linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
@ 2022-03-10 14:08 Chao Peng
  2022-03-10 14:08 ` [PATCH v5 01/13] mm/memfd: Introduce MFD_INACCESSIBLE flag Chao Peng
                   ` (14 more replies)
  0 siblings, 15 replies; 118+ messages in thread
From: Chao Peng @ 2022-03-10 14:08 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david

This is the v5 of this series which tries to implement the fd-based KVM
guest private memory. The patches are based on latest kvm/queue branch
commit:

  d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2
 
Introduction
------------
In general this patch series introduce fd-based memslot which provides
guest memory through memory file descriptor fd[offset,size] instead of
hva/size. The fd can be created from a supported memory filesystem
like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
and the the memory backing store exchange callbacks when such memslot
gets created. At runtime KVM will call into callbacks provided by the
backing store to get the pfn with the fd+offset. Memory backing store
will also call into KVM callbacks when userspace fallocate/punch hole
on the fd to notify KVM to map/unmap secondary MMU page tables.

Comparing to existing hva-based memslot, this new type of memslot allows
guest memory unmapped from host userspace like QEMU and even the kernel
itself, therefore reduce attack surface and prevent bugs.

Based on this fd-based memslot, we can build guest private memory that
is going to be used in confidential computing environments such as Intel
TDX and AMD SEV. When supported, the memory backing store can provide
more enforcement on the fd and KVM can use a single memslot to hold both
the private and shared part of the guest memory. 

mm extension
---------------------
Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file created
with these flags cannot read(), write() or mmap() etc via normal
MMU operations. The file content can only be used with the newly
introduced memfile_notifier extension.

The memfile_notifier extension provides two sets of callbacks for KVM to
interact with the memory backing store:
  - memfile_notifier_ops: callbacks for memory backing store to notify
    KVM when memory gets allocated/invalidated.
  - memfile_pfn_ops: callbacks for KVM to call into memory backing store
    to request memory pages for guest private memory.

The memfile_notifier extension also provides APIs for memory backing
store to register/unregister itself and to trigger the notifier when the
bookmarked memory gets fallocated/invalidated.

memslot extension
-----------------
Add the private fd and the fd offset to existing 'shared' memslot so that
both private/shared guest memory can live in one single memslot. A page in
the memslot is either private or shared. A page is private only when it's
already allocated in the backing store fd, all the other cases it's treated
as shared, this includes those already mapped as shared as well as those
having not been mapped. This means the memory backing store is the place
which tells the truth of which page is private.

Private memory map/unmap and conversion
---------------------------------------
Userspace's map/unmap operations are done by fallocate() ioctl on the
backing store fd.
  - map: default fallocate() with mode=0.
  - unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.
The map/unmap will trigger above memfile_notifier_ops to let KVM map/unmap
secondary MMU page tables.

Test
----
To test the new functionalities of this patch TDX patchset is needed.
Since TDX patchset has not been merged so I did two kinds of test:

-  Regresion test on kvm/queue (this patch)
   Most new code are not covered. I only tested building and booting.

-  New Funational test on latest TDX code
   The patch is rebased to latest TDX code and tested the new
   funcationalities.

For TDX test please see below repos:
Linux: https://github.com/chao-p/linux/tree/privmem-v5.1
QEMU: https://github.com/chao-p/qemu/tree/privmem-v4

And an example QEMU command line:
-object tdx-guest,id=tdx \
-object memory-backend-memfd-private,id=ram1,size=2G \
-machine q35,kvm-type=tdx,pic=no,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1

Changelog
----------
v5:
  - Removed userspace visible F_SEAL_INACCESSIBLE, instead using an
    in-kernel flag (SHM_F_INACCESSIBLE for shmem). Private fd can only
    be created by MFD_INACCESSIBLE.
  - Introduced new APIs for backing store to register itself to
    memfile_notifier instead of direct function call.
  - Added the accounting and restriction for MFD_INACCESSIBLE memory.
  - Added KVM API doc for new memslot extensions and man page for the new
    MFD_INACCESSIBLE flag.
  - Removed the overlap check for mapping the same file+offset into
    multiple gfns due to perf consideration, warned in document.
  - Addressed other comments in v4.
v4:
  - Decoupled the callbacks between KVM/mm from memfd and use new
    name 'memfile_notifier'.
  - Supported register multiple memslots to the same backing store.
  - Added per-memslot pfn_ops instead of per-system.
  - Reworked the invalidation part.
  - Improved new KVM uAPIs (private memslot extension and memory
    error) per Sean's suggestions.
  - Addressed many other minor fixes for comments from v3.
v3:
  - Added locking protection when calling
    invalidate_page_range/fallocate callbacks.
  - Changed memslot structure to keep use useraddr for shared memory.
  - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
  - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
  - Commit message improvement.
  - Many small fixes for comments from the last version.

Links to previous discussions
-----------------------------
[1] Original design proposal:
https://lkml.kernel.org/kvm/20210824005248.200037-1-seanjc@google.com/
[2] Updated proposal and RFC patch v1:
https://lkml.kernel.org/linux-fsdevel/20211111141352.26311-1-chao.p.peng@linux.intel.com/
[3] Patch v4: https://lkml.org/lkml/2022/1/18/395

Chao Peng (10):
  mm: Introduce memfile_notifier
  mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
  KVM: Extend the memslot to support fd-based private memory
  KVM: Use kvm_userspace_memory_region_ext
  KVM: Add KVM_EXIT_MEMORY_ERROR exit
  KVM: Use memfile_pfn_ops to obtain pfn for private pages
  KVM: Handle page fault for private memory
  KVM: Register private memslot to memory backing store
  KVM: Zap existing KVM mappings when pages changed in the private fd
  KVM: Expose KVM_MEM_PRIVATE

Kirill A. Shutemov (2):
  mm/memfd: Introduce MFD_INACCESSIBLE flag
  mm/shmem: Support memfile_notifier

 Documentation/virt/kvm/api.rst   |  59 +++++++++--
 arch/x86/kvm/Kconfig             |   1 +
 arch/x86/kvm/mmu/mmu.c           |  73 +++++++++++++-
 arch/x86/kvm/mmu/paging_tmpl.h   |  11 ++-
 arch/x86/kvm/x86.c               |  12 +--
 include/linux/kvm_host.h         |  49 ++++++++-
 include/linux/memfile_notifier.h |  64 ++++++++++++
 include/linux/shmem_fs.h         |  11 +++
 include/uapi/linux/kvm.h         |  17 ++++
 include/uapi/linux/memfd.h       |   1 +
 mm/Kconfig                       |   4 +
 mm/Makefile                      |   1 +
 mm/memfd.c                       |  26 ++++-
 mm/memfile_notifier.c            | 114 +++++++++++++++++++++
 mm/shmem.c                       | 156 +++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c              | 165 +++++++++++++++++++++++++++----
 16 files changed, 717 insertions(+), 47 deletions(-)
 create mode 100644 include/linux/memfile_notifier.h
 create mode 100644 mm/memfile_notifier.c

-- 
2.17.1



^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH v5 01/13] mm/memfd: Introduce MFD_INACCESSIBLE flag
  2022-03-10 14:08 [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
@ 2022-03-10 14:08 ` Chao Peng
  2022-04-11 15:10   ` Kirill A. Shutemov
  2022-04-23  5:43   ` Vishal Annapurve
  2022-03-10 14:09 ` [PATCH v5 02/13] mm: Introduce memfile_notifier Chao Peng
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 118+ messages in thread
From: Chao Peng @ 2022-03-10 14:08 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Introduce a new memfd_create() flag indicating the content of the
created memfd is inaccessible from userspace through ordinary MMU
access (e.g., read/write/mmap). However, the file content can be
accessed via a different mechanism (e.g. KVM MMU) indirectly.

It provides semantics required for KVM guest private memory support
that a file descriptor with this flag set is going to be used as the
source of guest memory in confidential computing environments such
as Intel TDX/AMD SEV but may not be accessible from host userspace.

Since page migration/swapping is not yet supported for such usages
so these pages are currently marked as UNMOVABLE and UNEVICTABLE
which makes them behave like long-term pinned pages.

The flag can not coexist with MFD_ALLOW_SEALING, future sealing is
also impossible for a memfd created with this flag.

At this time only shmem implements this flag.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/shmem_fs.h   |  7 +++++
 include/uapi/linux/memfd.h |  1 +
 mm/memfd.c                 | 26 +++++++++++++++--
 mm/shmem.c                 | 57 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 88 insertions(+), 3 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index e65b80ed09e7..2dde843f28ef 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -12,6 +12,9 @@
 
 /* inode in-kernel data */
 
+/* shmem extended flags */
+#define SHM_F_INACCESSIBLE	0x0001  /* prevent ordinary MMU access (e.g. read/write/mmap) to file content */
+
 struct shmem_inode_info {
 	spinlock_t		lock;
 	unsigned int		seals;		/* shmem seals */
@@ -24,6 +27,7 @@ struct shmem_inode_info {
 	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct simple_xattrs	xattrs;		/* list of xattrs */
 	atomic_t		stop_eviction;	/* hold when working on inode */
+	unsigned int		xflags;		/* shmem extended flags */
 	struct inode		vfs_inode;
 };
 
@@ -61,6 +65,9 @@ extern struct file *shmem_file_setup(const char *name,
 					loff_t size, unsigned long flags);
 extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
 					    unsigned long flags);
+extern struct file *shmem_file_setup_xflags(const char *name, loff_t size,
+					    unsigned long flags,
+					    unsigned int xflags);
 extern struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt,
 		const char *name, loff_t size, unsigned long flags);
 extern int shmem_zero_setup(struct vm_area_struct *);
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..48750474b904 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,7 @@
 #define MFD_CLOEXEC		0x0001U
 #define MFD_ALLOW_SEALING	0x0002U
 #define MFD_HUGETLB		0x0004U
+#define MFD_INACCESSIBLE	0x0008U
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/memfd.c b/mm/memfd.c
index 9f80f162791a..74d45a26cf5d 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -245,16 +245,20 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
 
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
+		       MFD_INACCESSIBLE)
 
 SYSCALL_DEFINE2(memfd_create,
 		const char __user *, uname,
 		unsigned int, flags)
 {
+	struct address_space *mapping;
 	unsigned int *file_seals;
+	unsigned int xflags;
 	struct file *file;
 	int fd, error;
 	char *name;
+	gfp_t gfp;
 	long len;
 
 	if (!(flags & MFD_HUGETLB)) {
@@ -267,6 +271,10 @@ SYSCALL_DEFINE2(memfd_create,
 			return -EINVAL;
 	}
 
+	/* Disallow sealing when MFD_INACCESSIBLE is set. */
+	if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING)
+		return -EINVAL;
+
 	/* length includes terminating zero */
 	len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
 	if (len <= 0)
@@ -301,8 +309,11 @@ SYSCALL_DEFINE2(memfd_create,
 					HUGETLB_ANONHUGE_INODE,
 					(flags >> MFD_HUGE_SHIFT) &
 					MFD_HUGE_MASK);
-	} else
-		file = shmem_file_setup(name, 0, VM_NORESERVE);
+	} else {
+		xflags = flags & MFD_INACCESSIBLE ? SHM_F_INACCESSIBLE : 0;
+		file = shmem_file_setup_xflags(name, 0, VM_NORESERVE, xflags);
+	}
+
 	if (IS_ERR(file)) {
 		error = PTR_ERR(file);
 		goto err_fd;
@@ -313,6 +324,15 @@ SYSCALL_DEFINE2(memfd_create,
 	if (flags & MFD_ALLOW_SEALING) {
 		file_seals = memfd_file_seals_ptr(file);
 		*file_seals &= ~F_SEAL_SEAL;
+	} else if (flags & MFD_INACCESSIBLE) {
+		mapping = file_inode(file)->i_mapping;
+		gfp = mapping_gfp_mask(mapping);
+		gfp &= ~__GFP_MOVABLE;
+		mapping_set_gfp_mask(mapping, gfp);
+		mapping_set_unevictable(mapping);
+
+		file_seals = memfd_file_seals_ptr(file);
+		*file_seals = F_SEAL_SEAL;
 	}
 
 	fd_install(fd, file);
diff --git a/mm/shmem.c b/mm/shmem.c
index a09b29ec2b45..9b31a7056009 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1084,6 +1084,13 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
 		    (newsize > oldsize && (info->seals & F_SEAL_GROW)))
 			return -EPERM;
 
+		if (info->xflags & SHM_F_INACCESSIBLE) {
+			if(oldsize)
+				return -EPERM;
+			if (!PAGE_ALIGNED(newsize))
+				return -EINVAL;
+		}
+
 		if (newsize != oldsize) {
 			error = shmem_reacct_size(SHMEM_I(inode)->flags,
 					oldsize, newsize);
@@ -1331,6 +1338,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 		goto redirty;
 	if (!total_swap_pages)
 		goto redirty;
+	if (info->xflags & SHM_F_INACCESSIBLE)
+		goto redirty;
 
 	/*
 	 * Our capabilities prevent regular writeback or sync from ever calling
@@ -2228,6 +2237,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 	if (ret)
 		return ret;
 
+	if (info->xflags & SHM_F_INACCESSIBLE)
+		return -EPERM;
+
 	/* arm64 - allow memory tagging on RAM-based files */
 	vma->vm_flags |= VM_MTE_ALLOWED;
 
@@ -2433,6 +2445,8 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 		if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
 			return -EPERM;
 	}
+	if (unlikely(info->xflags & SHM_F_INACCESSIBLE))
+		return -EPERM;
 
 	ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
 
@@ -2517,6 +2531,21 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		end_index = i_size >> PAGE_SHIFT;
 		if (index > end_index)
 			break;
+
+		/*
+		 * inode_lock protects setting up seals as well as write to
+		 * i_size. Setting SHM_F_INACCESSIBLE only allowed with
+		 * i_size == 0.
+		 *
+		 * Check SHM_F_INACCESSIBLE after i_size. It effectively
+		 * serialize read vs. setting SHM_F_INACCESSIBLE without
+		 * taking inode_lock in read path.
+		 */
+		if (SHMEM_I(inode)->xflags & SHM_F_INACCESSIBLE) {
+			error = -EPERM;
+			break;
+		}
+
 		if (index == end_index) {
 			nr = i_size & ~PAGE_MASK;
 			if (nr <= offset)
@@ -2648,6 +2677,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 			goto out;
 		}
 
+		if ((info->xflags & SHM_F_INACCESSIBLE) &&
+		    (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))) {
+			error = -EINVAL;
+			goto out;
+		}
+
 		shmem_falloc.waitq = &shmem_falloc_waitq;
 		shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT;
 		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
@@ -4082,6 +4117,28 @@ struct file *shmem_kernel_file_setup(const char *name, loff_t size, unsigned lon
 	return __shmem_file_setup(shm_mnt, name, size, flags, S_PRIVATE);
 }
 
+/**
+ * shmem_file_setup_xflags - get an unlinked file living in tmpfs with
+ *      additional xflags.
+ * @name: name for dentry (to be seen in /proc/<pid>/maps
+ * @size: size to be set for the file
+ * @flags: VM_NORESERVE suppresses pre-accounting of the entire object size
+ * @xflags: SHM_F_INACCESSIBLE prevents ordinary MMU access to the file content
+ */
+
+struct file *shmem_file_setup_xflags(const char *name, loff_t size,
+				     unsigned long flags, unsigned int xflags)
+{
+	struct shmem_inode_info *info;
+	struct file *res = __shmem_file_setup(shm_mnt, name, size, flags, 0);
+
+	if(!IS_ERR(res)) {
+		info = SHMEM_I(file_inode(res));
+		info->xflags = xflags & SHM_F_INACCESSIBLE;
+	}
+	return res;
+}
+
 /**
  * shmem_file_setup - get an unlinked file living in tmpfs
  * @name: name for dentry (to be seen in /proc/<pid>/maps
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 02/13] mm: Introduce memfile_notifier
  2022-03-10 14:08 [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
  2022-03-10 14:08 ` [PATCH v5 01/13] mm/memfd: Introduce MFD_INACCESSIBLE flag Chao Peng
@ 2022-03-10 14:09 ` Chao Peng
  2022-03-29 18:45   ` Sean Christopherson
  2022-04-12 14:36   ` Hillf Danton
  2022-03-10 14:09 ` [PATCH v5 03/13] mm/shmem: Support memfile_notifier Chao Peng
                   ` (12 subsequent siblings)
  14 siblings, 2 replies; 118+ messages in thread
From: Chao Peng @ 2022-03-10 14:09 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david

This patch introduces memfile_notifier facility so existing memory file
subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a
third kernel component to make use of memory bookmarked in the memory
file and gets notified when the pages in the memory file become
allocated/invalidated.

It will be used for KVM to use a file descriptor as the guest memory
backing store and KVM will use this memfile_notifier interface to
interact with memory file subsystems. In the future there might be other
consumers (e.g. VFIO with encrypted device memory).

It consists two sets of callbacks:
  - memfile_notifier_ops: callbacks for memory backing store to notify
    KVM when memory gets allocated/invalidated.
  - memfile_pfn_ops: callbacks for KVM to call into memory backing store
    to request memory pages for guest private memory.

Userspace is in charge of guest memory lifecycle: it first allocates
pages in memory backing store and then passes the fd to KVM and lets KVM
register each memory slot to memory backing store via
memfile_register_notifier.

The supported memory backing store should maintain a memfile_notifier list
and provide routine for memfile_notifier to get the list head address and
memfile_pfn_ops callbacks for memfile_register_notifier. It also should call
memfile_notifier_fallocate/memfile_notifier_invalidate when the bookmarked
memory gets allocated/invalidated.

Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/memfile_notifier.h |  64 +++++++++++++++++
 mm/Kconfig                       |   4 ++
 mm/Makefile                      |   1 +
 mm/memfile_notifier.c            | 114 +++++++++++++++++++++++++++++++
 4 files changed, 183 insertions(+)
 create mode 100644 include/linux/memfile_notifier.h
 create mode 100644 mm/memfile_notifier.c

diff --git a/include/linux/memfile_notifier.h b/include/linux/memfile_notifier.h
new file mode 100644
index 000000000000..e8d400558adb
--- /dev/null
+++ b/include/linux/memfile_notifier.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MEMFILE_NOTIFIER_H
+#define _LINUX_MEMFILE_NOTIFIER_H
+
+#include <linux/rculist.h>
+#include <linux/spinlock.h>
+#include <linux/srcu.h>
+#include <linux/fs.h>
+
+struct memfile_notifier;
+
+struct memfile_notifier_ops {
+	void (*invalidate)(struct memfile_notifier *notifier,
+			   pgoff_t start, pgoff_t end);
+	void (*fallocate)(struct memfile_notifier *notifier,
+			  pgoff_t start, pgoff_t end);
+};
+
+struct memfile_pfn_ops {
+	long (*get_lock_pfn)(struct inode *inode, pgoff_t offset, int *order);
+	void (*put_unlock_pfn)(unsigned long pfn);
+};
+
+struct memfile_notifier {
+	struct list_head list;
+	struct memfile_notifier_ops *ops;
+};
+
+struct memfile_notifier_list {
+	struct list_head head;
+	spinlock_t lock;
+};
+
+struct memfile_backing_store {
+	struct list_head list;
+	struct memfile_pfn_ops pfn_ops;
+	struct memfile_notifier_list* (*get_notifier_list)(struct inode *inode);
+};
+
+#ifdef CONFIG_MEMFILE_NOTIFIER
+/* APIs for backing stores */
+static inline void memfile_notifier_list_init(struct memfile_notifier_list *list)
+{
+	INIT_LIST_HEAD(&list->head);
+	spin_lock_init(&list->lock);
+}
+
+extern void memfile_notifier_invalidate(struct memfile_notifier_list *list,
+					pgoff_t start, pgoff_t end);
+extern void memfile_notifier_fallocate(struct memfile_notifier_list *list,
+				       pgoff_t start, pgoff_t end);
+extern void memfile_register_backing_store(struct memfile_backing_store *bs);
+extern void memfile_unregister_backing_store(struct memfile_backing_store *bs);
+
+/*APIs for notifier consumers */
+extern int memfile_register_notifier(struct inode *inode,
+				     struct memfile_notifier *notifier,
+				     struct memfile_pfn_ops **pfn_ops);
+extern void memfile_unregister_notifier(struct inode *inode,
+					struct memfile_notifier *notifier);
+
+#endif /* CONFIG_MEMFILE_NOTIFIER */
+
+#endif /* _LINUX_MEMFILE_NOTIFIER_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 3326ee3903f3..7c6b1ad3dade 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -892,6 +892,10 @@ config ANON_VMA_NAME
 	  area from being merged with adjacent virtual memory areas due to the
 	  difference in their name.
 
+config MEMFILE_NOTIFIER
+	bool
+	select SRCU
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 70d4309c9ce3..f628256dce0d 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -132,3 +132,4 @@ obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
 obj-$(CONFIG_IO_MAPPING) += io-mapping.o
 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
+obj-$(CONFIG_MEMFILE_NOTIFIER) += memfile_notifier.o
diff --git a/mm/memfile_notifier.c b/mm/memfile_notifier.c
new file mode 100644
index 000000000000..a405db56fde2
--- /dev/null
+++ b/mm/memfile_notifier.c
@@ -0,0 +1,114 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  linux/mm/memfile_notifier.c
+ *
+ *  Copyright (C) 2022  Intel Corporation.
+ *             Chao Peng <chao.p.peng@linux.intel.com>
+ */
+
+#include <linux/memfile_notifier.h>
+#include <linux/srcu.h>
+
+DEFINE_STATIC_SRCU(srcu);
+static LIST_HEAD(backing_store_list);
+
+void memfile_notifier_invalidate(struct memfile_notifier_list *list,
+				 pgoff_t start, pgoff_t end)
+{
+	struct memfile_notifier *notifier;
+	int id;
+
+	id = srcu_read_lock(&srcu);
+	list_for_each_entry_srcu(notifier, &list->head, list,
+				 srcu_read_lock_held(&srcu)) {
+		if (notifier->ops && notifier->ops->invalidate)
+			notifier->ops->invalidate(notifier, start, end);
+	}
+	srcu_read_unlock(&srcu, id);
+}
+
+void memfile_notifier_fallocate(struct memfile_notifier_list *list,
+				pgoff_t start, pgoff_t end)
+{
+	struct memfile_notifier *notifier;
+	int id;
+
+	id = srcu_read_lock(&srcu);
+	list_for_each_entry_srcu(notifier, &list->head, list,
+				 srcu_read_lock_held(&srcu)) {
+		if (notifier->ops && notifier->ops->fallocate)
+			notifier->ops->fallocate(notifier, start, end);
+	}
+	srcu_read_unlock(&srcu, id);
+}
+
+void memfile_register_backing_store(struct memfile_backing_store *bs)
+{
+	BUG_ON(!bs || !bs->get_notifier_list);
+
+	list_add_tail(&bs->list, &backing_store_list);
+}
+
+void memfile_unregister_backing_store(struct memfile_backing_store *bs)
+{
+	list_del(&bs->list);
+}
+
+static int memfile_get_notifier_info(struct inode *inode,
+				     struct memfile_notifier_list **list,
+				     struct memfile_pfn_ops **ops)
+{
+	struct memfile_backing_store *bs, *iter;
+	struct memfile_notifier_list *tmp;
+
+	list_for_each_entry_safe(bs, iter, &backing_store_list, list) {
+		tmp = bs->get_notifier_list(inode);
+		if (tmp) {
+			*list = tmp;
+			if (ops)
+				*ops = &bs->pfn_ops;
+			return 0;
+		}
+	}
+	return -EOPNOTSUPP;
+}
+
+int memfile_register_notifier(struct inode *inode,
+			      struct memfile_notifier *notifier,
+			      struct memfile_pfn_ops **pfn_ops)
+{
+	struct memfile_notifier_list *list;
+	int ret;
+
+	if (!inode || !notifier | !pfn_ops)
+		return -EINVAL;
+
+	ret = memfile_get_notifier_info(inode, &list, pfn_ops);
+	if (ret)
+		return ret;
+
+	spin_lock(&list->lock);
+	list_add_rcu(&notifier->list, &list->head);
+	spin_unlock(&list->lock);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(memfile_register_notifier);
+
+void memfile_unregister_notifier(struct inode *inode,
+				 struct memfile_notifier *notifier)
+{
+	struct memfile_notifier_list *list;
+
+	if (!inode || !notifier)
+		return;
+
+	BUG_ON(memfile_get_notifier_info(inode, &list, NULL));
+
+	spin_lock(&list->lock);
+	list_del_rcu(&notifier->list);
+	spin_unlock(&list->lock);
+
+	synchronize_srcu(&srcu);
+}
+EXPORT_SYMBOL_GPL(memfile_unregister_notifier);
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 03/13] mm/shmem: Support memfile_notifier
  2022-03-10 14:08 [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
  2022-03-10 14:08 ` [PATCH v5 01/13] mm/memfd: Introduce MFD_INACCESSIBLE flag Chao Peng
  2022-03-10 14:09 ` [PATCH v5 02/13] mm: Introduce memfile_notifier Chao Peng
@ 2022-03-10 14:09 ` Chao Peng
  2022-03-10 23:08   ` Dave Chinner
                     ` (2 more replies)
  2022-03-10 14:09 ` [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK Chao Peng
                   ` (11 subsequent siblings)
  14 siblings, 3 replies; 118+ messages in thread
From: Chao Peng @ 2022-03-10 14:09 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

It maintains a memfile_notifier list in shmem_inode_info structure and
implements memfile_pfn_ops callbacks defined by memfile_notifier. It
then exposes them to memfile_notifier via
shmem_get_memfile_notifier_info.

We use SGP_NOALLOC in shmem_get_lock_pfn since the pages should be
allocated by userspace for private memory. If there is no pages
allocated at the offset then error should be returned so KVM knows that
the memory is not private memory.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/shmem_fs.h |  4 +++
 mm/shmem.c               | 76 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 2dde843f28ef..7bb16f2d2825 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -9,6 +9,7 @@
 #include <linux/percpu_counter.h>
 #include <linux/xattr.h>
 #include <linux/fs_parser.h>
+#include <linux/memfile_notifier.h>
 
 /* inode in-kernel data */
 
@@ -28,6 +29,9 @@ struct shmem_inode_info {
 	struct simple_xattrs	xattrs;		/* list of xattrs */
 	atomic_t		stop_eviction;	/* hold when working on inode */
 	unsigned int		xflags;		/* shmem extended flags */
+#ifdef CONFIG_MEMFILE_NOTIFIER
+	struct memfile_notifier_list memfile_notifiers;
+#endif
 	struct inode		vfs_inode;
 };
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 9b31a7056009..7b43e274c9a2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -903,6 +903,28 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
 	return page ? page_folio(page) : NULL;
 }
 
+static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
+{
+#ifdef CONFIG_MEMFILE_NOTIFIER
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	memfile_notifier_fallocate(&info->memfile_notifiers, start, end);
+#endif
+}
+
+static void notify_invalidate_page(struct inode *inode, struct folio *folio,
+				   pgoff_t start, pgoff_t end)
+{
+#ifdef CONFIG_MEMFILE_NOTIFIER
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	start = max(start, folio->index);
+	end = min(end, folio->index + folio_nr_pages(folio));
+
+	memfile_notifier_invalidate(&info->memfile_notifiers, start, end);
+#endif
+}
+
 /*
  * Remove range of pages and swap entries from page cache, and free them.
  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -946,6 +968,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			index += folio_nr_pages(folio) - 1;
 
+			notify_invalidate_page(inode, folio, start, end);
+
 			if (!unfalloc || !folio_test_uptodate(folio))
 				truncate_inode_folio(mapping, folio);
 			folio_unlock(folio);
@@ -1019,6 +1043,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 					index--;
 					break;
 				}
+
+				notify_invalidate_page(inode, folio, start, end);
+
 				VM_BUG_ON_FOLIO(folio_test_writeback(folio),
 						folio);
 				truncate_inode_folio(mapping, folio);
@@ -2279,6 +2306,9 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 		info->flags = flags & VM_NORESERVE;
 		INIT_LIST_HEAD(&info->shrinklist);
 		INIT_LIST_HEAD(&info->swaplist);
+#ifdef CONFIG_MEMFILE_NOTIFIER
+		memfile_notifier_list_init(&info->memfile_notifiers);
+#endif
 		simple_xattrs_init(&info->xattrs);
 		cache_no_acl(inode);
 		mapping_set_large_folios(inode->i_mapping);
@@ -2802,6 +2832,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
 		i_size_write(inode, offset + len);
 	inode->i_ctime = current_time(inode);
+	notify_fallocate(inode, start, end);
 undone:
 	spin_lock(&inode->i_lock);
 	inode->i_private = NULL;
@@ -3909,6 +3940,47 @@ static struct file_system_type shmem_fs_type = {
 	.fs_flags	= FS_USERNS_MOUNT,
 };
 
+#ifdef CONFIG_MEMFILE_NOTIFIER
+static long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset, int *order)
+{
+	struct page *page;
+	int ret;
+
+	ret = shmem_getpage(inode, offset, &page, SGP_NOALLOC);
+	if (ret)
+		return ret;
+
+	*order = thp_order(compound_head(page));
+
+	return page_to_pfn(page);
+}
+
+static void shmem_put_unlock_pfn(unsigned long pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	set_page_dirty(page);
+	unlock_page(page);
+	put_page(page);
+}
+
+static struct memfile_notifier_list* shmem_get_notifier_list(struct inode *inode)
+{
+	if (!shmem_mapping(inode->i_mapping))
+		return NULL;
+
+	return  &SHMEM_I(inode)->memfile_notifiers;
+}
+
+static struct memfile_backing_store shmem_backing_store = {
+	.pfn_ops.get_lock_pfn = shmem_get_lock_pfn,
+	.pfn_ops.put_unlock_pfn = shmem_put_unlock_pfn,
+	.get_notifier_list = shmem_get_notifier_list,
+};
+#endif /* CONFIG_MEMFILE_NOTIFIER */
+
 int __init shmem_init(void)
 {
 	int error;
@@ -3934,6 +4006,10 @@ int __init shmem_init(void)
 	else
 		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
 #endif
+
+#ifdef CONFIG_MEMFILE_NOTIFIER
+	memfile_register_backing_store(&shmem_backing_store);
+#endif
 	return 0;
 
 out1:
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
  2022-03-10 14:08 [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (2 preceding siblings ...)
  2022-03-10 14:09 ` [PATCH v5 03/13] mm/shmem: Support memfile_notifier Chao Peng
@ 2022-03-10 14:09 ` Chao Peng
  2022-04-07 16:05   ` Sean Christopherson
  2022-03-10 14:09 ` [PATCH v5 05/13] KVM: Extend the memslot to support fd-based private memory Chao Peng
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 118+ messages in thread
From: Chao Peng @ 2022-03-10 14:09 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david

Since page migration / swapping is not supported yet, MFD_INACCESSIBLE
memory behave like longterm pinned pages and thus should be accounted to
mm->pinned_vm and be restricted by RLIMIT_MEMLOCK.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 mm/shmem.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 7b43e274c9a2..ae46fb96494b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
 static void notify_invalidate_page(struct inode *inode, struct folio *folio,
 				   pgoff_t start, pgoff_t end)
 {
-#ifdef CONFIG_MEMFILE_NOTIFIER
 	struct shmem_inode_info *info = SHMEM_I(inode);
 
+#ifdef CONFIG_MEMFILE_NOTIFIER
 	start = max(start, folio->index);
 	end = min(end, folio->index + folio_nr_pages(folio));
 
 	memfile_notifier_invalidate(&info->memfile_notifiers, start, end);
 #endif
+
+	if (info->xflags & SHM_F_INACCESSIBLE)
+		atomic64_sub(end - start, &current->mm->pinned_vm);
 }
 
 /*
@@ -2680,6 +2683,20 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
 	return offset;
 }
 
+static bool memlock_limited(unsigned long npages)
+{
+	unsigned long lock_limit;
+	unsigned long pinned;
+
+	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	pinned = atomic64_add_return(npages, &current->mm->pinned_vm);
+	if (pinned > lock_limit && !capable(CAP_IPC_LOCK)) {
+		atomic64_sub(npages, &current->mm->pinned_vm);
+		return true;
+	}
+	return false;
+}
+
 static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 							 loff_t len)
 {
@@ -2753,6 +2770,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		goto out;
 	}
 
+	if ((info->xflags & SHM_F_INACCESSIBLE) &&
+			memlock_limited(end - start)) {
+		error = -ENOMEM;
+		goto out;
+	}
+
 	shmem_falloc.waitq = NULL;
 	shmem_falloc.start = start;
 	shmem_falloc.next  = start;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 05/13] KVM: Extend the memslot to support fd-based private memory
  2022-03-10 14:08 [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (3 preceding siblings ...)
  2022-03-10 14:09 ` [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK Chao Peng
@ 2022-03-10 14:09 ` Chao Peng
  2022-03-28 21:27   ` Sean Christopherson
  2022-03-28 21:56   ` Sean Christopherson
  2022-03-10 14:09 ` [PATCH v5 06/13] KVM: Use kvm_userspace_memory_region_ext Chao Peng
                   ` (9 subsequent siblings)
  14 siblings, 2 replies; 118+ messages in thread
From: Chao Peng @ 2022-03-10 14:09 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david

Extend the memslot definition to provide fd-based private memory support
by adding two new fields (private_fd/private_offset). The memslot then
can maintain memory for both shared pages and private pages in a single
memslot. Shared pages are provided by existing userspace_addr(hva) field
and private pages are provided through the new private_fd/private_offset
fields.

Since there is no 'hva' concept anymore for private memory so we cannot
rely on get_user_pages() to get a pfn, instead we use the newly added
memfile_notifier to complete the same job.

This new extension is indicated by a new flag KVM_MEM_PRIVATE.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 Documentation/virt/kvm/api.rst | 37 +++++++++++++++++++++++++++-------
 include/linux/kvm_host.h       |  7 +++++++
 include/uapi/linux/kvm.h       |  8 ++++++++
 3 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 3acbf4d263a5..f76ac598606c 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1307,7 +1307,7 @@ yet and must be cleared on entry.
 :Capability: KVM_CAP_USER_MEMORY
 :Architectures: all
 :Type: vm ioctl
-:Parameters: struct kvm_userspace_memory_region (in)
+:Parameters: struct kvm_userspace_memory_region(_ext) (in)
 :Returns: 0 on success, -1 on error
 
 ::
@@ -1320,9 +1320,17 @@ yet and must be cleared on entry.
 	__u64 userspace_addr; /* start of the userspace allocated memory */
   };
 
+  struct kvm_userspace_memory_region_ext {
+	struct kvm_userspace_memory_region region;
+	__u64 private_offset;
+	__u32 private_fd;
+	__u32 padding[5];
+};
+
   /* for kvm_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
   #define KVM_MEM_READONLY	(1UL << 1)
+  #define KVM_MEM_PRIVATE		(1UL << 2)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1353,12 +1361,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
-writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only.  In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region
+fields. It also includes additional fields for some specific features. See
+below description of flags field for more information. It's recommended to use
+kvm_userspace_memory_region_ext in new userspace code.
+
+The flags field supports below flags:
+
+- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to
+  memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to use it.
+
+- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to
+  make a new slot read-only.  In this case, writes to this memory will be posted
+  to userspace as KVM_EXIT_MMIO exits.
+
+- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by
+  a file descirptor(fd) and the content of the private memory is invisible to
+  userspace. In this case, userspace should use private_fd/private_offset in
+  kvm_userspace_memory_region_ext to instruct KVM to provide private memory to
+  guest. Userspace should guarantee not to map the same pfn indicated by
+  private_fd/private_offset to different gfns with multiple memslots. Failed to
+  do this may result undefined behavior.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9536ffa0473b..3be8116079d4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -563,8 +563,15 @@ struct kvm_memory_slot {
 	u32 flags;
 	short id;
 	u16 as_id;
+	struct file *private_file;
+	loff_t private_offset;
 };
 
+static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
+{
+	return slot && (slot->flags & KVM_MEM_PRIVATE);
+}
+
 static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
 {
 	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 91a6fe4e02c0..a523d834efc8 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -103,6 +103,13 @@ struct kvm_userspace_memory_region {
 	__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
+struct kvm_userspace_memory_region_ext {
+	struct kvm_userspace_memory_region region;
+	__u64 private_offset;
+	__u32 private_fd;
+	__u32 padding[5];
+};
+
 /*
  * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
  * other bits are reserved for kvm internal use which are defined in
@@ -110,6 +117,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_PRIVATE		(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 06/13] KVM: Use kvm_userspace_memory_region_ext
  2022-03-10 14:08 [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (4 preceding siblings ...)
  2022-03-10 14:09 ` [PATCH v5 05/13] KVM: Extend the memslot to support fd-based private memory Chao Peng
@ 2022-03-10 14:09 ` Chao Peng
  2022-03-28 22:26   ` Sean Christopherson
  2022-03-10 14:09 ` [PATCH v5 07/13] KVM: Add KVM_EXIT_MEMORY_ERROR exit Chao Peng
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 118+ messages in thread
From: Chao Peng @ 2022-03-10 14:09 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david

Use the new extended memslot structure kvm_userspace_memory_region_ext.
The extended part (private_fd/ private_offset) will be copied from
userspace only when KVM_MEM_PRIVATE is set. Internally old
kvm_userspace_memory_region will still be used for places where the
extended fields are not needed.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/kvm/x86.c       | 12 ++++++------
 include/linux/kvm_host.h |  4 ++--
 virt/kvm/kvm_main.c      | 30 ++++++++++++++++++++----------
 3 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8c06b8204fca..1d9dbef67715 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11757,13 +11757,13 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 	}
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-		struct kvm_userspace_memory_region m;
+		struct kvm_userspace_memory_region_ext m;
 
-		m.slot = id | (i << 16);
-		m.flags = 0;
-		m.guest_phys_addr = gpa;
-		m.userspace_addr = hva;
-		m.memory_size = size;
+		m.region.slot = id | (i << 16);
+		m.region.flags = 0;
+		m.region.guest_phys_addr = gpa;
+		m.region.userspace_addr = hva;
+		m.region.memory_size = size;
 		r = __kvm_set_memory_region(kvm, &m);
 		if (r < 0)
 			return ERR_PTR_USR(r);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3be8116079d4..c92c70174248 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1082,9 +1082,9 @@ enum kvm_mr_change {
 };
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem);
+		const struct kvm_userspace_memory_region_ext *region_ext);
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem);
+		const struct kvm_userspace_memory_region_ext *region_ext);
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 69c318fdff61..d11a2628b548 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1809,8 +1809,9 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
  * Must be called holding kvm->slots_lock for write.
  */
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem)
+		const struct kvm_userspace_memory_region_ext *region_ext)
 {
+	const struct kvm_userspace_memory_region *mem = &region_ext->region;
 	struct kvm_memory_slot *old, *new;
 	struct kvm_memslots *slots;
 	enum kvm_mr_change change;
@@ -1913,24 +1914,24 @@ int __kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem)
+		const struct kvm_userspace_memory_region_ext *region_ext)
 {
 	int r;
 
 	mutex_lock(&kvm->slots_lock);
-	r = __kvm_set_memory_region(kvm, mem);
+	r = __kvm_set_memory_region(kvm, region_ext);
 	mutex_unlock(&kvm->slots_lock);
 	return r;
 }
 EXPORT_SYMBOL_GPL(kvm_set_memory_region);
 
 static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
-					  struct kvm_userspace_memory_region *mem)
+			struct kvm_userspace_memory_region_ext *region_ext)
 {
-	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
+	if ((u16)region_ext->region.slot >= KVM_USER_MEM_SLOTS)
 		return -EINVAL;
 
-	return kvm_set_memory_region(kvm, mem);
+	return kvm_set_memory_region(kvm, region_ext);
 }
 
 #ifndef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
@@ -4476,14 +4477,23 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 	case KVM_SET_USER_MEMORY_REGION: {
-		struct kvm_userspace_memory_region kvm_userspace_mem;
+		struct kvm_userspace_memory_region_ext region_ext;
 
 		r = -EFAULT;
-		if (copy_from_user(&kvm_userspace_mem, argp,
-						sizeof(kvm_userspace_mem)))
+		if (copy_from_user(&region_ext, argp,
+				sizeof(struct kvm_userspace_memory_region)))
 			goto out;
+		if (region_ext.region.flags & KVM_MEM_PRIVATE) {
+			int offset = offsetof(
+				struct kvm_userspace_memory_region_ext,
+				private_offset);
+			if (copy_from_user(&region_ext.private_offset,
+					   argp + offset,
+					   sizeof(region_ext) - offset))
+				goto out;
+		}
 
-		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
+		r = kvm_vm_ioctl_set_memory_region(kvm, &region_ext);
 		break;
 	}
 	case KVM_GET_DIRTY_LOG: {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 07/13] KVM: Add KVM_EXIT_MEMORY_ERROR exit
  2022-03-10 14:08 [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (5 preceding siblings ...)
  2022-03-10 14:09 ` [PATCH v5 06/13] KVM: Use kvm_userspace_memory_region_ext Chao Peng
@ 2022-03-10 14:09 ` Chao Peng
  2022-03-28 22:33   ` Sean Christopherson
  2022-03-10 14:09 ` [PATCH v5 08/13] KVM: Use memfile_pfn_ops to obtain pfn for private pages Chao Peng
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 118+ messages in thread
From: Chao Peng @ 2022-03-10 14:09 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david

This new KVM exit allows userspace to handle memory-related errors. It
indicates an error happens in KVM at guest memory range [gpa, gpa+size).
The flags includes additional information for userspace to handle the
error. Currently bit 0 is defined as 'private memory' where '1'
indicates error happens due to private memory access and '0' indicates
error happens due to shared memory access.

After private memory is enabled, this new exit will be used for KVM to
exit to userspace for shared memory <-> private memory conversion in
memory encryption usage.

In such usage, typically there are two kind of memory conversions:
  - explicit conversion: happens when guest explicitly calls into KVM to
    map a range (as private or shared), KVM then exits to userspace to
    do the map/unmap operations.
  - implicit conversion: happens in KVM page fault handler.
    * if the fault is due to a private memory access then causes a
      userspace exit for a shared->private conversion request when the
      page has not been allocated in the private memory backend.
    * If the fault is due to a shared memory access then causes a
      userspace exit for a private->shared conversion request when the
      page has already been allocated in the private memory backend.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 Documentation/virt/kvm/api.rst | 22 ++++++++++++++++++++++
 include/uapi/linux/kvm.h       |  9 +++++++++
 2 files changed, 31 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f76ac598606c..bad550c2212b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6216,6 +6216,28 @@ array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+		/* KVM_EXIT_MEMORY_ERROR */
+		struct {
+  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
+			__u32 flags;
+			__u32 padding;
+			__u64 gpa;
+			__u64 size;
+		} memory;
+If exit reason is KVM_EXIT_MEMORY_ERROR then it indicates that the VCPU has
+encountered a memory error which is not handled by KVM kernel module and
+userspace may choose to handle it. The 'flags' field indicates the memory
+properties of the exit.
+
+ - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
+   private memory access when the bit is set otherwise the memory error is
+   caused by shared memory access when the bit is clear.
+
+'gpa' and 'size' indicate the memory range the error occurs at. The userspace
+may handle the error and return to KVM to retry the previous memory access.
+
 ::
 
 		/* Fix the size of the union. */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a523d834efc8..9ad0c8aa0263 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -278,6 +278,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_X86_BUS_LOCK     33
 #define KVM_EXIT_XEN              34
 #define KVM_EXIT_RISCV_SBI        35
+#define KVM_EXIT_MEMORY_ERROR     36
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -495,6 +496,14 @@ struct kvm_run {
 			unsigned long args[6];
 			unsigned long ret[2];
 		} riscv_sbi;
+		/* KVM_EXIT_MEMORY_ERROR */
+		struct {
+#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
+			__u32 flags;
+			__u32 padding;
+			__u64 gpa;
+			__u64 size;
+		} memory;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 08/13] KVM: Use memfile_pfn_ops to obtain pfn for private pages
  2022-03-10 14:08 [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (6 preceding siblings ...)
  2022-03-10 14:09 ` [PATCH v5 07/13] KVM: Add KVM_EXIT_MEMORY_ERROR exit Chao Peng
@ 2022-03-10 14:09 ` Chao Peng
  2022-03-28 23:56   ` Sean Christopherson
  2022-03-10 14:09 ` [PATCH v5 09/13] KVM: Handle page fault for private memory Chao Peng
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 118+ messages in thread
From: Chao Peng @ 2022-03-10 14:09 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david

Private pages are not mmap-ed into userspace so can not reply on
get_user_pages() to obtain the pfn. Instead we add a memfile_pfn_ops
pointer pfn_ops in each private memslot and use it to obtain the pfn
for a gfn. To do that, KVM should convert the gfn to the offset into
the fd and then call get_lock_pfn callback. Once KVM completes its job
it should call put_unlock_pfn to unlock the pfn. Note the pfn(page) is
locked between get_lock_pfn/put_unlock_pfn to ensure pfn is valid when
KVM uses it to establish the mapping in the secondary MMU page table.

The pfn_ops is initialized via memfile_register_notifier from the memory
backing store that provided the private_fd.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/kvm/Kconfig     |  1 +
 include/linux/kvm_host.h | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index e3cbd7706136..ca7b2a6a452a 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -48,6 +48,7 @@ config KVM
 	select SRCU
 	select INTERVAL_TREE
 	select HAVE_KVM_PM_NOTIFIER if PM
+	select MEMFILE_NOTIFIER
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c92c70174248..6e1d770d6bf8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -44,6 +44,7 @@
 
 #include <asm/kvm_host.h>
 #include <linux/kvm_dirty_ring.h>
+#include <linux/memfile_notifier.h>
 
 #ifndef KVM_MAX_VCPU_IDS
 #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
@@ -565,6 +566,7 @@ struct kvm_memory_slot {
 	u16 as_id;
 	struct file *private_file;
 	loff_t private_offset;
+	struct memfile_pfn_ops *pfn_ops;
 };
 
 static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
@@ -915,6 +917,7 @@ static inline void kvm_irqfd_exit(void)
 {
 }
 #endif
+
 int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 		  struct module *module);
 void kvm_exit(void);
@@ -2217,4 +2220,34 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+#ifdef CONFIG_MEMFILE_NOTIFIER
+static inline long kvm_memfile_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn,
+				       int *order)
+{
+	pgoff_t index = gfn - slot->base_gfn +
+			(slot->private_offset >> PAGE_SHIFT);
+
+	return slot->pfn_ops->get_lock_pfn(file_inode(slot->private_file),
+					   index, order);
+}
+
+static inline void kvm_memfile_put_pfn(struct kvm_memory_slot *slot,
+				       kvm_pfn_t pfn)
+{
+	slot->pfn_ops->put_unlock_pfn(pfn);
+}
+
+#else
+static inline long kvm_memfile_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn,
+				       int *order)
+{
+	return -1;
+}
+
+static inline void kvm_memfile_put_pfn(struct kvm_memory_slot *slot,
+				       kvm_pfn_t pfn)
+{
+}
+#endif /* CONFIG_MEMFILE_NOTIFIER */
+
 #endif
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 09/13] KVM: Handle page fault for private memory
  2022-03-10 14:08 [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (7 preceding siblings ...)
  2022-03-10 14:09 ` [PATCH v5 08/13] KVM: Use memfile_pfn_ops to obtain pfn for private pages Chao Peng
@ 2022-03-10 14:09 ` Chao Peng
  2022-03-29  1:07   ` Sean Christopherson
  2022-03-10 14:09 ` [PATCH v5 10/13] KVM: Register private memslot to memory backing store Chao Peng
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 118+ messages in thread
From: Chao Peng @ 2022-03-10 14:09 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david

When page fault happens for a memslot with KVM_MEM_PRIVATE, we use
kvm_memfile_get_pfn() which further calls into memfile_pfn_ops callbacks
defined for each memslot to request the pfn from the memory backing store.

One assumption is that private pages are persistent and pre-allocated in
the private memory fd (backing store) so KVM uses this information as an
indicator for a page is private or shared (i.e. the private fd is the
final source of truth as to whether or not a GPA is private).

Depending on the access is private or shared, we go different paths:
  - For private access, KVM checks if the page is already allocated in
    the memory backing store, if yes KVM establishes the mapping,
    otherwise exits to userspace to convert a shared page to private one.

  - For shared access, KVM also checks if the page is already allocated
    in the memory backing store, if yes then exit to userspace to
    convert a private page to shared one, otherwise it's treated as a
    traditional hva-based shared memory, KVM lets existing code to obtain
    a pfn with get_user_pages() and establish the mapping.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/kvm/mmu/mmu.c         | 73 ++++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu/paging_tmpl.h | 11 +++--
 2 files changed, 77 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3b8da8b0745e..f04c823ea09a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2844,6 +2844,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
 
+	if (kvm_slot_is_private(slot))
+		return max_level;
+
 	host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
 	return min(host_level, max_level);
 }
@@ -3890,7 +3893,59 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
 }
 
-static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
+static bool kvm_vcpu_is_private_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	/*
+	 * At this time private gfn has not been supported yet. Other patch
+	 * that enables it should change this.
+	 */
+	return false;
+}
+
+static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
+				    struct kvm_page_fault *fault,
+				    bool *is_private_pfn, int *r)
+{
+	int order;
+	unsigned int flags = 0;
+	struct kvm_memory_slot *slot = fault->slot;
+	long pfn = kvm_memfile_get_pfn(slot, fault->gfn, &order);
+
+	if (kvm_vcpu_is_private_gfn(vcpu, fault->addr >> PAGE_SHIFT)) {
+		if (pfn < 0)
+			flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
+		else {
+			fault->pfn = pfn;
+			if (slot->flags & KVM_MEM_READONLY)
+				fault->map_writable = false;
+			else
+				fault->map_writable = true;
+
+			if (order == 0)
+				fault->max_level = PG_LEVEL_4K;
+			*is_private_pfn = true;
+			*r = RET_PF_FIXED;
+			return true;
+		}
+	} else {
+		if (pfn < 0)
+			return false;
+
+		kvm_memfile_put_pfn(slot, pfn);
+	}
+
+	vcpu->run->exit_reason = KVM_EXIT_MEMORY_ERROR;
+	vcpu->run->memory.flags = flags;
+	vcpu->run->memory.padding = 0;
+	vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
+	vcpu->run->memory.size = PAGE_SIZE;
+	fault->pfn = -1;
+	*r = -1;
+	return true;
+}
+
+static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
+			    bool *is_private_pfn, int *r)
 {
 	struct kvm_memory_slot *slot = fault->slot;
 	bool async;
@@ -3924,6 +3979,10 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		}
 	}
 
+	if (kvm_slot_is_private(slot) &&
+	    kvm_faultin_pfn_private(vcpu, fault, is_private_pfn, r))
+		return *r == RET_PF_FIXED ? false : true;
+
 	async = false;
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
 					  fault->write, &fault->map_writable,
@@ -3984,6 +4043,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu);
 
 	unsigned long mmu_seq;
+	bool is_private_pfn = false;
 	int r;
 
 	fault->gfn = fault->addr >> PAGE_SHIFT;
@@ -4003,7 +4063,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (kvm_faultin_pfn(vcpu, fault, &r))
+	if (kvm_faultin_pfn(vcpu, fault, &is_private_pfn, &r))
 		return r;
 
 	if (handle_abnormal_pfn(vcpu, fault, ACC_ALL, &r))
@@ -4016,7 +4076,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	else
 		write_lock(&vcpu->kvm->mmu_lock);
 
-	if (is_page_fault_stale(vcpu, fault, mmu_seq))
+	if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq))
 		goto out_unlock;
 
 	r = make_mmu_pages_available(vcpu);
@@ -4033,7 +4093,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 		read_unlock(&vcpu->kvm->mmu_lock);
 	else
 		write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+
+	if (is_private_pfn)
+		kvm_memfile_put_pfn(fault->slot, fault->pfn);
+	else
+		kvm_release_pfn_clean(fault->pfn);
+
 	return r;
 }
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 252c77805eb9..6a5736699c0a 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -825,6 +825,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	int r;
 	unsigned long mmu_seq;
 	bool is_self_change_mapping;
+	bool is_private_pfn = false;
+
 
 	pgprintk("%s: addr %lx err %x\n", __func__, fault->addr, fault->error_code);
 	WARN_ON_ONCE(fault->is_tdp);
@@ -873,7 +875,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (kvm_faultin_pfn(vcpu, fault, &r))
+	if (kvm_faultin_pfn(vcpu, fault, &is_private_pfn, &r))
 		return r;
 
 	if (handle_abnormal_pfn(vcpu, fault, walker.pte_access, &r))
@@ -901,7 +903,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	r = RET_PF_RETRY;
 	write_lock(&vcpu->kvm->mmu_lock);
 
-	if (is_page_fault_stale(vcpu, fault, mmu_seq))
+	if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq))
 		goto out_unlock;
 
 	r = make_mmu_pages_available(vcpu);
@@ -911,7 +913,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 
 out_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+	if (is_private_pfn)
+		kvm_memfile_put_pfn(fault->slot, fault->pfn);
+	else
+		kvm_release_pfn_clean(fault->pfn);
 	return r;
 }
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 10/13] KVM: Register private memslot to memory backing store
  2022-03-10 14:08 [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (8 preceding siblings ...)
  2022-03-10 14:09 ` [PATCH v5 09/13] KVM: Handle page fault for private memory Chao Peng
@ 2022-03-10 14:09 ` Chao Peng
  2022-03-29 19:01   ` Sean Christopherson
  2022-03-10 14:09 ` [PATCH v5 11/13] KVM: Zap existing KVM mappings when pages changed in the private fd Chao Peng
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 118+ messages in thread
From: Chao Peng @ 2022-03-10 14:09 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david

Add 'notifier' to memslot to make it a memfile_notifier node and then
register it to memory backing store via memfile_register_notifier() when
memslot gets created. When memslot is deleted, do the reverse with
memfile_unregister_notifier(). Note each KVM memslot can be registered
to different memory backing stores (or the same backing store but at
different offset) independently.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      | 75 ++++++++++++++++++++++++++++++++++++----
 2 files changed, 70 insertions(+), 6 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6e1d770d6bf8..9b175aeca63f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -567,6 +567,7 @@ struct kvm_memory_slot {
 	struct file *private_file;
 	loff_t private_offset;
 	struct memfile_pfn_ops *pfn_ops;
+	struct memfile_notifier notifier;
 };
 
 static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d11a2628b548..67349421eae3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -840,6 +840,37 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
 
+#ifdef CONFIG_MEMFILE_NOTIFIER
+static inline int kvm_memfile_register(struct kvm_memory_slot *slot)
+{
+	return memfile_register_notifier(file_inode(slot->private_file),
+					 &slot->notifier,
+					 &slot->pfn_ops);
+}
+
+static inline void kvm_memfile_unregister(struct kvm_memory_slot *slot)
+{
+	if (slot->private_file) {
+		memfile_unregister_notifier(file_inode(slot->private_file),
+					    &slot->notifier);
+		fput(slot->private_file);
+		slot->private_file = NULL;
+	}
+}
+
+#else /* !CONFIG_MEMFILE_NOTIFIER */
+
+static inline int kvm_memfile_register(struct kvm_memory_slot *slot)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void kvm_memfile_unregister(struct kvm_memory_slot *slot)
+{
+}
+
+#endif /* CONFIG_MEMFILE_NOTIFIER */
+
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 static int kvm_pm_notifier_call(struct notifier_block *bl,
 				unsigned long state,
@@ -884,6 +915,9 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
 /* This does not remove the slot from struct kvm_memslots data structures */
 static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
+	if (slot->flags & KVM_MEM_PRIVATE)
+		kvm_memfile_unregister(slot);
+
 	kvm_destroy_dirty_bitmap(slot);
 
 	kvm_arch_free_memslot(kvm, slot);
@@ -1738,6 +1772,12 @@ static int kvm_set_memslot(struct kvm *kvm,
 		kvm_invalidate_memslot(kvm, old, invalid_slot);
 	}
 
+	if (new->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE) {
+		r = kvm_memfile_register(new);
+		if (r)
+			return r;
+	}
+
 	r = kvm_prepare_memory_region(kvm, old, new, change);
 	if (r) {
 		/*
@@ -1752,6 +1792,10 @@ static int kvm_set_memslot(struct kvm *kvm,
 		} else {
 			mutex_unlock(&kvm->slots_arch_lock);
 		}
+
+		if (new->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE)
+			kvm_memfile_unregister(new);
+
 		return r;
 	}
 
@@ -1817,6 +1861,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	enum kvm_mr_change change;
 	unsigned long npages;
 	gfn_t base_gfn;
+	struct file *file = NULL;
 	int as_id, id;
 	int r;
 
@@ -1890,14 +1935,24 @@ int __kvm_set_memory_region(struct kvm *kvm,
 			return 0;
 	}
 
+	if (mem->flags & KVM_MEM_PRIVATE) {
+		file = fdget(region_ext->private_fd).file;
+		if (!file)
+			return -EINVAL;
+	}
+
 	if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) &&
-	    kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages))
-		return -EEXIST;
+	    kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages)) {
+		r = -EEXIST;
+		goto out;
+	}
 
 	/* Allocate a slot that will persist in the memslot. */
 	new = kzalloc(sizeof(*new), GFP_KERNEL_ACCOUNT);
-	if (!new)
-		return -ENOMEM;
+	if (!new) {
+		r = -ENOMEM;
+		goto out;
+	}
 
 	new->as_id = as_id;
 	new->id = id;
@@ -1905,10 +1960,18 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	new->npages = npages;
 	new->flags = mem->flags;
 	new->userspace_addr = mem->userspace_addr;
+	new->private_file = file;
+	new->private_offset = mem->flags & KVM_MEM_PRIVATE ?
+			      region_ext->private_offset : 0;
 
 	r = kvm_set_memslot(kvm, old, new, change);
-	if (r)
-		kfree(new);
+	if (!r)
+		return r;
+
+	kfree(new);
+out:
+	if (file)
+		fput(file);
 	return r;
 }
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 11/13] KVM: Zap existing KVM mappings when pages changed in the private fd
  2022-03-10 14:08 [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (9 preceding siblings ...)
  2022-03-10 14:09 ` [PATCH v5 10/13] KVM: Register private memslot to memory backing store Chao Peng
@ 2022-03-10 14:09 ` Chao Peng
  2022-03-29 19:23   ` Sean Christopherson
                     ` (2 more replies)
  2022-03-10 14:09 ` [PATCH v5 12/13] KVM: Expose KVM_MEM_PRIVATE Chao Peng
                   ` (3 subsequent siblings)
  14 siblings, 3 replies; 118+ messages in thread
From: Chao Peng @ 2022-03-10 14:09 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david

KVM gets notified when memory pages changed in the memory backing store.
When userspace allocates the memory with fallocate() or frees memory
with fallocate(FALLOC_FL_PUNCH_HOLE), memory backing store calls into
KVM fallocate/invalidate callbacks respectively. To ensure KVM never
maps both the private and shared variants of a GPA into the guest, in
the fallocate callback, we should zap the existing shared mapping and
in the invalidate callback we should zap the existing private mapping.

In the callbacks, KVM firstly converts the offset range into the
gfn_range and then calls existing kvm_unmap_gfn_range() which will zap
the shared or private mapping. Both callbacks pass in a memslot
reference but we need 'kvm' so add a reference in memslot structure.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/kvm_host.h |  3 ++-
 virt/kvm/kvm_main.c      | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9b175aeca63f..186b9b981a65 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -236,7 +236,7 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
-#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
+#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_MEMFILE_NOTIFIER)
 struct kvm_gfn_range {
 	struct kvm_memory_slot *slot;
 	gfn_t start;
@@ -568,6 +568,7 @@ struct kvm_memory_slot {
 	loff_t private_offset;
 	struct memfile_pfn_ops *pfn_ops;
 	struct memfile_notifier notifier;
+	struct kvm *kvm;
 };
 
 static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 67349421eae3..52319f49d58a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -841,8 +841,43 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
 
 #ifdef CONFIG_MEMFILE_NOTIFIER
+static void kvm_memfile_notifier_handler(struct memfile_notifier *notifier,
+					 pgoff_t start, pgoff_t end)
+{
+	int idx;
+	struct kvm_memory_slot *slot = container_of(notifier,
+						    struct kvm_memory_slot,
+						    notifier);
+	struct kvm_gfn_range gfn_range = {
+		.slot		= slot,
+		.start		= start - (slot->private_offset >> PAGE_SHIFT),
+		.end		= end - (slot->private_offset >> PAGE_SHIFT),
+		.may_block 	= true,
+	};
+	struct kvm *kvm = slot->kvm;
+
+	gfn_range.start = max(gfn_range.start, slot->base_gfn);
+	gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages);
+
+	if (gfn_range.start >= gfn_range.end)
+		return;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	KVM_MMU_LOCK(kvm);
+	kvm_unmap_gfn_range(kvm, &gfn_range);
+	kvm_flush_remote_tlbs(kvm);
+	KVM_MMU_UNLOCK(kvm);
+	srcu_read_unlock(&kvm->srcu, idx);
+}
+
+static struct memfile_notifier_ops kvm_memfile_notifier_ops = {
+	.invalidate = kvm_memfile_notifier_handler,
+	.fallocate = kvm_memfile_notifier_handler,
+};
+
 static inline int kvm_memfile_register(struct kvm_memory_slot *slot)
 {
+	slot->notifier.ops = &kvm_memfile_notifier_ops;
 	return memfile_register_notifier(file_inode(slot->private_file),
 					 &slot->notifier,
 					 &slot->pfn_ops);
@@ -1963,6 +1998,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	new->private_file = file;
 	new->private_offset = mem->flags & KVM_MEM_PRIVATE ?
 			      region_ext->private_offset : 0;
+	new->kvm = kvm;
 
 	r = kvm_set_memslot(kvm, old, new, change);
 	if (!r)
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 12/13] KVM: Expose KVM_MEM_PRIVATE
  2022-03-10 14:08 [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (10 preceding siblings ...)
  2022-03-10 14:09 ` [PATCH v5 11/13] KVM: Zap existing KVM mappings when pages changed in the private fd Chao Peng
@ 2022-03-10 14:09 ` Chao Peng
  2022-03-29 19:13   ` Sean Christopherson
  2022-03-10 14:09 ` [PATCH v5 13/13] memfd_create.2: Describe MFD_INACCESSIBLE flag Chao Peng
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 118+ messages in thread
From: Chao Peng @ 2022-03-10 14:09 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david

KVM_MEM_PRIVATE is not exposed by default but architecture code can turn
on it by implementing kvm_arch_private_memory_supported().

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      | 24 +++++++++++++++++++-----
 2 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 186b9b981a65..0150e952a131 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1432,6 +1432,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
 int kvm_arch_post_init_vm(struct kvm *kvm);
 void kvm_arch_pre_destroy_vm(struct kvm *kvm);
 int kvm_arch_create_vm_debugfs(struct kvm *kvm);
+bool kvm_arch_private_memory_supported(struct kvm *kvm);
 
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 /*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 52319f49d58a..df5311755a40 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1485,10 +1485,19 @@ static void kvm_replace_memslot(struct kvm *kvm,
 	}
 }
 
-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
+bool __weak kvm_arch_private_memory_supported(struct kvm *kvm)
+{
+	return false;
+}
+
+static int check_memory_region_flags(struct kvm *kvm,
+				const struct kvm_userspace_memory_region *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
 
+	if (kvm_arch_private_memory_supported(kvm))
+		valid_flags |= KVM_MEM_PRIVATE;
+
 #ifdef __KVM_HAVE_READONLY_MEM
 	valid_flags |= KVM_MEM_READONLY;
 #endif
@@ -1900,7 +1909,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	int as_id, id;
 	int r;
 
-	r = check_memory_region_flags(mem);
+	r = check_memory_region_flags(kvm, mem);
 	if (r)
 		return r;
 
@@ -1913,10 +1922,12 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		return -EINVAL;
 	if (mem->guest_phys_addr & (PAGE_SIZE - 1))
 		return -EINVAL;
-	/* We can read the guest memory with __xxx_user() later on. */
 	if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
-	    (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
-	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
+	    (mem->userspace_addr != untagged_addr(mem->userspace_addr)))
+		return -EINVAL;
+	/* We can read the guest memory with __xxx_user() later on. */
+	if (!(mem->flags & KVM_MEM_PRIVATE) &&
+	    !access_ok((void __user *)(unsigned long)mem->userspace_addr,
 			mem->memory_size))
 		return -EINVAL;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
@@ -1957,6 +1968,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
 			return -EINVAL;
 	} else { /* Modify an existing slot. */
+		/* Private memslots are immutable, they can only be deleted. */
+		if (mem->flags & KVM_MEM_PRIVATE)
+			return -EINVAL;
 		if ((mem->userspace_addr != old->userspace_addr) ||
 		    (npages != old->npages) ||
 		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH v5 13/13] memfd_create.2: Describe MFD_INACCESSIBLE flag
  2022-03-10 14:08 [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (11 preceding siblings ...)
  2022-03-10 14:09 ` [PATCH v5 12/13] KVM: Expose KVM_MEM_PRIVATE Chao Peng
@ 2022-03-10 14:09 ` Chao Peng
  2022-03-24 15:51 ` [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Quentin Perret
  2022-03-28 20:16 ` Andy Lutomirski
  14 siblings, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-03-10 14:09 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 man2/memfd_create.2 | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/man2/memfd_create.2 b/man2/memfd_create.2
index 89e9c4136..2698222ae 100644
--- a/man2/memfd_create.2
+++ b/man2/memfd_create.2
@@ -101,6 +101,19 @@ meaning that no other seals can be set on the file.
 .\" FIXME Why is the MFD_ALLOW_SEALING behavior not simply the default?
 .\" Is it worth adding some text explaining this?
 .TP
+.BR MFD_INACCESSIBLE
+Disallow userspace access through ordinary MMU accesses via
+.BR read (2),
+.BR write (2)
+and
+.BR mmap (2).
+The file size cannot be changed once initialized.
+This flag cannot coexist with
+.B MFD_ALLOW_SEALING
+and when this flag is set, the initial set of seals will be
+.B F_SEAL_SEAL,
+meaning that no other seals can be set on the file.
+.TP
 .BR MFD_HUGETLB " (since Linux 4.14)"
 .\" commit 749df87bd7bee5a79cef073f5d032ddb2b211de8
 The anonymous file will be created in the hugetlbfs filesystem using
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 03/13] mm/shmem: Support memfile_notifier
  2022-03-10 14:09 ` [PATCH v5 03/13] mm/shmem: Support memfile_notifier Chao Peng
@ 2022-03-10 23:08   ` Dave Chinner
  2022-03-11  8:42     ` Chao Peng
  2022-04-11 15:26   ` Kirill A. Shutemov
  2022-04-19 22:40   ` Vishal Annapurve
  2 siblings, 1 reply; 118+ messages in thread
From: Dave Chinner @ 2022-03-10 23:08 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david

On Thu, Mar 10, 2022 at 10:09:01PM +0800, Chao Peng wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> It maintains a memfile_notifier list in shmem_inode_info structure and
> implements memfile_pfn_ops callbacks defined by memfile_notifier. It
> then exposes them to memfile_notifier via
> shmem_get_memfile_notifier_info.
> 
> We use SGP_NOALLOC in shmem_get_lock_pfn since the pages should be
> allocated by userspace for private memory. If there is no pages
> allocated at the offset then error should be returned so KVM knows that
> the memory is not private memory.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  include/linux/shmem_fs.h |  4 +++
>  mm/shmem.c               | 76 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 80 insertions(+)
> 
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 2dde843f28ef..7bb16f2d2825 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -9,6 +9,7 @@
>  #include <linux/percpu_counter.h>
>  #include <linux/xattr.h>
>  #include <linux/fs_parser.h>
> +#include <linux/memfile_notifier.h>
>  
>  /* inode in-kernel data */
>  
> @@ -28,6 +29,9 @@ struct shmem_inode_info {
>  	struct simple_xattrs	xattrs;		/* list of xattrs */
>  	atomic_t		stop_eviction;	/* hold when working on inode */
>  	unsigned int		xflags;		/* shmem extended flags */
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +	struct memfile_notifier_list memfile_notifiers;
> +#endif
>  	struct inode		vfs_inode;
>  };
>  
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 9b31a7056009..7b43e274c9a2 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -903,6 +903,28 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
>  	return page ? page_folio(page) : NULL;
>  }
>  
> +static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
> +{
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +	struct shmem_inode_info *info = SHMEM_I(inode);
> +
> +	memfile_notifier_fallocate(&info->memfile_notifiers, start, end);
> +#endif
> +}

*notify_populate(), not fallocate.  This is a notification that a
range has been populated, not that the fallocate() syscall was run
to populate the backing store of a file.

i.e.  fallocate is the name of a userspace filesystem API that can
be used to manipulate the backing store of a file in various ways.
It can both populate and punch away the backing store of a file, and
some operations that fallocate() can run will do both (e.g.
FALLOC_FL_ZERO_RANGE) and so could generate both
notify_invalidate() and a notify_populate() events.

Hence "fallocate" as an internal mm namespace or operation does not
belong anywhere in core MM infrastructure - it should never get used
anywhere other than the VFS/filesystem layers that implement the
fallocate() syscall or use it directly.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 03/13] mm/shmem: Support memfile_notifier
  2022-03-10 23:08   ` Dave Chinner
@ 2022-03-11  8:42     ` Chao Peng
  0 siblings, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-03-11  8:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david

On Fri, Mar 11, 2022 at 10:08:22AM +1100, Dave Chinner wrote:
> On Thu, Mar 10, 2022 at 10:09:01PM +0800, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > It maintains a memfile_notifier list in shmem_inode_info structure and
> > implements memfile_pfn_ops callbacks defined by memfile_notifier. It
> > then exposes them to memfile_notifier via
> > shmem_get_memfile_notifier_info.
> > 
> > We use SGP_NOALLOC in shmem_get_lock_pfn since the pages should be
> > allocated by userspace for private memory. If there is no pages
> > allocated at the offset then error should be returned so KVM knows that
> > the memory is not private memory.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  include/linux/shmem_fs.h |  4 +++
> >  mm/shmem.c               | 76 ++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 80 insertions(+)
> > 
> > diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> > index 2dde843f28ef..7bb16f2d2825 100644
> > --- a/include/linux/shmem_fs.h
> > +++ b/include/linux/shmem_fs.h
> > @@ -9,6 +9,7 @@
> >  #include <linux/percpu_counter.h>
> >  #include <linux/xattr.h>
> >  #include <linux/fs_parser.h>
> > +#include <linux/memfile_notifier.h>
> >  
> >  /* inode in-kernel data */
> >  
> > @@ -28,6 +29,9 @@ struct shmem_inode_info {
> >  	struct simple_xattrs	xattrs;		/* list of xattrs */
> >  	atomic_t		stop_eviction;	/* hold when working on inode */
> >  	unsigned int		xflags;		/* shmem extended flags */
> > +#ifdef CONFIG_MEMFILE_NOTIFIER
> > +	struct memfile_notifier_list memfile_notifiers;
> > +#endif
> >  	struct inode		vfs_inode;
> >  };
> >  
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 9b31a7056009..7b43e274c9a2 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -903,6 +903,28 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
> >  	return page ? page_folio(page) : NULL;
> >  }
> >  
> > +static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
> > +{
> > +#ifdef CONFIG_MEMFILE_NOTIFIER
> > +	struct shmem_inode_info *info = SHMEM_I(inode);
> > +
> > +	memfile_notifier_fallocate(&info->memfile_notifiers, start, end);
> > +#endif
> > +}
> 
> *notify_populate(), not fallocate.  This is a notification that a
> range has been populated, not that the fallocate() syscall was run
> to populate the backing store of a file.
> 
> i.e.  fallocate is the name of a userspace filesystem API that can
> be used to manipulate the backing store of a file in various ways.
> It can both populate and punch away the backing store of a file, and
> some operations that fallocate() can run will do both (e.g.
> FALLOC_FL_ZERO_RANGE) and so could generate both
> notify_invalidate() and a notify_populate() events.

Yes, I fully agreed fallocate syscall has both populating and hole
punching semantics so notify_fallocate can be misleading since we
actually mean populate here.

> 
> Hence "fallocate" as an internal mm namespace or operation does not
> belong anywhere in core MM infrastructure - it should never get used
> anywhere other than the VFS/filesystem layers that implement the
> fallocate() syscall or use it directly.

Will use your suggestion through the series where applied. Thanks for
your suggestion.

Chao
> 
> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-03-10 14:08 [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (12 preceding siblings ...)
  2022-03-10 14:09 ` [PATCH v5 13/13] memfd_create.2: Describe MFD_INACCESSIBLE flag Chao Peng
@ 2022-03-24 15:51 ` Quentin Perret
  2022-03-28 17:13   ` Sean Christopherson
  2022-03-28 20:16 ` Andy Lutomirski
  14 siblings, 1 reply; 118+ messages in thread
From: Quentin Perret @ 2022-03-24 15:51 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, maz, will

Hi Chao,

+CC Will and Marc for visibility.

On Thursday 10 Mar 2022 at 22:08:58 (+0800), Chao Peng wrote:
> This is the v5 of this series which tries to implement the fd-based KVM
> guest private memory. The patches are based on latest kvm/queue branch
> commit:
> 
>   d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2
>  
> Introduction
> ------------
> In general this patch series introduce fd-based memslot which provides
> guest memory through memory file descriptor fd[offset,size] instead of
> hva/size. The fd can be created from a supported memory filesystem
> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> and the the memory backing store exchange callbacks when such memslot
> gets created. At runtime KVM will call into callbacks provided by the
> backing store to get the pfn with the fd+offset. Memory backing store
> will also call into KVM callbacks when userspace fallocate/punch hole
> on the fd to notify KVM to map/unmap secondary MMU page tables.
> 
> Comparing to existing hva-based memslot, this new type of memslot allows
> guest memory unmapped from host userspace like QEMU and even the kernel
> itself, therefore reduce attack surface and prevent bugs.
> 
> Based on this fd-based memslot, we can build guest private memory that
> is going to be used in confidential computing environments such as Intel
> TDX and AMD SEV. When supported, the memory backing store can provide
> more enforcement on the fd and KVM can use a single memslot to hold both
> the private and shared part of the guest memory. 
> 
> mm extension
> ---------------------
> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file created
> with these flags cannot read(), write() or mmap() etc via normal
> MMU operations. The file content can only be used with the newly
> introduced memfile_notifier extension.
> 
> The memfile_notifier extension provides two sets of callbacks for KVM to
> interact with the memory backing store:
>   - memfile_notifier_ops: callbacks for memory backing store to notify
>     KVM when memory gets allocated/invalidated.
>   - memfile_pfn_ops: callbacks for KVM to call into memory backing store
>     to request memory pages for guest private memory.
> 
> The memfile_notifier extension also provides APIs for memory backing
> store to register/unregister itself and to trigger the notifier when the
> bookmarked memory gets fallocated/invalidated.
> 
> memslot extension
> -----------------
> Add the private fd and the fd offset to existing 'shared' memslot so that
> both private/shared guest memory can live in one single memslot. A page in
> the memslot is either private or shared. A page is private only when it's
> already allocated in the backing store fd, all the other cases it's treated
> as shared, this includes those already mapped as shared as well as those
> having not been mapped. This means the memory backing store is the place
> which tells the truth of which page is private.
> 
> Private memory map/unmap and conversion
> ---------------------------------------
> Userspace's map/unmap operations are done by fallocate() ioctl on the
> backing store fd.
>   - map: default fallocate() with mode=0.
>   - unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.
> The map/unmap will trigger above memfile_notifier_ops to let KVM map/unmap
> secondary MMU page tables.

I recently came across this series which is interesting for the
Protected KVM work that's currently ongoing in the Android world (see
[1], [2] or [3] for more details). The idea is similar in a number of
ways to the Intel TDX stuff (from what I understand, but I'm clearly not
understanding it all so, ...) or the Arm CCA solution, but using stage-2
MMUs instead of encryption; and leverages the caveat of the nVHE
KVM/arm64 implementation to isolate the control of stage-2 MMUs from the
host.

For Protected KVM (and I suspect most other confidential computing
solutions), guests have the ability to share some of their pages back
with the host kernel using a dedicated hypercall. This is necessary
for e.g. virtio communications, so these shared pages need to be mapped
back into the VMM's address space. I'm a bit confused about how that
would work with the approach proposed here. What is going to be the
approach for TDX?

It feels like the most 'natural' thing would be to have a KVM exit
reason describing which pages have been shared back by the guest, and to
then allow the VMM to mmap those specific pages in response in the
memfd. Is this something that has been discussed or considered?

Thanks,
Quentin

[1] https://lwn.net/Articles/836693/
[2] https://www.youtube.com/watch?v=wY-u6n75iXc
[3] https://www.youtube.com/watch?v=54q6RzS9BpQ&t=10862s


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-03-24 15:51 ` [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Quentin Perret
@ 2022-03-28 17:13   ` Sean Christopherson
  2022-03-28 18:00     ` Quentin Perret
  0 siblings, 1 reply; 118+ messages in thread
From: Sean Christopherson @ 2022-03-28 17:13 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, maz, will

On Thu, Mar 24, 2022, Quentin Perret wrote:
> For Protected KVM (and I suspect most other confidential computing
> solutions), guests have the ability to share some of their pages back
> with the host kernel using a dedicated hypercall. This is necessary
> for e.g. virtio communications, so these shared pages need to be mapped
> back into the VMM's address space. I'm a bit confused about how that
> would work with the approach proposed here. What is going to be the
> approach for TDX?
> 
> It feels like the most 'natural' thing would be to have a KVM exit
> reason describing which pages have been shared back by the guest, and to
> then allow the VMM to mmap those specific pages in response in the
> memfd. Is this something that has been discussed or considered?

The proposed solution is to exit to userspace with a new exit reason, KVM_EXIT_MEMORY_ERROR,
when the guest makes the hypercall to request conversion[1].  The private fd itself
will never allow mapping memory into userspace, instead userspace will need to punch
a hole in the private fd backing store.  The absense of a valid mapping in the private
fd is how KVM detects that a pfn is "shared" (memslots without a private fd are always
shared)[2].

The key point is that KVM never decides to convert between shared and private, it's
always a userspace decision.  Like normal memslots, where userspace has full control
over what gfns are a valid, this gives userspace full control over whether a gfn is
shared or private at any given time.

Another important detail is that this approach means the kernel and KVM treat the
shared backing store and private backing store as independent, albeit related,
entities.  This is very deliberate as it makes it easier to reason about what is
and isn't allowed/required.  E.g. the kernel only needs to handle freeing private
memory, there is no special handling for conversion to shared because no such path
exists as far as host pfns are concerned.  And userspace doesn't need any new "rules"
for protecting itself against a malicious guest, e.g. userspace already needs to
ensure that it has a valid mapping prior to accessing guest memory (or be able to
handle any resulting signals).  A malicious guest can DoS itself by instructing
userspace to communicate over memory that is currently mapped private, but there
are no new novel attack vectors from the host's perspective as coercing the host
into accessing an invalid mapping after shared=>private conversion is just a variant
of a use-after-free.

One potential conversions that's TBD (at least, I think it is, I haven't read through
this most recent version) is how to support populating guest private memory with
non-zero data, e.g. to allow in-place conversion of the initial guest firmware instead
of having to an extra memcpy().

[1] KVM will also exit to userspace with the same info on "implicit" conversions,
    i.e. if the guest accesses the "wrong" GPA.  Neither SEV-SNP nor TDX mandate
    explicit conversions in their guest<->host ABIs, so KVM has to support implicit
    conversions :-/

[2] Ideally (IMO), KVM would require userspace to completely remove the private memslot,
    but that's too slow due to use of SRCU in both KVM and userspace (QEMU at least uses
    SRCU for memslot changes).


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-03-28 17:13   ` Sean Christopherson
@ 2022-03-28 18:00     ` Quentin Perret
  2022-03-28 18:58       ` Sean Christopherson
  0 siblings, 1 reply; 118+ messages in thread
From: Quentin Perret @ 2022-03-28 18:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, maz, will

Hi Sean,

Thanks for the reply, this helps a lot.

On Monday 28 Mar 2022 at 17:13:10 (+0000), Sean Christopherson wrote:
> On Thu, Mar 24, 2022, Quentin Perret wrote:
> > For Protected KVM (and I suspect most other confidential computing
> > solutions), guests have the ability to share some of their pages back
> > with the host kernel using a dedicated hypercall. This is necessary
> > for e.g. virtio communications, so these shared pages need to be mapped
> > back into the VMM's address space. I'm a bit confused about how that
> > would work with the approach proposed here. What is going to be the
> > approach for TDX?
> > 
> > It feels like the most 'natural' thing would be to have a KVM exit
> > reason describing which pages have been shared back by the guest, and to
> > then allow the VMM to mmap those specific pages in response in the
> > memfd. Is this something that has been discussed or considered?
> 
> The proposed solution is to exit to userspace with a new exit reason, KVM_EXIT_MEMORY_ERROR,
> when the guest makes the hypercall to request conversion[1].  The private fd itself
> will never allow mapping memory into userspace, instead userspace will need to punch
> a hole in the private fd backing store.  The absense of a valid mapping in the private
> fd is how KVM detects that a pfn is "shared" (memslots without a private fd are always
> shared)[2].

Right. I'm still a bit confused about how the VMM is going to get the
shared page mapped in its page-table. Once it has punched a hole into
the private fd, how is it supposed to access the actual physical page
that the guest shared? Is there an assumption somewhere that the VMM
should have this page mapped in via an alias that it can legally access
only once it has punched a hole at the corresponding offset in the
private fd or something along those lines?

> The key point is that KVM never decides to convert between shared and private, it's
> always a userspace decision.  Like normal memslots, where userspace has full control
> over what gfns are a valid, this gives userspace full control over whether a gfn is
> shared or private at any given time.

I'm understanding this as 'the VMM is allowed to punch holes in the
private fd whenever it wants'. Is this correct? What happens if it does
so for a page that a guest hasn't shared back?

> Another important detail is that this approach means the kernel and KVM treat the
> shared backing store and private backing store as independent, albeit related,
> entities.  This is very deliberate as it makes it easier to reason about what is
> and isn't allowed/required.  E.g. the kernel only needs to handle freeing private
> memory, there is no special handling for conversion to shared because no such path
> exists as far as host pfns are concerned.  And userspace doesn't need any new "rules"
> for protecting itself against a malicious guest, e.g. userspace already needs to
> ensure that it has a valid mapping prior to accessing guest memory (or be able to
> handle any resulting signals).  A malicious guest can DoS itself by instructing
> userspace to communicate over memory that is currently mapped private, but there
> are no new novel attack vectors from the host's perspective as coercing the host
> into accessing an invalid mapping after shared=>private conversion is just a variant
> of a use-after-free.

Interesting. I was (maybe incorrectly) assuming that it would be
difficult to handle illegal host accesses w/ TDX. IOW, this would
essentially crash the host. Is this remotely correct or did I get that
wrong?

> One potential conversions that's TBD (at least, I think it is, I haven't read through
> this most recent version) is how to support populating guest private memory with
> non-zero data, e.g. to allow in-place conversion of the initial guest firmware instead
> of having to an extra memcpy().

Right. FWIW, in the pKVM case we should be pretty immune to this I
think. The initial firmware is loaded in guest memory by the hypervisor
itself (the EL2 code in arm64 speak) as the first vCPU starts running.
And that firmware can then use e.g. virtio to load the guest payload and
measure/check it. IOW, we currently don't have expectations regarding
the initial state of guest memory, but it might be handy to have support
for pre-loading the payload in the future (should save a copy as you
said).

> [1] KVM will also exit to userspace with the same info on "implicit" conversions,
>     i.e. if the guest accesses the "wrong" GPA.  Neither SEV-SNP nor TDX mandate
>     explicit conversions in their guest<->host ABIs, so KVM has to support implicit
>     conversions :-/
> 
> [2] Ideally (IMO), KVM would require userspace to completely remove the private memslot,
>     but that's too slow due to use of SRCU in both KVM and userspace (QEMU at least uses
>     SRCU for memslot changes).


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-03-28 18:00     ` Quentin Perret
@ 2022-03-28 18:58       ` Sean Christopherson
  2022-03-29 17:01         ` Quentin Perret
  0 siblings, 1 reply; 118+ messages in thread
From: Sean Christopherson @ 2022-03-28 18:58 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, maz, will

On Mon, Mar 28, 2022, Quentin Perret wrote:
> Hi Sean,
> 
> Thanks for the reply, this helps a lot.
> 
> On Monday 28 Mar 2022 at 17:13:10 (+0000), Sean Christopherson wrote:
> > On Thu, Mar 24, 2022, Quentin Perret wrote:
> > > For Protected KVM (and I suspect most other confidential computing
> > > solutions), guests have the ability to share some of their pages back
> > > with the host kernel using a dedicated hypercall. This is necessary
> > > for e.g. virtio communications, so these shared pages need to be mapped
> > > back into the VMM's address space. I'm a bit confused about how that
> > > would work with the approach proposed here. What is going to be the
> > > approach for TDX?
> > > 
> > > It feels like the most 'natural' thing would be to have a KVM exit
> > > reason describing which pages have been shared back by the guest, and to
> > > then allow the VMM to mmap those specific pages in response in the
> > > memfd. Is this something that has been discussed or considered?
> > 
> > The proposed solution is to exit to userspace with a new exit reason, KVM_EXIT_MEMORY_ERROR,
> > when the guest makes the hypercall to request conversion[1].  The private fd itself
> > will never allow mapping memory into userspace, instead userspace will need to punch
> > a hole in the private fd backing store.  The absense of a valid mapping in the private
> > fd is how KVM detects that a pfn is "shared" (memslots without a private fd are always
> > shared)[2].
> 
> Right. I'm still a bit confused about how the VMM is going to get the
> shared page mapped in its page-table. Once it has punched a hole into
> the private fd, how is it supposed to access the actual physical page
> that the guest shared?

The guest doesn't share a _host_ physical page, the guest shares a _guest_ physical
page.  Until host userspace converts the gfn to shared and thus maps the gfn=>hva
via mmap(), the guest is blocked and can't read/write/exec the memory.  AFAIK, no
architecture allows in-place decryption of guest private memory.  s390 allows a
page to be "made accessible" to the host for the purposes of swap, and other
architectures will have similar behavior for migrating a protected VM, but those
scenarios are not sharing the page (and they also make the page inaccessible to
the guest).

> Is there an assumption somewhere that the VMM should have this page mapped in
> via an alias that it can legally access only once it has punched a hole at
> the corresponding offset in the private fd or something along those lines?

Yes, the VMM must have a completely separate VMA.  The VMM doesn't haven't to
wait until the conversion to mmap() the shared variant, though obviously it will
potentially consume double the memory if the VMM actually populates both the
private and shared backing stores.

> > The key point is that KVM never decides to convert between shared and private, it's
> > always a userspace decision.  Like normal memslots, where userspace has full control
> > over what gfns are a valid, this gives userspace full control over whether a gfn is
> > shared or private at any given time.
> 
> I'm understanding this as 'the VMM is allowed to punch holes in the
> private fd whenever it wants'. Is this correct?

From the kernel's perspective, yes, the VMM can punch holes at any time.  From a
"do I want to DoS my guest" perspective, the VMM must honor its contract with the
guest and not spuriously unmap private memory.

> What happens if it does so for a page that a guest hasn't shared back?

When the hole is punched, KVM will unmap the corresponding private SPTEs.  If the
guest is still accessing the page as private, the next access will fault and KVM
will exit to userspace with KVM_EXIT_MEMORY_ERROR.  Of course the guest is probably
hosed if the hole punch was truly spurious, as at least hardware-based protected VMs
effectively destroy data when a private page is unmapped from the guest private SPTEs.

E.g. Linux guests for TDX and SNP will panic/terminate in such a scenario as they
will get a fault (injected by trusted hardware/firmware) saying that the guest is
trying to access an unaccepted/unvalidated page (TDX and SNP require the guest to
explicit accept all private pages that aren't part of the guest's initial pre-boot
image).

> > Another important detail is that this approach means the kernel and KVM treat the
> > shared backing store and private backing store as independent, albeit related,
> > entities.  This is very deliberate as it makes it easier to reason about what is
> > and isn't allowed/required.  E.g. the kernel only needs to handle freeing private
> > memory, there is no special handling for conversion to shared because no such path
> > exists as far as host pfns are concerned.  And userspace doesn't need any new "rules"
> > for protecting itself against a malicious guest, e.g. userspace already needs to
> > ensure that it has a valid mapping prior to accessing guest memory (or be able to
> > handle any resulting signals).  A malicious guest can DoS itself by instructing
> > userspace to communicate over memory that is currently mapped private, but there
> > are no new novel attack vectors from the host's perspective as coercing the host
> > into accessing an invalid mapping after shared=>private conversion is just a variant
> > of a use-after-free.
> 
> Interesting. I was (maybe incorrectly) assuming that it would be
> difficult to handle illegal host accesses w/ TDX. IOW, this would
> essentially crash the host. Is this remotely correct or did I get that
> wrong?

Handling illegal host kernel accesses for both TDX and SEV-SNP is extremely
difficult, bordering on impossible.  That's one of the biggest, if not _the_
biggest, motivations for the private fd approach.  On "conversion", the page that is
used to back the shared variant is a completely different, unrelated host physical
page.  Whether or not the private/shared backing page is freed is orthogonal to
what version is mapped into the guest.  E.g. if the guest converts a 4kb chunk of
a 2mb hugepage, the private backing store could keep the physical page on hole
punch (example only, I don't know if this is the actual proposed implementation).

The idea is that it'll be much, much more difficult for the host to perform an
illegal access if the actual private memory is not mapped anywhere (modulo the
kernel's direct map, which we may or may not leave intact).  The private backing
store just needs to ensure it properly sanitizing pages before freeing them.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-03-10 14:08 [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (13 preceding siblings ...)
  2022-03-24 15:51 ` [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Quentin Perret
@ 2022-03-28 20:16 ` Andy Lutomirski
  2022-03-28 22:48   ` Nakajima, Jun
                     ` (2 more replies)
  14 siblings, 3 replies; 118+ messages in thread
From: Andy Lutomirski @ 2022-03-28 20:16 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david

On Thu, Mar 10, 2022 at 6:09 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> This is the v5 of this series which tries to implement the fd-based KVM
> guest private memory. The patches are based on latest kvm/queue branch
> commit:
>
>   d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2

Can this series be run and a VM booted without TDX?  A feature like
that might help push it forward.

--Andy


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 05/13] KVM: Extend the memslot to support fd-based private memory
  2022-03-10 14:09 ` [PATCH v5 05/13] KVM: Extend the memslot to support fd-based private memory Chao Peng
@ 2022-03-28 21:27   ` Sean Christopherson
  2022-04-08 13:21     ` Chao Peng
  2022-03-28 21:56   ` Sean Christopherson
  1 sibling, 1 reply; 118+ messages in thread
From: Sean Christopherson @ 2022-03-28 21:27 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Thu, Mar 10, 2022, Chao Peng wrote:
> Extend the memslot definition to provide fd-based private memory support
> by adding two new fields (private_fd/private_offset). The memslot then
> can maintain memory for both shared pages and private pages in a single
> memslot. Shared pages are provided by existing userspace_addr(hva) field
> and private pages are provided through the new private_fd/private_offset
> fields.
> 
> Since there is no 'hva' concept anymore for private memory so we cannot
> rely on get_user_pages() to get a pfn, instead we use the newly added
> memfile_notifier to complete the same job.
> 
> This new extension is indicated by a new flag KVM_MEM_PRIVATE.
> 
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>

Needs a Co-developed-by: for Yu, or a From: if Yu is the sole author.

> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  Documentation/virt/kvm/api.rst | 37 +++++++++++++++++++++++++++-------
>  include/linux/kvm_host.h       |  7 +++++++
>  include/uapi/linux/kvm.h       |  8 ++++++++
>  3 files changed, 45 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 3acbf4d263a5..f76ac598606c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1307,7 +1307,7 @@ yet and must be cleared on entry.
>  :Capability: KVM_CAP_USER_MEMORY
>  :Architectures: all
>  :Type: vm ioctl
> -:Parameters: struct kvm_userspace_memory_region (in)
> +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
>  :Returns: 0 on success, -1 on error
>  
>  ::
> @@ -1320,9 +1320,17 @@ yet and must be cleared on entry.
>  	__u64 userspace_addr; /* start of the userspace allocated memory */
>    };
>  
> +  struct kvm_userspace_memory_region_ext {
> +	struct kvm_userspace_memory_region region;
> +	__u64 private_offset;
> +	__u32 private_fd;
> +	__u32 padding[5];

Uber nit, I'd prefer we pad u32 for private_fd separate from padding the size of
the structure for future expansion.

Regarding future expansion, any reason not to go crazy and pad like 128+ bytes?
It'd be rather embarassing if the next memslot extension needs 3 u64s and we end
up with region_ext2 :-)

> +};
> +
>    /* for kvm_memory_region::flags */
>    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>    #define KVM_MEM_READONLY	(1UL << 1)
> +  #define KVM_MEM_PRIVATE		(1UL << 2)
>  
>  This ioctl allows the user to create, modify or delete a guest physical
>  memory slot.  Bits 0-15 of "slot" specify the slot id and this value

...

> +static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)

I 100% think we should usurp the name "private" for these memslots, but as prep
work this series should first rename KVM_PRIVATE_MEM_SLOTS to avoid confusion.
Maybe KVM_INTERNAL_MEM_SLOTS?


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 05/13] KVM: Extend the memslot to support fd-based private memory
  2022-03-10 14:09 ` [PATCH v5 05/13] KVM: Extend the memslot to support fd-based private memory Chao Peng
  2022-03-28 21:27   ` Sean Christopherson
@ 2022-03-28 21:56   ` Sean Christopherson
  2022-04-08 13:46     ` Chao Peng
  1 sibling, 1 reply; 118+ messages in thread
From: Sean Christopherson @ 2022-03-28 21:56 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Thu, Mar 10, 2022, Chao Peng wrote:
> Extend the memslot definition to provide fd-based private memory support
> by adding two new fields (private_fd/private_offset). The memslot then
> can maintain memory for both shared pages and private pages in a single
> memslot. Shared pages are provided by existing userspace_addr(hva) field
> and private pages are provided through the new private_fd/private_offset
> fields.
> 
> Since there is no 'hva' concept anymore for private memory so we cannot
> rely on get_user_pages() to get a pfn, instead we use the newly added
> memfile_notifier to complete the same job.
> 
> This new extension is indicated by a new flag KVM_MEM_PRIVATE.
> 
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  Documentation/virt/kvm/api.rst | 37 +++++++++++++++++++++++++++-------
>  include/linux/kvm_host.h       |  7 +++++++
>  include/uapi/linux/kvm.h       |  8 ++++++++
>  3 files changed, 45 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 3acbf4d263a5..f76ac598606c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1307,7 +1307,7 @@ yet and must be cleared on entry.
>  :Capability: KVM_CAP_USER_MEMORY
>  :Architectures: all
>  :Type: vm ioctl
> -:Parameters: struct kvm_userspace_memory_region (in)
> +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
>  :Returns: 0 on success, -1 on error
>  
>  ::
> @@ -1320,9 +1320,17 @@ yet and must be cleared on entry.
>  	__u64 userspace_addr; /* start of the userspace allocated memory */
>    };
>  
> +  struct kvm_userspace_memory_region_ext {
> +	struct kvm_userspace_memory_region region;

Peeking ahead, the partial switch to the _ext variant is rather gross.  I would
prefer that KVM use an entirely different, but binary compatible, struct internally.
And once the kernel supports C11[*], I'm pretty sure we can make the "region" in
_ext an anonymous struct, and make KVM's internal struct a #define of _ext.  That
should minimize the churn (no need to get the embedded "region" field), reduce
line lengths, and avoid confusion due to some flows taking the _ext but others
dealing with only the "base" struct.

Maybe kvm_user_memory_region or kvm_user_mem_region?  Though it's tempting to be
evil and usurp the old kvm_memory_region :-)

E.g. pre-C11 do

struct kvm_userspace_memory_region_ext {
	struct kvm_userspace_memory_region region;
	__u64 private_offset;
	__u32 private_fd;
	__u32 padding[5];
};

#ifdef __KERNEL__
struct kvm_user_mem_region {
	__u32 slot;
	__u32 flags;
	__u64 guest_phys_addr;
	__u64 memory_size; /* bytes */
	__u64 userspace_addr; /* start of the userspace allocated memory */
	__u64 private_offset;
	__u32 private_fd;
	__u32 padding[5];
};
#endif

and then post-C11 do

struct kvm_userspace_memory_region_ext {
#ifdef __KERNEL__
	struct kvm_userspace_memory_region region;
#else
	struct kvm_userspace_memory_region;
#endif
	__u64 private_offset;
	__u32 private_fd;
	__u32 padding[5];
};

#ifdef __KERNEL__
#define kvm_user_mem_region kvm_userspace_memory_region_ext
#endif

[*] https://lore.kernel.org/all/20220301145233.3689119-1-arnd@kernel.org

> +	__u64 private_offset;
> +	__u32 private_fd;
> +	__u32 padding[5];
> +};


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 06/13] KVM: Use kvm_userspace_memory_region_ext
  2022-03-10 14:09 ` [PATCH v5 06/13] KVM: Use kvm_userspace_memory_region_ext Chao Peng
@ 2022-03-28 22:26   ` Sean Christopherson
  2022-04-08 13:58     ` Chao Peng
  0 siblings, 1 reply; 118+ messages in thread
From: Sean Christopherson @ 2022-03-28 22:26 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Thu, Mar 10, 2022, Chao Peng wrote:
> @@ -4476,14 +4477,23 @@ static long kvm_vm_ioctl(struct file *filp,
>  		break;
>  	}
>  	case KVM_SET_USER_MEMORY_REGION: {
> -		struct kvm_userspace_memory_region kvm_userspace_mem;
> +		struct kvm_userspace_memory_region_ext region_ext;

It's probably a good idea to zero initialize the full region to avoid consuming
garbage stack data if there's a bug and an _ext field is accessed without first
checking KVM_MEM_PRIVATE.  I'm usually opposed to unnecessary initialization, but
this seems like something we could screw up quite easily.

>  		r = -EFAULT;
> -		if (copy_from_user(&kvm_userspace_mem, argp,
> -						sizeof(kvm_userspace_mem)))
> +		if (copy_from_user(&region_ext, argp,
> +				sizeof(struct kvm_userspace_memory_region)))
>  			goto out;
> +		if (region_ext.region.flags & KVM_MEM_PRIVATE) {
> +			int offset = offsetof(
> +				struct kvm_userspace_memory_region_ext,
> +				private_offset);
> +			if (copy_from_user(&region_ext.private_offset,
> +					   argp + offset,
> +					   sizeof(region_ext) - offset))

In this patch, KVM_MEM_PRIVATE should result in an -EINVAL as it's not yet
supported.  Copying the _ext on KVM_MEM_PRIVATE belongs in the "Expose KVM_MEM_PRIVATE"
patch.

Mechnically, what about first reading flags via get_user(), and then doing a single
copy_from_user()?  It's technically more work in the common case, and requires an
extra check to guard against TOCTOU attacks, but this isn't a fast path by any means
and IMO the end result makes it easier to understand the relationship between
KVM_MEM_PRIVATE and the two different structs.

E.g.

	case KVM_SET_USER_MEMORY_REGION: {
		struct kvm_user_mem_region region;
		unsigned long size;
		u32 flags;

		memset(&region, 0, sizeof(region));

		r = -EFAULT;
		if (get_user(flags, (u32 __user *)(argp + offsetof(typeof(region), flags))))
			goto out;

		if (flags & KVM_MEM_PRIVATE)
			size = sizeof(struct kvm_userspace_memory_region_ext);
		else
			size = sizeof(struct kvm_userspace_memory_region);
		if (copy_from_user(&region, argp, size))
			goto out;

		r = -EINVAL;
		if ((flags ^ region.flags) & KVM_MEM_PRIVATE)
			goto out;

		r = kvm_vm_ioctl_set_memory_region(kvm, &region);
		break;
	}

> +				goto out;
> +		}
>  
> -		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> +		r = kvm_vm_ioctl_set_memory_region(kvm, &region_ext);
>  		break;
>  	}
>  	case KVM_GET_DIRTY_LOG: {
> -- 
> 2.17.1
> 


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 07/13] KVM: Add KVM_EXIT_MEMORY_ERROR exit
  2022-03-10 14:09 ` [PATCH v5 07/13] KVM: Add KVM_EXIT_MEMORY_ERROR exit Chao Peng
@ 2022-03-28 22:33   ` Sean Christopherson
  2022-04-08 13:59     ` Chao Peng
  0 siblings, 1 reply; 118+ messages in thread
From: Sean Christopherson @ 2022-03-28 22:33 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Thu, Mar 10, 2022, Chao Peng wrote:
> This new KVM exit allows userspace to handle memory-related errors. It
> indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> The flags includes additional information for userspace to handle the
> error. Currently bit 0 is defined as 'private memory' where '1'
> indicates error happens due to private memory access and '0' indicates
> error happens due to shared memory access.
> 
> After private memory is enabled, this new exit will be used for KVM to
> exit to userspace for shared memory <-> private memory conversion in
> memory encryption usage.
> 
> In such usage, typically there are two kind of memory conversions:
>   - explicit conversion: happens when guest explicitly calls into KVM to
>     map a range (as private or shared), KVM then exits to userspace to
>     do the map/unmap operations.
>   - implicit conversion: happens in KVM page fault handler.
>     * if the fault is due to a private memory access then causes a
>       userspace exit for a shared->private conversion request when the
>       page has not been allocated in the private memory backend.
>     * If the fault is due to a shared memory access then causes a
>       userspace exit for a private->shared conversion request when the
>       page has already been allocated in the private memory backend.
> 
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  Documentation/virt/kvm/api.rst | 22 ++++++++++++++++++++++
>  include/uapi/linux/kvm.h       |  9 +++++++++
>  2 files changed, 31 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index f76ac598606c..bad550c2212b 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6216,6 +6216,28 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>  
> +::
> +
> +		/* KVM_EXIT_MEMORY_ERROR */
> +		struct {
> +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> +			__u32 flags;
> +			__u32 padding;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory;
> +If exit reason is KVM_EXIT_MEMORY_ERROR then it indicates that the VCPU has

Doh, I'm pretty sure I suggested KVM_EXIT_MEMORY_ERROR.  Any objection to using
KVM_EXIT_MEMORY_FAULT instead of KVM_EXIT_MEMORY_ERROR?  "ERROR" makes me think
of ECC errors, i.e. uncorrected #MC in x86 land, not more generic "faults".  That
would align nicely with -EFAULT.

> +encountered a memory error which is not handled by KVM kernel module and
> +userspace may choose to handle it. The 'flags' field indicates the memory
> +properties of the exit.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-03-28 20:16 ` Andy Lutomirski
@ 2022-03-28 22:48   ` Nakajima, Jun
  2022-03-29  0:04     ` Sean Christopherson
  2022-04-08 21:35   ` Vishal Annapurve
  2022-04-12 19:58   ` Kirill A. Shutemov
  2 siblings, 1 reply; 118+ messages in thread
From: Nakajima, Jun @ 2022-03-28 22:48 UTC (permalink / raw)
  To: Lutomirski, Andy
  Cc: Chao Peng, KVM list, LKML, Linux Memory Management List,
	linux-fsdevel, linux-api, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Christopherson,,
	Sean, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	Hansen, Dave, ak, david

> On Mar 28, 2022, at 1:16 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Thu, Mar 10, 2022 at 6:09 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>> 
>> This is the v5 of this series which tries to implement the fd-based KVM
>> guest private memory. The patches are based on latest kvm/queue branch
>> commit:
>> 
>>  d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2
> 
> Can this series be run and a VM booted without TDX?  A feature like
> that might help push it forward.
> 
> —Andy

Since the userspace VMM (e.g. QEMU) loses direct access to private memory of the VM, the guest needs to avoid using the private memory for (virtual) DMA buffers, for example. Otherwise, it would need to use bounce buffers, i.e. we would need changes to the VM. I think we can try that (i.e. add only bounce buffer changes). What do you think?

Thanks,
--- 
Jun



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 08/13] KVM: Use memfile_pfn_ops to obtain pfn for private pages
  2022-03-10 14:09 ` [PATCH v5 08/13] KVM: Use memfile_pfn_ops to obtain pfn for private pages Chao Peng
@ 2022-03-28 23:56   ` Sean Christopherson
  2022-04-08 14:07     ` Chao Peng
  2022-04-28 12:37     ` Chao Peng
  0 siblings, 2 replies; 118+ messages in thread
From: Sean Christopherson @ 2022-03-28 23:56 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Thu, Mar 10, 2022, Chao Peng wrote:
> @@ -2217,4 +2220,34 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
>  /* Max number of entries allowed for each kvm dirty ring */
>  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>  
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +static inline long kvm_memfile_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn,
> +				       int *order)
> +{
> +	pgoff_t index = gfn - slot->base_gfn +
> +			(slot->private_offset >> PAGE_SHIFT);

This is broken for 32-bit kernels, where gfn_t is a 64-bit value but pgoff_t is a
32-bit value.  There's no reason to support this for 32-bit kernels, so...

The easiest fix, and likely most maintainable for other code too, would be to
add a dedicated CONFIG for private memory, and then have KVM check that for all
the memfile stuff.  x86 can then select it only for 64-bit kernels, and in turn
select MEMFILE_NOTIFIER iff private memory is supported.

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ca7b2a6a452a..ee9c8c155300 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -48,7 +48,9 @@ config KVM
        select SRCU
        select INTERVAL_TREE
        select HAVE_KVM_PM_NOTIFIER if PM
-       select MEMFILE_NOTIFIER
+       select HAVE_KVM_PRIVATE_MEM if X86_64
+       select MEMFILE_NOTIFIER if HAVE_KVM_PRIVATE_MEM
+
        help
          Support hosting fully virtualized guest machines using hardware
          virtualization extensions.  You will need a fairly recent

And in addition to replacing checks on CONFIG_MEMFILE_NOTIFIER, the probing of
whether or not KVM_MEM_PRIVATE is allowed can be:

@@ -1499,23 +1499,19 @@ static void kvm_replace_memslot(struct kvm *kvm,
        }
 }

-bool __weak kvm_arch_private_memory_supported(struct kvm *kvm)
-{
-       return false;
-}
-
 static int check_memory_region_flags(struct kvm *kvm,
                                const struct kvm_userspace_memory_region *mem)
 {
        u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;

-       if (kvm_arch_private_memory_supported(kvm))
-               valid_flags |= KVM_MEM_PRIVATE;
-
 #ifdef __KVM_HAVE_READONLY_MEM
        valid_flags |= KVM_MEM_READONLY;
 #endif

+#ifdef CONFIG_KVM_HAVE_PRIVATE_MEM
+       valid_flags |= KVM_MEM_PRIVATE;
+#endif
+
        if (mem->flags & ~valid_flags)
                return -EINVAL;

> +
> +	return slot->pfn_ops->get_lock_pfn(file_inode(slot->private_file),
> +					   index, order);

In a similar vein, get_locK_pfn() shouldn't return a "long".  KVM likely won't use
these APIs on 32-bit kernels, but that may not hold true for other subsystems, and
this code is confusing and technically wrong.  The pfns for struct page squeeze
into an unsigned long because PAE support is capped at 64gb, but casting to a
signed long could result in a pfn with bit 31 set being misinterpreted as an error.

Even returning an "unsigned long" for the pfn is wrong.  It "works" for the shmem
code because shmem deals only with struct page, but it's technically wrong, especially
since one of the selling points of this approach is that it can work without struct
page.

OUT params suck, but I don't see a better option than having the return value be
0/-errno, with "pfn_t *pfn" for the resolved pfn.

> +}
> +
> +static inline void kvm_memfile_put_pfn(struct kvm_memory_slot *slot,
> +				       kvm_pfn_t pfn)
> +{
> +	slot->pfn_ops->put_unlock_pfn(pfn);
> +}
> +
> +#else
> +static inline long kvm_memfile_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn,
> +				       int *order)
> +{

This should be a WARN_ON() as its usage should be guarded by a KVM_PRIVATE_MEM
check, and private memslots should be disallowed in this case.

Alternatively, it might be a good idea to #ifdef these out entirely and not provide
stubs.  That'd likely require a stub or two in arch code, but overall it might be
less painful in the long run, e.g. would force us to more carefully consider the
touch points for private memory.  Definitely not a requirement, just an idea.


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-03-28 22:48   ` Nakajima, Jun
@ 2022-03-29  0:04     ` Sean Christopherson
  0 siblings, 0 replies; 118+ messages in thread
From: Sean Christopherson @ 2022-03-29  0:04 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: Lutomirski, Andy, Chao Peng, KVM list, LKML,
	Linux Memory Management List, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, Hansen, Dave,
	ak, david

On Mon, Mar 28, 2022, Nakajima, Jun wrote:
> > On Mar 28, 2022, at 1:16 PM, Andy Lutomirski <luto@kernel.org> wrote:
> > 
> > On Thu, Mar 10, 2022 at 6:09 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >> 
> >> This is the v5 of this series which tries to implement the fd-based KVM
> >> guest private memory. The patches are based on latest kvm/queue branch
> >> commit:
> >> 
> >>  d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2
> > 
> > Can this series be run and a VM booted without TDX?  A feature like
> > that might help push it forward.
> > 
> > —Andy
> 
> Since the userspace VMM (e.g. QEMU) loses direct access to private memory of
> the VM, the guest needs to avoid using the private memory for (virtual) DMA
> buffers, for example. Otherwise, it would need to use bounce buffers, i.e. we
> would need changes to the VM. I think we can try that (i.e. add only bounce
> buffer changes). What do you think?

I would love to be able to test this series and run full-blown VMs without TDX or
SEV hardware.

The other option for getting test coverage is KVM selftests, which don't have an
existing guest that needs to be enlightened.  Vishal is doing work on that front,
though I think it's still in early stages.  Long term, selftests will also be great
for negative testing.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 09/13] KVM: Handle page fault for private memory
  2022-03-10 14:09 ` [PATCH v5 09/13] KVM: Handle page fault for private memory Chao Peng
@ 2022-03-29  1:07   ` Sean Christopherson
  2022-04-12 12:10     ` Chao Peng
  0 siblings, 1 reply; 118+ messages in thread
From: Sean Christopherson @ 2022-03-29  1:07 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Thu, Mar 10, 2022, Chao Peng wrote:
> @@ -3890,7 +3893,59 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
>  }
>  
> -static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
> +static bool kvm_vcpu_is_private_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	/*
> +	 * At this time private gfn has not been supported yet. Other patch
> +	 * that enables it should change this.
> +	 */
> +	return false;
> +}
> +
> +static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> +				    struct kvm_page_fault *fault,
> +				    bool *is_private_pfn, int *r)

@is_private_pfn should be a field in @fault, not a separate parameter, and it
should be a const property set by the original caller.  I would also name it
"is_private", because if KVM proceeds past this point, it will be a property of
the fault/access _and_ the pfn

I say it's a property of the fault because the below kvm_vcpu_is_private_gfn()
should instead be:

	if (fault->is_private)

The kvm_vcpu_is_private_gfn() check is TDX centric.  For SNP, private vs. shared
is communicated via error code.  For software-only (I'm being optimistic ;-) ),
we'd probably need to track private vs. shared internally in KVM, I don't think
we'd want to force it to be a property of the gfn.

Then you can also move the fault->is_private waiver into is_page_fault_stale(),
and drop the local is_private_pfn in direct_page_fault().

> +{
> +	int order;
> +	unsigned int flags = 0;
> +	struct kvm_memory_slot *slot = fault->slot;
> +	long pfn = kvm_memfile_get_pfn(slot, fault->gfn, &order);

If get_lock_pfn() and thus kvm_memfile_get_pfn() returns a pure error code instead
of multiplexing the pfn, then this can be:

	bool is_private_pfn;

	is_private_pfn = !!kvm_memfile_get_pfn(slot, fault->gfn, &fault->pfn, &order);

That self-documents the "pfn < 0" == shared logic.

> +
> +	if (kvm_vcpu_is_private_gfn(vcpu, fault->addr >> PAGE_SHIFT)) {
> +		if (pfn < 0)
> +			flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
> +		else {
> +			fault->pfn = pfn;
> +			if (slot->flags & KVM_MEM_READONLY)
> +				fault->map_writable = false;
> +			else
> +				fault->map_writable = true;
> +
> +			if (order == 0)
> +				fault->max_level = PG_LEVEL_4K;

This doesn't correctly handle order > 0, but less than the next page size, in which
case max_level needs to be PG_LEVEL_4k.  It also doesn't handle the case where
max_level > PG_LEVEL_2M.

That said, I think the proper fix is to have the get_lock_pfn() API return the max
mapping level, not the order.  KVM, and presumably any other secondary MMU that might
use these APIs, doesn't care about the order of the struct page, KVM cares about the
max size/level of page it can map into the guest.  And similar to the previous patch,
"order" is specific to struct page, which we are trying to avoid.

> +			*is_private_pfn = true;

This is where KVM guarantees that is_private_pfn == fault->is_private.

> +			*r = RET_PF_FIXED;
> +			return true;

Ewww.  This is super confusing.  Ditto for the "*r = -1" magic number.  I totally
understand why you took this approach, it's just hard to follow because it kinda
follows the kvm_faultin_pfn() semantics, but then inverts true and false in this
one case.

I think the least awful option is to forego the helper and open code everything.
If we ever refactor kvm_faultin_pfn() to be less weird then we can maybe move this
to a helper.

Open coding isn't too bad if you reorganize things so that the exit-to-userspace
path is a dedicated, early check.  IMO, it's a lot easier to read this way, open
coded or not.

I think this is correct?  "is_private_pfn" and "level" are locals, everything else
is in @fault.

	if (kvm_slot_is_private(slot)) {
		is_private_pfn = !!kvm_memfile_get_pfn(slot, fault->gfn,
						       &fault->pfn, &level);

		if (fault->is_private != is_private_pfn) {
			if (is_private_pfn)
				kvm_memfile_put_pfn(slot, fault->pfn);

			vcpu->run->exit_reason = KVM_EXIT_MEMORY_ERROR;
			if (fault->is_private)
				vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
			else
				vcpu->run->memory.flags = 0;
			vcpu->run->memory.padding = 0;
			vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
			vcpu->run->memory.size = PAGE_SIZE;
			*r = 0;
			return true;
		}

		/*
		 * fault->pfn is all set if the fault is for a private pfn, just
		 * need to update other metadata.
		 */
		if (fault->is_private) {
			fault->max_level = min(fault->max_level, level);
			fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
			return false;
		}

		/* Fault is shared, fallthrough to the standard path. */
	}

	async = false;

> @@ -4016,7 +4076,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	else
>  		write_lock(&vcpu->kvm->mmu_lock);
>  
> -	if (is_page_fault_stale(vcpu, fault, mmu_seq))
> +	if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq))

As above, I'd prefer this check go in is_page_fault_stale().  It means shadow MMUs
will suffer a pointless check, but I don't think that's a big issue.  Oooh, unless
we support software-only, which would play nice with nested and probably even legacy
shadow paging.  Fun :-)

>  		goto out_unlock;
>  
>  	r = make_mmu_pages_available(vcpu);
> @@ -4033,7 +4093,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  		read_unlock(&vcpu->kvm->mmu_lock);
>  	else
>  		write_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);
> +
> +	if (is_private_pfn)

And this can be

	if (fault->is_private)

Same feedback for paging_tmpl.h.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-03-28 18:58       ` Sean Christopherson
@ 2022-03-29 17:01         ` Quentin Perret
  2022-03-30  8:58           ` Steven Price
  0 siblings, 1 reply; 118+ messages in thread
From: Quentin Perret @ 2022-03-29 17:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, maz, will

On Monday 28 Mar 2022 at 18:58:35 (+0000), Sean Christopherson wrote:
> On Mon, Mar 28, 2022, Quentin Perret wrote:
> > Hi Sean,
> > 
> > Thanks for the reply, this helps a lot.
> > 
> > On Monday 28 Mar 2022 at 17:13:10 (+0000), Sean Christopherson wrote:
> > > On Thu, Mar 24, 2022, Quentin Perret wrote:
> > > > For Protected KVM (and I suspect most other confidential computing
> > > > solutions), guests have the ability to share some of their pages back
> > > > with the host kernel using a dedicated hypercall. This is necessary
> > > > for e.g. virtio communications, so these shared pages need to be mapped
> > > > back into the VMM's address space. I'm a bit confused about how that
> > > > would work with the approach proposed here. What is going to be the
> > > > approach for TDX?
> > > > 
> > > > It feels like the most 'natural' thing would be to have a KVM exit
> > > > reason describing which pages have been shared back by the guest, and to
> > > > then allow the VMM to mmap those specific pages in response in the
> > > > memfd. Is this something that has been discussed or considered?
> > > 
> > > The proposed solution is to exit to userspace with a new exit reason, KVM_EXIT_MEMORY_ERROR,
> > > when the guest makes the hypercall to request conversion[1].  The private fd itself
> > > will never allow mapping memory into userspace, instead userspace will need to punch
> > > a hole in the private fd backing store.  The absense of a valid mapping in the private
> > > fd is how KVM detects that a pfn is "shared" (memslots without a private fd are always
> > > shared)[2].
> > 
> > Right. I'm still a bit confused about how the VMM is going to get the
> > shared page mapped in its page-table. Once it has punched a hole into
> > the private fd, how is it supposed to access the actual physical page
> > that the guest shared?
> 
> The guest doesn't share a _host_ physical page, the guest shares a _guest_ physical
> page.  Until host userspace converts the gfn to shared and thus maps the gfn=>hva
> via mmap(), the guest is blocked and can't read/write/exec the memory.  AFAIK, no
> architecture allows in-place decryption of guest private memory.  s390 allows a
> page to be "made accessible" to the host for the purposes of swap, and other
> architectures will have similar behavior for migrating a protected VM, but those
> scenarios are not sharing the page (and they also make the page inaccessible to
> the guest).

I see. FWIW, since pKVM is entirely MMU-based, we are in fact capable of
doing in-place sharing, which also means it can retain the content of
the page as part of the conversion.

Also, I'll ask the Arm CCA developers to correct me if this is wrong, but
I _believe_ it should be technically possible to do in-place sharing for
them too.

> > Is there an assumption somewhere that the VMM should have this page mapped in
> > via an alias that it can legally access only once it has punched a hole at
> > the corresponding offset in the private fd or something along those lines?
> 
> Yes, the VMM must have a completely separate VMA.  The VMM doesn't haven't to
> wait until the conversion to mmap() the shared variant, though obviously it will
> potentially consume double the memory if the VMM actually populates both the
> private and shared backing stores.

Gotcha. This is what confused me I think -- in this approach private and
shared pages are in fact entirely different.

In which scenario could you end up with both the private and shared
pages live at the same time? Would this be something like follows?

 - userspace creates a private fd, fallocates into it, and associates
   the <fd, offset, size> tuple with a private memslot;

 - userspace then mmaps anonymous memory (for ex.), and associates it
   with a standard memslot, which happens to be positioned at exactly
   the right offset w.r.t to the private memslot (with this offset
   defined by the bit that is set for the private addresses in the gpa
   space);

 - the guest runs, and accesses both 'aliases' of the page without doing
   an explicit share hypercall.

Is there another option?

Is implicit sharing a thing? E.g., if a guest makes a memory access in
the shared gpa range at an address that doesn't have a backing memslot,
will KVM check whether there is a corresponding private memslot at the
right offset with a hole punched and report a KVM_EXIT_MEMORY_ERROR? Or
would that just generate an MMIO exit as usual?

> > > The key point is that KVM never decides to convert between shared and private, it's
> > > always a userspace decision.  Like normal memslots, where userspace has full control
> > > over what gfns are a valid, this gives userspace full control over whether a gfn is
> > > shared or private at any given time.
> > 
> > I'm understanding this as 'the VMM is allowed to punch holes in the
> > private fd whenever it wants'. Is this correct?
> 
> From the kernel's perspective, yes, the VMM can punch holes at any time.  From a
> "do I want to DoS my guest" perspective, the VMM must honor its contract with the
> guest and not spuriously unmap private memory.
> 
> > What happens if it does so for a page that a guest hasn't shared back?
> 
> When the hole is punched, KVM will unmap the corresponding private SPTEs.  If the
> guest is still accessing the page as private, the next access will fault and KVM
> will exit to userspace with KVM_EXIT_MEMORY_ERROR.  Of course the guest is probably
> hosed if the hole punch was truly spurious, as at least hardware-based protected VMs
> effectively destroy data when a private page is unmapped from the guest private SPTEs.
>
> E.g. Linux guests for TDX and SNP will panic/terminate in such a scenario as they
> will get a fault (injected by trusted hardware/firmware) saying that the guest is
> trying to access an unaccepted/unvalidated page (TDX and SNP require the guest to
> explicit accept all private pages that aren't part of the guest's initial pre-boot
> image).

I suppose this is necessary is to prevent the VMM from re-fallocating
in a hole it previously punched and re-entering the guest without
notifying it?

> > > Another important detail is that this approach means the kernel and KVM treat the
> > > shared backing store and private backing store as independent, albeit related,
> > > entities.  This is very deliberate as it makes it easier to reason about what is
> > > and isn't allowed/required.  E.g. the kernel only needs to handle freeing private
> > > memory, there is no special handling for conversion to shared because no such path
> > > exists as far as host pfns are concerned.  And userspace doesn't need any new "rules"
> > > for protecting itself against a malicious guest, e.g. userspace already needs to
> > > ensure that it has a valid mapping prior to accessing guest memory (or be able to
> > > handle any resulting signals).  A malicious guest can DoS itself by instructing
> > > userspace to communicate over memory that is currently mapped private, but there
> > > are no new novel attack vectors from the host's perspective as coercing the host
> > > into accessing an invalid mapping after shared=>private conversion is just a variant
> > > of a use-after-free.
> > 
> > Interesting. I was (maybe incorrectly) assuming that it would be
> > difficult to handle illegal host accesses w/ TDX. IOW, this would
> > essentially crash the host. Is this remotely correct or did I get that
> > wrong?
> 
> Handling illegal host kernel accesses for both TDX and SEV-SNP is extremely
> difficult, bordering on impossible.  That's one of the biggest, if not _the_
> biggest, motivations for the private fd approach.  On "conversion", the page that is
> used to back the shared variant is a completely different, unrelated host physical
> page.  Whether or not the private/shared backing page is freed is orthogonal to
> what version is mapped into the guest.  E.g. if the guest converts a 4kb chunk of
> a 2mb hugepage, the private backing store could keep the physical page on hole
> punch (example only, I don't know if this is the actual proposed implementation).
> 
> The idea is that it'll be much, much more difficult for the host to perform an
> illegal access if the actual private memory is not mapped anywhere (modulo the
> kernel's direct map, which we may or may not leave intact).  The private backing
> store just needs to ensure it properly sanitizing pages before freeing them.

Understood.

I'm overall inclined to think that while this abstraction works nicely
for TDX and the likes, it might not suit pKVM all that well in the
current form, but it's close.

What do you think of extending the model proposed here to also address
the needs of implementations that support in-place sharing? One option
would be to have KVM notify the private-fd backing store when a page is
shared back by a guest, which would then allow host userspace to mmap
that particular page in the private fd instead of punching a hole.

This should retain the main property you're after: private pages that
are actually mapped in the guest SPTE aren't mmap-able, but all the
others are fair game.

Thoughts?


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 02/13] mm: Introduce memfile_notifier
  2022-03-10 14:09 ` [PATCH v5 02/13] mm: Introduce memfile_notifier Chao Peng
@ 2022-03-29 18:45   ` Sean Christopherson
  2022-04-08 12:54     ` Chao Peng
  2022-04-12 14:36   ` Hillf Danton
  1 sibling, 1 reply; 118+ messages in thread
From: Sean Christopherson @ 2022-03-29 18:45 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Thu, Mar 10, 2022, Chao Peng wrote:
> diff --git a/mm/Makefile b/mm/Makefile
> index 70d4309c9ce3..f628256dce0d 100644
> +void memfile_notifier_invalidate(struct memfile_notifier_list *list,
> +				 pgoff_t start, pgoff_t end)
> +{
> +	struct memfile_notifier *notifier;
> +	int id;
> +
> +	id = srcu_read_lock(&srcu);
> +	list_for_each_entry_srcu(notifier, &list->head, list,
> +				 srcu_read_lock_held(&srcu)) {
> +		if (notifier->ops && notifier->ops->invalidate)

Any reason notifier->ops isn't mandatory?

> +			notifier->ops->invalidate(notifier, start, end);
> +	}
> +	srcu_read_unlock(&srcu, id);
> +}
> +
> +void memfile_notifier_fallocate(struct memfile_notifier_list *list,
> +				pgoff_t start, pgoff_t end)
> +{
> +	struct memfile_notifier *notifier;
> +	int id;
> +
> +	id = srcu_read_lock(&srcu);
> +	list_for_each_entry_srcu(notifier, &list->head, list,
> +				 srcu_read_lock_held(&srcu)) {
> +		if (notifier->ops && notifier->ops->fallocate)
> +			notifier->ops->fallocate(notifier, start, end);
> +	}
> +	srcu_read_unlock(&srcu, id);
> +}
> +
> +void memfile_register_backing_store(struct memfile_backing_store *bs)
> +{
> +	BUG_ON(!bs || !bs->get_notifier_list);
> +
> +	list_add_tail(&bs->list, &backing_store_list);
> +}
> +
> +void memfile_unregister_backing_store(struct memfile_backing_store *bs)
> +{
> +	list_del(&bs->list);

Allowing unregistration of a backing store is broken.  Using the _safe() variant
is not sufficient to guard against concurrent modification.  I don't see any reason
to support this out of the gate, the only reason to support unregistering a backing
store is if the backing store is implemented as a module, and AFAIK none of the
backing stores we plan on supporting initially support being built as a module.
These aren't exported, so it's not like that's even possible.  Registration would
also be broken if modules are allowed, I'm pretty sure module init doesn't run
under a global lock.

We can always add this complexity if it's needed in the future, but for now the
easiest thing would be to tag memfile_register_backing_store() with __init and
make backing_store_list __ro_after_init.

> +}
> +
> +static int memfile_get_notifier_info(struct inode *inode,
> +				     struct memfile_notifier_list **list,
> +				     struct memfile_pfn_ops **ops)
> +{
> +	struct memfile_backing_store *bs, *iter;
> +	struct memfile_notifier_list *tmp;
> +
> +	list_for_each_entry_safe(bs, iter, &backing_store_list, list) {
> +		tmp = bs->get_notifier_list(inode);
> +		if (tmp) {
> +			*list = tmp;
> +			if (ops)
> +				*ops = &bs->pfn_ops;
> +			return 0;
> +		}
> +	}
> +	return -EOPNOTSUPP;
> +}
> +
> +int memfile_register_notifier(struct inode *inode,

Taking an inode is a bit odd from a user perspective.  Any reason not to take a
"struct file *" and get the inode here?  That would give callers a hint that they
need to hold a reference to the file for the lifetime of the registration.

> +			      struct memfile_notifier *notifier,
> +			      struct memfile_pfn_ops **pfn_ops)
> +{
> +	struct memfile_notifier_list *list;
> +	int ret;
> +
> +	if (!inode || !notifier | !pfn_ops)

Bitwise | instead of logical ||.  But IMO taking in a pfn_ops pointer is silly.
More below.

> +		return -EINVAL;
> +
> +	ret = memfile_get_notifier_info(inode, &list, pfn_ops);
> +	if (ret)
> +		return ret;
> +
> +	spin_lock(&list->lock);
> +	list_add_rcu(&notifier->list, &list->head);
> +	spin_unlock(&list->lock);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(memfile_register_notifier);
> +
> +void memfile_unregister_notifier(struct inode *inode,
> +				 struct memfile_notifier *notifier)
> +{
> +	struct memfile_notifier_list *list;
> +
> +	if (!inode || !notifier)
> +		return;
> +
> +	BUG_ON(memfile_get_notifier_info(inode, &list, NULL));

Eww.  Rather than force the caller to provide the inode/file and the notifier,
what about grabbing the backing store itself in the notifier?

	struct memfile_notifier {
		struct list_head list;
		struct memfile_notifier_ops *ops;

		struct memfile_backing_store *bs;
	};

That also helps avoid confusing between "ops" and "pfn_ops".  IMO, exposing
memfile_backing_store to the caller isn't a big deal, and is preferable to having
to rewalk multiple lists just to delete a notifier.

Then this can become:

  void memfile_unregister_notifier(struct memfile_notifier *notifier)
  {
	spin_lock(&notifier->bs->list->lock);
	list_del_rcu(&notifier->list);
	spin_unlock(&notifier->bs->list->lock);

	synchronize_srcu(&srcu);
  }

and registration can be:

  int memfile_register_notifier(const struct file *file,
			      struct memfile_notifier *notifier)
  {
	struct memfile_notifier_list *list;
	struct memfile_backing_store *bs;
	int ret;

	if (!file || !notifier)
		return -EINVAL;

	list_for_each_entry(bs, &backing_store_list, list) {
		list = bs->get_notifier_list(file_inode(file));
		if (list) {
			notifier->bs = bs;

			spin_lock(&list->lock);
			list_add_rcu(&notifier->list, &list->head);
			spin_unlock(&list->lock);
			return 0;
		}
	}

	return -EOPNOTSUPP;
  }


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 10/13] KVM: Register private memslot to memory backing store
  2022-03-10 14:09 ` [PATCH v5 10/13] KVM: Register private memslot to memory backing store Chao Peng
@ 2022-03-29 19:01   ` Sean Christopherson
  2022-04-12 12:40     ` Chao Peng
  0 siblings, 1 reply; 118+ messages in thread
From: Sean Christopherson @ 2022-03-29 19:01 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Thu, Mar 10, 2022, Chao Peng wrote:
> Add 'notifier' to memslot to make it a memfile_notifier node and then
> register it to memory backing store via memfile_register_notifier() when
> memslot gets created. When memslot is deleted, do the reverse with
> memfile_unregister_notifier(). Note each KVM memslot can be registered
> to different memory backing stores (or the same backing store but at
> different offset) independently.
> 
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  include/linux/kvm_host.h |  1 +
>  virt/kvm/kvm_main.c      | 75 ++++++++++++++++++++++++++++++++++++----
>  2 files changed, 70 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 6e1d770d6bf8..9b175aeca63f 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -567,6 +567,7 @@ struct kvm_memory_slot {
>  	struct file *private_file;
>  	loff_t private_offset;
>  	struct memfile_pfn_ops *pfn_ops;
> +	struct memfile_notifier notifier;
>  };
>  
>  static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d11a2628b548..67349421eae3 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -840,6 +840,37 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>  
>  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>  
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +static inline int kvm_memfile_register(struct kvm_memory_slot *slot)

This is a good oppurtunity to hide away the memfile details a bit.  Maybe
kvm_private_mem_{,un}register()?

> +{
> +	return memfile_register_notifier(file_inode(slot->private_file),
> +					 &slot->notifier,
> +					 &slot->pfn_ops);
> +}
> +
> +static inline void kvm_memfile_unregister(struct kvm_memory_slot *slot)
> +{
> +	if (slot->private_file) {
> +		memfile_unregister_notifier(file_inode(slot->private_file),
> +					    &slot->notifier);
> +		fput(slot->private_file);

This should not do fput(), it makes the helper imbalanced with respect to the
register path and will likely lead to double fput().  Indeed, if preparing the
region fails, __kvm_set_memory_region() will double up on fput() due to checking
its local "file" for null, not slot->private for null.

> +		slot->private_file = NULL;
> +	}
> +}
> +
> +#else /* !CONFIG_MEMFILE_NOTIFIER */
> +
> +static inline int kvm_memfile_register(struct kvm_memory_slot *slot)
> +{

This should WARN_ON_ONCE().  Ditto for unregister.

> +	return -EOPNOTSUPP;
> +}
> +
> +static inline void kvm_memfile_unregister(struct kvm_memory_slot *slot)
> +{
> +}
> +
> +#endif /* CONFIG_MEMFILE_NOTIFIER */
> +
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>  static int kvm_pm_notifier_call(struct notifier_block *bl,
>  				unsigned long state,
> @@ -884,6 +915,9 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
>  /* This does not remove the slot from struct kvm_memslots data structures */
>  static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
>  {
> +	if (slot->flags & KVM_MEM_PRIVATE)
> +		kvm_memfile_unregister(slot);

With fput() move out of unregister, this needs to be:

	if (slot->flags & KVM_MEM_PRIVATE) {
		kvm_private_mem_unregister(slot);
		fput(slot->private_file);
	}
> +
>  	kvm_destroy_dirty_bitmap(slot);
>  
>  	kvm_arch_free_memslot(kvm, slot);
> @@ -1738,6 +1772,12 @@ static int kvm_set_memslot(struct kvm *kvm,
>  		kvm_invalidate_memslot(kvm, old, invalid_slot);
>  	}
>  
> +	if (new->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE) {
> +		r = kvm_memfile_register(new);
> +		if (r)
> +			return r;
> +	}

This belongs in kvm_prepare_memory_region().  The shenanigans for DELETE and MOVE
are special.

> +
>  	r = kvm_prepare_memory_region(kvm, old, new, change);
>  	if (r) {
>  		/*
> @@ -1752,6 +1792,10 @@ static int kvm_set_memslot(struct kvm *kvm,
>  		} else {
>  			mutex_unlock(&kvm->slots_arch_lock);
>  		}
> +
> +		if (new->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE)
> +			kvm_memfile_unregister(new);
> +
>  		return r;
>  	}
>  
> @@ -1817,6 +1861,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  	enum kvm_mr_change change;
>  	unsigned long npages;
>  	gfn_t base_gfn;
> +	struct file *file = NULL;

Nit, naming this private_file would help understand its use.  Though I think it's
easier to not have a local variable.  More below.

>  	int as_id, id;
>  	int r;
>  
> @@ -1890,14 +1935,24 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  			return 0;
>  	}
>  
> +	if (mem->flags & KVM_MEM_PRIVATE) {
> +		file = fdget(region_ext->private_fd).file;

This can use fget() instead of fdget().

> +		if (!file)
> +			return -EINVAL;
> +	}
> +
>  	if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) &&
> -	    kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages))
> -		return -EEXIST;
> +	    kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages)) {
> +		r = -EEXIST;
> +		goto out;
> +	}
>  
>  	/* Allocate a slot that will persist in the memslot. */
>  	new = kzalloc(sizeof(*new), GFP_KERNEL_ACCOUNT);
> -	if (!new)
> -		return -ENOMEM;
> +	if (!new) {
> +		r = -ENOMEM;
> +		goto out;
> +	}
>  
>  	new->as_id = as_id;
>  	new->id = id;
> @@ -1905,10 +1960,18 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  	new->npages = npages;
>  	new->flags = mem->flags;
>  	new->userspace_addr = mem->userspace_addr;
> +	new->private_file = file;
> +	new->private_offset = mem->flags & KVM_MEM_PRIVATE ?
> +			      region_ext->private_offset : 0;

"new" is zero-allocated, so all the private stuff, including the fget(), can be
wrapped in a single KVM_MEM_PRIVATE check.  Moving fget() eliminates the number
of gotos needed (the above -EEXIST and -ENOMEM paths don't need to be modified).

>  	r = kvm_set_memslot(kvm, old, new, change);
> -	if (r)
> -		kfree(new);
> +	if (!r)
> +		return r;

Use goto, e.g.

	if (r)
		goto out;

	return 0;

Burying the happy path in a taken if-statement is confusing and error prone,
mostly because it breaks well-established kernel patterns.  Note, there's no need
for a separate out_free since new->private_file will be NULL in either case.  I
don't have a strong preference, I just find it easier to read code that's more
explicit, but I'm a-ok collapsing them into a single label.

	if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) &&
	    kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages))
		return -EEXIST;

	/* Allocate a slot that will persist in the memslot. */
	new = kzalloc(sizeof(*new), GFP_KERNEL_ACCOUNT);
	if (!new)
		return -ENOMEM;

	new->as_id = as_id;
	new->id = id;
	new->base_gfn = base_gfn;
	new->npages = npages;
	new->flags = mem->flags;
	new->userspace_addr = mem->userspace_addr;

	if (mem->flags & KVM_MEM_PRIVATE) {
		new->private_file = fget(mem->private_fd);
		if (!new->private_file) {
			r = -EINVAL;
			goto out_free;
		}
		new->private_offset = mem->private_offset;
	}

	r = kvm_set_memslot(kvm, old, new, change);
	if (r)
		goto out;

	return 0;

out:
	if (new->private_file)
		fput(new->private_file);

out_free:
	kfree(new);
	return r;


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 12/13] KVM: Expose KVM_MEM_PRIVATE
  2022-03-10 14:09 ` [PATCH v5 12/13] KVM: Expose KVM_MEM_PRIVATE Chao Peng
@ 2022-03-29 19:13   ` Sean Christopherson
  2022-04-12 12:56     ` Chao Peng
  0 siblings, 1 reply; 118+ messages in thread
From: Sean Christopherson @ 2022-03-29 19:13 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Thu, Mar 10, 2022, Chao Peng wrote:
> KVM_MEM_PRIVATE is not exposed by default but architecture code can turn
> on it by implementing kvm_arch_private_memory_supported().
> 
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  include/linux/kvm_host.h |  1 +
>  virt/kvm/kvm_main.c      | 24 +++++++++++++++++++-----
>  2 files changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 186b9b981a65..0150e952a131 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1432,6 +1432,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
>  int kvm_arch_post_init_vm(struct kvm *kvm);
>  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
>  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_private_memory_supported(struct kvm *kvm);
>  
>  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
>  /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 52319f49d58a..df5311755a40 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1485,10 +1485,19 @@ static void kvm_replace_memslot(struct kvm *kvm,
>  	}
>  }
>  
> -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> +bool __weak kvm_arch_private_memory_supported(struct kvm *kvm)
> +{
> +	return false;
> +}
> +
> +static int check_memory_region_flags(struct kvm *kvm,
> +				const struct kvm_userspace_memory_region *mem)
>  {
>  	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>  
> +	if (kvm_arch_private_memory_supported(kvm))
> +		valid_flags |= KVM_MEM_PRIVATE;
> +
>  #ifdef __KVM_HAVE_READONLY_MEM
>  	valid_flags |= KVM_MEM_READONLY;
>  #endif
> @@ -1900,7 +1909,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  	int as_id, id;
>  	int r;
>  
> -	r = check_memory_region_flags(mem);
> +	r = check_memory_region_flags(kvm, mem);
>  	if (r)
>  		return r;
>  
> @@ -1913,10 +1922,12 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  		return -EINVAL;
>  	if (mem->guest_phys_addr & (PAGE_SIZE - 1))
>  		return -EINVAL;
> -	/* We can read the guest memory with __xxx_user() later on. */
>  	if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
> -	    (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
> -	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
> +	    (mem->userspace_addr != untagged_addr(mem->userspace_addr)))
> +		return -EINVAL;
> +	/* We can read the guest memory with __xxx_user() later on. */
> +	if (!(mem->flags & KVM_MEM_PRIVATE) &&
> +	    !access_ok((void __user *)(unsigned long)mem->userspace_addr,

This should sanity check private_offset for private memslots.  At a bare minimum,
wrapping should be disallowed.

>  			mem->memory_size))
>  		return -EINVAL;
>  	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> @@ -1957,6 +1968,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  		if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
>  			return -EINVAL;
>  	} else { /* Modify an existing slot. */
> +		/* Private memslots are immutable, they can only be deleted. */
> +		if (mem->flags & KVM_MEM_PRIVATE)
> +			return -EINVAL;

These sanity checks belong in "KVM: Register private memslot to memory backing store",
e.g. that patch is "broken" without the immutability restriction.  It's somewhat moot
because the code is unreachable, but it makes reviewing confusing/difficult.

But rather than move the sanity checks back, I think I'd prefer to pull all of patch 10
here.  I think it also makes sense to drop "KVM: Use memfile_pfn_ops to obtain pfn for
private pages" and add the pointer in "struct kvm_memory_slot" in patch "KVM: Extend the
memslot to support fd-based private memory", with the use of the ops folded into
"KVM: Handle page fault for private memory".  Adding code to KVM and KVM-x86 in a single
patch is ok, and overall makes things easier to review because the new helpers have a
user right away, especially since there will be #ifdeffery.

I.e. end up with something like:

  mm: Introduce memfile_notifier
  mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
  KVM: Extend the memslot to support fd-based private memory
  KVM: Use kvm_userspace_memory_region_ext
  KVM: Add KVM_EXIT_MEMORY_ERROR exit
  KVM: Handle page fault for private memory
  KVM: Register private memslot to memory backing store
  KVM: Zap existing KVM mappings when pages changed in the private fd
  KVM: Enable and expose KVM_MEM_PRIVATE

>  		if ((mem->userspace_addr != old->userspace_addr) ||
>  		    (npages != old->npages) ||
>  		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> -- 
> 2.17.1
> 


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 11/13] KVM: Zap existing KVM mappings when pages changed in the private fd
  2022-03-10 14:09 ` [PATCH v5 11/13] KVM: Zap existing KVM mappings when pages changed in the private fd Chao Peng
@ 2022-03-29 19:23   ` Sean Christopherson
  2022-04-12 12:43     ` Chao Peng
  2022-04-05 23:45   ` Michael Roth
  2022-04-19 22:43   ` Vishal Annapurve
  2 siblings, 1 reply; 118+ messages in thread
From: Sean Christopherson @ 2022-03-29 19:23 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Thu, Mar 10, 2022, Chao Peng wrote:
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 67349421eae3..52319f49d58a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -841,8 +841,43 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>  
>  #ifdef CONFIG_MEMFILE_NOTIFIER
> +static void kvm_memfile_notifier_handler(struct memfile_notifier *notifier,
> +					 pgoff_t start, pgoff_t end)
> +{
> +	int idx;
> +	struct kvm_memory_slot *slot = container_of(notifier,
> +						    struct kvm_memory_slot,
> +						    notifier);
> +	struct kvm_gfn_range gfn_range = {
> +		.slot		= slot,
> +		.start		= start - (slot->private_offset >> PAGE_SHIFT),
> +		.end		= end - (slot->private_offset >> PAGE_SHIFT),
> +		.may_block 	= true,
> +	};
> +	struct kvm *kvm = slot->kvm;
> +
> +	gfn_range.start = max(gfn_range.start, slot->base_gfn);
> +	gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages);
> +
> +	if (gfn_range.start >= gfn_range.end)
> +		return;
> +
> +	idx = srcu_read_lock(&kvm->srcu);
> +	KVM_MMU_LOCK(kvm);
> +	kvm_unmap_gfn_range(kvm, &gfn_range);
> +	kvm_flush_remote_tlbs(kvm);

This should check the result of kvm_unmap_gfn_range() and flush only if necessary.

kvm->mmu_notifier_seq needs to be incremented, otherwise KVM will incorrectly
install a SPTE if the mapping is zapped between retrieving the pfn in faultin and
installing it after acquire mmu_lock.


> +	KVM_MMU_UNLOCK(kvm);
> +	srcu_read_unlock(&kvm->srcu, idx);
> +}
> +
> +static struct memfile_notifier_ops kvm_memfile_notifier_ops = {
> +	.invalidate = kvm_memfile_notifier_handler,
> +	.fallocate = kvm_memfile_notifier_handler,
> +};
> +
>  static inline int kvm_memfile_register(struct kvm_memory_slot *slot)
>  {
> +	slot->notifier.ops = &kvm_memfile_notifier_ops;
>  	return memfile_register_notifier(file_inode(slot->private_file),
>  					 &slot->notifier,
>  					 &slot->pfn_ops);
> @@ -1963,6 +1998,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  	new->private_file = file;
>  	new->private_offset = mem->flags & KVM_MEM_PRIVATE ?
>  			      region_ext->private_offset : 0;
> +	new->kvm = kvm;
>  
>  	r = kvm_set_memslot(kvm, old, new, change);
>  	if (!r)
> -- 
> 2.17.1
> 


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-03-29 17:01         ` Quentin Perret
@ 2022-03-30  8:58           ` Steven Price
  2022-03-30 10:39             ` Quentin Perret
  2022-03-30 16:18             ` Sean Christopherson
  0 siblings, 2 replies; 118+ messages in thread
From: Steven Price @ 2022-03-30  8:58 UTC (permalink / raw)
  To: Quentin Perret, Sean Christopherson
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, maz, will

On 29/03/2022 18:01, Quentin Perret wrote:
> On Monday 28 Mar 2022 at 18:58:35 (+0000), Sean Christopherson wrote:
>> On Mon, Mar 28, 2022, Quentin Perret wrote:
>>> Hi Sean,
>>>
>>> Thanks for the reply, this helps a lot.
>>>
>>> On Monday 28 Mar 2022 at 17:13:10 (+0000), Sean Christopherson wrote:
>>>> On Thu, Mar 24, 2022, Quentin Perret wrote:
>>>>> For Protected KVM (and I suspect most other confidential computing
>>>>> solutions), guests have the ability to share some of their pages back
>>>>> with the host kernel using a dedicated hypercall. This is necessary
>>>>> for e.g. virtio communications, so these shared pages need to be mapped
>>>>> back into the VMM's address space. I'm a bit confused about how that
>>>>> would work with the approach proposed here. What is going to be the
>>>>> approach for TDX?
>>>>>
>>>>> It feels like the most 'natural' thing would be to have a KVM exit
>>>>> reason describing which pages have been shared back by the guest, and to
>>>>> then allow the VMM to mmap those specific pages in response in the
>>>>> memfd. Is this something that has been discussed or considered?
>>>>
>>>> The proposed solution is to exit to userspace with a new exit reason, KVM_EXIT_MEMORY_ERROR,
>>>> when the guest makes the hypercall to request conversion[1].  The private fd itself
>>>> will never allow mapping memory into userspace, instead userspace will need to punch
>>>> a hole in the private fd backing store.  The absense of a valid mapping in the private
>>>> fd is how KVM detects that a pfn is "shared" (memslots without a private fd are always
>>>> shared)[2].
>>>
>>> Right. I'm still a bit confused about how the VMM is going to get the
>>> shared page mapped in its page-table. Once it has punched a hole into
>>> the private fd, how is it supposed to access the actual physical page
>>> that the guest shared?
>>
>> The guest doesn't share a _host_ physical page, the guest shares a _guest_ physical
>> page.  Until host userspace converts the gfn to shared and thus maps the gfn=>hva
>> via mmap(), the guest is blocked and can't read/write/exec the memory.  AFAIK, no
>> architecture allows in-place decryption of guest private memory.  s390 allows a
>> page to be "made accessible" to the host for the purposes of swap, and other
>> architectures will have similar behavior for migrating a protected VM, but those
>> scenarios are not sharing the page (and they also make the page inaccessible to
>> the guest).
> 
> I see. FWIW, since pKVM is entirely MMU-based, we are in fact capable of
> doing in-place sharing, which also means it can retain the content of
> the page as part of the conversion.
> 
> Also, I'll ask the Arm CCA developers to correct me if this is wrong, but
> I _believe_ it should be technically possible to do in-place sharing for
> them too.

In general this isn't possible as the physical memory could be
encrypted, so some temporary memory is required. We have prototyped
having a single temporary page for the setup when populating the guest's
initial memory - this has the nice property of not requiring any
additional allocation during the process but with the downside of
effectively two memcpy()s per page (one to the temporary page and
another, with optional encryption, into the now private page).

>>> Is there an assumption somewhere that the VMM should have this page mapped in
>>> via an alias that it can legally access only once it has punched a hole at
>>> the corresponding offset in the private fd or something along those lines?
>>
>> Yes, the VMM must have a completely separate VMA.  The VMM doesn't haven't to
>> wait until the conversion to mmap() the shared variant, though obviously it will
>> potentially consume double the memory if the VMM actually populates both the
>> private and shared backing stores.
> 
> Gotcha. This is what confused me I think -- in this approach private and
> shared pages are in fact entirely different.
> 
> In which scenario could you end up with both the private and shared
> pages live at the same time? Would this be something like follows?
> 
>  - userspace creates a private fd, fallocates into it, and associates
>    the <fd, offset, size> tuple with a private memslot;
> 
>  - userspace then mmaps anonymous memory (for ex.), and associates it
>    with a standard memslot, which happens to be positioned at exactly
>    the right offset w.r.t to the private memslot (with this offset
>    defined by the bit that is set for the private addresses in the gpa
>    space);
> 
>  - the guest runs, and accesses both 'aliases' of the page without doing
>    an explicit share hypercall.
> 
> Is there another option?

AIUI you can have both private and shared "live" at the same time. But
you can have a page allocated both in the private fd and in the same
location in the (shared) memslot in the VMM's memory map. In this
situation the private fd page effectively hides the shared page.

> Is implicit sharing a thing? E.g., if a guest makes a memory access in
> the shared gpa range at an address that doesn't have a backing memslot,
> will KVM check whether there is a corresponding private memslot at the
> right offset with a hole punched and report a KVM_EXIT_MEMORY_ERROR? Or
> would that just generate an MMIO exit as usual?

My understanding is that the guest needs some way of tagging whether a
page is expected to be shared or private. On the architectures I'm aware
of this is done by effectively stealing a bit from the IPA space and
pretending it's a flag bit.

So when a guest access causes a fault, the flag bit (really part of the
intermediate physical address) is compared against whether the page is
present in the private fd. If they correspond (i.e. a private access and
the private fd has a page, or a shared access and there's a hole in the
private fd) then the appropriate page is mapped and the guest continues.
If there's a mismatch then a KVM_EXIT_MEMORY_ERROR exit is trigged and
the VMM is expected to fix up the situation (either convert the page or
kill the guest if this was unexpected).

>>>> The key point is that KVM never decides to convert between shared and private, it's
>>>> always a userspace decision.  Like normal memslots, where userspace has full control
>>>> over what gfns are a valid, this gives userspace full control over whether a gfn is
>>>> shared or private at any given time.
>>>
>>> I'm understanding this as 'the VMM is allowed to punch holes in the
>>> private fd whenever it wants'. Is this correct?
>>
>> From the kernel's perspective, yes, the VMM can punch holes at any time.  From a
>> "do I want to DoS my guest" perspective, the VMM must honor its contract with the
>> guest and not spuriously unmap private memory.
>>
>>> What happens if it does so for a page that a guest hasn't shared back?
>>
>> When the hole is punched, KVM will unmap the corresponding private SPTEs.  If the
>> guest is still accessing the page as private, the next access will fault and KVM
>> will exit to userspace with KVM_EXIT_MEMORY_ERROR.  Of course the guest is probably
>> hosed if the hole punch was truly spurious, as at least hardware-based protected VMs
>> effectively destroy data when a private page is unmapped from the guest private SPTEs.
>>
>> E.g. Linux guests for TDX and SNP will panic/terminate in such a scenario as they
>> will get a fault (injected by trusted hardware/firmware) saying that the guest is
>> trying to access an unaccepted/unvalidated page (TDX and SNP require the guest to
>> explicit accept all private pages that aren't part of the guest's initial pre-boot
>> image).
> 
> I suppose this is necessary is to prevent the VMM from re-fallocating
> in a hole it previously punched and re-entering the guest without
> notifying it?

I don't know specifically about TDX/SNP, but one thing we want to
prevent with CCA is the VMM deallocating/reallocating a private page
without the guest being aware (i.e. corrupting the guest's state). So
punching a hole will taint the address such that a future access by the
guest is fatal (unless the guest first jumps through the right hoops to
acknowledge that it was expecting such a thing).

>>>> Another important detail is that this approach means the kernel and KVM treat the
>>>> shared backing store and private backing store as independent, albeit related,
>>>> entities.  This is very deliberate as it makes it easier to reason about what is
>>>> and isn't allowed/required.  E.g. the kernel only needs to handle freeing private
>>>> memory, there is no special handling for conversion to shared because no such path
>>>> exists as far as host pfns are concerned.  And userspace doesn't need any new "rules"
>>>> for protecting itself against a malicious guest, e.g. userspace already needs to
>>>> ensure that it has a valid mapping prior to accessing guest memory (or be able to
>>>> handle any resulting signals).  A malicious guest can DoS itself by instructing
>>>> userspace to communicate over memory that is currently mapped private, but there
>>>> are no new novel attack vectors from the host's perspective as coercing the host
>>>> into accessing an invalid mapping after shared=>private conversion is just a variant
>>>> of a use-after-free.
>>>
>>> Interesting. I was (maybe incorrectly) assuming that it would be
>>> difficult to handle illegal host accesses w/ TDX. IOW, this would
>>> essentially crash the host. Is this remotely correct or did I get that
>>> wrong?
>>
>> Handling illegal host kernel accesses for both TDX and SEV-SNP is extremely
>> difficult, bordering on impossible.  That's one of the biggest, if not _the_
>> biggest, motivations for the private fd approach.  On "conversion", the page that is
>> used to back the shared variant is a completely different, unrelated host physical
>> page.  Whether or not the private/shared backing page is freed is orthogonal to
>> what version is mapped into the guest.  E.g. if the guest converts a 4kb chunk of
>> a 2mb hugepage, the private backing store could keep the physical page on hole
>> punch (example only, I don't know if this is the actual proposed implementation).
>>
>> The idea is that it'll be much, much more difficult for the host to perform an
>> illegal access if the actual private memory is not mapped anywhere (modulo the
>> kernel's direct map, which we may or may not leave intact).  The private backing
>> store just needs to ensure it properly sanitizing pages before freeing them.
> 
> Understood.
> 
> I'm overall inclined to think that while this abstraction works nicely
> for TDX and the likes, it might not suit pKVM all that well in the
> current form, but it's close.
> 
> What do you think of extending the model proposed here to also address
> the needs of implementations that support in-place sharing? One option
> would be to have KVM notify the private-fd backing store when a page is
> shared back by a guest, which would then allow host userspace to mmap
> that particular page in the private fd instead of punching a hole.
> 
> This should retain the main property you're after: private pages that
> are actually mapped in the guest SPTE aren't mmap-able, but all the
> others are fair game.
> 
> Thoughts?

How do you propose this works if the page shared by the guest then needs
to be made private again? If there's no hole punched then it's not
possible to just repopulate the private-fd. I'm struggling to see how
that could work. Having said that; if we can work out a way to safely
mmap() pages from the private-fd there's definitely some benefits to be
had - e.g. it could be used to populate the initial memory before the
guest is started.

Steve


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-03-30  8:58           ` Steven Price
@ 2022-03-30 10:39             ` Quentin Perret
  2022-03-30 17:58               ` Sean Christopherson
  2022-03-30 16:18             ` Sean Christopherson
  1 sibling, 1 reply; 118+ messages in thread
From: Quentin Perret @ 2022-03-30 10:39 UTC (permalink / raw)
  To: Steven Price
  Cc: Sean Christopherson, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, maz, will

On Wednesday 30 Mar 2022 at 09:58:27 (+0100), Steven Price wrote:
> On 29/03/2022 18:01, Quentin Perret wrote:
> > On Monday 28 Mar 2022 at 18:58:35 (+0000), Sean Christopherson wrote:
> >> On Mon, Mar 28, 2022, Quentin Perret wrote:
> >>> Hi Sean,
> >>>
> >>> Thanks for the reply, this helps a lot.
> >>>
> >>> On Monday 28 Mar 2022 at 17:13:10 (+0000), Sean Christopherson wrote:
> >>>> On Thu, Mar 24, 2022, Quentin Perret wrote:
> >>>>> For Protected KVM (and I suspect most other confidential computing
> >>>>> solutions), guests have the ability to share some of their pages back
> >>>>> with the host kernel using a dedicated hypercall. This is necessary
> >>>>> for e.g. virtio communications, so these shared pages need to be mapped
> >>>>> back into the VMM's address space. I'm a bit confused about how that
> >>>>> would work with the approach proposed here. What is going to be the
> >>>>> approach for TDX?
> >>>>>
> >>>>> It feels like the most 'natural' thing would be to have a KVM exit
> >>>>> reason describing which pages have been shared back by the guest, and to
> >>>>> then allow the VMM to mmap those specific pages in response in the
> >>>>> memfd. Is this something that has been discussed or considered?
> >>>>
> >>>> The proposed solution is to exit to userspace with a new exit reason, KVM_EXIT_MEMORY_ERROR,
> >>>> when the guest makes the hypercall to request conversion[1].  The private fd itself
> >>>> will never allow mapping memory into userspace, instead userspace will need to punch
> >>>> a hole in the private fd backing store.  The absense of a valid mapping in the private
> >>>> fd is how KVM detects that a pfn is "shared" (memslots without a private fd are always
> >>>> shared)[2].
> >>>
> >>> Right. I'm still a bit confused about how the VMM is going to get the
> >>> shared page mapped in its page-table. Once it has punched a hole into
> >>> the private fd, how is it supposed to access the actual physical page
> >>> that the guest shared?
> >>
> >> The guest doesn't share a _host_ physical page, the guest shares a _guest_ physical
> >> page.  Until host userspace converts the gfn to shared and thus maps the gfn=>hva
> >> via mmap(), the guest is blocked and can't read/write/exec the memory.  AFAIK, no
> >> architecture allows in-place decryption of guest private memory.  s390 allows a
> >> page to be "made accessible" to the host for the purposes of swap, and other
> >> architectures will have similar behavior for migrating a protected VM, but those
> >> scenarios are not sharing the page (and they also make the page inaccessible to
> >> the guest).
> > 
> > I see. FWIW, since pKVM is entirely MMU-based, we are in fact capable of
> > doing in-place sharing, which also means it can retain the content of
> > the page as part of the conversion.
> > 
> > Also, I'll ask the Arm CCA developers to correct me if this is wrong, but
> > I _believe_ it should be technically possible to do in-place sharing for
> > them too.
> 
> In general this isn't possible as the physical memory could be
> encrypted, so some temporary memory is required. We have prototyped
> having a single temporary page for the setup when populating the guest's
> initial memory - this has the nice property of not requiring any
> additional allocation during the process but with the downside of
> effectively two memcpy()s per page (one to the temporary page and
> another, with optional encryption, into the now private page).

Interesting, thanks for the explanation.

> >>> Is there an assumption somewhere that the VMM should have this page mapped in
> >>> via an alias that it can legally access only once it has punched a hole at
> >>> the corresponding offset in the private fd or something along those lines?
> >>
> >> Yes, the VMM must have a completely separate VMA.  The VMM doesn't haven't to
> >> wait until the conversion to mmap() the shared variant, though obviously it will
> >> potentially consume double the memory if the VMM actually populates both the
> >> private and shared backing stores.
> > 
> > Gotcha. This is what confused me I think -- in this approach private and
> > shared pages are in fact entirely different.
> > 
> > In which scenario could you end up with both the private and shared
> > pages live at the same time? Would this be something like follows?
> > 
> >  - userspace creates a private fd, fallocates into it, and associates
> >    the <fd, offset, size> tuple with a private memslot;
> > 
> >  - userspace then mmaps anonymous memory (for ex.), and associates it
> >    with a standard memslot, which happens to be positioned at exactly
> >    the right offset w.r.t to the private memslot (with this offset
> >    defined by the bit that is set for the private addresses in the gpa
> >    space);
> > 
> >  - the guest runs, and accesses both 'aliases' of the page without doing
> >    an explicit share hypercall.
> > 
> > Is there another option?
> 
> AIUI you can have both private and shared "live" at the same time. But
> you can have a page allocated both in the private fd and in the same
> location in the (shared) memslot in the VMM's memory map. In this
> situation the private fd page effectively hides the shared page.
> 
> > Is implicit sharing a thing? E.g., if a guest makes a memory access in
> > the shared gpa range at an address that doesn't have a backing memslot,
> > will KVM check whether there is a corresponding private memslot at the
> > right offset with a hole punched and report a KVM_EXIT_MEMORY_ERROR? Or
> > would that just generate an MMIO exit as usual?
> 
> My understanding is that the guest needs some way of tagging whether a
> page is expected to be shared or private. On the architectures I'm aware
> of this is done by effectively stealing a bit from the IPA space and
> pretending it's a flag bit.

Right, and that is in fact the main point of divergence we have I think.
While I understand this might be necessary for TDX and the likes, this
makes little sense for pKVM. This would effectively embed into the IPA a
purely software-defined non-architectural property/protocol although we
don't actually need to: we (pKVM) can reasonably expect the guest to
explicitly issue hypercalls to share pages in-place. So I'd be really
keen to avoid baking in assumptions about that model too deep in the
host mm bits if at all possible.

> So when a guest access causes a fault, the flag bit (really part of the
> intermediate physical address) is compared against whether the page is
> present in the private fd. If they correspond (i.e. a private access and
> the private fd has a page, or a shared access and there's a hole in the
> private fd) then the appropriate page is mapped and the guest continues.
> If there's a mismatch then a KVM_EXIT_MEMORY_ERROR exit is trigged and
> the VMM is expected to fix up the situation (either convert the page or
> kill the guest if this was unexpected).
> 
> >>>> The key point is that KVM never decides to convert between shared and private, it's
> >>>> always a userspace decision.  Like normal memslots, where userspace has full control
> >>>> over what gfns are a valid, this gives userspace full control over whether a gfn is
> >>>> shared or private at any given time.
> >>>
> >>> I'm understanding this as 'the VMM is allowed to punch holes in the
> >>> private fd whenever it wants'. Is this correct?
> >>
> >> From the kernel's perspective, yes, the VMM can punch holes at any time.  From a
> >> "do I want to DoS my guest" perspective, the VMM must honor its contract with the
> >> guest and not spuriously unmap private memory.
> >>
> >>> What happens if it does so for a page that a guest hasn't shared back?
> >>
> >> When the hole is punched, KVM will unmap the corresponding private SPTEs.  If the
> >> guest is still accessing the page as private, the next access will fault and KVM
> >> will exit to userspace with KVM_EXIT_MEMORY_ERROR.  Of course the guest is probably
> >> hosed if the hole punch was truly spurious, as at least hardware-based protected VMs
> >> effectively destroy data when a private page is unmapped from the guest private SPTEs.
> >>
> >> E.g. Linux guests for TDX and SNP will panic/terminate in such a scenario as they
> >> will get a fault (injected by trusted hardware/firmware) saying that the guest is
> >> trying to access an unaccepted/unvalidated page (TDX and SNP require the guest to
> >> explicit accept all private pages that aren't part of the guest's initial pre-boot
> >> image).
> > 
> > I suppose this is necessary is to prevent the VMM from re-fallocating
> > in a hole it previously punched and re-entering the guest without
> > notifying it?
> 
> I don't know specifically about TDX/SNP, but one thing we want to
> prevent with CCA is the VMM deallocating/reallocating a private page
> without the guest being aware (i.e. corrupting the guest's state). So
> punching a hole will taint the address such that a future access by the
> guest is fatal (unless the guest first jumps through the right hoops to
> acknowledge that it was expecting such a thing).

Makes sense.

> >>>> Another important detail is that this approach means the kernel and KVM treat the
> >>>> shared backing store and private backing store as independent, albeit related,
> >>>> entities.  This is very deliberate as it makes it easier to reason about what is
> >>>> and isn't allowed/required.  E.g. the kernel only needs to handle freeing private
> >>>> memory, there is no special handling for conversion to shared because no such path
> >>>> exists as far as host pfns are concerned.  And userspace doesn't need any new "rules"
> >>>> for protecting itself against a malicious guest, e.g. userspace already needs to
> >>>> ensure that it has a valid mapping prior to accessing guest memory (or be able to
> >>>> handle any resulting signals).  A malicious guest can DoS itself by instructing
> >>>> userspace to communicate over memory that is currently mapped private, but there
> >>>> are no new novel attack vectors from the host's perspective as coercing the host
> >>>> into accessing an invalid mapping after shared=>private conversion is just a variant
> >>>> of a use-after-free.
> >>>
> >>> Interesting. I was (maybe incorrectly) assuming that it would be
> >>> difficult to handle illegal host accesses w/ TDX. IOW, this would
> >>> essentially crash the host. Is this remotely correct or did I get that
> >>> wrong?
> >>
> >> Handling illegal host kernel accesses for both TDX and SEV-SNP is extremely
> >> difficult, bordering on impossible.  That's one of the biggest, if not _the_
> >> biggest, motivations for the private fd approach.  On "conversion", the page that is
> >> used to back the shared variant is a completely different, unrelated host physical
> >> page.  Whether or not the private/shared backing page is freed is orthogonal to
> >> what version is mapped into the guest.  E.g. if the guest converts a 4kb chunk of
> >> a 2mb hugepage, the private backing store could keep the physical page on hole
> >> punch (example only, I don't know if this is the actual proposed implementation).
> >>
> >> The idea is that it'll be much, much more difficult for the host to perform an
> >> illegal access if the actual private memory is not mapped anywhere (modulo the
> >> kernel's direct map, which we may or may not leave intact).  The private backing
> >> store just needs to ensure it properly sanitizing pages before freeing them.
> > 
> > Understood.
> > 
> > I'm overall inclined to think that while this abstraction works nicely
> > for TDX and the likes, it might not suit pKVM all that well in the
> > current form, but it's close.
> > 
> > What do you think of extending the model proposed here to also address
> > the needs of implementations that support in-place sharing? One option
> > would be to have KVM notify the private-fd backing store when a page is
> > shared back by a guest, which would then allow host userspace to mmap
> > that particular page in the private fd instead of punching a hole.
> > 
> > This should retain the main property you're after: private pages that
> > are actually mapped in the guest SPTE aren't mmap-able, but all the
> > others are fair game.
> > 
> > Thoughts?
> 
> How do you propose this works if the page shared by the guest then needs
> to be made private again? If there's no hole punched then it's not
> possible to just repopulate the private-fd. I'm struggling to see how
> that could work.

Yes, some discussion might be required, but I was thinking about
something along those lines:

 - a guest requests a shared->private page conversion;

 - the conversion request is routed all the way back to the VMM;

 - the VMM is expected to either decline the conversion (which may be
   fatal for the guest if it can't handle this), or to tear-down its
   mappings (via munmap()) of the shared page, and accept the
   conversion;

 - upon return from the VMM, KVM will be expected to check how many
   references to the shared page are still held (probably by asking the
   fd backing store) to check that userspace has indeed torn down its
   mappings. If all is fine, KVM will instruct the hypervisor to
   repopulate the private range of the guest, otherwise it'll return an
   error to the VMM;

 - if the conversion has been successful, the guest can resume its
   execution normally.

Note: this should still allow to use the hole-punching method just fine
on systems that require it. The invariant here is just that KVM (with
help from the backing store) is now responsible for refusing to
instruct the hypervisor (or TDX module, or RMM, or whatever) to map a
private page if there are existing mappings to it.

> Having said that; if we can work out a way to safely
> mmap() pages from the private-fd there's definitely some benefits to be
> had - e.g. it could be used to populate the initial memory before the
> guest is started.

Right, so assuming the approach proposed above isn't entirely bogus,
this might now become possible by having the VMM mmap the private-fd,
load the payload, and then unmap it all, and only then instruct the
hypervisor to use this as private memory.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-03-30  8:58           ` Steven Price
  2022-03-30 10:39             ` Quentin Perret
@ 2022-03-30 16:18             ` Sean Christopherson
  1 sibling, 0 replies; 118+ messages in thread
From: Sean Christopherson @ 2022-03-30 16:18 UTC (permalink / raw)
  To: Steven Price
  Cc: Quentin Perret, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, maz, will

On Wed, Mar 30, 2022, Steven Price wrote:
> On 29/03/2022 18:01, Quentin Perret wrote:
> > Is implicit sharing a thing? E.g., if a guest makes a memory access in
> > the shared gpa range at an address that doesn't have a backing memslot,
> > will KVM check whether there is a corresponding private memslot at the
> > right offset with a hole punched and report a KVM_EXIT_MEMORY_ERROR? Or
> > would that just generate an MMIO exit as usual?
> 
> My understanding is that the guest needs some way of tagging whether a
> page is expected to be shared or private. On the architectures I'm aware
> of this is done by effectively stealing a bit from the IPA space and
> pretending it's a flag bit.
> 
> So when a guest access causes a fault, the flag bit (really part of the
> intermediate physical address) is compared against whether the page is
> present in the private fd. If they correspond (i.e. a private access and
> the private fd has a page, or a shared access and there's a hole in the
> private fd) then the appropriate page is mapped and the guest continues.
> If there's a mismatch then a KVM_EXIT_MEMORY_ERROR exit is trigged and
> the VMM is expected to fix up the situation (either convert the page or
> kill the guest if this was unexpected).

x86 architectures do steal a bit, but it's not strictly required.  The guest can
communicate its desired private vs. shared state via hypercall.  I refer to the
hypercall method as explicit conversion, and reacting to a page fault due to
accessing the "wrong" PA variant as implicit conversion.

I have dreams of supporting a software-only implementation on x86, a la pKVM, if
only for testing and debug purposes.  In that case, only explicit conversion is
supported.

I'd actually prefer TDX and SNP only allow explicit conversion, i.e. let the host
treat accesses to the "wrong" PA as illegal, but sadly the guest/host ABIs for
both TDX and SNP require the host to support implicit conversions.

> >>>> The key point is that KVM never decides to convert between shared and private, it's
> >>>> always a userspace decision.  Like normal memslots, where userspace has full control
> >>>> over what gfns are a valid, this gives userspace full control over whether a gfn is
> >>>> shared or private at any given time.
> >>>
> >>> I'm understanding this as 'the VMM is allowed to punch holes in the
> >>> private fd whenever it wants'. Is this correct?
> >>
> >> From the kernel's perspective, yes, the VMM can punch holes at any time.  From a
> >> "do I want to DoS my guest" perspective, the VMM must honor its contract with the
> >> guest and not spuriously unmap private memory.
> >>
> >>> What happens if it does so for a page that a guest hasn't shared back?
> >>
> >> When the hole is punched, KVM will unmap the corresponding private SPTEs.  If the
> >> guest is still accessing the page as private, the next access will fault and KVM
> >> will exit to userspace with KVM_EXIT_MEMORY_ERROR.  Of course the guest is probably
> >> hosed if the hole punch was truly spurious, as at least hardware-based protected VMs
> >> effectively destroy data when a private page is unmapped from the guest private SPTEs.
> >>
> >> E.g. Linux guests for TDX and SNP will panic/terminate in such a scenario as they
> >> will get a fault (injected by trusted hardware/firmware) saying that the guest is
> >> trying to access an unaccepted/unvalidated page (TDX and SNP require the guest to
> >> explicit accept all private pages that aren't part of the guest's initial pre-boot
> >> image).
> > 
> > I suppose this is necessary is to prevent the VMM from re-fallocating
> > in a hole it previously punched and re-entering the guest without
> > notifying it?
> 
> I don't know specifically about TDX/SNP, but one thing we want to
> prevent with CCA is the VMM deallocating/reallocating a private page
> without the guest being aware (i.e. corrupting the guest's state).So
> punching a hole will taint the address such that a future access by the
> guest is fatal (unless the guest first jumps through the right hoops to
> acknowledge that it was expecting such a thing).

Yep, both TDX and SNP will trigger a fault in the guest if the host removes and
reinserts a private page.  The current plan for Linux guests is to track whether
or not a given page has been accepted as private, and panic/die if a fault due
to unaccepted private page occurs.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-03-30 10:39             ` Quentin Perret
@ 2022-03-30 17:58               ` Sean Christopherson
  2022-03-31 16:04                 ` Andy Lutomirski
  0 siblings, 1 reply; 118+ messages in thread
From: Sean Christopherson @ 2022-03-30 17:58 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Steven Price, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, maz, will

On Wed, Mar 30, 2022, Quentin Perret wrote:
> On Wednesday 30 Mar 2022 at 09:58:27 (+0100), Steven Price wrote:
> > On 29/03/2022 18:01, Quentin Perret wrote:
> > > Is implicit sharing a thing? E.g., if a guest makes a memory access in
> > > the shared gpa range at an address that doesn't have a backing memslot,
> > > will KVM check whether there is a corresponding private memslot at the
> > > right offset with a hole punched and report a KVM_EXIT_MEMORY_ERROR? Or
> > > would that just generate an MMIO exit as usual?
> > 
> > My understanding is that the guest needs some way of tagging whether a
> > page is expected to be shared or private. On the architectures I'm aware
> > of this is done by effectively stealing a bit from the IPA space and
> > pretending it's a flag bit.
> 
> Right, and that is in fact the main point of divergence we have I think.
> While I understand this might be necessary for TDX and the likes, this
> makes little sense for pKVM. This would effectively embed into the IPA a
> purely software-defined non-architectural property/protocol although we
> don't actually need to: we (pKVM) can reasonably expect the guest to
> explicitly issue hypercalls to share pages in-place. So I'd be really
> keen to avoid baking in assumptions about that model too deep in the
> host mm bits if at all possible.

There is no assumption about stealing PA bits baked into this API.  Even within
x86 KVM, I consider it a hard requirement that the common flows not assume the
private vs. shared information is communicated through the PA.

> > > I'm overall inclined to think that while this abstraction works nicely
> > > for TDX and the likes, it might not suit pKVM all that well in the
> > > current form, but it's close.
> > > 
> > > What do you think of extending the model proposed here to also address
> > > the needs of implementations that support in-place sharing? One option
> > > would be to have KVM notify the private-fd backing store when a page is
> > > shared back by a guest, which would then allow host userspace to mmap
> > > that particular page in the private fd instead of punching a hole.
> > > 
> > > This should retain the main property you're after: private pages that
> > > are actually mapped in the guest SPTE aren't mmap-able, but all the
> > > others are fair game.
> > > 
> > > Thoughts?
> > How do you propose this works if the page shared by the guest then needs
> > to be made private again? If there's no hole punched then it's not
> > possible to just repopulate the private-fd. I'm struggling to see how
> > that could work.
> 
> Yes, some discussion might be required, but I was thinking about
> something along those lines:
> 
>  - a guest requests a shared->private page conversion;
> 
>  - the conversion request is routed all the way back to the VMM;
> 
>  - the VMM is expected to either decline the conversion (which may be
>    fatal for the guest if it can't handle this), or to tear-down its
>    mappings (via munmap()) of the shared page, and accept the
>    conversion;
> 
>  - upon return from the VMM, KVM will be expected to check how many
>    references to the shared page are still held (probably by asking the
>    fd backing store) to check that userspace has indeed torn down its
>    mappings. If all is fine, KVM will instruct the hypervisor to
>    repopulate the private range of the guest, otherwise it'll return an
>    error to the VMM;
> 
>  - if the conversion has been successful, the guest can resume its
>    execution normally.
> 
> Note: this should still allow to use the hole-punching method just fine
> on systems that require it. The invariant here is just that KVM (with
> help from the backing store) is now responsible for refusing to
> instruct the hypervisor (or TDX module, or RMM, or whatever) to map a
> private page if there are existing mappings to it.
> 
> > Having said that; if we can work out a way to safely
> > mmap() pages from the private-fd there's definitely some benefits to be
> > had - e.g. it could be used to populate the initial memory before the
> > guest is started.
> 
> Right, so assuming the approach proposed above isn't entirely bogus,
> this might now become possible by having the VMM mmap the private-fd,
> load the payload, and then unmap it all, and only then instruct the
> hypervisor to use this as private memory.

Hard "no" on mapping the private-fd.  Having the invariant tha the private-fd
can never be mapped greatly simplifies the responsibilities of the backing store,
as well as the interface between the private-fd and the in-kernel consumers of the
memory (KVM in this case).

What is the use case for shared->private conversion?  x86, both TDX and SNP,
effectively do have a flavor of shared->private conversion; SNP can definitely
be in-place, and I think TDX too.  But the only use case in x86 is to populate
the initial guest image, and due to other performance bottlenecks, it's strongly
recommended to keep the initial image as small as possible.  Based on your previous
response about the guest firmware loading the full guest image, my understanding is
that pKVM will also utilize a minimal initial image.

As a result, true in-place conversion to reduce the number of memcpy()s is low
priority, i.e. not planned at this time.  Unless the use case expects to convert
large swaths of memory, the simplest approach would be to have pKVM memcpy() between
the private and shared backing pages during conversion.

In-place conversion that preserves data needs to be a separate and/or additional
hypercall, because "I want to map this page as private/shared" is very, very different
than "I want to map this page as private/shared and consume/expose non-zero data".
I.e. the host is guaranteed to get an explicit request to do the memcpy(), so there
shouldn't be a need to implicitly allow this on any conversion.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-03-30 17:58               ` Sean Christopherson
@ 2022-03-31 16:04                 ` Andy Lutomirski
  2022-04-01 14:59                   ` Quentin Perret
  0 siblings, 1 reply; 118+ messages in thread
From: Andy Lutomirski @ 2022-03-31 16:04 UTC (permalink / raw)
  To: Sean Christopherson, Quentin Perret
  Cc: Steven Price, Chao Peng, kvm list, Linux Kernel Mailing List,
	linux-mm, linux-fsdevel, Linux API, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A. Shutemov, Nakajima, Jun, Dave Hansen,
	Andi Kleen, David Hildenbrand, Marc Zyngier, Will Deacon

On Wed, Mar 30, 2022, at 10:58 AM, Sean Christopherson wrote:
> On Wed, Mar 30, 2022, Quentin Perret wrote:
>> On Wednesday 30 Mar 2022 at 09:58:27 (+0100), Steven Price wrote:
>> > On 29/03/2022 18:01, Quentin Perret wrote:
>> > > Is implicit sharing a thing? E.g., if a guest makes a memory access in
>> > > the shared gpa range at an address that doesn't have a backing memslot,
>> > > will KVM check whether there is a corresponding private memslot at the
>> > > right offset with a hole punched and report a KVM_EXIT_MEMORY_ERROR? Or
>> > > would that just generate an MMIO exit as usual?
>> > 
>> > My understanding is that the guest needs some way of tagging whether a
>> > page is expected to be shared or private. On the architectures I'm aware
>> > of this is done by effectively stealing a bit from the IPA space and
>> > pretending it's a flag bit.
>> 
>> Right, and that is in fact the main point of divergence we have I think.
>> While I understand this might be necessary for TDX and the likes, this
>> makes little sense for pKVM. This would effectively embed into the IPA a
>> purely software-defined non-architectural property/protocol although we
>> don't actually need to: we (pKVM) can reasonably expect the guest to
>> explicitly issue hypercalls to share pages in-place. So I'd be really
>> keen to avoid baking in assumptions about that model too deep in the
>> host mm bits if at all possible.
>
> There is no assumption about stealing PA bits baked into this API.  Even within
> x86 KVM, I consider it a hard requirement that the common flows not assume the
> private vs. shared information is communicated through the PA.

Quentin, I think we might need a clarification.  The API in this patchset indeed has no requirement that a PA bit distinguish between private and shared, but I think it makes at least a weak assumption that *something*, a priori, distinguishes them.  In particular, there are private memslots and shared memslots, so the logical flow of resolving a guest memory access looks like:

1. guest accesses a GVA

2. read guest paging structures

3. determine whether this is a shared or private access

4. read host (KVM memslots and anything else, EPT, NPT, RMP, etc) structures accordingly.  In particular, the memslot to reference is different depending on the access type.

For TDX, this maps on to the fd-based model perfectly: the host-side paging structures for the shared and private slots are completely separate.  For SEV, the structures are shared and KVM will need to figure out what to do in case a private and shared memslot overlap.  Presumably it's sufficient to declare that one of them wins, although actually determining which one is active for a given GPA may involve checking whether the backing store for a given page actually exists.

But I don't understand pKVM well enough to understand how it fits in.  Quentin, how is the shared vs private mode of a memory access determined?  How do the paging structures work?  Can a guest switch between shared and private by issuing a hypercall without changing any guest-side paging structures or anything else?

It's plausible that SEV and (maybe) pKVM would be better served if memslots could be sparse or if there was otherwise a direct way for host userspace to indicate to KVM which address ranges are actually active (not hole-punched) in a given memslot or to otherwise be able to make a rule that two different memslots (one shared and one private) can't claim the same address.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-03-31 16:04                 ` Andy Lutomirski
@ 2022-04-01 14:59                   ` Quentin Perret
  2022-04-01 17:14                     ` Sean Christopherson
  2022-04-01 19:56                     ` Andy Lutomirski
  0 siblings, 2 replies; 118+ messages in thread
From: Quentin Perret @ 2022-04-01 14:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Sean Christopherson, Steven Price, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon

On Thursday 31 Mar 2022 at 09:04:56 (-0700), Andy Lutomirski wrote:
> On Wed, Mar 30, 2022, at 10:58 AM, Sean Christopherson wrote:
> > On Wed, Mar 30, 2022, Quentin Perret wrote:
> >> On Wednesday 30 Mar 2022 at 09:58:27 (+0100), Steven Price wrote:
> >> > On 29/03/2022 18:01, Quentin Perret wrote:
> >> > > Is implicit sharing a thing? E.g., if a guest makes a memory access in
> >> > > the shared gpa range at an address that doesn't have a backing memslot,
> >> > > will KVM check whether there is a corresponding private memslot at the
> >> > > right offset with a hole punched and report a KVM_EXIT_MEMORY_ERROR? Or
> >> > > would that just generate an MMIO exit as usual?
> >> > 
> >> > My understanding is that the guest needs some way of tagging whether a
> >> > page is expected to be shared or private. On the architectures I'm aware
> >> > of this is done by effectively stealing a bit from the IPA space and
> >> > pretending it's a flag bit.
> >> 
> >> Right, and that is in fact the main point of divergence we have I think.
> >> While I understand this might be necessary for TDX and the likes, this
> >> makes little sense for pKVM. This would effectively embed into the IPA a
> >> purely software-defined non-architectural property/protocol although we
> >> don't actually need to: we (pKVM) can reasonably expect the guest to
> >> explicitly issue hypercalls to share pages in-place. So I'd be really
> >> keen to avoid baking in assumptions about that model too deep in the
> >> host mm bits if at all possible.
> >
> > There is no assumption about stealing PA bits baked into this API.  Even within
> > x86 KVM, I consider it a hard requirement that the common flows not assume the
> > private vs. shared information is communicated through the PA.
> 
> Quentin, I think we might need a clarification.  The API in this patchset indeed has no requirement that a PA bit distinguish between private and shared, but I think it makes at least a weak assumption that *something*, a priori, distinguishes them.  In particular, there are private memslots and shared memslots, so the logical flow of resolving a guest memory access looks like:
> 
> 1. guest accesses a GVA
> 
> 2. read guest paging structures
> 
> 3. determine whether this is a shared or private access
> 
> 4. read host (KVM memslots and anything else, EPT, NPT, RMP, etc) structures accordingly.  In particular, the memslot to reference is different depending on the access type.
> 
> For TDX, this maps on to the fd-based model perfectly: the host-side paging structures for the shared and private slots are completely separate.  For SEV, the structures are shared and KVM will need to figure out what to do in case a private and shared memslot overlap.  Presumably it's sufficient to declare that one of them wins, although actually determining which one is active for a given GPA may involve checking whether the backing store for a given page actually exists.
> 
> But I don't understand pKVM well enough to understand how it fits in.  Quentin, how is the shared vs private mode of a memory access determined?  How do the paging structures work?  Can a guest switch between shared and private by issuing a hypercall without changing any guest-side paging structures or anything else?

My apologies, I've indeed shared very little details about how pKVM
works. We'll be posting patches upstream really soon that will hopefully
help with this, but in the meantime, here is the idea.

pKVM is designed around MMU-based protection as opposed to encryption as
is the case for many confidential computing solutions. It's probably
worth mentioning that, although it targets arm64, pKVM is distinct from
the Arm CC-A stuff and requires no fancy hardware extensions -- it is
applicable all the way back to Arm v8.0 which makes it an interesting
solution for mobile.

Another particularity of the pKVM approach is that the code of the
hypervisor itself lives in the kernel source tree (see
arch/arm64/kvm/hyp/nvhe/). The hypervisor is built with the rest of the
kernel but as a self-sufficient object, and ends up in its own dedicated
ELF section (.hyp.*) in the kernel image. The main requirement for pKVM
(and KVM on arm64 in general) is to have the bootloader enter the kernel
at the hypervisor exception level (a.k.a EL2). The boot procedure is a
bit involved, but eventually the hypervisor object is installed at EL2,
and the kernel is deprivileged to EL1 and proceeds to boot. From that
point on the hypervisor no longer trusts the kernel and will enable the
stage-2 MMU to impose access-control restrictions to all memory accesses
from the host.

All that to say: the pKVM approach offers a great deal of flexibility
when it comes to hypervisor behaviour. We have control over the
hypervisor code and can change it as we see fit. Since both the
hypervisor and the host kernel are part of the same image, the ABI
between them is very much *not* stable and can be adjusted to whatever
makes the most sense. So, I think we'd be quite keen to use that
flexibility to align some of the pKVM behaviours with other players
(TDX, SEV, CC-A), especially when it comes to host mm APIs. But that
flexibility also means we can do some things a bit better (e.g. pKVM can
handle illegal accesses from the host mostly fine -- the hypervisor can
re-inject the fault in the host) so I would definitely like to use this
to our advantage and not be held back by unrelated constraints.

To answer your original question about memory 'conversion', the key
thing is that the pKVM hypervisor controls the stage-2 page-tables for
everyone in the system, all guests as well as the host. As such, a page
'conversion' is nothing more than a permission change in the relevant
page-tables.

The typical flow is as follows:

 - the host asks the hypervisor to run a guest;

 - the hypervisor does the context switch, which includes switching
   stage-2 page-tables;

 - initially the guest has an empty stage-2 (we don't require
   pre-faulting everything), which means it'll immediately fault;

 - the hypervisor switches back to host context to handle the guest
   fault;

 - the host handler finds the corresponding memslot and does the
   ipa->hva conversion. In our current implementation it uses a longterm
   GUP pin on the corresponding page;

 - once it has a page, the host handler issues a hypercall to donate the
   page to the guest;

 - the hypervisor does a bunch of checks to make sure the host owns the
   page, and if all is fine it will unmap it from the host stage-2 and
   map it in the guest stage-2, and do some bookkeeping as it needs to
   track page ownership, etc;

 - the guest can then proceed to run, and possibly faults in many more
   pages;

 - when it wants to, the guest can then issue a hypercall to share a
   page back with the host;

 - the hypervisor checks the request, maps the page back in the host
   stage-2, does more bookkeeping and returns back to the host to notify
   it of the share;

 - the host kernel at that point can exit back to userspace to relay
   that information to the VMM;

 - rinse and repeat.

We currently don't allow the host punching holes in the guest IPA space.
Once it has donated a page to a guest, it can't have it back until the
guest has been entirely torn down (at which point all of memory is
poisoned by the hypervisor obviously). But we could certainly reconsider
that part. OTOH, I'm still inclined to think that in-place sharing is
desirable. In our case it's dirt cheap, and could even work on huge
pages, which would allow very efficient sharing of large amounts of
data. So, I'm a bit hesitant to use the private-fd approach as-is since
it's not immediately obvious how we'll ever be able reconcile these
things if mmap-ing the fd is a firm no. With that said, I don't think
our *current* use-cases have a strong need for this, so I mostly agree
with Sean's point earlier. But since we're talking about committing to a
userspace ABI, I would feel better if there was a clear path towards
having support for in-place sharing -- I can certainly see it being
useful. I'll think about it, but if folks have ideas in the meantime
I'll be happy to discuss.

I hope the above was useful and clears up the confusion.

Thanks,
Quentin


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-01 14:59                   ` Quentin Perret
@ 2022-04-01 17:14                     ` Sean Christopherson
  2022-04-01 18:03                       ` Quentin Perret
  2022-04-01 19:56                     ` Andy Lutomirski
  1 sibling, 1 reply; 118+ messages in thread
From: Sean Christopherson @ 2022-04-01 17:14 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Andy Lutomirski, Steven Price, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon

On Fri, Apr 01, 2022, Quentin Perret wrote:
> The typical flow is as follows:
> 
>  - the host asks the hypervisor to run a guest;
> 
>  - the hypervisor does the context switch, which includes switching
>    stage-2 page-tables;
> 
>  - initially the guest has an empty stage-2 (we don't require
>    pre-faulting everything), which means it'll immediately fault;
> 
>  - the hypervisor switches back to host context to handle the guest
>    fault;
> 
>  - the host handler finds the corresponding memslot and does the
>    ipa->hva conversion. In our current implementation it uses a longterm
>    GUP pin on the corresponding page;
> 
>  - once it has a page, the host handler issues a hypercall to donate the
>    page to the guest;
> 
>  - the hypervisor does a bunch of checks to make sure the host owns the
>    page, and if all is fine it will unmap it from the host stage-2 and
>    map it in the guest stage-2, and do some bookkeeping as it needs to
>    track page ownership, etc;
> 
>  - the guest can then proceed to run, and possibly faults in many more
>    pages;
> 
>  - when it wants to, the guest can then issue a hypercall to share a
>    page back with the host;
> 
>  - the hypervisor checks the request, maps the page back in the host
>    stage-2, does more bookkeeping and returns back to the host to notify
>    it of the share;
> 
>  - the host kernel at that point can exit back to userspace to relay
>    that information to the VMM;
> 
>  - rinse and repeat.

I assume there is a scenario where a page can be converted from shared=>private?
If so, is there a use case where that happens post-boot _and_ the contents of the
page are preserved?

> We currently don't allow the host punching holes in the guest IPA space.

The hole doesn't get punched in guest IPA space, it gets punched in the private
backing store, which is host PA space.

> Once it has donated a page to a guest, it can't have it back until the
> guest has been entirely torn down (at which point all of memory is
> poisoned by the hypervisor obviously).

The guest doesn't have to know that it was handed back a different page.  It will
require defining the semantics to state that the trusted hypervisor will clear
that page on conversion, but IMO the trusted hypervisor should be doing that
anyways.  IMO, forcing on the guest to correctly zero pages on conversion is
unnecessarily risky because converting private=>shared and preserving the contents
should be a very, very rare scenario, i.e. it's just one more thing for the guest
to get wrong.

If there is a use case where the page contents need to be preserved, then that can
and should be an explicit request from the guest, and can be handled through
export/import style functions.  Export/import would be slow-ish due to memcpy(),
which is why I asked if there's a need to do this specific action frequently (or
at all).


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-01 17:14                     ` Sean Christopherson
@ 2022-04-01 18:03                       ` Quentin Perret
  2022-04-01 18:24                         ` Sean Christopherson
  0 siblings, 1 reply; 118+ messages in thread
From: Quentin Perret @ 2022-04-01 18:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Steven Price, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon

On Friday 01 Apr 2022 at 17:14:21 (+0000), Sean Christopherson wrote:
> On Fri, Apr 01, 2022, Quentin Perret wrote:
> > The typical flow is as follows:
> > 
> >  - the host asks the hypervisor to run a guest;
> > 
> >  - the hypervisor does the context switch, which includes switching
> >    stage-2 page-tables;
> > 
> >  - initially the guest has an empty stage-2 (we don't require
> >    pre-faulting everything), which means it'll immediately fault;
> > 
> >  - the hypervisor switches back to host context to handle the guest
> >    fault;
> > 
> >  - the host handler finds the corresponding memslot and does the
> >    ipa->hva conversion. In our current implementation it uses a longterm
> >    GUP pin on the corresponding page;
> > 
> >  - once it has a page, the host handler issues a hypercall to donate the
> >    page to the guest;
> > 
> >  - the hypervisor does a bunch of checks to make sure the host owns the
> >    page, and if all is fine it will unmap it from the host stage-2 and
> >    map it in the guest stage-2, and do some bookkeeping as it needs to
> >    track page ownership, etc;
> > 
> >  - the guest can then proceed to run, and possibly faults in many more
> >    pages;
> > 
> >  - when it wants to, the guest can then issue a hypercall to share a
> >    page back with the host;
> > 
> >  - the hypervisor checks the request, maps the page back in the host
> >    stage-2, does more bookkeeping and returns back to the host to notify
> >    it of the share;
> > 
> >  - the host kernel at that point can exit back to userspace to relay
> >    that information to the VMM;
> > 
> >  - rinse and repeat.
> 
> I assume there is a scenario where a page can be converted from shared=>private?
> If so, is there a use case where that happens post-boot _and_ the contents of the
> page are preserved?

I think most our use-cases are private=>shared, but how is that
different?

> > We currently don't allow the host punching holes in the guest IPA space.
> 
> The hole doesn't get punched in guest IPA space, it gets punched in the private
> backing store, which is host PA space.

Hmm, in a previous message I thought that you mentioned when a whole
gets punched in the fd KVM will go and unmap the page in the private
SPTEs, which will cause a fatal error for any subsequent access from the
guest to the corresponding IPA?

If that's correct, I meant that we currently don't support that - the
host can't unmap anything from the guest stage-2, it can only tear it
down entirely. But again, I'm not too worried about that, we could
certainly implement that part without too many issues.

> > Once it has donated a page to a guest, it can't have it back until the
> > guest has been entirely torn down (at which point all of memory is
> > poisoned by the hypervisor obviously).
> 
> The guest doesn't have to know that it was handed back a different page.  It will
> require defining the semantics to state that the trusted hypervisor will clear
> that page on conversion, but IMO the trusted hypervisor should be doing that
> anyways.  IMO, forcing on the guest to correctly zero pages on conversion is
> unnecessarily risky because converting private=>shared and preserving the contents
> should be a very, very rare scenario, i.e. it's just one more thing for the guest
> to get wrong.

I'm not sure I agree. The guest is going to communicate with an
untrusted entity via that shared page, so it better be careful. Guest
hardening in general is a major topic, and of all problems, zeroing the
page before sharing is probably one of the simplest to solve.

Also, note that in pKVM all the hypervisor code at EL2 runs with
preemption disabled, which is a strict constraint. As such one of the
main goals is the spend as little time as possible in that context.
We're trying hard to keep the amount of zeroing/memcpy-ing to an
absolute minimum. And that's especially true as we introduce support for
huge pages. So, we'll take every opportunity we get to have the guest
or the host do that work.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-01 18:03                       ` Quentin Perret
@ 2022-04-01 18:24                         ` Sean Christopherson
  0 siblings, 0 replies; 118+ messages in thread
From: Sean Christopherson @ 2022-04-01 18:24 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Andy Lutomirski, Steven Price, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon

On Fri, Apr 01, 2022, Quentin Perret wrote:
> On Friday 01 Apr 2022 at 17:14:21 (+0000), Sean Christopherson wrote:
> > On Fri, Apr 01, 2022, Quentin Perret wrote:
> > I assume there is a scenario where a page can be converted from shared=>private?
> > If so, is there a use case where that happens post-boot _and_ the contents of the
> > page are preserved?
> 
> I think most our use-cases are private=>shared, but how is that
> different?

Ah, it's not really different.  What I really was trying to understand is if there
are post-boot conversions that preserve data.  I asked about shared=>private because
there are known pre-boot conversions, e.g. populating the initial guest image, but
AFAIK there are no use cases for post-boot conversions, which might be more needy in
terms of performance.

> > > We currently don't allow the host punching holes in the guest IPA space.
> > 
> > The hole doesn't get punched in guest IPA space, it gets punched in the private
> > backing store, which is host PA space.
> 
> Hmm, in a previous message I thought that you mentioned when a whole
> gets punched in the fd KVM will go and unmap the page in the private
> SPTEs, which will cause a fatal error for any subsequent access from the
> guest to the corresponding IPA?

Oooh, that was in the context of TDX.  Mixing VMX and arm64 terminology... TDX has
two separate stage-2 roots, one for private IPAs and one for shared IPAs.  The
guest selects private/shared by toggling a bit stolen from the guest IPA space.
Upon conversion, KVM will remove from one stage-2 tree and insert into the other.

But even then, subsequent accesses to the wrong IPA won't be fatal, as KVM will
treat them as implicit conversions.  I wish they could be fatal, but that's not
"allowed" given the guest/host contract dictated by the TDX specs.

> If that's correct, I meant that we currently don't support that - the
> host can't unmap anything from the guest stage-2, it can only tear it
> down entirely. But again, I'm not too worried about that, we could
> certainly implement that part without too many issues.

I believe for the pKVM case it wouldn't be unmapping, it would be a PFN change.

> > > Once it has donated a page to a guest, it can't have it back until the
> > > guest has been entirely torn down (at which point all of memory is
> > > poisoned by the hypervisor obviously).
> > 
> > The guest doesn't have to know that it was handed back a different page.  It will
> > require defining the semantics to state that the trusted hypervisor will clear
> > that page on conversion, but IMO the trusted hypervisor should be doing that
> > anyways.  IMO, forcing on the guest to correctly zero pages on conversion is
> > unnecessarily risky because converting private=>shared and preserving the contents
> > should be a very, very rare scenario, i.e. it's just one more thing for the guest
> > to get wrong.
> 
> I'm not sure I agree. The guest is going to communicate with an
> untrusted entity via that shared page, so it better be careful. Guest
> hardening in general is a major topic, and of all problems, zeroing the
> page before sharing is probably one of the simplest to solve.

Yes, for private=>shared you're correct, the guest needs to be paranoid as
there are no guarantees as to what data may be in the shared page.

I was thinking more in the context of shared=>private conversions, e.g. the guest
is done sharing a page and wants it back.  In that case, forcing the guest to zero
the private page upon re-acceptance is dicey.  Hmm, but if the guest needs to
explicitly re-accept the page, then putting the onus on the guest to zero the page
isn't a big deal.  The pKVM contract would just need to make it clear that the
guest cannot make any assumptions about the state of private data 

Oh, now I remember why I'm biased toward the trusted entity doing the work.
IIRC, thanks to TDX's lovely memory poisoning and cache aliasing behavior, the
guest can't be trusted to properly initialize private memory with the guest key,
i.e. the guest could induce a #MC and crash the host.

Anywho, I agree that for performance reasons, requiring the guest to zero private
pages is preferable so long as the guest must explicitly accept/initiate conversions.

> Also, note that in pKVM all the hypervisor code at EL2 runs with
> preemption disabled, which is a strict constraint. As such one of the
> main goals is the spend as little time as possible in that context.
> We're trying hard to keep the amount of zeroing/memcpy-ing to an
> absolute minimum. And that's especially true as we introduce support for
> huge pages. So, we'll take every opportunity we get to have the guest
> or the host do that work.

FWIW, TDX has the exact same constraints (they're actually worse as the trusted
entity runs with _all_ interrupts blocked).  And yeah, it needs to be careful when
dealing with huge pages, e.g. many flows force the guest/host to do 512 * 4kb operations.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-01 14:59                   ` Quentin Perret
  2022-04-01 17:14                     ` Sean Christopherson
@ 2022-04-01 19:56                     ` Andy Lutomirski
  2022-04-04 15:01                       ` Quentin Perret
  1 sibling, 1 reply; 118+ messages in thread
From: Andy Lutomirski @ 2022-04-01 19:56 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Sean Christopherson, Steven Price, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon

On Fri, Apr 1, 2022, at 7:59 AM, Quentin Perret wrote:
> On Thursday 31 Mar 2022 at 09:04:56 (-0700), Andy Lutomirski wrote:


> To answer your original question about memory 'conversion', the key
> thing is that the pKVM hypervisor controls the stage-2 page-tables for
> everyone in the system, all guests as well as the host. As such, a page
> 'conversion' is nothing more than a permission change in the relevant
> page-tables.
>

So I can see two different ways to approach this.

One is that you split the whole address space in half and, just like SEV and TDX, allocate one bit to indicate the shared/private status of a page.  This makes it work a lot like SEV and TDX.

The other is to have shared and private pages be distinguished only by their hypercall history and the (protected) page tables.  This saves some address space and some page table allocations, but it opens some cans of worms too.  In particular, the guest and the hypervisor need to coordinate, in a way that the guest can trust, to ensure that the guest's idea of which pages are private match the host's.  This model seems a bit harder to support nicely with the private memory fd model, but not necessarily impossible.

Also, what are you trying to accomplish by having the host userspace mmap private pages?  Is the idea that multiple guest could share the same page until such time as one of them tries to write to it?  That would be kind of like having a third kind of memory that's visible to host and guests but is read-only for everyone.  TDX and SEV can't support this at all (a private page belongs to one guest and one guest only, at least in SEV and in the current TDX SEAM spec).  I imagine that this could be supported with private memory fds with some care without mmap, though -- the host could still populate the page with memcpy.  Or I suppose a memslot could support using MAP_PRIVATE fds and have approximately the right semantics.

--Andy




^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-01 19:56                     ` Andy Lutomirski
@ 2022-04-04 15:01                       ` Quentin Perret
  2022-04-04 17:06                         ` Sean Christopherson
  0 siblings, 1 reply; 118+ messages in thread
From: Quentin Perret @ 2022-04-04 15:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Sean Christopherson, Steven Price, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon

On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote:
> On Fri, Apr 1, 2022, at 7:59 AM, Quentin Perret wrote:
> > On Thursday 31 Mar 2022 at 09:04:56 (-0700), Andy Lutomirski wrote:
> 
> 
> > To answer your original question about memory 'conversion', the key
> > thing is that the pKVM hypervisor controls the stage-2 page-tables for
> > everyone in the system, all guests as well as the host. As such, a page
> > 'conversion' is nothing more than a permission change in the relevant
> > page-tables.
> >
> 
> So I can see two different ways to approach this.
> 
> One is that you split the whole address space in half and, just like SEV and TDX, allocate one bit to indicate the shared/private status of a page.  This makes it work a lot like SEV and TDX.
>
> The other is to have shared and private pages be distinguished only by their hypercall history and the (protected) page tables.  This saves some address space and some page table allocations, but it opens some cans of worms too.  In particular, the guest and the hypervisor need to coordinate, in a way that the guest can trust, to ensure that the guest's idea of which pages are private match the host's.  This model seems a bit harder to support nicely with the private memory fd model, but not necessarily impossible.

Right. Perhaps one thing I should clarify as well: pKVM (as opposed to
TDX) has only _one_ page-table per guest, and it is controllex by the
hypervisor only. So the hypervisor needs to be involved for both shared
and private mappings. As such, shared pages have relatively similar
constraints when it comes to host mm stuff --  we can't migrate shared
pages or swap them out without getting the hypervisor involved.

> Also, what are you trying to accomplish by having the host userspace mmap private pages?

What I would really like to have is non-destructive in-place conversions
of pages. mmap-ing the pages that have been shared back felt like a good
fit for the private=>shared conversion, but in fact I'm not all that
opinionated about the API as long as the behaviour and the performance
are there. Happy to look into alternatives.

FWIW, there are a couple of reasons why I'd like to have in-place
conversions:

 - one goal of pKVM is to migrate some things away from the Arm
   Trustzone environment (e.g. DRM and the likes) and into protected VMs
   instead. This will give Linux a fighting chance to defend itself
   against these things -- they currently have access to _all_ memory.
   And transitioning pages between Linux and Trustzone (donations and
   shares) is fast and non-destructive, so we really do not want pKVM to
   regress by requiring the hypervisor to memcpy things;

 - it can be very useful for protected VMs to do shared=>private
   conversions. Think of a VM receiving some data from the host in a
   shared buffer, and then it wants to operate on that buffer without
   risking to leak confidential informations in a transient state. In
   that case the most logical thing to do is to convert the buffer back
   to private, do whatever needs to be done on that buffer (decrypting a
   frame, ...), and then share it back with the host to consume it;

 - similar to the previous point, a protected VM might want to
   temporarily turn a buffer private to avoid ToCToU issues;

 - once we're able to do device assignment to protected VMs, this might
   allow DMA-ing to a private buffer, and make it shared later w/o
   bouncing.

And there is probably more.

IIUC, the private fd proposal as it stands requires shared and private
pages to come from entirely distinct places. So it's not entirely clear
to me how any of the above could be supported without having the
hypervisor memcpy the data during conversions, which I really don't want
to do for performance reasons.

> Is the idea that multiple guest could share the same page until such time as one of them tries to write to it?

That would certainly be possible to implement in the pKVM
environment with the right tracking, so I think it is worth considering
as a future goal.

Thanks,
Quentin


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-04 15:01                       ` Quentin Perret
@ 2022-04-04 17:06                         ` Sean Christopherson
  2022-04-04 22:04                           ` Andy Lutomirski
  0 siblings, 1 reply; 118+ messages in thread
From: Sean Christopherson @ 2022-04-04 17:06 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Andy Lutomirski, Steven Price, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon

On Mon, Apr 04, 2022, Quentin Perret wrote:
> On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote:
> FWIW, there are a couple of reasons why I'd like to have in-place
> conversions:
> 
>  - one goal of pKVM is to migrate some things away from the Arm
>    Trustzone environment (e.g. DRM and the likes) and into protected VMs
>    instead. This will give Linux a fighting chance to defend itself
>    against these things -- they currently have access to _all_ memory.
>    And transitioning pages between Linux and Trustzone (donations and
>    shares) is fast and non-destructive, so we really do not want pKVM to
>    regress by requiring the hypervisor to memcpy things;

Is there actually a _need_ for the conversion to be non-destructive?  E.g. I assume
the "trusted" side of things will need to be reworked to run as a pKVM guest, at
which point reworking its logic to understand that conversions are destructive and
slow-ish doesn't seem too onerous.

>  - it can be very useful for protected VMs to do shared=>private
>    conversions. Think of a VM receiving some data from the host in a
>    shared buffer, and then it wants to operate on that buffer without
>    risking to leak confidential informations in a transient state. In
>    that case the most logical thing to do is to convert the buffer back
>    to private, do whatever needs to be done on that buffer (decrypting a
>    frame, ...), and then share it back with the host to consume it;

If performance is a motivation, why would the guest want to do two conversions
instead of just doing internal memcpy() to/from a private page?  I would be quite
surprised if multiple exits and TLB shootdowns is actually faster, especially at
any kind of scale where zapping stage-2 PTEs will cause lock contention and IPIs.

>  - similar to the previous point, a protected VM might want to
>    temporarily turn a buffer private to avoid ToCToU issues;

Again, bounce buffer the page in the guest.

>  - once we're able to do device assignment to protected VMs, this might
>    allow DMA-ing to a private buffer, and make it shared later w/o
>    bouncing.

Exposing a private buffer to a device doesn't requring in-place conversion.  The
proper way to handle this would be to teach e.g. VFIO to retrieve the PFN from
the backing store.  I don't understand the use case for sharing a DMA'd page at a
later time; with whom would the guest share the page?  E.g. if a NIC has access to
guest private data then there should never be a need to convert/bounce the page.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-04 17:06                         ` Sean Christopherson
@ 2022-04-04 22:04                           ` Andy Lutomirski
  2022-04-05 10:36                             ` Quentin Perret
  0 siblings, 1 reply; 118+ messages in thread
From: Andy Lutomirski @ 2022-04-04 22:04 UTC (permalink / raw)
  To: Sean Christopherson, Quentin Perret
  Cc: Steven Price, Chao Peng, kvm list, Linux Kernel Mailing List,
	linux-mm, linux-fsdevel, Linux API, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A. Shutemov, Nakajima, Jun, Dave Hansen,
	Andi Kleen, David Hildenbrand, Marc Zyngier, Will Deacon



On Mon, Apr 4, 2022, at 10:06 AM, Sean Christopherson wrote:
> On Mon, Apr 04, 2022, Quentin Perret wrote:
>> On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote:
>> FWIW, there are a couple of reasons why I'd like to have in-place
>> conversions:
>> 
>>  - one goal of pKVM is to migrate some things away from the Arm
>>    Trustzone environment (e.g. DRM and the likes) and into protected VMs
>>    instead. This will give Linux a fighting chance to defend itself
>>    against these things -- they currently have access to _all_ memory.
>>    And transitioning pages between Linux and Trustzone (donations and
>>    shares) is fast and non-destructive, so we really do not want pKVM to
>>    regress by requiring the hypervisor to memcpy things;
>
> Is there actually a _need_ for the conversion to be non-destructive?  
> E.g. I assume
> the "trusted" side of things will need to be reworked to run as a pKVM 
> guest, at
> which point reworking its logic to understand that conversions are 
> destructive and
> slow-ish doesn't seem too onerous.
>
>>  - it can be very useful for protected VMs to do shared=>private
>>    conversions. Think of a VM receiving some data from the host in a
>>    shared buffer, and then it wants to operate on that buffer without
>>    risking to leak confidential informations in a transient state. In
>>    that case the most logical thing to do is to convert the buffer back
>>    to private, do whatever needs to be done on that buffer (decrypting a
>>    frame, ...), and then share it back with the host to consume it;
>
> If performance is a motivation, why would the guest want to do two 
> conversions
> instead of just doing internal memcpy() to/from a private page?  I 
> would be quite
> surprised if multiple exits and TLB shootdowns is actually faster, 
> especially at
> any kind of scale where zapping stage-2 PTEs will cause lock contention 
> and IPIs.

I don't know the numbers or all the details, but this is arm64, which is a rather better architecture than x86 in this regard.  So maybe it's not so bad, at least in very simple cases, ignoring all implementation details.  (But see below.)  Also the systems in question tend to have fewer CPUs than some of the massive x86 systems out there.

If we actually wanted to support transitioning the same page between shared and private, though, we have a bit of an awkward situation.  Private to shared is conceptually easy -- do some bookkeeping, reconstitute the direct map entry, and it's done.  The other direction is a mess: all existing uses of the page need to be torn down.  If the page has been recently used for DMA, this includes IOMMU entries.

Quentin: let's ignore any API issues for now.  Do you have a concept of how a nondestructive shared -> private transition could work well, even in principle?  The best I can come up with is a special type of shared page that is not GUP-able and maybe not even mmappable, having a clear option for transitions to fail, and generally preventing the nasty cases from happening in the first place.

Maybe there could be a special mode for the private memory fds in which specific pages are marked as "managed by this fd but actually shared".  pread() and pwrite() would work on those pages, but not mmap().  (Or maybe mmap() but the resulting mappings would not permit GUP.)  And transitioning them would be a special operation on the fd that is specific to pKVM and wouldn't work on TDX or SEV.

Hmm.  Sean and Chao, are we making a bit of a mistake by making these fds technology-agnostic?  That is, would we want to distinguish between a TDX backing fd, a SEV backing fd, a software-based backing fd, etc?  API-wise this could work by requiring the fd to be bound to a KVM VM instance and possibly even configured a bit before any other operations would be allowed.

(Destructive transitions nicely avoid all the nasty cases.  If something is still pinning a shared page when it's "transitioned" to private (really just replaced with a new page), then the old page continues existing for as long as needed as a separate object.)


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-04 22:04                           ` Andy Lutomirski
@ 2022-04-05 10:36                             ` Quentin Perret
  2022-04-05 17:51                               ` Andy Lutomirski
  2022-04-05 18:03                               ` Sean Christopherson
  0 siblings, 2 replies; 118+ messages in thread
From: Quentin Perret @ 2022-04-05 10:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Sean Christopherson, Steven Price, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon

On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
> 
> 
> On Mon, Apr 4, 2022, at 10:06 AM, Sean Christopherson wrote:
> > On Mon, Apr 04, 2022, Quentin Perret wrote:
> >> On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote:
> >> FWIW, there are a couple of reasons why I'd like to have in-place
> >> conversions:
> >> 
> >>  - one goal of pKVM is to migrate some things away from the Arm
> >>    Trustzone environment (e.g. DRM and the likes) and into protected VMs
> >>    instead. This will give Linux a fighting chance to defend itself
> >>    against these things -- they currently have access to _all_ memory.
> >>    And transitioning pages between Linux and Trustzone (donations and
> >>    shares) is fast and non-destructive, so we really do not want pKVM to
> >>    regress by requiring the hypervisor to memcpy things;
> >
> > Is there actually a _need_ for the conversion to be non-destructive?  
> > E.g. I assume
> > the "trusted" side of things will need to be reworked to run as a pKVM 
> > guest, at
> > which point reworking its logic to understand that conversions are 
> > destructive and
> > slow-ish doesn't seem too onerous.
> >
> >>  - it can be very useful for protected VMs to do shared=>private
> >>    conversions. Think of a VM receiving some data from the host in a
> >>    shared buffer, and then it wants to operate on that buffer without
> >>    risking to leak confidential informations in a transient state. In
> >>    that case the most logical thing to do is to convert the buffer back
> >>    to private, do whatever needs to be done on that buffer (decrypting a
> >>    frame, ...), and then share it back with the host to consume it;
> >
> > If performance is a motivation, why would the guest want to do two 
> > conversions
> > instead of just doing internal memcpy() to/from a private page?  I 
> > would be quite
> > surprised if multiple exits and TLB shootdowns is actually faster, 
> > especially at
> > any kind of scale where zapping stage-2 PTEs will cause lock contention 
> > and IPIs.
> 
> I don't know the numbers or all the details, but this is arm64, which is a rather better architecture than x86 in this regard.  So maybe it's not so bad, at least in very simple cases, ignoring all implementation details.  (But see below.)  Also the systems in question tend to have fewer CPUs than some of the massive x86 systems out there.

Yep. I can try and do some measurements if that's really necessary, but
I'm really convinced the cost of the TLBI for the shared->private
conversion is going to be significantly smaller than the cost of memcpy
the buffer twice in the guest for us. To be fair, although the cost for
the CPU update is going to be low, the cost for IOMMU updates _might_ be
higher, but that very much depends on the hardware. On systems that use
e.g. the Arm SMMU, the IOMMUs can use the CPU page-tables directly, and
the iotlb invalidation is done on the back of the CPU invalidation. So,
on systems with sane hardware the overhead is *really* quite small.

Also, memcpy requires double the memory, it is pretty bad for power, and
it causes memory traffic which can't be a good thing for things running
concurrently.

> If we actually wanted to support transitioning the same page between shared and private, though, we have a bit of an awkward situation.  Private to shared is conceptually easy -- do some bookkeeping, reconstitute the direct map entry, and it's done.  The other direction is a mess: all existing uses of the page need to be torn down.  If the page has been recently used for DMA, this includes IOMMU entries.
>
> Quentin: let's ignore any API issues for now.  Do you have a concept of how a nondestructive shared -> private transition could work well, even in principle?

I had a high level idea for the workflow, but I haven't looked into the
implementation details.

The idea would be to allow KVM *or* userspace to take a reference
to a page in the fd in an exclusive manner. KVM could take a reference
on a page (which would be necessary before to donating it to a guest)
using some kind of memfile_notifier as proposed in this series, and
userspace could do the same some other way (mmap presumably?). In both
cases, the operation might fail.

I would imagine the boot and private->shared flow as follow:

 - the VMM uses fallocate on the private fd, and associates the <fd,
   offset, size> with a memslot;

 - the guest boots, and as part of that KVM takes references to all the
   pages that are donated to the guest. If userspace happens to have a
   mapping to a page, KVM will fail to take the reference, which would
   be fatal for the guest.

 - once the guest has booted, it issues a hypercall to share a page back
   with the host;

 - KVM is notified, and at that point it drops its reference to the
   page. It then exits to userspace to notify it of the share;

 - host userspace receives the share, and mmaps the shared page with
   MAP_FIXED to access it, which takes a reference on the fd-backed
   page.

There are variations of that idea: e.g. allow userspace to mmap the
entire private fd but w/o taking a reference on pages mapped with
PROT_NONE. And then the VMM can use mprotect() in response to
share/unshare requests. I think Marc liked that idea as it keeps the
userspace API closer to normal KVM -- there actually is a
straightforward gpa->hva relation. Not sure how much that would impact
the implementation at this point.

For the shared=>private conversion, this would be something like so:

 - the guest issues a hypercall to unshare a page;

 - the hypervisor forwards the request to the host;

 - the host kernel forwards the request to userspace;

 - userspace then munmap()s the shared page;

 - KVM then tries to take a reference to the page. If it succeeds, it
   re-enters the guest with a flag of some sort saying that the share
   succeeded, and the hypervisor will adjust pgtables accordingly. If
   KVM failed to take a reference, it flags this and the hypervisor will
   be responsible for communicating that back to the guest. This means
   the guest must handle failures (possibly fatal).

(There are probably many ways in which we can optimize this, e.g. by
having the host proactively munmap() pages it no longer needs so that
the unshare hypercall from the guest doesn't need to exit all the way
back to host userspace.)

A nice side-effect of the above is that it allows userspace to dump a
payload in the private fd before booting the guest. It just needs to
mmap the fd, copy what it wants in there, munmap, and only then pass the
fd to KVM which will be happy enough as long as there are no current
references to the pages. Note: in a previous email I've said that
Android doesn't need this (which is correct as our guest bootloader
currently receives the payload over virtio) but this might change some
day, and there might be other implementations as well, so it's a nice
bonus if we can make this work.

> The best I can come up with is a special type of shared page that is not GUP-able and maybe not even mmappable, having a clear option for transitions to fail, and generally preventing the nasty cases from happening in the first place.

Right, that sounds reasonable to me.

> Maybe there could be a special mode for the private memory fds in which specific pages are marked as "managed by this fd but actually shared".  pread() and pwrite() would work on those pages, but not mmap().  (Or maybe mmap() but the resulting mappings would not permit GUP.)  And transitioning them would be a special operation on the fd that is specific to pKVM and wouldn't work on TDX or SEV.

Aha, didn't think of pread()/pwrite(). Very interesting.

I'd need to check what our VMM actually does, but as an initial
reaction it feels like this might require a pretty significant rework in
userspace. Maybe it's a good thing? Dunno. Maybe more important, those
shared pages are used for virtio communications, so the cost of issuing
syscalls every time the VMM needs to access the shared page will need to
be considered...

Thanks,
Quentin


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-05 10:36                             ` Quentin Perret
@ 2022-04-05 17:51                               ` Andy Lutomirski
  2022-04-05 18:30                                 ` Sean Christopherson
  2022-04-06 13:05                                 ` Quentin Perret
  2022-04-05 18:03                               ` Sean Christopherson
  1 sibling, 2 replies; 118+ messages in thread
From: Andy Lutomirski @ 2022-04-05 17:51 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Sean Christopherson, Steven Price, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon



On Tue, Apr 5, 2022, at 3:36 AM, Quentin Perret wrote:
> On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
>> 
>> 
>> On Mon, Apr 4, 2022, at 10:06 AM, Sean Christopherson wrote:
>> > On Mon, Apr 04, 2022, Quentin Perret wrote:
>> >> On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote:
>> >> FWIW, there are a couple of reasons why I'd like to have in-place
>> >> conversions:
>> >> 
>> >>  - one goal of pKVM is to migrate some things away from the Arm
>> >>    Trustzone environment (e.g. DRM and the likes) and into protected VMs
>> >>    instead. This will give Linux a fighting chance to defend itself
>> >>    against these things -- they currently have access to _all_ memory.
>> >>    And transitioning pages between Linux and Trustzone (donations and
>> >>    shares) is fast and non-destructive, so we really do not want pKVM to
>> >>    regress by requiring the hypervisor to memcpy things;
>> >
>> > Is there actually a _need_ for the conversion to be non-destructive?  
>> > E.g. I assume
>> > the "trusted" side of things will need to be reworked to run as a pKVM 
>> > guest, at
>> > which point reworking its logic to understand that conversions are 
>> > destructive and
>> > slow-ish doesn't seem too onerous.
>> >
>> >>  - it can be very useful for protected VMs to do shared=>private
>> >>    conversions. Think of a VM receiving some data from the host in a
>> >>    shared buffer, and then it wants to operate on that buffer without
>> >>    risking to leak confidential informations in a transient state. In
>> >>    that case the most logical thing to do is to convert the buffer back
>> >>    to private, do whatever needs to be done on that buffer (decrypting a
>> >>    frame, ...), and then share it back with the host to consume it;
>> >
>> > If performance is a motivation, why would the guest want to do two 
>> > conversions
>> > instead of just doing internal memcpy() to/from a private page?  I 
>> > would be quite
>> > surprised if multiple exits and TLB shootdowns is actually faster, 
>> > especially at
>> > any kind of scale where zapping stage-2 PTEs will cause lock contention 
>> > and IPIs.
>> 
>> I don't know the numbers or all the details, but this is arm64, which is a rather better architecture than x86 in this regard.  So maybe it's not so bad, at least in very simple cases, ignoring all implementation details.  (But see below.)  Also the systems in question tend to have fewer CPUs than some of the massive x86 systems out there.
>
> Yep. I can try and do some measurements if that's really necessary, but
> I'm really convinced the cost of the TLBI for the shared->private
> conversion is going to be significantly smaller than the cost of memcpy
> the buffer twice in the guest for us. To be fair, although the cost for
> the CPU update is going to be low, the cost for IOMMU updates _might_ be
> higher, but that very much depends on the hardware. On systems that use
> e.g. the Arm SMMU, the IOMMUs can use the CPU page-tables directly, and
> the iotlb invalidation is done on the back of the CPU invalidation. So,
> on systems with sane hardware the overhead is *really* quite small.
>
> Also, memcpy requires double the memory, it is pretty bad for power, and
> it causes memory traffic which can't be a good thing for things running
> concurrently.
>
>> If we actually wanted to support transitioning the same page between shared and private, though, we have a bit of an awkward situation.  Private to shared is conceptually easy -- do some bookkeeping, reconstitute the direct map entry, and it's done.  The other direction is a mess: all existing uses of the page need to be torn down.  If the page has been recently used for DMA, this includes IOMMU entries.
>>
>> Quentin: let's ignore any API issues for now.  Do you have a concept of how a nondestructive shared -> private transition could work well, even in principle?
>
> I had a high level idea for the workflow, but I haven't looked into the
> implementation details.
>
> The idea would be to allow KVM *or* userspace to take a reference
> to a page in the fd in an exclusive manner. KVM could take a reference
> on a page (which would be necessary before to donating it to a guest)
> using some kind of memfile_notifier as proposed in this series, and
> userspace could do the same some other way (mmap presumably?). In both
> cases, the operation might fail.
>
> I would imagine the boot and private->shared flow as follow:
>
>  - the VMM uses fallocate on the private fd, and associates the <fd,
>    offset, size> with a memslot;
>
>  - the guest boots, and as part of that KVM takes references to all the
>    pages that are donated to the guest. If userspace happens to have a
>    mapping to a page, KVM will fail to take the reference, which would
>    be fatal for the guest.
>
>  - once the guest has booted, it issues a hypercall to share a page back
>    with the host;
>
>  - KVM is notified, and at that point it drops its reference to the
>    page. It then exits to userspace to notify it of the share;
>
>  - host userspace receives the share, and mmaps the shared page with
>    MAP_FIXED to access it, which takes a reference on the fd-backed
>    page.
>
> There are variations of that idea: e.g. allow userspace to mmap the
> entire private fd but w/o taking a reference on pages mapped with
> PROT_NONE. And then the VMM can use mprotect() in response to
> share/unshare requests. I think Marc liked that idea as it keeps the
> userspace API closer to normal KVM -- there actually is a
> straightforward gpa->hva relation. Not sure how much that would impact
> the implementation at this point.
>
> For the shared=>private conversion, this would be something like so:
>
>  - the guest issues a hypercall to unshare a page;
>
>  - the hypervisor forwards the request to the host;
>
>  - the host kernel forwards the request to userspace;
>
>  - userspace then munmap()s the shared page;
>
>  - KVM then tries to take a reference to the page. If it succeeds, it
>    re-enters the guest with a flag of some sort saying that the share
>    succeeded, and the hypervisor will adjust pgtables accordingly. If
>    KVM failed to take a reference, it flags this and the hypervisor will
>    be responsible for communicating that back to the guest. This means
>    the guest must handle failures (possibly fatal).
>
> (There are probably many ways in which we can optimize this, e.g. by
> having the host proactively munmap() pages it no longer needs so that
> the unshare hypercall from the guest doesn't need to exit all the way
> back to host userspace.)
>
> A nice side-effect of the above is that it allows userspace to dump a
> payload in the private fd before booting the guest. It just needs to
> mmap the fd, copy what it wants in there, munmap, and only then pass the
> fd to KVM which will be happy enough as long as there are no current
> references to the pages. Note: in a previous email I've said that
> Android doesn't need this (which is correct as our guest bootloader
> currently receives the payload over virtio) but this might change some
> day, and there might be other implementations as well, so it's a nice
> bonus if we can make this work.
>
>> The best I can come up with is a special type of shared page that is not GUP-able and maybe not even mmappable, having a clear option for transitions to fail, and generally preventing the nasty cases from happening in the first place.
>
> Right, that sounds reasonable to me.

At least as a v1, this is probably more straightforward than allowing mmap().  Also, there's much to be said for a simpler, limited API, to be expanded if genuinely needed, as opposed to starting out with a very featureful API.

>
>> Maybe there could be a special mode for the private memory fds in which specific pages are marked as "managed by this fd but actually shared".  pread() and pwrite() would work on those pages, but not mmap().  (Or maybe mmap() but the resulting mappings would not permit GUP.)  And transitioning them would be a special operation on the fd that is specific to pKVM and wouldn't work on TDX or SEV.
>
> Aha, didn't think of pread()/pwrite(). Very interesting.

There are plenty of use cases for which pread()/pwrite()/splice() will be as fast or even much faster than mmap()+memcpy().

>
> I'd need to check what our VMM actually does, but as an initial
> reaction it feels like this might require a pretty significant rework in
> userspace. Maybe it's a good thing? Dunno. Maybe more important, those
> shared pages are used for virtio communications, so the cost of issuing
> syscalls every time the VMM needs to access the shared page will need to
> be considered...

Let's try actually counting syscalls and mode transitions, at least approximately.  For non-direct IO (DMA allocation on guest side, not straight to/from pagecache or similar):

Guest writes to shared DMA buffer.  Assume the guest is smart and reuses the buffer.
Guest writes descriptor to shared virtio ring.
Guest rings virtio doorbell, which causes an exit.
*** guest -> hypervisor -> host ***
host reads virtio ring (mmaped shared memory)
host does pread() to read the DMA buffer or reads mmapped buffer
host does the IO
resume guest
*** host -> hypervisor -> guest ***

This is essentially optimal in terms of transitions.  The data is copied on the guest side (which may well be mandatory depending on what guest userspace did to initiate the IO) and on the host (which may well be mandatory depending on what the host is doing with the data).

Now let's try straight-from-guest-pagecache or otherwise zero-copy on the guest side.  Without nondestructive changes, the guest needs a bounce buffer and it looks just like the above.  One extra copy, zero extra mode transitions.  With nondestructive changes, it's a bit more like physical hardware with an IOMMU:

Guest shares the page.
*** guest -> hypervisor ***
Hypervisor adds a PTE.  Let's assume we're being very optimal and the host is not synchronously notified.
*** hypervisor -> guest ***
Guest writes descriptor to shared virtio ring.
Guest rings virtio doorbell, which causes an exit.
*** guest -> hypervisor -> host ***
host reads virtio ring (mmaped shared memory)

mmap  *** syscall ***
host does the IO
munmap *** syscall, TLBI ***

resume guest
*** host -> hypervisor -> guest ***
Guest unshares the page.
*** guest -> hypervisor ***
Hypervisor removes PTE.  TLBI.
*** hypervisor -> guest ***

This is quite expensive.  For small IO, pread() or splice() in the host may be a lot faster.  Even for large IO, splice() may still win.


I can imagine clever improvements.  First, let's get rid of mmap() + munmap().  Instead use a special device mapping with special semantics, not regular memory.  (mmap and munmap are expensive even ignoring any arch and TLB stuff.)  The rule is that, if the page is shared, access works, and if private, access doesn't, but it's still mapped.  The hypervisor and the host cooperate to make it so.  Now it's more like:

Guest shares the page.
*** guest -> hypervisor ***
Hypervisor adds a PTE.  Let's assume we're being very optimal and the host is not synchronously notified.
*** hypervisor -> guest ***
Guest writes descriptor to shared virtio ring.
Guest rings virtio doorbell, which causes an exit.
*** guest -> hypervisor -> host ***
host reads virtio ring (mmaped shared memory)
memcpy(): just works without a fault
resume guest
*** host -> hypervisor -> guest ***
Guest unshares the page.
*** guest -> hypervisor ***
Hypervisor removes PTE.  TLBI.
*** hypervisor -> guest ***

That's *much* better.  On x86, it's still pretty terrible, but ARM64 is superior.  (For now.  I keep loudly asking Intel and AMD to catch up.  Hasn't happened yet.)

But we can improve it further by making this whole mess look like an IOMMU:


Guest shares the page by writing an IOPTE or however it works.
Guest writes descriptor to shared virtio ring.
Guest rings virtio doorbell, which causes an exit.
*** guest -> hypervisor -> host ***
host reads virtio ring (mmaped shared memory)

memcpy(): would fault to hypervisor, but the IOPTE scheme could be improved by having a ring listing recently added PTEs so the hypervisor could do it as part of the exit processing.

resume guest
*** host -> hypervisor -> guest ***
Guest unshares the page.
*** guest -> hypervisor ***
Hypervisor removes PTE.  TLBI.
*** hypervisor -> guest ***

Obviously considerable cleverness is needed to make a virt IOMMU like this work well, but still.

Anyway, my suggestion is that the fd backing proposal get slightly modified to get it ready for multiple subtypes of backing object, which should be a pretty minimal change.  Then, if someone actually needs any of this cleverness, it can be added later.  In the mean time, the pread()/pwrite()/splice() scheme is pretty good.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-05 10:36                             ` Quentin Perret
  2022-04-05 17:51                               ` Andy Lutomirski
@ 2022-04-05 18:03                               ` Sean Christopherson
  2022-04-06 10:34                                 ` Quentin Perret
  2022-04-22 10:56                                 ` Chao Peng
  1 sibling, 2 replies; 118+ messages in thread
From: Sean Christopherson @ 2022-04-05 18:03 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Andy Lutomirski, Steven Price, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon

On Tue, Apr 05, 2022, Quentin Perret wrote:
> On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
> > >>  - it can be very useful for protected VMs to do shared=>private
> > >>    conversions. Think of a VM receiving some data from the host in a
> > >>    shared buffer, and then it wants to operate on that buffer without
> > >>    risking to leak confidential informations in a transient state. In
> > >>    that case the most logical thing to do is to convert the buffer back
> > >>    to private, do whatever needs to be done on that buffer (decrypting a
> > >>    frame, ...), and then share it back with the host to consume it;
> > >
> > > If performance is a motivation, why would the guest want to do two
> > > conversions instead of just doing internal memcpy() to/from a private
> > > page?  I would be quite surprised if multiple exits and TLB shootdowns is
> > > actually faster, especially at any kind of scale where zapping stage-2
> > > PTEs will cause lock contention and IPIs.
> > 
> > I don't know the numbers or all the details, but this is arm64, which is a
> > rather better architecture than x86 in this regard.  So maybe it's not so
> > bad, at least in very simple cases, ignoring all implementation details.
> > (But see below.)  Also the systems in question tend to have fewer CPUs than
> > some of the massive x86 systems out there.
> 
> Yep. I can try and do some measurements if that's really necessary, but
> I'm really convinced the cost of the TLBI for the shared->private
> conversion is going to be significantly smaller than the cost of memcpy
> the buffer twice in the guest for us.

It's not just the TLB shootdown, the VM-Exits aren't free.   And barring non-trivial
improvements to KVM's MMU, e.g. sharding of mmu_lock, modifying the page tables will
block all other updates and MMU operations.  Taking mmu_lock for read, should arm64
ever convert to a rwlock, is not an option because KVM needs to block other
conversions to avoid races.

Hmm, though batching multiple pages into a single request would mitigate most of
the overhead.

> There are variations of that idea: e.g. allow userspace to mmap the
> entire private fd but w/o taking a reference on pages mapped with
> PROT_NONE. And then the VMM can use mprotect() in response to
> share/unshare requests. I think Marc liked that idea as it keeps the
> userspace API closer to normal KVM -- there actually is a
> straightforward gpa->hva relation. Not sure how much that would impact
> the implementation at this point.
> 
> For the shared=>private conversion, this would be something like so:
> 
>  - the guest issues a hypercall to unshare a page;
> 
>  - the hypervisor forwards the request to the host;
> 
>  - the host kernel forwards the request to userspace;
> 
>  - userspace then munmap()s the shared page;
> 
>  - KVM then tries to take a reference to the page. If it succeeds, it
>    re-enters the guest with a flag of some sort saying that the share
>    succeeded, and the hypervisor will adjust pgtables accordingly. If
>    KVM failed to take a reference, it flags this and the hypervisor will
>    be responsible for communicating that back to the guest. This means
>    the guest must handle failures (possibly fatal).
> 
> (There are probably many ways in which we can optimize this, e.g. by
> having the host proactively munmap() pages it no longer needs so that
> the unshare hypercall from the guest doesn't need to exit all the way
> back to host userspace.)

...

> > Maybe there could be a special mode for the private memory fds in which
> > specific pages are marked as "managed by this fd but actually shared".
> > pread() and pwrite() would work on those pages, but not mmap().  (Or maybe
> > mmap() but the resulting mappings would not permit GUP.)

Unless I misunderstand what you intend by pread()/pwrite(), I think we'd need to
allow mmap(), otherwise e.g. uaccess from the kernel wouldn't work.

> > And transitioning them would be a special operation on the fd that is
> > specific to pKVM and wouldn't work on TDX or SEV.

To keep things feature agnostic (IMO, baking TDX vs SEV vs pKVM info into private-fd
is a really bad idea), this could be handled by adding a flag and/or callback into
the notifier/client stating whether or not it supports mapping a private-fd, and then
mapping would be allowed if and only if all consumers support/allow mapping.

> > Hmm.  Sean and Chao, are we making a bit of a mistake by making these fds
> > technology-agnostic?  That is, would we want to distinguish between a TDX
> > backing fd, a SEV backing fd, a software-based backing fd, etc?  API-wise
> > this could work by requiring the fd to be bound to a KVM VM instance and
> > possibly even configured a bit before any other operations would be
> > allowed.

I really don't want to distinguish between between each exact feature, but I've
no objection to adding flags/callbacks to track specific properties of the
downstream consumers, e.g. "can this memory be accessed by userspace" is a fine
abstraction.  It also scales to multiple consumers (see above).


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-05 17:51                               ` Andy Lutomirski
@ 2022-04-05 18:30                                 ` Sean Christopherson
  2022-04-06 18:42                                   ` Andy Lutomirski
  2022-04-06 13:05                                 ` Quentin Perret
  1 sibling, 1 reply; 118+ messages in thread
From: Sean Christopherson @ 2022-04-05 18:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Quentin Perret, Steven Price, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon

On Tue, Apr 05, 2022, Andy Lutomirski wrote:
> On Tue, Apr 5, 2022, at 3:36 AM, Quentin Perret wrote:
> > On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
> >> The best I can come up with is a special type of shared page that is not
> >> GUP-able and maybe not even mmappable, having a clear option for
> >> transitions to fail, and generally preventing the nasty cases from
> >> happening in the first place.
> >
> > Right, that sounds reasonable to me.
> 
> At least as a v1, this is probably more straightforward than allowing mmap().
> Also, there's much to be said for a simpler, limited API, to be expanded if
> genuinely needed, as opposed to starting out with a very featureful API.

Regarding "genuinely needed", IMO the same applies to supporting this at all.
Without numbers from something at least approximating a real use case, we're just
speculating on which will be the most performant approach.

> >> Maybe there could be a special mode for the private memory fds in which
> >> specific pages are marked as "managed by this fd but actually shared".
> >> pread() and pwrite() would work on those pages, but not mmap().  (Or maybe
> >> mmap() but the resulting mappings would not permit GUP.)  And
> >> transitioning them would be a special operation on the fd that is specific
> >> to pKVM and wouldn't work on TDX or SEV.
> >
> > Aha, didn't think of pread()/pwrite(). Very interesting.
> 
> There are plenty of use cases for which pread()/pwrite()/splice() will be as
> fast or even much faster than mmap()+memcpy().

...

> resume guest
> *** host -> hypervisor -> guest ***
> Guest unshares the page.
> *** guest -> hypervisor ***
> Hypervisor removes PTE.  TLBI.
> *** hypervisor -> guest ***
> 
> Obviously considerable cleverness is needed to make a virt IOMMU like this
> work well, but still.
> 
> Anyway, my suggestion is that the fd backing proposal get slightly modified
> to get it ready for multiple subtypes of backing object, which should be a
> pretty minimal change.  Then, if someone actually needs any of this
> cleverness, it can be added later.  In the mean time, the
> pread()/pwrite()/splice() scheme is pretty good.

Tangentially related to getting private-fd ready for multiple things, what about
implementing the pread()/pwrite()/splice() scheme in pKVM itself?  I.e. read() on
the VM fd, with the offset corresponding to gfn in some way.

Ditto for mmap() on the VM fd, though that would require additional changes outside
of pKVM.

That would allow pKVM to support in-place conversions without the private-fd having
to differentiate between the type of protected VM, and without having to provide
new APIs from the private-fd.  TDX, SNP, etc... Just Work by not supporting the pKVM
APIs.

And assuming we get multiple consumers down the road, pKVM will need to be able to
communicate the "true" state of a page to other consumers, because in addition to
being a consumer, pKVM is also an owner/enforcer analogous to the TDX Module and
the SEV PSP.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 11/13] KVM: Zap existing KVM mappings when pages changed in the private fd
  2022-03-10 14:09 ` [PATCH v5 11/13] KVM: Zap existing KVM mappings when pages changed in the private fd Chao Peng
  2022-03-29 19:23   ` Sean Christopherson
@ 2022-04-05 23:45   ` Michael Roth
  2022-04-08  3:06     ` Sean Christopherson
  2022-04-19 22:43   ` Vishal Annapurve
  2 siblings, 1 reply; 118+ messages in thread
From: Michael Roth @ 2022-04-05 23:45 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david

On Thu, Mar 10, 2022 at 10:09:09PM +0800, Chao Peng wrote:
> KVM gets notified when memory pages changed in the memory backing store.
> When userspace allocates the memory with fallocate() or frees memory
> with fallocate(FALLOC_FL_PUNCH_HOLE), memory backing store calls into
> KVM fallocate/invalidate callbacks respectively. To ensure KVM never
> maps both the private and shared variants of a GPA into the guest, in
> the fallocate callback, we should zap the existing shared mapping and
> in the invalidate callback we should zap the existing private mapping.
> 
> In the callbacks, KVM firstly converts the offset range into the
> gfn_range and then calls existing kvm_unmap_gfn_range() which will zap
> the shared or private mapping. Both callbacks pass in a memslot
> reference but we need 'kvm' so add a reference in memslot structure.
> 
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  include/linux/kvm_host.h |  3 ++-
>  virt/kvm/kvm_main.c      | 36 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9b175aeca63f..186b9b981a65 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -236,7 +236,7 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>  #endif
>  
> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_MEMFILE_NOTIFIER)
>  struct kvm_gfn_range {
>  	struct kvm_memory_slot *slot;
>  	gfn_t start;
> @@ -568,6 +568,7 @@ struct kvm_memory_slot {
>  	loff_t private_offset;
>  	struct memfile_pfn_ops *pfn_ops;
>  	struct memfile_notifier notifier;
> +	struct kvm *kvm;
>  };
>  
>  static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 67349421eae3..52319f49d58a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -841,8 +841,43 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>  
>  #ifdef CONFIG_MEMFILE_NOTIFIER
> +static void kvm_memfile_notifier_handler(struct memfile_notifier *notifier,
> +					 pgoff_t start, pgoff_t end)
> +{
> +	int idx;
> +	struct kvm_memory_slot *slot = container_of(notifier,
> +						    struct kvm_memory_slot,
> +						    notifier);
> +	struct kvm_gfn_range gfn_range = {
> +		.slot		= slot,
> +		.start		= start - (slot->private_offset >> PAGE_SHIFT),
> +		.end		= end - (slot->private_offset >> PAGE_SHIFT),
> +		.may_block 	= true,
> +	};
> +	struct kvm *kvm = slot->kvm;
> +
> +	gfn_range.start = max(gfn_range.start, slot->base_gfn);
> +	gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages);
> +
> +	if (gfn_range.start >= gfn_range.end)
> +		return;
> +
> +	idx = srcu_read_lock(&kvm->srcu);
> +	KVM_MMU_LOCK(kvm);
> +	kvm_unmap_gfn_range(kvm, &gfn_range);
> +	kvm_flush_remote_tlbs(kvm);
> +	KVM_MMU_UNLOCK(kvm);
> +	srcu_read_unlock(&kvm->srcu, idx);

Should this also invalidate gfn_to_pfn_cache mappings? Otherwise it seems
possible the kernel might end up inadvertantly writing to now-private guest
memory via a now-stale gfn_to_pfn_cache entry.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-05 18:03                               ` Sean Christopherson
@ 2022-04-06 10:34                                 ` Quentin Perret
  2022-04-22 10:56                                 ` Chao Peng
  1 sibling, 0 replies; 118+ messages in thread
From: Quentin Perret @ 2022-04-06 10:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Steven Price, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon

On Tuesday 05 Apr 2022 at 18:03:21 (+0000), Sean Christopherson wrote:
> On Tue, Apr 05, 2022, Quentin Perret wrote:
> > On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
> > > >>  - it can be very useful for protected VMs to do shared=>private
> > > >>    conversions. Think of a VM receiving some data from the host in a
> > > >>    shared buffer, and then it wants to operate on that buffer without
> > > >>    risking to leak confidential informations in a transient state. In
> > > >>    that case the most logical thing to do is to convert the buffer back
> > > >>    to private, do whatever needs to be done on that buffer (decrypting a
> > > >>    frame, ...), and then share it back with the host to consume it;
> > > >
> > > > If performance is a motivation, why would the guest want to do two
> > > > conversions instead of just doing internal memcpy() to/from a private
> > > > page?  I would be quite surprised if multiple exits and TLB shootdowns is
> > > > actually faster, especially at any kind of scale where zapping stage-2
> > > > PTEs will cause lock contention and IPIs.
> > > 
> > > I don't know the numbers or all the details, but this is arm64, which is a
> > > rather better architecture than x86 in this regard.  So maybe it's not so
> > > bad, at least in very simple cases, ignoring all implementation details.
> > > (But see below.)  Also the systems in question tend to have fewer CPUs than
> > > some of the massive x86 systems out there.
> > 
> > Yep. I can try and do some measurements if that's really necessary, but
> > I'm really convinced the cost of the TLBI for the shared->private
> > conversion is going to be significantly smaller than the cost of memcpy
> > the buffer twice in the guest for us.
> 
> It's not just the TLB shootdown, the VM-Exits aren't free.

Ack, but we can at least work on the rest (number of exits, locking, ...).
The cost of the memcpy and the TLBI are really incompressible.

> And barring non-trivial
> improvements to KVM's MMU, e.g. sharding of mmu_lock, modifying the page tables will
> block all other updates and MMU operations.  Taking mmu_lock for read, should arm64
> ever convert to a rwlock, is not an option because KVM needs to block other
> conversions to avoid races.

FWIW the host mmu_lock isn't all that useful for pKVM. The host doesn't
have _any_ control over guest page-tables, and the hypervisor can't
safely rely on the host for locking, so we have hypervisor-level
synchronization.

> Hmm, though batching multiple pages into a single request would mitigate most of
> the overhead.

Yep, there are a few tricks we can play to make this fairly efficient in
the most common cases. And fine-grain locking at EL2 is really high up
on the todo list :-)

Thanks,
Quentin


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-05 17:51                               ` Andy Lutomirski
  2022-04-05 18:30                                 ` Sean Christopherson
@ 2022-04-06 13:05                                 ` Quentin Perret
  1 sibling, 0 replies; 118+ messages in thread
From: Quentin Perret @ 2022-04-06 13:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Sean Christopherson, Steven Price, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon

On Tuesday 05 Apr 2022 at 10:51:36 (-0700), Andy Lutomirski wrote:
> Let's try actually counting syscalls and mode transitions, at least approximately.  For non-direct IO (DMA allocation on guest side, not straight to/from pagecache or similar):
> 
> Guest writes to shared DMA buffer.  Assume the guest is smart and reuses the buffer.
> Guest writes descriptor to shared virtio ring.
> Guest rings virtio doorbell, which causes an exit.
> *** guest -> hypervisor -> host ***
> host reads virtio ring (mmaped shared memory)
> host does pread() to read the DMA buffer or reads mmapped buffer
> host does the IO
> resume guest
> *** host -> hypervisor -> guest ***
> 
> This is essentially optimal in terms of transitions.  The data is copied on the guest side (which may well be mandatory depending on what guest userspace did to initiate the IO) and on the host (which may well be mandatory depending on what the host is doing with the data).
> 
> Now let's try straight-from-guest-pagecache or otherwise zero-copy on the guest side.  Without nondestructive changes, the guest needs a bounce buffer and it looks just like the above.  One extra copy, zero extra mode transitions.  With nondestructive changes, it's a bit more like physical hardware with an IOMMU:
> 
> Guest shares the page.
> *** guest -> hypervisor ***
> Hypervisor adds a PTE.  Let's assume we're being very optimal and the host is not synchronously notified.
> *** hypervisor -> guest ***
> Guest writes descriptor to shared virtio ring.
> Guest rings virtio doorbell, which causes an exit.
> *** guest -> hypervisor -> host ***
> host reads virtio ring (mmaped shared memory)
> 
> mmap  *** syscall ***
> host does the IO
> munmap *** syscall, TLBI ***
> 
> resume guest
> *** host -> hypervisor -> guest ***
> Guest unshares the page.
> *** guest -> hypervisor ***
> Hypervisor removes PTE.  TLBI.
> *** hypervisor -> guest ***
> 
> This is quite expensive.  For small IO, pread() or splice() in the host may be a lot faster.  Even for large IO, splice() may still win.

Right, that would work nicely for pages that are shared transiently, but
less so for long-term shares. But I guess your proposal below should do
the trick.

> I can imagine clever improvements.  First, let's get rid of mmap() + munmap().  Instead use a special device mapping with special semantics, not regular memory.  (mmap and munmap are expensive even ignoring any arch and TLB stuff.)  The rule is that, if the page is shared, access works, and if private, access doesn't, but it's still mapped.  The hypervisor and the host cooperate to make it so.

As long as the page can't be GUP'd I _think_ this shouldn't be a
problem. We can have the hypervisor re-inject the fault in the host. And
the host fault handler will deal with it just fine if the fault was
taken from userspace (inject a SEGV), or from the kernel through uaccess
macros. But we do get into issues if the host kernel can be tricked into
accessing the page via e.g. kmap(). I've been able to trigger this by
strace-ing a userspace process which passes a pointer to private memory
to a syscall. strace will inspect the syscall argument using
process_vm_readv(), which will pin_user_pages_remote() and access the
page via kmap(), and then we're in trouble. But preventing GUP would
prevent this by construction I think?

FWIW memfd_secret() did look like a good solution to this, but it lacks
the bidirectional notifiers with KVM that is offered by this patch
series, which is needed to allow KVM to handle guest faults, and also
offers a good framework to support future extensions (e.g.
hypervisor-assisted page migration, swap, ...). So yes, ideally
pKVM would use a kind of hybrid between memfd_secret and the private fd
proposed here, or something else providing similar properties.

Thanks,
Quentin


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-05 18:30                                 ` Sean Christopherson
@ 2022-04-06 18:42                                   ` Andy Lutomirski
  0 siblings, 0 replies; 118+ messages in thread
From: Andy Lutomirski @ 2022-04-06 18:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Quentin Perret, Steven Price, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon



On Tue, Apr 5, 2022, at 11:30 AM, Sean Christopherson wrote:
> On Tue, Apr 05, 2022, Andy Lutomirski wrote:

>
>> resume guest
>> *** host -> hypervisor -> guest ***
>> Guest unshares the page.
>> *** guest -> hypervisor ***
>> Hypervisor removes PTE.  TLBI.
>> *** hypervisor -> guest ***
>> 
>> Obviously considerable cleverness is needed to make a virt IOMMU like this
>> work well, but still.
>> 
>> Anyway, my suggestion is that the fd backing proposal get slightly modified
>> to get it ready for multiple subtypes of backing object, which should be a
>> pretty minimal change.  Then, if someone actually needs any of this
>> cleverness, it can be added later.  In the mean time, the
>> pread()/pwrite()/splice() scheme is pretty good.
>
> Tangentially related to getting private-fd ready for multiple things, 
> what about
> implementing the pread()/pwrite()/splice() scheme in pKVM itself?  I.e. 
> read() on
> the VM fd, with the offset corresponding to gfn in some way.
>

Hmm, could make sense.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
  2022-03-10 14:09 ` [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK Chao Peng
@ 2022-04-07 16:05   ` Sean Christopherson
  2022-04-07 17:09     ` Andy Lutomirski
                       ` (2 more replies)
  0 siblings, 3 replies; 118+ messages in thread
From: Sean Christopherson @ 2022-04-07 16:05 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Thu, Mar 10, 2022, Chao Peng wrote:
> Since page migration / swapping is not supported yet, MFD_INACCESSIBLE
> memory behave like longterm pinned pages and thus should be accounted to
> mm->pinned_vm and be restricted by RLIMIT_MEMLOCK.
> 
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  mm/shmem.c | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 7b43e274c9a2..ae46fb96494b 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
>  static void notify_invalidate_page(struct inode *inode, struct folio *folio,
>  				   pgoff_t start, pgoff_t end)
>  {
> -#ifdef CONFIG_MEMFILE_NOTIFIER
>  	struct shmem_inode_info *info = SHMEM_I(inode);
>  
> +#ifdef CONFIG_MEMFILE_NOTIFIER
>  	start = max(start, folio->index);
>  	end = min(end, folio->index + folio_nr_pages(folio));
>  
>  	memfile_notifier_invalidate(&info->memfile_notifiers, start, end);
>  #endif
> +
> +	if (info->xflags & SHM_F_INACCESSIBLE)
> +		atomic64_sub(end - start, &current->mm->pinned_vm);

As Vishal's to-be-posted selftest discovered, this is broken as current->mm may
be NULL.  Or it may be a completely different mm, e.g. AFAICT there's nothing that
prevents a different process from punching hole in the shmem backing.

I don't see a sane way of tracking this in the backing store unless the inode is
associated with a single mm when it's created, and that opens up a giant can of
worms, e.g. what happens with the accounting if the creating process goes away?

I think the correct approach is to not do the locking automatically for SHM_F_INACCESSIBLE,
and instead require userspace to do shmctl(.., SHM_LOCK, ...) if userspace knows the
consumers don't support migrate/swap.  That'd require wrapping migrate_page() and then
wiring up notifier hooks for migrate/swap, but IMO that's a good thing to get sorted
out sooner than later.  KVM isn't planning on support migrate/swap for TDX or SNP,
but supporting at least migrate for a software-only implementation a la pKVM should
be relatively straightforward.  On the notifiee side, KVM can terminate the VM if it
gets an unexpected migrate/swap, e.g. so that TDX/SEV VMs don't die later with
exceptions and/or data corruption (pre-SNP SEV guests) in the guest.

Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the swap, so
maybe it's just the page migration path that needs to be updated?


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
  2022-04-07 16:05   ` Sean Christopherson
@ 2022-04-07 17:09     ` Andy Lutomirski
  2022-04-08 17:56       ` Sean Christopherson
  2022-04-08 13:02     ` Chao Peng
  2022-04-11 15:32     ` Kirill A. Shutemov
  2 siblings, 1 reply; 118+ messages in thread
From: Andy Lutomirski @ 2022-04-07 17:09 UTC (permalink / raw)
  To: Sean Christopherson, Chao Peng
  Cc: kvm list, Linux Kernel Mailing List, linux-mm, linux-fsdevel,
	Linux API, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A. Shutemov, Nakajima, Jun,
	Dave Hansen, Andi Kleen, David Hildenbrand



On Thu, Apr 7, 2022, at 9:05 AM, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
>> Since page migration / swapping is not supported yet, MFD_INACCESSIBLE
>> memory behave like longterm pinned pages and thus should be accounted to
>> mm->pinned_vm and be restricted by RLIMIT_MEMLOCK.
>> 
>> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>> ---
>>  mm/shmem.c | 25 ++++++++++++++++++++++++-
>>  1 file changed, 24 insertions(+), 1 deletion(-)
>> 
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 7b43e274c9a2..ae46fb96494b 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
>>  static void notify_invalidate_page(struct inode *inode, struct folio *folio,
>>  				   pgoff_t start, pgoff_t end)
>>  {
>> -#ifdef CONFIG_MEMFILE_NOTIFIER
>>  	struct shmem_inode_info *info = SHMEM_I(inode);
>>  
>> +#ifdef CONFIG_MEMFILE_NOTIFIER
>>  	start = max(start, folio->index);
>>  	end = min(end, folio->index + folio_nr_pages(folio));
>>  
>>  	memfile_notifier_invalidate(&info->memfile_notifiers, start, end);
>>  #endif
>> +
>> +	if (info->xflags & SHM_F_INACCESSIBLE)
>> +		atomic64_sub(end - start, &current->mm->pinned_vm);
>
> As Vishal's to-be-posted selftest discovered, this is broken as 
> current->mm may
> be NULL.  Or it may be a completely different mm, e.g. AFAICT there's 
> nothing that
> prevents a different process from punching hole in the shmem backing.
>

How about just not charging the mm in the first place?  There’s precedent: ramfs and hugetlbfs (at least sometimes — I’ve lost track of the current status).

In any case, for an administrator to try to assemble the various rlimits into a coherent policy is, and always has been, quite messy. ISTM cgroup limits, which can actually add across processes usefully, are much better.

So, aside from the fact that these fds aren’t in a filesystem and are thus available by default, I’m not convinced that this accounting is useful or necessary.

Maybe we could just have some switch require to enable creation of private memory in the first place, and anyone who flips that switch without configuring cgroups is subject to DoS.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 11/13] KVM: Zap existing KVM mappings when pages changed in the private fd
  2022-04-05 23:45   ` Michael Roth
@ 2022-04-08  3:06     ` Sean Christopherson
  0 siblings, 0 replies; 118+ messages in thread
From: Sean Christopherson @ 2022-04-08  3:06 UTC (permalink / raw)
  To: Michael Roth
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Tue, Apr 05, 2022, Michael Roth wrote:
> On Thu, Mar 10, 2022 at 10:09:09PM +0800, Chao Peng wrote:
> >  static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 67349421eae3..52319f49d58a 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -841,8 +841,43 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
> >  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
> >  
> >  #ifdef CONFIG_MEMFILE_NOTIFIER
> > +static void kvm_memfile_notifier_handler(struct memfile_notifier *notifier,
> > +					 pgoff_t start, pgoff_t end)
> > +{
> > +	int idx;
> > +	struct kvm_memory_slot *slot = container_of(notifier,
> > +						    struct kvm_memory_slot,
> > +						    notifier);
> > +	struct kvm_gfn_range gfn_range = {
> > +		.slot		= slot,
> > +		.start		= start - (slot->private_offset >> PAGE_SHIFT),
> > +		.end		= end - (slot->private_offset >> PAGE_SHIFT),
> > +		.may_block 	= true,
> > +	};
> > +	struct kvm *kvm = slot->kvm;
> > +
> > +	gfn_range.start = max(gfn_range.start, slot->base_gfn);
> > +	gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages);
> > +
> > +	if (gfn_range.start >= gfn_range.end)
> > +		return;
> > +
> > +	idx = srcu_read_lock(&kvm->srcu);
> > +	KVM_MMU_LOCK(kvm);
> > +	kvm_unmap_gfn_range(kvm, &gfn_range);
> > +	kvm_flush_remote_tlbs(kvm);
> > +	KVM_MMU_UNLOCK(kvm);
> > +	srcu_read_unlock(&kvm->srcu, idx);
> 
> Should this also invalidate gfn_to_pfn_cache mappings? Otherwise it seems
> possible the kernel might end up inadvertantly writing to now-private guest
> memory via a now-stale gfn_to_pfn_cache entry.

Yes.  Ideally we'd get these flows to share common code and avoid these goofs.
I tried very briefly but they're just different enough to make it ugly.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 02/13] mm: Introduce memfile_notifier
  2022-03-29 18:45   ` Sean Christopherson
@ 2022-04-08 12:54     ` Chao Peng
  0 siblings, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-04-08 12:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Tue, Mar 29, 2022 at 06:45:16PM +0000, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 70d4309c9ce3..f628256dce0d 100644
> > +void memfile_notifier_invalidate(struct memfile_notifier_list *list,
> > +				 pgoff_t start, pgoff_t end)
> > +{
> > +	struct memfile_notifier *notifier;
> > +	int id;
> > +
> > +	id = srcu_read_lock(&srcu);
> > +	list_for_each_entry_srcu(notifier, &list->head, list,
> > +				 srcu_read_lock_held(&srcu)) {
> > +		if (notifier->ops && notifier->ops->invalidate)
> 
> Any reason notifier->ops isn't mandatory?

Yes it's mandatory, will skip the check here.

> 
> > +			notifier->ops->invalidate(notifier, start, end);
> > +	}
> > +	srcu_read_unlock(&srcu, id);
> > +}
> > +
> > +void memfile_notifier_fallocate(struct memfile_notifier_list *list,
> > +				pgoff_t start, pgoff_t end)
> > +{
> > +	struct memfile_notifier *notifier;
> > +	int id;
> > +
> > +	id = srcu_read_lock(&srcu);
> > +	list_for_each_entry_srcu(notifier, &list->head, list,
> > +				 srcu_read_lock_held(&srcu)) {
> > +		if (notifier->ops && notifier->ops->fallocate)
> > +			notifier->ops->fallocate(notifier, start, end);
> > +	}
> > +	srcu_read_unlock(&srcu, id);
> > +}
> > +
> > +void memfile_register_backing_store(struct memfile_backing_store *bs)
> > +{
> > +	BUG_ON(!bs || !bs->get_notifier_list);
> > +
> > +	list_add_tail(&bs->list, &backing_store_list);
> > +}
> > +
> > +void memfile_unregister_backing_store(struct memfile_backing_store *bs)
> > +{
> > +	list_del(&bs->list);
> 
> Allowing unregistration of a backing store is broken.  Using the _safe() variant
> is not sufficient to guard against concurrent modification.  I don't see any reason
> to support this out of the gate, the only reason to support unregistering a backing
> store is if the backing store is implemented as a module, and AFAIK none of the
> backing stores we plan on supporting initially support being built as a module.
> These aren't exported, so it's not like that's even possible.  Registration would
> also be broken if modules are allowed, I'm pretty sure module init doesn't run
> under a global lock.
> 
> We can always add this complexity if it's needed in the future, but for now the
> easiest thing would be to tag memfile_register_backing_store() with __init and
> make backing_store_list __ro_after_init.

The only currently supported backing store shmem does not need this so
can remove it for now.

> 
> > +}
> > +
> > +static int memfile_get_notifier_info(struct inode *inode,
> > +				     struct memfile_notifier_list **list,
> > +				     struct memfile_pfn_ops **ops)
> > +{
> > +	struct memfile_backing_store *bs, *iter;
> > +	struct memfile_notifier_list *tmp;
> > +
> > +	list_for_each_entry_safe(bs, iter, &backing_store_list, list) {
> > +		tmp = bs->get_notifier_list(inode);
> > +		if (tmp) {
> > +			*list = tmp;
> > +			if (ops)
> > +				*ops = &bs->pfn_ops;
> > +			return 0;
> > +		}
> > +	}
> > +	return -EOPNOTSUPP;
> > +}
> > +
> > +int memfile_register_notifier(struct inode *inode,
> 
> Taking an inode is a bit odd from a user perspective.  Any reason not to take a
> "struct file *" and get the inode here?  That would give callers a hint that they
> need to hold a reference to the file for the lifetime of the registration.

Yes, I can change.

> 
> > +			      struct memfile_notifier *notifier,
> > +			      struct memfile_pfn_ops **pfn_ops)
> > +{
> > +	struct memfile_notifier_list *list;
> > +	int ret;
> > +
> > +	if (!inode || !notifier | !pfn_ops)
> 
> Bitwise | instead of logical ||.  But IMO taking in a pfn_ops pointer is silly.
> More below.
> 
> > +		return -EINVAL;
> > +
> > +	ret = memfile_get_notifier_info(inode, &list, pfn_ops);
> > +	if (ret)
> > +		return ret;
> > +
> > +	spin_lock(&list->lock);
> > +	list_add_rcu(&notifier->list, &list->head);
> > +	spin_unlock(&list->lock);
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(memfile_register_notifier);
> > +
> > +void memfile_unregister_notifier(struct inode *inode,
> > +				 struct memfile_notifier *notifier)
> > +{
> > +	struct memfile_notifier_list *list;
> > +
> > +	if (!inode || !notifier)
> > +		return;
> > +
> > +	BUG_ON(memfile_get_notifier_info(inode, &list, NULL));
> 
> Eww.  Rather than force the caller to provide the inode/file and the notifier,
> what about grabbing the backing store itself in the notifier?
> 
> 	struct memfile_notifier {
> 		struct list_head list;
> 		struct memfile_notifier_ops *ops;
> 
> 		struct memfile_backing_store *bs;
> 	};
> 
> That also helps avoid confusing between "ops" and "pfn_ops".  IMO, exposing
> memfile_backing_store to the caller isn't a big deal, and is preferable to having
> to rewalk multiple lists just to delete a notifier.

Agreed, good suggestion.

> 
> Then this can become:
> 
>   void memfile_unregister_notifier(struct memfile_notifier *notifier)
>   {
> 	spin_lock(&notifier->bs->list->lock);
> 	list_del_rcu(&notifier->list);
> 	spin_unlock(&notifier->bs->list->lock);
> 
> 	synchronize_srcu(&srcu);
>   }
> 
> and registration can be:
> 
>   int memfile_register_notifier(const struct file *file,
> 			      struct memfile_notifier *notifier)
>   {
> 	struct memfile_notifier_list *list;
> 	struct memfile_backing_store *bs;
> 	int ret;
> 
> 	if (!file || !notifier)
> 		return -EINVAL;
> 
> 	list_for_each_entry(bs, &backing_store_list, list) {
> 		list = bs->get_notifier_list(file_inode(file));
> 		if (list) {
> 			notifier->bs = bs;
> 
> 			spin_lock(&list->lock);
> 			list_add_rcu(&notifier->list, &list->head);
> 			spin_unlock(&list->lock);
> 			return 0;
> 		}
> 	}
> 
> 	return -EOPNOTSUPP;
>   }


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
  2022-04-07 16:05   ` Sean Christopherson
  2022-04-07 17:09     ` Andy Lutomirski
@ 2022-04-08 13:02     ` Chao Peng
  2022-04-11 15:34       ` Kirill A. Shutemov
  2022-04-11 15:32     ` Kirill A. Shutemov
  2 siblings, 1 reply; 118+ messages in thread
From: Chao Peng @ 2022-04-08 13:02 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Thu, Apr 07, 2022 at 04:05:36PM +0000, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > Since page migration / swapping is not supported yet, MFD_INACCESSIBLE
> > memory behave like longterm pinned pages and thus should be accounted to
> > mm->pinned_vm and be restricted by RLIMIT_MEMLOCK.
> > 
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  mm/shmem.c | 25 ++++++++++++++++++++++++-
> >  1 file changed, 24 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 7b43e274c9a2..ae46fb96494b 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
> >  static void notify_invalidate_page(struct inode *inode, struct folio *folio,
> >  				   pgoff_t start, pgoff_t end)
> >  {
> > -#ifdef CONFIG_MEMFILE_NOTIFIER
> >  	struct shmem_inode_info *info = SHMEM_I(inode);
> >  
> > +#ifdef CONFIG_MEMFILE_NOTIFIER
> >  	start = max(start, folio->index);
> >  	end = min(end, folio->index + folio_nr_pages(folio));
> >  
> >  	memfile_notifier_invalidate(&info->memfile_notifiers, start, end);
> >  #endif
> > +
> > +	if (info->xflags & SHM_F_INACCESSIBLE)
> > +		atomic64_sub(end - start, &current->mm->pinned_vm);
> 
> As Vishal's to-be-posted selftest discovered, this is broken as current->mm may
> be NULL.  Or it may be a completely different mm, e.g. AFAICT there's nothing that
> prevents a different process from punching hole in the shmem backing.
> 
> I don't see a sane way of tracking this in the backing store unless the inode is
> associated with a single mm when it's created, and that opens up a giant can of
> worms, e.g. what happens with the accounting if the creating process goes away?

Yes, I realized this.

> 
> I think the correct approach is to not do the locking automatically for SHM_F_INACCESSIBLE,
> and instead require userspace to do shmctl(.., SHM_LOCK, ...) if userspace knows the
> consumers don't support migrate/swap.  That'd require wrapping migrate_page() and then
> wiring up notifier hooks for migrate/swap, but IMO that's a good thing to get sorted
> out sooner than later.  KVM isn't planning on support migrate/swap for TDX or SNP,
> but supporting at least migrate for a software-only implementation a la pKVM should
> be relatively straightforward.  On the notifiee side, KVM can terminate the VM if it
> gets an unexpected migrate/swap, e.g. so that TDX/SEV VMs don't die later with
> exceptions and/or data corruption (pre-SNP SEV guests) in the guest.

SHM_LOCK sounds like a good match.

Thanks,
Chao
> 
> Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the swap, so
> maybe it's just the page migration path that needs to be updated?


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 05/13] KVM: Extend the memslot to support fd-based private memory
  2022-03-28 21:27   ` Sean Christopherson
@ 2022-04-08 13:21     ` Chao Peng
  0 siblings, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-04-08 13:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Mon, Mar 28, 2022 at 09:27:32PM +0000, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > Extend the memslot definition to provide fd-based private memory support
> > by adding two new fields (private_fd/private_offset). The memslot then
> > can maintain memory for both shared pages and private pages in a single
> > memslot. Shared pages are provided by existing userspace_addr(hva) field
> > and private pages are provided through the new private_fd/private_offset
> > fields.
> > 
> > Since there is no 'hva' concept anymore for private memory so we cannot
> > rely on get_user_pages() to get a pfn, instead we use the newly added
> > memfile_notifier to complete the same job.
> > 
> > This new extension is indicated by a new flag KVM_MEM_PRIVATE.
> > 
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> 
> Needs a Co-developed-by: for Yu, or a From: if Yu is the sole author.

Yes a Co-developed-by for Yu is needed, for all the patches throught the series.

> 
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  Documentation/virt/kvm/api.rst | 37 +++++++++++++++++++++++++++-------
> >  include/linux/kvm_host.h       |  7 +++++++
> >  include/uapi/linux/kvm.h       |  8 ++++++++
> >  3 files changed, 45 insertions(+), 7 deletions(-)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 3acbf4d263a5..f76ac598606c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1307,7 +1307,7 @@ yet and must be cleared on entry.
> >  :Capability: KVM_CAP_USER_MEMORY
> >  :Architectures: all
> >  :Type: vm ioctl
> > -:Parameters: struct kvm_userspace_memory_region (in)
> > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> >  :Returns: 0 on success, -1 on error
> >  
> >  ::
> > @@ -1320,9 +1320,17 @@ yet and must be cleared on entry.
> >  	__u64 userspace_addr; /* start of the userspace allocated memory */
> >    };
> >  
> > +  struct kvm_userspace_memory_region_ext {
> > +	struct kvm_userspace_memory_region region;
> > +	__u64 private_offset;
> > +	__u32 private_fd;
> > +	__u32 padding[5];
> 
> Uber nit, I'd prefer we pad u32 for private_fd separate from padding the size of
> the structure for future expansion.
> 
> Regarding future expansion, any reason not to go crazy and pad like 128+ bytes?
> It'd be rather embarassing if the next memslot extension needs 3 u64s and we end
> up with region_ext2 :-)

OK, so maybe:
	__u64 private_offset;
	__u32 private_fd;
	__u32 pad1;
	__u32 pad2[28];
> 
> > +};
> > +
> >    /* for kvm_memory_region::flags */
> >    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> >    #define KVM_MEM_READONLY	(1UL << 1)
> > +  #define KVM_MEM_PRIVATE		(1UL << 2)
> >  
> >  This ioctl allows the user to create, modify or delete a guest physical
> >  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> 
> ...
> 
> > +static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
> 
> I 100% think we should usurp the name "private" for these memslots, but as prep
> work this series should first rename KVM_PRIVATE_MEM_SLOTS to avoid confusion.
> Maybe KVM_INTERNAL_MEM_SLOTS?

Oh, I didn't realized 'PRIVATE' is already taken.  KVM_INTERNAL_MEM_SLOTS
sounds good.

Thanks,
Chao


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 05/13] KVM: Extend the memslot to support fd-based private memory
  2022-03-28 21:56   ` Sean Christopherson
@ 2022-04-08 13:46     ` Chao Peng
  2022-04-08 17:45       ` Sean Christopherson
  0 siblings, 1 reply; 118+ messages in thread
From: Chao Peng @ 2022-04-08 13:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Mon, Mar 28, 2022 at 09:56:33PM +0000, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > Extend the memslot definition to provide fd-based private memory support
> > by adding two new fields (private_fd/private_offset). The memslot then
> > can maintain memory for both shared pages and private pages in a single
> > memslot. Shared pages are provided by existing userspace_addr(hva) field
> > and private pages are provided through the new private_fd/private_offset
> > fields.
> > 
> > Since there is no 'hva' concept anymore for private memory so we cannot
> > rely on get_user_pages() to get a pfn, instead we use the newly added
> > memfile_notifier to complete the same job.
> > 
> > This new extension is indicated by a new flag KVM_MEM_PRIVATE.
> > 
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  Documentation/virt/kvm/api.rst | 37 +++++++++++++++++++++++++++-------
> >  include/linux/kvm_host.h       |  7 +++++++
> >  include/uapi/linux/kvm.h       |  8 ++++++++
> >  3 files changed, 45 insertions(+), 7 deletions(-)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 3acbf4d263a5..f76ac598606c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1307,7 +1307,7 @@ yet and must be cleared on entry.
> >  :Capability: KVM_CAP_USER_MEMORY
> >  :Architectures: all
> >  :Type: vm ioctl
> > -:Parameters: struct kvm_userspace_memory_region (in)
> > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> >  :Returns: 0 on success, -1 on error
> >  
> >  ::
> > @@ -1320,9 +1320,17 @@ yet and must be cleared on entry.
> >  	__u64 userspace_addr; /* start of the userspace allocated memory */
> >    };
> >  
> > +  struct kvm_userspace_memory_region_ext {
> > +	struct kvm_userspace_memory_region region;
> 
> Peeking ahead, the partial switch to the _ext variant is rather gross.  I would
> prefer that KVM use an entirely different, but binary compatible, struct internally.
> And once the kernel supports C11[*], I'm pretty sure we can make the "region" in
> _ext an anonymous struct, and make KVM's internal struct a #define of _ext.  That
> should minimize the churn (no need to get the embedded "region" field), reduce
> line lengths, and avoid confusion due to some flows taking the _ext but others
> dealing with only the "base" struct.

Will try that.

> 
> Maybe kvm_user_memory_region or kvm_user_mem_region?  Though it's tempting to be
> evil and usurp the old kvm_memory_region :-)
> 
> E.g. pre-C11 do
> 
> struct kvm_userspace_memory_region_ext {
> 	struct kvm_userspace_memory_region region;
> 	__u64 private_offset;
> 	__u32 private_fd;
> 	__u32 padding[5];
> };
> 
> #ifdef __KERNEL__
> struct kvm_user_mem_region {
> 	__u32 slot;
> 	__u32 flags;
> 	__u64 guest_phys_addr;
> 	__u64 memory_size; /* bytes */
> 	__u64 userspace_addr; /* start of the userspace allocated memory */
> 	__u64 private_offset;
> 	__u32 private_fd;
> 	__u32 padding[5];
> };
> #endif
> 
> and then post-C11 do
> 
> struct kvm_userspace_memory_region_ext {
> #ifdef __KERNEL__

Is this #ifndef? As I think anonymous struct is only for kernel?

Thanks,
Chao

> 	struct kvm_userspace_memory_region region;
> #else
> 	struct kvm_userspace_memory_region;
> #endif
> 	__u64 private_offset;
> 	__u32 private_fd;
> 	__u32 padding[5];
> };
> 
> #ifdef __KERNEL__
> #define kvm_user_mem_region kvm_userspace_memory_region_ext
> #endif
> 
> [*] https://lore.kernel.org/all/20220301145233.3689119-1-arnd@kernel.org
> 
> > +	__u64 private_offset;
> > +	__u32 private_fd;
> > +	__u32 padding[5];
> > +};


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 06/13] KVM: Use kvm_userspace_memory_region_ext
  2022-03-28 22:26   ` Sean Christopherson
@ 2022-04-08 13:58     ` Chao Peng
  0 siblings, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-04-08 13:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Mon, Mar 28, 2022 at 10:26:55PM +0000, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > @@ -4476,14 +4477,23 @@ static long kvm_vm_ioctl(struct file *filp,
> >  		break;
> >  	}
> >  	case KVM_SET_USER_MEMORY_REGION: {
> > -		struct kvm_userspace_memory_region kvm_userspace_mem;
> > +		struct kvm_userspace_memory_region_ext region_ext;
> 
> It's probably a good idea to zero initialize the full region to avoid consuming
> garbage stack data if there's a bug and an _ext field is accessed without first
> checking KVM_MEM_PRIVATE.  I'm usually opposed to unnecessary initialization, but
> this seems like something we could screw up quite easily.
> 
> >  		r = -EFAULT;
> > -		if (copy_from_user(&kvm_userspace_mem, argp,
> > -						sizeof(kvm_userspace_mem)))
> > +		if (copy_from_user(&region_ext, argp,
> > +				sizeof(struct kvm_userspace_memory_region)))
> >  			goto out;
> > +		if (region_ext.region.flags & KVM_MEM_PRIVATE) {
> > +			int offset = offsetof(
> > +				struct kvm_userspace_memory_region_ext,
> > +				private_offset);
> > +			if (copy_from_user(&region_ext.private_offset,
> > +					   argp + offset,
> > +					   sizeof(region_ext) - offset))
> 
> In this patch, KVM_MEM_PRIVATE should result in an -EINVAL as it's not yet
> supported.  Copying the _ext on KVM_MEM_PRIVATE belongs in the "Expose KVM_MEM_PRIVATE"
> patch.

Agreed.

> 
> Mechnically, what about first reading flags via get_user(), and then doing a single
> copy_from_user()?  It's technically more work in the common case, and requires an
> extra check to guard against TOCTOU attacks, but this isn't a fast path by any means
> and IMO the end result makes it easier to understand the relationship between
> KVM_MEM_PRIVATE and the two different structs.

Will use this code, thanks for typing.

Chao
> 
> E.g.
> 
> 	case KVM_SET_USER_MEMORY_REGION: {
> 		struct kvm_user_mem_region region;
> 		unsigned long size;
> 		u32 flags;
> 
> 		memset(&region, 0, sizeof(region));
> 
> 		r = -EFAULT;
> 		if (get_user(flags, (u32 __user *)(argp + offsetof(typeof(region), flags))))
> 			goto out;
> 
> 		if (flags & KVM_MEM_PRIVATE)
> 			size = sizeof(struct kvm_userspace_memory_region_ext);
> 		else
> 			size = sizeof(struct kvm_userspace_memory_region);
> 		if (copy_from_user(&region, argp, size))
> 			goto out;
> 
> 		r = -EINVAL;
> 		if ((flags ^ region.flags) & KVM_MEM_PRIVATE)
> 			goto out;
> 
> 		r = kvm_vm_ioctl_set_memory_region(kvm, &region);
> 		break;
> 	}
> 
> > +				goto out;
> > +		}
> >  
> > -		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> > +		r = kvm_vm_ioctl_set_memory_region(kvm, &region_ext);
> >  		break;
> >  	}
> >  	case KVM_GET_DIRTY_LOG: {
> > -- 
> > 2.17.1
> > 


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 07/13] KVM: Add KVM_EXIT_MEMORY_ERROR exit
  2022-03-28 22:33   ` Sean Christopherson
@ 2022-04-08 13:59     ` Chao Peng
  0 siblings, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-04-08 13:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Mon, Mar 28, 2022 at 10:33:37PM +0000, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > This new KVM exit allows userspace to handle memory-related errors. It
> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> > The flags includes additional information for userspace to handle the
> > error. Currently bit 0 is defined as 'private memory' where '1'
> > indicates error happens due to private memory access and '0' indicates
> > error happens due to shared memory access.
> > 
> > After private memory is enabled, this new exit will be used for KVM to
> > exit to userspace for shared memory <-> private memory conversion in
> > memory encryption usage.
> > 
> > In such usage, typically there are two kind of memory conversions:
> >   - explicit conversion: happens when guest explicitly calls into KVM to
> >     map a range (as private or shared), KVM then exits to userspace to
> >     do the map/unmap operations.
> >   - implicit conversion: happens in KVM page fault handler.
> >     * if the fault is due to a private memory access then causes a
> >       userspace exit for a shared->private conversion request when the
> >       page has not been allocated in the private memory backend.
> >     * If the fault is due to a shared memory access then causes a
> >       userspace exit for a private->shared conversion request when the
> >       page has already been allocated in the private memory backend.
> > 
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  Documentation/virt/kvm/api.rst | 22 ++++++++++++++++++++++
> >  include/uapi/linux/kvm.h       |  9 +++++++++
> >  2 files changed, 31 insertions(+)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index f76ac598606c..bad550c2212b 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6216,6 +6216,28 @@ array field represents return values. The userspace should update the return
> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >  
> > +::
> > +
> > +		/* KVM_EXIT_MEMORY_ERROR */
> > +		struct {
> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> > +			__u32 flags;
> > +			__u32 padding;
> > +			__u64 gpa;
> > +			__u64 size;
> > +		} memory;
> > +If exit reason is KVM_EXIT_MEMORY_ERROR then it indicates that the VCPU has
> 
> Doh, I'm pretty sure I suggested KVM_EXIT_MEMORY_ERROR.  Any objection to using
> KVM_EXIT_MEMORY_FAULT instead of KVM_EXIT_MEMORY_ERROR?  "ERROR" makes me think
> of ECC errors, i.e. uncorrected #MC in x86 land, not more generic "faults".  That
> would align nicely with -EFAULT.

Sure.

> 
> > +encountered a memory error which is not handled by KVM kernel module and
> > +userspace may choose to handle it. The 'flags' field indicates the memory
> > +properties of the exit.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 08/13] KVM: Use memfile_pfn_ops to obtain pfn for private pages
  2022-03-28 23:56   ` Sean Christopherson
@ 2022-04-08 14:07     ` Chao Peng
  2022-04-28 12:37     ` Chao Peng
  1 sibling, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-04-08 14:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Mon, Mar 28, 2022 at 11:56:06PM +0000, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > @@ -2217,4 +2220,34 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
> >  /* Max number of entries allowed for each kvm dirty ring */
> >  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> >  
> > +#ifdef CONFIG_MEMFILE_NOTIFIER
> > +static inline long kvm_memfile_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn,
> > +				       int *order)
> > +{
> > +	pgoff_t index = gfn - slot->base_gfn +
> > +			(slot->private_offset >> PAGE_SHIFT);
> 
> This is broken for 32-bit kernels, where gfn_t is a 64-bit value but pgoff_t is a
> 32-bit value.  There's no reason to support this for 32-bit kernels, so...
> 
> The easiest fix, and likely most maintainable for other code too, would be to
> add a dedicated CONFIG for private memory, and then have KVM check that for all
> the memfile stuff.  x86 can then select it only for 64-bit kernels, and in turn
> select MEMFILE_NOTIFIER iff private memory is supported.

Looks good.

> 
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index ca7b2a6a452a..ee9c8c155300 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -48,7 +48,9 @@ config KVM
>         select SRCU
>         select INTERVAL_TREE
>         select HAVE_KVM_PM_NOTIFIER if PM
> -       select MEMFILE_NOTIFIER
> +       select HAVE_KVM_PRIVATE_MEM if X86_64
> +       select MEMFILE_NOTIFIER if HAVE_KVM_PRIVATE_MEM
> +
>         help
>           Support hosting fully virtualized guest machines using hardware
>           virtualization extensions.  You will need a fairly recent
> 
> And in addition to replacing checks on CONFIG_MEMFILE_NOTIFIER, the probing of
> whether or not KVM_MEM_PRIVATE is allowed can be:
> 
> @@ -1499,23 +1499,19 @@ static void kvm_replace_memslot(struct kvm *kvm,
>         }
>  }
> 
> -bool __weak kvm_arch_private_memory_supported(struct kvm *kvm)
> -{
> -       return false;
> -}
> -
>  static int check_memory_region_flags(struct kvm *kvm,
>                                 const struct kvm_userspace_memory_region *mem)
>  {
>         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> 
> -       if (kvm_arch_private_memory_supported(kvm))
> -               valid_flags |= KVM_MEM_PRIVATE;
> -
>  #ifdef __KVM_HAVE_READONLY_MEM
>         valid_flags |= KVM_MEM_READONLY;
>  #endif
> 
> +#ifdef CONFIG_KVM_HAVE_PRIVATE_MEM
> +       valid_flags |= KVM_MEM_PRIVATE;
> +#endif
> +
>         if (mem->flags & ~valid_flags)
>                 return -EINVAL;
> 
> > +
> > +	return slot->pfn_ops->get_lock_pfn(file_inode(slot->private_file),
> > +					   index, order);
> 
> In a similar vein, get_locK_pfn() shouldn't return a "long".  KVM likely won't use
> these APIs on 32-bit kernels, but that may not hold true for other subsystems, and
> this code is confusing and technically wrong.  The pfns for struct page squeeze
> into an unsigned long because PAE support is capped at 64gb, but casting to a
> signed long could result in a pfn with bit 31 set being misinterpreted as an error.
> 
> Even returning an "unsigned long" for the pfn is wrong.  It "works" for the shmem
> code because shmem deals only with struct page, but it's technically wrong, especially
> since one of the selling points of this approach is that it can work without struct
> page.

Hmmm, that's correct.

> 
> OUT params suck, but I don't see a better option than having the return value be
> 0/-errno, with "pfn_t *pfn" for the resolved pfn.
> 
> > +}
> > +
> > +static inline void kvm_memfile_put_pfn(struct kvm_memory_slot *slot,
> > +				       kvm_pfn_t pfn)
> > +{
> > +	slot->pfn_ops->put_unlock_pfn(pfn);
> > +}
> > +
> > +#else
> > +static inline long kvm_memfile_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn,
> > +				       int *order)
> > +{
> 
> This should be a WARN_ON() as its usage should be guarded by a KVM_PRIVATE_MEM
> check, and private memslots should be disallowed in this case.
> 
> Alternatively, it might be a good idea to #ifdef these out entirely and not provide
> stubs.  That'd likely require a stub or two in arch code, but overall it might be
> less painful in the long run, e.g. would force us to more carefully consider the
> touch points for private memory.  Definitely not a requirement, just an idea.

Make sense, let me try.

Thanks,
Chao


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 05/13] KVM: Extend the memslot to support fd-based private memory
  2022-04-08 13:46     ` Chao Peng
@ 2022-04-08 17:45       ` Sean Christopherson
  0 siblings, 0 replies; 118+ messages in thread
From: Sean Christopherson @ 2022-04-08 17:45 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Fri, Apr 08, 2022, Chao Peng wrote:
> On Mon, Mar 28, 2022 at 09:56:33PM +0000, Sean Christopherson wrote:
> > struct kvm_userspace_memory_region_ext {
> > #ifdef __KERNEL__
> 
> Is this #ifndef? As I think anonymous struct is only for kernel?

Doh, yes, I inverted that.

> Thanks,
> Chao
> 
> > 	struct kvm_userspace_memory_region region;
> > #else
> > 	struct kvm_userspace_memory_region;
> > #endif
> > 	__u64 private_offset;
> > 	__u32 private_fd;
> > 	__u32 padding[5];
> > };
> > 
> > #ifdef __KERNEL__
> > #define kvm_user_mem_region kvm_userspace_memory_region_ext
> > #endif
> > 
> > [*] https://lore.kernel.org/all/20220301145233.3689119-1-arnd@kernel.org
> > 
> > > +	__u64 private_offset;
> > > +	__u32 private_fd;
> > > +	__u32 padding[5];
> > > +};


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
  2022-04-07 17:09     ` Andy Lutomirski
@ 2022-04-08 17:56       ` Sean Christopherson
  2022-04-08 18:54         ` David Hildenbrand
  0 siblings, 1 reply; 118+ messages in thread
From: Sean Christopherson @ 2022-04-08 17:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chao Peng, kvm list, Linux Kernel Mailing List, linux-mm,
	linux-fsdevel, Linux API, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A. Shutemov, Nakajima, Jun,
	Dave Hansen, Andi Kleen, David Hildenbrand

On Thu, Apr 07, 2022, Andy Lutomirski wrote:
> 
> On Thu, Apr 7, 2022, at 9:05 AM, Sean Christopherson wrote:
> > On Thu, Mar 10, 2022, Chao Peng wrote:
> >> Since page migration / swapping is not supported yet, MFD_INACCESSIBLE
> >> memory behave like longterm pinned pages and thus should be accounted to
> >> mm->pinned_vm and be restricted by RLIMIT_MEMLOCK.
> >> 
> >> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> >> ---
> >>  mm/shmem.c | 25 ++++++++++++++++++++++++-
> >>  1 file changed, 24 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/mm/shmem.c b/mm/shmem.c
> >> index 7b43e274c9a2..ae46fb96494b 100644
> >> --- a/mm/shmem.c
> >> +++ b/mm/shmem.c
> >> @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
> >>  static void notify_invalidate_page(struct inode *inode, struct folio *folio,
> >>  				   pgoff_t start, pgoff_t end)
> >>  {
> >> -#ifdef CONFIG_MEMFILE_NOTIFIER
> >>  	struct shmem_inode_info *info = SHMEM_I(inode);
> >>  
> >> +#ifdef CONFIG_MEMFILE_NOTIFIER
> >>  	start = max(start, folio->index);
> >>  	end = min(end, folio->index + folio_nr_pages(folio));
> >>  
> >>  	memfile_notifier_invalidate(&info->memfile_notifiers, start, end);
> >>  #endif
> >> +
> >> +	if (info->xflags & SHM_F_INACCESSIBLE)
> >> +		atomic64_sub(end - start, &current->mm->pinned_vm);
> >
> > As Vishal's to-be-posted selftest discovered, this is broken as current->mm
> > may be NULL.  Or it may be a completely different mm, e.g. AFAICT there's
> > nothing that prevents a different process from punching hole in the shmem
> > backing.
> >
> 
> How about just not charging the mm in the first place?  There’s precedent:
> ramfs and hugetlbfs (at least sometimes — I’ve lost track of the current
> status).
> 
> In any case, for an administrator to try to assemble the various rlimits into
> a coherent policy is, and always has been, quite messy. ISTM cgroup limits,
> which can actually add across processes usefully, are much better.
> 
> So, aside from the fact that these fds aren’t in a filesystem and are thus
> available by default, I’m not convinced that this accounting is useful or
> necessary.
> 
> Maybe we could just have some switch require to enable creation of private
> memory in the first place, and anyone who flips that switch without
> configuring cgroups is subject to DoS.

I personally have no objection to that, and I'm 99% certain Google doesn't rely
on RLIMIT_MEMLOCK.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
  2022-04-08 17:56       ` Sean Christopherson
@ 2022-04-08 18:54         ` David Hildenbrand
  2022-04-12 14:36           ` Jason Gunthorpe
  0 siblings, 1 reply; 118+ messages in thread
From: David Hildenbrand @ 2022-04-08 18:54 UTC (permalink / raw)
  To: Sean Christopherson, Andy Lutomirski
  Cc: Chao Peng, kvm list, Linux Kernel Mailing List, linux-mm,
	linux-fsdevel, Linux API, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A. Shutemov, Nakajima, Jun,
	Dave Hansen, Andi Kleen

On 08.04.22 19:56, Sean Christopherson wrote:
> On Thu, Apr 07, 2022, Andy Lutomirski wrote:
>>
>> On Thu, Apr 7, 2022, at 9:05 AM, Sean Christopherson wrote:
>>> On Thu, Mar 10, 2022, Chao Peng wrote:
>>>> Since page migration / swapping is not supported yet, MFD_INACCESSIBLE
>>>> memory behave like longterm pinned pages and thus should be accounted to
>>>> mm->pinned_vm and be restricted by RLIMIT_MEMLOCK.
>>>>
>>>> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>>>> ---
>>>>  mm/shmem.c | 25 ++++++++++++++++++++++++-
>>>>  1 file changed, 24 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>> index 7b43e274c9a2..ae46fb96494b 100644
>>>> --- a/mm/shmem.c
>>>> +++ b/mm/shmem.c
>>>> @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
>>>>  static void notify_invalidate_page(struct inode *inode, struct folio *folio,
>>>>  				   pgoff_t start, pgoff_t end)
>>>>  {
>>>> -#ifdef CONFIG_MEMFILE_NOTIFIER
>>>>  	struct shmem_inode_info *info = SHMEM_I(inode);
>>>>  
>>>> +#ifdef CONFIG_MEMFILE_NOTIFIER
>>>>  	start = max(start, folio->index);
>>>>  	end = min(end, folio->index + folio_nr_pages(folio));
>>>>  
>>>>  	memfile_notifier_invalidate(&info->memfile_notifiers, start, end);
>>>>  #endif
>>>> +
>>>> +	if (info->xflags & SHM_F_INACCESSIBLE)
>>>> +		atomic64_sub(end - start, &current->mm->pinned_vm);
>>>
>>> As Vishal's to-be-posted selftest discovered, this is broken as current->mm
>>> may be NULL.  Or it may be a completely different mm, e.g. AFAICT there's
>>> nothing that prevents a different process from punching hole in the shmem
>>> backing.
>>>
>>
>> How about just not charging the mm in the first place?  There’s precedent:
>> ramfs and hugetlbfs (at least sometimes — I’ve lost track of the current
>> status).
>>
>> In any case, for an administrator to try to assemble the various rlimits into
>> a coherent policy is, and always has been, quite messy. ISTM cgroup limits,
>> which can actually add across processes usefully, are much better.
>>
>> So, aside from the fact that these fds aren’t in a filesystem and are thus
>> available by default, I’m not convinced that this accounting is useful or
>> necessary.
>>
>> Maybe we could just have some switch require to enable creation of private
>> memory in the first place, and anyone who flips that switch without
>> configuring cgroups is subject to DoS.
> 
> I personally have no objection to that, and I'm 99% certain Google doesn't rely
> on RLIMIT_MEMLOCK.
> 

It's unnacceptable for distributions to have random unprivileged users
be able to allocate an unlimited amount of unmovable memory. And any
kind of these "switches" won't help a thing because the distribution
will have to enable them either way.

I raised in the past that accounting might be challenging, so it's no
surprise that something popped up now.

RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he
past already with secretmem, it's not 100% that good of a fit (unmovable
is worth than mlocked). But it gets the job done for now at least.

So I'm open for alternative to limit the amount of unmovable memory we
might allocate for user space, and then we could convert seretmem as well.

Random switches are not an option.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-03-28 20:16 ` Andy Lutomirski
  2022-03-28 22:48   ` Nakajima, Jun
@ 2022-04-08 21:35   ` Vishal Annapurve
  2022-04-12 13:00     ` Chao Peng
  2022-04-12 19:58   ` Kirill A. Shutemov
  2 siblings, 1 reply; 118+ messages in thread
From: Vishal Annapurve @ 2022-04-08 21:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Jun Nakajima,
	dave.hansen, ak, david

On Mon, Mar 28, 2022 at 10:17 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Thu, Mar 10, 2022 at 6:09 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > This is the v5 of this series which tries to implement the fd-based KVM
> > guest private memory. The patches are based on latest kvm/queue branch
> > commit:
> >
> >   d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2
>
> Can this series be run and a VM booted without TDX?  A feature like
> that might help push it forward.
>
> --Andy

I have posted a RFC series with selftests to exercise the UPM feature
with normal non-confidential VMs via
https://lore.kernel.org/kvm/20220408210545.3915712-1-vannapurve@google.com/

-- Vishal


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 01/13] mm/memfd: Introduce MFD_INACCESSIBLE flag
  2022-03-10 14:08 ` [PATCH v5 01/13] mm/memfd: Introduce MFD_INACCESSIBLE flag Chao Peng
@ 2022-04-11 15:10   ` Kirill A. Shutemov
  2022-04-12 13:11     ` Chao Peng
  2022-04-23  5:43   ` Vishal Annapurve
  1 sibling, 1 reply; 118+ messages in thread
From: Kirill A. Shutemov @ 2022-04-11 15:10 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david

On Thu, Mar 10, 2022 at 10:08:59PM +0800, Chao Peng wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Introduce a new memfd_create() flag indicating the content of the
> created memfd is inaccessible from userspace through ordinary MMU
> access (e.g., read/write/mmap). However, the file content can be
> accessed via a different mechanism (e.g. KVM MMU) indirectly.
> 
> It provides semantics required for KVM guest private memory support
> that a file descriptor with this flag set is going to be used as the
> source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV but may not be accessible from host userspace.
> 
> Since page migration/swapping is not yet supported for such usages
> so these pages are currently marked as UNMOVABLE and UNEVICTABLE
> which makes them behave like long-term pinned pages.
> 
> The flag can not coexist with MFD_ALLOW_SEALING, future sealing is
> also impossible for a memfd created with this flag.
> 
> At this time only shmem implements this flag.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  include/linux/shmem_fs.h   |  7 +++++
>  include/uapi/linux/memfd.h |  1 +
>  mm/memfd.c                 | 26 +++++++++++++++--
>  mm/shmem.c                 | 57 ++++++++++++++++++++++++++++++++++++++
>  4 files changed, 88 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index e65b80ed09e7..2dde843f28ef 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -12,6 +12,9 @@
>  
>  /* inode in-kernel data */
>  
> +/* shmem extended flags */
> +#define SHM_F_INACCESSIBLE	0x0001  /* prevent ordinary MMU access (e.g. read/write/mmap) to file content */
> +
>  struct shmem_inode_info {
>  	spinlock_t		lock;
>  	unsigned int		seals;		/* shmem seals */
> @@ -24,6 +27,7 @@ struct shmem_inode_info {
>  	struct shared_policy	policy;		/* NUMA memory alloc policy */
>  	struct simple_xattrs	xattrs;		/* list of xattrs */
>  	atomic_t		stop_eviction;	/* hold when working on inode */
> +	unsigned int		xflags;		/* shmem extended flags */
>  	struct inode		vfs_inode;
>  };
>  

AFAICS, only two bits of 'flags' are used. And that's very strange that
VM_ flags are used for the purpose. My guess that someone was lazy to
introduce new constants for this.

I think we should fix this: introduce SHM_F_LOCKED and SHM_F_NORESERVE
alongside with SHM_F_INACCESSIBLE and stuff them all into info->flags.
It also makes shmem_file_setup_xflags() go away.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 03/13] mm/shmem: Support memfile_notifier
  2022-03-10 14:09 ` [PATCH v5 03/13] mm/shmem: Support memfile_notifier Chao Peng
  2022-03-10 23:08   ` Dave Chinner
@ 2022-04-11 15:26   ` Kirill A. Shutemov
  2022-04-12 13:12     ` Chao Peng
  2022-04-19 22:40   ` Vishal Annapurve
  2 siblings, 1 reply; 118+ messages in thread
From: Kirill A. Shutemov @ 2022-04-11 15:26 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Wanpeng Li, jun.nakajima, david, J . Bruce Fields,
	dave.hansen, H . Peter Anvin, ak, Jonathan Corbet, Joerg Roedel,
	x86, Hugh Dickins, Steven Price, Ingo Molnar,
	Maciej S . Szmigiero, Borislav Petkov, luto, Thomas Gleixner,
	Vitaly Kuznetsov, Vlastimil Babka, Jim Mattson,
	Sean Christopherson, Jeff Layton, Yu Zhang, Kirill A . Shutemov,
	Paolo Bonzini, Andrew Morton, Vishal Annapurve, Mike Rapoport

On Thu, Mar 10, 2022 at 10:09:01PM +0800, Chao Peng wrote:
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 9b31a7056009..7b43e274c9a2 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -903,6 +903,28 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
>  	return page ? page_folio(page) : NULL;
>  }
>  
> +static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
> +{
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +	struct shmem_inode_info *info = SHMEM_I(inode);
> +
> +	memfile_notifier_fallocate(&info->memfile_notifiers, start, end);
> +#endif

All these #ifdefs look ugly. Could you provide dummy memfile_* for
!MEMFILE_NOTIFIER case?

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
  2022-04-07 16:05   ` Sean Christopherson
  2022-04-07 17:09     ` Andy Lutomirski
  2022-04-08 13:02     ` Chao Peng
@ 2022-04-11 15:32     ` Kirill A. Shutemov
  2022-04-12 13:39       ` Chao Peng
  2 siblings, 1 reply; 118+ messages in thread
From: Kirill A. Shutemov @ 2022-04-11 15:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Thu, Apr 07, 2022 at 04:05:36PM +0000, Sean Christopherson wrote:
> Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the swap, so
> maybe it's just the page migration path that needs to be updated?

My early version prevented migration with -ENOTSUPP for
address_space_operations::migratepage().

What's wrong with that approach?

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
  2022-04-08 13:02     ` Chao Peng
@ 2022-04-11 15:34       ` Kirill A. Shutemov
  2022-04-12  5:14         ` Hugh Dickins
  0 siblings, 1 reply; 118+ messages in thread
From: Kirill A. Shutemov @ 2022-04-11 15:34 UTC (permalink / raw)
  To: Chao Peng
  Cc: Sean Christopherson, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david

On Fri, Apr 08, 2022 at 09:02:54PM +0800, Chao Peng wrote:
> > I think the correct approach is to not do the locking automatically for SHM_F_INACCESSIBLE,
> > and instead require userspace to do shmctl(.., SHM_LOCK, ...) if userspace knows the
> > consumers don't support migrate/swap.  That'd require wrapping migrate_page() and then
> > wiring up notifier hooks for migrate/swap, but IMO that's a good thing to get sorted
> > out sooner than later.  KVM isn't planning on support migrate/swap for TDX or SNP,
> > but supporting at least migrate for a software-only implementation a la pKVM should
> > be relatively straightforward.  On the notifiee side, KVM can terminate the VM if it
> > gets an unexpected migrate/swap, e.g. so that TDX/SEV VMs don't die later with
> > exceptions and/or data corruption (pre-SNP SEV guests) in the guest.
> 
> SHM_LOCK sounds like a good match.

Emm, no. shmctl(2) and SHM_LOCK are SysV IPC thing. I don't see how they
fit here.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
  2022-04-11 15:34       ` Kirill A. Shutemov
@ 2022-04-12  5:14         ` Hugh Dickins
  0 siblings, 0 replies; 118+ messages in thread
From: Hugh Dickins @ 2022-04-12  5:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Chao Peng, Sean Christopherson, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	Shakeel Butt, luto, jun.nakajima, dave.hansen, ak, david

On Mon, 11 Apr 2022, Kirill A. Shutemov wrote:
> On Fri, Apr 08, 2022 at 09:02:54PM +0800, Chao Peng wrote:
> > > I think the correct approach is to not do the locking automatically for SHM_F_INACCESSIBLE,
> > > and instead require userspace to do shmctl(.., SHM_LOCK, ...) if userspace knows the
> > > consumers don't support migrate/swap.  That'd require wrapping migrate_page() and then
> > > wiring up notifier hooks for migrate/swap, but IMO that's a good thing to get sorted
> > > out sooner than later.  KVM isn't planning on support migrate/swap for TDX or SNP,
> > > but supporting at least migrate for a software-only implementation a la pKVM should
> > > be relatively straightforward.  On the notifiee side, KVM can terminate the VM if it
> > > gets an unexpected migrate/swap, e.g. so that TDX/SEV VMs don't die later with
> > > exceptions and/or data corruption (pre-SNP SEV guests) in the guest.
> > 
> > SHM_LOCK sounds like a good match.
> 
> Emm, no. shmctl(2) and SHM_LOCK are SysV IPC thing. I don't see how they
> fit here.

I am still struggling to formulate a constructive response on
MFD_INACCESSIBLE in general: but before doing so, let me jump in here
to say that I'm firmly on the side of SHM_LOCK being the right model -
but admittedly not through userspace calling shmctl(2).

Please refer to our last year's posting "[PATCH 10/16] tmpfs: fcntl(fd,
F_MEM_LOCK) to memlock a tmpfs file" for the example of how Shakeel did
it then (though only a small part of that would be needed for this case):
https://lore.kernel.org/linux-mm/54e03798-d836-ae64-f41-4a1d46bc115b@google.com/

And until such time as swapping is enabled, this memlock accounting would
be necessarily entailed by "MFD_INACCESSIBLE", or however that turns out
to be implemented: not something that we could trust userspace to call
separately.

Hugh


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 09/13] KVM: Handle page fault for private memory
  2022-03-29  1:07   ` Sean Christopherson
@ 2022-04-12 12:10     ` Chao Peng
  0 siblings, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-04-12 12:10 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Tue, Mar 29, 2022 at 01:07:18AM +0000, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > @@ -3890,7 +3893,59 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >  				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
> >  }
> >  
> > -static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
> > +static bool kvm_vcpu_is_private_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> > +{
> > +	/*
> > +	 * At this time private gfn has not been supported yet. Other patch
> > +	 * that enables it should change this.
> > +	 */
> > +	return false;
> > +}
> > +
> > +static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > +				    struct kvm_page_fault *fault,
> > +				    bool *is_private_pfn, int *r)
> 
> @is_private_pfn should be a field in @fault, not a separate parameter, and it
> should be a const property set by the original caller.  I would also name it
> "is_private", because if KVM proceeds past this point, it will be a property of
> the fault/access _and_ the pfn
> 
> I say it's a property of the fault because the below kvm_vcpu_is_private_gfn()
> should instead be:
> 
> 	if (fault->is_private)
> 
> The kvm_vcpu_is_private_gfn() check is TDX centric.  For SNP, private vs. shared
> is communicated via error code.  For software-only (I'm being optimistic ;-) ),
> we'd probably need to track private vs. shared internally in KVM, I don't think
> we'd want to force it to be a property of the gfn.

Make sense.

> 
> Then you can also move the fault->is_private waiver into is_page_fault_stale(),
> and drop the local is_private_pfn in direct_page_fault().
> 
> > +{
> > +	int order;
> > +	unsigned int flags = 0;
> > +	struct kvm_memory_slot *slot = fault->slot;
> > +	long pfn = kvm_memfile_get_pfn(slot, fault->gfn, &order);
> 
> If get_lock_pfn() and thus kvm_memfile_get_pfn() returns a pure error code instead
> of multiplexing the pfn, then this can be:
> 
> 	bool is_private_pfn;
> 
> 	is_private_pfn = !!kvm_memfile_get_pfn(slot, fault->gfn, &fault->pfn, &order);
> 
> That self-documents the "pfn < 0" == shared logic.

Yes, agreed.

> 
> > +
> > +	if (kvm_vcpu_is_private_gfn(vcpu, fault->addr >> PAGE_SHIFT)) {
> > +		if (pfn < 0)
> > +			flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
> > +		else {
> > +			fault->pfn = pfn;
> > +			if (slot->flags & KVM_MEM_READONLY)
> > +				fault->map_writable = false;
> > +			else
> > +				fault->map_writable = true;
> > +
> > +			if (order == 0)
> > +				fault->max_level = PG_LEVEL_4K;
> 
> This doesn't correctly handle order > 0, but less than the next page size, in which
> case max_level needs to be PG_LEVEL_4k.  It also doesn't handle the case where
> max_level > PG_LEVEL_2M.
> 
> That said, I think the proper fix is to have the get_lock_pfn() API return the max
> mapping level, not the order.  KVM, and presumably any other secondary MMU that might
> use these APIs, doesn't care about the order of the struct page, KVM cares about the
> max size/level of page it can map into the guest.  And similar to the previous patch,
> "order" is specific to struct page, which we are trying to avoid.

I remembered I suggested return max mapping level instead of order but
Kirill reminded me that PG_LEVEL_* is x86 specific, then changed back
to 'order'. It's just a matter of backing store or KVM to convert
'order' to mapping level.

> 
> > +			*is_private_pfn = true;
> 
> This is where KVM guarantees that is_private_pfn == fault->is_private.
> 
> > +			*r = RET_PF_FIXED;
> > +			return true;
> 
> Ewww.  This is super confusing.  Ditto for the "*r = -1" magic number.  I totally
> understand why you took this approach, it's just hard to follow because it kinda
> follows the kvm_faultin_pfn() semantics, but then inverts true and false in this
> one case.
> 
> I think the least awful option is to forego the helper and open code everything.
> If we ever refactor kvm_faultin_pfn() to be less weird then we can maybe move this
> to a helper.
> 
> Open coding isn't too bad if you reorganize things so that the exit-to-userspace
> path is a dedicated, early check.  IMO, it's a lot easier to read this way, open
> coded or not.

Yes the existing way of handling this is really awful, including the handling for 'r'
that will be finally return to KVM_RUN as part of the uAPI. Let me try your above
suggestion.

> 
> I think this is correct?  "is_private_pfn" and "level" are locals, everything else
> is in @fault.
> 
> 	if (kvm_slot_is_private(slot)) {
> 		is_private_pfn = !!kvm_memfile_get_pfn(slot, fault->gfn,
> 						       &fault->pfn, &level);
> 
> 		if (fault->is_private != is_private_pfn) {
> 			if (is_private_pfn)
> 				kvm_memfile_put_pfn(slot, fault->pfn);
> 
> 			vcpu->run->exit_reason = KVM_EXIT_MEMORY_ERROR;
> 			if (fault->is_private)
> 				vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> 			else
> 				vcpu->run->memory.flags = 0;
> 			vcpu->run->memory.padding = 0;
> 			vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> 			vcpu->run->memory.size = PAGE_SIZE;
> 			*r = 0;
> 			return true;
> 		}
> 
> 		/*
> 		 * fault->pfn is all set if the fault is for a private pfn, just
> 		 * need to update other metadata.
> 		 */
> 		if (fault->is_private) {
> 			fault->max_level = min(fault->max_level, level);
> 			fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> 			return false;
> 		}
> 
> 		/* Fault is shared, fallthrough to the standard path. */
> 	}
> 
> 	async = false;
> 
> > @@ -4016,7 +4076,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  	else
> >  		write_lock(&vcpu->kvm->mmu_lock);
> >  
> > -	if (is_page_fault_stale(vcpu, fault, mmu_seq))
> > +	if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq))
> 
> As above, I'd prefer this check go in is_page_fault_stale().  It means shadow MMUs
> will suffer a pointless check, but I don't think that's a big issue.  Oooh, unless
> we support software-only, which would play nice with nested and probably even legacy
> shadow paging.  Fun :-)

Sounds good.

> 
> >  		goto out_unlock;
> >  
> >  	r = make_mmu_pages_available(vcpu);
> > @@ -4033,7 +4093,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  		read_unlock(&vcpu->kvm->mmu_lock);
> >  	else
> >  		write_unlock(&vcpu->kvm->mmu_lock);
> > -	kvm_release_pfn_clean(fault->pfn);
> > +
> > +	if (is_private_pfn)
> 
> And this can be
> 
> 	if (fault->is_private)
> 
> Same feedback for paging_tmpl.h.

Agreed.

Thanks,
Chao


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 10/13] KVM: Register private memslot to memory backing store
  2022-03-29 19:01   ` Sean Christopherson
@ 2022-04-12 12:40     ` Chao Peng
  0 siblings, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-04-12 12:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Tue, Mar 29, 2022 at 07:01:52PM +0000, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > Add 'notifier' to memslot to make it a memfile_notifier node and then
> > register it to memory backing store via memfile_register_notifier() when
> > memslot gets created. When memslot is deleted, do the reverse with
> > memfile_unregister_notifier(). Note each KVM memslot can be registered
> > to different memory backing stores (or the same backing store but at
> > different offset) independently.
> > 
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  include/linux/kvm_host.h |  1 +
> >  virt/kvm/kvm_main.c      | 75 ++++++++++++++++++++++++++++++++++++----
> >  2 files changed, 70 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 6e1d770d6bf8..9b175aeca63f 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -567,6 +567,7 @@ struct kvm_memory_slot {
> >  	struct file *private_file;
> >  	loff_t private_offset;
> >  	struct memfile_pfn_ops *pfn_ops;
> > +	struct memfile_notifier notifier;
> >  };
> >  
> >  static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index d11a2628b548..67349421eae3 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -840,6 +840,37 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
> >  
> >  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
> >  
> > +#ifdef CONFIG_MEMFILE_NOTIFIER
> > +static inline int kvm_memfile_register(struct kvm_memory_slot *slot)
> 
> This is a good oppurtunity to hide away the memfile details a bit.  Maybe
> kvm_private_mem_{,un}register()?

Happy to change.

> 
> > +{
> > +	return memfile_register_notifier(file_inode(slot->private_file),
> > +					 &slot->notifier,
> > +					 &slot->pfn_ops);
> > +}
> > +
> > +static inline void kvm_memfile_unregister(struct kvm_memory_slot *slot)
> > +{
> > +	if (slot->private_file) {
> > +		memfile_unregister_notifier(file_inode(slot->private_file),
> > +					    &slot->notifier);
> > +		fput(slot->private_file);
> 
> This should not do fput(), it makes the helper imbalanced with respect to the
> register path and will likely lead to double fput().  Indeed, if preparing the
> region fails, __kvm_set_memory_region() will double up on fput() due to checking
> its local "file" for null, not slot->private for null.

Right.

> 
> > +		slot->private_file = NULL;
> > +	}
> > +}
> > +
> > +#else /* !CONFIG_MEMFILE_NOTIFIER */
> > +
> > +static inline int kvm_memfile_register(struct kvm_memory_slot *slot)
> > +{
> 
> This should WARN_ON_ONCE().  Ditto for unregister.
> 
> > +	return -EOPNOTSUPP;
> > +}
> > +
> > +static inline void kvm_memfile_unregister(struct kvm_memory_slot *slot)
> > +{
> > +}
> > +
> > +#endif /* CONFIG_MEMFILE_NOTIFIER */
> > +
> >  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> >  static int kvm_pm_notifier_call(struct notifier_block *bl,
> >  				unsigned long state,
> > @@ -884,6 +915,9 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
> >  /* This does not remove the slot from struct kvm_memslots data structures */
> >  static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> >  {
> > +	if (slot->flags & KVM_MEM_PRIVATE)
> > +		kvm_memfile_unregister(slot);
> 
> With fput() move out of unregister, this needs to be:

Agreed.

> 
> 	if (slot->flags & KVM_MEM_PRIVATE) {
> 		kvm_private_mem_unregister(slot);
> 		fput(slot->private_file);
> 	}
> > +
> >  	kvm_destroy_dirty_bitmap(slot);
> >  
> >  	kvm_arch_free_memslot(kvm, slot);
> > @@ -1738,6 +1772,12 @@ static int kvm_set_memslot(struct kvm *kvm,
> >  		kvm_invalidate_memslot(kvm, old, invalid_slot);
> >  	}
> >  
> > +	if (new->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE) {
> > +		r = kvm_memfile_register(new);
> > +		if (r)
> > +			return r;
> > +	}
> 
> This belongs in kvm_prepare_memory_region().  The shenanigans for DELETE and MOVE
> are special.

Sure.

> 
> > +
> >  	r = kvm_prepare_memory_region(kvm, old, new, change);
> >  	if (r) {
> >  		/*
> > @@ -1752,6 +1792,10 @@ static int kvm_set_memslot(struct kvm *kvm,
> >  		} else {
> >  			mutex_unlock(&kvm->slots_arch_lock);
> >  		}
> > +
> > +		if (new->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE)
> > +			kvm_memfile_unregister(new);
> > +
> >  		return r;
> >  	}
> >  
> > @@ -1817,6 +1861,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >  	enum kvm_mr_change change;
> >  	unsigned long npages;
> >  	gfn_t base_gfn;
> > +	struct file *file = NULL;
> 
> Nit, naming this private_file would help understand its use.  Though I think it's
> easier to not have a local variable.  More below.
> 
> >  	int as_id, id;
> >  	int r;
> >  
> > @@ -1890,14 +1935,24 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >  			return 0;
> >  	}
> >  
> > +	if (mem->flags & KVM_MEM_PRIVATE) {
> > +		file = fdget(region_ext->private_fd).file;
> 
> This can use fget() instead of fdget().
> 
> > +		if (!file)
> > +			return -EINVAL;
> > +	}
> > +
> >  	if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) &&
> > -	    kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages))
> > -		return -EEXIST;
> > +	    kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages)) {
> > +		r = -EEXIST;
> > +		goto out;
> > +	}
> >  
> >  	/* Allocate a slot that will persist in the memslot. */
> >  	new = kzalloc(sizeof(*new), GFP_KERNEL_ACCOUNT);
> > -	if (!new)
> > -		return -ENOMEM;
> > +	if (!new) {
> > +		r = -ENOMEM;
> > +		goto out;
> > +	}
> >  
> >  	new->as_id = as_id;
> >  	new->id = id;
> > @@ -1905,10 +1960,18 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >  	new->npages = npages;
> >  	new->flags = mem->flags;
> >  	new->userspace_addr = mem->userspace_addr;
> > +	new->private_file = file;
> > +	new->private_offset = mem->flags & KVM_MEM_PRIVATE ?
> > +			      region_ext->private_offset : 0;
> 
> "new" is zero-allocated, so all the private stuff, including the fget(), can be
> wrapped in a single KVM_MEM_PRIVATE check.  Moving fget() eliminates the number
> of gotos needed (the above -EEXIST and -ENOMEM paths don't need to be modified).
> 
> >  	r = kvm_set_memslot(kvm, old, new, change);
> > -	if (r)
> > -		kfree(new);
> > +	if (!r)
> > +		return r;
> 
> Use goto, e.g.
> 
> 	if (r)
> 		goto out;
> 
> 	return 0;
> 
> Burying the happy path in a taken if-statement is confusing and error prone,
> mostly because it breaks well-established kernel patterns.  Note, there's no need
> for a separate out_free since new->private_file will be NULL in either case.  I
> don't have a strong preference, I just find it easier to read code that's more
> explicit, but I'm a-ok collapsing them into a single label.

Will follow this, thanks for the detailed suggestion.

Chao
> 
> 	if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) &&
> 	    kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages))
> 		return -EEXIST;
> 
> 	/* Allocate a slot that will persist in the memslot. */
> 	new = kzalloc(sizeof(*new), GFP_KERNEL_ACCOUNT);
> 	if (!new)
> 		return -ENOMEM;
> 
> 	new->as_id = as_id;
> 	new->id = id;
> 	new->base_gfn = base_gfn;
> 	new->npages = npages;
> 	new->flags = mem->flags;
> 	new->userspace_addr = mem->userspace_addr;
> 
> 	if (mem->flags & KVM_MEM_PRIVATE) {
> 		new->private_file = fget(mem->private_fd);
> 		if (!new->private_file) {
> 			r = -EINVAL;
> 			goto out_free;
> 		}
> 		new->private_offset = mem->private_offset;
> 	}
> 
> 	r = kvm_set_memslot(kvm, old, new, change);
> 	if (r)
> 		goto out;
> 
> 	return 0;
> 
> out:
> 	if (new->private_file)
> 		fput(new->private_file);
> 
> out_free:
> 	kfree(new);
> 	return r;


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 11/13] KVM: Zap existing KVM mappings when pages changed in the private fd
  2022-03-29 19:23   ` Sean Christopherson
@ 2022-04-12 12:43     ` Chao Peng
  0 siblings, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-04-12 12:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Tue, Mar 29, 2022 at 07:23:04PM +0000, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 67349421eae3..52319f49d58a 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -841,8 +841,43 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
> >  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
> >  
> >  #ifdef CONFIG_MEMFILE_NOTIFIER
> > +static void kvm_memfile_notifier_handler(struct memfile_notifier *notifier,
> > +					 pgoff_t start, pgoff_t end)
> > +{
> > +	int idx;
> > +	struct kvm_memory_slot *slot = container_of(notifier,
> > +						    struct kvm_memory_slot,
> > +						    notifier);
> > +	struct kvm_gfn_range gfn_range = {
> > +		.slot		= slot,
> > +		.start		= start - (slot->private_offset >> PAGE_SHIFT),
> > +		.end		= end - (slot->private_offset >> PAGE_SHIFT),
> > +		.may_block 	= true,
> > +	};
> > +	struct kvm *kvm = slot->kvm;
> > +
> > +	gfn_range.start = max(gfn_range.start, slot->base_gfn);
> > +	gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages);
> > +
> > +	if (gfn_range.start >= gfn_range.end)
> > +		return;
> > +
> > +	idx = srcu_read_lock(&kvm->srcu);
> > +	KVM_MMU_LOCK(kvm);
> > +	kvm_unmap_gfn_range(kvm, &gfn_range);
> > +	kvm_flush_remote_tlbs(kvm);
> 
> This should check the result of kvm_unmap_gfn_range() and flush only if necessary.

Yep.

> 
> kvm->mmu_notifier_seq needs to be incremented, otherwise KVM will incorrectly
> install a SPTE if the mapping is zapped between retrieving the pfn in faultin and
> installing it after acquire mmu_lock.

Good catch.

Chao
> 
> 
> > +	KVM_MMU_UNLOCK(kvm);
> > +	srcu_read_unlock(&kvm->srcu, idx);
> > +}
> > +
> > +static struct memfile_notifier_ops kvm_memfile_notifier_ops = {
> > +	.invalidate = kvm_memfile_notifier_handler,
> > +	.fallocate = kvm_memfile_notifier_handler,
> > +};
> > +
> >  static inline int kvm_memfile_register(struct kvm_memory_slot *slot)
> >  {
> > +	slot->notifier.ops = &kvm_memfile_notifier_ops;
> >  	return memfile_register_notifier(file_inode(slot->private_file),
> >  					 &slot->notifier,
> >  					 &slot->pfn_ops);
> > @@ -1963,6 +1998,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >  	new->private_file = file;
> >  	new->private_offset = mem->flags & KVM_MEM_PRIVATE ?
> >  			      region_ext->private_offset : 0;
> > +	new->kvm = kvm;
> >  
> >  	r = kvm_set_memslot(kvm, old, new, change);
> >  	if (!r)
> > -- 
> > 2.17.1
> > 


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 12/13] KVM: Expose KVM_MEM_PRIVATE
  2022-03-29 19:13   ` Sean Christopherson
@ 2022-04-12 12:56     ` Chao Peng
  0 siblings, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-04-12 12:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Tue, Mar 29, 2022 at 07:13:00PM +0000, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > KVM_MEM_PRIVATE is not exposed by default but architecture code can turn
> > on it by implementing kvm_arch_private_memory_supported().
> > 
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  include/linux/kvm_host.h |  1 +
> >  virt/kvm/kvm_main.c      | 24 +++++++++++++++++++-----
> >  2 files changed, 20 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 186b9b981a65..0150e952a131 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1432,6 +1432,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> >  int kvm_arch_post_init_vm(struct kvm *kvm);
> >  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> >  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > +bool kvm_arch_private_memory_supported(struct kvm *kvm);
> >  
> >  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> >  /*
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 52319f49d58a..df5311755a40 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1485,10 +1485,19 @@ static void kvm_replace_memslot(struct kvm *kvm,
> >  	}
> >  }
> >  
> > -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> > +bool __weak kvm_arch_private_memory_supported(struct kvm *kvm)
> > +{
> > +	return false;
> > +}
> > +
> > +static int check_memory_region_flags(struct kvm *kvm,
> > +				const struct kvm_userspace_memory_region *mem)
> >  {
> >  	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> >  
> > +	if (kvm_arch_private_memory_supported(kvm))
> > +		valid_flags |= KVM_MEM_PRIVATE;
> > +
> >  #ifdef __KVM_HAVE_READONLY_MEM
> >  	valid_flags |= KVM_MEM_READONLY;
> >  #endif
> > @@ -1900,7 +1909,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >  	int as_id, id;
> >  	int r;
> >  
> > -	r = check_memory_region_flags(mem);
> > +	r = check_memory_region_flags(kvm, mem);
> >  	if (r)
> >  		return r;
> >  
> > @@ -1913,10 +1922,12 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >  		return -EINVAL;
> >  	if (mem->guest_phys_addr & (PAGE_SIZE - 1))
> >  		return -EINVAL;
> > -	/* We can read the guest memory with __xxx_user() later on. */
> >  	if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
> > -	    (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
> > -	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
> > +	    (mem->userspace_addr != untagged_addr(mem->userspace_addr)))
> > +		return -EINVAL;
> > +	/* We can read the guest memory with __xxx_user() later on. */
> > +	if (!(mem->flags & KVM_MEM_PRIVATE) &&
> > +	    !access_ok((void __user *)(unsigned long)mem->userspace_addr,
> 
> This should sanity check private_offset for private memslots.  At a bare minimum,
> wrapping should be disallowed.

Will add this.

> 
> >  			mem->memory_size))
> >  		return -EINVAL;
> >  	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> > @@ -1957,6 +1968,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >  		if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
> >  			return -EINVAL;
> >  	} else { /* Modify an existing slot. */
> > +		/* Private memslots are immutable, they can only be deleted. */
> > +		if (mem->flags & KVM_MEM_PRIVATE)
> > +			return -EINVAL;
> 
> These sanity checks belong in "KVM: Register private memslot to memory backing store",
> e.g. that patch is "broken" without the immutability restriction.  It's somewhat moot
> because the code is unreachable, but it makes reviewing confusing/difficult.
> 
> But rather than move the sanity checks back, I think I'd prefer to pull all of patch 10
> here.  I think it also makes sense to drop "KVM: Use memfile_pfn_ops to obtain pfn for
> private pages" and add the pointer in "struct kvm_memory_slot" in patch "KVM: Extend the
> memslot to support fd-based private memory", with the use of the ops folded into
> "KVM: Handle page fault for private memory".  Adding code to KVM and KVM-x86 in a single
> patch is ok, and overall makes things easier to review because the new helpers have a
> user right away, especially since there will be #ifdeffery.
> 
> I.e. end up with something like:
> 
>   mm: Introduce memfile_notifier
>   mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
>   KVM: Extend the memslot to support fd-based private memory
>   KVM: Use kvm_userspace_memory_region_ext
>   KVM: Add KVM_EXIT_MEMORY_ERROR exit
>   KVM: Handle page fault for private memory
>   KVM: Register private memslot to memory backing store
>   KVM: Zap existing KVM mappings when pages changed in the private fd
>   KVM: Enable and expose KVM_MEM_PRIVATE

Thanks for the suggestion. That makes sense.

Chao
> 
> >  		if ((mem->userspace_addr != old->userspace_addr) ||
> >  		    (npages != old->npages) ||
> >  		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> > -- 
> > 2.17.1
> > 


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-08 21:35   ` Vishal Annapurve
@ 2022-04-12 13:00     ` Chao Peng
  0 siblings, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-04-12 13:00 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Andy Lutomirski, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Jun Nakajima,
	dave.hansen, ak, david

On Fri, Apr 08, 2022 at 11:35:05AM -1000, Vishal Annapurve wrote:
> On Mon, Mar 28, 2022 at 10:17 AM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Thu, Mar 10, 2022 at 6:09 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > >
> > > This is the v5 of this series which tries to implement the fd-based KVM
> > > guest private memory. The patches are based on latest kvm/queue branch
> > > commit:
> > >
> > >   d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2
> >
> > Can this series be run and a VM booted without TDX?  A feature like
> > that might help push it forward.
> >
> > --Andy
> 
> I have posted a RFC series with selftests to exercise the UPM feature
> with normal non-confidential VMs via
> https://lore.kernel.org/kvm/20220408210545.3915712-1-vannapurve@google.com/

Thanks Vishal, this sounds very helpful, it already started to find
bugs.

Chao
> 
> -- Vishal


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 01/13] mm/memfd: Introduce MFD_INACCESSIBLE flag
  2022-04-11 15:10   ` Kirill A. Shutemov
@ 2022-04-12 13:11     ` Chao Peng
  0 siblings, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-04-12 13:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david

On Mon, Apr 11, 2022 at 06:10:23PM +0300, Kirill A. Shutemov wrote:
> On Thu, Mar 10, 2022 at 10:08:59PM +0800, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Introduce a new memfd_create() flag indicating the content of the
> > created memfd is inaccessible from userspace through ordinary MMU
> > access (e.g., read/write/mmap). However, the file content can be
> > accessed via a different mechanism (e.g. KVM MMU) indirectly.
> > 
> > It provides semantics required for KVM guest private memory support
> > that a file descriptor with this flag set is going to be used as the
> > source of guest memory in confidential computing environments such
> > as Intel TDX/AMD SEV but may not be accessible from host userspace.
> > 
> > Since page migration/swapping is not yet supported for such usages
> > so these pages are currently marked as UNMOVABLE and UNEVICTABLE
> > which makes them behave like long-term pinned pages.
> > 
> > The flag can not coexist with MFD_ALLOW_SEALING, future sealing is
> > also impossible for a memfd created with this flag.
> > 
> > At this time only shmem implements this flag.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  include/linux/shmem_fs.h   |  7 +++++
> >  include/uapi/linux/memfd.h |  1 +
> >  mm/memfd.c                 | 26 +++++++++++++++--
> >  mm/shmem.c                 | 57 ++++++++++++++++++++++++++++++++++++++
> >  4 files changed, 88 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> > index e65b80ed09e7..2dde843f28ef 100644
> > --- a/include/linux/shmem_fs.h
> > +++ b/include/linux/shmem_fs.h
> > @@ -12,6 +12,9 @@
> >  
> >  /* inode in-kernel data */
> >  
> > +/* shmem extended flags */
> > +#define SHM_F_INACCESSIBLE	0x0001  /* prevent ordinary MMU access (e.g. read/write/mmap) to file content */
> > +
> >  struct shmem_inode_info {
> >  	spinlock_t		lock;
> >  	unsigned int		seals;		/* shmem seals */
> > @@ -24,6 +27,7 @@ struct shmem_inode_info {
> >  	struct shared_policy	policy;		/* NUMA memory alloc policy */
> >  	struct simple_xattrs	xattrs;		/* list of xattrs */
> >  	atomic_t		stop_eviction;	/* hold when working on inode */
> > +	unsigned int		xflags;		/* shmem extended flags */
> >  	struct inode		vfs_inode;
> >  };
> >  
> 
> AFAICS, only two bits of 'flags' are used. And that's very strange that
> VM_ flags are used for the purpose. My guess that someone was lazy to
> introduce new constants for this.
> 
> I think we should fix this: introduce SHM_F_LOCKED and SHM_F_NORESERVE
> alongside with SHM_F_INACCESSIBLE and stuff them all into info->flags.
> It also makes shmem_file_setup_xflags() go away.

Did a quick search and sounds we only use SHM_F_LOCKED/SHM_F_NORESERVE and
that definitely don't have to be VM_ flags.

Chao
> 
> -- 
>  Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 03/13] mm/shmem: Support memfile_notifier
  2022-04-11 15:26   ` Kirill A. Shutemov
@ 2022-04-12 13:12     ` Chao Peng
  0 siblings, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-04-12 13:12 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Wanpeng Li, jun.nakajima, david, J . Bruce Fields,
	dave.hansen, H . Peter Anvin, ak, Jonathan Corbet, Joerg Roedel,
	x86, Hugh Dickins, Steven Price, Ingo Molnar,
	Maciej S . Szmigiero, Borislav Petkov, luto, Thomas Gleixner,
	Vitaly Kuznetsov, Vlastimil Babka, Jim Mattson,
	Sean Christopherson, Jeff Layton, Yu Zhang, Kirill A . Shutemov,
	Paolo Bonzini, Andrew Morton, Vishal Annapurve, Mike Rapoport

On Mon, Apr 11, 2022 at 06:26:47PM +0300, Kirill A. Shutemov wrote:
> On Thu, Mar 10, 2022 at 10:09:01PM +0800, Chao Peng wrote:
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 9b31a7056009..7b43e274c9a2 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -903,6 +903,28 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
> >  	return page ? page_folio(page) : NULL;
> >  }
> >  
> > +static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
> > +{
> > +#ifdef CONFIG_MEMFILE_NOTIFIER
> > +	struct shmem_inode_info *info = SHMEM_I(inode);
> > +
> > +	memfile_notifier_fallocate(&info->memfile_notifiers, start, end);
> > +#endif
> 
> All these #ifdefs look ugly. Could you provide dummy memfile_* for
> !MEMFILE_NOTIFIER case?
Sure.

Chao
> 
> -- 
>  Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
  2022-04-11 15:32     ` Kirill A. Shutemov
@ 2022-04-12 13:39       ` Chao Peng
  2022-04-12 19:28         ` Kirill A. Shutemov
  0 siblings, 1 reply; 118+ messages in thread
From: Chao Peng @ 2022-04-12 13:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Sean Christopherson, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david

On Mon, Apr 11, 2022 at 06:32:33PM +0300, Kirill A. Shutemov wrote:
> On Thu, Apr 07, 2022 at 04:05:36PM +0000, Sean Christopherson wrote:
> > Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the swap, so
> > maybe it's just the page migration path that needs to be updated?
> 
> My early version prevented migration with -ENOTSUPP for
> address_space_operations::migratepage().
> 
> What's wrong with that approach?

I previously thought migratepage will not be called since we already
marked the pages as UNMOVABLE, sounds not correct?

Thanks,
Chao
> 
> -- 
>  Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
  2022-04-08 18:54         ` David Hildenbrand
@ 2022-04-12 14:36           ` Jason Gunthorpe
  2022-04-12 21:27             ` Andy Lutomirski
  2022-04-13 16:24             ` David Hildenbrand
  0 siblings, 2 replies; 118+ messages in thread
From: Jason Gunthorpe @ 2022-04-12 14:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Sean Christopherson, Andy Lutomirski, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen

On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote:

> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he
> past already with secretmem, it's not 100% that good of a fit (unmovable
> is worth than mlocked). But it gets the job done for now at least.

No, it doesn't. There are too many different interpretations how
MELOCK is supposed to work

eg VFIO accounts per-process so hostile users can just fork to go past
it.

RDMA is per-process but uses a different counter, so you can double up

iouring is per-user and users a 3rd counter, so it can triple up on
the above two

> So I'm open for alternative to limit the amount of unmovable memory we
> might allocate for user space, and then we could convert seretmem as well.

I think it has to be cgroup based considering where we are now :\

Jason


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 02/13] mm: Introduce memfile_notifier
  2022-03-10 14:09 ` [PATCH v5 02/13] mm: Introduce memfile_notifier Chao Peng
  2022-03-29 18:45   ` Sean Christopherson
@ 2022-04-12 14:36   ` Hillf Danton
  2022-04-13  6:47     ` Chao Peng
  1 sibling, 1 reply; 118+ messages in thread
From: Hillf Danton @ 2022-04-12 14:36 UTC (permalink / raw)
  To: Chao Peng; +Cc: linux-kernel, linux-mm, Kirill A . Shutemov

On Thu, 10 Mar 2022 22:09:00 +0800 Chao Peng wrote:
> +
> +void memfile_register_backing_store(struct memfile_backing_store *bs)
> +{
> +	BUG_ON(!bs || !bs->get_notifier_list);
> +
> +	list_add_tail(&bs->list, &backing_store_list);
> +}
> +
> +void memfile_unregister_backing_store(struct memfile_backing_store *bs)
> +{
> +	list_del(&bs->list);
> +}
> +
> +static int memfile_get_notifier_info(struct inode *inode,

Nit, s/get/lookup/

> +				     struct memfile_notifier_list **list,
> +				     struct memfile_pfn_ops **ops)
> +{
> +	struct memfile_backing_store *bs, *iter;
> +	struct memfile_notifier_list *tmp;
> +
> +	list_for_each_entry_safe(bs, iter, &backing_store_list, list) {

Wonder what serializes list walk with list del and add above.

> +		tmp = bs->get_notifier_list(inode);
> +		if (tmp) {
> +			*list = tmp;
> +			if (ops)
> +				*ops = &bs->pfn_ops;
> +			return 0;
> +		}
> +	}
> +	return -EOPNOTSUPP;
> +}


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
  2022-04-12 13:39       ` Chao Peng
@ 2022-04-12 19:28         ` Kirill A. Shutemov
  2022-04-13  9:15           ` Chao Peng
  0 siblings, 1 reply; 118+ messages in thread
From: Kirill A. Shutemov @ 2022-04-12 19:28 UTC (permalink / raw)
  To: Chao Peng
  Cc: Sean Christopherson, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david

On Tue, Apr 12, 2022 at 09:39:25PM +0800, Chao Peng wrote:
> On Mon, Apr 11, 2022 at 06:32:33PM +0300, Kirill A. Shutemov wrote:
> > On Thu, Apr 07, 2022 at 04:05:36PM +0000, Sean Christopherson wrote:
> > > Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the swap, so
> > > maybe it's just the page migration path that needs to be updated?
> > 
> > My early version prevented migration with -ENOTSUPP for
> > address_space_operations::migratepage().
> > 
> > What's wrong with that approach?
> 
> I previously thought migratepage will not be called since we already
> marked the pages as UNMOVABLE, sounds not correct?

Do you mean missing __GFP_MOVABLE? I can be wrong, but I don't see that it
direclty affects if the page is migratable. It is a hint to page allocator
to group unmovable pages to separate page block and impove availablity of
higher order pages this way. Page allocator tries to allocate unmovable
pages from pages blocks that already have unmovable pages.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-03-28 20:16 ` Andy Lutomirski
  2022-03-28 22:48   ` Nakajima, Jun
  2022-04-08 21:35   ` Vishal Annapurve
@ 2022-04-12 19:58   ` Kirill A. Shutemov
  2 siblings, 0 replies; 118+ messages in thread
From: Kirill A. Shutemov @ 2022-04-12 19:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chao Peng, Wanpeng Li, jun.nakajima, kvm, david, qemu-devel,
	J . Bruce Fields, linux-mm, H . Peter Anvin, ak, Jonathan Corbet,
	Joerg Roedel, x86, Hugh Dickins, Steven Price, Ingo Molnar,
	Maciej S . Szmigiero, Borislav Petkov, Thomas Gleixner,
	Vitaly Kuznetsov, Vlastimil Babka, Jim Mattson, dave.hansen,
	linux-api, Jeff Layton, linux-kernel, Yu Zhang,
	Kirill A . Shutemov, Sean Christopherson, linux-fsdevel,
	Paolo Bonzini, Andrew Morton, Vishal Annapurve, Mike Rapoport

On Mon, Mar 28, 2022 at 01:16:48PM -0700, Andy Lutomirski wrote:
> On Thu, Mar 10, 2022 at 6:09 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > This is the v5 of this series which tries to implement the fd-based KVM
> > guest private memory. The patches are based on latest kvm/queue branch
> > commit:
> >
> >   d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2
> 
> Can this series be run and a VM booted without TDX?  A feature like
> that might help push it forward.

It would require enlightenment of the guest code. We have two options.

Simple one is to limit enabling to the guest kernel, but it would require
non-destructive conversion between shared->private memory. This does not
seem to be compatible with current design.

Other option is get memory private from time 0 of VM boot, but it requires
modification of virtual BIOS to setup shared ranges as needed. I'm not
sure if anybody volunteer to work on BIOS code to make it happen.

Hm.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
  2022-04-12 14:36           ` Jason Gunthorpe
@ 2022-04-12 21:27             ` Andy Lutomirski
  2022-04-13 16:30               ` David Hildenbrand
  2022-04-13 16:24             ` David Hildenbrand
  1 sibling, 1 reply; 118+ messages in thread
From: Andy Lutomirski @ 2022-04-12 21:27 UTC (permalink / raw)
  To: Jason Gunthorpe, David Hildenbrand
  Cc: Sean Christopherson, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, Eric W. Biederman,
	Linux API

On Tue, Apr 12, 2022, at 7:36 AM, Jason Gunthorpe wrote:
> On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote:
>
>> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he
>> past already with secretmem, it's not 100% that good of a fit (unmovable
>> is worth than mlocked). But it gets the job done for now at least.
>
> No, it doesn't. There are too many different interpretations how
> MELOCK is supposed to work
>
> eg VFIO accounts per-process so hostile users can just fork to go past
> it.
>
> RDMA is per-process but uses a different counter, so you can double up
>
> iouring is per-user and users a 3rd counter, so it can triple up on
> the above two
>
>> So I'm open for alternative to limit the amount of unmovable memory we
>> might allocate for user space, and then we could convert seretmem as well.
>
> I think it has to be cgroup based considering where we are now :\
>

So this is another situation where the actual backend (TDX, SEV, pKVM, pure software) makes a difference -- depending on exactly what backend we're using, the memory may not be unmoveable.  It might even be swappable (in the potentially distant future).

Anyway, here's a concrete proposal, with a bit of handwaving:

We add new cgroup limits:

memory.unmoveable
memory.locked

These can be set to an actual number or they can be set to the special value ROOT_CAP.  If they're set to ROOT_CAP, then anyone in the cgroup with capable(CAP_SYS_RESOURCE) (i.e. the global capability) can allocate movable or locked memory with this (and potentially other) new APIs.  If it's 0, then they can't.  If it's another value, then the memory can be allocated, charged to the cgroup, up to the limit, with no particular capability needed.  The default at boot is ROOT_CAP.  Anyone who wants to configure it differently is free to do so.  This avoids introducing a DoS, makes it easy to run tests without configuring cgroup, and lets serious users set up their cgroups.

Nothing is charge per mm.

To make this fully sensible, we need to know what the backend is for the private memory before allocating any so that we can charge it accordingly.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 02/13] mm: Introduce memfile_notifier
  2022-04-12 14:36   ` Hillf Danton
@ 2022-04-13  6:47     ` Chao Peng
  0 siblings, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-04-13  6:47 UTC (permalink / raw)
  To: Hillf Danton; +Cc: linux-kernel, linux-mm, Kirill A . Shutemov

On Tue, Apr 12, 2022 at 10:36:54PM +0800, Hillf Danton wrote:
> On Thu, 10 Mar 2022 22:09:00 +0800 Chao Peng wrote:
> > +
> > +void memfile_register_backing_store(struct memfile_backing_store *bs)
> > +{
> > +	BUG_ON(!bs || !bs->get_notifier_list);
> > +
> > +	list_add_tail(&bs->list, &backing_store_list);
> > +}
> > +
> > +void memfile_unregister_backing_store(struct memfile_backing_store *bs)
> > +{
> > +	list_del(&bs->list);
> > +}
> > +
> > +static int memfile_get_notifier_info(struct inode *inode,
> 
> Nit, s/get/lookup/

Thanks.

> 
> > +				     struct memfile_notifier_list **list,
> > +				     struct memfile_pfn_ops **ops)
> > +{
> > +	struct memfile_backing_store *bs, *iter;
> > +	struct memfile_notifier_list *tmp;
> > +
> > +	list_for_each_entry_safe(bs, iter, &backing_store_list, list) {
> 
> Wonder what serializes list walk with list del and add above.

Yes, this needs locking if we want to support backing stores as modules,
as Sean pointed out, at this time this is not quite meaningful so I will
remove unregister also the register should be done at kernel init time
so the walk here is actually on a readonly list.

Thanks,
Chao

> 
> > +		tmp = bs->get_notifier_list(inode);
> > +		if (tmp) {
> > +			*list = tmp;
> > +			if (ops)
> > +				*ops = &bs->pfn_ops;
> > +			return 0;
> > +		}
> > +	}
> > +	return -EOPNOTSUPP;
> > +}


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
  2022-04-12 19:28         ` Kirill A. Shutemov
@ 2022-04-13  9:15           ` Chao Peng
  0 siblings, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-04-13  9:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Sean Christopherson, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david

On Tue, Apr 12, 2022 at 10:28:21PM +0300, Kirill A. Shutemov wrote:
> On Tue, Apr 12, 2022 at 09:39:25PM +0800, Chao Peng wrote:
> > On Mon, Apr 11, 2022 at 06:32:33PM +0300, Kirill A. Shutemov wrote:
> > > On Thu, Apr 07, 2022 at 04:05:36PM +0000, Sean Christopherson wrote:
> > > > Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the swap, so
> > > > maybe it's just the page migration path that needs to be updated?
> > > 
> > > My early version prevented migration with -ENOTSUPP for
> > > address_space_operations::migratepage().
> > > 
> > > What's wrong with that approach?
> > 
> > I previously thought migratepage will not be called since we already
> > marked the pages as UNMOVABLE, sounds not correct?
> 
> Do you mean missing __GFP_MOVABLE?

Yes.

> I can be wrong, but I don't see that it
> direclty affects if the page is migratable. It is a hint to page allocator
> to group unmovable pages to separate page block and impove availablity of
> higher order pages this way. Page allocator tries to allocate unmovable
> pages from pages blocks that already have unmovable pages.

OK, thanks.

Chao
> 
> -- 
>  Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
  2022-04-12 14:36           ` Jason Gunthorpe
  2022-04-12 21:27             ` Andy Lutomirski
@ 2022-04-13 16:24             ` David Hildenbrand
  2022-04-13 17:52               ` Jason Gunthorpe
  1 sibling, 1 reply; 118+ messages in thread
From: David Hildenbrand @ 2022-04-13 16:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sean Christopherson, Andy Lutomirski, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen

On 12.04.22 16:36, Jason Gunthorpe wrote:
> On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote:
> 
>> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he
>> past already with secretmem, it's not 100% that good of a fit (unmovable
>> is worth than mlocked). But it gets the job done for now at least.
> 
> No, it doesn't. There are too many different interpretations how
> MELOCK is supposed to work
> 
> eg VFIO accounts per-process so hostile users can just fork to go past
> it.
> 
> RDMA is per-process but uses a different counter, so you can double up
> 
> iouring is per-user and users a 3rd counter, so it can triple up on
> the above two

Thanks for that summary, very helpful.

> 
>> So I'm open for alternative to limit the amount of unmovable memory we
>> might allocate for user space, and then we could convert seretmem as well.
> 
> I think it has to be cgroup based considering where we are now :\

Most probably. I think the important lessons we learned are that

* mlocked != unmovable.
* RLIMIT_MEMLOCK should most probably never have been abused for
  unmovable memory (especially, long-term pinning)


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
  2022-04-12 21:27             ` Andy Lutomirski
@ 2022-04-13 16:30               ` David Hildenbrand
  0 siblings, 0 replies; 118+ messages in thread
From: David Hildenbrand @ 2022-04-13 16:30 UTC (permalink / raw)
  To: Andy Lutomirski, Jason Gunthorpe
  Cc: Sean Christopherson, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, Eric W. Biederman

> 
> So this is another situation where the actual backend (TDX, SEV, pKVM, pure software) makes a difference -- depending on exactly what backend we're using, the memory may not be unmoveable.  It might even be swappable (in the potentially distant future).

Right. And on a system without swap we don't particularly care about
mlock, but we might (in most cases) care about fragmentation with
unmovable memory.

> 
> Anyway, here's a concrete proposal, with a bit of handwaving:

Thanks for investing some brainpower.

> 
> We add new cgroup limits:
> 
> memory.unmoveable
> memory.locked
> 
> These can be set to an actual number or they can be set to the special value ROOT_CAP.  If they're set to ROOT_CAP, then anyone in the cgroup with capable(CAP_SYS_RESOURCE) (i.e. the global capability) can allocate movable or locked memory with this (and potentially other) new APIs.  If it's 0, then they can't.  If it's another value, then the memory can be allocated, charged to the cgroup, up to the limit, with no particular capability needed.  The default at boot is ROOT_CAP.  Anyone who wants to configure it differently is free to do so.  This avoids introducing a DoS, makes it easy to run tests without configuring cgroup, and lets serious users set up their cgroups.

I wonder what the implications are for existing user space.

Assume we want to move page pinning (rdma, vfio, io_uring, ...) to the
new model. How can we be sure

a) We don't break existing user space
b) We don't open the doors unnoticed for the admin to go crazy on
   unmovable memory.

Any ideas?

> 
> Nothing is charge per mm.
> 
> To make this fully sensible, we need to know what the backend is for the private memory before allocating any so that we can charge it accordingly.

Right, the support for migration and/or swap defines how to account.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
  2022-04-13 16:24             ` David Hildenbrand
@ 2022-04-13 17:52               ` Jason Gunthorpe
  2022-04-25 14:07                 ` David Hildenbrand
  0 siblings, 1 reply; 118+ messages in thread
From: Jason Gunthorpe @ 2022-04-13 17:52 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Sean Christopherson, Andy Lutomirski, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen

On Wed, Apr 13, 2022 at 06:24:56PM +0200, David Hildenbrand wrote:
> On 12.04.22 16:36, Jason Gunthorpe wrote:
> > On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote:
> > 
> >> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he
> >> past already with secretmem, it's not 100% that good of a fit (unmovable
> >> is worth than mlocked). But it gets the job done for now at least.
> > 
> > No, it doesn't. There are too many different interpretations how
> > MELOCK is supposed to work
> > 
> > eg VFIO accounts per-process so hostile users can just fork to go past
> > it.
> > 
> > RDMA is per-process but uses a different counter, so you can double up
> > 
> > iouring is per-user and users a 3rd counter, so it can triple up on
> > the above two
> 
> Thanks for that summary, very helpful.

I kicked off a big discussion when I suggested to change vfio to use
the same as io_uring

We may still end up trying it, but the major concern is that libvirt
sets the RLIMIT_MEMLOCK and if we touch anything here - including
fixing RDMA, or anything really, it becomes a uAPI break for libvirt..

> >> So I'm open for alternative to limit the amount of unmovable memory we
> >> might allocate for user space, and then we could convert seretmem as well.
> > 
> > I think it has to be cgroup based considering where we are now :\
> 
> Most probably. I think the important lessons we learned are that
> 
> * mlocked != unmovable.
> * RLIMIT_MEMLOCK should most probably never have been abused for
>   unmovable memory (especially, long-term pinning)

The trouble is I'm not sure how anything can correctly/meaningfully
set a limit.

Consider qemu where we might have 3 different things all pinning the
same page (rdma, iouring, vfio) - should the cgroup give 3x the limit?
What use is that really?

IMHO there are only two meaningful scenarios - either you are unpriv
and limited to a very small number for your user/cgroup - or you are
priv and you can do whatever you want.

The idea we can fine tune this to exactly the right amount for a
workload does not seem realistic and ends up exporting internal kernel
decisions into a uAPI..

Jason


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 03/13] mm/shmem: Support memfile_notifier
  2022-03-10 14:09 ` [PATCH v5 03/13] mm/shmem: Support memfile_notifier Chao Peng
  2022-03-10 23:08   ` Dave Chinner
  2022-04-11 15:26   ` Kirill A. Shutemov
@ 2022-04-19 22:40   ` Vishal Annapurve
  2022-04-20  3:24     ` Chao Peng
  2 siblings, 1 reply; 118+ messages in thread
From: Vishal Annapurve @ 2022-04-19 22:40 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, dave.hansen, ak, david

On Thu, Mar 10, 2022 at 6:10 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> It maintains a memfile_notifier list in shmem_inode_info structure and
> implements memfile_pfn_ops callbacks defined by memfile_notifier. It
> then exposes them to memfile_notifier via
> shmem_get_memfile_notifier_info.
>
> We use SGP_NOALLOC in shmem_get_lock_pfn since the pages should be
> allocated by userspace for private memory. If there is no pages
> allocated at the offset then error should be returned so KVM knows that
> the memory is not private memory.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  include/linux/shmem_fs.h |  4 +++
>  mm/shmem.c               | 76 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 80 insertions(+)
>
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 2dde843f28ef..7bb16f2d2825 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -9,6 +9,7 @@
>  #include <linux/percpu_counter.h>
>  #include <linux/xattr.h>
>  #include <linux/fs_parser.h>
> +#include <linux/memfile_notifier.h>
>
>  /* inode in-kernel data */
>
> @@ -28,6 +29,9 @@ struct shmem_inode_info {
>         struct simple_xattrs    xattrs;         /* list of xattrs */
>         atomic_t                stop_eviction;  /* hold when working on inode */
>         unsigned int            xflags;         /* shmem extended flags */
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +       struct memfile_notifier_list memfile_notifiers;
> +#endif
>         struct inode            vfs_inode;
>  };
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 9b31a7056009..7b43e274c9a2 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -903,6 +903,28 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
>         return page ? page_folio(page) : NULL;
>  }
>
> +static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
> +{
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +       struct shmem_inode_info *info = SHMEM_I(inode);
> +
> +       memfile_notifier_fallocate(&info->memfile_notifiers, start, end);
> +#endif
> +}
> +
> +static void notify_invalidate_page(struct inode *inode, struct folio *folio,
> +                                  pgoff_t start, pgoff_t end)
> +{
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +       struct shmem_inode_info *info = SHMEM_I(inode);
> +
> +       start = max(start, folio->index);
> +       end = min(end, folio->index + folio_nr_pages(folio));
> +
> +       memfile_notifier_invalidate(&info->memfile_notifiers, start, end);
> +#endif
> +}
> +
>  /*
>   * Remove range of pages and swap entries from page cache, and free them.
>   * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
> @@ -946,6 +968,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>                         }
>                         index += folio_nr_pages(folio) - 1;
>
> +                       notify_invalidate_page(inode, folio, start, end);
> +
>                         if (!unfalloc || !folio_test_uptodate(folio))
>                                 truncate_inode_folio(mapping, folio);
>                         folio_unlock(folio);
> @@ -1019,6 +1043,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>                                         index--;
>                                         break;
>                                 }
> +
> +                               notify_invalidate_page(inode, folio, start, end);
> +

Should this be done in batches or done once for all of range [start, end)?

>                                 VM_BUG_ON_FOLIO(folio_test_writeback(folio),
>                                                 folio);
>                                 truncate_inode_folio(mapping, folio);
> @@ -2279,6 +2306,9 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
>                 info->flags = flags & VM_NORESERVE;
>                 INIT_LIST_HEAD(&info->shrinklist);
>                 INIT_LIST_HEAD(&info->swaplist);
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +               memfile_notifier_list_init(&info->memfile_notifiers);
> +#endif
>                 simple_xattrs_init(&info->xattrs);
>                 cache_no_acl(inode);
>                 mapping_set_large_folios(inode->i_mapping);
> @@ -2802,6 +2832,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>         if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
>                 i_size_write(inode, offset + len);
>         inode->i_ctime = current_time(inode);
> +       notify_fallocate(inode, start, end);
>  undone:
>         spin_lock(&inode->i_lock);
>         inode->i_private = NULL;
> @@ -3909,6 +3940,47 @@ static struct file_system_type shmem_fs_type = {
>         .fs_flags       = FS_USERNS_MOUNT,
>  };
>
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +static long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset, int *order)
> +{
> +       struct page *page;
> +       int ret;
> +
> +       ret = shmem_getpage(inode, offset, &page, SGP_NOALLOC);
> +       if (ret)
> +               return ret;
> +
> +       *order = thp_order(compound_head(page));
> +
> +       return page_to_pfn(page);
> +}
> +
> +static void shmem_put_unlock_pfn(unsigned long pfn)
> +{
> +       struct page *page = pfn_to_page(pfn);
> +
> +       VM_BUG_ON_PAGE(!PageLocked(page), page);
> +
> +       set_page_dirty(page);
> +       unlock_page(page);
> +       put_page(page);
> +}
> +
> +static struct memfile_notifier_list* shmem_get_notifier_list(struct inode *inode)
> +{
> +       if (!shmem_mapping(inode->i_mapping))
> +               return NULL;
> +
> +       return  &SHMEM_I(inode)->memfile_notifiers;
> +}
> +
> +static struct memfile_backing_store shmem_backing_store = {
> +       .pfn_ops.get_lock_pfn = shmem_get_lock_pfn,
> +       .pfn_ops.put_unlock_pfn = shmem_put_unlock_pfn,
> +       .get_notifier_list = shmem_get_notifier_list,
> +};
> +#endif /* CONFIG_MEMFILE_NOTIFIER */
> +
>  int __init shmem_init(void)
>  {
>         int error;
> @@ -3934,6 +4006,10 @@ int __init shmem_init(void)
>         else
>                 shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
>  #endif
> +
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +       memfile_register_backing_store(&shmem_backing_store);
> +#endif
>         return 0;
>
>  out1:
> --
> 2.17.1
>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 11/13] KVM: Zap existing KVM mappings when pages changed in the private fd
  2022-03-10 14:09 ` [PATCH v5 11/13] KVM: Zap existing KVM mappings when pages changed in the private fd Chao Peng
  2022-03-29 19:23   ` Sean Christopherson
  2022-04-05 23:45   ` Michael Roth
@ 2022-04-19 22:43   ` Vishal Annapurve
  2022-04-20  3:17     ` Chao Peng
  2 siblings, 1 reply; 118+ messages in thread
From: Vishal Annapurve @ 2022-04-19 22:43 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, dave.hansen, ak, david

On Thu, Mar 10, 2022 at 6:11 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> KVM gets notified when memory pages changed in the memory backing store.
> When userspace allocates the memory with fallocate() or frees memory
> with fallocate(FALLOC_FL_PUNCH_HOLE), memory backing store calls into
> KVM fallocate/invalidate callbacks respectively. To ensure KVM never
> maps both the private and shared variants of a GPA into the guest, in
> the fallocate callback, we should zap the existing shared mapping and
> in the invalidate callback we should zap the existing private mapping.
>
> In the callbacks, KVM firstly converts the offset range into the
> gfn_range and then calls existing kvm_unmap_gfn_range() which will zap
> the shared or private mapping. Both callbacks pass in a memslot
> reference but we need 'kvm' so add a reference in memslot structure.
>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  include/linux/kvm_host.h |  3 ++-
>  virt/kvm/kvm_main.c      | 36 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 38 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9b175aeca63f..186b9b981a65 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -236,7 +236,7 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>  #endif
>
> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_MEMFILE_NOTIFIER)
>  struct kvm_gfn_range {
>         struct kvm_memory_slot *slot;
>         gfn_t start;
> @@ -568,6 +568,7 @@ struct kvm_memory_slot {
>         loff_t private_offset;
>         struct memfile_pfn_ops *pfn_ops;
>         struct memfile_notifier notifier;
> +       struct kvm *kvm;
>  };
>
>  static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 67349421eae3..52319f49d58a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -841,8 +841,43 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>
>  #ifdef CONFIG_MEMFILE_NOTIFIER
> +static void kvm_memfile_notifier_handler(struct memfile_notifier *notifier,
> +                                        pgoff_t start, pgoff_t end)
> +{
> +       int idx;
> +       struct kvm_memory_slot *slot = container_of(notifier,
> +                                                   struct kvm_memory_slot,
> +                                                   notifier);
> +       struct kvm_gfn_range gfn_range = {
> +               .slot           = slot,
> +               .start          = start - (slot->private_offset >> PAGE_SHIFT),
> +               .end            = end - (slot->private_offset >> PAGE_SHIFT),
> +               .may_block      = true,
> +       };
> +       struct kvm *kvm = slot->kvm;
> +
> +       gfn_range.start = max(gfn_range.start, slot->base_gfn);

gfn_range.start seems to be page offset within the file. Should this rather be:
gfn_range.start = slot->base_gfn + min(gfn_range.start, slot->npages);

> +       gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages);
> +

Similar to previous comment, should this rather be:
gfn_range.end = slot->base_gfn + min(gfn_range.end, slot->npages);

> +       if (gfn_range.start >= gfn_range.end)
> +               return;
> +
> +       idx = srcu_read_lock(&kvm->srcu);
> +       KVM_MMU_LOCK(kvm);
> +       kvm_unmap_gfn_range(kvm, &gfn_range);
> +       kvm_flush_remote_tlbs(kvm);
> +       KVM_MMU_UNLOCK(kvm);
> +       srcu_read_unlock(&kvm->srcu, idx);
> +}
> +
> +static struct memfile_notifier_ops kvm_memfile_notifier_ops = {
> +       .invalidate = kvm_memfile_notifier_handler,
> +       .fallocate = kvm_memfile_notifier_handler,
> +};
> +
>  static inline int kvm_memfile_register(struct kvm_memory_slot *slot)
>  {
> +       slot->notifier.ops = &kvm_memfile_notifier_ops;
>         return memfile_register_notifier(file_inode(slot->private_file),
>                                          &slot->notifier,
>                                          &slot->pfn_ops);
> @@ -1963,6 +1998,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>         new->private_file = file;
>         new->private_offset = mem->flags & KVM_MEM_PRIVATE ?
>                               region_ext->private_offset : 0;
> +       new->kvm = kvm;
>
>         r = kvm_set_memslot(kvm, old, new, change);
>         if (!r)
> --
> 2.17.1
>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 11/13] KVM: Zap existing KVM mappings when pages changed in the private fd
  2022-04-19 22:43   ` Vishal Annapurve
@ 2022-04-20  3:17     ` Chao Peng
  0 siblings, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-04-20  3:17 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, dave.hansen, ak, david

On Tue, Apr 19, 2022 at 03:43:56PM -0700, Vishal Annapurve wrote:
> On Thu, Mar 10, 2022 at 6:11 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > KVM gets notified when memory pages changed in the memory backing store.
> > When userspace allocates the memory with fallocate() or frees memory
> > with fallocate(FALLOC_FL_PUNCH_HOLE), memory backing store calls into
> > KVM fallocate/invalidate callbacks respectively. To ensure KVM never
> > maps both the private and shared variants of a GPA into the guest, in
> > the fallocate callback, we should zap the existing shared mapping and
> > in the invalidate callback we should zap the existing private mapping.
> >
> > In the callbacks, KVM firstly converts the offset range into the
> > gfn_range and then calls existing kvm_unmap_gfn_range() which will zap
> > the shared or private mapping. Both callbacks pass in a memslot
> > reference but we need 'kvm' so add a reference in memslot structure.
> >
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  include/linux/kvm_host.h |  3 ++-
> >  virt/kvm/kvm_main.c      | 36 ++++++++++++++++++++++++++++++++++++
> >  2 files changed, 38 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 9b175aeca63f..186b9b981a65 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -236,7 +236,7 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> >  #endif
> >
> > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_MEMFILE_NOTIFIER)
> >  struct kvm_gfn_range {
> >         struct kvm_memory_slot *slot;
> >         gfn_t start;
> > @@ -568,6 +568,7 @@ struct kvm_memory_slot {
> >         loff_t private_offset;
> >         struct memfile_pfn_ops *pfn_ops;
> >         struct memfile_notifier notifier;
> > +       struct kvm *kvm;
> >  };
> >
> >  static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 67349421eae3..52319f49d58a 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -841,8 +841,43 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
> >  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
> >
> >  #ifdef CONFIG_MEMFILE_NOTIFIER
> > +static void kvm_memfile_notifier_handler(struct memfile_notifier *notifier,
> > +                                        pgoff_t start, pgoff_t end)
> > +{
> > +       int idx;
> > +       struct kvm_memory_slot *slot = container_of(notifier,
> > +                                                   struct kvm_memory_slot,
> > +                                                   notifier);
> > +       struct kvm_gfn_range gfn_range = {
> > +               .slot           = slot,
> > +               .start          = start - (slot->private_offset >> PAGE_SHIFT),
> > +               .end            = end - (slot->private_offset >> PAGE_SHIFT),
> > +               .may_block      = true,
> > +       };
> > +       struct kvm *kvm = slot->kvm;
> > +
> > +       gfn_range.start = max(gfn_range.start, slot->base_gfn);
> 
> gfn_range.start seems to be page offset within the file. Should this rather be:
> gfn_range.start = slot->base_gfn + min(gfn_range.start, slot->npages);

Right. For start we don't really need care about the uppper bound
here (will check below), so this should be enough:
	gfn_range.start = slot->base_gfn + gfn_range.start;

> 
> > +       gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages);
> > +
> 
> Similar to previous comment, should this rather be:
> gfn_range.end = slot->base_gfn + min(gfn_range.end, slot->npages);

This is correct.

Thanks,
Chao
> 
> > +       if (gfn_range.start >= gfn_range.end)
> > +               return;
> > +
> > +       idx = srcu_read_lock(&kvm->srcu);
> > +       KVM_MMU_LOCK(kvm);
> > +       kvm_unmap_gfn_range(kvm, &gfn_range);
> > +       kvm_flush_remote_tlbs(kvm);
> > +       KVM_MMU_UNLOCK(kvm);
> > +       srcu_read_unlock(&kvm->srcu, idx);
> > +}
> > +
> > +static struct memfile_notifier_ops kvm_memfile_notifier_ops = {
> > +       .invalidate = kvm_memfile_notifier_handler,
> > +       .fallocate = kvm_memfile_notifier_handler,
> > +};
> > +
> >  static inline int kvm_memfile_register(struct kvm_memory_slot *slot)
> >  {
> > +       slot->notifier.ops = &kvm_memfile_notifier_ops;
> >         return memfile_register_notifier(file_inode(slot->private_file),
> >                                          &slot->notifier,
> >                                          &slot->pfn_ops);
> > @@ -1963,6 +1998,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >         new->private_file = file;
> >         new->private_offset = mem->flags & KVM_MEM_PRIVATE ?
> >                               region_ext->private_offset : 0;
> > +       new->kvm = kvm;
> >
> >         r = kvm_set_memslot(kvm, old, new, change);
> >         if (!r)
> > --
> > 2.17.1
> >


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 03/13] mm/shmem: Support memfile_notifier
  2022-04-19 22:40   ` Vishal Annapurve
@ 2022-04-20  3:24     ` Chao Peng
  0 siblings, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-04-20  3:24 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, dave.hansen, ak, david

On Tue, Apr 19, 2022 at 03:40:09PM -0700, Vishal Annapurve wrote:
> On Thu, Mar 10, 2022 at 6:10 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > It maintains a memfile_notifier list in shmem_inode_info structure and
> > implements memfile_pfn_ops callbacks defined by memfile_notifier. It
> > then exposes them to memfile_notifier via
> > shmem_get_memfile_notifier_info.
> >
> > We use SGP_NOALLOC in shmem_get_lock_pfn since the pages should be
> > allocated by userspace for private memory. If there is no pages
> > allocated at the offset then error should be returned so KVM knows that
> > the memory is not private memory.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  include/linux/shmem_fs.h |  4 +++
> >  mm/shmem.c               | 76 ++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 80 insertions(+)
> >
> > diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> > index 2dde843f28ef..7bb16f2d2825 100644
> > --- a/include/linux/shmem_fs.h
> > +++ b/include/linux/shmem_fs.h
> > @@ -9,6 +9,7 @@
> >  #include <linux/percpu_counter.h>
> >  #include <linux/xattr.h>
> >  #include <linux/fs_parser.h>
> > +#include <linux/memfile_notifier.h>
> >
> >  /* inode in-kernel data */
> >
> > @@ -28,6 +29,9 @@ struct shmem_inode_info {
> >         struct simple_xattrs    xattrs;         /* list of xattrs */
> >         atomic_t                stop_eviction;  /* hold when working on inode */
> >         unsigned int            xflags;         /* shmem extended flags */
> > +#ifdef CONFIG_MEMFILE_NOTIFIER
> > +       struct memfile_notifier_list memfile_notifiers;
> > +#endif
> >         struct inode            vfs_inode;
> >  };
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 9b31a7056009..7b43e274c9a2 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -903,6 +903,28 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
> >         return page ? page_folio(page) : NULL;
> >  }
> >
> > +static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
> > +{
> > +#ifdef CONFIG_MEMFILE_NOTIFIER
> > +       struct shmem_inode_info *info = SHMEM_I(inode);
> > +
> > +       memfile_notifier_fallocate(&info->memfile_notifiers, start, end);
> > +#endif
> > +}
> > +
> > +static void notify_invalidate_page(struct inode *inode, struct folio *folio,
> > +                                  pgoff_t start, pgoff_t end)
> > +{
> > +#ifdef CONFIG_MEMFILE_NOTIFIER
> > +       struct shmem_inode_info *info = SHMEM_I(inode);
> > +
> > +       start = max(start, folio->index);
> > +       end = min(end, folio->index + folio_nr_pages(folio));
> > +
> > +       memfile_notifier_invalidate(&info->memfile_notifiers, start, end);
> > +#endif
> > +}
> > +
> >  /*
> >   * Remove range of pages and swap entries from page cache, and free them.
> >   * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
> > @@ -946,6 +968,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
> >                         }
> >                         index += folio_nr_pages(folio) - 1;
> >
> > +                       notify_invalidate_page(inode, folio, start, end);
> > +
> >                         if (!unfalloc || !folio_test_uptodate(folio))
> >                                 truncate_inode_folio(mapping, folio);
> >                         folio_unlock(folio);
> > @@ -1019,6 +1043,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
> >                                         index--;
> >                                         break;
> >                                 }
> > +
> > +                               notify_invalidate_page(inode, folio, start, end);
> > +
> 
> Should this be done in batches or done once for all of range [start, end)?

Batching is definitely prefered. Will look at that.

Thanks,
Chao
> 
> >                                 VM_BUG_ON_FOLIO(folio_test_writeback(folio),
> >                                                 folio);
> >                                 truncate_inode_folio(mapping, folio);
> > @@ -2279,6 +2306,9 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
> >                 info->flags = flags & VM_NORESERVE;
> >                 INIT_LIST_HEAD(&info->shrinklist);
> >                 INIT_LIST_HEAD(&info->swaplist);
> > +#ifdef CONFIG_MEMFILE_NOTIFIER
> > +               memfile_notifier_list_init(&info->memfile_notifiers);
> > +#endif
> >                 simple_xattrs_init(&info->xattrs);
> >                 cache_no_acl(inode);
> >                 mapping_set_large_folios(inode->i_mapping);
> > @@ -2802,6 +2832,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> >         if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
> >                 i_size_write(inode, offset + len);
> >         inode->i_ctime = current_time(inode);
> > +       notify_fallocate(inode, start, end);
> >  undone:
> >         spin_lock(&inode->i_lock);
> >         inode->i_private = NULL;
> > @@ -3909,6 +3940,47 @@ static struct file_system_type shmem_fs_type = {
> >         .fs_flags       = FS_USERNS_MOUNT,
> >  };
> >
> > +#ifdef CONFIG_MEMFILE_NOTIFIER
> > +static long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset, int *order)
> > +{
> > +       struct page *page;
> > +       int ret;
> > +
> > +       ret = shmem_getpage(inode, offset, &page, SGP_NOALLOC);
> > +       if (ret)
> > +               return ret;
> > +
> > +       *order = thp_order(compound_head(page));
> > +
> > +       return page_to_pfn(page);
> > +}
> > +
> > +static void shmem_put_unlock_pfn(unsigned long pfn)
> > +{
> > +       struct page *page = pfn_to_page(pfn);
> > +
> > +       VM_BUG_ON_PAGE(!PageLocked(page), page);
> > +
> > +       set_page_dirty(page);
> > +       unlock_page(page);
> > +       put_page(page);
> > +}
> > +
> > +static struct memfile_notifier_list* shmem_get_notifier_list(struct inode *inode)
> > +{
> > +       if (!shmem_mapping(inode->i_mapping))
> > +               return NULL;
> > +
> > +       return  &SHMEM_I(inode)->memfile_notifiers;
> > +}
> > +
> > +static struct memfile_backing_store shmem_backing_store = {
> > +       .pfn_ops.get_lock_pfn = shmem_get_lock_pfn,
> > +       .pfn_ops.put_unlock_pfn = shmem_put_unlock_pfn,
> > +       .get_notifier_list = shmem_get_notifier_list,
> > +};
> > +#endif /* CONFIG_MEMFILE_NOTIFIER */
> > +
> >  int __init shmem_init(void)
> >  {
> >         int error;
> > @@ -3934,6 +4006,10 @@ int __init shmem_init(void)
> >         else
> >                 shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
> >  #endif
> > +
> > +#ifdef CONFIG_MEMFILE_NOTIFIER
> > +       memfile_register_backing_store(&shmem_backing_store);
> > +#endif
> >         return 0;
> >
> >  out1:
> > --
> > 2.17.1
> >


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-05 18:03                               ` Sean Christopherson
  2022-04-06 10:34                                 ` Quentin Perret
@ 2022-04-22 10:56                                 ` Chao Peng
  2022-04-22 11:06                                   ` Paolo Bonzini
                                                     ` (2 more replies)
  1 sibling, 3 replies; 118+ messages in thread
From: Chao Peng @ 2022-04-22 10:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Quentin Perret, Andy Lutomirski, Steven Price, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon

On Tue, Apr 05, 2022 at 06:03:21PM +0000, Sean Christopherson wrote:
> On Tue, Apr 05, 2022, Quentin Perret wrote:
> > On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
> > > >>  - it can be very useful for protected VMs to do shared=>private
> > > >>    conversions. Think of a VM receiving some data from the host in a
> > > >>    shared buffer, and then it wants to operate on that buffer without
> > > >>    risking to leak confidential informations in a transient state. In
> > > >>    that case the most logical thing to do is to convert the buffer back
> > > >>    to private, do whatever needs to be done on that buffer (decrypting a
> > > >>    frame, ...), and then share it back with the host to consume it;
> > > >
> > > > If performance is a motivation, why would the guest want to do two
> > > > conversions instead of just doing internal memcpy() to/from a private
> > > > page?  I would be quite surprised if multiple exits and TLB shootdowns is
> > > > actually faster, especially at any kind of scale where zapping stage-2
> > > > PTEs will cause lock contention and IPIs.
> > > 
> > > I don't know the numbers or all the details, but this is arm64, which is a
> > > rather better architecture than x86 in this regard.  So maybe it's not so
> > > bad, at least in very simple cases, ignoring all implementation details.
> > > (But see below.)  Also the systems in question tend to have fewer CPUs than
> > > some of the massive x86 systems out there.
> > 
> > Yep. I can try and do some measurements if that's really necessary, but
> > I'm really convinced the cost of the TLBI for the shared->private
> > conversion is going to be significantly smaller than the cost of memcpy
> > the buffer twice in the guest for us.
> 
> It's not just the TLB shootdown, the VM-Exits aren't free.   And barring non-trivial
> improvements to KVM's MMU, e.g. sharding of mmu_lock, modifying the page tables will
> block all other updates and MMU operations.  Taking mmu_lock for read, should arm64
> ever convert to a rwlock, is not an option because KVM needs to block other
> conversions to avoid races.
> 
> Hmm, though batching multiple pages into a single request would mitigate most of
> the overhead.
> 
> > There are variations of that idea: e.g. allow userspace to mmap the
> > entire private fd but w/o taking a reference on pages mapped with
> > PROT_NONE. And then the VMM can use mprotect() in response to
> > share/unshare requests. I think Marc liked that idea as it keeps the
> > userspace API closer to normal KVM -- there actually is a
> > straightforward gpa->hva relation. Not sure how much that would impact
> > the implementation at this point.
> > 
> > For the shared=>private conversion, this would be something like so:
> > 
> >  - the guest issues a hypercall to unshare a page;
> > 
> >  - the hypervisor forwards the request to the host;
> > 
> >  - the host kernel forwards the request to userspace;
> > 
> >  - userspace then munmap()s the shared page;
> > 
> >  - KVM then tries to take a reference to the page. If it succeeds, it
> >    re-enters the guest with a flag of some sort saying that the share
> >    succeeded, and the hypervisor will adjust pgtables accordingly. If
> >    KVM failed to take a reference, it flags this and the hypervisor will
> >    be responsible for communicating that back to the guest. This means
> >    the guest must handle failures (possibly fatal).
> > 
> > (There are probably many ways in which we can optimize this, e.g. by
> > having the host proactively munmap() pages it no longer needs so that
> > the unshare hypercall from the guest doesn't need to exit all the way
> > back to host userspace.)
> 
> ...
> 
> > > Maybe there could be a special mode for the private memory fds in which
> > > specific pages are marked as "managed by this fd but actually shared".
> > > pread() and pwrite() would work on those pages, but not mmap().  (Or maybe
> > > mmap() but the resulting mappings would not permit GUP.)
> 
> Unless I misunderstand what you intend by pread()/pwrite(), I think we'd need to
> allow mmap(), otherwise e.g. uaccess from the kernel wouldn't work.
> 
> > > And transitioning them would be a special operation on the fd that is
> > > specific to pKVM and wouldn't work on TDX or SEV.
> 
> To keep things feature agnostic (IMO, baking TDX vs SEV vs pKVM info into private-fd
> is a really bad idea), this could be handled by adding a flag and/or callback into
> the notifier/client stating whether or not it supports mapping a private-fd, and then
> mapping would be allowed if and only if all consumers support/allow mapping.
> 
> > > Hmm.  Sean and Chao, are we making a bit of a mistake by making these fds
> > > technology-agnostic?  That is, would we want to distinguish between a TDX
> > > backing fd, a SEV backing fd, a software-based backing fd, etc?  API-wise
> > > this could work by requiring the fd to be bound to a KVM VM instance and
> > > possibly even configured a bit before any other operations would be
> > > allowed.
> 
> I really don't want to distinguish between between each exact feature, but I've
> no objection to adding flags/callbacks to track specific properties of the
> downstream consumers, e.g. "can this memory be accessed by userspace" is a fine
> abstraction.  It also scales to multiple consumers (see above).

Great thanks for the discussions. I summarized the requirements/gaps and the
potential changes for next step. Please help to review.


Terminologies:
--------------
  - memory conversion: the action of converting guest memory between private
    and shared.
  - explicit conversion: an enlightened guest uses a hypercall to explicitly
    request a memory conversion to VMM.
  - implicit conversion: the conversion when VMM reacts to a page fault due
    to different guest/host memory attributes (private/shared).
  - destructive conversion: the memory content is lost/destroyed during
    conversion.
  - non-destructive conversion: the memory content is preserved during
    conversion.


Requirements & Gaps
-------------------------------------
  - Confidential computing(CC): TDX/SEV/CCA
    * Need support both explicit/implicit conversions.
    * Need support only destructive conversion at runtime.
    * The current patch should just work, but prefer to have pre-boot guest
      payload/firmware population into private memory for performance.

  - pKVM
    * Support explicit conversion only. Hard to achieve implicit conversion,
      does not record the guest access info (private/shared) in page fault,
      also makes little sense.
    * Expect to support non-destructive conversion at runtime. Additionally
      in-place conversion (the underlying physical page is unchanged) is
      desirable since copy is not disirable. The current destructive conversion
      does not fit well.
    * The current callbacks between mm/KVM is useful and reusable for pKVM.
    * Pre-boot guest payload population is nice to have.


Change Proposal
---------------
Since there are some divergences for pKVM from CC usages and at this time looks
whether we will and how we will support pKVM with this private memory patchset
is still not quite clear, so this proposal does not imply certain detailed pKVM
implementation. But from the API level, we want this can be possible to be future
extended for pKVM or other potential usages.

  - No new user APIs introduced for memory backing store, e.g. remove the
    current MFD_INACCESSIBLE. This info will be communicated from memfile_notifier
    consumers to backing store via the new 'flag' field in memfile_notifier
    described below. At creation time, the fd is normal shared fd. At rumtime CC
    usages will keep using current fallocate/FALLOC_FL_PUNCH_HOLE to do the
    conversion, but pKVM may also possible use a different way (e.g. rely on
    mmap/munmap or mprotect as discussed). These are all not new APIs anyway.

  - Add a flag to memfile_notifier so its consumers can state the requirements.

        struct memfile_notifier {
                struct list_head list;
                unsigned long flags;     /* consumer states its requirements here */
                struct memfile_notifier_ops *ops; /* future function may also extend ops when necessary */
        };

    For current CC usage, we can define and set below flags from KVM.

        /* memfile notifier flags */
        #define MFN_F_USER_INACCESSIBLE   0x0001  /* memory allocated in the file is inaccessible from userspace (e.g. read/write/mmap) */
        #define MFN_F_UNMOVABLE           0x0002  /* memory allocated in the file is unmovable */
        #define MFN_F_UNRECLAIMABLE       0x0003  /* memory allocated in the file is unreclaimable (e.g. via kswapd or any other pathes) */

    When memfile_notifier is being registered, memfile_register_notifier will
    need check these flags. E.g. for MFN_F_USER_INACCESSIBLE, it fails when
    previous mmap-ed mapping exists on the fd (I'm still unclear on how to do
    this). When multiple consumers are supported it also need check all
    registered consumers to see if any conflict (e.g. all consumers should have
    MFN_F_USER_INACCESSIBLE set). Only when the register succeeds, the fd is
    converted into a private fd, before that, the fd is just a normal (shared)
    one. During this conversion, the previous data is preserved so you can put
    some initial data in guest pages (whether the architecture allows this is
    architecture-specific and out of the scope of this patch).

  - Pre-boot guest payload populating is done by normal mmap/munmap on the fd
    before it's converted into private fd when KVM registers itself to the
    backing store.

  - Implicit conversion: maybe it's worthy to discuss again: how about totally
    remove implicit converion support? TDX should be OK, unsure SEV/CCA. pKVM
    should be happy to see. Removing also makes the work much easier and prevents
    guest bugs/unitended behaviors early. If it turns out that there is reason to
    keep it, then for pKVM we can make it an optional feature (e.g. via a new
    module param). But that can be added when pKVM really gets supported.

  - non-destructive in-place conversion: Out of scope for this series, pKVM can
    invent other pKVM specific interfaces (either extend memfile_notifier and using
    mmap/mprotect or use totaly different ways like access through vmfd as Sean
    suggested).

Thanks,
Chao


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-22 10:56                                 ` Chao Peng
@ 2022-04-22 11:06                                   ` Paolo Bonzini
  2022-04-24  8:07                                     ` Chao Peng
  2022-04-24 16:59                                   ` Andy Lutomirski
  2022-05-09 22:30                                   ` Michael Roth
  2 siblings, 1 reply; 118+ messages in thread
From: Paolo Bonzini @ 2022-04-22 11:06 UTC (permalink / raw)
  To: Chao Peng, Sean Christopherson
  Cc: Quentin Perret, Andy Lutomirski, Steven Price, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, the arch/x86 maintainers, H. Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Mike Rapoport, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A. Shutemov, Nakajima, Jun,
	Dave Hansen, Andi Kleen, David Hildenbrand, Marc Zyngier,
	Will Deacon

On 4/22/22 12:56, Chao Peng wrote:
>          /* memfile notifier flags */
>          #define MFN_F_USER_INACCESSIBLE   0x0001  /* memory allocated in the file is inaccessible from userspace (e.g. read/write/mmap) */
>          #define MFN_F_UNMOVABLE           0x0002  /* memory allocated in the file is unmovable */
>          #define MFN_F_UNRECLAIMABLE       0x0003  /* memory allocated in the file is unreclaimable (e.g. via kswapd or any other pathes) */

You probably mean BIT(0/1/2) here.

Paolo

>      When memfile_notifier is being registered, memfile_register_notifier will
>      need check these flags. E.g. for MFN_F_USER_INACCESSIBLE, it fails when
>      previous mmap-ed mapping exists on the fd (I'm still unclear on how to do
>      this). When multiple consumers are supported it also need check all
>      registered consumers to see if any conflict (e.g. all consumers should have
>      MFN_F_USER_INACCESSIBLE set). Only when the register succeeds, the fd is
>      converted into a private fd, before that, the fd is just a normal (shared)
>      one. During this conversion, the previous data is preserved so you can put
>      some initial data in guest pages (whether the architecture allows this is
>      architecture-specific and out of the scope of this patch).



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 01/13] mm/memfd: Introduce MFD_INACCESSIBLE flag
  2022-03-10 14:08 ` [PATCH v5 01/13] mm/memfd: Introduce MFD_INACCESSIBLE flag Chao Peng
  2022-04-11 15:10   ` Kirill A. Shutemov
@ 2022-04-23  5:43   ` Vishal Annapurve
  2022-04-24  8:15     ` Chao Peng
  1 sibling, 1 reply; 118+ messages in thread
From: Vishal Annapurve @ 2022-04-23  5:43 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, dave.hansen, ak, david

On Thu, Mar 10, 2022 at 6:09 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Introduce a new memfd_create() flag indicating the content of the
> created memfd is inaccessible from userspace through ordinary MMU
> access (e.g., read/write/mmap). However, the file content can be
> accessed via a different mechanism (e.g. KVM MMU) indirectly.
>
> It provides semantics required for KVM guest private memory support
> that a file descriptor with this flag set is going to be used as the
> source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV but may not be accessible from host userspace.
>
> Since page migration/swapping is not yet supported for such usages
> so these pages are currently marked as UNMOVABLE and UNEVICTABLE
> which makes them behave like long-term pinned pages.
>
> The flag can not coexist with MFD_ALLOW_SEALING, future sealing is
> also impossible for a memfd created with this flag.
>
> At this time only shmem implements this flag.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  include/linux/shmem_fs.h   |  7 +++++
>  include/uapi/linux/memfd.h |  1 +
>  mm/memfd.c                 | 26 +++++++++++++++--
>  mm/shmem.c                 | 57 ++++++++++++++++++++++++++++++++++++++
>  4 files changed, 88 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index e65b80ed09e7..2dde843f28ef 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -12,6 +12,9 @@
>
>  /* inode in-kernel data */
>
> +/* shmem extended flags */
> +#define SHM_F_INACCESSIBLE     0x0001  /* prevent ordinary MMU access (e.g. read/write/mmap) to file content */
> +
>  struct shmem_inode_info {
>         spinlock_t              lock;
>         unsigned int            seals;          /* shmem seals */
> @@ -24,6 +27,7 @@ struct shmem_inode_info {
>         struct shared_policy    policy;         /* NUMA memory alloc policy */
>         struct simple_xattrs    xattrs;         /* list of xattrs */
>         atomic_t                stop_eviction;  /* hold when working on inode */
> +       unsigned int            xflags;         /* shmem extended flags */
>         struct inode            vfs_inode;
>  };
>
> @@ -61,6 +65,9 @@ extern struct file *shmem_file_setup(const char *name,
>                                         loff_t size, unsigned long flags);
>  extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
>                                             unsigned long flags);
> +extern struct file *shmem_file_setup_xflags(const char *name, loff_t size,
> +                                           unsigned long flags,
> +                                           unsigned int xflags);
>  extern struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt,
>                 const char *name, loff_t size, unsigned long flags);
>  extern int shmem_zero_setup(struct vm_area_struct *);
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> index 7a8a26751c23..48750474b904 100644
> --- a/include/uapi/linux/memfd.h
> +++ b/include/uapi/linux/memfd.h
> @@ -8,6 +8,7 @@
>  #define MFD_CLOEXEC            0x0001U
>  #define MFD_ALLOW_SEALING      0x0002U
>  #define MFD_HUGETLB            0x0004U
> +#define MFD_INACCESSIBLE       0x0008U
>
>  /*
>   * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 9f80f162791a..74d45a26cf5d 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -245,16 +245,20 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
>  #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
>  #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>
> -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> +                      MFD_INACCESSIBLE)
>
>  SYSCALL_DEFINE2(memfd_create,
>                 const char __user *, uname,
>                 unsigned int, flags)
>  {
> +       struct address_space *mapping;
>         unsigned int *file_seals;
> +       unsigned int xflags;
>         struct file *file;
>         int fd, error;
>         char *name;
> +       gfp_t gfp;
>         long len;
>
>         if (!(flags & MFD_HUGETLB)) {
> @@ -267,6 +271,10 @@ SYSCALL_DEFINE2(memfd_create,
>                         return -EINVAL;
>         }
>
> +       /* Disallow sealing when MFD_INACCESSIBLE is set. */
> +       if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING)
> +               return -EINVAL;
> +
>         /* length includes terminating zero */
>         len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
>         if (len <= 0)
> @@ -301,8 +309,11 @@ SYSCALL_DEFINE2(memfd_create,
>                                         HUGETLB_ANONHUGE_INODE,
>                                         (flags >> MFD_HUGE_SHIFT) &
>                                         MFD_HUGE_MASK);

Should hugetlbfs also be modified to be a backing store for private
memory like shmem when hugepages are to be used?
As of now, this series doesn't seem to support using private memfds
with backing hugepages.



> -       } else
> -               file = shmem_file_setup(name, 0, VM_NORESERVE);
> +       } else {
> +               xflags = flags & MFD_INACCESSIBLE ? SHM_F_INACCESSIBLE : 0;
> +               file = shmem_file_setup_xflags(name, 0, VM_NORESERVE, xflags);
> +       }
> +
>         if (IS_ERR(file)) {
>                 error = PTR_ERR(file);
>                 goto err_fd;
> @@ -313,6 +324,15 @@ SYSCALL_DEFINE2(memfd_create,
>         if (flags & MFD_ALLOW_SEALING) {
>                 file_seals = memfd_file_seals_ptr(file);
>                 *file_seals &= ~F_SEAL_SEAL;
> +       } else if (flags & MFD_INACCESSIBLE) {
> +               mapping = file_inode(file)->i_mapping;
> +               gfp = mapping_gfp_mask(mapping);
> +               gfp &= ~__GFP_MOVABLE;
> +               mapping_set_gfp_mask(mapping, gfp);
> +               mapping_set_unevictable(mapping);
> +
> +               file_seals = memfd_file_seals_ptr(file);
> +               *file_seals = F_SEAL_SEAL;
>         }
>
>         fd_install(fd, file);
> diff --git a/mm/shmem.c b/mm/shmem.c
> index a09b29ec2b45..9b31a7056009 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1084,6 +1084,13 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
>                     (newsize > oldsize && (info->seals & F_SEAL_GROW)))
>                         return -EPERM;
>
> +               if (info->xflags & SHM_F_INACCESSIBLE) {
> +                       if(oldsize)
> +                               return -EPERM;
> +                       if (!PAGE_ALIGNED(newsize))
> +                               return -EINVAL;
> +               }
> +
>                 if (newsize != oldsize) {
>                         error = shmem_reacct_size(SHMEM_I(inode)->flags,
>                                         oldsize, newsize);
> @@ -1331,6 +1338,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>                 goto redirty;
>         if (!total_swap_pages)
>                 goto redirty;
> +       if (info->xflags & SHM_F_INACCESSIBLE)
> +               goto redirty;
>
>         /*
>          * Our capabilities prevent regular writeback or sync from ever calling
> @@ -2228,6 +2237,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
>         if (ret)
>                 return ret;
>
> +       if (info->xflags & SHM_F_INACCESSIBLE)
> +               return -EPERM;
> +
>         /* arm64 - allow memory tagging on RAM-based files */
>         vma->vm_flags |= VM_MTE_ALLOWED;
>
> @@ -2433,6 +2445,8 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
>                 if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
>                         return -EPERM;
>         }
> +       if (unlikely(info->xflags & SHM_F_INACCESSIBLE))
> +               return -EPERM;
>
>         ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
>
> @@ -2517,6 +2531,21 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>                 end_index = i_size >> PAGE_SHIFT;
>                 if (index > end_index)
>                         break;
> +
> +               /*
> +                * inode_lock protects setting up seals as well as write to
> +                * i_size. Setting SHM_F_INACCESSIBLE only allowed with
> +                * i_size == 0.
> +                *
> +                * Check SHM_F_INACCESSIBLE after i_size. It effectively
> +                * serialize read vs. setting SHM_F_INACCESSIBLE without
> +                * taking inode_lock in read path.
> +                */
> +               if (SHMEM_I(inode)->xflags & SHM_F_INACCESSIBLE) {
> +                       error = -EPERM;
> +                       break;
> +               }
> +
>                 if (index == end_index) {
>                         nr = i_size & ~PAGE_MASK;
>                         if (nr <= offset)
> @@ -2648,6 +2677,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>                         goto out;
>                 }
>
> +               if ((info->xflags & SHM_F_INACCESSIBLE) &&
> +                   (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))) {
> +                       error = -EINVAL;
> +                       goto out;
> +               }
> +
>                 shmem_falloc.waitq = &shmem_falloc_waitq;
>                 shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT;
>                 shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
> @@ -4082,6 +4117,28 @@ struct file *shmem_kernel_file_setup(const char *name, loff_t size, unsigned lon
>         return __shmem_file_setup(shm_mnt, name, size, flags, S_PRIVATE);
>  }
>
> +/**
> + * shmem_file_setup_xflags - get an unlinked file living in tmpfs with
> + *      additional xflags.
> + * @name: name for dentry (to be seen in /proc/<pid>/maps
> + * @size: size to be set for the file
> + * @flags: VM_NORESERVE suppresses pre-accounting of the entire object size
> + * @xflags: SHM_F_INACCESSIBLE prevents ordinary MMU access to the file content
> + */
> +
> +struct file *shmem_file_setup_xflags(const char *name, loff_t size,
> +                                    unsigned long flags, unsigned int xflags)
> +{
> +       struct shmem_inode_info *info;
> +       struct file *res = __shmem_file_setup(shm_mnt, name, size, flags, 0);
> +
> +       if(!IS_ERR(res)) {
> +               info = SHMEM_I(file_inode(res));
> +               info->xflags = xflags & SHM_F_INACCESSIBLE;
> +       }
> +       return res;
> +}
> +
>  /**
>   * shmem_file_setup - get an unlinked file living in tmpfs
>   * @name: name for dentry (to be seen in /proc/<pid>/maps
> --
> 2.17.1
>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-22 11:06                                   ` Paolo Bonzini
@ 2022-04-24  8:07                                     ` Chao Peng
  0 siblings, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-04-24  8:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Quentin Perret, Andy Lutomirski,
	Steven Price, kvm list, Linux Kernel Mailing List, linux-mm,
	linux-fsdevel, Linux API, qemu-devel, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A. Shutemov, Nakajima, Jun, Dave Hansen,
	Andi Kleen, David Hildenbrand, Marc Zyngier, Will Deacon

On Fri, Apr 22, 2022 at 01:06:25PM +0200, Paolo Bonzini wrote:
> On 4/22/22 12:56, Chao Peng wrote:
> >          /* memfile notifier flags */
> >          #define MFN_F_USER_INACCESSIBLE   0x0001  /* memory allocated in the file is inaccessible from userspace (e.g. read/write/mmap) */
> >          #define MFN_F_UNMOVABLE           0x0002  /* memory allocated in the file is unmovable */
> >          #define MFN_F_UNRECLAIMABLE       0x0003  /* memory allocated in the file is unreclaimable (e.g. via kswapd or any other pathes) */
> 
> You probably mean BIT(0/1/2) here.

Right, it's BIT(n), Thanks.

Chao
> 
> Paolo
> 
> >      When memfile_notifier is being registered, memfile_register_notifier will
> >      need check these flags. E.g. for MFN_F_USER_INACCESSIBLE, it fails when
> >      previous mmap-ed mapping exists on the fd (I'm still unclear on how to do
> >      this). When multiple consumers are supported it also need check all
> >      registered consumers to see if any conflict (e.g. all consumers should have
> >      MFN_F_USER_INACCESSIBLE set). Only when the register succeeds, the fd is
> >      converted into a private fd, before that, the fd is just a normal (shared)
> >      one. During this conversion, the previous data is preserved so you can put
> >      some initial data in guest pages (whether the architecture allows this is
> >      architecture-specific and out of the scope of this patch).


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 01/13] mm/memfd: Introduce MFD_INACCESSIBLE flag
  2022-04-23  5:43   ` Vishal Annapurve
@ 2022-04-24  8:15     ` Chao Peng
  0 siblings, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-04-24  8:15 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, Andy Lutomirski,
	Jun Nakajima, dave.hansen, ak, david

On Fri, Apr 22, 2022 at 10:43:50PM -0700, Vishal Annapurve wrote:
> On Thu, Mar 10, 2022 at 6:09 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > Introduce a new memfd_create() flag indicating the content of the
> > created memfd is inaccessible from userspace through ordinary MMU
> > access (e.g., read/write/mmap). However, the file content can be
> > accessed via a different mechanism (e.g. KVM MMU) indirectly.
> >
> > It provides semantics required for KVM guest private memory support
> > that a file descriptor with this flag set is going to be used as the
> > source of guest memory in confidential computing environments such
> > as Intel TDX/AMD SEV but may not be accessible from host userspace.
> >
> > Since page migration/swapping is not yet supported for such usages
> > so these pages are currently marked as UNMOVABLE and UNEVICTABLE
> > which makes them behave like long-term pinned pages.
> >
> > The flag can not coexist with MFD_ALLOW_SEALING, future sealing is
> > also impossible for a memfd created with this flag.
> >
> > At this time only shmem implements this flag.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  include/linux/shmem_fs.h   |  7 +++++
> >  include/uapi/linux/memfd.h |  1 +
> >  mm/memfd.c                 | 26 +++++++++++++++--
> >  mm/shmem.c                 | 57 ++++++++++++++++++++++++++++++++++++++
> >  4 files changed, 88 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> > index e65b80ed09e7..2dde843f28ef 100644
> > --- a/include/linux/shmem_fs.h
> > +++ b/include/linux/shmem_fs.h
> > @@ -12,6 +12,9 @@
> >
> >  /* inode in-kernel data */
> >
> > +/* shmem extended flags */
> > +#define SHM_F_INACCESSIBLE     0x0001  /* prevent ordinary MMU access (e.g. read/write/mmap) to file content */
> > +
> >  struct shmem_inode_info {
> >         spinlock_t              lock;
> >         unsigned int            seals;          /* shmem seals */
> > @@ -24,6 +27,7 @@ struct shmem_inode_info {
> >         struct shared_policy    policy;         /* NUMA memory alloc policy */
> >         struct simple_xattrs    xattrs;         /* list of xattrs */
> >         atomic_t                stop_eviction;  /* hold when working on inode */
> > +       unsigned int            xflags;         /* shmem extended flags */
> >         struct inode            vfs_inode;
> >  };
> >
> > @@ -61,6 +65,9 @@ extern struct file *shmem_file_setup(const char *name,
> >                                         loff_t size, unsigned long flags);
> >  extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
> >                                             unsigned long flags);
> > +extern struct file *shmem_file_setup_xflags(const char *name, loff_t size,
> > +                                           unsigned long flags,
> > +                                           unsigned int xflags);
> >  extern struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt,
> >                 const char *name, loff_t size, unsigned long flags);
> >  extern int shmem_zero_setup(struct vm_area_struct *);
> > diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> > index 7a8a26751c23..48750474b904 100644
> > --- a/include/uapi/linux/memfd.h
> > +++ b/include/uapi/linux/memfd.h
> > @@ -8,6 +8,7 @@
> >  #define MFD_CLOEXEC            0x0001U
> >  #define MFD_ALLOW_SEALING      0x0002U
> >  #define MFD_HUGETLB            0x0004U
> > +#define MFD_INACCESSIBLE       0x0008U
> >
> >  /*
> >   * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> > diff --git a/mm/memfd.c b/mm/memfd.c
> > index 9f80f162791a..74d45a26cf5d 100644
> > --- a/mm/memfd.c
> > +++ b/mm/memfd.c
> > @@ -245,16 +245,20 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
> >  #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> >  #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
> >
> > -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
> > +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> > +                      MFD_INACCESSIBLE)
> >
> >  SYSCALL_DEFINE2(memfd_create,
> >                 const char __user *, uname,
> >                 unsigned int, flags)
> >  {
> > +       struct address_space *mapping;
> >         unsigned int *file_seals;
> > +       unsigned int xflags;
> >         struct file *file;
> >         int fd, error;
> >         char *name;
> > +       gfp_t gfp;
> >         long len;
> >
> >         if (!(flags & MFD_HUGETLB)) {
> > @@ -267,6 +271,10 @@ SYSCALL_DEFINE2(memfd_create,
> >                         return -EINVAL;
> >         }
> >
> > +       /* Disallow sealing when MFD_INACCESSIBLE is set. */
> > +       if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING)
> > +               return -EINVAL;
> > +
> >         /* length includes terminating zero */
> >         len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
> >         if (len <= 0)
> > @@ -301,8 +309,11 @@ SYSCALL_DEFINE2(memfd_create,
> >                                         HUGETLB_ANONHUGE_INODE,
> >                                         (flags >> MFD_HUGE_SHIFT) &
> >                                         MFD_HUGE_MASK);
> 
> Should hugetlbfs also be modified to be a backing store for private
> memory like shmem when hugepages are to be used?
> As of now, this series doesn't seem to support using private memfds
> with backing hugepages.
> 

Right, as the first step tmpfs is the first backing store supported,
hugetlbfs would be potentially the second one to support once the user
semantics and kAPIs exposed to KVM are well understood.

Thanks,
Chao


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-22 10:56                                 ` Chao Peng
  2022-04-22 11:06                                   ` Paolo Bonzini
@ 2022-04-24 16:59                                   ` Andy Lutomirski
  2022-04-25 13:40                                     ` Chao Peng
  2022-05-09 22:30                                   ` Michael Roth
  2 siblings, 1 reply; 118+ messages in thread
From: Andy Lutomirski @ 2022-04-24 16:59 UTC (permalink / raw)
  To: Chao Peng, Sean Christopherson
  Cc: Quentin Perret, Steven Price, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon



On Fri, Apr 22, 2022, at 3:56 AM, Chao Peng wrote:
> On Tue, Apr 05, 2022 at 06:03:21PM +0000, Sean Christopherson wrote:
>> On Tue, Apr 05, 2022, Quentin Perret wrote:
>> > On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
>     Only when the register succeeds, the fd is
>     converted into a private fd, before that, the fd is just a normal (shared)
>     one. During this conversion, the previous data is preserved so you can put
>     some initial data in guest pages (whether the architecture allows this is
>     architecture-specific and out of the scope of this patch).

I think this can be made to work, but it will be awkward.  On TDX, for example, what exactly are the semantics supposed to be?  An error code if the memory isn't all zero?  An error code if it has ever been written?

Fundamentally, I think this is because your proposed lifecycle for these memfiles results in a lightweight API but is awkward for the intended use cases.  You're proposing, roughly:

1. Create a memfile. 

Now it's in a shared state with an unknown virt technology.  It can be read and written.  Let's call this state BRAND_NEW.

2. Bind to a VM.

Now it's an a bound state.  For TDX, for example, let's call the new state BOUND_TDX.  In this state, the TDX rules are followed (private memory can't be converted, etc).

The problem here is that the BOUND_NEW state allows things that are nonsensical in TDX, and the binding step needs to invent some kind of semantics for what happens when binding a nonempty memfile.


So I would propose a somewhat different order:

1. Create a memfile.  It's in the UNBOUND state and no operations whatsoever are allowed except binding or closing.

2. Bind the memfile to a VM (or at least to a VM technology).  Now it's in the initial state appropriate for that VM.

For TDX, this completely bypasses the cases where the data is prepopulated and TDX can't handle it cleanly.  For SEV, it bypasses a situation in which data might be written to the memory before we find out whether that data will be unreclaimable or unmovable.


----------------------------------------------

Now I have a question, since I don't think anyone has really answered it: how does this all work with SEV- or pKVM-like technologies in which private and shared pages share the same address space?  I sounds like you're proposing to have a big memfile that contains private and shared pages and to use that same memfile as pages are converted back and forth.  IO and even real physical DMA could be done on that memfile.  Am I understanding correctly?

If so, I think this makes sense, but I'm wondering if the actual memslot setup should be different.  For TDX, private memory lives in a logically separate memslot space.  For SEV and pKVM, it doesn't.  I assume the API can reflect this straightforwardly.

And the corresponding TDX question: is the intent still that shared pages aren't allowed at all in a TDX memfile?  If so, that would be the most direct mapping to what the hardware actually does.

--Andy


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-24 16:59                                   ` Andy Lutomirski
@ 2022-04-25 13:40                                     ` Chao Peng
  2022-04-25 14:52                                       ` Andy Lutomirski
  0 siblings, 1 reply; 118+ messages in thread
From: Chao Peng @ 2022-04-25 13:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Sean Christopherson, Quentin Perret, Steven Price, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon

On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote:
> 
> 
> On Fri, Apr 22, 2022, at 3:56 AM, Chao Peng wrote:
> > On Tue, Apr 05, 2022 at 06:03:21PM +0000, Sean Christopherson wrote:
> >> On Tue, Apr 05, 2022, Quentin Perret wrote:
> >> > On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
> >     Only when the register succeeds, the fd is
> >     converted into a private fd, before that, the fd is just a normal (shared)
> >     one. During this conversion, the previous data is preserved so you can put
> >     some initial data in guest pages (whether the architecture allows this is
> >     architecture-specific and out of the scope of this patch).
> 
> I think this can be made to work, but it will be awkward.  On TDX, for example, what exactly are the semantics supposed to be?  An error code if the memory isn't all zero?  An error code if it has ever been written?
> 
> Fundamentally, I think this is because your proposed lifecycle for these memfiles results in a lightweight API but is awkward for the intended use cases.  You're proposing, roughly:
> 
> 1. Create a memfile. 
> 
> Now it's in a shared state with an unknown virt technology.  It can be read and written.  Let's call this state BRAND_NEW.
> 
> 2. Bind to a VM.
> 
> Now it's an a bound state.  For TDX, for example, let's call the new state BOUND_TDX.  In this state, the TDX rules are followed (private memory can't be converted, etc).
> 
> The problem here is that the BOUND_NEW state allows things that are nonsensical in TDX, and the binding step needs to invent some kind of semantics for what happens when binding a nonempty memfile.
> 
> 
> So I would propose a somewhat different order:
> 
> 1. Create a memfile.  It's in the UNBOUND state and no operations whatsoever are allowed except binding or closing.

OK, so we need invent new user API to indicate UNBOUND state. For memfd
based, it can be a new feature-neutral flag at creation time.

> 
> 2. Bind the memfile to a VM (or at least to a VM technology).  Now it's in the initial state appropriate for that VM.
> 
> For TDX, this completely bypasses the cases where the data is prepopulated and TDX can't handle it cleanly.  For SEV, it bypasses a situation in which data might be written to the memory before we find out whether that data will be unreclaimable or unmovable.

This sounds a more strict rule to avoid semantics unclear.

So userspace needs to know what excatly happens for a 'bind' operation.
This is different when binds to different technologies. E.g. for SEV, it
may imply after this call, the memfile can be accessed (through mmap or
what ever) from userspace, while for current TDX this should be not allowed.

And I feel we still need a third flow/operation to indicate the
completion of the initialization on the memfile before the guest's 
first-time launch. SEV needs to check previous mmap-ed areas are munmap-ed
and prevent future userspace access. After this point, then the memfile
becomes truely private fd.

> 
> 
> ----------------------------------------------
> 
> Now I have a question, since I don't think anyone has really answered it: how does this all work with SEV- or pKVM-like technologies in which private and shared pages share the same address space?  I sounds like you're proposing to have a big memfile that contains private and shared pages and to use that same memfile as pages are converted back and forth.  IO and even real physical DMA could be done on that memfile.  Am I understanding correctly?

For TDX case, and probably SEV as well, this memfile contains private memory
only. But this design at least makes it possible for usage cases like
pKVM which wants both private/shared memory in the same memfile and rely
on other ways like mmap/munmap or mprotect to toggle private/shared instead
of fallocate/hole punching.

> 
> If so, I think this makes sense, but I'm wondering if the actual memslot setup should be different.  For TDX, private memory lives in a logically separate memslot space.  For SEV and pKVM, it doesn't.  I assume the API can reflect this straightforwardly.

I believe so. The flow should be similar but we do need pass different
flags during the 'bind' to the backing store for different usages. That
should be some new flags for pKVM but the callbacks (API here) between
memfile_notifile and its consumers can be reused.

> 
> And the corresponding TDX question: is the intent still that shared pages aren't allowed at all in a TDX memfile?  If so, that would be the most direct mapping to what the hardware actually does.

Exactly. TDX will still use fallocate/hole punching to turn on/off the
private page. Once off, the traditional shared page will become
effective in KVM.

Chao
> 
> --Andy


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
  2022-04-13 17:52               ` Jason Gunthorpe
@ 2022-04-25 14:07                 ` David Hildenbrand
  0 siblings, 0 replies; 118+ messages in thread
From: David Hildenbrand @ 2022-04-25 14:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sean Christopherson, Andy Lutomirski, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen

On 13.04.22 19:52, Jason Gunthorpe wrote:
> On Wed, Apr 13, 2022 at 06:24:56PM +0200, David Hildenbrand wrote:
>> On 12.04.22 16:36, Jason Gunthorpe wrote:
>>> On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote:
>>>
>>>> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he
>>>> past already with secretmem, it's not 100% that good of a fit (unmovable
>>>> is worth than mlocked). But it gets the job done for now at least.
>>>
>>> No, it doesn't. There are too many different interpretations how
>>> MELOCK is supposed to work
>>>
>>> eg VFIO accounts per-process so hostile users can just fork to go past
>>> it.
>>>
>>> RDMA is per-process but uses a different counter, so you can double up
>>>
>>> iouring is per-user and users a 3rd counter, so it can triple up on
>>> the above two
>>
>> Thanks for that summary, very helpful.
> 
> I kicked off a big discussion when I suggested to change vfio to use
> the same as io_uring
> 
> We may still end up trying it, but the major concern is that libvirt
> sets the RLIMIT_MEMLOCK and if we touch anything here - including
> fixing RDMA, or anything really, it becomes a uAPI break for libvirt..
> 

Okay, so we have to introduce a second mechanism, don't use
RLIMIT_MEMLOCK for new unmovable memory, and then eventually phase out
RLIMIT_MEMLOCK usage for existing unmovable memory consumers (which, as
you say, will be difficult).

>>>> So I'm open for alternative to limit the amount of unmovable memory we
>>>> might allocate for user space, and then we could convert seretmem as well.
>>>
>>> I think it has to be cgroup based considering where we are now :\
>>
>> Most probably. I think the important lessons we learned are that
>>
>> * mlocked != unmovable.
>> * RLIMIT_MEMLOCK should most probably never have been abused for
>>   unmovable memory (especially, long-term pinning)
> 
> The trouble is I'm not sure how anything can correctly/meaningfully
> set a limit.
> 
> Consider qemu where we might have 3 different things all pinning the
> same page (rdma, iouring, vfio) - should the cgroup give 3x the limit?
> What use is that really?

I think your tackling a related problem, that we double-account
unmovable/mlocked memory due to lack of ways to track that a page is
already pinned by the same user/cgroup/whatsoever. Not easy to solve.

The problem also becomes interesting if iouring with fixed buffers
doesn't work on guest RAM, but on some other QEMU buffers.

> 
> IMHO there are only two meaningful scenarios - either you are unpriv
> and limited to a very small number for your user/cgroup - or you are
> priv and you can do whatever you want.
> 
> The idea we can fine tune this to exactly the right amount for a
> workload does not seem realistic and ends up exporting internal kernel
> decisions into a uAPI..


IMHO, there are three use cases:

* App that conditionally uses selected mechanism that end up requiring
  unmovable, long-term allocations. Secretmem, iouring, rdma. We want
  some sane, small default. Apps have a backup path in case any such
  mechanism fails because we're out of allowed unmovable resources.
* App that relies on selected mechanism that end up requiring unmovable,
  long-term allocations. E.g., vfio with known memory consumption, such
  as the VM size. It's fairly easy to come up with the right value.
* App that relies on multiple mechanism that end up requiring unmovable,
  long-term allocations. QEMU with rdma, iouring, vfio, ... I agree that
  coming up with something good is problematic.

Then, there are privileged/unprivileged apps. There might be admins that
just don't care. There might be admins that even want to set some limit
instead of configuring "unlimited" for QEMU.

Long story short, it should be an admin choice what to configure,
especially:
* What the default is for random apps
* What the maximum is for selected apps
* Which apps don't have a maximum

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-25 13:40                                     ` Chao Peng
@ 2022-04-25 14:52                                       ` Andy Lutomirski
  2022-04-25 20:30                                         ` Sean Christopherson
  2022-04-28 12:29                                         ` Chao Peng
  0 siblings, 2 replies; 118+ messages in thread
From: Andy Lutomirski @ 2022-04-25 14:52 UTC (permalink / raw)
  To: Chao Peng
  Cc: Sean Christopherson, Quentin Perret, Steven Price, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon



On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote:
> On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote:
>> 

>> 
>> 2. Bind the memfile to a VM (or at least to a VM technology).  Now it's in the initial state appropriate for that VM.
>> 
>> For TDX, this completely bypasses the cases where the data is prepopulated and TDX can't handle it cleanly.  For SEV, it bypasses a situation in which data might be written to the memory before we find out whether that data will be unreclaimable or unmovable.
>
> This sounds a more strict rule to avoid semantics unclear.
>
> So userspace needs to know what excatly happens for a 'bind' operation.
> This is different when binds to different technologies. E.g. for SEV, it
> may imply after this call, the memfile can be accessed (through mmap or
> what ever) from userspace, while for current TDX this should be not allowed.

I think this is actually a good thing.  While SEV, TDX, pKVM, etc achieve similar goals and have broadly similar ways of achieving them, they really are different, and having userspace be aware of the differences seems okay to me.

(Although I don't think that allowing userspace to mmap SEV shared pages is particularly wise -- it will result in faults or cache incoherence depending on the variant of SEV in use.)

>
> And I feel we still need a third flow/operation to indicate the
> completion of the initialization on the memfile before the guest's 
> first-time launch. SEV needs to check previous mmap-ed areas are munmap-ed
> and prevent future userspace access. After this point, then the memfile
> becomes truely private fd.

Even that is technology-dependent.  For TDX, this operation doesn't really exist.  For SEV, I'm not sure (I haven't read the specs in nearly enough detail).  For pKVM, I guess it does exist and isn't quite the same as a shared->private conversion.

Maybe this could be generalized a bit as an operation "measure and make private" that would be supported by the technologies for which it's useful.


>
>> 
>> 
>> ----------------------------------------------
>> 
>> Now I have a question, since I don't think anyone has really answered it: how does this all work with SEV- or pKVM-like technologies in which private and shared pages share the same address space?  I sounds like you're proposing to have a big memfile that contains private and shared pages and to use that same memfile as pages are converted back and forth.  IO and even real physical DMA could be done on that memfile.  Am I understanding correctly?
>
> For TDX case, and probably SEV as well, this memfile contains private memory
> only. But this design at least makes it possible for usage cases like
> pKVM which wants both private/shared memory in the same memfile and rely
> on other ways like mmap/munmap or mprotect to toggle private/shared instead
> of fallocate/hole punching.

Hmm.  Then we still need some way to get KVM to generate the correct SEV pagetables.  For TDX, there are private memslots and shared memslots, and they can overlap.  If they overlap and both contain valid pages at the same address, then the results may not be what the guest-side ABI expects, but everything will work.  So, when a single logical guest page transitions between shared and private, no change to the memslots is needed.  For SEV, this is not the case: everything is in one set of pagetables, and there isn't a natural way to resolve overlaps.

If the memslot code becomes efficient enough, then the memslots could be fragmented.  Or the memfile could support private and shared data in the same memslot.  And if pKVM does this, I don't see why SEV couldn't also do it and hopefully reuse the same code.

>
>> 
>> If so, I think this makes sense, but I'm wondering if the actual memslot setup should be different.  For TDX, private memory lives in a logically separate memslot space.  For SEV and pKVM, it doesn't.  I assume the API can reflect this straightforwardly.
>
> I believe so. The flow should be similar but we do need pass different
> flags during the 'bind' to the backing store for different usages. That
> should be some new flags for pKVM but the callbacks (API here) between
> memfile_notifile and its consumers can be reused.

And also some different flag in the operation that installs the fd as a memslot?

>
>> 
>> And the corresponding TDX question: is the intent still that shared pages aren't allowed at all in a TDX memfile?  If so, that would be the most direct mapping to what the hardware actually does.
>
> Exactly. TDX will still use fallocate/hole punching to turn on/off the
> private page. Once off, the traditional shared page will become
> effective in KVM.

Works for me.

For what it's worth, I still think it should be fine to land all the TDX memfile bits upstream as long as we're confident that SEV, pKVM, etc can be added on without issues.

I think we can increase confidence in this by either getting one other technology's maintainers to get far enough along in the design to be confident and/or by having a pure-kernel-software implementation that serves as a testbed.  For the latter, maybe it could support two different models with little overhead:

Pure software "interleaved" model: pages are shared or private and a hypercall converts them.  The access mode is entirely determined by the state programmed by hypercall.  I think this is essentially what Vishal implemented, but with the "HACK" replaced by something permanent and (if they're not already in the series) appropriate access checks implemented to actually protect the private memory.

Pure software "separate" mode: one GPA bit is set aside as the shared vs private bit.  The normal memslots are restricted to the shared half of GPA space.  Private memslots use the private half.  This works a lot like TDX.  This would be new code.  We don't *really* need this for testing, since TDX itself exercises the same programming model, but it would let people without TDX hardware exercise the interesting bits of the memory management.

Paolo, etc: what do you think?

>
> Chao
>> 
>> --Andy


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-25 14:52                                       ` Andy Lutomirski
@ 2022-04-25 20:30                                         ` Sean Christopherson
  2022-06-10 19:18                                           ` Andy Lutomirski
  2022-04-28 12:29                                         ` Chao Peng
  1 sibling, 1 reply; 118+ messages in thread
From: Sean Christopherson @ 2022-04-25 20:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chao Peng, Quentin Perret, Steven Price, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon

On Mon, Apr 25, 2022, Andy Lutomirski wrote:
> 
> 
> On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote:
> > On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote:
> >> 
> 
> >> 
> >> 2. Bind the memfile to a VM (or at least to a VM technology).  Now it's in
> >> the initial state appropriate for that VM.
> >> 
> >> For TDX, this completely bypasses the cases where the data is prepopulated
> >> and TDX can't handle it cleanly.

I believe TDX can handle this cleanly, TDH.MEM.PAGE.ADD doesn't require that the
source and destination have different HPAs.  There's just no pressing need to
support such behavior because userspace is highly motivated to keep the initial
image small for performance reasons, i.e. burning a few extra pages while building
the guest is a non-issue.

> >> For SEV, it bypasses a situation in which data might be written to the
> >> memory before we find out whether that data will be unreclaimable or
> >> unmovable.
> >
> > This sounds a more strict rule to avoid semantics unclear.
> >
> > So userspace needs to know what excatly happens for a 'bind' operation.
> > This is different when binds to different technologies. E.g. for SEV, it
> > may imply after this call, the memfile can be accessed (through mmap or
> > what ever) from userspace, while for current TDX this should be not allowed.
> 
> I think this is actually a good thing.  While SEV, TDX, pKVM, etc achieve
> similar goals and have broadly similar ways of achieving them, they really
> are different, and having userspace be aware of the differences seems okay to
> me.

I agree, _if_ the properties of the memory are enumerated in a technology-agnostic
way.  The underlying mechanisms are different, but conceptually the set of sane
operations that userspace can perform/initiate are the same.  E.g. TDX and SNP can
support swap, they just don't because no one has requested Intel/AMD to provide
that support (no use cases for oversubscribing confidential VMs).  SNP does support
page migration, and TDX can add that support without too much fuss.

SEV "allows" the host to access guest private memory, but that doesn't mean it
should be deliberately supported by the kernel.  It's a bit of a moot point for
SEV/SEV-ES, as the host doesn't get any kind of notification that the guest has
"converted" a page, but the kernel shouldn't allow userspace to map memory that
is _known_ to be private.

> (Although I don't think that allowing userspace to mmap SEV shared pages is

s/shared/private?

> particularly wise -- it will result in faults or cache incoherence depending
> on the variant of SEV in use.)
>
> > And I feel we still need a third flow/operation to indicate the
> > completion of the initialization on the memfile before the guest's 
> > first-time launch. SEV needs to check previous mmap-ed areas are munmap-ed
> > and prevent future userspace access. After this point, then the memfile
> > becomes truely private fd.
> 
> Even that is technology-dependent.  For TDX, this operation doesn't really
> exist.

As above, I believe this is TDH.MEM.PAGE.ADD.

> For SEV, I'm not sure (I haven't read the specs in nearly enough detail).

QEMU+KVM does in-place conversion for SEV/SEV-ES via SNP_LAUNCH_UPDATE, AFAICT
that's still allowed for SNP.

> For pKVM, I guess it does exist and isn't quite the same as a
> shared->private conversion.
> 
> Maybe this could be generalized a bit as an operation "measure and make
> private" that would be supported by the technologies for which it's useful.
> 
> 
> >> Now I have a question, since I don't think anyone has really answered it:
> >> how does this all work with SEV- or pKVM-like technologies in which
> >> private and shared pages share the same address space?

The current proposal is to have both a private fd and a shared hva for memslot
that can be mapped private.  A GPA is considered private by KVM if the memslot
has a private fd and that corresponding page in the private fd is populated.  KVM
will always and only map the current flavor of shared/private based on that
definition.  If userspace punches a hole in the private fd, KVM will unmap any
relevant private GPAs.  If userspace populates a range in the private fd, KVM will
unmap any relevant shared GPAs.

> >> I sounds like you're proposing to have a big memfile that contains private
> >> and shared pages and to use that same memfile as pages are converted back
> >> and forth.  IO and even real physical DMA could be done on that memfile.
> >> Am I understanding correctly?
> >
> > For TDX case, and probably SEV as well, this memfile contains private memory
> > only. But this design at least makes it possible for usage cases like
> > pKVM which wants both private/shared memory in the same memfile and rely
> > on other ways like mmap/munmap or mprotect to toggle private/shared instead
> > of fallocate/hole punching.
> 
> Hmm.  Then we still need some way to get KVM to generate the correct SEV
> pagetables.  For TDX, there are private memslots and shared memslots, and
> they can overlap.  If they overlap and both contain valid pages at the same
> address, then the results may not be what the guest-side ABI expects, but
> everything will work.

Absolutely not, KVM is not concurrently mapping both private and shared variants
of a single GPA into the guest.  The Shared bit is a glorified attribute/permission
bit, that it's carved out of the GPA space is a hack to minimize Intel's hardware
development cost.

> So, when a single logical guest page transitions between shared and private,
> no change to the memslots is needed.

Hard no, KVM is not supporting different memslot semantics for TDX versus everything
else.

> For SEV, this is not the case: everything is in one set of pagetables, and
> there isn't a natural way to resolve overlaps.
> 
> If the memslot code becomes efficient enough, then the memslots could be
> fragmented.

No, because the only way to not artificially limit the amount of fragmentation is
to turn the memslots into a tree structure, i.e. to make them effectively multi-level
page tables as opposed to the single-level "tables" that they are today.

> Or the memfile could support private and shared data in the same memslot.
> And if pKVM does this, I don't see why SEV couldn't also do it and hopefully
> reuse the same code.
> 
> >
> >> 
> >> If so, I think this makes sense, but I'm wondering if the actual memslot
> >> setup should be different.  For TDX, private memory lives in a logically
> >> separate memslot space.  For SEV and pKVM, it doesn't.  I assume the API
> >> can reflect this straightforwardly.

Again, no.  KVM is not going to give special treatment to TDX.

> > I believe so. The flow should be similar but we do need pass different
> > flags during the 'bind' to the backing store for different usages. That
> > should be some new flags for pKVM but the callbacks (API here) between
> > memfile_notifile and its consumers can be reused.
> 
> And also some different flag in the operation that installs the fd as a memslot?

No, memslots updates need to come directly from userspace.

> >> And the corresponding TDX question: is the intent still that shared pages
> >> aren't allowed at all in a TDX memfile?  If so, that would be the most
> >> direct mapping to what the hardware actually does.
> >
> > Exactly. TDX will still use fallocate/hole punching to turn on/off the
> > private page. Once off, the traditional shared page will become
> > effective in KVM.
> 
> Works for me.
> 
> For what it's worth, I still think it should be fine to land all the TDX
> memfile bits upstream as long as we're confident that SEV, pKVM, etc can be
> added on without issues.
> 
> I think we can increase confidence in this by either getting one other
> technology's maintainers to get far enough along in the design to be
> confident and/or by having a pure-kernel-software implementation that serves
> as a testbed.  For the latter, maybe it could support two different models
> with little overhead:
> 
> Pure software "interleaved" model: pages are shared or private and a
> hypercall converts them.  The access mode is entirely determined by the state
> programmed by hypercall.  I think this is essentially what Vishal
> implemented, but with the "HACK" replaced by something permanent and (if
> they're not already in the series) appropriate access checks implemented to
> actually protect the private memory.
> 
> Pure software "separate" mode: one GPA bit is set aside as the shared vs
> private bit.  The normal memslots are restricted to the shared half of GPA
> space.  Private memslots use the private half.  This works a lot like TDX.
> This would be new code.  We don't *really* need this for testing, since TDX
> itself exercises the same programming model, but it would let people without
> TDX hardware exercise the interesting bits of the memory management.

No, KVM is not bifurcating the memslots into shared and private.  Except for TDX,
hardware can't support mapping both variants.  That means KVM has to define some
semantic for which memslot "wins", and that puts userspace/KVM back to square one
in having punch a hole into a memslot in order to allow mapping the "loser" into
the guest.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-25 14:52                                       ` Andy Lutomirski
  2022-04-25 20:30                                         ` Sean Christopherson
@ 2022-04-28 12:29                                         ` Chao Peng
  2022-05-03 11:12                                           ` Quentin Perret
  1 sibling, 1 reply; 118+ messages in thread
From: Chao Peng @ 2022-04-28 12:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Sean Christopherson, Quentin Perret, Steven Price, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon, Michael Roth


+ Michael in case he has comment from SEV side.

On Mon, Apr 25, 2022 at 07:52:38AM -0700, Andy Lutomirski wrote:
> 
> 
> On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote:
> > On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote:
> >> 
> 
> >> 
> >> 2. Bind the memfile to a VM (or at least to a VM technology).  Now it's in the initial state appropriate for that VM.
> >> 
> >> For TDX, this completely bypasses the cases where the data is prepopulated and TDX can't handle it cleanly.  For SEV, it bypasses a situation in which data might be written to the memory before we find out whether that data will be unreclaimable or unmovable.
> >
> > This sounds a more strict rule to avoid semantics unclear.
> >
> > So userspace needs to know what excatly happens for a 'bind' operation.
> > This is different when binds to different technologies. E.g. for SEV, it
> > may imply after this call, the memfile can be accessed (through mmap or
> > what ever) from userspace, while for current TDX this should be not allowed.
> 
> I think this is actually a good thing.  While SEV, TDX, pKVM, etc achieve similar goals and have broadly similar ways of achieving them, they really are different, and having userspace be aware of the differences seems okay to me.
> 
> (Although I don't think that allowing userspace to mmap SEV shared pages is particularly wise -- it will result in faults or cache incoherence depending on the variant of SEV in use.)
> 
> >
> > And I feel we still need a third flow/operation to indicate the
> > completion of the initialization on the memfile before the guest's 
> > first-time launch. SEV needs to check previous mmap-ed areas are munmap-ed
> > and prevent future userspace access. After this point, then the memfile
> > becomes truely private fd.
> 
> Even that is technology-dependent.  For TDX, this operation doesn't really exist.  For SEV, I'm not sure (I haven't read the specs in nearly enough detail).  For pKVM, I guess it does exist and isn't quite the same as a shared->private conversion.
> 
> Maybe this could be generalized a bit as an operation "measure and make private" that would be supported by the technologies for which it's useful.

Then I think we need callback instead of static flag field. Backing
store implements this callback and consumers change the flags
dynamically with this callback. This implements kind of state machine
flow.

> 
> 
> >
> >> 
> >> 
> >> ----------------------------------------------
> >> 
> >> Now I have a question, since I don't think anyone has really answered it: how does this all work with SEV- or pKVM-like technologies in which private and shared pages share the same address space?  I sounds like you're proposing to have a big memfile that contains private and shared pages and to use that same memfile as pages are converted back and forth.  IO and even real physical DMA could be done on that memfile.  Am I understanding correctly?
> >
> > For TDX case, and probably SEV as well, this memfile contains private memory
> > only. But this design at least makes it possible for usage cases like
> > pKVM which wants both private/shared memory in the same memfile and rely
> > on other ways like mmap/munmap or mprotect to toggle private/shared instead
> > of fallocate/hole punching.
> 
> Hmm.  Then we still need some way to get KVM to generate the correct SEV pagetables.  For TDX, there are private memslots and shared memslots, and they can overlap.  If they overlap and both contain valid pages at the same address, then the results may not be what the guest-side ABI expects, but everything will work.  So, when a single logical guest page transitions between shared and private, no change to the memslots is needed.  For SEV, this is not the case: everything is in one set of pagetables, and there isn't a natural way to resolve overlaps.

I don't see SEV has problem. Note for all the cases, both private/shared
memory are in the same memslot. For a given GPA, if there is no private
page, then shared page will be used to establish KVM pagetables, so this
can guarantee there is no overlaps.

> 
> If the memslot code becomes efficient enough, then the memslots could be fragmented.  Or the memfile could support private and shared data in the same memslot.  And if pKVM does this, I don't see why SEV couldn't also do it and hopefully reuse the same code.

For pKVM, that might be the case. For SEV, I don't think we require
private/shared data in the same memfile. The same model that works for
TDX should also work for SEV. Or maybe I misunderstood something here?

> 
> >
> >> 
> >> If so, I think this makes sense, but I'm wondering if the actual memslot setup should be different.  For TDX, private memory lives in a logically separate memslot space.  For SEV and pKVM, it doesn't.  I assume the API can reflect this straightforwardly.
> >
> > I believe so. The flow should be similar but we do need pass different
> > flags during the 'bind' to the backing store for different usages. That
> > should be some new flags for pKVM but the callbacks (API here) between
> > memfile_notifile and its consumers can be reused.
> 
> And also some different flag in the operation that installs the fd as a memslot?
> 
> >
> >> 
> >> And the corresponding TDX question: is the intent still that shared pages aren't allowed at all in a TDX memfile?  If so, that would be the most direct mapping to what the hardware actually does.
> >
> > Exactly. TDX will still use fallocate/hole punching to turn on/off the
> > private page. Once off, the traditional shared page will become
> > effective in KVM.
> 
> Works for me.
> 
> For what it's worth, I still think it should be fine to land all the TDX memfile bits upstream as long as we're confident that SEV, pKVM, etc can be added on without issues.
> 
> I think we can increase confidence in this by either getting one other technology's maintainers to get far enough along in the design to be confident

AFAICS, SEV shouldn't have any problem, But would like to see AMD people
can comment. For pKVM, definitely need more work, but isn't totally
undoable. Also would be good if pKVM people can comment.

Thanks,
Chao

> and/or by having a pure-kernel-software implementation that serves as a testbed.  For the latter, maybe it could support two different models with little overhead:
> 
> Pure software "interleaved" model: pages are shared or private and a hypercall converts them.  The access mode is entirely determined by the state programmed by hypercall.  I think this is essentially what Vishal implemented, but with the "HACK" replaced by something permanent and (if they're not already in the series) appropriate access checks implemented to actually protect the private memory.
> 
> Pure software "separate" mode: one GPA bit is set aside as the shared vs private bit.  The normal memslots are restricted to the shared half of GPA space.  Private memslots use the private half.  This works a lot like TDX.  This would be new code.  We don't *really* need this for testing, since TDX itself exercises the same programming model, but it would let people without TDX hardware exercise the interesting bits of the memory management.
> 
> Paolo, etc: what do you think?
> 
> >
> > Chao
> >> 
> >> --Andy


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 08/13] KVM: Use memfile_pfn_ops to obtain pfn for private pages
  2022-03-28 23:56   ` Sean Christopherson
  2022-04-08 14:07     ` Chao Peng
@ 2022-04-28 12:37     ` Chao Peng
  1 sibling, 0 replies; 118+ messages in thread
From: Chao Peng @ 2022-04-28 12:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david

On Mon, Mar 28, 2022 at 11:56:06PM +0000, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > @@ -2217,4 +2220,34 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
> >  /* Max number of entries allowed for each kvm dirty ring */
> >  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> >  
> > +#ifdef CONFIG_MEMFILE_NOTIFIER
> > +static inline long kvm_memfile_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn,
> > +				       int *order)
> > +{
> > +	pgoff_t index = gfn - slot->base_gfn +
> > +			(slot->private_offset >> PAGE_SHIFT);
> 
> This is broken for 32-bit kernels, where gfn_t is a 64-bit value but pgoff_t is a
> 32-bit value.  There's no reason to support this for 32-bit kernels, so...
> 
> The easiest fix, and likely most maintainable for other code too, would be to
> add a dedicated CONFIG for private memory, and then have KVM check that for all
> the memfile stuff.  x86 can then select it only for 64-bit kernels, and in turn
> select MEMFILE_NOTIFIER iff private memory is supported.
> 
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index ca7b2a6a452a..ee9c8c155300 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -48,7 +48,9 @@ config KVM
>         select SRCU
>         select INTERVAL_TREE
>         select HAVE_KVM_PM_NOTIFIER if PM
> -       select MEMFILE_NOTIFIER
> +       select HAVE_KVM_PRIVATE_MEM if X86_64
> +       select MEMFILE_NOTIFIER if HAVE_KVM_PRIVATE_MEM
> +
>         help
>           Support hosting fully virtualized guest machines using hardware
>           virtualization extensions.  You will need a fairly recent
> 
> And in addition to replacing checks on CONFIG_MEMFILE_NOTIFIER, the probing of
> whether or not KVM_MEM_PRIVATE is allowed can be:
> 
> @@ -1499,23 +1499,19 @@ static void kvm_replace_memslot(struct kvm *kvm,
>         }
>  }
> 
> -bool __weak kvm_arch_private_memory_supported(struct kvm *kvm)
> -{
> -       return false;
> -}
> -
>  static int check_memory_region_flags(struct kvm *kvm,
>                                 const struct kvm_userspace_memory_region *mem)
>  {
>         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> 
> -       if (kvm_arch_private_memory_supported(kvm))
> -               valid_flags |= KVM_MEM_PRIVATE;
> -
>  #ifdef __KVM_HAVE_READONLY_MEM
>         valid_flags |= KVM_MEM_READONLY;
>  #endif
> 
> +#ifdef CONFIG_KVM_HAVE_PRIVATE_MEM
> +       valid_flags |= KVM_MEM_PRIVATE;
> +#endif

One thing to mention is CONFIG_KVM_HAVE_PRIVATE_MEM is build-time thing.
Do you think we should or not do that for runtime? E.g. expose by vm_type
so only when TDX is enabled KVM_MEM_PRIVATE is exposed.

Chao


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-28 12:29                                         ` Chao Peng
@ 2022-05-03 11:12                                           ` Quentin Perret
  0 siblings, 0 replies; 118+ messages in thread
From: Quentin Perret @ 2022-05-03 11:12 UTC (permalink / raw)
  To: Chao Peng
  Cc: Andy Lutomirski, Sean Christopherson, Steven Price, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon, Michael Roth

On Thursday 28 Apr 2022 at 20:29:52 (+0800), Chao Peng wrote:
> 
> + Michael in case he has comment from SEV side.
> 
> On Mon, Apr 25, 2022 at 07:52:38AM -0700, Andy Lutomirski wrote:
> > 
> > 
> > On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote:
> > > On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote:
> > >> 
> > 
> > >> 
> > >> 2. Bind the memfile to a VM (or at least to a VM technology).  Now it's in the initial state appropriate for that VM.
> > >> 
> > >> For TDX, this completely bypasses the cases where the data is prepopulated and TDX can't handle it cleanly.  For SEV, it bypasses a situation in which data might be written to the memory before we find out whether that data will be unreclaimable or unmovable.
> > >
> > > This sounds a more strict rule to avoid semantics unclear.
> > >
> > > So userspace needs to know what excatly happens for a 'bind' operation.
> > > This is different when binds to different technologies. E.g. for SEV, it
> > > may imply after this call, the memfile can be accessed (through mmap or
> > > what ever) from userspace, while for current TDX this should be not allowed.
> > 
> > I think this is actually a good thing.  While SEV, TDX, pKVM, etc achieve similar goals and have broadly similar ways of achieving them, they really are different, and having userspace be aware of the differences seems okay to me.
> > 
> > (Although I don't think that allowing userspace to mmap SEV shared pages is particularly wise -- it will result in faults or cache incoherence depending on the variant of SEV in use.)
> > 
> > >
> > > And I feel we still need a third flow/operation to indicate the
> > > completion of the initialization on the memfile before the guest's 
> > > first-time launch. SEV needs to check previous mmap-ed areas are munmap-ed
> > > and prevent future userspace access. After this point, then the memfile
> > > becomes truely private fd.
> > 
> > Even that is technology-dependent.  For TDX, this operation doesn't really exist.  For SEV, I'm not sure (I haven't read the specs in nearly enough detail).  For pKVM, I guess it does exist and isn't quite the same as a shared->private conversion.
> > 
> > Maybe this could be generalized a bit as an operation "measure and make private" that would be supported by the technologies for which it's useful.
> 
> Then I think we need callback instead of static flag field. Backing
> store implements this callback and consumers change the flags
> dynamically with this callback. This implements kind of state machine
> flow.
> 
> > 
> > 
> > >
> > >> 
> > >> 
> > >> ----------------------------------------------
> > >> 
> > >> Now I have a question, since I don't think anyone has really answered it: how does this all work with SEV- or pKVM-like technologies in which private and shared pages share the same address space?  I sounds like you're proposing to have a big memfile that contains private and shared pages and to use that same memfile as pages are converted back and forth.  IO and even real physical DMA could be done on that memfile.  Am I understanding correctly?
> > >
> > > For TDX case, and probably SEV as well, this memfile contains private memory
> > > only. But this design at least makes it possible for usage cases like
> > > pKVM which wants both private/shared memory in the same memfile and rely
> > > on other ways like mmap/munmap or mprotect to toggle private/shared instead
> > > of fallocate/hole punching.
> > 
> > Hmm.  Then we still need some way to get KVM to generate the correct SEV pagetables.  For TDX, there are private memslots and shared memslots, and they can overlap.  If they overlap and both contain valid pages at the same address, then the results may not be what the guest-side ABI expects, but everything will work.  So, when a single logical guest page transitions between shared and private, no change to the memslots is needed.  For SEV, this is not the case: everything is in one set of pagetables, and there isn't a natural way to resolve overlaps.
> 
> I don't see SEV has problem. Note for all the cases, both private/shared
> memory are in the same memslot. For a given GPA, if there is no private
> page, then shared page will be used to establish KVM pagetables, so this
> can guarantee there is no overlaps.
> 
> > 
> > If the memslot code becomes efficient enough, then the memslots could be fragmented.  Or the memfile could support private and shared data in the same memslot.  And if pKVM does this, I don't see why SEV couldn't also do it and hopefully reuse the same code.
> 
> For pKVM, that might be the case. For SEV, I don't think we require
> private/shared data in the same memfile. The same model that works for
> TDX should also work for SEV. Or maybe I misunderstood something here?
> 
> > 
> > >
> > >> 
> > >> If so, I think this makes sense, but I'm wondering if the actual memslot setup should be different.  For TDX, private memory lives in a logically separate memslot space.  For SEV and pKVM, it doesn't.  I assume the API can reflect this straightforwardly.
> > >
> > > I believe so. The flow should be similar but we do need pass different
> > > flags during the 'bind' to the backing store for different usages. That
> > > should be some new flags for pKVM but the callbacks (API here) between
> > > memfile_notifile and its consumers can be reused.
> > 
> > And also some different flag in the operation that installs the fd as a memslot?
> > 
> > >
> > >> 
> > >> And the corresponding TDX question: is the intent still that shared pages aren't allowed at all in a TDX memfile?  If so, that would be the most direct mapping to what the hardware actually does.
> > >
> > > Exactly. TDX will still use fallocate/hole punching to turn on/off the
> > > private page. Once off, the traditional shared page will become
> > > effective in KVM.
> > 
> > Works for me.
> > 
> > For what it's worth, I still think it should be fine to land all the TDX memfile bits upstream as long as we're confident that SEV, pKVM, etc can be added on without issues.
> > 
> > I think we can increase confidence in this by either getting one other technology's maintainers to get far enough along in the design to be confident
> 
> AFAICS, SEV shouldn't have any problem, But would like to see AMD people
> can comment. For pKVM, definitely need more work, but isn't totally
> undoable. Also would be good if pKVM people can comment.

Merging things incrementally sounds good to me if we can indeed get some
time to make sure it'll be a workable solution for other technologies.
I'm happy to prototype a pKVM extension to the proposed series to see if
there are any major blockers.

Thanks,
Quentin


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-22 10:56                                 ` Chao Peng
  2022-04-22 11:06                                   ` Paolo Bonzini
  2022-04-24 16:59                                   ` Andy Lutomirski
@ 2022-05-09 22:30                                   ` Michael Roth
  2022-05-09 23:29                                     ` Sean Christopherson
  2 siblings, 1 reply; 118+ messages in thread
From: Michael Roth @ 2022-05-09 22:30 UTC (permalink / raw)
  To: Chao Peng
  Cc: Sean Christopherson, Quentin Perret, Andy Lutomirski,
	Steven Price, kvm list, Linux Kernel Mailing List, linux-mm,
	linux-fsdevel, Linux API, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A. Shutemov, Nakajima, Jun, Dave Hansen,
	Andi Kleen, David Hildenbrand, Marc Zyngier, Will Deacon, nikunj,
	ashish.kalra

On Fri, Apr 22, 2022 at 06:56:12PM +0800, Chao Peng wrote:
> Great thanks for the discussions. I summarized the requirements/gaps and the
> potential changes for next step. Please help to review.

Hi Chao,

Thanks for writing this up. I've been meaning to respond, but wanted to
make a bit more progress with SNP+UPM prototype to get a better idea of
what's needed on that end. I've needed to make some changes on the KVM
and QEMU side to get things working so hopefully with your proposed
rework those changes can be dropped.

> 
> 
> Terminologies:
> --------------
>   - memory conversion: the action of converting guest memory between private
>     and shared.
>   - explicit conversion: an enlightened guest uses a hypercall to explicitly
>     request a memory conversion to VMM.
>   - implicit conversion: the conversion when VMM reacts to a page fault due
>     to different guest/host memory attributes (private/shared).
>   - destructive conversion: the memory content is lost/destroyed during
>     conversion.
>   - non-destructive conversion: the memory content is preserved during
>     conversion.
> 
> 
> Requirements & Gaps
> -------------------------------------
>   - Confidential computing(CC): TDX/SEV/CCA
>     * Need support both explicit/implicit conversions.
>     * Need support only destructive conversion at runtime.
>     * The current patch should just work, but prefer to have pre-boot guest
>       payload/firmware population into private memory for performance.

Not just performance in the case of SEV, it's needed there because firmware
only supports in-place encryption of guest memory, there's no mechanism to
provide a separate buffer to load into guest memory at pre-boot time. I
think you're aware of this but wanted to point that out just in case.

> 
>   - pKVM
>     * Support explicit conversion only. Hard to achieve implicit conversion,
>       does not record the guest access info (private/shared) in page fault,
>       also makes little sense.
>     * Expect to support non-destructive conversion at runtime. Additionally
>       in-place conversion (the underlying physical page is unchanged) is
>       desirable since copy is not disirable. The current destructive conversion
>       does not fit well.
>     * The current callbacks between mm/KVM is useful and reusable for pKVM.
>     * Pre-boot guest payload population is nice to have.
> 
> 
> Change Proposal
> ---------------
> Since there are some divergences for pKVM from CC usages and at this time looks
> whether we will and how we will support pKVM with this private memory patchset
> is still not quite clear, so this proposal does not imply certain detailed pKVM
> implementation. But from the API level, we want this can be possible to be future
> extended for pKVM or other potential usages.
> 
>   - No new user APIs introduced for memory backing store, e.g. remove the
>     current MFD_INACCESSIBLE. This info will be communicated from memfile_notifier
>     consumers to backing store via the new 'flag' field in memfile_notifier
>     described below. At creation time, the fd is normal shared fd. At rumtime CC
>     usages will keep using current fallocate/FALLOC_FL_PUNCH_HOLE to do the
>     conversion, but pKVM may also possible use a different way (e.g. rely on
>     mmap/munmap or mprotect as discussed). These are all not new APIs anyway.

For SNP most of the explicit conversions are via GHCB page-state change
requests. Each of these PSC requests can request shared/private
conversions for up to 252 individual pages, along with whether or not
they should be treated as 4K or 2M pages. Currently, like with
KVM_EXIT_MEMORY_ERROR, these requests get handled in userspace and call
back into the kernel via fallocate/PUNCH_HOLE calls.

For each fallocate(), we need to update the RMP table to mark a page as
private, and for PUNCH_HOLE we need to mark it as shared (otherwise it
would be freed back to the host as guest-owned/private and cause a crash if
the host tries to re-use it for something). I needed to add some callbacks
to the memfile_notifier to handle these RMP table updates. There might be
some other bits of book-keeping like clflush's, and adding/removing guest
pages from the kernel's direct map.

Not currently implemented, but the guest can also issue requests to
"smash"/"unsmash" a 2M private range into individual 4K private ranges
(generally in advance of flipping one of the pages to shared, or
vice-versa) in the RMP table. Hypervisor code tries to handle this
automatically, by determining when to smash/unsmash on it's own, but...

I'm wondering how all these things can be properly conveyed through this
fallocate/PUNCH_HOLE interface if we ever needed to add support for all
of this, as it seems a bit restrictive as-is. For instance, with the
current approach, one possible scheme is:

  - explicit conversion of shared->private for 252 4K pages:
    - we could do 252 individual fallocate()'s of 4K each, and make sure the
      kernel code will do notifier callbacks / RMP updates for each individual
      4K page

  - shared->private for 252 2M pages:
    - we could do 252 individual fallocate()'s of 2M each, and make sure the
      kernel code will do notifier callbacks / RMP updates for each individual
      2M page

But for SNP most of these bulk PSC changes are when the guest switches
*all* of it's pages from shared->private during early boot when it
validates all of it's memory. So these pages tend to be contiguous
ranges, and a nice optimization would be to coalesce these 252
fallocate() calls into a single fallocate() that spans the whole range.
But there's no good way to do that without losing information like
whether these should be treated as individual 4K vs. 2M ranges.

So I wonder, since there's talk of the "binding" of this memfd to KVM
being what actually enabled all the private/shared operations, if we
should introduce some sort of new KVM ioctl, like
KVM_UPM_SET_PRIVATE/SHARED, that could handle all the
fallocate/hole-punching on the kernel side for larger GFN ranges to reduce
the kernel<->userspace transitions, and allow for 4K/2M granularity to be
specified as arguments, and maybe provide for better
backward-compatibility vs. future changes to memfd backend interface.

> 
>   - Add a flag to memfile_notifier so its consumers can state the requirements.
> 
>         struct memfile_notifier {
>                 struct list_head list;
>                 unsigned long flags;     /* consumer states its requirements here */
>                 struct memfile_notifier_ops *ops; /* future function may also extend ops when necessary */
>         };
> 
>     For current CC usage, we can define and set below flags from KVM.
> 
>         /* memfile notifier flags */
>         #define MFN_F_USER_INACCESSIBLE   0x0001  /* memory allocated in the file is inaccessible from userspace (e.g. read/write/mmap) */
>         #define MFN_F_UNMOVABLE           0x0002  /* memory allocated in the file is unmovable */
>         #define MFN_F_UNRECLAIMABLE       0x0003  /* memory allocated in the file is unreclaimable (e.g. via kswapd or any other pathes) */
> 
>     When memfile_notifier is being registered, memfile_register_notifier will
>     need check these flags. E.g. for MFN_F_USER_INACCESSIBLE, it fails when
>     previous mmap-ed mapping exists on the fd (I'm still unclear on how to do
>     this). When multiple consumers are supported it also need check all
>     registered consumers to see if any conflict (e.g. all consumers should have
>     MFN_F_USER_INACCESSIBLE set). Only when the register succeeds, the fd is
>     converted into a private fd, before that, the fd is just a normal (shared)
>     one. During this conversion, the previous data is preserved so you can put
>     some initial data in guest pages (whether the architecture allows this is
>     architecture-specific and out of the scope of this patch).
> 
>   - Pre-boot guest payload populating is done by normal mmap/munmap on the fd
>     before it's converted into private fd when KVM registers itself to the
>     backing store.

Is that registration still intended to be triggered by
KVM_SET_USER_MEMORY_REGION, or is there a new ioctl you're considering?

I ask because in the case of SNP (and QEMU in general, maybe other VMMs),
the regions are generally registered before the guest contents are
initialized. So if KVM_SET_USER_MEMORY_REGION kicks of the conversion then
it's too late for the SNP code in QEMU to populate the pre-conversion data.

Maybe, building on the above approach, we could have something like:

KVM_SET_USER_MEMORY_REGION
KVM_UPM_BIND(TYPE_TDX|SEV|SNP, gfn_start, gfn_end)
<populate guest memory>
KVM_UPM_INIT(gfn_start, gfn_end) //not sure if needed
KVM_UPM_SET_PRIVATE(gfn_start, gfn_end, granularity)
<launch guest>
...
KVM_UPM_SET_PRIVATE(gfn_start, gfn_end, granularity)
...
KVM_UPM_SET_SHARED(gfn_start, gfn_end, granularity)
etc.

Just some rough ideas, but I think addressing these in some form would help
a lot with getting SNP covered with reasonable performance.

> 
>   - Implicit conversion: maybe it's worthy to discuss again: how about totally
>     remove implicit converion support? TDX should be OK, unsure SEV/CCA. pKVM
>     should be happy to see. Removing also makes the work much easier and prevents
>     guest bugs/unitended behaviors early. If it turns out that there is reason to
>     keep it, then for pKVM we can make it an optional feature (e.g. via a new
>     module param). But that can be added when pKVM really gets supported.

SEV sort of relies on implicit conversion since the guest is free to turn
on/off the encryption bit during run-time. But in the context of UPM that
wouldn't be supported anyway since, IIUC, the idea is that SEV/SEV-ES would
only be supported for guests that do explicit conversions via MAP_GPA_RANGE
hypercall. And for SNP these would similarly be done via explicit page-state
change requests via GHCB requests issued by the guest.

But if possible, it would be nice if we could leave implicit conversion
as an optional feature/flag, as it's something that we considered
harmless for the guest SNP support (now upstream), and planned to allow
in the hypervisor implementation. I don't think we intentionally relied on
it in the guest kernel/uefi support, but I need to audit that code to be
sure that dropping it wouldn't cause a regression in the guest support.
I'll try to confirm this soon one I get things running under UPM a bit more
reliably.

> 
>   - non-destructive in-place conversion: Out of scope for this series, pKVM can
>     invent other pKVM specific interfaces (either extend memfile_notifier and using
>     mmap/mprotect or use totaly different ways like access through vmfd as Sean
>     suggested).
> 
> Thanks,
> Chao

Also, happy to help with testing things on the SNP side going forward, just
let me know.

Thanks!

-Mike


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-05-09 22:30                                   ` Michael Roth
@ 2022-05-09 23:29                                     ` Sean Christopherson
  2022-07-21 20:05                                       ` Gupta, Pankaj
  0 siblings, 1 reply; 118+ messages in thread
From: Sean Christopherson @ 2022-05-09 23:29 UTC (permalink / raw)
  To: Michael Roth
  Cc: Chao Peng, Quentin Perret, Andy Lutomirski, Steven Price,
	kvm list, Linux Kernel Mailing List, linux-mm, linux-fsdevel,
	Linux API, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A. Shutemov, Nakajima, Jun, Dave Hansen,
	Andi Kleen, David Hildenbrand, Marc Zyngier, Will Deacon, nikunj,
	ashish.kalra

On Mon, May 09, 2022, Michael Roth wrote:
> On Fri, Apr 22, 2022 at 06:56:12PM +0800, Chao Peng wrote:
> > Requirements & Gaps
> > -------------------------------------
> >   - Confidential computing(CC): TDX/SEV/CCA
> >     * Need support both explicit/implicit conversions.
> >     * Need support only destructive conversion at runtime.
> >     * The current patch should just work, but prefer to have pre-boot guest
> >       payload/firmware population into private memory for performance.
> 
> Not just performance in the case of SEV, it's needed there because firmware
> only supports in-place encryption of guest memory, there's no mechanism to
> provide a separate buffer to load into guest memory at pre-boot time. I
> think you're aware of this but wanted to point that out just in case.

I view it as a performance problem because nothing stops KVM from copying from
userspace into the private fd during the SEV ioctl().  What's missing is the
ability for userspace to directly initialze the private fd, which may or may not
avoid an extra memcpy() depending on how clever userspace is.

> 
> > 
> >   - pKVM
> >     * Support explicit conversion only. Hard to achieve implicit conversion,
> >       does not record the guest access info (private/shared) in page fault,
> >       also makes little sense.
> >     * Expect to support non-destructive conversion at runtime. Additionally
> >       in-place conversion (the underlying physical page is unchanged) is
> >       desirable since copy is not disirable. The current destructive conversion
> >       does not fit well.
> >     * The current callbacks between mm/KVM is useful and reusable for pKVM.
> >     * Pre-boot guest payload population is nice to have.
> > 
> > 
> > Change Proposal
> > ---------------
> > Since there are some divergences for pKVM from CC usages and at this time looks
> > whether we will and how we will support pKVM with this private memory patchset
> > is still not quite clear, so this proposal does not imply certain detailed pKVM
> > implementation. But from the API level, we want this can be possible to be future
> > extended for pKVM or other potential usages.
> > 
> >   - No new user APIs introduced for memory backing store, e.g. remove the
> >     current MFD_INACCESSIBLE. This info will be communicated from memfile_notifier
> >     consumers to backing store via the new 'flag' field in memfile_notifier
> >     described below. At creation time, the fd is normal shared fd. At rumtime CC
> >     usages will keep using current fallocate/FALLOC_FL_PUNCH_HOLE to do the
> >     conversion, but pKVM may also possible use a different way (e.g. rely on
> >     mmap/munmap or mprotect as discussed). These are all not new APIs anyway.
> 
> For SNP most of the explicit conversions are via GHCB page-state change
> requests. Each of these PSC requests can request shared/private
> conversions for up to 252 individual pages, along with whether or not
> they should be treated as 4K or 2M pages. Currently, like with
> KVM_EXIT_MEMORY_ERROR, these requests get handled in userspace and call
> back into the kernel via fallocate/PUNCH_HOLE calls.
> 
> For each fallocate(), we need to update the RMP table to mark a page as
> private, and for PUNCH_HOLE we need to mark it as shared (otherwise it
> would be freed back to the host as guest-owned/private and cause a crash if
> the host tries to re-use it for something). I needed to add some callbacks
> to the memfile_notifier to handle these RMP table updates. There might be
> some other bits of book-keeping like clflush's, and adding/removing guest
> pages from the kernel's direct map.
> 
> Not currently implemented, but the guest can also issue requests to
> "smash"/"unsmash" a 2M private range into individual 4K private ranges
> (generally in advance of flipping one of the pages to shared, or
> vice-versa) in the RMP table. Hypervisor code tries to handle this
> automatically, by determining when to smash/unsmash on it's own, but...
> 
> I'm wondering how all these things can be properly conveyed through this
> fallocate/PUNCH_HOLE interface if we ever needed to add support for all
> of this, as it seems a bit restrictive as-is. For instance, with the
> current approach, one possible scheme is:
> 
>   - explicit conversion of shared->private for 252 4K pages:
>     - we could do 252 individual fallocate()'s of 4K each, and make sure the
>       kernel code will do notifier callbacks / RMP updates for each individual
>       4K page
> 
>   - shared->private for 252 2M pages:
>     - we could do 252 individual fallocate()'s of 2M each, and make sure the
>       kernel code will do notifier callbacks / RMP updates for each individual
>       2M page
> 
> But for SNP most of these bulk PSC changes are when the guest switches
> *all* of it's pages from shared->private during early boot when it
> validates all of it's memory. So these pages tend to be contiguous
> ranges, and a nice optimization would be to coalesce these 252
> fallocate() calls into a single fallocate() that spans the whole range.
> But there's no good way to do that without losing information like
> whether these should be treated as individual 4K vs. 2M ranges.

Eh, the smash/unsmash hint from the guest is just that, a hint.  If the guest
hints at 4kb pages and then bulk converts a contiguous 2mb chunk (or 242 2mb chunks),
then the guest is being dumb because it either (a) doesn't realize it can/should use
2mb pages, or (b) is doing an unnecessary shared->private (assuming the hint was sane
and intented to hint that a private->shared split+conversion is coming).

> So I wonder, since there's talk of the "binding" of this memfd to KVM
> being what actually enabled all the private/shared operations, if we
> should introduce some sort of new KVM ioctl, like
> KVM_UPM_SET_PRIVATE/SHARED, that could handle all the
> fallocate/hole-punching on the kernel side for larger GFN ranges to reduce
> the kernel<->userspace transitions, and allow for 4K/2M granularity to be
> specified as arguments, and maybe provide for better
> backward-compatibility vs. future changes to memfd backend interface.

At this point, I don't think we need anything new.  When SNP is merged, KVM can
coalesce contiguous pages into a single KVM_HC_MAP_GPA_RANGE so that userspace can
batch those into a single fallocate().  That does leave a gap in that KVM_HC_MAP_GPA_RANGE
will require multiple roundtrips for discontiguous ranges, but I would be very surprised
if that ends up being the long pole for boot performance.

> >   - Add a flag to memfile_notifier so its consumers can state the requirements.
> > 
> >         struct memfile_notifier {
> >                 struct list_head list;
> >                 unsigned long flags;     /* consumer states its requirements here */
> >                 struct memfile_notifier_ops *ops; /* future function may also extend ops when necessary */
> >         };
> > 
> >     For current CC usage, we can define and set below flags from KVM.
> > 
> >         /* memfile notifier flags */
> >         #define MFN_F_USER_INACCESSIBLE   0x0001  /* memory allocated in the file is inaccessible from userspace (e.g. read/write/mmap) */
> >         #define MFN_F_UNMOVABLE           0x0002  /* memory allocated in the file is unmovable */
> >         #define MFN_F_UNRECLAIMABLE       0x0003  /* memory allocated in the file is unreclaimable (e.g. via kswapd or any other pathes) */
> > 
> >     When memfile_notifier is being registered, memfile_register_notifier will
> >     need check these flags. E.g. for MFN_F_USER_INACCESSIBLE, it fails when
> >     previous mmap-ed mapping exists on the fd (I'm still unclear on how to do
> >     this). When multiple consumers are supported it also need check all
> >     registered consumers to see if any conflict (e.g. all consumers should have
> >     MFN_F_USER_INACCESSIBLE set). Only when the register succeeds, the fd is
> >     converted into a private fd, before that, the fd is just a normal (shared)
> >     one. During this conversion, the previous data is preserved so you can put
> >     some initial data in guest pages (whether the architecture allows this is
> >     architecture-specific and out of the scope of this patch).
> > 
> >   - Pre-boot guest payload populating is done by normal mmap/munmap on the fd
> >     before it's converted into private fd when KVM registers itself to the
> >     backing store.
> 
> Is that registration still intended to be triggered by
> KVM_SET_USER_MEMORY_REGION, or is there a new ioctl you're considering?
> 
> I ask because in the case of SNP (and QEMU in general, maybe other VMMs),
> the regions are generally registered before the guest contents are
> initialized. So if KVM_SET_USER_MEMORY_REGION kicks of the conversion then
> it's too late for the SNP code in QEMU to populate the pre-conversion data.
> 
> Maybe, building on the above approach, we could have something like:
> 
> KVM_SET_USER_MEMORY_REGION
> KVM_UPM_BIND(TYPE_TDX|SEV|SNP, gfn_start, gfn_end)
> <populate guest memory>
> KVM_UPM_INIT(gfn_start, gfn_end) //not sure if needed
> KVM_UPM_SET_PRIVATE(gfn_start, gfn_end, granularity)
> <launch guest>
> ...
> KVM_UPM_SET_PRIVATE(gfn_start, gfn_end, granularity)
> ...
> KVM_UPM_SET_SHARED(gfn_start, gfn_end, granularity)
> etc.
> 
> Just some rough ideas, but I think addressing these in some form would help
> a lot with getting SNP covered with reasonable performance.

TDX also needs to populate some amount of guest memory with non-zero data, and to
do so must set up TDP page table in KVM.  So for starters, I think a single ioctl
to copy data into a private fd is the way to go.  That does leave a performance
gap (the extra memcpy() I mentioned earlier), but it at least ensures KVM can boot
an SNP guest.

I'm certainly not opposed to directly pre-populating private fd memory from
userspace, I just want to point out that this can be handled in KVM without too
much fuss and without any additional support in the private fd implementation. 

> >   - Implicit conversion: maybe it's worthy to discuss again: how about totally
> >     remove implicit converion support? TDX should be OK, unsure SEV/CCA. pKVM
> >     should be happy to see. Removing also makes the work much easier and prevents
> >     guest bugs/unitended behaviors early. If it turns out that there is reason to
> >     keep it, then for pKVM we can make it an optional feature (e.g. via a new
> >     module param). But that can be added when pKVM really gets supported.
> 
> SEV sort of relies on implicit conversion since the guest is free to turn
> on/off the encryption bit during run-time. But in the context of UPM that
> wouldn't be supported anyway since, IIUC, the idea is that SEV/SEV-ES would
> only be supported for guests that do explicit conversions via MAP_GPA_RANGE
> hypercall. And for SNP these would similarly be done via explicit page-state
> change requests via GHCB requests issued by the guest.
> 
> But if possible, it would be nice if we could leave implicit conversion
> as an optional feature/flag, as it's something that we considered
> harmless for the guest SNP support (now upstream), and planned to allow
> in the hypervisor implementation. I don't think we intentionally relied on
> it in the guest kernel/uefi support, but I need to audit that code to be
> sure that dropping it wouldn't cause a regression in the guest support.
> I'll try to confirm this soon one I get things running under UPM a bit more
> reliably.

The plan is to support implicit conversions, albeit with a lot of whining :-)
Both SNP's GHCB and TDX's GHCI specs allow for implicit conversion, so barring a
spec change KVM needs to support it.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-04-25 20:30                                         ` Sean Christopherson
@ 2022-06-10 19:18                                           ` Andy Lutomirski
  2022-06-10 19:27                                             ` Sean Christopherson
  0 siblings, 1 reply; 118+ messages in thread
From: Andy Lutomirski @ 2022-06-10 19:18 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Chao Peng, Quentin Perret, Steven Price,
	kvm list, Linux Kernel Mailing List, linux-mm, linux-fsdevel,
	Linux API, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A. Shutemov, Nakajima, Jun, Dave Hansen,
	Andi Kleen, David Hildenbrand, Marc Zyngier, Will Deacon

On Mon, Apr 25, 2022 at 1:31 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Apr 25, 2022, Andy Lutomirski wrote:
> >
> >
> > On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote:
> > > On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote:
> > >>
> >
> > >>
> > >> 2. Bind the memfile to a VM (or at least to a VM technology).  Now it's in
> > >> the initial state appropriate for that VM.
> > >>
> > >> For TDX, this completely bypasses the cases where the data is prepopulated
> > >> and TDX can't handle it cleanly.
>
> I believe TDX can handle this cleanly, TDH.MEM.PAGE.ADD doesn't require that the
> source and destination have different HPAs.  There's just no pressing need to
> support such behavior because userspace is highly motivated to keep the initial
> image small for performance reasons, i.e. burning a few extra pages while building
> the guest is a non-issue.

Following up on this, rather belatedly.  After re-reading the docs,
TDX can populate guest memory using TDH.MEM.PAGE.ADD, but see Intel®
TDX Module Base Spec v1.5, section 2.3, step D.4 substeps 1 and 2
here:

https://www.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1.5-base-spec-348549001.pdf

For each TD page:

1. The host VMM specifies a TDR as a parameter and calls the
TDH.MEM.PAGE.ADD function. It copies the contents from the TD
image page into the target TD page which is encrypted with the TD
ephemeral key. TDH.MEM.PAGE.ADD also extends the TD
measurement with the page GPA.

2. The host VMM extends the TD measurement with the contents of
the new page by calling the TDH.MR.EXTEND function on each 256-
byte chunk of the new TD page.

So this is a bit like SGX.  There is a specific series of operations
that have to be done in precisely the right order to reproduce the
intended TD measurement.  Otherwise the guest will boot and run until
it tries to get a report and then it will have a hard time getting
anyone to believe its report.

So I don't think the host kernel can get away with host userspace just
providing pre-populated memory.  Userspace needs to tell the host
kernel exactly what sequence of adds, extends, etc to perform and in
what order, and the host kernel needs to do precisely what userspace
asks it to do.  "Here's the contents of memory" doesn't cut it unless
the tooling that builds the guest image matches the exact semantics
that the host kernel provides.

--Andy


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-06-10 19:18                                           ` Andy Lutomirski
@ 2022-06-10 19:27                                             ` Sean Christopherson
  0 siblings, 0 replies; 118+ messages in thread
From: Sean Christopherson @ 2022-06-10 19:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chao Peng, Quentin Perret, Steven Price, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon

On Fri, Jun 10, 2022, Andy Lutomirski wrote:
> On Mon, Apr 25, 2022 at 1:31 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Apr 25, 2022, Andy Lutomirski wrote:
> > >
> > >
> > > On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote:
> > > > On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote:
> > > >>
> > >
> > > >>
> > > >> 2. Bind the memfile to a VM (or at least to a VM technology).  Now it's in
> > > >> the initial state appropriate for that VM.
> > > >>
> > > >> For TDX, this completely bypasses the cases where the data is prepopulated
> > > >> and TDX can't handle it cleanly.
> >
> > I believe TDX can handle this cleanly, TDH.MEM.PAGE.ADD doesn't require that the
> > source and destination have different HPAs.  There's just no pressing need to
> > support such behavior because userspace is highly motivated to keep the initial
> > image small for performance reasons, i.e. burning a few extra pages while building
> > the guest is a non-issue.
> 
> Following up on this, rather belatedly.  After re-reading the docs,
> TDX can populate guest memory using TDH.MEM.PAGE.ADD, but see Intel®
> TDX Module Base Spec v1.5, section 2.3, step D.4 substeps 1 and 2
> here:
> 
> https://www.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1.5-base-spec-348549001.pdf
> 
> For each TD page:
> 
> 1. The host VMM specifies a TDR as a parameter and calls the
> TDH.MEM.PAGE.ADD function. It copies the contents from the TD
> image page into the target TD page which is encrypted with the TD
> ephemeral key. TDH.MEM.PAGE.ADD also extends the TD
> measurement with the page GPA.
> 
> 2. The host VMM extends the TD measurement with the contents of
> the new page by calling the TDH.MR.EXTEND function on each 256-
> byte chunk of the new TD page.
> 
> So this is a bit like SGX.  There is a specific series of operations
> that have to be done in precisely the right order to reproduce the
> intended TD measurement.  Otherwise the guest will boot and run until
> it tries to get a report and then it will have a hard time getting
> anyone to believe its report.
> 
> So I don't think the host kernel can get away with host userspace just
> providing pre-populated memory.  Userspace needs to tell the host
> kernel exactly what sequence of adds, extends, etc to perform and in
> what order, and the host kernel needs to do precisely what userspace
> asks it to do.  "Here's the contents of memory" doesn't cut it unless
> the tooling that builds the guest image matches the exact semantics
> that the host kernel provides.

For TDX, yes, a KVM ioctl() is mandatory for all intents and purposes since adding
non-zero memory into the guest requires a SEAMCALL.  My "idea", which I'm not sure
would actually work, is more than a bit contrived, and which I don't think is remotely
critical to support, is to let userspace fill the guest private memory directly
and then use the private page for both the source and the target to TDH.MEM.PAGE.ADD.

That would avoid having to double allocate memory for the initial guest image.  But
like I said, contrived and low priority.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-05-09 23:29                                     ` Sean Christopherson
@ 2022-07-21 20:05                                       ` Gupta, Pankaj
  2022-07-21 21:19                                         ` Sean Christopherson
  0 siblings, 1 reply; 118+ messages in thread
From: Gupta, Pankaj @ 2022-07-21 20:05 UTC (permalink / raw)
  To: Sean Christopherson, Chao Peng
  Cc: Quentin Perret, Michael Roth, Andy Lutomirski, Steven Price,
	kvm list, Linux Kernel Mailing List, linux-mm, linux-fsdevel,
	Linux API, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A. Shutemov, Nakajima, Jun, Dave Hansen,
	Andi Kleen, David Hildenbrand, Marc Zyngier, Will Deacon, nikunj,
	ashish.kalra


Hi Sean, Chao,

While attempting to solve the pre-boot guest payload/firmware population
into private memory for SEV SNP, retrieved this thread. Have question below:

>>> Requirements & Gaps
>>> -------------------------------------
>>>    - Confidential computing(CC): TDX/SEV/CCA
>>>      * Need support both explicit/implicit conversions.
>>>      * Need support only destructive conversion at runtime.
>>>      * The current patch should just work, but prefer to have pre-boot guest
>>>        payload/firmware population into private memory for performance.
>>
>> Not just performance in the case of SEV, it's needed there because firmware
>> only supports in-place encryption of guest memory, there's no mechanism to
>> provide a separate buffer to load into guest memory at pre-boot time. I
>> think you're aware of this but wanted to point that out just in case.
> 
> I view it as a performance problem because nothing stops KVM from copying from
> userspace into the private fd during the SEV ioctl().  What's missing is the
> ability for userspace to directly initialze the private fd, which may or may not
> avoid an extra memcpy() depending on how clever userspace is.
Can you please elaborate more what you see as a performance problem? And 
possible ways to solve it?

Thanks,
Pankaj


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-07-21 20:05                                       ` Gupta, Pankaj
@ 2022-07-21 21:19                                         ` Sean Christopherson
  2022-07-21 21:36                                           ` Gupta, Pankaj
  2022-07-23  3:09                                           ` Andy Lutomirski
  0 siblings, 2 replies; 118+ messages in thread
From: Sean Christopherson @ 2022-07-21 21:19 UTC (permalink / raw)
  To: Gupta, Pankaj
  Cc: Chao Peng, Quentin Perret, Michael Roth, Andy Lutomirski,
	Steven Price, kvm list, Linux Kernel Mailing List, linux-mm,
	linux-fsdevel, Linux API, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A. Shutemov, Nakajima, Jun, Dave Hansen,
	Andi Kleen, David Hildenbrand, Marc Zyngier, Will Deacon, nikunj,
	ashish.kalra

On Thu, Jul 21, 2022, Gupta, Pankaj wrote:
> 
> Hi Sean, Chao,
> 
> While attempting to solve the pre-boot guest payload/firmware population
> into private memory for SEV SNP, retrieved this thread. Have question below:
> 
> > > > Requirements & Gaps
> > > > -------------------------------------
> > > >    - Confidential computing(CC): TDX/SEV/CCA
> > > >      * Need support both explicit/implicit conversions.
> > > >      * Need support only destructive conversion at runtime.
> > > >      * The current patch should just work, but prefer to have pre-boot guest
> > > >        payload/firmware population into private memory for performance.
> > > 
> > > Not just performance in the case of SEV, it's needed there because firmware
> > > only supports in-place encryption of guest memory, there's no mechanism to
> > > provide a separate buffer to load into guest memory at pre-boot time. I
> > > think you're aware of this but wanted to point that out just in case.
> > 
> > I view it as a performance problem because nothing stops KVM from copying from
> > userspace into the private fd during the SEV ioctl().  What's missing is the
> > ability for userspace to directly initialze the private fd, which may or may not
> > avoid an extra memcpy() depending on how clever userspace is.
> Can you please elaborate more what you see as a performance problem? And
> possible ways to solve it?

Oh, I'm not saying there actually _is_ a performance problem.  What I'm saying is
that in-place encryption is not a functional requirement, which means it's purely
an optimization, and thus we should other bother supporting in-place encryption
_if_ it would solve a performane bottleneck.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-07-21 21:19                                         ` Sean Christopherson
@ 2022-07-21 21:36                                           ` Gupta, Pankaj
  2022-07-23  3:09                                           ` Andy Lutomirski
  1 sibling, 0 replies; 118+ messages in thread
From: Gupta, Pankaj @ 2022-07-21 21:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, Quentin Perret, Michael Roth, Andy Lutomirski,
	Steven Price, kvm list, Linux Kernel Mailing List, linux-mm,
	linux-fsdevel, Linux API, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Mike Rapoport,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A. Shutemov, Nakajima, Jun, Dave Hansen,
	Andi Kleen, David Hildenbrand, Marc Zyngier, Will Deacon, nikunj,
	ashish.kalra


>>>>>       * The current patch should just work, but prefer to have pre-boot guest
>>>>>         payload/firmware population into private memory for performance.
>>>>
>>>> Not just performance in the case of SEV, it's needed there because firmware
>>>> only supports in-place encryption of guest memory, there's no mechanism to
>>>> provide a separate buffer to load into guest memory at pre-boot time. I
>>>> think you're aware of this but wanted to point that out just in case.
>>>
>>> I view it as a performance problem because nothing stops KVM from copying from
>>> userspace into the private fd during the SEV ioctl().  What's missing is the
>>> ability for userspace to directly initialze the private fd, which may or may not
>>> avoid an extra memcpy() depending on how clever userspace is.
>> Can you please elaborate more what you see as a performance problem? And
>> possible ways to solve it?
> 
> Oh, I'm not saying there actually _is_ a performance problem.  What I'm saying is
> that in-place encryption is not a functional requirement, which means it's purely
> an optimization, and thus we should other bother supporting in-place encryption
> _if_ it would solve a performane bottleneck.

Understood. Thank you!

Best regards,
Pankaj



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-07-21 21:19                                         ` Sean Christopherson
  2022-07-21 21:36                                           ` Gupta, Pankaj
@ 2022-07-23  3:09                                           ` Andy Lutomirski
  2022-07-25  9:19                                             ` Gupta, Pankaj
  1 sibling, 1 reply; 118+ messages in thread
From: Andy Lutomirski @ 2022-07-23  3:09 UTC (permalink / raw)
  To: Sean Christopherson, Gupta, Pankaj
  Cc: Chao Peng, Quentin Perret, Michael Roth, Steven Price, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon, nikunj, ashish.kalra

On 7/21/22 14:19, Sean Christopherson wrote:
> On Thu, Jul 21, 2022, Gupta, Pankaj wrote:
>>

>>> I view it as a performance problem because nothing stops KVM from copying from
>>> userspace into the private fd during the SEV ioctl().  What's missing is the
>>> ability for userspace to directly initialze the private fd, which may or may not
>>> avoid an extra memcpy() depending on how clever userspace is.
>> Can you please elaborate more what you see as a performance problem? And
>> possible ways to solve it?
> 
> Oh, I'm not saying there actually _is_ a performance problem.  What I'm saying is
> that in-place encryption is not a functional requirement, which means it's purely
> an optimization, and thus we should other bother supporting in-place encryption
> _if_ it would solve a performane bottleneck.

Even if we end up having a performance problem, I think we need to 
understand the workloads that we want to optimize before getting too 
excited about designing a speedup.

In particular, there's (depending on the specific technology, perhaps, 
and also architecture) a possible tradeoff between trying to reduce 
copying and trying to reduce unmapping and the associated flushes.  If a 
user program maps an fd, populates it, and then converts it in place 
into private memory (especially if it doesn't do it in a single shot), 
then that memory needs to get unmapped both from the user mm and 
probably from the kernel direct map.  On the flip side, it's possible to 
imagine an ioctl that does copy-and-add-to-private-fd that uses a 
private mm and doesn't need any TLB IPIs.

All of this is to say that trying to optimize right now seems quite 
premature to me.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-07-23  3:09                                           ` Andy Lutomirski
@ 2022-07-25  9:19                                             ` Gupta, Pankaj
  0 siblings, 0 replies; 118+ messages in thread
From: Gupta, Pankaj @ 2022-07-25  9:19 UTC (permalink / raw)
  To: Andy Lutomirski, Sean Christopherson
  Cc: Chao Peng, Quentin Perret, Michael Roth, Steven Price, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Mike Rapoport, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	Marc Zyngier, Will Deacon, nikunj, ashish.kalra


>>>> I view it as a performance problem because nothing stops KVM from 
>>>> copying from
>>>> userspace into the private fd during the SEV ioctl().  What's 
>>>> missing is the
>>>> ability for userspace to directly initialze the private fd, which 
>>>> may or may not
>>>> avoid an extra memcpy() depending on how clever userspace is.
>>> Can you please elaborate more what you see as a performance problem? And
>>> possible ways to solve it?
>>
>> Oh, I'm not saying there actually _is_ a performance problem.  What 
>> I'm saying is
>> that in-place encryption is not a functional requirement, which means 
>> it's purely
>> an optimization, and thus we should other bother supporting in-place 
>> encryption
>> _if_ it would solve a performane bottleneck.
> 
> Even if we end up having a performance problem, I think we need to 
> understand the workloads that we want to optimize before getting too 
> excited about designing a speedup.
> 
> In particular, there's (depending on the specific technology, perhaps, 
> and also architecture) a possible tradeoff between trying to reduce 
> copying and trying to reduce unmapping and the associated flushes.  If a 
> user program maps an fd, populates it, and then converts it in place 
> into private memory (especially if it doesn't do it in a single shot), 
> then that memory needs to get unmapped both from the user mm and 
> probably from the kernel direct map.  On the flip side, it's possible to 
> imagine an ioctl that does copy-and-add-to-private-fd that uses a 
> private mm and doesn't need any TLB IPIs.
> 
> All of this is to say that trying to optimize right now seems quite 
> premature to me.

Agree to it. Thank you for explaining!

Thanks,
Pankaj





^ permalink raw reply	[flat|nested] 118+ messages in thread

end of thread, other threads:[~2022-07-25  9:19 UTC | newest]

Thread overview: 118+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-10 14:08 [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
2022-03-10 14:08 ` [PATCH v5 01/13] mm/memfd: Introduce MFD_INACCESSIBLE flag Chao Peng
2022-04-11 15:10   ` Kirill A. Shutemov
2022-04-12 13:11     ` Chao Peng
2022-04-23  5:43   ` Vishal Annapurve
2022-04-24  8:15     ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 02/13] mm: Introduce memfile_notifier Chao Peng
2022-03-29 18:45   ` Sean Christopherson
2022-04-08 12:54     ` Chao Peng
2022-04-12 14:36   ` Hillf Danton
2022-04-13  6:47     ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 03/13] mm/shmem: Support memfile_notifier Chao Peng
2022-03-10 23:08   ` Dave Chinner
2022-03-11  8:42     ` Chao Peng
2022-04-11 15:26   ` Kirill A. Shutemov
2022-04-12 13:12     ` Chao Peng
2022-04-19 22:40   ` Vishal Annapurve
2022-04-20  3:24     ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK Chao Peng
2022-04-07 16:05   ` Sean Christopherson
2022-04-07 17:09     ` Andy Lutomirski
2022-04-08 17:56       ` Sean Christopherson
2022-04-08 18:54         ` David Hildenbrand
2022-04-12 14:36           ` Jason Gunthorpe
2022-04-12 21:27             ` Andy Lutomirski
2022-04-13 16:30               ` David Hildenbrand
2022-04-13 16:24             ` David Hildenbrand
2022-04-13 17:52               ` Jason Gunthorpe
2022-04-25 14:07                 ` David Hildenbrand
2022-04-08 13:02     ` Chao Peng
2022-04-11 15:34       ` Kirill A. Shutemov
2022-04-12  5:14         ` Hugh Dickins
2022-04-11 15:32     ` Kirill A. Shutemov
2022-04-12 13:39       ` Chao Peng
2022-04-12 19:28         ` Kirill A. Shutemov
2022-04-13  9:15           ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 05/13] KVM: Extend the memslot to support fd-based private memory Chao Peng
2022-03-28 21:27   ` Sean Christopherson
2022-04-08 13:21     ` Chao Peng
2022-03-28 21:56   ` Sean Christopherson
2022-04-08 13:46     ` Chao Peng
2022-04-08 17:45       ` Sean Christopherson
2022-03-10 14:09 ` [PATCH v5 06/13] KVM: Use kvm_userspace_memory_region_ext Chao Peng
2022-03-28 22:26   ` Sean Christopherson
2022-04-08 13:58     ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 07/13] KVM: Add KVM_EXIT_MEMORY_ERROR exit Chao Peng
2022-03-28 22:33   ` Sean Christopherson
2022-04-08 13:59     ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 08/13] KVM: Use memfile_pfn_ops to obtain pfn for private pages Chao Peng
2022-03-28 23:56   ` Sean Christopherson
2022-04-08 14:07     ` Chao Peng
2022-04-28 12:37     ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 09/13] KVM: Handle page fault for private memory Chao Peng
2022-03-29  1:07   ` Sean Christopherson
2022-04-12 12:10     ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 10/13] KVM: Register private memslot to memory backing store Chao Peng
2022-03-29 19:01   ` Sean Christopherson
2022-04-12 12:40     ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 11/13] KVM: Zap existing KVM mappings when pages changed in the private fd Chao Peng
2022-03-29 19:23   ` Sean Christopherson
2022-04-12 12:43     ` Chao Peng
2022-04-05 23:45   ` Michael Roth
2022-04-08  3:06     ` Sean Christopherson
2022-04-19 22:43   ` Vishal Annapurve
2022-04-20  3:17     ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 12/13] KVM: Expose KVM_MEM_PRIVATE Chao Peng
2022-03-29 19:13   ` Sean Christopherson
2022-04-12 12:56     ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 13/13] memfd_create.2: Describe MFD_INACCESSIBLE flag Chao Peng
2022-03-24 15:51 ` [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Quentin Perret
2022-03-28 17:13   ` Sean Christopherson
2022-03-28 18:00     ` Quentin Perret
2022-03-28 18:58       ` Sean Christopherson
2022-03-29 17:01         ` Quentin Perret
2022-03-30  8:58           ` Steven Price
2022-03-30 10:39             ` Quentin Perret
2022-03-30 17:58               ` Sean Christopherson
2022-03-31 16:04                 ` Andy Lutomirski
2022-04-01 14:59                   ` Quentin Perret
2022-04-01 17:14                     ` Sean Christopherson
2022-04-01 18:03                       ` Quentin Perret
2022-04-01 18:24                         ` Sean Christopherson
2022-04-01 19:56                     ` Andy Lutomirski
2022-04-04 15:01                       ` Quentin Perret
2022-04-04 17:06                         ` Sean Christopherson
2022-04-04 22:04                           ` Andy Lutomirski
2022-04-05 10:36                             ` Quentin Perret
2022-04-05 17:51                               ` Andy Lutomirski
2022-04-05 18:30                                 ` Sean Christopherson
2022-04-06 18:42                                   ` Andy Lutomirski
2022-04-06 13:05                                 ` Quentin Perret
2022-04-05 18:03                               ` Sean Christopherson
2022-04-06 10:34                                 ` Quentin Perret
2022-04-22 10:56                                 ` Chao Peng
2022-04-22 11:06                                   ` Paolo Bonzini
2022-04-24  8:07                                     ` Chao Peng
2022-04-24 16:59                                   ` Andy Lutomirski
2022-04-25 13:40                                     ` Chao Peng
2022-04-25 14:52                                       ` Andy Lutomirski
2022-04-25 20:30                                         ` Sean Christopherson
2022-06-10 19:18                                           ` Andy Lutomirski
2022-06-10 19:27                                             ` Sean Christopherson
2022-04-28 12:29                                         ` Chao Peng
2022-05-03 11:12                                           ` Quentin Perret
2022-05-09 22:30                                   ` Michael Roth
2022-05-09 23:29                                     ` Sean Christopherson
2022-07-21 20:05                                       ` Gupta, Pankaj
2022-07-21 21:19                                         ` Sean Christopherson
2022-07-21 21:36                                           ` Gupta, Pankaj
2022-07-23  3:09                                           ` Andy Lutomirski
2022-07-25  9:19                                             ` Gupta, Pankaj
2022-03-30 16:18             ` Sean Christopherson
2022-03-28 20:16 ` Andy Lutomirski
2022-03-28 22:48   ` Nakajima, Jun
2022-03-29  0:04     ` Sean Christopherson
2022-04-08 21:35   ` Vishal Annapurve
2022-04-12 13:00     ` Chao Peng
2022-04-12 19:58   ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).