linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
@ 2022-07-06  8:20 Chao Peng
  2022-07-06  8:20 ` [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd Chao Peng
                   ` (18 more replies)
  0 siblings, 19 replies; 398+ messages in thread
From: Chao Peng @ 2022-07-06  8:20 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

This is the v7 of this series which tries to implement the fd-based KVM
guest private memory. The patches are based on latest kvm/queue branch
commit:

  b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
split_desc_cache only by default capacity

Introduction
------------
In general this patch series introduce fd-based memslot which provides
guest memory through memory file descriptor fd[offset,size] instead of
hva/size. The fd can be created from a supported memory filesystem
like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
and the the memory backing store exchange callbacks when such memslot
gets created. At runtime KVM will call into callbacks provided by the
backing store to get the pfn with the fd+offset. Memory backing store
will also call into KVM callbacks when userspace punch hole on the fd
to notify KVM to unmap secondary MMU page table entries.

Comparing to existing hva-based memslot, this new type of memslot allows
guest memory unmapped from host userspace like QEMU and even the kernel
itself, therefore reduce attack surface and prevent bugs.

Based on this fd-based memslot, we can build guest private memory that
is going to be used in confidential computing environments such as Intel
TDX and AMD SEV. When supported, the memory backing store can provide
more enforcement on the fd and KVM can use a single memslot to hold both
the private and shared part of the guest memory. 

mm extension
---------------------
Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
created with these flags cannot read(), write() or mmap() etc via normal
MMU operations. The file content can only be used with the newly
introduced memfile_notifier extension.

The memfile_notifier extension provides two sets of callbacks for KVM to
interact with the memory backing store:
  - memfile_notifier_ops: callbacks for memory backing store to notify
    KVM when memory gets invalidated.
  - backing store callbacks: callbacks for KVM to call into memory
    backing store to request memory pages for guest private memory.

The memfile_notifier extension also provides APIs for memory backing
store to register/unregister itself and to trigger the notifier when the
bookmarked memory gets invalidated.

The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
prevent double allocation caused by unintentional guest when we only
have a single side of the shared/private memfds effective.

memslot extension
-----------------
Add the private fd and the fd offset to existing 'shared' memslot so
that both private/shared guest memory can live in one single memslot.
A page in the memslot is either private or shared. Whether a guest page
is private or shared is maintained through reusing existing SEV ioctls
KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.

Test
----
To test the new functionalities of this patch TDX patchset is needed.
Since TDX patchset has not been merged so I did two kinds of test:

-  Regresion test on kvm/queue (this patchset)
   Most new code are not covered. Code also in below repo:
   https://github.com/chao-p/linux/tree/privmem-v7

-  New Funational test on latest TDX code
   The patch is rebased to latest TDX code and tested the new
   funcationalities. See below repos:
   Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
   QEMU: https://github.com/chao-p/qemu/tree/privmem-v7

An example QEMU command line for TDX test:
-object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
-machine confidential-guest-support=tdx \
-object memory-backend-memfd-private,id=ram1,size=${mem} \
-machine memory-backend=ram1

Changelog
----------
v7:
  - Move the private/shared info from backing store to KVM.
  - Introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
  - Rework on the sync mechanism between zap/page fault paths.
  - Addressed other comments in v6.
v6:
  - Re-organzied patch for both mm/KVM parts.
  - Added flags for memfile_notifier so its consumers can state their
    features and memory backing store can check against these flags.
  - Put a backing store reference in the memfile_notifier and move pfn_ops
    into backing store.
  - Only support boot time backing store register.
  - Overall KVM part improvement suggested by Sean and some others.
v5:
  - Removed userspace visible F_SEAL_INACCESSIBLE, instead using an
    in-kernel flag (SHM_F_INACCESSIBLE for shmem). Private fd can only
    be created by MFD_INACCESSIBLE.
  - Introduced new APIs for backing store to register itself to
    memfile_notifier instead of direct function call.
  - Added the accounting and restriction for MFD_INACCESSIBLE memory.
  - Added KVM API doc for new memslot extensions and man page for the new
    MFD_INACCESSIBLE flag.
  - Removed the overlap check for mapping the same file+offset into
    multiple gfns due to perf consideration, warned in document.
  - Addressed other comments in v4.
v4:
  - Decoupled the callbacks between KVM/mm from memfd and use new
    name 'memfile_notifier'.
  - Supported register multiple memslots to the same backing store.
  - Added per-memslot pfn_ops instead of per-system.
  - Reworked the invalidation part.
  - Improved new KVM uAPIs (private memslot extension and memory
    error) per Sean's suggestions.
  - Addressed many other minor fixes for comments from v3.
v3:
  - Added locking protection when calling
    invalidate_page_range/fallocate callbacks.
  - Changed memslot structure to keep use useraddr for shared memory.
  - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
  - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
  - Commit message improvement.
  - Many small fixes for comments from the last version.

Links to previous discussions
-----------------------------
[1] Original design proposal:
https://lkml.kernel.org/kvm/20210824005248.200037-1-seanjc@google.com/
[2] Updated proposal and RFC patch v1:
https://lkml.kernel.org/linux-fsdevel/20211111141352.26311-1-chao.p.peng@linux.intel.com/
[3] Patch v5: https://lkml.org/lkml/2022/5/19/861

Chao Peng (12):
  mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE
  mm: Introduce memfile_notifier
  mm/memfd: Introduce MFD_INACCESSIBLE flag
  KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS
  KVM: Use gfn instead of hva for mmu_notifier_retry
  KVM: Rename mmu_notifier_*
  KVM: Extend the memslot to support fd-based private memory
  KVM: Add KVM_EXIT_MEMORY_FAULT exit
  KVM: Register/unregister the guest private memory regions
  KVM: Handle page fault for private memory
  KVM: Enable and expose KVM_MEM_PRIVATE

Kirill A. Shutemov (1):
  mm/shmem: Support memfile_notifier

 Documentation/virt/kvm/api.rst             |  77 +++++-
 arch/arm64/kvm/mmu.c                       |   8 +-
 arch/mips/include/asm/kvm_host.h           |   2 +-
 arch/mips/kvm/mmu.c                        |  10 +-
 arch/powerpc/include/asm/kvm_book3s_64.h   |   2 +-
 arch/powerpc/kvm/book3s_64_mmu_host.c      |   4 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c        |   4 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c     |   6 +-
 arch/powerpc/kvm/book3s_hv_nested.c        |   2 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c        |   8 +-
 arch/powerpc/kvm/e500_mmu_host.c           |   4 +-
 arch/riscv/kvm/mmu.c                       |   4 +-
 arch/x86/include/asm/kvm_host.h            |   3 +-
 arch/x86/kvm/Kconfig                       |   3 +
 arch/x86/kvm/mmu.h                         |   2 -
 arch/x86/kvm/mmu/mmu.c                     |  74 +++++-
 arch/x86/kvm/mmu/mmu_internal.h            |  18 ++
 arch/x86/kvm/mmu/mmutrace.h                |   1 +
 arch/x86/kvm/mmu/paging_tmpl.h             |   4 +-
 arch/x86/kvm/x86.c                         |   2 +-
 include/linux/kvm_host.h                   | 105 +++++---
 include/linux/memfile_notifier.h           |  91 +++++++
 include/linux/shmem_fs.h                   |   2 +
 include/uapi/linux/fcntl.h                 |   1 +
 include/uapi/linux/kvm.h                   |  37 +++
 include/uapi/linux/memfd.h                 |   1 +
 mm/Kconfig                                 |   4 +
 mm/Makefile                                |   1 +
 mm/memfd.c                                 |  18 +-
 mm/memfile_notifier.c                      | 123 ++++++++++
 mm/shmem.c                                 | 125 +++++++++-
 tools/testing/selftests/memfd/memfd_test.c | 166 +++++++++++++
 virt/kvm/Kconfig                           |   3 +
 virt/kvm/kvm_main.c                        | 272 ++++++++++++++++++---
 virt/kvm/pfncache.c                        |  14 +-
 35 files changed, 1074 insertions(+), 127 deletions(-)
 create mode 100644 include/linux/memfile_notifier.h
 create mode 100644 mm/memfile_notifier.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 398+ messages in thread

* [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  2022-07-06  8:20 [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
@ 2022-07-06  8:20 ` Chao Peng
  2022-07-21  9:44   ` David Hildenbrand
  2022-08-26 15:19   ` Fuad Tabba
  2022-07-06  8:20 ` [PATCH v7 02/14] selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE Chao Peng
                   ` (17 subsequent siblings)
  18 siblings, 2 replies; 398+ messages in thread
From: Chao Peng @ 2022-07-06  8:20 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

Normally, a write to unallocated space of a file or the hole of a sparse
file automatically causes space allocation, for memfd, this equals to
memory allocation. This new seal prevents such automatically allocating,
either this is from a direct write() or a write on the previously
mmap-ed area. The seal does not prevent fallocate() so an explicit
fallocate() can still cause allocating and can be used to reserve
memory.

This is used to prevent unintentional allocation from userspace on a
stray or careless write and any intentional allocation should use an
explicit fallocate(). One of the main usecases is to avoid memory double
allocation for confidential computing usage where we use two memfds to
back guest memory and at a single point only one memfd is alive and we
want to prevent memory allocation for the other memfd which may have
been mmap-ed previously. More discussion can be found at:

  https://lkml.org/lkml/2022/6/14/1255

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/uapi/linux/fcntl.h |  1 +
 mm/memfd.c                 |  3 ++-
 mm/shmem.c                 | 16 ++++++++++++++--
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 2f86b2ad6d7e..98bdabc8e309 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -43,6 +43,7 @@
 #define F_SEAL_GROW	0x0004	/* prevent file from growing */
 #define F_SEAL_WRITE	0x0008	/* prevent writes */
 #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
+#define F_SEAL_AUTO_ALLOCATE	0x0020  /* prevent allocation for writes */
 /* (1U << 31) is reserved for signed error codes */
 
 /*
diff --git a/mm/memfd.c b/mm/memfd.c
index 08f5f8304746..2afd898798e4 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -150,7 +150,8 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
 		     F_SEAL_SHRINK | \
 		     F_SEAL_GROW | \
 		     F_SEAL_WRITE | \
-		     F_SEAL_FUTURE_WRITE)
+		     F_SEAL_FUTURE_WRITE | \
+		     F_SEAL_AUTO_ALLOCATE)
 
 static int memfd_add_seals(struct file *file, unsigned int seals)
 {
diff --git a/mm/shmem.c b/mm/shmem.c
index a6f565308133..6c8aef15a17d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2051,6 +2051,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct inode *inode = file_inode(vma->vm_file);
 	gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	enum sgp_type sgp;
 	int err;
 	vm_fault_t ret = VM_FAULT_LOCKED;
 
@@ -2113,7 +2115,12 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 		spin_unlock(&inode->i_lock);
 	}
 
-	err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE,
+	if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE))
+		sgp = SGP_NOALLOC;
+	else
+		sgp = SGP_CACHE;
+
+	err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
 				  gfp, vma, vmf, &ret);
 	if (err)
 		return vmf_error(err);
@@ -2459,6 +2466,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 	struct inode *inode = mapping->host;
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	pgoff_t index = pos >> PAGE_SHIFT;
+	enum sgp_type sgp;
 	int ret = 0;
 
 	/* i_rwsem is held by caller */
@@ -2470,7 +2478,11 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 			return -EPERM;
 	}
 
-	ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
+	if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE))
+		sgp = SGP_NOALLOC;
+	else
+		sgp = SGP_WRITE;
+	ret = shmem_getpage(inode, index, pagep, sgp);
 
 	if (ret)
 		return ret;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v7 02/14] selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE
  2022-07-06  8:20 [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
  2022-07-06  8:20 ` [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd Chao Peng
@ 2022-07-06  8:20 ` Chao Peng
  2022-08-05 13:11   ` David Hildenbrand
  2022-07-06  8:20 ` [PATCH v7 03/14] mm: Introduce memfile_notifier Chao Peng
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-07-06  8:20 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

Add tests to verify sealing memfds with the F_SEAL_AUTO_ALLOCATE works
as expected.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 tools/testing/selftests/memfd/memfd_test.c | 166 +++++++++++++++++++++
 1 file changed, 166 insertions(+)

diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
index 94df2692e6e4..b849ece295fd 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -9,6 +9,7 @@
 #include <fcntl.h>
 #include <linux/memfd.h>
 #include <sched.h>
+#include <setjmp.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <signal.h>
@@ -232,6 +233,31 @@ static void mfd_fail_open(int fd, int flags, mode_t mode)
 	}
 }
 
+static void mfd_assert_fallocate(int fd)
+{
+	int r;
+
+	r = fallocate(fd, 0, 0, mfd_def_size);
+	if (r < 0) {
+		printf("fallocate(ALLOC) failed: %m\n");
+		abort();
+	}
+}
+
+static void mfd_assert_punch_hole(int fd)
+{
+	int r;
+
+	r = fallocate(fd,
+		      FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+		      0,
+		      mfd_def_size);
+	if (r < 0) {
+		printf("fallocate(PUNCH_HOLE) failed: %m\n");
+		abort();
+	}
+}
+
 static void mfd_assert_read(int fd)
 {
 	char buf[16];
@@ -594,6 +620,94 @@ static void mfd_fail_grow_write(int fd)
 	}
 }
 
+static void mfd_assert_hole_write(int fd)
+{
+	ssize_t l;
+	void *p;
+	char *p1;
+
+	/*
+	 * huegtlbfs does not support write, but we want to
+	 * verify everything else here.
+	 */
+	if (!hugetlbfs_test) {
+		/* verify direct write() succeeds */
+		l = write(fd, "\0\0\0\0", 4);
+		if (l != 4) {
+			printf("write() failed: %m\n");
+			abort();
+		}
+	}
+
+	/* verify mmaped write succeeds */
+	p = mmap(NULL,
+		 mfd_def_size,
+		 PROT_READ | PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+	p1 = (char *)p + mfd_def_size - 1;
+	*p1 = 'H';
+	if (*p1 != 'H') {
+		printf("mmaped write failed: %m\n");
+		abort();
+
+	}
+	munmap(p, mfd_def_size);
+}
+
+sigjmp_buf jbuf, *sigbuf;
+static void sig_handler(int sig, siginfo_t *siginfo, void *ptr)
+{
+	if (sig == SIGBUS) {
+		if (sigbuf)
+			siglongjmp(*sigbuf, 1);
+		abort();
+	}
+}
+
+static void mfd_fail_hole_write(int fd)
+{
+	ssize_t l;
+	void *p;
+	char *p1;
+
+	/* verify direct write() fails */
+	l = write(fd, "data", 4);
+	if (l > 0) {
+		printf("expected failure on write(), but got %d: %m\n", (int)l);
+		abort();
+	}
+
+	/* verify mmaped write fails */
+	p = mmap(NULL,
+		 mfd_def_size,
+		 PROT_READ | PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+
+	sigbuf = &jbuf;
+	if (sigsetjmp(*sigbuf, 1))
+		goto out;
+
+	/* Below write should trigger SIGBUS signal */
+	p1 = (char *)p + mfd_def_size - 1;
+	*p1 = 'H';
+	printf("failed to receive SIGBUS for mmaped write: %m\n");
+	abort();
+out:
+	munmap(p, mfd_def_size);
+}
+
 static int idle_thread_fn(void *arg)
 {
 	sigset_t set;
@@ -880,6 +994,57 @@ static void test_seal_resize(void)
 	close(fd);
 }
 
+/*
+ * Test F_SEAL_AUTO_ALLOCATE
+ * Test whether F_SEAL_AUTO_ALLOCATE actually prevents allocation.
+ */
+static void test_seal_auto_allocate(void)
+{
+	struct sigaction act;
+	int fd;
+
+	printf("%s SEAL-AUTO-ALLOCATE\n", memfd_str);
+
+	memset(&act, 0, sizeof(act));
+	act.sa_sigaction = sig_handler;
+	act.sa_flags = SA_SIGINFO;
+	if (sigaction(SIGBUS, &act, 0)) {
+		printf("sigaction() failed: %m\n");
+		abort();
+	}
+
+	fd = mfd_assert_new("kern_memfd_seal_auto_allocate",
+			    mfd_def_size,
+			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+	/* read/write should pass if F_SEAL_AUTO_ALLOCATE not set */
+	mfd_assert_read(fd);
+	mfd_assert_hole_write(fd);
+
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_add_seals(fd, F_SEAL_AUTO_ALLOCATE);
+	mfd_assert_has_seals(fd, F_SEAL_AUTO_ALLOCATE);
+
+	/* read/write should pass for pre-allocated area */
+	mfd_assert_read(fd);
+	mfd_assert_hole_write(fd);
+
+	mfd_assert_punch_hole(fd);
+
+	/* read should pass, write should fail in hole */
+	mfd_assert_read(fd);
+	mfd_fail_hole_write(fd);
+
+	mfd_assert_fallocate(fd);
+
+	/* read/write should pass after fallocate */
+	mfd_assert_read(fd);
+	mfd_assert_hole_write(fd);
+
+	close(fd);
+}
+
+
 /*
  * Test sharing via dup()
  * Test that seals are shared between dupped FDs and they're all equal.
@@ -1059,6 +1224,7 @@ int main(int argc, char **argv)
 	test_seal_shrink();
 	test_seal_grow();
 	test_seal_resize();
+	test_seal_auto_allocate();
 
 	test_share_dup("SHARE-DUP", "");
 	test_share_mmap("SHARE-MMAP", "");
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v7 03/14] mm: Introduce memfile_notifier
  2022-07-06  8:20 [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
  2022-07-06  8:20 ` [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd Chao Peng
  2022-07-06  8:20 ` [PATCH v7 02/14] selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE Chao Peng
@ 2022-07-06  8:20 ` Chao Peng
  2022-08-05 13:22   ` David Hildenbrand
  2022-07-06  8:20 ` [PATCH v7 04/14] mm/shmem: Support memfile_notifier Chao Peng
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-07-06  8:20 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

This patch introduces memfile_notifier facility so existing memory file
subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a
third kernel component to make use of memory bookmarked in the memory
file and gets notified when the pages in the memory file become
invalidated.

It will be used for KVM to use a file descriptor as the guest memory
backing store and KVM will use this memfile_notifier interface to
interact with memory file subsystems. In the future there might be other
consumers (e.g. VFIO with encrypted device memory).

It consists below components:
 - memfile_backing_store: Each supported memory file subsystem can be
   implemented as a memory backing store which bookmarks memory and
   provides callbacks for other kernel systems (memfile_notifier
   consumers) to interact with.
 - memfile_notifier: memfile_notifier consumers defines callbacks and
   associate them to a file using memfile_register_notifier().
 - memfile_node: A memfile_node is associated with the file (inode) from
   the backing store and includes feature flags and a list of registered
   memfile_notifier for notifying.

In KVM usages, userspace is in charge of guest memory lifecycle: it first
allocates pages in memory backing store and then passes the fd to KVM and
lets KVM register memory slot to memory backing store via
memfile_register_notifier.

Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/memfile_notifier.h |  93 ++++++++++++++++++++++++
 mm/Kconfig                       |   4 +
 mm/Makefile                      |   1 +
 mm/memfile_notifier.c            | 121 +++++++++++++++++++++++++++++++
 4 files changed, 219 insertions(+)
 create mode 100644 include/linux/memfile_notifier.h
 create mode 100644 mm/memfile_notifier.c

diff --git a/include/linux/memfile_notifier.h b/include/linux/memfile_notifier.h
new file mode 100644
index 000000000000..c5d66fd8ba53
--- /dev/null
+++ b/include/linux/memfile_notifier.h
@@ -0,0 +1,93 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MEMFILE_NOTIFIER_H
+#define _LINUX_MEMFILE_NOTIFIER_H
+
+#include <linux/pfn_t.h>
+#include <linux/rculist.h>
+#include <linux/spinlock.h>
+#include <linux/srcu.h>
+#include <linux/fs.h>
+
+/* memory in the file is inaccessible from userspace (e.g. read/write/mmap) */
+#define MEMFILE_F_USER_INACCESSIBLE	BIT(0)
+/* memory in the file is unmovable (e.g. via pagemigration)*/
+#define MEMFILE_F_UNMOVABLE		BIT(1)
+/* memory in the file is unreclaimable (e.g. via kswapd) */
+#define MEMFILE_F_UNRECLAIMABLE		BIT(2)
+
+#define MEMFILE_F_ALLOWED_MASK		(MEMFILE_F_USER_INACCESSIBLE | \
+					MEMFILE_F_UNMOVABLE | \
+					MEMFILE_F_UNRECLAIMABLE)
+
+struct memfile_node {
+	struct list_head	notifiers;	/* registered notifiers */
+	unsigned long		flags;		/* MEMFILE_F_* flags */
+};
+
+struct memfile_backing_store {
+	struct list_head list;
+	spinlock_t lock;
+	struct memfile_node* (*lookup_memfile_node)(struct file *file);
+	int (*get_pfn)(struct file *file, pgoff_t offset, pfn_t *pfn,
+		       int *order);
+	void (*put_pfn)(pfn_t pfn);
+};
+
+struct memfile_notifier;
+struct memfile_notifier_ops {
+	void (*invalidate)(struct memfile_notifier *notifier,
+			   pgoff_t start, pgoff_t end);
+};
+
+struct memfile_notifier {
+	struct list_head list;
+	struct memfile_notifier_ops *ops;
+	struct memfile_backing_store *bs;
+};
+
+static inline void memfile_node_init(struct memfile_node *node)
+{
+	INIT_LIST_HEAD(&node->notifiers);
+	node->flags = 0;
+}
+
+#ifdef CONFIG_MEMFILE_NOTIFIER
+/* APIs for backing stores */
+extern void memfile_register_backing_store(struct memfile_backing_store *bs);
+extern int memfile_node_set_flags(struct file *file, unsigned long flags);
+extern void memfile_notifier_invalidate(struct memfile_node *node,
+					pgoff_t start, pgoff_t end);
+/*APIs for notifier consumers */
+extern int memfile_register_notifier(struct file *file, unsigned long flags,
+				     struct memfile_notifier *notifier);
+extern void memfile_unregister_notifier(struct memfile_notifier *notifier);
+
+#else /* !CONFIG_MEMFILE_NOTIFIER */
+static inline void memfile_register_backing_store(struct memfile_backing_store *bs)
+{
+}
+
+static inline int memfile_node_set_flags(struct file *file, unsigned long flags)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void memfile_notifier_invalidate(struct memfile_node *node,
+					       pgoff_t start, pgoff_t end)
+{
+}
+
+static inline int memfile_register_notifier(struct file *file,
+					    unsigned long flags,
+					    struct memfile_notifier *notifier)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void memfile_unregister_notifier(struct memfile_notifier *notifier)
+{
+}
+
+#endif /* CONFIG_MEMFILE_NOTIFIER */
+
+#endif /* _LINUX_MEMFILE_NOTIFIER_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 169e64192e48..19ab9350f5cb 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1130,6 +1130,10 @@ config PTE_MARKER_UFFD_WP
 	  purposes.  It is required to enable userfaultfd write protection on
 	  file-backed memory types like shmem and hugetlbfs.
 
+config MEMFILE_NOTIFIER
+	bool
+	select SRCU
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 6f9ffa968a1a..b7e3fb5fa85b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -133,3 +133,4 @@ obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
 obj-$(CONFIG_IO_MAPPING) += io-mapping.o
 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
+obj-$(CONFIG_MEMFILE_NOTIFIER) += memfile_notifier.o
diff --git a/mm/memfile_notifier.c b/mm/memfile_notifier.c
new file mode 100644
index 000000000000..799d3197903e
--- /dev/null
+++ b/mm/memfile_notifier.c
@@ -0,0 +1,121 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  Copyright (C) 2022  Intel Corporation.
+ *             Chao Peng <chao.p.peng@linux.intel.com>
+ */
+
+#include <linux/memfile_notifier.h>
+#include <linux/pagemap.h>
+#include <linux/srcu.h>
+
+DEFINE_STATIC_SRCU(memfile_srcu);
+static __ro_after_init LIST_HEAD(backing_store_list);
+
+
+void memfile_notifier_invalidate(struct memfile_node *node,
+				 pgoff_t start, pgoff_t end)
+{
+	struct memfile_notifier *notifier;
+	int id;
+
+	id = srcu_read_lock(&memfile_srcu);
+	list_for_each_entry_srcu(notifier, &node->notifiers, list,
+				 srcu_read_lock_held(&memfile_srcu)) {
+		if (notifier->ops->invalidate)
+			notifier->ops->invalidate(notifier, start, end);
+	}
+	srcu_read_unlock(&memfile_srcu, id);
+}
+
+void __init memfile_register_backing_store(struct memfile_backing_store *bs)
+{
+	spin_lock_init(&bs->lock);
+	list_add_tail(&bs->list, &backing_store_list);
+}
+
+static void memfile_node_update_flags(struct file *file, unsigned long flags)
+{
+	struct address_space *mapping = file_inode(file)->i_mapping;
+	gfp_t gfp;
+
+	gfp = mapping_gfp_mask(mapping);
+	if (flags & MEMFILE_F_UNMOVABLE)
+		gfp &= ~__GFP_MOVABLE;
+	else
+		gfp |= __GFP_MOVABLE;
+	mapping_set_gfp_mask(mapping, gfp);
+
+	if (flags & MEMFILE_F_UNRECLAIMABLE)
+		mapping_set_unevictable(mapping);
+	else
+		mapping_clear_unevictable(mapping);
+}
+
+int memfile_node_set_flags(struct file *file, unsigned long flags)
+{
+	struct memfile_backing_store *bs;
+	struct memfile_node *node;
+
+	if (flags & ~MEMFILE_F_ALLOWED_MASK)
+		return -EINVAL;
+
+	list_for_each_entry(bs, &backing_store_list, list) {
+		node = bs->lookup_memfile_node(file);
+		if (node) {
+			spin_lock(&bs->lock);
+			node->flags = flags;
+			spin_unlock(&bs->lock);
+			memfile_node_update_flags(file, flags);
+			return 0;
+		}
+	}
+
+	return -EOPNOTSUPP;
+}
+
+int memfile_register_notifier(struct file *file, unsigned long flags,
+			      struct memfile_notifier *notifier)
+{
+	struct memfile_backing_store *bs;
+	struct memfile_node *node;
+	struct list_head *list;
+
+	if (!file || !notifier || !notifier->ops)
+		return -EINVAL;
+	if (flags & ~MEMFILE_F_ALLOWED_MASK)
+		return -EINVAL;
+
+	list_for_each_entry(bs, &backing_store_list, list) {
+		node = bs->lookup_memfile_node(file);
+		if (node) {
+			list = &node->notifiers;
+			notifier->bs = bs;
+
+			spin_lock(&bs->lock);
+			if (list_empty(list))
+				node->flags = flags;
+			else if (node->flags ^ flags) {
+				spin_unlock(&bs->lock);
+				return -EINVAL;
+			}
+
+			list_add_rcu(&notifier->list, list);
+			spin_unlock(&bs->lock);
+			memfile_node_update_flags(file, flags);
+			return 0;
+		}
+	}
+
+	return -EOPNOTSUPP;
+}
+EXPORT_SYMBOL_GPL(memfile_register_notifier);
+
+void memfile_unregister_notifier(struct memfile_notifier *notifier)
+{
+	spin_lock(&notifier->bs->lock);
+	list_del_rcu(&notifier->list);
+	spin_unlock(&notifier->bs->lock);
+
+	synchronize_srcu(&memfile_srcu);
+}
+EXPORT_SYMBOL_GPL(memfile_unregister_notifier);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v7 04/14] mm/shmem: Support memfile_notifier
  2022-07-06  8:20 [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (2 preceding siblings ...)
  2022-07-06  8:20 ` [PATCH v7 03/14] mm: Introduce memfile_notifier Chao Peng
@ 2022-07-06  8:20 ` Chao Peng
  2022-07-12 18:02   ` Gupta, Pankaj
  2022-08-05 13:26   ` David Hildenbrand
  2022-07-06  8:20 ` [PATCH v7 05/14] mm/memfd: Introduce MFD_INACCESSIBLE flag Chao Peng
                   ` (14 subsequent siblings)
  18 siblings, 2 replies; 398+ messages in thread
From: Chao Peng @ 2022-07-06  8:20 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Implement shmem as a memfile_notifier backing store. Essentially it
interacts with the memfile_notifier feature flags for userspace
access/page migration/page reclaiming and implements the necessary
memfile_backing_store callbacks.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/shmem_fs.h |   2 +
 mm/shmem.c               | 109 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 110 insertions(+), 1 deletion(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index a68f982f22d1..6031c0b08d26 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -9,6 +9,7 @@
 #include <linux/percpu_counter.h>
 #include <linux/xattr.h>
 #include <linux/fs_parser.h>
+#include <linux/memfile_notifier.h>
 
 /* inode in-kernel data */
 
@@ -25,6 +26,7 @@ struct shmem_inode_info {
 	struct simple_xattrs	xattrs;		/* list of xattrs */
 	atomic_t		stop_eviction;	/* hold when working on inode */
 	struct timespec64	i_crtime;	/* file creation time */
+	struct memfile_node	memfile_node;	/* memfile node */
 	struct inode		vfs_inode;
 };
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 6c8aef15a17d..627e315c3b4d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -905,6 +905,17 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
 	return page ? page_folio(page) : NULL;
 }
 
+static void notify_invalidate(struct inode *inode, struct folio *folio,
+				   pgoff_t start, pgoff_t end)
+{
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	start = max(start, folio->index);
+	end = min(end, folio->index + folio_nr_pages(folio));
+
+	memfile_notifier_invalidate(&info->memfile_node, start, end);
+}
+
 /*
  * Remove range of pages and swap entries from page cache, and free them.
  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -948,6 +959,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			index += folio_nr_pages(folio) - 1;
 
+			notify_invalidate(inode, folio, start, end);
+
 			if (!unfalloc || !folio_test_uptodate(folio))
 				truncate_inode_folio(mapping, folio);
 			folio_unlock(folio);
@@ -1021,6 +1034,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 					index--;
 					break;
 				}
+
+				notify_invalidate(inode, folio, start, end);
+
 				VM_BUG_ON_FOLIO(folio_test_writeback(folio),
 						folio);
 				truncate_inode_folio(mapping, folio);
@@ -1092,6 +1108,13 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
 		    (newsize > oldsize && (info->seals & F_SEAL_GROW)))
 			return -EPERM;
 
+		if (info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE) {
+			if (oldsize)
+				return -EPERM;
+			if (!PAGE_ALIGNED(newsize))
+				return -EINVAL;
+		}
+
 		if (newsize != oldsize) {
 			error = shmem_reacct_size(SHMEM_I(inode)->flags,
 					oldsize, newsize);
@@ -1336,6 +1359,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 		goto redirty;
 	if (!total_swap_pages)
 		goto redirty;
+	if (info->memfile_node.flags & MEMFILE_F_UNRECLAIMABLE)
+		goto redirty;
 
 	/*
 	 * Our capabilities prevent regular writeback or sync from ever calling
@@ -2271,6 +2296,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 	if (ret)
 		return ret;
 
+	if (info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE)
+		return -EPERM;
+
 	/* arm64 - allow memory tagging on RAM-based files */
 	vma->vm_flags |= VM_MTE_ALLOWED;
 
@@ -2306,6 +2334,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 		info->i_crtime = inode->i_mtime;
 		INIT_LIST_HEAD(&info->shrinklist);
 		INIT_LIST_HEAD(&info->swaplist);
+		memfile_node_init(&info->memfile_node);
 		simple_xattrs_init(&info->xattrs);
 		cache_no_acl(inode);
 		mapping_set_large_folios(inode->i_mapping);
@@ -2477,6 +2506,8 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 		if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
 			return -EPERM;
 	}
+	if (unlikely(info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE))
+		return -EPERM;
 
 	if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE))
 		sgp = SGP_NOALLOC;
@@ -2556,6 +2587,13 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		end_index = i_size >> PAGE_SHIFT;
 		if (index > end_index)
 			break;
+
+		if (SHMEM_I(inode)->memfile_node.flags &
+				MEMFILE_F_USER_INACCESSIBLE) {
+			error = -EPERM;
+			break;
+		}
+
 		if (index == end_index) {
 			nr = i_size & ~PAGE_MASK;
 			if (nr <= offset)
@@ -2697,6 +2735,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 			goto out;
 		}
 
+		if ((info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE) &&
+		    (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))) {
+			error = -EINVAL;
+			goto out;
+		}
+
 		shmem_falloc.waitq = &shmem_falloc_waitq;
 		shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT;
 		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
@@ -3806,6 +3850,20 @@ static int shmem_error_remove_page(struct address_space *mapping,
 	return 0;
 }
 
+#ifdef CONFIG_MIGRATION
+static int shmem_migrate_page(struct address_space *mapping,
+			      struct page *newpage, struct page *page,
+			      enum migrate_mode mode)
+{
+	struct inode *inode = mapping->host;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	if (info->memfile_node.flags & MEMFILE_F_UNMOVABLE)
+		return -EOPNOTSUPP;
+	return migrate_page(mapping, newpage, page, mode);
+}
+#endif
+
 const struct address_space_operations shmem_aops = {
 	.writepage	= shmem_writepage,
 	.dirty_folio	= noop_dirty_folio,
@@ -3814,7 +3872,7 @@ const struct address_space_operations shmem_aops = {
 	.write_end	= shmem_write_end,
 #endif
 #ifdef CONFIG_MIGRATION
-	.migratepage	= migrate_page,
+	.migratepage	= shmem_migrate_page,
 #endif
 	.error_remove_page = shmem_error_remove_page,
 };
@@ -3931,6 +3989,51 @@ static struct file_system_type shmem_fs_type = {
 	.fs_flags	= FS_USERNS_MOUNT,
 };
 
+#ifdef CONFIG_MEMFILE_NOTIFIER
+static struct memfile_node *shmem_lookup_memfile_node(struct file *file)
+{
+	struct inode *inode = file_inode(file);
+
+	if (!shmem_mapping(inode->i_mapping))
+		return NULL;
+
+	return  &SHMEM_I(inode)->memfile_node;
+}
+
+
+static int shmem_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+			 int *order)
+{
+	struct page *page;
+	int ret;
+
+	ret = shmem_getpage(file_inode(file), offset, &page, SGP_WRITE);
+	if (ret)
+		return ret;
+
+	unlock_page(page);
+	*pfn = page_to_pfn_t(page);
+	*order = thp_order(compound_head(page));
+	return 0;
+}
+
+static void shmem_put_pfn(pfn_t pfn)
+{
+	struct page *page = pfn_t_to_page(pfn);
+
+	if (!page)
+		return;
+
+	put_page(page);
+}
+
+static struct memfile_backing_store shmem_backing_store = {
+	.lookup_memfile_node = shmem_lookup_memfile_node,
+	.get_pfn = shmem_get_pfn,
+	.put_pfn = shmem_put_pfn,
+};
+#endif /* CONFIG_MEMFILE_NOTIFIER */
+
 void __init shmem_init(void)
 {
 	int error;
@@ -3956,6 +4059,10 @@ void __init shmem_init(void)
 	else
 		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
 #endif
+
+#ifdef CONFIG_MEMFILE_NOTIFIER
+	memfile_register_backing_store(&shmem_backing_store);
+#endif
 	return;
 
 out1:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v7 05/14] mm/memfd: Introduce MFD_INACCESSIBLE flag
  2022-07-06  8:20 [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (3 preceding siblings ...)
  2022-07-06  8:20 ` [PATCH v7 04/14] mm/shmem: Support memfile_notifier Chao Peng
@ 2022-07-06  8:20 ` Chao Peng
  2022-08-05 13:28   ` David Hildenbrand
  2022-07-06  8:20 ` [PATCH v7 06/14] KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS Chao Peng
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-07-06  8:20 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

Introduce a new memfd_create() flag indicating the content of the
created memfd is inaccessible from userspace through ordinary MMU
access (e.g., read/write/mmap). However, the file content can be
accessed via a different mechanism (e.g. KVM MMU) indirectly.

It provides semantics required for KVM guest private memory support
that a file descriptor with this flag set is going to be used as the
source of guest memory in confidential computing environments such
as Intel TDX/AMD SEV but may not be accessible from host userspace.

The flag can not coexist with MFD_ALLOW_SEALING, future sealing is
also impossible for a memfd created with this flag.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/uapi/linux/memfd.h |  1 +
 mm/memfd.c                 | 15 ++++++++++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..48750474b904 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,7 @@
 #define MFD_CLOEXEC		0x0001U
 #define MFD_ALLOW_SEALING	0x0002U
 #define MFD_HUGETLB		0x0004U
+#define MFD_INACCESSIBLE	0x0008U
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/memfd.c b/mm/memfd.c
index 2afd898798e4..72d7139ccced 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -18,6 +18,7 @@
 #include <linux/hugetlb.h>
 #include <linux/shmem_fs.h>
 #include <linux/memfd.h>
+#include <linux/memfile_notifier.h>
 #include <uapi/linux/memfd.h>
 
 /*
@@ -262,7 +263,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
 
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
+		       MFD_INACCESSIBLE)
 
 SYSCALL_DEFINE2(memfd_create,
 		const char __user *, uname,
@@ -284,6 +286,10 @@ SYSCALL_DEFINE2(memfd_create,
 			return -EINVAL;
 	}
 
+	/* Disallow sealing when MFD_INACCESSIBLE is set. */
+	if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING)
+		return -EINVAL;
+
 	/* length includes terminating zero */
 	len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
 	if (len <= 0)
@@ -330,12 +336,19 @@ SYSCALL_DEFINE2(memfd_create,
 	if (flags & MFD_ALLOW_SEALING) {
 		file_seals = memfd_file_seals_ptr(file);
 		*file_seals &= ~F_SEAL_SEAL;
+	} else if (flags & MFD_INACCESSIBLE) {
+		error = memfile_node_set_flags(file,
+					       MEMFILE_F_USER_INACCESSIBLE);
+		if (error)
+			goto err_file;
 	}
 
 	fd_install(fd, file);
 	kfree(name);
 	return fd;
 
+err_file:
+	fput(file);
 err_fd:
 	put_unused_fd(fd);
 err_name:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v7 06/14] KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS
  2022-07-06  8:20 [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (4 preceding siblings ...)
  2022-07-06  8:20 ` [PATCH v7 05/14] mm/memfd: Introduce MFD_INACCESSIBLE flag Chao Peng
@ 2022-07-06  8:20 ` Chao Peng
  2022-07-06  8:20 ` [PATCH v7 07/14] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-07-06  8:20 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

KVM_INTERNAL_MEM_SLOTS better reflects the fact those slots are not
exposed to userspace and avoids confusion to real private slots that
is going to be added.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/mips/include/asm/kvm_host.h | 2 +-
 arch/x86/include/asm/kvm_host.h  | 2 +-
 include/linux/kvm_host.h         | 6 +++---
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 717716cc51c5..45a978c805bc 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -85,7 +85,7 @@
 
 #define KVM_MAX_VCPUS		16
 /* memory slots that does not exposed to userspace */
-#define KVM_PRIVATE_MEM_SLOTS	0
+#define KVM_INTERNAL_MEM_SLOTS	0
 
 #define KVM_HALT_POLL_NS_DEFAULT 500000
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index de5a149d0971..dae190e19fce 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -53,7 +53,7 @@
 #define KVM_MAX_VCPU_IDS (KVM_MAX_VCPUS * KVM_VCPU_ID_RATIO)
 
 /* memory slots that are not exposed to userspace */
-#define KVM_PRIVATE_MEM_SLOTS 3
+#define KVM_INTERNAL_MEM_SLOTS 3
 
 #define KVM_HALT_POLL_NS_DEFAULT 200000
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3b40f8d68fbb..0bdb6044e316 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -656,12 +656,12 @@ struct kvm_irq_routing_table {
 };
 #endif
 
-#ifndef KVM_PRIVATE_MEM_SLOTS
-#define KVM_PRIVATE_MEM_SLOTS 0
+#ifndef KVM_INTERNAL_MEM_SLOTS
+#define KVM_INTERNAL_MEM_SLOTS 0
 #endif
 
 #define KVM_MEM_SLOTS_NUM SHRT_MAX
-#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_PRIVATE_MEM_SLOTS)
+#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
 
 #ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
 static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v7 07/14] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-07-06  8:20 [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (5 preceding siblings ...)
  2022-07-06  8:20 ` [PATCH v7 06/14] KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS Chao Peng
@ 2022-07-06  8:20 ` Chao Peng
  2022-07-15 11:36   ` Gupta, Pankaj
  2022-08-04  7:10   ` Isaku Yamahata
  2022-07-06  8:20 ` [PATCH v7 08/14] KVM: Rename mmu_notifier_* Chao Peng
                   ` (11 subsequent siblings)
  18 siblings, 2 replies; 398+ messages in thread
From: Chao Peng @ 2022-07-06  8:20 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

Currently in mmu_notifier validate path, hva range is recorded and then
checked in the mmu_notifier_retry_hva() from page fault path. However
for the to be introduced private memory, a page fault may not have a hva
associated, checking gfn(gpa) makes more sense. For existing non private
memory case, gfn is expected to continue to work.

The patch also fixes a potential bug in kvm_zap_gfn_range() which has
already been using gfn when calling kvm_inc/dec_notifier_count() in
current code.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/kvm/mmu/mmu.c   |  2 +-
 include/linux/kvm_host.h | 18 ++++++++----------
 virt/kvm/kvm_main.c      |  6 +++---
 3 files changed, 12 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f7fa4c31b7c5..0d882fad4bc1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4182,7 +4182,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
 		return true;
 
 	return fault->slot &&
-	       mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
+	       mmu_notifier_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
 }
 
 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0bdb6044e316..e9153b54e2a4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -767,8 +767,8 @@ struct kvm {
 	struct mmu_notifier mmu_notifier;
 	unsigned long mmu_notifier_seq;
 	long mmu_notifier_count;
-	unsigned long mmu_notifier_range_start;
-	unsigned long mmu_notifier_range_end;
+	gfn_t mmu_notifier_range_start;
+	gfn_t mmu_notifier_range_end;
 #endif
 	struct list_head devices;
 	u64 manual_dirty_log_protect;
@@ -1362,10 +1362,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 #endif
 
-void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
-				   unsigned long end);
-void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start,
-				   unsigned long end);
+void kvm_inc_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end);
+void kvm_dec_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end);
 
 long kvm_arch_dev_ioctl(struct file *filp,
 			unsigned int ioctl, unsigned long arg);
@@ -1923,9 +1921,9 @@ static inline int mmu_notifier_retry(struct kvm *kvm, unsigned long mmu_seq)
 	return 0;
 }
 
-static inline int mmu_notifier_retry_hva(struct kvm *kvm,
+static inline int mmu_notifier_retry_gfn(struct kvm *kvm,
 					 unsigned long mmu_seq,
-					 unsigned long hva)
+					 gfn_t gfn)
 {
 	lockdep_assert_held(&kvm->mmu_lock);
 	/*
@@ -1935,8 +1933,8 @@ static inline int mmu_notifier_retry_hva(struct kvm *kvm,
 	 * positives, due to shortcuts when handing concurrent invalidations.
 	 */
 	if (unlikely(kvm->mmu_notifier_count) &&
-	    hva >= kvm->mmu_notifier_range_start &&
-	    hva < kvm->mmu_notifier_range_end)
+	    gfn >= kvm->mmu_notifier_range_start &&
+	    gfn < kvm->mmu_notifier_range_end)
 		return 1;
 	if (kvm->mmu_notifier_seq != mmu_seq)
 		return 1;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index da263c370d00..4d7f0e72366f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -536,8 +536,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
 
 typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
-typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
-			     unsigned long end);
+typedef void (*on_lock_fn_t)(struct kvm *kvm, gfn_t start, gfn_t end);
 
 typedef void (*on_unlock_fn_t)(struct kvm *kvm);
 
@@ -624,7 +623,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 				locked = true;
 				KVM_MMU_LOCK(kvm);
 				if (!IS_KVM_NULL_FN(range->on_lock))
-					range->on_lock(kvm, range->start, range->end);
+					range->on_lock(kvm, gfn_range.start,
+							    gfn_range.end);
 				if (IS_KVM_NULL_FN(range->handler))
 					break;
 			}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v7 08/14] KVM: Rename mmu_notifier_*
  2022-07-06  8:20 [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (6 preceding siblings ...)
  2022-07-06  8:20 ` [PATCH v7 07/14] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
@ 2022-07-06  8:20 ` Chao Peng
  2022-07-29 19:02   ` Sean Christopherson
  2023-05-23  7:19   ` Kautuk Consul
  2022-07-06  8:20 ` [PATCH v7 09/14] KVM: Extend the memslot to support fd-based private memory Chao Peng
                   ` (10 subsequent siblings)
  18 siblings, 2 replies; 398+ messages in thread
From: Chao Peng @ 2022-07-06  8:20 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

The sync mechanism between mmu_notifier and page fault handler employs
fields mmu_notifier_seq/count and mmu_notifier_range_start/end. For the
to be added private memory, there is the same mechanism needed but not
rely on mmu_notifier (It uses new introduced memfile_notifier). This
patch renames the existing fields and related helper functions to a
neutral name mmu_updating_* so private memory can reuse.

No functional change intended.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/arm64/kvm/mmu.c                     |  8 ++---
 arch/mips/kvm/mmu.c                      | 10 +++---
 arch/powerpc/include/asm/kvm_book3s_64.h |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_host.c    |  4 +--
 arch/powerpc/kvm/book3s_64_mmu_hv.c      |  4 +--
 arch/powerpc/kvm/book3s_64_mmu_radix.c   |  6 ++--
 arch/powerpc/kvm/book3s_hv_nested.c      |  2 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c      |  8 ++---
 arch/powerpc/kvm/e500_mmu_host.c         |  4 +--
 arch/riscv/kvm/mmu.c                     |  4 +--
 arch/x86/kvm/mmu/mmu.c                   | 14 ++++----
 arch/x86/kvm/mmu/paging_tmpl.h           |  4 +--
 include/linux/kvm_host.h                 | 38 ++++++++++-----------
 virt/kvm/kvm_main.c                      | 42 +++++++++++-------------
 virt/kvm/pfncache.c                      | 14 ++++----
 15 files changed, 81 insertions(+), 83 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 87f1cd0df36e..7ee6fafc24ee 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -993,7 +993,7 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		 * THP doesn't start to split while we are adjusting the
 		 * refcounts.
 		 *
-		 * We are sure this doesn't happen, because mmu_notifier_retry
+		 * We are sure this doesn't happen, because mmu_updating_retry
 		 * was successful and we are holding the mmu_lock, so if this
 		 * THP is trying to split, it will be blocked in the mmu
 		 * notifier before touching any of the pages, specifically
@@ -1188,9 +1188,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			return ret;
 	}
 
-	mmu_seq = vcpu->kvm->mmu_notifier_seq;
+	mmu_seq = vcpu->kvm->mmu_updating_seq;
 	/*
-	 * Ensure the read of mmu_notifier_seq happens before we call
+	 * Ensure the read of mmu_updating_seq happens before we call
 	 * gfn_to_pfn_prot (which calls get_user_pages), so that we don't risk
 	 * the page we just got a reference to gets unmapped before we have a
 	 * chance to grab the mmu_lock, which ensure that if the page gets
@@ -1246,7 +1246,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	else
 		write_lock(&kvm->mmu_lock);
 	pgt = vcpu->arch.hw_mmu->pgt;
-	if (mmu_notifier_retry(kvm, mmu_seq))
+	if (mmu_updating_retry(kvm, mmu_seq))
 		goto out_unlock;
 
 	/*
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index 1bfd1b501d82..abd468c6a749 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -615,17 +615,17 @@ static int kvm_mips_map_page(struct kvm_vcpu *vcpu, unsigned long gpa,
 	 * Used to check for invalidations in progress, of the pfn that is
 	 * returned by pfn_to_pfn_prot below.
 	 */
-	mmu_seq = kvm->mmu_notifier_seq;
+	mmu_seq = kvm->mmu_updating_seq;
 	/*
-	 * Ensure the read of mmu_notifier_seq isn't reordered with PTE reads in
+	 * Ensure the read of mmu_updating_seq isn't reordered with PTE reads in
 	 * gfn_to_pfn_prot() (which calls get_user_pages()), so that we don't
 	 * risk the page we get a reference to getting unmapped before we have a
-	 * chance to grab the mmu_lock without mmu_notifier_retry() noticing.
+	 * chance to grab the mmu_lock without mmu_updating_retry () noticing.
 	 *
 	 * This smp_rmb() pairs with the effective smp_wmb() of the combination
 	 * of the pte_unmap_unlock() after the PTE is zapped, and the
 	 * spin_lock() in kvm_mmu_notifier_invalidate_<page|range_end>() before
-	 * mmu_notifier_seq is incremented.
+	 * mmu_updating_seq is incremented.
 	 */
 	smp_rmb();
 
@@ -638,7 +638,7 @@ static int kvm_mips_map_page(struct kvm_vcpu *vcpu, unsigned long gpa,
 
 	spin_lock(&kvm->mmu_lock);
 	/* Check if an invalidation has taken place since we got pfn */
-	if (mmu_notifier_retry(kvm, mmu_seq)) {
+	if (mmu_updating_retry(kvm, mmu_seq)) {
 		/*
 		 * This can happen when mappings are changed asynchronously, but
 		 * also synchronously if a COW is triggered by
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
index 4def2bd17b9b..4d35fb913de5 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -666,7 +666,7 @@ static inline pte_t *find_kvm_host_pte(struct kvm *kvm, unsigned long mmu_seq,
 	VM_WARN(!spin_is_locked(&kvm->mmu_lock),
 		"%s called with kvm mmu_lock not held \n", __func__);
 
-	if (mmu_notifier_retry(kvm, mmu_seq))
+	if (mmu_updating_retry(kvm, mmu_seq))
 		return NULL;
 
 	pte = __find_linux_pte(kvm->mm->pgd, ea, NULL, hshift);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c b/arch/powerpc/kvm/book3s_64_mmu_host.c
index 1ae09992c9ea..78f1aae8cb60 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_host.c
@@ -90,7 +90,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte,
 	unsigned long pfn;
 
 	/* used to check for invalidations in progress */
-	mmu_seq = kvm->mmu_notifier_seq;
+	mmu_seq = kvm->mmu_updating_seq;
 	smp_rmb();
 
 	/* Get host physical address for gpa */
@@ -151,7 +151,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte,
 	cpte = kvmppc_mmu_hpte_cache_next(vcpu);
 
 	spin_lock(&kvm->mmu_lock);
-	if (!cpte || mmu_notifier_retry(kvm, mmu_seq)) {
+	if (!cpte || mmu_updating_retry(kvm, mmu_seq)) {
 		r = -EAGAIN;
 		goto out_unlock;
 	}
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 514fd45c1994..bcdec6a6f2a7 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -578,7 +578,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
 		return -EFAULT;
 
 	/* used to check for invalidations in progress */
-	mmu_seq = kvm->mmu_notifier_seq;
+	mmu_seq = kvm->mmu_updating_seq;
 	smp_rmb();
 
 	ret = -EFAULT;
@@ -693,7 +693,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
 
 	/* Check if we might have been invalidated; let the guest retry if so */
 	ret = RESUME_GUEST;
-	if (mmu_notifier_retry(vcpu->kvm, mmu_seq)) {
+	if (mmu_updating_retry(vcpu->kvm, mmu_seq)) {
 		unlock_rmap(rmap);
 		goto out_unlock;
 	}
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 42851c32ff3b..c8890ccc3f40 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -639,7 +639,7 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
 	/* Check if we might have been invalidated; let the guest retry if so */
 	spin_lock(&kvm->mmu_lock);
 	ret = -EAGAIN;
-	if (mmu_notifier_retry(kvm, mmu_seq))
+	if (mmu_updating_retry(kvm, mmu_seq))
 		goto out_unlock;
 
 	/* Now traverse again under the lock and change the tree */
@@ -829,7 +829,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 	bool large_enable;
 
 	/* used to check for invalidations in progress */
-	mmu_seq = kvm->mmu_notifier_seq;
+	mmu_seq = kvm->mmu_updating_seq;
 	smp_rmb();
 
 	/*
@@ -1190,7 +1190,7 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm,
 	 * Increase the mmu notifier sequence number to prevent any page
 	 * fault that read the memslot earlier from writing a PTE.
 	 */
-	kvm->mmu_notifier_seq++;
+	kvm->mmu_updating_seq++;
 	spin_unlock(&kvm->mmu_lock);
 }
 
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
index 0644732d1a25..09f841f730da 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -1579,7 +1579,7 @@ static long int __kvmhv_nested_page_fault(struct kvm_vcpu *vcpu,
 	/* 2. Find the host pte for this L1 guest real address */
 
 	/* Used to check for invalidations in progress */
-	mmu_seq = kvm->mmu_notifier_seq;
+	mmu_seq = kvm->mmu_updating_seq;
 	smp_rmb();
 
 	/* See if can find translation in our partition scoped tables for L1 */
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 2257fb18cb72..952b504dc98a 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -219,7 +219,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 	g_ptel = ptel;
 
 	/* used later to detect if we might have been invalidated */
-	mmu_seq = kvm->mmu_notifier_seq;
+	mmu_seq = kvm->mmu_updating_seq;
 	smp_rmb();
 
 	/* Find the memslot (if any) for this address */
@@ -366,7 +366,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 			rmap = real_vmalloc_addr(rmap);
 		lock_rmap(rmap);
 		/* Check for pending invalidations under the rmap chain lock */
-		if (mmu_notifier_retry(kvm, mmu_seq)) {
+		if (mmu_updating_retry(kvm, mmu_seq)) {
 			/* inval in progress, write a non-present HPTE */
 			pteh |= HPTE_V_ABSENT;
 			pteh &= ~HPTE_V_VALID;
@@ -932,7 +932,7 @@ static long kvmppc_do_h_page_init_zero(struct kvm_vcpu *vcpu,
 	int i;
 
 	/* Used later to detect if we might have been invalidated */
-	mmu_seq = kvm->mmu_notifier_seq;
+	mmu_seq = kvm->mmu_updating_seq;
 	smp_rmb();
 
 	arch_spin_lock(&kvm->mmu_lock.rlock.raw_lock);
@@ -960,7 +960,7 @@ static long kvmppc_do_h_page_init_copy(struct kvm_vcpu *vcpu,
 	long ret = H_SUCCESS;
 
 	/* Used later to detect if we might have been invalidated */
-	mmu_seq = kvm->mmu_notifier_seq;
+	mmu_seq = kvm->mmu_updating_seq;
 	smp_rmb();
 
 	arch_spin_lock(&kvm->mmu_lock.rlock.raw_lock);
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 7f16afc331ef..d7636b926f25 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -339,7 +339,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 	unsigned long flags;
 
 	/* used to check for invalidations in progress */
-	mmu_seq = kvm->mmu_notifier_seq;
+	mmu_seq = kvm->mmu_updating_seq;
 	smp_rmb();
 
 	/*
@@ -460,7 +460,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 	}
 
 	spin_lock(&kvm->mmu_lock);
-	if (mmu_notifier_retry(kvm, mmu_seq)) {
+	if (mmu_updating_retry(kvm, mmu_seq)) {
 		ret = -EAGAIN;
 		goto out;
 	}
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index 081f8d2b9cf3..a7db374d3861 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -654,7 +654,7 @@ int kvm_riscv_gstage_map(struct kvm_vcpu *vcpu,
 		return ret;
 	}
 
-	mmu_seq = kvm->mmu_notifier_seq;
+	mmu_seq = kvm->mmu_updating_seq;
 
 	hfn = gfn_to_pfn_prot(kvm, gfn, is_write, &writeable);
 	if (hfn == KVM_PFN_ERR_HWPOISON) {
@@ -674,7 +674,7 @@ int kvm_riscv_gstage_map(struct kvm_vcpu *vcpu,
 
 	spin_lock(&kvm->mmu_lock);
 
-	if (mmu_notifier_retry(kvm, mmu_seq))
+	if (mmu_updating_retry(kvm, mmu_seq))
 		goto out_unlock;
 
 	if (writeable) {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0d882fad4bc1..545eb74305fe 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2908,7 +2908,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
 	 * If addresses are being invalidated, skip prefetching to avoid
 	 * accidentally prefetching those addresses.
 	 */
-	if (unlikely(vcpu->kvm->mmu_notifier_count))
+	if (unlikely(vcpu->kvm->mmu_updating_count))
 		return;
 
 	__direct_pte_prefetch(vcpu, sp, sptep);
@@ -2950,7 +2950,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	/*
 	 * Lookup the mapping level in the current mm.  The information
 	 * may become stale soon, but it is safe to use as long as
-	 * 1) mmu_notifier_retry was checked after taking mmu_lock, and
+	 * 1) mmu_updating_retry was checked after taking mmu_lock, and
 	 * 2) mmu_lock is taken now.
 	 *
 	 * We still need to disable IRQs to prevent concurrent tear down
@@ -3035,7 +3035,7 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 		return;
 
 	/*
-	 * mmu_notifier_retry() was successful and mmu_lock is held, so
+	 * mmu_updating_retry was successful and mmu_lock is held, so
 	 * the pmd can't be split from under us.
 	 */
 	fault->goal_level = fault->req_level;
@@ -4182,7 +4182,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
 		return true;
 
 	return fault->slot &&
-	       mmu_notifier_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
+	       mmu_updating_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
 }
 
 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
@@ -4206,7 +4206,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	if (r)
 		return r;
 
-	mmu_seq = vcpu->kvm->mmu_notifier_seq;
+	mmu_seq = vcpu->kvm->mmu_updating_seq;
 	smp_rmb();
 
 	r = kvm_faultin_pfn(vcpu, fault);
@@ -6023,7 +6023,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 
 	write_lock(&kvm->mmu_lock);
 
-	kvm_inc_notifier_count(kvm, gfn_start, gfn_end);
+	kvm_mmu_updating_begin(kvm, gfn_start, gfn_end);
 
 	flush = __kvm_zap_rmaps(kvm, gfn_start, gfn_end);
 
@@ -6037,7 +6037,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 		kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
 						   gfn_end - gfn_start);
 
-	kvm_dec_notifier_count(kvm, gfn_start, gfn_end);
+	kvm_mmu_updating_end(kvm, gfn_start, gfn_end);
 
 	write_unlock(&kvm->mmu_lock);
 }
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 2448fa8d8438..acf7e41aa02b 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -589,7 +589,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
 	 * If addresses are being invalidated, skip prefetching to avoid
 	 * accidentally prefetching those addresses.
 	 */
-	if (unlikely(vcpu->kvm->mmu_notifier_count))
+	if (unlikely(vcpu->kvm->mmu_updating_count))
 		return;
 
 	if (sp->role.direct)
@@ -838,7 +838,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	else
 		fault->max_level = walker.level;
 
-	mmu_seq = vcpu->kvm->mmu_notifier_seq;
+	mmu_seq = vcpu->kvm->mmu_updating_seq;
 	smp_rmb();
 
 	r = kvm_faultin_pfn(vcpu, fault);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e9153b54e2a4..c262ebb168a7 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -765,10 +765,10 @@ struct kvm {
 
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
 	struct mmu_notifier mmu_notifier;
-	unsigned long mmu_notifier_seq;
-	long mmu_notifier_count;
-	gfn_t mmu_notifier_range_start;
-	gfn_t mmu_notifier_range_end;
+	unsigned long mmu_updating_seq;
+	long mmu_updating_count;
+	gfn_t mmu_updating_range_start;
+	gfn_t mmu_updating_range_end;
 #endif
 	struct list_head devices;
 	u64 manual_dirty_log_protect;
@@ -1362,8 +1362,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 #endif
 
-void kvm_inc_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end);
-void kvm_dec_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end);
+void kvm_mmu_updating_begin(struct kvm *kvm, gfn_t start, gfn_t end);
+void kvm_mmu_updating_end(struct kvm *kvm, gfn_t start, gfn_t end);
 
 long kvm_arch_dev_ioctl(struct file *filp,
 			unsigned int ioctl, unsigned long arg);
@@ -1901,42 +1901,42 @@ extern const struct kvm_stats_header kvm_vcpu_stats_header;
 extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
 
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
-static inline int mmu_notifier_retry(struct kvm *kvm, unsigned long mmu_seq)
+static inline int mmu_updating_retry(struct kvm *kvm, unsigned long mmu_seq)
 {
-	if (unlikely(kvm->mmu_notifier_count))
+	if (unlikely(kvm->mmu_updating_count))
 		return 1;
 	/*
-	 * Ensure the read of mmu_notifier_count happens before the read
-	 * of mmu_notifier_seq.  This interacts with the smp_wmb() in
+	 * Ensure the read of mmu_updating_count happens before the read
+	 * of mmu_updating_seq.  This interacts with the smp_wmb() in
 	 * mmu_notifier_invalidate_range_end to make sure that the caller
-	 * either sees the old (non-zero) value of mmu_notifier_count or
-	 * the new (incremented) value of mmu_notifier_seq.
+	 * either sees the old (non-zero) value of mmu_updating_count or
+	 * the new (incremented) value of mmu_updating_seq.
 	 * PowerPC Book3s HV KVM calls this under a per-page lock
 	 * rather than under kvm->mmu_lock, for scalability, so
 	 * can't rely on kvm->mmu_lock to keep things ordered.
 	 */
 	smp_rmb();
-	if (kvm->mmu_notifier_seq != mmu_seq)
+	if (kvm->mmu_updating_seq != mmu_seq)
 		return 1;
 	return 0;
 }
 
-static inline int mmu_notifier_retry_gfn(struct kvm *kvm,
+static inline int mmu_updating_retry_gfn(struct kvm *kvm,
 					 unsigned long mmu_seq,
 					 gfn_t gfn)
 {
 	lockdep_assert_held(&kvm->mmu_lock);
 	/*
-	 * If mmu_notifier_count is non-zero, then the range maintained by
+	 * If mmu_updating_count is non-zero, then the range maintained by
 	 * kvm_mmu_notifier_invalidate_range_start contains all addresses that
 	 * might be being invalidated. Note that it may include some false
 	 * positives, due to shortcuts when handing concurrent invalidations.
 	 */
-	if (unlikely(kvm->mmu_notifier_count) &&
-	    gfn >= kvm->mmu_notifier_range_start &&
-	    gfn < kvm->mmu_notifier_range_end)
+	if (unlikely(kvm->mmu_updating_count) &&
+	    gfn >= kvm->mmu_updating_range_start &&
+	    gfn < kvm->mmu_updating_range_end)
 		return 1;
-	if (kvm->mmu_notifier_seq != mmu_seq)
+	if (kvm->mmu_updating_seq != mmu_seq)
 		return 1;
 	return 0;
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4d7f0e72366f..3ae4944b9f15 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -698,30 +698,29 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 
 	/*
 	 * .change_pte() must be surrounded by .invalidate_range_{start,end}().
-	 * If mmu_notifier_count is zero, then no in-progress invalidations,
+	 * If mmu_updating_count is zero, then no in-progress invalidations,
 	 * including this one, found a relevant memslot at start(); rechecking
 	 * memslots here is unnecessary.  Note, a false positive (count elevated
 	 * by a different invalidation) is sub-optimal but functionally ok.
 	 */
 	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
-	if (!READ_ONCE(kvm->mmu_notifier_count))
+	if (!READ_ONCE(kvm->mmu_updating_count))
 		return;
 
 	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
 }
 
-void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
-				   unsigned long end)
+void kvm_mmu_updating_begin(struct kvm *kvm, gfn_t start, gfn_t end)
 {
 	/*
 	 * The count increase must become visible at unlock time as no
 	 * spte can be established without taking the mmu_lock and
 	 * count is also read inside the mmu_lock critical section.
 	 */
-	kvm->mmu_notifier_count++;
-	if (likely(kvm->mmu_notifier_count == 1)) {
-		kvm->mmu_notifier_range_start = start;
-		kvm->mmu_notifier_range_end = end;
+	kvm->mmu_updating_count++;
+	if (likely(kvm->mmu_updating_count == 1)) {
+		kvm->mmu_updating_range_start = start;
+		kvm->mmu_updating_range_end = end;
 	} else {
 		/*
 		 * Fully tracking multiple concurrent ranges has diminishing
@@ -732,10 +731,10 @@ void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
 		 * accumulate and persist until all outstanding invalidates
 		 * complete.
 		 */
-		kvm->mmu_notifier_range_start =
-			min(kvm->mmu_notifier_range_start, start);
-		kvm->mmu_notifier_range_end =
-			max(kvm->mmu_notifier_range_end, end);
+		kvm->mmu_updating_range_start =
+			min(kvm->mmu_updating_range_start, start);
+		kvm->mmu_updating_range_end =
+			max(kvm->mmu_updating_range_end, end);
 	}
 }
 
@@ -748,7 +747,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.end		= range->end,
 		.pte		= __pte(0),
 		.handler	= kvm_unmap_gfn_range,
-		.on_lock	= kvm_inc_notifier_count,
+		.on_lock	= kvm_mmu_updating_begin,
 		.on_unlock	= kvm_arch_guest_memory_reclaimed,
 		.flush_on_ret	= true,
 		.may_block	= mmu_notifier_range_blockable(range),
@@ -759,7 +758,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	/*
 	 * Prevent memslot modification between range_start() and range_end()
 	 * so that conditionally locking provides the same result in both
-	 * functions.  Without that guarantee, the mmu_notifier_count
+	 * functions.  Without that guarantee, the mmu_updating_count
 	 * adjustments will be imbalanced.
 	 *
 	 * Pairs with the decrement in range_end().
@@ -775,7 +774,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 * any given time, and the caches themselves can check for hva overlap,
 	 * i.e. don't need to rely on memslot overlap checks for performance.
 	 * Because this runs without holding mmu_lock, the pfn caches must use
-	 * mn_active_invalidate_count (see above) instead of mmu_notifier_count.
+	 * mn_active_invalidate_count (see above) instead of mmu_updating_count.
 	 */
 	gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end,
 					  hva_range.may_block);
@@ -785,22 +784,21 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	return 0;
 }
 
-void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start,
-				   unsigned long end)
+void kvm_mmu_updating_end(struct kvm *kvm, gfn_t start, gfn_t end)
 {
 	/*
 	 * This sequence increase will notify the kvm page fault that
 	 * the page that is going to be mapped in the spte could have
 	 * been freed.
 	 */
-	kvm->mmu_notifier_seq++;
+	kvm->mmu_updating_seq++;
 	smp_wmb();
 	/*
 	 * The above sequence increase must be visible before the
 	 * below count decrease, which is ensured by the smp_wmb above
-	 * in conjunction with the smp_rmb in mmu_notifier_retry().
+	 * in conjunction with the smp_rmb in mmu_updating_retry().
 	 */
-	kvm->mmu_notifier_count--;
+	kvm->mmu_updating_count--;
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
@@ -812,7 +810,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 		.end		= range->end,
 		.pte		= __pte(0),
 		.handler	= (void *)kvm_null_fn,
-		.on_lock	= kvm_dec_notifier_count,
+		.on_lock	= kvm_mmu_updating_end,
 		.on_unlock	= (void *)kvm_null_fn,
 		.flush_on_ret	= false,
 		.may_block	= mmu_notifier_range_blockable(range),
@@ -833,7 +831,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	if (wake)
 		rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
 
-	BUG_ON(kvm->mmu_notifier_count < 0);
+	BUG_ON(kvm->mmu_updating_count < 0);
 }
 
 static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index ab519f72f2cd..aa6d24966a76 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -112,27 +112,27 @@ static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_s
 {
 	/*
 	 * mn_active_invalidate_count acts for all intents and purposes
-	 * like mmu_notifier_count here; but the latter cannot be used
+	 * like mmu_updating_count here; but the latter cannot be used
 	 * here because the invalidation of caches in the mmu_notifier
-	 * event occurs _before_ mmu_notifier_count is elevated.
+	 * event occurs _before_ mmu_updating_count is elevated.
 	 *
 	 * Note, it does not matter that mn_active_invalidate_count
 	 * is not protected by gpc->lock.  It is guaranteed to
 	 * be elevated before the mmu_notifier acquires gpc->lock, and
-	 * isn't dropped until after mmu_notifier_seq is updated.
+	 * isn't dropped until after mmu_updating_seq is updated.
 	 */
 	if (kvm->mn_active_invalidate_count)
 		return true;
 
 	/*
 	 * Ensure mn_active_invalidate_count is read before
-	 * mmu_notifier_seq.  This pairs with the smp_wmb() in
+	 * mmu_updating_seq.  This pairs with the smp_wmb() in
 	 * mmu_notifier_invalidate_range_end() to guarantee either the
 	 * old (non-zero) value of mn_active_invalidate_count or the
-	 * new (incremented) value of mmu_notifier_seq is observed.
+	 * new (incremented) value of mmu_updating_seq is observed.
 	 */
 	smp_rmb();
-	return kvm->mmu_notifier_seq != mmu_seq;
+	return kvm->mmu_updating_seq != mmu_seq;
 }
 
 static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc)
@@ -155,7 +155,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc)
 	gpc->valid = false;
 
 	do {
-		mmu_seq = kvm->mmu_notifier_seq;
+		mmu_seq = kvm->mmu_updating_seq;
 		smp_rmb();
 
 		write_unlock_irq(&gpc->lock);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v7 09/14] KVM: Extend the memslot to support fd-based private memory
  2022-07-06  8:20 [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (7 preceding siblings ...)
  2022-07-06  8:20 ` [PATCH v7 08/14] KVM: Rename mmu_notifier_* Chao Peng
@ 2022-07-06  8:20 ` Chao Peng
  2022-07-29 19:51   ` Sean Christopherson
  2022-07-06  8:20 ` [PATCH v7 10/14] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-07-06  8:20 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

Extend the memslot definition to provide guest private memory through a
file descriptor(fd) instead of userspace_addr(hva). Such guest private
memory(fd) may never be mapped into userspace so no userspace_addr(hva)
can be used. Instead add another two new fields
(private_fd/private_offset), plus the existing memory_size to represent
the private memory range. Such memslot can still have the existing
userspace_addr(hva). When use, a single memslot can maintain both
private memory through private fd(private_fd/private_offset) and shared
memory through hva(userspace_addr). Whether the private or shared part
is effective for a guest GPA is maintained by other KVM code.

Since there is no userspace mapping for private fd so we cannot
rely on get_user_pages() to get the pfn in KVM, instead we add a new
memfile_notifier in the memslot and rely on it to get pfn by interacting
the callbacks from memory backing store with the fd/offset.

This new extension is indicated by a new flag KVM_MEM_PRIVATE. At
compile time, a new config HAVE_KVM_PRIVATE_MEM is added and right now
it is selected on X86_64 for Intel TDX usage.

To make KVM easy, internally we use a binary compatible alias struct
kvm_user_mem_region to handle both the normal and the '_ext' variants.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 Documentation/virt/kvm/api.rst | 38 ++++++++++++++++----
 arch/x86/kvm/Kconfig           |  2 ++
 arch/x86/kvm/x86.c             |  2 +-
 include/linux/kvm_host.h       | 13 +++++--
 include/uapi/linux/kvm.h       | 28 +++++++++++++++
 virt/kvm/Kconfig               |  3 ++
 virt/kvm/kvm_main.c            | 64 +++++++++++++++++++++++++++++-----
 7 files changed, 132 insertions(+), 18 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index bafaeedd455c..4f27c973a952 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
 :Capability: KVM_CAP_USER_MEMORY
 :Architectures: all
 :Type: vm ioctl
-:Parameters: struct kvm_userspace_memory_region (in)
+:Parameters: struct kvm_userspace_memory_region(_ext) (in)
 :Returns: 0 on success, -1 on error
 
 ::
@@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
 	__u64 userspace_addr; /* start of the userspace allocated memory */
   };
 
+  struct kvm_userspace_memory_region_ext {
+	struct kvm_userspace_memory_region region;
+	__u64 private_offset;
+	__u32 private_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+};
+
   /* for kvm_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
   #define KVM_MEM_READONLY	(1UL << 1)
+  #define KVM_MEM_PRIVATE		(1UL << 2)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1365,12 +1374,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
-writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only.  In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region
+fields. It also includes additional fields for some specific features. See
+below description of flags field for more information. It's recommended to use
+kvm_userspace_memory_region_ext in new userspace code.
+
+The flags field supports below flags:
+
+- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to
+  memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to use it.
+
+- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to
+  make a new slot read-only.  In this case, writes to this memory will be posted
+  to userspace as KVM_EXIT_MMIO exits.
+
+- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by
+  a file descirptor(fd) and the content of the private memory is invisible to
+  userspace. In this case, userspace should use private_fd/private_offset in
+  kvm_userspace_memory_region_ext to instruct KVM to provide private memory to
+  guest. Userspace should guarantee not to map the same pfn indicated by
+  private_fd/private_offset to different gfns with multiple memslots. Failed to
+  do this may result undefined behavior.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index e3cbd7706136..1f160801e2a7 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -48,6 +48,8 @@ config KVM
 	select SRCU
 	select INTERVAL_TREE
 	select HAVE_KVM_PM_NOTIFIER if PM
+	select HAVE_KVM_PRIVATE_MEM if X86_64
+	select MEMFILE_NOTIFIER if HAVE_KVM_PRIVATE_MEM
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 567d13405445..77d16b90045c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12154,7 +12154,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 	}
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-		struct kvm_userspace_memory_region m;
+		struct kvm_user_mem_region m;
 
 		m.slot = id | (i << 16);
 		m.flags = 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c262ebb168a7..1b203c8aa696 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -44,6 +44,7 @@
 
 #include <asm/kvm_host.h>
 #include <linux/kvm_dirty_ring.h>
+#include <linux/memfile_notifier.h>
 
 #ifndef KVM_MAX_VCPU_IDS
 #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
@@ -576,8 +577,16 @@ struct kvm_memory_slot {
 	u32 flags;
 	short id;
 	u16 as_id;
+	struct file *private_file;
+	loff_t private_offset;
+	struct memfile_notifier notifier;
 };
 
+static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
+{
+	return slot && (slot->flags & KVM_MEM_PRIVATE);
+}
+
 static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
 {
 	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
@@ -1109,9 +1118,9 @@ enum kvm_mr_change {
 };
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem);
+			  const struct kvm_user_mem_region *mem);
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem);
+			    const struct kvm_user_mem_region *mem);
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a36e78710382..c467c69b7ad7 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
 	__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
+struct kvm_userspace_memory_region_ext {
+	struct kvm_userspace_memory_region region;
+	__u64 private_offset;
+	__u32 private_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+};
+
+#ifdef __KERNEL__
+/*
+ * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
+ * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
+ * all fields from the top-level "extended" region.
+ */
+struct kvm_user_mem_region {
+	__u32 slot;
+	__u32 flags;
+	__u64 guest_phys_addr;
+	__u64 memory_size;
+	__u64 userspace_addr;
+	__u64 private_offset;
+	__u32 private_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+};
+#endif
+
 /*
  * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
  * other bits are reserved for kvm internal use which are defined in
@@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_PRIVATE		(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index a8c5c9f06b3c..ccaff13cc5b8 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -72,3 +72,6 @@ config KVM_XFER_TO_GUEST_WORK
 
 config HAVE_KVM_PM_NOTIFIER
        bool
+
+config HAVE_KVM_PRIVATE_MEM
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3ae4944b9f15..230c8ff9659c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1508,7 +1508,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
 	}
 }
 
-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
+static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
 
@@ -1902,7 +1902,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
  * Must be called holding kvm->slots_lock for write.
  */
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem)
+			    const struct kvm_user_mem_region *mem)
 {
 	struct kvm_memory_slot *old, *new;
 	struct kvm_memslots *slots;
@@ -2006,7 +2006,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem)
+			  const struct kvm_user_mem_region *mem)
 {
 	int r;
 
@@ -2018,7 +2018,7 @@ int kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(kvm_set_memory_region);
 
 static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
-					  struct kvm_userspace_memory_region *mem)
+					  struct kvm_user_mem_region *mem)
 {
 	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
 		return -EINVAL;
@@ -4608,6 +4608,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
 	return fd;
 }
 
+#define SANITY_CHECK_MEM_REGION_FIELD(field)					\
+do {										\
+	BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=		\
+		     offsetof(struct kvm_userspace_memory_region, field));	\
+	BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=		\
+		     sizeof_field(struct kvm_userspace_memory_region, field));	\
+} while (0)
+
+#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)					\
+do {											\
+	BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=			\
+		     offsetof(struct kvm_userspace_memory_region_ext, field));		\
+	BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=			\
+		     sizeof_field(struct kvm_userspace_memory_region_ext, field));	\
+} while (0)
+
+static void kvm_sanity_check_user_mem_region_alias(void)
+{
+	SANITY_CHECK_MEM_REGION_FIELD(slot);
+	SANITY_CHECK_MEM_REGION_FIELD(flags);
+	SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
+	SANITY_CHECK_MEM_REGION_FIELD(memory_size);
+	SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
+	SANITY_CHECK_MEM_REGION_EXT_FIELD(private_offset);
+	SANITY_CHECK_MEM_REGION_EXT_FIELD(private_fd);
+}
+
 static long kvm_vm_ioctl(struct file *filp,
 			   unsigned int ioctl, unsigned long arg)
 {
@@ -4631,14 +4658,35 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 	case KVM_SET_USER_MEMORY_REGION: {
-		struct kvm_userspace_memory_region kvm_userspace_mem;
+		struct kvm_user_mem_region mem;
+		unsigned long size;
+		u32 flags;
+
+		kvm_sanity_check_user_mem_region_alias();
+
+		memset(&mem, 0, sizeof(mem));
 
 		r = -EFAULT;
-		if (copy_from_user(&kvm_userspace_mem, argp,
-						sizeof(kvm_userspace_mem)))
+
+		if (get_user(flags,
+			(u32 __user *)(argp + offsetof(typeof(mem), flags))))
+			goto out;
+
+		if (flags & KVM_MEM_PRIVATE) {
+			r = -EINVAL;
+			goto out;
+		}
+
+		size = sizeof(struct kvm_userspace_memory_region);
+
+		if (copy_from_user(&mem, argp, size))
+			goto out;
+
+		r = -EINVAL;
+		if ((flags ^ mem.flags) & KVM_MEM_PRIVATE)
 			goto out;
 
-		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
+		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
 		break;
 	}
 	case KVM_GET_DIRTY_LOG: {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v7 10/14] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-07-06  8:20 [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (8 preceding siblings ...)
  2022-07-06  8:20 ` [PATCH v7 09/14] KVM: Extend the memslot to support fd-based private memory Chao Peng
@ 2022-07-06  8:20 ` Chao Peng
  2022-07-06  8:20 ` [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions Chao Peng
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-07-06  8:20 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

This new KVM exit allows userspace to handle memory-related errors. It
indicates an error happens in KVM at guest memory range [gpa, gpa+size).
The flags includes additional information for userspace to handle the
error. Currently bit 0 is defined as 'private memory' where '1'
indicates error happens due to private memory access and '0' indicates
error happens due to shared memory access.

After private memory is enabled, this new exit will be used for KVM to
exit to userspace for shared memory <-> private memory conversion in
memory encryption usage.

In such usage, typically there are two kind of memory conversions:
  - explicit conversion: happens when guest explicitly calls into KVM to
    map a range (as private or shared), KVM then exits to userspace to
    do the map/unmap operations.
  - implicit conversion: happens in KVM page fault handler.
    * if the fault is due to a private memory access then causes a
      userspace exit for a shared->private conversion when the page
      is recognized as shared by KVM.
    * If the fault is due to a shared memory access then causes a
      userspace exit for a private->shared conversion when the page
      is recognized as private by KVM.

Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 Documentation/virt/kvm/api.rst | 22 ++++++++++++++++++++++
 include/uapi/linux/kvm.h       |  9 +++++++++
 2 files changed, 31 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 4f27c973a952..5ecfc7fbe0ee 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6583,6 +6583,28 @@ array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
+			__u32 flags;
+			__u32 padding;
+			__u64 gpa;
+			__u64 size;
+		} memory;
+If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
+encountered a memory error which is not handled by KVM kernel module and
+userspace may choose to handle it. The 'flags' field indicates the memory
+properties of the exit.
+
+ - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
+   private memory access when the bit is set otherwise the memory error is
+   caused by shared memory access when the bit is clear.
+
+'gpa' and 'size' indicate the memory range the error occurs at. The userspace
+may handle the error and return to KVM to retry the previous memory access.
+
 ::
 
     /* KVM_EXIT_NOTIFY */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index c467c69b7ad7..83c278f284dd 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -299,6 +299,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_XEN              34
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_NOTIFY           36
+#define KVM_EXIT_MEMORY_FAULT     37
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -530,6 +531,14 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
+			__u32 flags;
+			__u32 padding;
+			__u64 gpa;
+			__u64 size;
+		} memory;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-07-06  8:20 [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (9 preceding siblings ...)
  2022-07-06  8:20 ` [PATCH v7 10/14] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
@ 2022-07-06  8:20 ` Chao Peng
  2022-07-19  8:00   ` Gupta, Pankaj
                     ` (3 more replies)
  2022-07-06  8:20 ` [PATCH v7 12/14] KVM: Handle page fault for private memory Chao Peng
                   ` (7 subsequent siblings)
  18 siblings, 4 replies; 398+ messages in thread
From: Chao Peng @ 2022-07-06  8:20 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

If CONFIG_HAVE_KVM_PRIVATE_MEM=y, userspace can register/unregister the
guest private memory regions through KVM_MEMORY_ENCRYPT_{UN,}REG_REGION
ioctls. The patch reuses existing SEV ioctl but differs that the
address in the region for private memory is gpa while SEV case it's hva.

The private memory region is stored as xarray in KVM for memory
efficiency in normal usages and zapping existing memory mappings is also
a side effect of these two ioctls.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 Documentation/virt/kvm/api.rst  | 17 +++++++---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/Kconfig            |  1 +
 arch/x86/kvm/mmu.h              |  2 --
 include/linux/kvm_host.h        |  8 +++++
 virt/kvm/kvm_main.c             | 57 +++++++++++++++++++++++++++++++++
 6 files changed, 80 insertions(+), 6 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 5ecfc7fbe0ee..dfb4caecab73 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4715,10 +4715,19 @@ Documentation/virt/kvm/amd-memory-encryption.rst.
 This ioctl can be used to register a guest memory region which may
 contain encrypted data (e.g. guest RAM, SMRAM etc).
 
-It is used in the SEV-enabled guest. When encryption is enabled, a guest
-memory region may contain encrypted data. The SEV memory encryption
-engine uses a tweak such that two identical plaintext pages, each at
-different locations will have differing ciphertexts. So swapping or
+Currently this ioctl supports registering memory regions for two usages:
+private memory and SEV-encrypted memory.
+
+When private memory is enabled, this ioctl is used to register guest private
+memory region and the addr/size of kvm_enc_region represents guest physical
+address (GPA). In this usage, this ioctl zaps the existing guest memory
+mappings in KVM that fallen into the region.
+
+When SEV-encrypted memory is enabled, this ioctl is used to register guest
+memory region which may contain encrypted data for a SEV-enabled guest. The
+addr/size of kvm_enc_region represents userspace address (HVA). The SEV
+memory encryption engine uses a tweak such that two identical plaintext pages,
+each at different locations will have differing ciphertexts. So swapping or
 moving ciphertext of those pages will not result in plaintext being
 swapped. So relocating (or migrating) physical backing pages for the SEV
 guest will require some additional steps.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dae190e19fce..92120e3a224e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -37,6 +37,7 @@
 #include <asm/hyperv-tlfs.h>
 
 #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
+#define __KVM_HAVE_ZAP_GFN_RANGE
 
 #define KVM_MAX_VCPUS 1024
 
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 1f160801e2a7..05861b9656a4 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -50,6 +50,7 @@ config KVM
 	select HAVE_KVM_PM_NOTIFIER if PM
 	select HAVE_KVM_PRIVATE_MEM if X86_64
 	select MEMFILE_NOTIFIER if HAVE_KVM_PRIVATE_MEM
+	select XARRAY_MULTI if HAVE_KVM_PRIVATE_MEM
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index a99acec925eb..428cd2e88cbd 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -209,8 +209,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	return -(u32)fault & errcode;
 }
 
-void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
-
 int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 
 int kvm_mmu_post_init_vm(struct kvm *kvm);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1b203c8aa696..da33f8828456 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -260,6 +260,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 #endif
 
+#ifdef __KVM_HAVE_ZAP_GFN_RANGE
+void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
+#endif
+
 enum {
 	OUTSIDE_GUEST_MODE,
 	IN_GUEST_MODE,
@@ -795,6 +799,9 @@ struct kvm {
 	struct notifier_block pm_notifier;
 #endif
 	char stats_id[KVM_STATS_NAME_SIZE];
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+	struct xarray mem_attr_array;
+#endif
 };
 
 #define kvm_err(fmt, ...) \
@@ -1459,6 +1466,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
 int kvm_arch_post_init_vm(struct kvm *kvm);
 void kvm_arch_pre_destroy_vm(struct kvm *kvm);
 int kvm_arch_create_vm_debugfs(struct kvm *kvm);
+bool kvm_arch_private_mem_supported(struct kvm *kvm);
 
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 /*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 230c8ff9659c..bb714c2a4b06 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -914,6 +914,35 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
 
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+#define KVM_MEM_ATTR_PRIVATE	0x0001
+static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl,
+					     struct kvm_enc_region *region)
+{
+	unsigned long start, end;
+	void *entry;
+	int r;
+
+	if (region->size == 0 || region->addr + region->size < region->addr)
+		return -EINVAL;
+	if (region->addr & (PAGE_SIZE - 1) || region->size & (PAGE_SIZE - 1))
+		return -EINVAL;
+
+	start = region->addr >> PAGE_SHIFT;
+	end = (region->addr + region->size - 1) >> PAGE_SHIFT;
+
+	entry = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION ?
+				xa_mk_value(KVM_MEM_ATTR_PRIVATE) : NULL;
+
+	r = xa_err(xa_store_range(&kvm->mem_attr_array, start, end,
+					entry, GFP_KERNEL_ACCOUNT));
+
+	kvm_zap_gfn_range(kvm, start, end + 1);
+
+	return r;
+}
+#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
+
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 static int kvm_pm_notifier_call(struct notifier_block *bl,
 				unsigned long state,
@@ -1138,6 +1167,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	spin_lock_init(&kvm->mn_invalidate_lock);
 	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
 	xa_init(&kvm->vcpu_array);
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+	xa_init(&kvm->mem_attr_array);
+#endif
 
 	INIT_LIST_HEAD(&kvm->gpc_list);
 	spin_lock_init(&kvm->gpc_lock);
@@ -1305,6 +1337,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
 		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
 		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
 	}
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+	xa_destroy(&kvm->mem_attr_array);
+#endif
 	cleanup_srcu_struct(&kvm->irq_srcu);
 	cleanup_srcu_struct(&kvm->srcu);
 	kvm_arch_free_vm(kvm);
@@ -1508,6 +1543,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
 	}
 }
 
+bool __weak kvm_arch_private_mem_supported(struct kvm *kvm)
+{
+	return false;
+}
+
 static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
@@ -4689,6 +4729,22 @@ static long kvm_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
 		break;
 	}
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+	case KVM_MEMORY_ENCRYPT_REG_REGION:
+	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
+		struct kvm_enc_region region;
+
+		if (!kvm_arch_private_mem_supported(kvm))
+			goto arch_vm_ioctl;
+
+		r = -EFAULT;
+		if (copy_from_user(&region, argp, sizeof(region)))
+			goto out;
+
+		r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, &region);
+		break;
+	}
+#endif
 	case KVM_GET_DIRTY_LOG: {
 		struct kvm_dirty_log log;
 
@@ -4842,6 +4898,7 @@ static long kvm_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_get_stats_fd(kvm);
 		break;
 	default:
+arch_vm_ioctl:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 	}
 out:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v7 12/14] KVM: Handle page fault for private memory
  2022-07-06  8:20 [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (10 preceding siblings ...)
  2022-07-06  8:20 ` [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions Chao Peng
@ 2022-07-06  8:20 ` Chao Peng
  2022-07-29 20:58   ` Sean Christopherson
  2022-07-06  8:20 ` [PATCH v7 13/14] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-07-06  8:20 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

A page fault can carry the private/shared information for
KVM_MEM_PRIVATE memslot, this can be filled by architecture code(like
TDX code). To handle page fault for such access, KVM maps the page only
when this private property matches the host's view on the page.

For a successful match, private pfn is obtained with memfile_notifier
callbacks from private fd and shared pfn is obtained with existing
get_user_pages.

For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
userspace. Userspace then can convert memory between private/shared from
host's view then retry the access.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/kvm/mmu/mmu.c          | 60 ++++++++++++++++++++++++++++++++-
 arch/x86/kvm/mmu/mmu_internal.h | 18 ++++++++++
 arch/x86/kvm/mmu/mmutrace.h     |  1 +
 include/linux/kvm_host.h        | 35 ++++++++++++++++++-
 4 files changed, 112 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 545eb74305fe..27dbdd4fe8d1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3004,6 +3004,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
 
+	if (kvm_mem_is_private(kvm, gfn))
+		return max_level;
+
 	host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
 	return min(host_level, max_level);
 }
@@ -4101,10 +4104,52 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
 }
 
+static inline u8 order_to_level(int order)
+{
+	enum pg_level level;
+
+	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > PG_LEVEL_4K; level--)
+		if (order >= page_level_shift(level) - PAGE_SHIFT)
+			return level;
+	return level;
+}
+
+static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
+				   struct kvm_page_fault *fault)
+{
+	int order;
+	struct kvm_memory_slot *slot = fault->slot;
+	bool private_exist = kvm_mem_is_private(vcpu->kvm, fault->gfn);
+
+	if (fault->is_private != private_exist) {
+		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+		if (fault->is_private)
+			vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
+		else
+			vcpu->run->memory.flags = 0;
+		vcpu->run->memory.padding = 0;
+		vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
+		vcpu->run->memory.size = PAGE_SIZE;
+		return RET_PF_USER;
+	}
+
+	if (fault->is_private) {
+		if (kvm_private_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
+			return RET_PF_RETRY;
+		fault->max_level = min(order_to_level(order), fault->max_level);
+		fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
+		return RET_PF_FIXED;
+	}
+
+	/* Fault is shared, fallthrough. */
+	return RET_PF_CONTINUE;
+}
+
 static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
 	bool async;
+	int r;
 
 	/*
 	 * Retry the page fault if the gfn hit a memslot that is being deleted
@@ -4133,6 +4178,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			return RET_PF_EMULATE;
 	}
 
+	if (kvm_slot_can_be_private(slot)) {
+		r = kvm_faultin_pfn_private(vcpu, fault);
+		if (r != RET_PF_CONTINUE)
+			return r == RET_PF_FIXED ? RET_PF_CONTINUE : r;
+	}
+
 	async = false;
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
 					  fault->write, &fault->map_writable,
@@ -4241,7 +4292,11 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 		read_unlock(&vcpu->kvm->mmu_lock);
 	else
 		write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+
+	if (fault->is_private)
+		kvm_private_mem_put_pfn(fault->slot, fault->pfn);
+	else
+		kvm_release_pfn_clean(fault->pfn);
 	return r;
 }
 
@@ -5518,6 +5573,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
 			return -EIO;
 	}
 
+	if (r == RET_PF_USER)
+		return 0;
+
 	if (r < 0)
 		return r;
 	if (r != RET_PF_EMULATE)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index ae2d660e2dab..fb9c298abcf0 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -188,6 +188,7 @@ struct kvm_page_fault {
 
 	/* Derived from mmu and global state.  */
 	const bool is_tdp;
+	const bool is_private;
 	const bool nx_huge_page_workaround_enabled;
 
 	/*
@@ -236,6 +237,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
  * RET_PF_RETRY: let CPU fault again on the address.
  * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
  * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
+ * RET_PF_USER: need to exit to userspace to handle this fault.
  * RET_PF_FIXED: The faulting entry has been fixed.
  * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
  *
@@ -252,6 +254,7 @@ enum {
 	RET_PF_RETRY,
 	RET_PF_EMULATE,
 	RET_PF_INVALID,
+	RET_PF_USER,
 	RET_PF_FIXED,
 	RET_PF_SPURIOUS,
 };
@@ -318,4 +321,19 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+#ifndef CONFIG_HAVE_KVM_PRIVATE_MEM
+static inline int kvm_private_mem_get_pfn(struct kvm_memory_slot *slot,
+					  gfn_t gfn, kvm_pfn_t *pfn, int *order)
+{
+	WARN_ON_ONCE(1);
+	return -EOPNOTSUPP;
+}
+
+static inline void kvm_private_mem_put_pfn(struct kvm_memory_slot *slot,
+					   kvm_pfn_t pfn)
+{
+	WARN_ON_ONCE(1);
+}
+#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index ae86820cef69..2d7555381955 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
 TRACE_DEFINE_ENUM(RET_PF_RETRY);
 TRACE_DEFINE_ENUM(RET_PF_EMULATE);
 TRACE_DEFINE_ENUM(RET_PF_INVALID);
+TRACE_DEFINE_ENUM(RET_PF_USER);
 TRACE_DEFINE_ENUM(RET_PF_FIXED);
 TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index da33f8828456..8f56426aa1e3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -778,6 +778,10 @@ struct kvm {
 
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
 	struct mmu_notifier mmu_notifier;
+#endif
+
+#if (defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)) || \
+	defined(CONFIG_MEMFILE_NOTIFIER)
 	unsigned long mmu_updating_seq;
 	long mmu_updating_count;
 	gfn_t mmu_updating_range_start;
@@ -1917,7 +1921,8 @@ extern const struct _kvm_stats_desc kvm_vm_stats_desc[];
 extern const struct kvm_stats_header kvm_vcpu_stats_header;
 extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
 
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+#if (defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)) || \
+	defined(CONFIG_MEMFILE_NOTIFIER)
 static inline int mmu_updating_retry(struct kvm *kvm, unsigned long mmu_seq)
 {
 	if (unlikely(kvm->mmu_updating_count))
@@ -2266,4 +2271,32 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+static inline int kvm_private_mem_get_pfn(struct kvm_memory_slot *slot,
+					  gfn_t gfn, kvm_pfn_t *pfn, int *order)
+{
+	int ret;
+	pfn_t pfnt;
+	pgoff_t index = gfn - slot->base_gfn +
+			(slot->private_offset >> PAGE_SHIFT);
+
+	ret = slot->notifier.bs->get_pfn(slot->private_file, index, &pfnt,
+					 order);
+	*pfn = pfn_t_to_pfn(pfnt);
+	return ret;
+}
+
+static inline void kvm_private_mem_put_pfn(struct kvm_memory_slot *slot,
+					   kvm_pfn_t pfn)
+{
+	slot->notifier.bs->put_pfn(pfn_to_pfn_t(pfn));
+}
+
+static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+{
+	return !!xa_load(&kvm->mem_attr_array, gfn);
+}
+
+#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
+
 #endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v7 13/14] KVM: Enable and expose KVM_MEM_PRIVATE
  2022-07-06  8:20 [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (11 preceding siblings ...)
  2022-07-06  8:20 ` [PATCH v7 12/14] KVM: Handle page fault for private memory Chao Peng
@ 2022-07-06  8:20 ` Chao Peng
  2022-07-19  9:55   ` Gupta, Pankaj
  2022-07-06  8:20 ` [PATCH v7 14/14] memfd_create.2: Describe MFD_INACCESSIBLE flag Chao Peng
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-07-06  8:20 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

Register private memslot to fd-based memory backing store and handle the
memfile notifiers to zap the existing mappings.

Currently the register is happened at memslot creating time and the
initial support does not include page migration/swap.

KVM_MEM_PRIVATE is not exposed by default, architecture code can turn
on it by implementing kvm_arch_private_mem_supported().

A 'kvm' reference is added in memslot structure since in
memfile_notifier callbacks we can only obtain a memslot reference while
kvm is need to do the zapping.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/kvm_host.h |   1 +
 virt/kvm/kvm_main.c      | 117 ++++++++++++++++++++++++++++++++++++---
 2 files changed, 109 insertions(+), 9 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8f56426aa1e3..4e5a0db68799 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -584,6 +584,7 @@ struct kvm_memory_slot {
 	struct file *private_file;
 	loff_t private_offset;
 	struct memfile_notifier notifier;
+	struct kvm *kvm;
 };
 
 static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index bb714c2a4b06..d6f7e074cab2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -941,6 +941,63 @@ static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl
 
 	return r;
 }
+
+static void kvm_memfile_notifier_invalidate(struct memfile_notifier *notifier,
+					    pgoff_t start, pgoff_t end)
+{
+	struct kvm_memory_slot *slot = container_of(notifier,
+						    struct kvm_memory_slot,
+						    notifier);
+	unsigned long base_pgoff = slot->private_offset >> PAGE_SHIFT;
+	gfn_t start_gfn = slot->base_gfn;
+	gfn_t end_gfn = slot->base_gfn + slot->npages;
+
+
+	if (start > base_pgoff)
+		start_gfn = slot->base_gfn + start - base_pgoff;
+
+	if (end < base_pgoff + slot->npages)
+		end_gfn = slot->base_gfn + end - base_pgoff;
+
+	if (start_gfn >= end_gfn)
+		return;
+
+	kvm_zap_gfn_range(slot->kvm, start_gfn, end_gfn);
+}
+
+static struct memfile_notifier_ops kvm_memfile_notifier_ops = {
+	.invalidate = kvm_memfile_notifier_invalidate,
+};
+
+#define KVM_MEMFILE_FLAGS (MEMFILE_F_USER_INACCESSIBLE | \
+			   MEMFILE_F_UNMOVABLE | \
+			   MEMFILE_F_UNRECLAIMABLE)
+
+static inline int kvm_private_mem_register(struct kvm_memory_slot *slot)
+{
+	slot->notifier.ops = &kvm_memfile_notifier_ops;
+	return memfile_register_notifier(slot->private_file, KVM_MEMFILE_FLAGS,
+					 &slot->notifier);
+}
+
+static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot)
+{
+	memfile_unregister_notifier(&slot->notifier);
+}
+
+#else /* !CONFIG_HAVE_KVM_PRIVATE_MEM */
+
+static inline int kvm_private_mem_register(struct kvm_memory_slot *slot)
+{
+	WARN_ON_ONCE(1);
+	return -EOPNOTSUPP;
+}
+
+static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot)
+{
+	WARN_ON_ONCE(1);
+}
+
 #endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
 
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
@@ -987,6 +1044,11 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
 /* This does not remove the slot from struct kvm_memslots data structures */
 static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
+	if (slot->flags & KVM_MEM_PRIVATE) {
+		kvm_private_mem_unregister(slot);
+		fput(slot->private_file);
+	}
+
 	kvm_destroy_dirty_bitmap(slot);
 
 	kvm_arch_free_memslot(kvm, slot);
@@ -1548,10 +1610,16 @@ bool __weak kvm_arch_private_mem_supported(struct kvm *kvm)
 	return false;
 }
 
-static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
+static int check_memory_region_flags(struct kvm *kvm,
+				     const struct kvm_user_mem_region *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
 
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+	if (kvm_arch_private_mem_supported(kvm))
+		valid_flags |= KVM_MEM_PRIVATE;
+#endif
+
 #ifdef __KVM_HAVE_READONLY_MEM
 	valid_flags |= KVM_MEM_READONLY;
 #endif
@@ -1627,6 +1695,12 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
 {
 	int r;
 
+	if (change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE) {
+		r = kvm_private_mem_register(new);
+		if (r)
+			return r;
+	}
+
 	/*
 	 * If dirty logging is disabled, nullify the bitmap; the old bitmap
 	 * will be freed on "commit".  If logging is enabled in both old and
@@ -1655,6 +1729,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
 	if (r && new && new->dirty_bitmap && (!old || !old->dirty_bitmap))
 		kvm_destroy_dirty_bitmap(new);
 
+	if (r && change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
+		kvm_private_mem_unregister(new);
+
 	return r;
 }
 
@@ -1952,7 +2029,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	int as_id, id;
 	int r;
 
-	r = check_memory_region_flags(mem);
+	r = check_memory_region_flags(kvm, mem);
 	if (r)
 		return r;
 
@@ -1971,6 +2048,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
 			mem->memory_size))
 		return -EINVAL;
+	if (mem->flags & KVM_MEM_PRIVATE &&
+		(mem->private_offset & (PAGE_SIZE - 1) ||
+		 mem->private_offset > U64_MAX - mem->memory_size))
+		return -EINVAL;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
 		return -EINVAL;
 	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
@@ -2009,6 +2090,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
 			return -EINVAL;
 	} else { /* Modify an existing slot. */
+		/* Private memslots are immutable, they can only be deleted. */
+		if (mem->flags & KVM_MEM_PRIVATE)
+			return -EINVAL;
 		if ((mem->userspace_addr != old->userspace_addr) ||
 		    (npages != old->npages) ||
 		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
@@ -2037,10 +2121,27 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	new->npages = npages;
 	new->flags = mem->flags;
 	new->userspace_addr = mem->userspace_addr;
+	if (mem->flags & KVM_MEM_PRIVATE) {
+		new->private_file = fget(mem->private_fd);
+		if (!new->private_file) {
+			r = -EINVAL;
+			goto out;
+		}
+		new->private_offset = mem->private_offset;
+	}
+
+	new->kvm = kvm;
 
 	r = kvm_set_memslot(kvm, old, new, change);
 	if (r)
-		kfree(new);
+		goto out;
+
+	return 0;
+
+out:
+	if (new->private_file)
+		fput(new->private_file);
+	kfree(new);
 	return r;
 }
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
@@ -4712,12 +4813,10 @@ static long kvm_vm_ioctl(struct file *filp,
 			(u32 __user *)(argp + offsetof(typeof(mem), flags))))
 			goto out;
 
-		if (flags & KVM_MEM_PRIVATE) {
-			r = -EINVAL;
-			goto out;
-		}
-
-		size = sizeof(struct kvm_userspace_memory_region);
+		if (flags & KVM_MEM_PRIVATE)
+			size = sizeof(struct kvm_userspace_memory_region_ext);
+		else
+			size = sizeof(struct kvm_userspace_memory_region);
 
 		if (copy_from_user(&mem, argp, size))
 			goto out;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v7 14/14] memfd_create.2: Describe MFD_INACCESSIBLE flag
  2022-07-06  8:20 [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (12 preceding siblings ...)
  2022-07-06  8:20 ` [PATCH v7 13/14] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
@ 2022-07-06  8:20 ` Chao Peng
  2022-08-01 14:40   ` Dave Hansen
  2022-07-13  3:58 ` [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Gupta, Pankaj
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-07-06  8:20 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 man2/memfd_create.2 | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/man2/memfd_create.2 b/man2/memfd_create.2
index 89e9c4136..2698222ae 100644
--- a/man2/memfd_create.2
+++ b/man2/memfd_create.2
@@ -101,6 +101,19 @@ meaning that no other seals can be set on the file.
 .\" FIXME Why is the MFD_ALLOW_SEALING behavior not simply the default?
 .\" Is it worth adding some text explaining this?
 .TP
+.BR MFD_INACCESSIBLE
+Disallow userspace access through ordinary MMU accesses via
+.BR read (2),
+.BR write (2)
+and
+.BR mmap (2).
+The file size cannot be changed once initialized.
+This flag cannot coexist with
+.B MFD_ALLOW_SEALING
+and when this flag is set, the initial set of seals will be
+.B F_SEAL_SEAL,
+meaning that no other seals can be set on the file.
+.TP
 .BR MFD_HUGETLB " (since Linux 4.14)"
 .\" commit 749df87bd7bee5a79cef073f5d032ddb2b211de8
 The anonymous file will be created in the hugetlbfs filesystem using
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 04/14] mm/shmem: Support memfile_notifier
  2022-07-06  8:20 ` [PATCH v7 04/14] mm/shmem: Support memfile_notifier Chao Peng
@ 2022-07-12 18:02   ` Gupta, Pankaj
  2022-07-13  7:44     ` Chao Peng
  2022-08-05 13:26   ` David Hildenbrand
  1 sibling, 1 reply; 398+ messages in thread
From: Gupta, Pankaj @ 2022-07-12 18:02 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On 7/6/2022 10:20 AM, Chao Peng wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Implement shmem as a memfile_notifier backing store. Essentially it
> interacts with the memfile_notifier feature flags for userspace
> access/page migration/page reclaiming and implements the necessary
> memfile_backing_store callbacks.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>   include/linux/shmem_fs.h |   2 +
>   mm/shmem.c               | 109 ++++++++++++++++++++++++++++++++++++++-
>   2 files changed, 110 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index a68f982f22d1..6031c0b08d26 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -9,6 +9,7 @@
>   #include <linux/percpu_counter.h>
>   #include <linux/xattr.h>
>   #include <linux/fs_parser.h>
> +#include <linux/memfile_notifier.h>
>   
>   /* inode in-kernel data */
>   
> @@ -25,6 +26,7 @@ struct shmem_inode_info {
>   	struct simple_xattrs	xattrs;		/* list of xattrs */
>   	atomic_t		stop_eviction;	/* hold when working on inode */
>   	struct timespec64	i_crtime;	/* file creation time */
> +	struct memfile_node	memfile_node;	/* memfile node */
>   	struct inode		vfs_inode;
>   };
>   
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 6c8aef15a17d..627e315c3b4d 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -905,6 +905,17 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
>   	return page ? page_folio(page) : NULL;
>   }
>   
> +static void notify_invalidate(struct inode *inode, struct folio *folio,
> +				   pgoff_t start, pgoff_t end)
> +{
> +	struct shmem_inode_info *info = SHMEM_I(inode);
> +
> +	start = max(start, folio->index);
> +	end = min(end, folio->index + folio_nr_pages(folio));
> +
> +	memfile_notifier_invalidate(&info->memfile_node, start, end);
> +}
> +
>   /*
>    * Remove range of pages and swap entries from page cache, and free them.
>    * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
> @@ -948,6 +959,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>   			}
>   			index += folio_nr_pages(folio) - 1;
>   
> +			notify_invalidate(inode, folio, start, end);
> +
>   			if (!unfalloc || !folio_test_uptodate(folio))
>   				truncate_inode_folio(mapping, folio);
>   			folio_unlock(folio);
> @@ -1021,6 +1034,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>   					index--;
>   					break;
>   				}
> +
> +				notify_invalidate(inode, folio, start, end);
> +
>   				VM_BUG_ON_FOLIO(folio_test_writeback(folio),
>   						folio);
>   				truncate_inode_folio(mapping, folio);
> @@ -1092,6 +1108,13 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
>   		    (newsize > oldsize && (info->seals & F_SEAL_GROW)))
>   			return -EPERM;
>   
> +		if (info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE) {
> +			if (oldsize)
> +				return -EPERM;
> +			if (!PAGE_ALIGNED(newsize))
> +				return -EINVAL;
> +		}
> +
>   		if (newsize != oldsize) {
>   			error = shmem_reacct_size(SHMEM_I(inode)->flags,
>   					oldsize, newsize);
> @@ -1336,6 +1359,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>   		goto redirty;
>   	if (!total_swap_pages)
>   		goto redirty;
> +	if (info->memfile_node.flags & MEMFILE_F_UNRECLAIMABLE)
> +		goto redirty;
>   
>   	/*
>   	 * Our capabilities prevent regular writeback or sync from ever calling
> @@ -2271,6 +2296,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
>   	if (ret)
>   		return ret;
>   
> +	if (info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE)
> +		return -EPERM;
> +
>   	/* arm64 - allow memory tagging on RAM-based files */
>   	vma->vm_flags |= VM_MTE_ALLOWED;
>   
> @@ -2306,6 +2334,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
>   		info->i_crtime = inode->i_mtime;
>   		INIT_LIST_HEAD(&info->shrinklist);
>   		INIT_LIST_HEAD(&info->swaplist);
> +		memfile_node_init(&info->memfile_node);
>   		simple_xattrs_init(&info->xattrs);
>   		cache_no_acl(inode);
>   		mapping_set_large_folios(inode->i_mapping);
> @@ -2477,6 +2506,8 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
>   		if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
>   			return -EPERM;
>   	}
> +	if (unlikely(info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE))
> +		return -EPERM;
>   
>   	if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE))
>   		sgp = SGP_NOALLOC;
> @@ -2556,6 +2587,13 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>   		end_index = i_size >> PAGE_SHIFT;
>   		if (index > end_index)
>   			break;
> +
> +		if (SHMEM_I(inode)->memfile_node.flags &
> +				MEMFILE_F_USER_INACCESSIBLE) {
> +			error = -EPERM;
> +			break;
> +		}
> +
>   		if (index == end_index) {
>   			nr = i_size & ~PAGE_MASK;
>   			if (nr <= offset)
> @@ -2697,6 +2735,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>   			goto out;
>   		}
>   
> +		if ((info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE) &&
> +		    (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))) {
> +			error = -EINVAL;
> +			goto out;
> +		}
> +
>   		shmem_falloc.waitq = &shmem_falloc_waitq;
>   		shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT;
>   		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
> @@ -3806,6 +3850,20 @@ static int shmem_error_remove_page(struct address_space *mapping,
>   	return 0;
>   }
>   
> +#ifdef CONFIG_MIGRATION
> +static int shmem_migrate_page(struct address_space *mapping,
> +			      struct page *newpage, struct page *page,
> +			      enum migrate_mode mode)
> +{
> +	struct inode *inode = mapping->host;
> +	struct shmem_inode_info *info = SHMEM_I(inode);
> +
> +	if (info->memfile_node.flags & MEMFILE_F_UNMOVABLE)
> +		return -EOPNOTSUPP;
> +	return migrate_page(mapping, newpage, page, mode);

Wondering how well page migrate would work for private pages
on shmem memfd based backend?

> +}
> +#endif
> +
>   const struct address_space_operations shmem_aops = {
>   	.writepage	= shmem_writepage,
>   	.dirty_folio	= noop_dirty_folio,
> @@ -3814,7 +3872,7 @@ const struct address_space_operations shmem_aops = {
>   	.write_end	= shmem_write_end,
>   #endif
>   #ifdef CONFIG_MIGRATION
> -	.migratepage	= migrate_page,
> +	.migratepage	= shmem_migrate_page,
>   #endif
>   	.error_remove_page = shmem_error_remove_page,
>   };
> @@ -3931,6 +3989,51 @@ static struct file_system_type shmem_fs_type = {
>   	.fs_flags	= FS_USERNS_MOUNT,
>   };
>   
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +static struct memfile_node *shmem_lookup_memfile_node(struct file *file)
> +{
> +	struct inode *inode = file_inode(file);
> +
> +	if (!shmem_mapping(inode->i_mapping))
> +		return NULL;
> +
> +	return  &SHMEM_I(inode)->memfile_node;
> +}
> +
> +
> +static int shmem_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +			 int *order)
> +{
> +	struct page *page;
> +	int ret;
> +
> +	ret = shmem_getpage(file_inode(file), offset, &page, SGP_WRITE);
> +	if (ret)
> +		return ret;
> +
> +	unlock_page(page);
> +	*pfn = page_to_pfn_t(page);
> +	*order = thp_order(compound_head(page));
> +	return 0;
> +}
> +
> +static void shmem_put_pfn(pfn_t pfn)
> +{
> +	struct page *page = pfn_t_to_page(pfn);
> +
> +	if (!page)
> +		return;
> +
> +	put_page(page);
> +}
> +
> +static struct memfile_backing_store shmem_backing_store = {
> +	.lookup_memfile_node = shmem_lookup_memfile_node,
> +	.get_pfn = shmem_get_pfn,
> +	.put_pfn = shmem_put_pfn,
> +};
> +#endif /* CONFIG_MEMFILE_NOTIFIER */
> +
>   void __init shmem_init(void)
>   {
>   	int error;
> @@ -3956,6 +4059,10 @@ void __init shmem_init(void)
>   	else
>   		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
>   #endif
> +
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +	memfile_register_backing_store(&shmem_backing_store);
> +#endif
>   	return;
>   
>   out1:


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-07-06  8:20 [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (13 preceding siblings ...)
  2022-07-06  8:20 ` [PATCH v7 14/14] memfd_create.2: Describe MFD_INACCESSIBLE flag Chao Peng
@ 2022-07-13  3:58 ` Gupta, Pankaj
  2022-07-13  7:57   ` Chao Peng
  2022-08-11 10:02 ` Nikunj A. Dadhania
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 398+ messages in thread
From: Gupta, Pankaj @ 2022-07-13  3:58 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song


> This is the v7 of this series which tries to implement the fd-based KVM
> guest private memory. The patches are based on latest kvm/queue branch
> commit:
> 
>    b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
> split_desc_cache only by default capacity
> 
> Introduction
> ------------
> In general this patch series introduce fd-based memslot which provides
> guest memory through memory file descriptor fd[offset,size] instead of
> hva/size. The fd can be created from a supported memory filesystem
> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM

Thinking a bit, As host side fd on tmpfs or shmem will store memory on 
host page cache instead of mapping pages into userspace address space. 
Can we hit double (un-coordinated) page cache problem with this when 
guest page cache is also used?

Thanks,
Pankaj

> and the the memory backing store exchange callbacks when such memslot
> gets created. At runtime KVM will call into callbacks provided by the
> backing store to get the pfn with the fd+offset. Memory backing store
> will also call into KVM callbacks when userspace punch hole on the fd
> to notify KVM to unmap secondary MMU page table entries.
> 
> Comparing to existing hva-based memslot, this new type of memslot allows
> guest memory unmapped from host userspace like QEMU and even the kernel
> itself, therefore reduce attack surface and prevent bugs.
> 
> Based on this fd-based memslot, we can build guest private memory that
> is going to be used in confidential computing environments such as Intel
> TDX and AMD SEV. When supported, the memory backing store can provide
> more enforcement on the fd and KVM can use a single memslot to hold both
> the private and shared part of the guest memory.
> 
> mm extension
> ---------------------
> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
> created with these flags cannot read(), write() or mmap() etc via normal
> MMU operations. The file content can only be used with the newly
> introduced memfile_notifier extension.
> 
> The memfile_notifier extension provides two sets of callbacks for KVM to
> interact with the memory backing store:
>    - memfile_notifier_ops: callbacks for memory backing store to notify
>      KVM when memory gets invalidated.
>    - backing store callbacks: callbacks for KVM to call into memory
>      backing store to request memory pages for guest private memory.
> 
> The memfile_notifier extension also provides APIs for memory backing
> store to register/unregister itself and to trigger the notifier when the
> bookmarked memory gets invalidated.
> 
> The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
> prevent double allocation caused by unintentional guest when we only
> have a single side of the shared/private memfds effective.
> 
> memslot extension
> -----------------
> Add the private fd and the fd offset to existing 'shared' memslot so
> that both private/shared guest memory can live in one single memslot.
> A page in the memslot is either private or shared. Whether a guest page
> is private or shared is maintained through reusing existing SEV ioctls
> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
> 
> Test
> ----
> To test the new functionalities of this patch TDX patchset is needed.
> Since TDX patchset has not been merged so I did two kinds of test:
> 
> -  Regresion test on kvm/queue (this patchset)
>     Most new code are not covered. Code also in below repo:
>     https://github.com/chao-p/linux/tree/privmem-v7
> 
> -  New Funational test on latest TDX code
>     The patch is rebased to latest TDX code and tested the new
>     funcationalities. See below repos:
>     Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
>     QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
> 
> An example QEMU command line for TDX test:
> -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
> -machine confidential-guest-support=tdx \
> -object memory-backend-memfd-private,id=ram1,size=${mem} \
> -machine memory-backend=ram1
> 
> Changelog
> ----------
> v7:
>    - Move the private/shared info from backing store to KVM.
>    - Introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
>    - Rework on the sync mechanism between zap/page fault paths.
>    - Addressed other comments in v6.
> v6:
>    - Re-organzied patch for both mm/KVM parts.
>    - Added flags for memfile_notifier so its consumers can state their
>      features and memory backing store can check against these flags.
>    - Put a backing store reference in the memfile_notifier and move pfn_ops
>      into backing store.
>    - Only support boot time backing store register.
>    - Overall KVM part improvement suggested by Sean and some others.
> v5:
>    - Removed userspace visible F_SEAL_INACCESSIBLE, instead using an
>      in-kernel flag (SHM_F_INACCESSIBLE for shmem). Private fd can only
>      be created by MFD_INACCESSIBLE.
>    - Introduced new APIs for backing store to register itself to
>      memfile_notifier instead of direct function call.
>    - Added the accounting and restriction for MFD_INACCESSIBLE memory.
>    - Added KVM API doc for new memslot extensions and man page for the new
>      MFD_INACCESSIBLE flag.
>    - Removed the overlap check for mapping the same file+offset into
>      multiple gfns due to perf consideration, warned in document.
>    - Addressed other comments in v4.
> v4:
>    - Decoupled the callbacks between KVM/mm from memfd and use new
>      name 'memfile_notifier'.
>    - Supported register multiple memslots to the same backing store.
>    - Added per-memslot pfn_ops instead of per-system.
>    - Reworked the invalidation part.
>    - Improved new KVM uAPIs (private memslot extension and memory
>      error) per Sean's suggestions.
>    - Addressed many other minor fixes for comments from v3.
> v3:
>    - Added locking protection when calling
>      invalidate_page_range/fallocate callbacks.
>    - Changed memslot structure to keep use useraddr for shared memory.
>    - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
>    - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
>    - Commit message improvement.
>    - Many small fixes for comments from the last version.
> 
> Links to previous discussions
> -----------------------------
> [1] Original design proposal:
> https://lkml.kernel.org/kvm/20210824005248.200037-1-seanjc@google.com/
> [2] Updated proposal and RFC patch v1:
> https://lkml.kernel.org/linux-fsdevel/20211111141352.26311-1-chao.p.peng@linux.intel.com/
> [3] Patch v5: https://lkml.org/lkml/2022/5/19/861
> 
> Chao Peng (12):
>    mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
>    selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE
>    mm: Introduce memfile_notifier
>    mm/memfd: Introduce MFD_INACCESSIBLE flag
>    KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS
>    KVM: Use gfn instead of hva for mmu_notifier_retry
>    KVM: Rename mmu_notifier_*
>    KVM: Extend the memslot to support fd-based private memory
>    KVM: Add KVM_EXIT_MEMORY_FAULT exit
>    KVM: Register/unregister the guest private memory regions
>    KVM: Handle page fault for private memory
>    KVM: Enable and expose KVM_MEM_PRIVATE
> 
> Kirill A. Shutemov (1):
>    mm/shmem: Support memfile_notifier
> 
>   Documentation/virt/kvm/api.rst             |  77 +++++-
>   arch/arm64/kvm/mmu.c                       |   8 +-
>   arch/mips/include/asm/kvm_host.h           |   2 +-
>   arch/mips/kvm/mmu.c                        |  10 +-
>   arch/powerpc/include/asm/kvm_book3s_64.h   |   2 +-
>   arch/powerpc/kvm/book3s_64_mmu_host.c      |   4 +-
>   arch/powerpc/kvm/book3s_64_mmu_hv.c        |   4 +-
>   arch/powerpc/kvm/book3s_64_mmu_radix.c     |   6 +-
>   arch/powerpc/kvm/book3s_hv_nested.c        |   2 +-
>   arch/powerpc/kvm/book3s_hv_rm_mmu.c        |   8 +-
>   arch/powerpc/kvm/e500_mmu_host.c           |   4 +-
>   arch/riscv/kvm/mmu.c                       |   4 +-
>   arch/x86/include/asm/kvm_host.h            |   3 +-
>   arch/x86/kvm/Kconfig                       |   3 +
>   arch/x86/kvm/mmu.h                         |   2 -
>   arch/x86/kvm/mmu/mmu.c                     |  74 +++++-
>   arch/x86/kvm/mmu/mmu_internal.h            |  18 ++
>   arch/x86/kvm/mmu/mmutrace.h                |   1 +
>   arch/x86/kvm/mmu/paging_tmpl.h             |   4 +-
>   arch/x86/kvm/x86.c                         |   2 +-
>   include/linux/kvm_host.h                   | 105 +++++---
>   include/linux/memfile_notifier.h           |  91 +++++++
>   include/linux/shmem_fs.h                   |   2 +
>   include/uapi/linux/fcntl.h                 |   1 +
>   include/uapi/linux/kvm.h                   |  37 +++
>   include/uapi/linux/memfd.h                 |   1 +
>   mm/Kconfig                                 |   4 +
>   mm/Makefile                                |   1 +
>   mm/memfd.c                                 |  18 +-
>   mm/memfile_notifier.c                      | 123 ++++++++++
>   mm/shmem.c                                 | 125 +++++++++-
>   tools/testing/selftests/memfd/memfd_test.c | 166 +++++++++++++
>   virt/kvm/Kconfig                           |   3 +
>   virt/kvm/kvm_main.c                        | 272 ++++++++++++++++++---
>   virt/kvm/pfncache.c                        |  14 +-
>   35 files changed, 1074 insertions(+), 127 deletions(-)
>   create mode 100644 include/linux/memfile_notifier.h
>   create mode 100644 mm/memfile_notifier.c
> 


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 04/14] mm/shmem: Support memfile_notifier
  2022-07-12 18:02   ` Gupta, Pankaj
@ 2022-07-13  7:44     ` Chao Peng
  2022-07-13 10:01       ` Gupta, Pankaj
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-07-13  7:44 UTC (permalink / raw)
  To: Gupta, Pankaj
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Tue, Jul 12, 2022 at 08:02:34PM +0200, Gupta, Pankaj wrote:
> On 7/6/2022 10:20 AM, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Implement shmem as a memfile_notifier backing store. Essentially it
> > interacts with the memfile_notifier feature flags for userspace
> > access/page migration/page reclaiming and implements the necessary
> > memfile_backing_store callbacks.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >   include/linux/shmem_fs.h |   2 +
> >   mm/shmem.c               | 109 ++++++++++++++++++++++++++++++++++++++-
> >   2 files changed, 110 insertions(+), 1 deletion(-)
...

> > +#ifdef CONFIG_MIGRATION
> > +static int shmem_migrate_page(struct address_space *mapping,
> > +			      struct page *newpage, struct page *page,
> > +			      enum migrate_mode mode)
> > +{
> > +	struct inode *inode = mapping->host;
> > +	struct shmem_inode_info *info = SHMEM_I(inode);
> > +
> > +	if (info->memfile_node.flags & MEMFILE_F_UNMOVABLE)
> > +		return -EOPNOTSUPP;
> > +	return migrate_page(mapping, newpage, page, mode);
> 
> Wondering how well page migrate would work for private pages
> on shmem memfd based backend?

From high level:
  - KVM unset MEMFILE_F_UNMOVABLE bit to indicate it capable of
    migrating a page.
  - Introduce new 'migrate' callback(s) to memfile_notifier_ops for KVM
    to register.
  - The callback is hooked to migrate_page() here.
  - Once page migration requested, shmem calls into the 'migrate'
    callback(s) to perform additional steps for encrypted memory (For
    TDX we will call TDH.MEM.PAGE.RELOCATE).

Chao
> 
> > +}
> > +#endif
> > +
> >   const struct address_space_operations shmem_aops = {
> >   	.writepage	= shmem_writepage,
> >   	.dirty_folio	= noop_dirty_folio,
> > @@ -3814,7 +3872,7 @@ const struct address_space_operations shmem_aops = {
> >   	.write_end	= shmem_write_end,
> >   #endif
> >   #ifdef CONFIG_MIGRATION
> > -	.migratepage	= migrate_page,
> > +	.migratepage	= shmem_migrate_page,
> >   #endif
> >   	.error_remove_page = shmem_error_remove_page,
> >   };
> > @@ -3931,6 +3989,51 @@ static struct file_system_type shmem_fs_type = {
> >   	.fs_flags	= FS_USERNS_MOUNT,
> >   };
 

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-07-13  3:58 ` [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Gupta, Pankaj
@ 2022-07-13  7:57   ` Chao Peng
  2022-07-13 10:35     ` Gupta, Pankaj
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-07-13  7:57 UTC (permalink / raw)
  To: Gupta, Pankaj
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Wed, Jul 13, 2022 at 05:58:32AM +0200, Gupta, Pankaj wrote:
> 
> > This is the v7 of this series which tries to implement the fd-based KVM
> > guest private memory. The patches are based on latest kvm/queue branch
> > commit:
> > 
> >    b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
> > split_desc_cache only by default capacity
> > 
> > Introduction
> > ------------
> > In general this patch series introduce fd-based memslot which provides
> > guest memory through memory file descriptor fd[offset,size] instead of
> > hva/size. The fd can be created from a supported memory filesystem
> > like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> 
> Thinking a bit, As host side fd on tmpfs or shmem will store memory on host
> page cache instead of mapping pages into userspace address space. Can we hit
> double (un-coordinated) page cache problem with this when guest page cache
> is also used?

This is my understanding: in host it will be indeed in page cache (in
current shmem implementation) but that's just the way it allocates and
provides the physical memory for the guest. In guest, guest OS will not
see this fd (absolutely), it only sees guest memory, on top of which it
can build its own page cache system for its own file-mapped content but
that is unrelated to host page cache.

Chao
> 
> Thanks,
> Pankaj
> 

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 04/14] mm/shmem: Support memfile_notifier
  2022-07-13  7:44     ` Chao Peng
@ 2022-07-13 10:01       ` Gupta, Pankaj
  2022-07-13 23:49         ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Gupta, Pankaj @ 2022-07-13 10:01 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song


>>> +#ifdef CONFIG_MIGRATION
>>> +static int shmem_migrate_page(struct address_space *mapping,
>>> +			      struct page *newpage, struct page *page,
>>> +			      enum migrate_mode mode)
>>> +{
>>> +	struct inode *inode = mapping->host;
>>> +	struct shmem_inode_info *info = SHMEM_I(inode);
>>> +
>>> +	if (info->memfile_node.flags & MEMFILE_F_UNMOVABLE)
>>> +		return -EOPNOTSUPP;
>>> +	return migrate_page(mapping, newpage, page, mode);
>>
>> Wondering how well page migrate would work for private pages
>> on shmem memfd based backend?
> 
>  From high level:
>    - KVM unset MEMFILE_F_UNMOVABLE bit to indicate it capable of
>      migrating a page.
>    - Introduce new 'migrate' callback(s) to memfile_notifier_ops for KVM
>      to register.
>    - The callback is hooked to migrate_page() here.
>    - Once page migration requested, shmem calls into the 'migrate'
>      callback(s) to perform additional steps for encrypted memory (For
>      TDX we will call TDH.MEM.PAGE.RELOCATE).

Yes, that would require additional (protocol specific) handling for 
private pages. Was trying to find where "MEMFILE_F_UNMOVABLE" flag is 
set currently?

Thanks,
Pankaj

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-07-13  7:57   ` Chao Peng
@ 2022-07-13 10:35     ` Gupta, Pankaj
  2022-07-13 23:59       ` Chao Peng
  2022-07-14  4:29       ` Andy Lutomirski
  0 siblings, 2 replies; 398+ messages in thread
From: Gupta, Pankaj @ 2022-07-13 10:35 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song


>>> This is the v7 of this series which tries to implement the fd-based KVM
>>> guest private memory. The patches are based on latest kvm/queue branch
>>> commit:
>>>
>>>     b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>>> split_desc_cache only by default capacity
>>>
>>> Introduction
>>> ------------
>>> In general this patch series introduce fd-based memslot which provides
>>> guest memory through memory file descriptor fd[offset,size] instead of
>>> hva/size. The fd can be created from a supported memory filesystem
>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>
>> Thinking a bit, As host side fd on tmpfs or shmem will store memory on host
>> page cache instead of mapping pages into userspace address space. Can we hit
>> double (un-coordinated) page cache problem with this when guest page cache
>> is also used?
> 
> This is my understanding: in host it will be indeed in page cache (in
> current shmem implementation) but that's just the way it allocates and
> provides the physical memory for the guest. In guest, guest OS will not
> see this fd (absolutely), it only sees guest memory, on top of which it
> can build its own page cache system for its own file-mapped content but
> that is unrelated to host page cache.

yes. If guest fills its page cache with file backed memory, this at host 
side(on shmem fd backend) will also fill the host page cache fast. This 
can have an impact on performance of guest VM's if host goes to memory 
pressure situation sooner. Or else we end up utilizing way less System 
RAM.

Thanks,
Pankaj



^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 04/14] mm/shmem: Support memfile_notifier
  2022-07-13 10:01       ` Gupta, Pankaj
@ 2022-07-13 23:49         ` Chao Peng
  2022-07-14  4:15           ` Gupta, Pankaj
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-07-13 23:49 UTC (permalink / raw)
  To: Gupta, Pankaj
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Wed, Jul 13, 2022 at 12:01:13PM +0200, Gupta, Pankaj wrote:
> 
> > > > +#ifdef CONFIG_MIGRATION
> > > > +static int shmem_migrate_page(struct address_space *mapping,
> > > > +			      struct page *newpage, struct page *page,
> > > > +			      enum migrate_mode mode)
> > > > +{
> > > > +	struct inode *inode = mapping->host;
> > > > +	struct shmem_inode_info *info = SHMEM_I(inode);
> > > > +
> > > > +	if (info->memfile_node.flags & MEMFILE_F_UNMOVABLE)
> > > > +		return -EOPNOTSUPP;
> > > > +	return migrate_page(mapping, newpage, page, mode);
> > > 
> > > Wondering how well page migrate would work for private pages
> > > on shmem memfd based backend?
> > 
> >  From high level:
> >    - KVM unset MEMFILE_F_UNMOVABLE bit to indicate it capable of
> >      migrating a page.
> >    - Introduce new 'migrate' callback(s) to memfile_notifier_ops for KVM
> >      to register.
> >    - The callback is hooked to migrate_page() here.
> >    - Once page migration requested, shmem calls into the 'migrate'
> >      callback(s) to perform additional steps for encrypted memory (For
> >      TDX we will call TDH.MEM.PAGE.RELOCATE).
> 
> Yes, that would require additional (protocol specific) handling for private
> pages. Was trying to find where "MEMFILE_F_UNMOVABLE" flag is set currently?

It's set with memfile_register_notifier() in patch 13.

> 
> Thanks,
> Pankaj

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-07-13 10:35     ` Gupta, Pankaj
@ 2022-07-13 23:59       ` Chao Peng
  2022-07-14  4:39         ` Gupta, Pankaj
  2022-07-14  4:29       ` Andy Lutomirski
  1 sibling, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-07-13 23:59 UTC (permalink / raw)
  To: Gupta, Pankaj
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Wed, Jul 13, 2022 at 12:35:56PM +0200, Gupta, Pankaj wrote:
> 
> > > > This is the v7 of this series which tries to implement the fd-based KVM
> > > > guest private memory. The patches are based on latest kvm/queue branch
> > > > commit:
> > > > 
> > > >     b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
> > > > split_desc_cache only by default capacity
> > > > 
> > > > Introduction
> > > > ------------
> > > > In general this patch series introduce fd-based memslot which provides
> > > > guest memory through memory file descriptor fd[offset,size] instead of
> > > > hva/size. The fd can be created from a supported memory filesystem
> > > > like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> > > 
> > > Thinking a bit, As host side fd on tmpfs or shmem will store memory on host
> > > page cache instead of mapping pages into userspace address space. Can we hit
> > > double (un-coordinated) page cache problem with this when guest page cache
> > > is also used?
> > 
> > This is my understanding: in host it will be indeed in page cache (in
> > current shmem implementation) but that's just the way it allocates and
> > provides the physical memory for the guest. In guest, guest OS will not
> > see this fd (absolutely), it only sees guest memory, on top of which it
> > can build its own page cache system for its own file-mapped content but
> > that is unrelated to host page cache.
> 
> yes. If guest fills its page cache with file backed memory, this at host
> side(on shmem fd backend) will also fill the host page cache fast. This can
> have an impact on performance of guest VM's if host goes to memory pressure
> situation sooner. Or else we end up utilizing way less System RAM.

(Currently), the file backed guest private memory is long-term pinned
and not reclaimable, it's in page cache anyway once we allocated it for
guest. This does not depend on how guest use it (e.g. use it for guest
page cache or not). 

Chao
> 
> Thanks,
> Pankaj
> 

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 04/14] mm/shmem: Support memfile_notifier
  2022-07-13 23:49         ` Chao Peng
@ 2022-07-14  4:15           ` Gupta, Pankaj
  0 siblings, 0 replies; 398+ messages in thread
From: Gupta, Pankaj @ 2022-07-14  4:15 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song


>>>>> +#ifdef CONFIG_MIGRATION
>>>>> +static int shmem_migrate_page(struct address_space *mapping,
>>>>> +			      struct page *newpage, struct page *page,
>>>>> +			      enum migrate_mode mode)
>>>>> +{
>>>>> +	struct inode *inode = mapping->host;
>>>>> +	struct shmem_inode_info *info = SHMEM_I(inode);
>>>>> +
>>>>> +	if (info->memfile_node.flags & MEMFILE_F_UNMOVABLE)
>>>>> +		return -EOPNOTSUPP;
>>>>> +	return migrate_page(mapping, newpage, page, mode);
>>>>
>>>> Wondering how well page migrate would work for private pages
>>>> on shmem memfd based backend?
>>>
>>>   From high level:
>>>     - KVM unset MEMFILE_F_UNMOVABLE bit to indicate it capable of
>>>       migrating a page.
>>>     - Introduce new 'migrate' callback(s) to memfile_notifier_ops for KVM
>>>       to register.
>>>     - The callback is hooked to migrate_page() here.
>>>     - Once page migration requested, shmem calls into the 'migrate'
>>>       callback(s) to perform additional steps for encrypted memory (For
>>>       TDX we will call TDH.MEM.PAGE.RELOCATE).
>>
>> Yes, that would require additional (protocol specific) handling for private
>> pages. Was trying to find where "MEMFILE_F_UNMOVABLE" flag is set currently?
> 
> It's set with memfile_register_notifier() in patch 13.

o.k.

Thanks,

Pankaj


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-07-13 10:35     ` Gupta, Pankaj
  2022-07-13 23:59       ` Chao Peng
@ 2022-07-14  4:29       ` Andy Lutomirski
  2022-07-14  5:13         ` Gupta, Pankaj
  1 sibling, 1 reply; 398+ messages in thread
From: Andy Lutomirski @ 2022-07-14  4:29 UTC (permalink / raw)
  To: Gupta, Pankaj, Chao Peng
  Cc: kvm list, Linux Kernel Mailing List, linux-mm, linux-fsdevel,
	Linux API, linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A. Shutemov, Nakajima, Jun, Dave Hansen,
	Andi Kleen, David Hildenbrand, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, Michal Hocko, Muchun Song



On Wed, Jul 13, 2022, at 3:35 AM, Gupta, Pankaj wrote:
>>>> This is the v7 of this series which tries to implement the fd-based KVM
>>>> guest private memory. The patches are based on latest kvm/queue branch
>>>> commit:
>>>>
>>>>     b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>>>> split_desc_cache only by default capacity
>>>>
>>>> Introduction
>>>> ------------
>>>> In general this patch series introduce fd-based memslot which provides
>>>> guest memory through memory file descriptor fd[offset,size] instead of
>>>> hva/size. The fd can be created from a supported memory filesystem
>>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>>
>>> Thinking a bit, As host side fd on tmpfs or shmem will store memory on host
>>> page cache instead of mapping pages into userspace address space. Can we hit
>>> double (un-coordinated) page cache problem with this when guest page cache
>>> is also used?
>> 
>> This is my understanding: in host it will be indeed in page cache (in
>> current shmem implementation) but that's just the way it allocates and
>> provides the physical memory for the guest. In guest, guest OS will not
>> see this fd (absolutely), it only sees guest memory, on top of which it
>> can build its own page cache system for its own file-mapped content but
>> that is unrelated to host page cache.
>
> yes. If guest fills its page cache with file backed memory, this at host 
> side(on shmem fd backend) will also fill the host page cache fast. This 
> can have an impact on performance of guest VM's if host goes to memory 
> pressure situation sooner. Or else we end up utilizing way less System 
> RAM.

Is this in any meaningful way different from a regular VM?

--Andy

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-07-13 23:59       ` Chao Peng
@ 2022-07-14  4:39         ` Gupta, Pankaj
  2022-07-14  5:06           ` Gupta, Pankaj
  0 siblings, 1 reply; 398+ messages in thread
From: Gupta, Pankaj @ 2022-07-14  4:39 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song


>>>>> This is the v7 of this series which tries to implement the fd-based KVM
>>>>> guest private memory. The patches are based on latest kvm/queue branch
>>>>> commit:
>>>>>
>>>>>      b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>>>>> split_desc_cache only by default capacity
>>>>>
>>>>> Introduction
>>>>> ------------
>>>>> In general this patch series introduce fd-based memslot which provides
>>>>> guest memory through memory file descriptor fd[offset,size] instead of
>>>>> hva/size. The fd can be created from a supported memory filesystem
>>>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>>>
>>>> Thinking a bit, As host side fd on tmpfs or shmem will store memory on host
>>>> page cache instead of mapping pages into userspace address space. Can we hit
>>>> double (un-coordinated) page cache problem with this when guest page cache
>>>> is also used?
>>>
>>> This is my understanding: in host it will be indeed in page cache (in
>>> current shmem implementation) but that's just the way it allocates and
>>> provides the physical memory for the guest. In guest, guest OS will not
>>> see this fd (absolutely), it only sees guest memory, on top of which it
>>> can build its own page cache system for its own file-mapped content but
>>> that is unrelated to host page cache.
>>
>> yes. If guest fills its page cache with file backed memory, this at host
>> side(on shmem fd backend) will also fill the host page cache fast. This can
>> have an impact on performance of guest VM's if host goes to memory pressure
>> situation sooner. Or else we end up utilizing way less System RAM.
> 
> (Currently), the file backed guest private memory is long-term pinned
> and not reclaimable, it's in page cache anyway once we allocated it for
> guest. This does not depend on how guest use it (e.g. use it for guest
> page cache or not).

Even if host shmem backed memory always be always un-reclaimable, we end 
up utilizing double RAM (both in guest & host page cache) for guest disk 
accesses?

I am considering this a serious design decision before we commit to this 
approach.

Happy to be enlightened on this and know the thoughts from others as well.

Thanks,
Pankaj


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-07-14  4:39         ` Gupta, Pankaj
@ 2022-07-14  5:06           ` Gupta, Pankaj
  0 siblings, 0 replies; 398+ messages in thread
From: Gupta, Pankaj @ 2022-07-14  5:06 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song


>>>>>> This is the v7 of this series which tries to implement the 
>>>>>> fd-based KVM
>>>>>> guest private memory. The patches are based on latest kvm/queue 
>>>>>> branch
>>>>>> commit:
>>>>>>
>>>>>>      b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>>>>>> split_desc_cache only by default capacity
>>>>>>
>>>>>> Introduction
>>>>>> ------------
>>>>>> In general this patch series introduce fd-based memslot which 
>>>>>> provides
>>>>>> guest memory through memory file descriptor fd[offset,size] 
>>>>>> instead of
>>>>>> hva/size. The fd can be created from a supported memory filesystem
>>>>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>>>>
>>>>> Thinking a bit, As host side fd on tmpfs or shmem will store memory 
>>>>> on host
>>>>> page cache instead of mapping pages into userspace address space. 
>>>>> Can we hit
>>>>> double (un-coordinated) page cache problem with this when guest 
>>>>> page cache
>>>>> is also used?
>>>>
>>>> This is my understanding: in host it will be indeed in page cache (in
>>>> current shmem implementation) but that's just the way it allocates and
>>>> provides the physical memory for the guest. In guest, guest OS will not
>>>> see this fd (absolutely), it only sees guest memory, on top of which it
>>>> can build its own page cache system for its own file-mapped content but
>>>> that is unrelated to host page cache.
>>>
>>> yes. If guest fills its page cache with file backed memory, this at host
>>> side(on shmem fd backend) will also fill the host page cache fast. 
>>> This can
>>> have an impact on performance of guest VM's if host goes to memory 
>>> pressure
>>> situation sooner. Or else we end up utilizing way less System RAM.
>>
>> (Currently), the file backed guest private memory is long-term pinned
>> and not reclaimable, it's in page cache anyway once we allocated it for
>> guest. This does not depend on how guest use it (e.g. use it for guest
>> page cache or not).
> 
> Even if host shmem backed memory always be always un-reclaimable, we end 
> up utilizing double RAM (both in guest & host page cache) for guest disk 
> accesses?

Answering my own question:

We wont use double RAM, just view of guest & host structures would 
change as per the code path taken. If we we don't care about reclaim 
situations we should be good, else we have to think something to 
coordinate page cache between guest & host (that could be an 
optimization for later).

Thanks,
Pankaj


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-07-14  4:29       ` Andy Lutomirski
@ 2022-07-14  5:13         ` Gupta, Pankaj
  0 siblings, 0 replies; 398+ messages in thread
From: Gupta, Pankaj @ 2022-07-14  5:13 UTC (permalink / raw)
  To: Andy Lutomirski, Chao Peng
  Cc: kvm list, Linux Kernel Mailing List, linux-mm, linux-fsdevel,
	Linux API, linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A. Shutemov, Nakajima, Jun, Dave Hansen,
	Andi Kleen, David Hildenbrand, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, Michal Hocko, Muchun Song


>>>>> This is the v7 of this series which tries to implement the fd-based KVM
>>>>> guest private memory. The patches are based on latest kvm/queue branch
>>>>> commit:
>>>>>
>>>>>      b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>>>>> split_desc_cache only by default capacity
>>>>>
>>>>> Introduction
>>>>> ------------
>>>>> In general this patch series introduce fd-based memslot which provides
>>>>> guest memory through memory file descriptor fd[offset,size] instead of
>>>>> hva/size. The fd can be created from a supported memory filesystem
>>>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>>>
>>>> Thinking a bit, As host side fd on tmpfs or shmem will store memory on host
>>>> page cache instead of mapping pages into userspace address space. Can we hit
>>>> double (un-coordinated) page cache problem with this when guest page cache
>>>> is also used?
>>>
>>> This is my understanding: in host it will be indeed in page cache (in
>>> current shmem implementation) but that's just the way it allocates and
>>> provides the physical memory for the guest. In guest, guest OS will not
>>> see this fd (absolutely), it only sees guest memory, on top of which it
>>> can build its own page cache system for its own file-mapped content but
>>> that is unrelated to host page cache.
>>
>> yes. If guest fills its page cache with file backed memory, this at host
>> side(on shmem fd backend) will also fill the host page cache fast. This
>> can have an impact on performance of guest VM's if host goes to memory
>> pressure situation sooner. Or else we end up utilizing way less System
>> RAM.
> 
> Is this in any meaningful way different from a regular VM?

After thinking a bit, Seems 'No'. Except the reclaim decisions system 
would take under memory pressure and also will have to see how well this 
gets stitched with memory tiers in future. But all these are future topics.

Sorry! for the noise.

Thanks,
Pankaj


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 07/14] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-07-06  8:20 ` [PATCH v7 07/14] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
@ 2022-07-15 11:36   ` Gupta, Pankaj
  2022-07-18 13:29     ` Chao Peng
  2022-08-04  7:10   ` Isaku Yamahata
  1 sibling, 1 reply; 398+ messages in thread
From: Gupta, Pankaj @ 2022-07-15 11:36 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

> Currently in mmu_notifier validate path, hva range is recorded and then
> checked in the mmu_notifier_retry_hva() from page fault path. However
> for the to be introduced private memory, a page fault may not have a hva

As this patch appeared in v7, just wondering did you see an actual bug 
because of it? And not having corresponding 'hva' occurs only with 
private memory because its not mapped to host userspace?

Thanks,
Pankaj

> associated, checking gfn(gpa) makes more sense. For existing non private
> memory case, gfn is expected to continue to work.
> 
> The patch also fixes a potential bug in kvm_zap_gfn_range() which has
> already been using gfn when calling kvm_inc/dec_notifier_count() in
> current code.
> 
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>   arch/x86/kvm/mmu/mmu.c   |  2 +-
>   include/linux/kvm_host.h | 18 ++++++++----------
>   virt/kvm/kvm_main.c      |  6 +++---
>   3 files changed, 12 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f7fa4c31b7c5..0d882fad4bc1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4182,7 +4182,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
>   		return true;
>   
>   	return fault->slot &&
> -	       mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> +	       mmu_notifier_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
>   }
>   
>   static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 0bdb6044e316..e9153b54e2a4 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -767,8 +767,8 @@ struct kvm {
>   	struct mmu_notifier mmu_notifier;
>   	unsigned long mmu_notifier_seq;
>   	long mmu_notifier_count;
> -	unsigned long mmu_notifier_range_start;
> -	unsigned long mmu_notifier_range_end;
> +	gfn_t mmu_notifier_range_start;
> +	gfn_t mmu_notifier_range_end;
>   #endif
>   	struct list_head devices;
>   	u64 manual_dirty_log_protect;
> @@ -1362,10 +1362,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
>   void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>   #endif
>   
> -void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
> -				   unsigned long end);
> -void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start,
> -				   unsigned long end);
> +void kvm_inc_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end);
> +void kvm_dec_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end);
>   
>   long kvm_arch_dev_ioctl(struct file *filp,
>   			unsigned int ioctl, unsigned long arg);
> @@ -1923,9 +1921,9 @@ static inline int mmu_notifier_retry(struct kvm *kvm, unsigned long mmu_seq)
>   	return 0;
>   }
>   
> -static inline int mmu_notifier_retry_hva(struct kvm *kvm,
> +static inline int mmu_notifier_retry_gfn(struct kvm *kvm,
>   					 unsigned long mmu_seq,
> -					 unsigned long hva)
> +					 gfn_t gfn)
>   {
>   	lockdep_assert_held(&kvm->mmu_lock);
>   	/*
> @@ -1935,8 +1933,8 @@ static inline int mmu_notifier_retry_hva(struct kvm *kvm,
>   	 * positives, due to shortcuts when handing concurrent invalidations.
>   	 */
>   	if (unlikely(kvm->mmu_notifier_count) &&
> -	    hva >= kvm->mmu_notifier_range_start &&
> -	    hva < kvm->mmu_notifier_range_end)
> +	    gfn >= kvm->mmu_notifier_range_start &&
> +	    gfn < kvm->mmu_notifier_range_end)
>   		return 1;
>   	if (kvm->mmu_notifier_seq != mmu_seq)
>   		return 1;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index da263c370d00..4d7f0e72366f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -536,8 +536,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
>   
>   typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
>   
> -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> -			     unsigned long end);
> +typedef void (*on_lock_fn_t)(struct kvm *kvm, gfn_t start, gfn_t end);
>   
>   typedef void (*on_unlock_fn_t)(struct kvm *kvm);
>   
> @@ -624,7 +623,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>   				locked = true;
>   				KVM_MMU_LOCK(kvm);
>   				if (!IS_KVM_NULL_FN(range->on_lock))
> -					range->on_lock(kvm, range->start, range->end);
> +					range->on_lock(kvm, gfn_range.start,
> +							    gfn_range.end);
>   				if (IS_KVM_NULL_FN(range->handler))
>   					break;
>   			}


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 07/14] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-07-15 11:36   ` Gupta, Pankaj
@ 2022-07-18 13:29     ` Chao Peng
  2022-07-18 15:26       ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-07-18 13:29 UTC (permalink / raw)
  To: Gupta, Pankaj
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Fri, Jul 15, 2022 at 01:36:15PM +0200, Gupta, Pankaj wrote:
> > Currently in mmu_notifier validate path, hva range is recorded and then
> > checked in the mmu_notifier_retry_hva() from page fault path. However
> > for the to be introduced private memory, a page fault may not have a hva
> 
> As this patch appeared in v7, just wondering did you see an actual bug
> because of it? And not having corresponding 'hva' occurs only with private
> memory because its not mapped to host userspace?

The addressed problem is not new in this version, previous versions I
also had code to handle it (just in different way). But the problem is:
mmu_notifier/memfile_notifier may be in the progress of invalidating a
pfn that obtained earlier in the page fault handler, when happens, we
should retry the fault. In v6 I used global mmu_notifier_retry() for
memfile_notifier but that can block unrelated mmu_notifer invalidation
which has hva range specified.

Sean gave a comment at https://lkml.org/lkml/2022/6/17/1001 to separate
memfile_notifier from mmu_notifier but during the implementation I
realized we actually can reuse the same code for shared and private
memory if both using gpa range and that can simplify the code handling
in kvm_zap_gfn_range and some other code (e.g. we don't need two
versions for memfile_notifier/mmu_notifier).

Adding gpa range for private memory invalidation also relieves the
above blocking issue between private memory page fault and mmu_notifier.

Chao
> 
> Thanks,
> Pankaj
> 
> > associated, checking gfn(gpa) makes more sense. For existing non private
> > memory case, gfn is expected to continue to work.
> > 
> > The patch also fixes a potential bug in kvm_zap_gfn_range() which has
> > already been using gfn when calling kvm_inc/dec_notifier_count() in
> > current code.
> > 
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >   arch/x86/kvm/mmu/mmu.c   |  2 +-
> >   include/linux/kvm_host.h | 18 ++++++++----------
> >   virt/kvm/kvm_main.c      |  6 +++---
> >   3 files changed, 12 insertions(+), 14 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index f7fa4c31b7c5..0d882fad4bc1 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4182,7 +4182,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> >   		return true;
> >   	return fault->slot &&
> > -	       mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> > +	       mmu_notifier_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
> >   }
> >   static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 0bdb6044e316..e9153b54e2a4 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -767,8 +767,8 @@ struct kvm {
> >   	struct mmu_notifier mmu_notifier;
> >   	unsigned long mmu_notifier_seq;
> >   	long mmu_notifier_count;
> > -	unsigned long mmu_notifier_range_start;
> > -	unsigned long mmu_notifier_range_end;
> > +	gfn_t mmu_notifier_range_start;
> > +	gfn_t mmu_notifier_range_end;
> >   #endif
> >   	struct list_head devices;
> >   	u64 manual_dirty_log_protect;
> > @@ -1362,10 +1362,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> >   void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >   #endif
> > -void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
> > -				   unsigned long end);
> > -void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start,
> > -				   unsigned long end);
> > +void kvm_inc_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end);
> > +void kvm_dec_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end);
> >   long kvm_arch_dev_ioctl(struct file *filp,
> >   			unsigned int ioctl, unsigned long arg);
> > @@ -1923,9 +1921,9 @@ static inline int mmu_notifier_retry(struct kvm *kvm, unsigned long mmu_seq)
> >   	return 0;
> >   }
> > -static inline int mmu_notifier_retry_hva(struct kvm *kvm,
> > +static inline int mmu_notifier_retry_gfn(struct kvm *kvm,
> >   					 unsigned long mmu_seq,
> > -					 unsigned long hva)
> > +					 gfn_t gfn)
> >   {
> >   	lockdep_assert_held(&kvm->mmu_lock);
> >   	/*
> > @@ -1935,8 +1933,8 @@ static inline int mmu_notifier_retry_hva(struct kvm *kvm,
> >   	 * positives, due to shortcuts when handing concurrent invalidations.
> >   	 */
> >   	if (unlikely(kvm->mmu_notifier_count) &&
> > -	    hva >= kvm->mmu_notifier_range_start &&
> > -	    hva < kvm->mmu_notifier_range_end)
> > +	    gfn >= kvm->mmu_notifier_range_start &&
> > +	    gfn < kvm->mmu_notifier_range_end)
> >   		return 1;
> >   	if (kvm->mmu_notifier_seq != mmu_seq)
> >   		return 1;
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index da263c370d00..4d7f0e72366f 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -536,8 +536,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
> >   typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
> > -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> > -			     unsigned long end);
> > +typedef void (*on_lock_fn_t)(struct kvm *kvm, gfn_t start, gfn_t end);
> >   typedef void (*on_unlock_fn_t)(struct kvm *kvm);
> > @@ -624,7 +623,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> >   				locked = true;
> >   				KVM_MMU_LOCK(kvm);
> >   				if (!IS_KVM_NULL_FN(range->on_lock))
> > -					range->on_lock(kvm, range->start, range->end);
> > +					range->on_lock(kvm, gfn_range.start,
> > +							    gfn_range.end);
> >   				if (IS_KVM_NULL_FN(range->handler))
> >   					break;
> >   			}

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 07/14] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-07-18 13:29     ` Chao Peng
@ 2022-07-18 15:26       ` Sean Christopherson
  2022-07-19 14:02         ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2022-07-18 15:26 UTC (permalink / raw)
  To: Chao Peng
  Cc: Gupta, Pankaj, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Mon, Jul 18, 2022, Chao Peng wrote:
> On Fri, Jul 15, 2022 at 01:36:15PM +0200, Gupta, Pankaj wrote:
> > > Currently in mmu_notifier validate path, hva range is recorded and then
> > > checked in the mmu_notifier_retry_hva() from page fault path. However
> > > for the to be introduced private memory, a page fault may not have a hva
> > 
> > As this patch appeared in v7, just wondering did you see an actual bug
> > because of it? And not having corresponding 'hva' occurs only with private
> > memory because its not mapped to host userspace?
> 
> The addressed problem is not new in this version, previous versions I
> also had code to handle it (just in different way). But the problem is:
> mmu_notifier/memfile_notifier may be in the progress of invalidating a
> pfn that obtained earlier in the page fault handler, when happens, we
> should retry the fault. In v6 I used global mmu_notifier_retry() for
> memfile_notifier but that can block unrelated mmu_notifer invalidation
> which has hva range specified.
> 
> Sean gave a comment at https://lkml.org/lkml/2022/6/17/1001 to separate
> memfile_notifier from mmu_notifier but during the implementation I
> realized we actually can reuse the same code for shared and private
> memory if both using gpa range and that can simplify the code handling
> in kvm_zap_gfn_range and some other code (e.g. we don't need two
> versions for memfile_notifier/mmu_notifier).

This should work, though I'm undecided as to whether or not it's a good idea.  KVM
allows aliasing multiple gfns to a single hva, and so using the gfn could result
in a much larger range being rejected given the simplistic algorithm for handling
multiple ranges in kvm_inc_notifier_count().  But I assume such aliasing is uncommon,
so I'm not sure it's worth optimizing for.

> Adding gpa range for private memory invalidation also relieves the
> above blocking issue between private memory page fault and mmu_notifier.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-07-06  8:20 ` [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions Chao Peng
@ 2022-07-19  8:00   ` Gupta, Pankaj
  2022-07-19 14:08     ` Chao Peng
  2022-07-20 16:44   ` Sean Christopherson
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 398+ messages in thread
From: Gupta, Pankaj @ 2022-07-19  8:00 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

Hi Chao,

Some comments below:

> If CONFIG_HAVE_KVM_PRIVATE_MEM=y, userspace can register/unregister the
> guest private memory regions through KVM_MEMORY_ENCRYPT_{UN,}REG_REGION
> ioctls. The patch reuses existing SEV ioctl but differs that the
> address in the region for private memory is gpa while SEV case it's hva.
> 
> The private memory region is stored as xarray in KVM for memory
> efficiency in normal usages and zapping existing memory mappings is also
> a side effect of these two ioctls.
> 
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>   Documentation/virt/kvm/api.rst  | 17 +++++++---
>   arch/x86/include/asm/kvm_host.h |  1 +
>   arch/x86/kvm/Kconfig            |  1 +
>   arch/x86/kvm/mmu.h              |  2 --
>   include/linux/kvm_host.h        |  8 +++++
>   virt/kvm/kvm_main.c             | 57 +++++++++++++++++++++++++++++++++
>   6 files changed, 80 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 5ecfc7fbe0ee..dfb4caecab73 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -4715,10 +4715,19 @@ Documentation/virt/kvm/amd-memory-encryption.rst.
>   This ioctl can be used to register a guest memory region which may
>   contain encrypted data (e.g. guest RAM, SMRAM etc).
>   
> -It is used in the SEV-enabled guest. When encryption is enabled, a guest
> -memory region may contain encrypted data. The SEV memory encryption
> -engine uses a tweak such that two identical plaintext pages, each at
> -different locations will have differing ciphertexts. So swapping or
> +Currently this ioctl supports registering memory regions for two usages:
> +private memory and SEV-encrypted memory.
> +
> +When private memory is enabled, this ioctl is used to register guest private
> +memory region and the addr/size of kvm_enc_region represents guest physical
> +address (GPA). In this usage, this ioctl zaps the existing guest memory
> +mappings in KVM that fallen into the region.
> +
> +When SEV-encrypted memory is enabled, this ioctl is used to register guest
> +memory region which may contain encrypted data for a SEV-enabled guest. The
> +addr/size of kvm_enc_region represents userspace address (HVA). The SEV
> +memory encryption engine uses a tweak such that two identical plaintext pages,
> +each at different locations will have differing ciphertexts. So swapping or
>   moving ciphertext of those pages will not result in plaintext being
>   swapped. So relocating (or migrating) physical backing pages for the SEV
>   guest will require some additional steps.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index dae190e19fce..92120e3a224e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -37,6 +37,7 @@
>   #include <asm/hyperv-tlfs.h>
>   
>   #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> +#define __KVM_HAVE_ZAP_GFN_RANGE
>   
>   #define KVM_MAX_VCPUS 1024
>   
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 1f160801e2a7..05861b9656a4 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -50,6 +50,7 @@ config KVM
>   	select HAVE_KVM_PM_NOTIFIER if PM
>   	select HAVE_KVM_PRIVATE_MEM if X86_64
>   	select MEMFILE_NOTIFIER if HAVE_KVM_PRIVATE_MEM
> +	select XARRAY_MULTI if HAVE_KVM_PRIVATE_MEM
>   	help
>   	  Support hosting fully virtualized guest machines using hardware
>   	  virtualization extensions.  You will need a fairly recent
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index a99acec925eb..428cd2e88cbd 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -209,8 +209,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>   	return -(u32)fault & errcode;
>   }
>   
> -void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> -
>   int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
>   
>   int kvm_mmu_post_init_vm(struct kvm *kvm);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 1b203c8aa696..da33f8828456 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -260,6 +260,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>   bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>   #endif
>   
> +#ifdef __KVM_HAVE_ZAP_GFN_RANGE
> +void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> +#endif
> +
>   enum {
>   	OUTSIDE_GUEST_MODE,
>   	IN_GUEST_MODE,
> @@ -795,6 +799,9 @@ struct kvm {
>   	struct notifier_block pm_notifier;
>   #endif
>   	char stats_id[KVM_STATS_NAME_SIZE];
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +	struct xarray mem_attr_array;
> +#endif
>   };
>   
>   #define kvm_err(fmt, ...) \
> @@ -1459,6 +1466,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
>   int kvm_arch_post_init_vm(struct kvm *kvm);
>   void kvm_arch_pre_destroy_vm(struct kvm *kvm);
>   int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_private_mem_supported(struct kvm *kvm);
>   
>   #ifndef __KVM_HAVE_ARCH_VM_ALLOC
>   /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 230c8ff9659c..bb714c2a4b06 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -914,6 +914,35 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>   
>   #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>   
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +#define KVM_MEM_ATTR_PRIVATE	0x0001
> +static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl,
> +					     struct kvm_enc_region *region)
> +{
> +	unsigned long start, end;
> +	void *entry;
> +	int r;
> +
> +	if (region->size == 0 || region->addr + region->size < region->addr)
> +		return -EINVAL;
> +	if (region->addr & (PAGE_SIZE - 1) || region->size & (PAGE_SIZE - 1))
> +		return -EINVAL;
> +
> +	start = region->addr >> PAGE_SHIFT;
> +	end = (region->addr + region->size - 1) >> PAGE_SHIFT;
> +
> +	entry = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION ?
> +				xa_mk_value(KVM_MEM_ATTR_PRIVATE) : NULL;
> +
> +	r = xa_err(xa_store_range(&kvm->mem_attr_array, start, end,
> +					entry, GFP_KERNEL_ACCOUNT));
> +
> +	kvm_zap_gfn_range(kvm, start, end + 1);
> +
> +	return r;
> +}
> +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
> +
>   #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>   static int kvm_pm_notifier_call(struct notifier_block *bl,
>   				unsigned long state,
> @@ -1138,6 +1167,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
>   	spin_lock_init(&kvm->mn_invalidate_lock);
>   	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
>   	xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +	xa_init(&kvm->mem_attr_array);
> +#endif
>   
>   	INIT_LIST_HEAD(&kvm->gpc_list);
>   	spin_lock_init(&kvm->gpc_lock);
> @@ -1305,6 +1337,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
>   		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
>   		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
>   	}
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +	xa_destroy(&kvm->mem_attr_array);
> +#endif
>   	cleanup_srcu_struct(&kvm->irq_srcu);
>   	cleanup_srcu_struct(&kvm->srcu);
>   	kvm_arch_free_vm(kvm);
> @@ -1508,6 +1543,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
>   	}
>   }
>   
> +bool __weak kvm_arch_private_mem_supported(struct kvm *kvm)
> +{
> +	return false;
> +}

Does this function has to be overriden by SEV and TDX to support the 
private regions?

> +
>   static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
>   {
>   	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> @@ -4689,6 +4729,22 @@ static long kvm_vm_ioctl(struct file *filp,
>   		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>   		break;
>   	}
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +	case KVM_MEMORY_ENCRYPT_REG_REGION:
> +	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> +		struct kvm_enc_region region;
> +
> +		if (!kvm_arch_private_mem_supported(kvm))
> +			goto arch_vm_ioctl;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&region, argp, sizeof(region)))
> +			goto out;
> +
> +		r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, &region);

this is to store private region metadata not only the encrypted region?

Also, seems same ioctl can be used to put other regions (e.g firmware, 
later maybe DAX backend etc) into private memory?

> +		break;
> +	}
> +#endif
>   	case KVM_GET_DIRTY_LOG: {
>   		struct kvm_dirty_log log;
>   
> @@ -4842,6 +4898,7 @@ static long kvm_vm_ioctl(struct file *filp,
>   		r = kvm_vm_ioctl_get_stats_fd(kvm);
>   		break;
>   	default:
> +arch_vm_ioctl:
>   		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>   	}
>   out:


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 13/14] KVM: Enable and expose KVM_MEM_PRIVATE
  2022-07-06  8:20 ` [PATCH v7 13/14] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
@ 2022-07-19  9:55   ` Gupta, Pankaj
  2022-07-19 14:12     ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Gupta, Pankaj @ 2022-07-19  9:55 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

> Register private memslot to fd-based memory backing store and handle the
> memfile notifiers to zap the existing mappings.
> 
> Currently the register is happened at memslot creating time and the
> initial support does not include page migration/swap.
> 
> KVM_MEM_PRIVATE is not exposed by default, architecture code can turn
> on it by implementing kvm_arch_private_mem_supported().
> 
> A 'kvm' reference is added in memslot structure since in
> memfile_notifier callbacks we can only obtain a memslot reference while
> kvm is need to do the zapping.
> 
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>   include/linux/kvm_host.h |   1 +
>   virt/kvm/kvm_main.c      | 117 ++++++++++++++++++++++++++++++++++++---
>   2 files changed, 109 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8f56426aa1e3..4e5a0db68799 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -584,6 +584,7 @@ struct kvm_memory_slot {
>   	struct file *private_file;
>   	loff_t private_offset;
>   	struct memfile_notifier notifier;
> +	struct kvm *kvm;
>   };
>   
>   static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index bb714c2a4b06..d6f7e074cab2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -941,6 +941,63 @@ static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl
>   
>   	return r;
>   }
> +
> +static void kvm_memfile_notifier_invalidate(struct memfile_notifier *notifier,
> +					    pgoff_t start, pgoff_t end)
> +{
> +	struct kvm_memory_slot *slot = container_of(notifier,
> +						    struct kvm_memory_slot,
> +						    notifier);
> +	unsigned long base_pgoff = slot->private_offset >> PAGE_SHIFT;
> +	gfn_t start_gfn = slot->base_gfn;
> +	gfn_t end_gfn = slot->base_gfn + slot->npages;
> +
> +
> +	if (start > base_pgoff)
> +		start_gfn = slot->base_gfn + start - base_pgoff;
> +
> +	if (end < base_pgoff + slot->npages)
> +		end_gfn = slot->base_gfn + end - base_pgoff;
> +
> +	if (start_gfn >= end_gfn)
> +		return;
> +
> +	kvm_zap_gfn_range(slot->kvm, start_gfn, end_gfn);
> +}
> +
> +static struct memfile_notifier_ops kvm_memfile_notifier_ops = {
> +	.invalidate = kvm_memfile_notifier_invalidate,
> +};
> +
> +#define KVM_MEMFILE_FLAGS (MEMFILE_F_USER_INACCESSIBLE | \
> +			   MEMFILE_F_UNMOVABLE | \
> +			   MEMFILE_F_UNRECLAIMABLE)
> +
> +static inline int kvm_private_mem_register(struct kvm_memory_slot *slot)
> +{
> +	slot->notifier.ops = &kvm_memfile_notifier_ops;
> +	return memfile_register_notifier(slot->private_file, KVM_MEMFILE_FLAGS,
> +					 &slot->notifier);
> +}
> +
> +static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot)
> +{
> +	memfile_unregister_notifier(&slot->notifier);
> +}
> +
> +#else /* !CONFIG_HAVE_KVM_PRIVATE_MEM */
> +
> +static inline int kvm_private_mem_register(struct kvm_memory_slot *slot)
> +{
> +	WARN_ON_ONCE(1);
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot)
> +{
> +	WARN_ON_ONCE(1);
> +}
> +
>   #endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
>   
>   #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> @@ -987,6 +1044,11 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
>   /* This does not remove the slot from struct kvm_memslots data structures */
>   static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
>   {
> +	if (slot->flags & KVM_MEM_PRIVATE) {
> +		kvm_private_mem_unregister(slot);
> +		fput(slot->private_file);
> +	}
> +
>   	kvm_destroy_dirty_bitmap(slot);
>   
>   	kvm_arch_free_memslot(kvm, slot);
> @@ -1548,10 +1610,16 @@ bool __weak kvm_arch_private_mem_supported(struct kvm *kvm)
>   	return false;
>   }
>   
> -static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> +static int check_memory_region_flags(struct kvm *kvm,
> +				     const struct kvm_user_mem_region *mem)
>   {
>   	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>   
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +	if (kvm_arch_private_mem_supported(kvm))
> +		valid_flags |= KVM_MEM_PRIVATE;
> +#endif
> +
>   #ifdef __KVM_HAVE_READONLY_MEM
>   	valid_flags |= KVM_MEM_READONLY;
>   #endif
> @@ -1627,6 +1695,12 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
>   {
>   	int r;
>   
> +	if (change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE) {
> +		r = kvm_private_mem_register(new);
> +		if (r)
> +			return r;
> +	}
> +
>   	/*
>   	 * If dirty logging is disabled, nullify the bitmap; the old bitmap
>   	 * will be freed on "commit".  If logging is enabled in both old and
> @@ -1655,6 +1729,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
>   	if (r && new && new->dirty_bitmap && (!old || !old->dirty_bitmap))
>   		kvm_destroy_dirty_bitmap(new);
>   
> +	if (r && change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
> +		kvm_private_mem_unregister(new);
> +
>   	return r;
>   }
>   
> @@ -1952,7 +2029,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   	int as_id, id;
>   	int r;
>   
> -	r = check_memory_region_flags(mem);
> +	r = check_memory_region_flags(kvm, mem);
>   	if (r)
>   		return r;
>   
> @@ -1971,6 +2048,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
>   			mem->memory_size))
>   		return -EINVAL;
> +	if (mem->flags & KVM_MEM_PRIVATE &&
> +		(mem->private_offset & (PAGE_SIZE - 1) ||
> +		 mem->private_offset > U64_MAX - mem->memory_size))
> +		return -EINVAL;
>   	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
>   		return -EINVAL;
>   	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> @@ -2009,6 +2090,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   		if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
>   			return -EINVAL;
>   	} else { /* Modify an existing slot. */
> +		/* Private memslots are immutable, they can only be deleted. */
> +		if (mem->flags & KVM_MEM_PRIVATE)
> +			return -EINVAL;
>   		if ((mem->userspace_addr != old->userspace_addr) ||
>   		    (npages != old->npages) ||
>   		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> @@ -2037,10 +2121,27 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   	new->npages = npages;
>   	new->flags = mem->flags;
>   	new->userspace_addr = mem->userspace_addr;
> +	if (mem->flags & KVM_MEM_PRIVATE) {
> +		new->private_file = fget(mem->private_fd);
> +		if (!new->private_file) {
> +			r = -EINVAL;
> +			goto out;
> +		}
> +		new->private_offset = mem->private_offset;
> +	}
> +
> +	new->kvm = kvm;
>   
>   	r = kvm_set_memslot(kvm, old, new, change);
>   	if (r)
> -		kfree(new);
> +		goto out;
> +
> +	return 0;
> +
> +out:
> +	if (new->private_file)
> +		fput(new->private_file);
> +	kfree(new);
>   	return r;
>   }
>   EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> @@ -4712,12 +4813,10 @@ static long kvm_vm_ioctl(struct file *filp,
>   			(u32 __user *)(argp + offsetof(typeof(mem), flags))))
>   			goto out;
>   
> -		if (flags & KVM_MEM_PRIVATE) {
> -			r = -EINVAL;
> -			goto out;
> -		}
> -
> -		size = sizeof(struct kvm_userspace_memory_region);
> +		if (flags & KVM_MEM_PRIVATE)
> +			size = sizeof(struct kvm_userspace_memory_region_ext);

Not sure if we use kvm_userspace_memory_region_ext or 
kvm_user_mem_region, just for readability.

> +		else
> +			size = sizeof(struct kvm_userspace_memory_region);
>   
>   		if (copy_from_user(&mem, argp, size))
>   			goto out;


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 07/14] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-07-18 15:26       ` Sean Christopherson
@ 2022-07-19 14:02         ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-07-19 14:02 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Gupta, Pankaj, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Mon, Jul 18, 2022 at 03:26:34PM +0000, Sean Christopherson wrote:
> On Mon, Jul 18, 2022, Chao Peng wrote:
> > On Fri, Jul 15, 2022 at 01:36:15PM +0200, Gupta, Pankaj wrote:
> > > > Currently in mmu_notifier validate path, hva range is recorded and then
> > > > checked in the mmu_notifier_retry_hva() from page fault path. However
> > > > for the to be introduced private memory, a page fault may not have a hva
> > > 
> > > As this patch appeared in v7, just wondering did you see an actual bug
> > > because of it? And not having corresponding 'hva' occurs only with private
> > > memory because its not mapped to host userspace?
> > 
> > The addressed problem is not new in this version, previous versions I
> > also had code to handle it (just in different way). But the problem is:
> > mmu_notifier/memfile_notifier may be in the progress of invalidating a
> > pfn that obtained earlier in the page fault handler, when happens, we
> > should retry the fault. In v6 I used global mmu_notifier_retry() for
> > memfile_notifier but that can block unrelated mmu_notifer invalidation
> > which has hva range specified.
> > 
> > Sean gave a comment at https://lkml.org/lkml/2022/6/17/1001 to separate
> > memfile_notifier from mmu_notifier but during the implementation I
> > realized we actually can reuse the same code for shared and private
> > memory if both using gpa range and that can simplify the code handling
> > in kvm_zap_gfn_range and some other code (e.g. we don't need two
> > versions for memfile_notifier/mmu_notifier).
> 
> This should work, though I'm undecided as to whether or not it's a good idea.  KVM
> allows aliasing multiple gfns to a single hva, and so using the gfn could result
> in a much larger range being rejected given the simplistic algorithm for handling
> multiple ranges in kvm_inc_notifier_count().  But I assume such aliasing is uncommon,
> so I'm not sure it's worth optimizing for.

That can be a real problem for current v7 code, __kvm_handle_hva_range()
loops all possible gfn_range for a given hva_range but the
on_lock/on_unlock is invoked only once, this should work for hva_range,
but not gfn_range since we can have multiple of them.

> 
> > Adding gpa range for private memory invalidation also relieves the
> > above blocking issue between private memory page fault and mmu_notifier.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-07-19  8:00   ` Gupta, Pankaj
@ 2022-07-19 14:08     ` Chao Peng
  2022-07-19 14:23       ` Gupta, Pankaj
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-07-19 14:08 UTC (permalink / raw)
  To: Gupta, Pankaj
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Tue, Jul 19, 2022 at 10:00:23AM +0200, Gupta, Pankaj wrote:

...

> > +bool __weak kvm_arch_private_mem_supported(struct kvm *kvm)
> > +{
> > +	return false;
> > +}
> 
> Does this function has to be overriden by SEV and TDX to support the private
> regions?

Yes it should be overridden by architectures which want to support it.

> 
> > +
> >   static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> >   {
> >   	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> > @@ -4689,6 +4729,22 @@ static long kvm_vm_ioctl(struct file *filp,
> >   		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> >   		break;
> >   	}
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > +	case KVM_MEMORY_ENCRYPT_REG_REGION:
> > +	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> > +		struct kvm_enc_region region;
> > +
> > +		if (!kvm_arch_private_mem_supported(kvm))
> > +			goto arch_vm_ioctl;
> > +
> > +		r = -EFAULT;
> > +		if (copy_from_user(&region, argp, sizeof(region)))
> > +			goto out;
> > +
> > +		r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, &region);
> 
> this is to store private region metadata not only the encrypted region?

Correct.

> 
> Also, seems same ioctl can be used to put other regions (e.g firmware, later
> maybe DAX backend etc) into private memory?

Possibly. Depends on what exactly the semantics is. If just want to set
those regions as private current code already support that.

Chao
> 
> > +		break;
> > +	}
> > +#endif
> >   	case KVM_GET_DIRTY_LOG: {
> >   		struct kvm_dirty_log log;
> > @@ -4842,6 +4898,7 @@ static long kvm_vm_ioctl(struct file *filp,
> >   		r = kvm_vm_ioctl_get_stats_fd(kvm);
> >   		break;
> >   	default:
> > +arch_vm_ioctl:
> >   		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> >   	}
> >   out:
> 

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 13/14] KVM: Enable and expose KVM_MEM_PRIVATE
  2022-07-19  9:55   ` Gupta, Pankaj
@ 2022-07-19 14:12     ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-07-19 14:12 UTC (permalink / raw)
  To: Gupta, Pankaj
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Tue, Jul 19, 2022 at 11:55:24AM +0200, Gupta, Pankaj wrote:

...

> > @@ -4712,12 +4813,10 @@ static long kvm_vm_ioctl(struct file *filp,
> >   			(u32 __user *)(argp + offsetof(typeof(mem), flags))))
> >   			goto out;
> > -		if (flags & KVM_MEM_PRIVATE) {
> > -			r = -EINVAL;
> > -			goto out;
> > -		}
> > -
> > -		size = sizeof(struct kvm_userspace_memory_region);
> > +		if (flags & KVM_MEM_PRIVATE)
> > +			size = sizeof(struct kvm_userspace_memory_region_ext);
> 
> Not sure if we use kvm_userspace_memory_region_ext or kvm_user_mem_region,
> just for readability.

Somehow, but majorly for code maintainability, kvm_user_mem_region is
designed to be the alias of kvm_userspace_memory_region_ext so in the
code we can access the 'unpacked' fields using something like
'mem.usersapce_addr' instead of 'mem.region.userspace_addr'.

Chao
> 
> > +		else
> > +			size = sizeof(struct kvm_userspace_memory_region);
> >   		if (copy_from_user(&mem, argp, size))
> >   			goto out;

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-07-19 14:08     ` Chao Peng
@ 2022-07-19 14:23       ` Gupta, Pankaj
  2022-07-20 15:07         ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Gupta, Pankaj @ 2022-07-19 14:23 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song


>>> +bool __weak kvm_arch_private_mem_supported(struct kvm *kvm)
>>> +{
>>> +	return false;
>>> +}
>>
>> Does this function has to be overriden by SEV and TDX to support the private
>> regions?
> 
> Yes it should be overridden by architectures which want to support it.

o.k
> 
>>
>>> +
>>>    static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
>>>    {
>>>    	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>>> @@ -4689,6 +4729,22 @@ static long kvm_vm_ioctl(struct file *filp,
>>>    		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>>>    		break;
>>>    	}
>>> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
>>> +	case KVM_MEMORY_ENCRYPT_REG_REGION:
>>> +	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
>>> +		struct kvm_enc_region region;
>>> +
>>> +		if (!kvm_arch_private_mem_supported(kvm))
>>> +			goto arch_vm_ioctl;
>>> +
>>> +		r = -EFAULT;
>>> +		if (copy_from_user(&region, argp, sizeof(region)))
>>> +			goto out;
>>> +
>>> +		r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, &region);
>>
>> this is to store private region metadata not only the encrypted region?
> 
> Correct.

Sorry for not being clear, was suggesting name change of this function from:
"kvm_vm_ioctl_set_encrypted_region" to "kvm_vm_ioctl_set_private_region"

> 
>>
>> Also, seems same ioctl can be used to put other regions (e.g firmware, later
>> maybe DAX backend etc) into private memory?
> 
> Possibly. Depends on what exactly the semantics is. If just want to set
> those regions as private current code already support that.

Agree. Sure!


Thanks,
Pankaj

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-07-19 14:23       ` Gupta, Pankaj
@ 2022-07-20 15:07         ` Chao Peng
  2022-07-20 15:31           ` Gupta, Pankaj
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-07-20 15:07 UTC (permalink / raw)
  To: Gupta, Pankaj
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Tue, Jul 19, 2022 at 04:23:52PM +0200, Gupta, Pankaj wrote:
> 
> > > > +bool __weak kvm_arch_private_mem_supported(struct kvm *kvm)
> > > > +{
> > > > +	return false;
> > > > +}
> > > 
> > > Does this function has to be overriden by SEV and TDX to support the private
> > > regions?
> > 
> > Yes it should be overridden by architectures which want to support it.
> 
> o.k
> > 
> > > 
> > > > +
> > > >    static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> > > >    {
> > > >    	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> > > > @@ -4689,6 +4729,22 @@ static long kvm_vm_ioctl(struct file *filp,
> > > >    		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > > >    		break;
> > > >    	}
> > > > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > > > +	case KVM_MEMORY_ENCRYPT_REG_REGION:
> > > > +	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> > > > +		struct kvm_enc_region region;
> > > > +
> > > > +		if (!kvm_arch_private_mem_supported(kvm))
> > > > +			goto arch_vm_ioctl;
> > > > +
> > > > +		r = -EFAULT;
> > > > +		if (copy_from_user(&region, argp, sizeof(region)))
> > > > +			goto out;
> > > > +
> > > > +		r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, &region);
> > > 
> > > this is to store private region metadata not only the encrypted region?
> > 
> > Correct.
> 
> Sorry for not being clear, was suggesting name change of this function from:
> "kvm_vm_ioctl_set_encrypted_region" to "kvm_vm_ioctl_set_private_region"

Though I don't have strong reason to change it, I'm fine with this and
this name matches the above kvm_arch_private_mem_supported perfectly.

Thanks,
Chao
> 
> > 
> > > 
> > > Also, seems same ioctl can be used to put other regions (e.g firmware, later
> > > maybe DAX backend etc) into private memory?
> > 
> > Possibly. Depends on what exactly the semantics is. If just want to set
> > those regions as private current code already support that.
> 
> Agree. Sure!
> 
> 
> Thanks,
> Pankaj

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-07-20 15:07         ` Chao Peng
@ 2022-07-20 15:31           ` Gupta, Pankaj
  2022-07-20 16:21             ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Gupta, Pankaj @ 2022-07-20 15:31 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song


>>>>> +bool __weak kvm_arch_private_mem_supported(struct kvm *kvm)
>>>>> +{
>>>>> +	return false;
>>>>> +}
>>>>
>>>> Does this function has to be overriden by SEV and TDX to support the private
>>>> regions?
>>>
>>> Yes it should be overridden by architectures which want to support it.
>>
>> o.k
>>>
>>>>
>>>>> +
>>>>>     static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
>>>>>     {
>>>>>     	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>>>>> @@ -4689,6 +4729,22 @@ static long kvm_vm_ioctl(struct file *filp,
>>>>>     		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>>>>>     		break;
>>>>>     	}
>>>>> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
>>>>> +	case KVM_MEMORY_ENCRYPT_REG_REGION:
>>>>> +	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
>>>>> +		struct kvm_enc_region region;
>>>>> +
>>>>> +		if (!kvm_arch_private_mem_supported(kvm))
>>>>> +			goto arch_vm_ioctl;
>>>>> +
>>>>> +		r = -EFAULT;
>>>>> +		if (copy_from_user(&region, argp, sizeof(region)))
>>>>> +			goto out;
>>>>> +
>>>>> +		r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, &region);
>>>>
>>>> this is to store private region metadata not only the encrypted region?
>>>
>>> Correct.
>>
>> Sorry for not being clear, was suggesting name change of this function from:
>> "kvm_vm_ioctl_set_encrypted_region" to "kvm_vm_ioctl_set_private_region"
> 
> Though I don't have strong reason to change it, I'm fine with this and

Yes, no strong reason, just thought "kvm_vm_ioctl_set_private_region" 
would depict the actual functionality :)

> this name matches the above kvm_arch_private_mem_supported perfectly.
BTW could not understand this, how "kvm_vm_ioctl_set_encrypted_region"
matches "kvm_arch_private_mem_supported"?

Thanks,
Pankaj

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-07-20 15:31           ` Gupta, Pankaj
@ 2022-07-20 16:21             ` Sean Christopherson
  2022-07-20 17:41               ` Gupta, Pankaj
  2022-07-21  7:34               ` Wei Wang
  0 siblings, 2 replies; 398+ messages in thread
From: Sean Christopherson @ 2022-07-20 16:21 UTC (permalink / raw)
  To: Gupta, Pankaj
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Wed, Jul 20, 2022, Gupta, Pankaj wrote:
> 
> > > > > > +bool __weak kvm_arch_private_mem_supported(struct kvm *kvm)

Use kvm_arch_has_private_mem(), both because "has" makes it obvious this is checking
a flag of sorts, and to align with other helpers of this nature (and with
CONFIG_HAVE_KVM_PRIVATE_MEM).

  $ git grep kvm_arch | grep supported | wc -l
  0
  $ git grep kvm_arch | grep has | wc -l
  26

> > > > > > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > > > > > +	case KVM_MEMORY_ENCRYPT_REG_REGION:
> > > > > > +	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> > > > > > +		struct kvm_enc_region region;
> > > > > > +
> > > > > > +		if (!kvm_arch_private_mem_supported(kvm))
> > > > > > +			goto arch_vm_ioctl;
> > > > > > +
> > > > > > +		r = -EFAULT;
> > > > > > +		if (copy_from_user(&region, argp, sizeof(region)))
> > > > > > +			goto out;
> > > > > > +
> > > > > > +		r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, &region);
> > > > > 
> > > > > this is to store private region metadata not only the encrypted region?
> > > > 
> > > > Correct.
> > > 
> > > Sorry for not being clear, was suggesting name change of this function from:
> > > "kvm_vm_ioctl_set_encrypted_region" to "kvm_vm_ioctl_set_private_region"
> > 
> > Though I don't have strong reason to change it, I'm fine with this and
> 
> Yes, no strong reason, just thought "kvm_vm_ioctl_set_private_region" would
> depict the actual functionality :)
> 
> > this name matches the above kvm_arch_private_mem_supported perfectly.
> BTW could not understand this, how "kvm_vm_ioctl_set_encrypted_region"
> matches "kvm_arch_private_mem_supported"?

Chao is saying that kvm_vm_ioctl_set_private_region() pairs nicely with
kvm_arch_private_mem_supported(), not that the "encrypted" variant pairs nicely.

I also like using "private" instead of "encrypted", though we should probably
find a different verb than "set", because calling "set_private" when making the
region shared is confusing.  I'm struggling to come up with a good alternative
though.

kvm_vm_ioctl_set_memory_region() is already taken by KVM_SET_USER_MEMORY_REGION,
and that also means that anything with "memory_region" in the name is bound to be
confusing.

Hmm, and if we move away from "encrypted", it probably makes sense to pass in
addr+size instead of a kvm_enc_region.

Maybe this?

static int kvm_vm_ioctl_set_or_clear_mem_private(struct kvm *kvm, gpa_t gpa,
					         gpa_t size, bool set_private)

and then:

#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
	case KVM_MEMORY_ENCRYPT_REG_REGION:
	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
		bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
		struct kvm_enc_region region;

		if (!kvm_arch_private_mem_supported(kvm))
			goto arch_vm_ioctl;

		r = -EFAULT;
		if (copy_from_user(&region, argp, sizeof(region)))
			goto out;

		r = kvm_vm_ioctl_set_or_clear_mem_private(kvm, region.addr,
							  region.size, set);
		break;
	}
#endif

I don't love it, so if someone has a better idea...

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-07-06  8:20 ` [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions Chao Peng
  2022-07-19  8:00   ` Gupta, Pankaj
@ 2022-07-20 16:44   ` Sean Christopherson
  2022-07-21  9:37     ` Chao Peng
  2022-08-19 19:37   ` Vishal Annapurve
  2022-08-26 15:19   ` Fuad Tabba
  3 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2022-07-20 16:44 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Wed, Jul 06, 2022, Chao Peng wrote:
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 230c8ff9659c..bb714c2a4b06 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -914,6 +914,35 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>  
>  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>  
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +#define KVM_MEM_ATTR_PRIVATE	0x0001
> +static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl,
> +					     struct kvm_enc_region *region)
> +{
> +	unsigned long start, end;

As alluded to in a different reply, because this will track GPAs instead of HVAs,
the type needs to be "gpa_t", not "unsigned long".  Oh, actually, they need to
be gfn_t, since those are what gets shoved into the xarray.

> +	void *entry;
> +	int r;
> +
> +	if (region->size == 0 || region->addr + region->size < region->addr)
> +		return -EINVAL;
> +	if (region->addr & (PAGE_SIZE - 1) || region->size & (PAGE_SIZE - 1))
> +		return -EINVAL;
> +
> +	start = region->addr >> PAGE_SHIFT;
> +	end = (region->addr + region->size - 1) >> PAGE_SHIFT;
> +
> +	entry = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION ?
> +				xa_mk_value(KVM_MEM_ATTR_PRIVATE) : NULL;
> +
> +	r = xa_err(xa_store_range(&kvm->mem_attr_array, start, end,
> +					entry, GFP_KERNEL_ACCOUNT));

IIUC, this series treats memory as shared by default.  I think we should invert
that and have KVM's ABI be that all guest memory as private by default, i.e.
require the guest to opt into sharing memory instead of opt out of sharing memory.

And then the xarray would track which regions are shared.

Regarding mem_attr_array, it probably makes sense to explicitly include what it's
tracking in the name, i.e. name it {private,shared}_mem_array depending on whether
it's used to track private vs. shared memory.  If we ever need to track metadata
beyond shared/private then we can tweak the name as needed, e.g. if hardware ever
supports secondary non-ephemeral encryption keys.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-07-20 16:21             ` Sean Christopherson
@ 2022-07-20 17:41               ` Gupta, Pankaj
  2022-07-21  7:34               ` Wei Wang
  1 sibling, 0 replies; 398+ messages in thread
From: Gupta, Pankaj @ 2022-07-20 17:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song


> Use kvm_arch_has_private_mem(), both because "has" makes it obvious this is checking
> a flag of sorts, and to align with other helpers of this nature (and with
> CONFIG_HAVE_KVM_PRIVATE_MEM).
> 
>    $ git grep kvm_arch | grep supported | wc -l
>    0
>    $ git grep kvm_arch | grep has | wc -l
>    26
> 
>>>>>>> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
>>>>>>> +	case KVM_MEMORY_ENCRYPT_REG_REGION:
>>>>>>> +	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
>>>>>>> +		struct kvm_enc_region region;
>>>>>>> +
>>>>>>> +		if (!kvm_arch_private_mem_supported(kvm))
>>>>>>> +			goto arch_vm_ioctl;
>>>>>>> +
>>>>>>> +		r = -EFAULT;
>>>>>>> +		if (copy_from_user(&region, argp, sizeof(region)))
>>>>>>> +			goto out;
>>>>>>> +
>>>>>>> +		r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, &region);
>>>>>>
>>>>>> this is to store private region metadata not only the encrypted region?
>>>>>
>>>>> Correct.
>>>>
>>>> Sorry for not being clear, was suggesting name change of this function from:
>>>> "kvm_vm_ioctl_set_encrypted_region" to "kvm_vm_ioctl_set_private_region"
>>>
>>> Though I don't have strong reason to change it, I'm fine with this and
>>
>> Yes, no strong reason, just thought "kvm_vm_ioctl_set_private_region" would
>> depict the actual functionality :)
>>
>>> this name matches the above kvm_arch_private_mem_supported perfectly.
>> BTW could not understand this, how "kvm_vm_ioctl_set_encrypted_region"
>> matches "kvm_arch_private_mem_supported"?
> 
> Chao is saying that kvm_vm_ioctl_set_private_region() pairs nicely with
> kvm_arch_private_mem_supported(), not that the "encrypted" variant pairs nicely.
> 
> I also like using "private" instead of "encrypted", though we should probably
> find a different verb than "set", because calling "set_private" when making the
> region shared is confusing.  I'm struggling to come up with a good alternative
> though.
> 
> kvm_vm_ioctl_set_memory_region() is already taken by KVM_SET_USER_MEMORY_REGION,
> and that also means that anything with "memory_region" in the name is bound to be
> confusing.
> 
> Hmm, and if we move away from "encrypted", it probably makes sense to pass in
> addr+size instead of a kvm_enc_region.
> 
> Maybe this?
> 
> static int kvm_vm_ioctl_set_or_clear_mem_private(struct kvm *kvm, gpa_t gpa,
> 					         gpa_t size, bool set_private)
> 
> and then:
> 
> #ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> 	case KVM_MEMORY_ENCRYPT_REG_REGION:
> 	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> 		bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> 		struct kvm_enc_region region;
> 
> 		if (!kvm_arch_private_mem_supported(kvm))
> 			goto arch_vm_ioctl;
> 
> 		r = -EFAULT;
> 		if (copy_from_user(&region, argp, sizeof(region)))
> 			goto out;
> 
> 		r = kvm_vm_ioctl_set_or_clear_mem_private(kvm, region.addr,
> 							  region.size, set);
> 		break;
> 	}
> #endif
> 
> I don't love it, so if someone has a better idea...

Both the suggestions look good to me. Bring more clarity.

Thanks,
Pankaj


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-07-20 16:21             ` Sean Christopherson
  2022-07-20 17:41               ` Gupta, Pankaj
@ 2022-07-21  7:34               ` Wei Wang
  2022-07-21  9:29                 ` Chao Peng
  1 sibling, 1 reply; 398+ messages in thread
From: Wei Wang @ 2022-07-21  7:34 UTC (permalink / raw)
  To: Sean Christopherson, Gupta, Pankaj
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song



On 7/21/22 00:21, Sean Christopherson wrote:
> On Wed, Jul 20, 2022, Gupta, Pankaj wrote:
>>>>>>> +bool __weak kvm_arch_private_mem_supported(struct kvm *kvm)
> Use kvm_arch_has_private_mem(), both because "has" makes it obvious this is checking
> a flag of sorts, and to align with other helpers of this nature (and with
> CONFIG_HAVE_KVM_PRIVATE_MEM).
>
>    $ git grep kvm_arch | grep supported | wc -l
>    0
>    $ git grep kvm_arch | grep has | wc -l
>    26
>
>>>>>>> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
>>>>>>> +	case KVM_MEMORY_ENCRYPT_REG_REGION:
>>>>>>> +	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
>>>>>>> +		struct kvm_enc_region region;
>>>>>>> +
>>>>>>> +		if (!kvm_arch_private_mem_supported(kvm))
>>>>>>> +			goto arch_vm_ioctl;
>>>>>>> +
>>>>>>> +		r = -EFAULT;
>>>>>>> +		if (copy_from_user(&region, argp, sizeof(region)))
>>>>>>> +			goto out;
>>>>>>> +
>>>>>>> +		r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, &region);
>>>>>> this is to store private region metadata not only the encrypted region?
>>>>> Correct.
>>>> Sorry for not being clear, was suggesting name change of this function from:
>>>> "kvm_vm_ioctl_set_encrypted_region" to "kvm_vm_ioctl_set_private_region"
>>> Though I don't have strong reason to change it, I'm fine with this and
>> Yes, no strong reason, just thought "kvm_vm_ioctl_set_private_region" would
>> depict the actual functionality :)
>>
>>> this name matches the above kvm_arch_private_mem_supported perfectly.
>> BTW could not understand this, how "kvm_vm_ioctl_set_encrypted_region"
>> matches "kvm_arch_private_mem_supported"?
> Chao is saying that kvm_vm_ioctl_set_private_region() pairs nicely with
> kvm_arch_private_mem_supported(), not that the "encrypted" variant pairs nicely.
>
> I also like using "private" instead of "encrypted", though we should probably
> find a different verb than "set", because calling "set_private" when making the
> region shared is confusing.  I'm struggling to come up with a good alternative
> though.
>
> kvm_vm_ioctl_set_memory_region() is already taken by KVM_SET_USER_MEMORY_REGION,
> and that also means that anything with "memory_region" in the name is bound to be
> confusing.
>
> Hmm, and if we move away from "encrypted", it probably makes sense to pass in
> addr+size instead of a kvm_enc_region.
>
> Maybe this?
>
> static int kvm_vm_ioctl_set_or_clear_mem_private(struct kvm *kvm, gpa_t gpa,
> 					         gpa_t size, bool set_private)
>
> and then:
>
> #ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> 	case KVM_MEMORY_ENCRYPT_REG_REGION:
> 	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> 		bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> 		struct kvm_enc_region region;
>
> 		if (!kvm_arch_private_mem_supported(kvm))
> 			goto arch_vm_ioctl;
>
> 		r = -EFAULT;
> 		if (copy_from_user(&region, argp, sizeof(region)))
> 			goto out;
>
> 		r = kvm_vm_ioctl_set_or_clear_mem_private(kvm, region.addr,
> 							  region.size, set);
> 		break;
> 	}
> #endif
>
> I don't love it, so if someone has a better idea...
>
Maybe you could tag it with cgs for all the confidential guest support 
related stuff:
e.g. kvm_vm_ioctl_set_cgs_mem()

bool is_private = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
...
kvm_vm_ioctl_set_cgs_mem(, is_private)


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-07-21  7:34               ` Wei Wang
@ 2022-07-21  9:29                 ` Chao Peng
  2022-07-21 17:58                   ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-07-21  9:29 UTC (permalink / raw)
  To: Wei Wang
  Cc: Sean Christopherson, Gupta, Pankaj, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song

On Thu, Jul 21, 2022 at 03:34:59PM +0800, Wei Wang wrote:
> 
> 
> On 7/21/22 00:21, Sean Christopherson wrote:
> > On Wed, Jul 20, 2022, Gupta, Pankaj wrote:
> > > > > > > > +bool __weak kvm_arch_private_mem_supported(struct kvm *kvm)
> > Use kvm_arch_has_private_mem(), both because "has" makes it obvious this is checking
> > a flag of sorts, and to align with other helpers of this nature (and with
> > CONFIG_HAVE_KVM_PRIVATE_MEM).
> > 
> >    $ git grep kvm_arch | grep supported | wc -l
> >    0
> >    $ git grep kvm_arch | grep has | wc -l
> >    26

Make sense. kvm_arch_has_private_mem it actually better.

> > 
> > > > > > > > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > > > > > > > +	case KVM_MEMORY_ENCRYPT_REG_REGION:
> > > > > > > > +	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> > > > > > > > +		struct kvm_enc_region region;
> > > > > > > > +
> > > > > > > > +		if (!kvm_arch_private_mem_supported(kvm))
> > > > > > > > +			goto arch_vm_ioctl;
> > > > > > > > +
> > > > > > > > +		r = -EFAULT;
> > > > > > > > +		if (copy_from_user(&region, argp, sizeof(region)))
> > > > > > > > +			goto out;
> > > > > > > > +
> > > > > > > > +		r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, &region);
> > > > > > > this is to store private region metadata not only the encrypted region?
> > > > > > Correct.
> > > > > Sorry for not being clear, was suggesting name change of this function from:
> > > > > "kvm_vm_ioctl_set_encrypted_region" to "kvm_vm_ioctl_set_private_region"
> > > > Though I don't have strong reason to change it, I'm fine with this and
> > > Yes, no strong reason, just thought "kvm_vm_ioctl_set_private_region" would
> > > depict the actual functionality :)
> > > 
> > > > this name matches the above kvm_arch_private_mem_supported perfectly.
> > > BTW could not understand this, how "kvm_vm_ioctl_set_encrypted_region"
> > > matches "kvm_arch_private_mem_supported"?
> > Chao is saying that kvm_vm_ioctl_set_private_region() pairs nicely with
> > kvm_arch_private_mem_supported(), not that the "encrypted" variant pairs nicely.
> > 
> > I also like using "private" instead of "encrypted", though we should probably
> > find a different verb than "set", because calling "set_private" when making the
> > region shared is confusing.  I'm struggling to come up with a good alternative
> > though.
> > 
> > kvm_vm_ioctl_set_memory_region() is already taken by KVM_SET_USER_MEMORY_REGION,
> > and that also means that anything with "memory_region" in the name is bound to be
> > confusing.
> > 
> > Hmm, and if we move away from "encrypted", it probably makes sense to pass in
> > addr+size instead of a kvm_enc_region.

This makes sense.

> > 
> > Maybe this?
> > 
> > static int kvm_vm_ioctl_set_or_clear_mem_private(struct kvm *kvm, gpa_t gpa,
> > 					         gpa_t size, bool set_private)

Currently this should work.

> > 
> > and then:
> > 
> > #ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > 	case KVM_MEMORY_ENCRYPT_REG_REGION:
> > 	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> > 		bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> > 		struct kvm_enc_region region;
> > 
> > 		if (!kvm_arch_private_mem_supported(kvm))
> > 			goto arch_vm_ioctl;
> > 
> > 		r = -EFAULT;
> > 		if (copy_from_user(&region, argp, sizeof(region)))
> > 			goto out;
> > 
> > 		r = kvm_vm_ioctl_set_or_clear_mem_private(kvm, region.addr,
> > 							  region.size, set);
> > 		break;
> > 	}
> > #endif
> > 
> > I don't love it, so if someone has a better idea...
> > 
> Maybe you could tag it with cgs for all the confidential guest support
> related stuff:
> e.g. kvm_vm_ioctl_set_cgs_mem()
> 
> bool is_private = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> ...
> kvm_vm_ioctl_set_cgs_mem(, is_private)

If we plan to widely use such abbr. through KVM (e.g. it's well known),
I'm fine.

I actually use mem_attr in patch: https://lkml.org/lkml/2022/7/20/610
But I also don't quite like it, it's so generic and sounds say nothing.

But I do want a name can cover future usages other than just 
private/shared (pKVM for example may have a third state).

Thanks,
Chao

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-07-20 16:44   ` Sean Christopherson
@ 2022-07-21  9:37     ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-07-21  9:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Wed, Jul 20, 2022 at 04:44:32PM +0000, Sean Christopherson wrote:
> On Wed, Jul 06, 2022, Chao Peng wrote:
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 230c8ff9659c..bb714c2a4b06 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -914,6 +914,35 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
> >  
> >  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
> >  
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > +#define KVM_MEM_ATTR_PRIVATE	0x0001
> > +static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl,
> > +					     struct kvm_enc_region *region)
> > +{
> > +	unsigned long start, end;
> 
> As alluded to in a different reply, because this will track GPAs instead of HVAs,
> the type needs to be "gpa_t", not "unsigned long".  Oh, actually, they need to
> be gfn_t, since those are what gets shoved into the xarray.

It's gfn_t actually. My original purpose for this is 32bit architectures
(if any) can also work with it since index of xarrary is 32bit on those
architectures.  But kvm_enc_region is u64 so itr's even not possible.

> 
> > +	void *entry;
> > +	int r;
> > +
> > +	if (region->size == 0 || region->addr + region->size < region->addr)
> > +		return -EINVAL;
> > +	if (region->addr & (PAGE_SIZE - 1) || region->size & (PAGE_SIZE - 1))
> > +		return -EINVAL;
> > +
> > +	start = region->addr >> PAGE_SHIFT;
> > +	end = (region->addr + region->size - 1) >> PAGE_SHIFT;
> > +
> > +	entry = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION ?
> > +				xa_mk_value(KVM_MEM_ATTR_PRIVATE) : NULL;
> > +
> > +	r = xa_err(xa_store_range(&kvm->mem_attr_array, start, end,
> > +					entry, GFP_KERNEL_ACCOUNT));
> 
> IIUC, this series treats memory as shared by default.  I think we should invert
> that and have KVM's ABI be that all guest memory as private by default, i.e.
> require the guest to opt into sharing memory instead of opt out of sharing memory.
> 
> And then the xarray would track which regions are shared.

Maybe I missed some information discussed elsewhere? I followed
https://lkml.org/lkml/2022/5/23/772. KVM is shared by default but
userspace should set all guest memory to private before the guest
launch, guest then sees all memory as private.  While default it to
private sounds also good, if we only talk about the private/shared in
private memory context (I think so), then there is no ambiguity.

> 
> Regarding mem_attr_array, it probably makes sense to explicitly include what it's
> tracking in the name, i.e. name it {private,shared}_mem_array depending on whether
> it's used to track private vs. shared memory.  If we ever need to track metadata
> beyond shared/private then we can tweak the name as needed, e.g. if hardware ever
> supports secondary non-ephemeral encryption keys.

As I think that there may be other state beyond that. Fine with me to
just take consideration of private/shared, and it also sounds
reasonable for people who want to support that to change.

Chao

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  2022-07-06  8:20 ` [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd Chao Peng
@ 2022-07-21  9:44   ` David Hildenbrand
  2022-07-21  9:50     ` David Hildenbrand
                       ` (3 more replies)
  2022-08-26 15:19   ` Fuad Tabba
  1 sibling, 4 replies; 398+ messages in thread
From: David Hildenbrand @ 2022-07-21  9:44 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song

On 06.07.22 10:20, Chao Peng wrote:
> Normally, a write to unallocated space of a file or the hole of a sparse
> file automatically causes space allocation, for memfd, this equals to
> memory allocation. This new seal prevents such automatically allocating,
> either this is from a direct write() or a write on the previously
> mmap-ed area. The seal does not prevent fallocate() so an explicit
> fallocate() can still cause allocating and can be used to reserve
> memory.
> 
> This is used to prevent unintentional allocation from userspace on a
> stray or careless write and any intentional allocation should use an
> explicit fallocate(). One of the main usecases is to avoid memory double
> allocation for confidential computing usage where we use two memfds to
> back guest memory and at a single point only one memfd is alive and we
> want to prevent memory allocation for the other memfd which may have
> been mmap-ed previously. More discussion can be found at:
> 
>   https://lkml.org/lkml/2022/6/14/1255
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  include/uapi/linux/fcntl.h |  1 +
>  mm/memfd.c                 |  3 ++-
>  mm/shmem.c                 | 16 ++++++++++++++--
>  3 files changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 2f86b2ad6d7e..98bdabc8e309 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -43,6 +43,7 @@
>  #define F_SEAL_GROW	0x0004	/* prevent file from growing */
>  #define F_SEAL_WRITE	0x0008	/* prevent writes */
>  #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
> +#define F_SEAL_AUTO_ALLOCATE	0x0020  /* prevent allocation for writes */

Why only "on writes" and not "on reads". IIRC, shmem doesn't support the
shared zeropage, so you'll simply allocate a new page via read() or on
read faults.


Also, I *think* you can place pages via userfaultfd into shmem. Not sure
if that would count "auto alloc", but it would certainly bypass fallocate().

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  2022-07-21  9:44   ` David Hildenbrand
@ 2022-07-21  9:50     ` David Hildenbrand
  2022-07-21 15:05       ` Sean Christopherson
  2022-07-21 10:27     ` Gupta, Pankaj
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 398+ messages in thread
From: David Hildenbrand @ 2022-07-21  9:50 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song

On 21.07.22 11:44, David Hildenbrand wrote:
> On 06.07.22 10:20, Chao Peng wrote:
>> Normally, a write to unallocated space of a file or the hole of a sparse
>> file automatically causes space allocation, for memfd, this equals to
>> memory allocation. This new seal prevents such automatically allocating,
>> either this is from a direct write() or a write on the previously
>> mmap-ed area. The seal does not prevent fallocate() so an explicit
>> fallocate() can still cause allocating and can be used to reserve
>> memory.
>>
>> This is used to prevent unintentional allocation from userspace on a
>> stray or careless write and any intentional allocation should use an
>> explicit fallocate(). One of the main usecases is to avoid memory double
>> allocation for confidential computing usage where we use two memfds to
>> back guest memory and at a single point only one memfd is alive and we
>> want to prevent memory allocation for the other memfd which may have
>> been mmap-ed previously. More discussion can be found at:
>>
>>   https://lkml.org/lkml/2022/6/14/1255
>>
>> Suggested-by: Sean Christopherson <seanjc@google.com>
>> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>> ---
>>  include/uapi/linux/fcntl.h |  1 +
>>  mm/memfd.c                 |  3 ++-
>>  mm/shmem.c                 | 16 ++++++++++++++--
>>  3 files changed, 17 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
>> index 2f86b2ad6d7e..98bdabc8e309 100644
>> --- a/include/uapi/linux/fcntl.h
>> +++ b/include/uapi/linux/fcntl.h
>> @@ -43,6 +43,7 @@
>>  #define F_SEAL_GROW	0x0004	/* prevent file from growing */
>>  #define F_SEAL_WRITE	0x0008	/* prevent writes */
>>  #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
>> +#define F_SEAL_AUTO_ALLOCATE	0x0020  /* prevent allocation for writes */
> 
> Why only "on writes" and not "on reads". IIRC, shmem doesn't support the
> shared zeropage, so you'll simply allocate a new page via read() or on
> read faults.

Correction: on read() we don't allocate a fresh page. But on read faults
we would. So this comment here needs clarification.

> 
> 
> Also, I *think* you can place pages via userfaultfd into shmem. Not sure
> if that would count "auto alloc", but it would certainly bypass fallocate().
> 


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  2022-07-21  9:44   ` David Hildenbrand
  2022-07-21  9:50     ` David Hildenbrand
@ 2022-07-21 10:27     ` Gupta, Pankaj
  2022-07-25 13:54       ` Chao Peng
  2022-07-25 13:42     ` Chao Peng
  2022-08-05 17:55     ` Paolo Bonzini
  3 siblings, 1 reply; 398+ messages in thread
From: Gupta, Pankaj @ 2022-07-21 10:27 UTC (permalink / raw)
  To: David Hildenbrand, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song


>> Normally, a write to unallocated space of a file or the hole of a sparse
>> file automatically causes space allocation, for memfd, this equals to
>> memory allocation. This new seal prevents such automatically allocating,
>> either this is from a direct write() or a write on the previously
>> mmap-ed area. The seal does not prevent fallocate() so an explicit
>> fallocate() can still cause allocating and can be used to reserve
>> memory.
>>
>> This is used to prevent unintentional allocation from userspace on a
>> stray or careless write and any intentional allocation should use an
>> explicit fallocate(). One of the main usecases is to avoid memory double
>> allocation for confidential computing usage where we use two memfds to
>> back guest memory and at a single point only one memfd is alive and we
>> want to prevent memory allocation for the other memfd which may have
>> been mmap-ed previously. More discussion can be found at:
>>
>>    https://lkml.org/lkml/2022/6/14/1255
>>
>> Suggested-by: Sean Christopherson <seanjc@google.com>
>> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>> ---
>>   include/uapi/linux/fcntl.h |  1 +
>>   mm/memfd.c                 |  3 ++-
>>   mm/shmem.c                 | 16 ++++++++++++++--
>>   3 files changed, 17 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
>> index 2f86b2ad6d7e..98bdabc8e309 100644
>> --- a/include/uapi/linux/fcntl.h
>> +++ b/include/uapi/linux/fcntl.h
>> @@ -43,6 +43,7 @@
>>   #define F_SEAL_GROW	0x0004	/* prevent file from growing */
>>   #define F_SEAL_WRITE	0x0008	/* prevent writes */
>>   #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
>> +#define F_SEAL_AUTO_ALLOCATE	0x0020  /* prevent allocation for writes */
> 
> Why only "on writes" and not "on reads". IIRC, shmem doesn't support the
> shared zeropage, so you'll simply allocate a new page via read() or on
> read faults.
> 
> 
> Also, I *think* you can place pages via userfaultfd into shmem. Not sure
> if that would count "auto alloc", but it would certainly bypass fallocate().

I was also thinking this at the same time, but for different reason:

"Want to populate private preboot memory with firmware payload", so was 
thinking userfaulftd could be an option as direct writes are restricted?

Thanks,
Pankaj






^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  2022-07-21  9:50     ` David Hildenbrand
@ 2022-07-21 15:05       ` Sean Christopherson
  2022-07-25 13:46         ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2022-07-21 15:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song

On Thu, Jul 21, 2022, David Hildenbrand wrote:
> On 21.07.22 11:44, David Hildenbrand wrote:
> > On 06.07.22 10:20, Chao Peng wrote:
> >> Normally, a write to unallocated space of a file or the hole of a sparse
> >> file automatically causes space allocation, for memfd, this equals to
> >> memory allocation. This new seal prevents such automatically allocating,
> >> either this is from a direct write() or a write on the previously
> >> mmap-ed area. The seal does not prevent fallocate() so an explicit
> >> fallocate() can still cause allocating and can be used to reserve
> >> memory.
> >>
> >> This is used to prevent unintentional allocation from userspace on a
> >> stray or careless write and any intentional allocation should use an
> >> explicit fallocate(). One of the main usecases is to avoid memory double
> >> allocation for confidential computing usage where we use two memfds to
> >> back guest memory and at a single point only one memfd is alive and we
> >> want to prevent memory allocation for the other memfd which may have
> >> been mmap-ed previously. More discussion can be found at:
> >>
> >>   https://lkml.org/lkml/2022/6/14/1255
> >>
> >> Suggested-by: Sean Christopherson <seanjc@google.com>
> >> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> >> ---
> >>  include/uapi/linux/fcntl.h |  1 +
> >>  mm/memfd.c                 |  3 ++-
> >>  mm/shmem.c                 | 16 ++++++++++++++--
> >>  3 files changed, 17 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> >> index 2f86b2ad6d7e..98bdabc8e309 100644
> >> --- a/include/uapi/linux/fcntl.h
> >> +++ b/include/uapi/linux/fcntl.h
> >> @@ -43,6 +43,7 @@
> >>  #define F_SEAL_GROW	0x0004	/* prevent file from growing */
> >>  #define F_SEAL_WRITE	0x0008	/* prevent writes */
> >>  #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
> >> +#define F_SEAL_AUTO_ALLOCATE	0x0020  /* prevent allocation for writes */
> > 
> > Why only "on writes" and not "on reads". IIRC, shmem doesn't support the
> > shared zeropage, so you'll simply allocate a new page via read() or on
> > read faults.
> 
> Correction: on read() we don't allocate a fresh page. But on read faults
> we would. So this comment here needs clarification.

Not just the comment, the code too.  The intent of F_SEAL_AUTO_ALLOCATE is very
much to block _all_ implicit allocations (or maybe just fault-based allocations
if "implicit" is too broad of a description).

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-07-21  9:29                 ` Chao Peng
@ 2022-07-21 17:58                   ` Sean Christopherson
  2022-07-25 13:04                     ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2022-07-21 17:58 UTC (permalink / raw)
  To: Chao Peng
  Cc: Wei Wang, Gupta, Pankaj, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song

On Thu, Jul 21, 2022, Chao Peng wrote:
> On Thu, Jul 21, 2022 at 03:34:59PM +0800, Wei Wang wrote:
> > 
> > 
> > On 7/21/22 00:21, Sean Christopherson wrote:
> > Maybe you could tag it with cgs for all the confidential guest support
> > related stuff: e.g. kvm_vm_ioctl_set_cgs_mem()
> > 
> > bool is_private = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> > ...
> > kvm_vm_ioctl_set_cgs_mem(, is_private)
> 
> If we plan to widely use such abbr. through KVM (e.g. it's well known),
> I'm fine.

I'd prefer to stay away from "confidential guest", and away from any VM-scoped
name for that matter.  User-unmappable memmory has use cases beyond hiding guest
state from the host, e.g. userspace could use inaccessible/unmappable memory to
harden itself against unintentional access to guest memory.

> I actually use mem_attr in patch: https://lkml.org/lkml/2022/7/20/610
> But I also don't quite like it, it's so generic and sounds say nothing.
> 
> But I do want a name can cover future usages other than just 
> private/shared (pKVM for example may have a third state).

I don't think there can be a third top-level state.  Memory is either private to
the guest or it's not.  There can be sub-states, e.g. memory could be selectively
shared or encrypted with a different key, in which case we'd need metadata to
track that state.

Though that begs the question of whether or not private_fd is the correct
terminology.  E.g. if guest memory is backed by a memfd that can't be mapped by
userspace (currently F_SEAL_INACCESSIBLE), but something else in the kernel plugs
that memory into a device or another VM, then arguably that memory is shared,
especially the multi-VM scenario.

For TDX and SNP "private vs. shared" is likely the correct terminology given the
current specs, but for generic KVM it's probably better to align with whatever
terminology is used for memfd.  "inaccessible_fd" and "user_inaccessible_fd" are
a bit odd since the fd itself is accesible.

What about "user_unmappable"?  E.g.

  F_SEAL_USER_UNMAPPABLE, MFD_USER_UNMAPPABLE, KVM_HAS_USER_UNMAPPABLE_MEMORY,
  MEMFILE_F_USER_INACCESSIBLE, user_unmappable_fd, etc...

that gives us flexibility to map the memory from within the kernel, e.g. into
other VMs or devices.

Hmm, and then keep your original "mem_attr_array" name?  And probably 

 int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
 			       bool is_user_mappable)

Then the x86/mmu code for TDX/SNP private faults could be:

	is_private = !kvm_is_gpa_user_mappable();

	if (fault->is_private != is_private) {

or if we want to avoid mixing up "user_mappable" and "user_unmappable":

	is_private = kvm_is_gpa_user_unmappable();

	if (fault->is_private != is_private) {

though a helper that returns a negative (not mappable) feels kludgy.  And I like
kvm_is_gpa_user_mappable() because then when there's not "special" memory, it
defaults to true, which is more intuitive IMO.

And then if the future needs more precision, e.g. user-unmappable memory isn't
necessarily guest-exclusive, the uAPI names still work even though KVM internals
will need to be reworked, but that's unavoidable.  E.g. piggybacking
KVM_MEMORY_ENCRYPT_(UN)REG_REGION doesn't allow for further differentiation,
so we'd need to _extend_ the uAPI, but the _existing_ uAPI would still be sane.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-07-21 17:58                   ` Sean Christopherson
@ 2022-07-25 13:04                     ` Chao Peng
  2022-07-29 19:54                       ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-07-25 13:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Wei Wang, Gupta, Pankaj, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song

On Thu, Jul 21, 2022 at 05:58:50PM +0000, Sean Christopherson wrote:
> On Thu, Jul 21, 2022, Chao Peng wrote:
> > On Thu, Jul 21, 2022 at 03:34:59PM +0800, Wei Wang wrote:
> > > 
> > > 
> > > On 7/21/22 00:21, Sean Christopherson wrote:
> > > Maybe you could tag it with cgs for all the confidential guest support
> > > related stuff: e.g. kvm_vm_ioctl_set_cgs_mem()
> > > 
> > > bool is_private = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> > > ...
> > > kvm_vm_ioctl_set_cgs_mem(, is_private)
> > 
> > If we plan to widely use such abbr. through KVM (e.g. it's well known),
> > I'm fine.
> 
> I'd prefer to stay away from "confidential guest", and away from any VM-scoped
> name for that matter.  User-unmappable memmory has use cases beyond hiding guest
> state from the host, e.g. userspace could use inaccessible/unmappable memory to
> harden itself against unintentional access to guest memory.
> 
> > I actually use mem_attr in patch: https://lkml.org/lkml/2022/7/20/610
> > But I also don't quite like it, it's so generic and sounds say nothing.
> > 
> > But I do want a name can cover future usages other than just 
> > private/shared (pKVM for example may have a third state).
> 
> I don't think there can be a third top-level state.  Memory is either private to
> the guest or it's not.  There can be sub-states, e.g. memory could be selectively
> shared or encrypted with a different key, in which case we'd need metadata to
> track that state.
> 
> Though that begs the question of whether or not private_fd is the correct
> terminology.  E.g. if guest memory is backed by a memfd that can't be mapped by
> userspace (currently F_SEAL_INACCESSIBLE), but something else in the kernel plugs
> that memory into a device or another VM, then arguably that memory is shared,
> especially the multi-VM scenario.
> 
> For TDX and SNP "private vs. shared" is likely the correct terminology given the
> current specs, but for generic KVM it's probably better to align with whatever
> terminology is used for memfd.  "inaccessible_fd" and "user_inaccessible_fd" are
> a bit odd since the fd itself is accesible.
> 
> What about "user_unmappable"?  E.g.
> 
>   F_SEAL_USER_UNMAPPABLE, MFD_USER_UNMAPPABLE, KVM_HAS_USER_UNMAPPABLE_MEMORY,
>   MEMFILE_F_USER_INACCESSIBLE, user_unmappable_fd, etc...

For KVM I also think user_unmappable looks better than 'private', e.g.
user_unmappable_fd/KVM_HAS_USER_UNMAPPABLE_MEMORY sounds more
appropriate names. For memfd however, I don't feel that strong to change
it from current 'inaccessible' to 'user_unmappable', one of the reason
is it's not just about unmappable, but actually also inaccessible
through direct ioctls like read()/write().

> 
> that gives us flexibility to map the memory from within the kernel, e.g. into
> other VMs or devices.
> 
> Hmm, and then keep your original "mem_attr_array" name?  And probably 
> 
>  int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
>  			       bool is_user_mappable)
> 
> Then the x86/mmu code for TDX/SNP private faults could be:
> 
> 	is_private = !kvm_is_gpa_user_mappable();
> 
> 	if (fault->is_private != is_private) {
> 
> or if we want to avoid mixing up "user_mappable" and "user_unmappable":
> 
> 	is_private = kvm_is_gpa_user_unmappable();
> 
> 	if (fault->is_private != is_private) {
> 
> though a helper that returns a negative (not mappable) feels kludgy.  And I like
> kvm_is_gpa_user_mappable() because then when there's not "special" memory, it
> defaults to true, which is more intuitive IMO.

yes.

> 
> And then if the future needs more precision, e.g. user-unmappable memory isn't
> necessarily guest-exclusive, the uAPI names still work even though KVM internals
> will need to be reworked, but that's unavoidable.  E.g. piggybacking
> KVM_MEMORY_ENCRYPT_(UN)REG_REGION doesn't allow for further differentiation,
> so we'd need to _extend_ the uAPI, but the _existing_ uAPI would still be sane.

Right, that has to be extended.

Chao

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  2022-07-21  9:44   ` David Hildenbrand
  2022-07-21  9:50     ` David Hildenbrand
  2022-07-21 10:27     ` Gupta, Pankaj
@ 2022-07-25 13:42     ` Chao Peng
  2022-08-05 17:55     ` Paolo Bonzini
  3 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-07-25 13:42 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song

On Thu, Jul 21, 2022 at 11:44:11AM +0200, David Hildenbrand wrote:
> On 06.07.22 10:20, Chao Peng wrote:
> > Normally, a write to unallocated space of a file or the hole of a sparse
> > file automatically causes space allocation, for memfd, this equals to
> > memory allocation. This new seal prevents such automatically allocating,
> > either this is from a direct write() or a write on the previously
> > mmap-ed area. The seal does not prevent fallocate() so an explicit
> > fallocate() can still cause allocating and can be used to reserve
> > memory.
> > 
> > This is used to prevent unintentional allocation from userspace on a
> > stray or careless write and any intentional allocation should use an
> > explicit fallocate(). One of the main usecases is to avoid memory double
> > allocation for confidential computing usage where we use two memfds to
> > back guest memory and at a single point only one memfd is alive and we
> > want to prevent memory allocation for the other memfd which may have
> > been mmap-ed previously. More discussion can be found at:
> > 
> >   https://lkml.org/lkml/2022/6/14/1255
> > 
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  include/uapi/linux/fcntl.h |  1 +
> >  mm/memfd.c                 |  3 ++-
> >  mm/shmem.c                 | 16 ++++++++++++++--
> >  3 files changed, 17 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> > index 2f86b2ad6d7e..98bdabc8e309 100644
> > --- a/include/uapi/linux/fcntl.h
> > +++ b/include/uapi/linux/fcntl.h
> > @@ -43,6 +43,7 @@
> >  #define F_SEAL_GROW	0x0004	/* prevent file from growing */
> >  #define F_SEAL_WRITE	0x0008	/* prevent writes */
> >  #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
> > +#define F_SEAL_AUTO_ALLOCATE	0x0020  /* prevent allocation for writes */
> 
> Why only "on writes" and not "on reads". IIRC, shmem doesn't support the
> shared zeropage, so you'll simply allocate a new page via read() or on
> read faults.

Right, it also prevents read faults.

> 
> 
> Also, I *think* you can place pages via userfaultfd into shmem. Not sure
> if that would count "auto alloc", but it would certainly bypass fallocate().

Userfaultfd sounds interesting, will further investigate it. But a rough
look sounds it only faults to usrspace for write/read fault, not
write()? Also sounds it operates on vma and userfaultfd_register() takes
mmap_lock which is what we want to avoid for frequent
register/unregister during private/shared memory conversion.

Chao
> 
> -- 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  2022-07-21 15:05       ` Sean Christopherson
@ 2022-07-25 13:46         ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-07-25 13:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Hildenbrand, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song

On Thu, Jul 21, 2022 at 03:05:09PM +0000, Sean Christopherson wrote:
> On Thu, Jul 21, 2022, David Hildenbrand wrote:
> > On 21.07.22 11:44, David Hildenbrand wrote:
> > > On 06.07.22 10:20, Chao Peng wrote:
> > >> Normally, a write to unallocated space of a file or the hole of a sparse
> > >> file automatically causes space allocation, for memfd, this equals to
> > >> memory allocation. This new seal prevents such automatically allocating,
> > >> either this is from a direct write() or a write on the previously
> > >> mmap-ed area. The seal does not prevent fallocate() so an explicit
> > >> fallocate() can still cause allocating and can be used to reserve
> > >> memory.
> > >>
> > >> This is used to prevent unintentional allocation from userspace on a
> > >> stray or careless write and any intentional allocation should use an
> > >> explicit fallocate(). One of the main usecases is to avoid memory double
> > >> allocation for confidential computing usage where we use two memfds to
> > >> back guest memory and at a single point only one memfd is alive and we
> > >> want to prevent memory allocation for the other memfd which may have
> > >> been mmap-ed previously. More discussion can be found at:
> > >>
> > >>   https://lkml.org/lkml/2022/6/14/1255
> > >>
> > >> Suggested-by: Sean Christopherson <seanjc@google.com>
> > >> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > >> ---
> > >>  include/uapi/linux/fcntl.h |  1 +
> > >>  mm/memfd.c                 |  3 ++-
> > >>  mm/shmem.c                 | 16 ++++++++++++++--
> > >>  3 files changed, 17 insertions(+), 3 deletions(-)
> > >>
> > >> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> > >> index 2f86b2ad6d7e..98bdabc8e309 100644
> > >> --- a/include/uapi/linux/fcntl.h
> > >> +++ b/include/uapi/linux/fcntl.h
> > >> @@ -43,6 +43,7 @@
> > >>  #define F_SEAL_GROW	0x0004	/* prevent file from growing */
> > >>  #define F_SEAL_WRITE	0x0008	/* prevent writes */
> > >>  #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
> > >> +#define F_SEAL_AUTO_ALLOCATE	0x0020  /* prevent allocation for writes */
> > > 
> > > Why only "on writes" and not "on reads". IIRC, shmem doesn't support the
> > > shared zeropage, so you'll simply allocate a new page via read() or on
> > > read faults.
> > 
> > Correction: on read() we don't allocate a fresh page. But on read faults
> > we would. So this comment here needs clarification.
> 
> Not just the comment, the code too.  The intent of F_SEAL_AUTO_ALLOCATE is very
> much to block _all_ implicit allocations (or maybe just fault-based allocations
> if "implicit" is too broad of a description).

So maybe still your initial suggestion F_SEAL_FAULT_ALLOCATIONS? One
reason I don't like it is the write() ioctl also cause allocation and we
want to prevent it.

Chao

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  2022-07-21 10:27     ` Gupta, Pankaj
@ 2022-07-25 13:54       ` Chao Peng
  2022-07-25 14:49         ` Gupta, Pankaj
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-07-25 13:54 UTC (permalink / raw)
  To: Gupta, Pankaj
  Cc: David Hildenbrand, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

On Thu, Jul 21, 2022 at 12:27:03PM +0200, Gupta, Pankaj wrote:
> 
> > > Normally, a write to unallocated space of a file or the hole of a sparse
> > > file automatically causes space allocation, for memfd, this equals to
> > > memory allocation. This new seal prevents such automatically allocating,
> > > either this is from a direct write() or a write on the previously
> > > mmap-ed area. The seal does not prevent fallocate() so an explicit
> > > fallocate() can still cause allocating and can be used to reserve
> > > memory.
> > > 
> > > This is used to prevent unintentional allocation from userspace on a
> > > stray or careless write and any intentional allocation should use an
> > > explicit fallocate(). One of the main usecases is to avoid memory double
> > > allocation for confidential computing usage where we use two memfds to
> > > back guest memory and at a single point only one memfd is alive and we
> > > want to prevent memory allocation for the other memfd which may have
> > > been mmap-ed previously. More discussion can be found at:
> > > 
> > >    https://lkml.org/lkml/2022/6/14/1255
> > > 
> > > Suggested-by: Sean Christopherson <seanjc@google.com>
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > ---
> > >   include/uapi/linux/fcntl.h |  1 +
> > >   mm/memfd.c                 |  3 ++-
> > >   mm/shmem.c                 | 16 ++++++++++++++--
> > >   3 files changed, 17 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> > > index 2f86b2ad6d7e..98bdabc8e309 100644
> > > --- a/include/uapi/linux/fcntl.h
> > > +++ b/include/uapi/linux/fcntl.h
> > > @@ -43,6 +43,7 @@
> > >   #define F_SEAL_GROW	0x0004	/* prevent file from growing */
> > >   #define F_SEAL_WRITE	0x0008	/* prevent writes */
> > >   #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
> > > +#define F_SEAL_AUTO_ALLOCATE	0x0020  /* prevent allocation for writes */
> > 
> > Why only "on writes" and not "on reads". IIRC, shmem doesn't support the
> > shared zeropage, so you'll simply allocate a new page via read() or on
> > read faults.
> > 
> > 
> > Also, I *think* you can place pages via userfaultfd into shmem. Not sure
> > if that would count "auto alloc", but it would certainly bypass fallocate().
> 
> I was also thinking this at the same time, but for different reason:
> 
> "Want to populate private preboot memory with firmware payload", so was
> thinking userfaulftd could be an option as direct writes are restricted?

If that can be a side effect, I definitely glad to see it, though I'm
still not clear how userfaultfd can be particularly helpful for that.

Chao
> 
> Thanks,
> Pankaj
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  2022-07-25 13:54       ` Chao Peng
@ 2022-07-25 14:49         ` Gupta, Pankaj
  0 siblings, 0 replies; 398+ messages in thread
From: Gupta, Pankaj @ 2022-07-25 14:49 UTC (permalink / raw)
  To: Chao Peng
  Cc: David Hildenbrand, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song


>>>> Normally, a write to unallocated space of a file or the hole of a sparse
>>>> file automatically causes space allocation, for memfd, this equals to
>>>> memory allocation. This new seal prevents such automatically allocating,
>>>> either this is from a direct write() or a write on the previously
>>>> mmap-ed area. The seal does not prevent fallocate() so an explicit
>>>> fallocate() can still cause allocating and can be used to reserve
>>>> memory.
>>>>
>>>> This is used to prevent unintentional allocation from userspace on a
>>>> stray or careless write and any intentional allocation should use an
>>>> explicit fallocate(). One of the main usecases is to avoid memory double
>>>> allocation for confidential computing usage where we use two memfds to
>>>> back guest memory and at a single point only one memfd is alive and we
>>>> want to prevent memory allocation for the other memfd which may have
>>>> been mmap-ed previously. More discussion can be found at:
>>>>
>>>>     https://lkml.org/lkml/2022/6/14/1255
>>>>
>>>> Suggested-by: Sean Christopherson <seanjc@google.com>
>>>> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>>>> ---
>>>>    include/uapi/linux/fcntl.h |  1 +
>>>>    mm/memfd.c                 |  3 ++-
>>>>    mm/shmem.c                 | 16 ++++++++++++++--
>>>>    3 files changed, 17 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
>>>> index 2f86b2ad6d7e..98bdabc8e309 100644
>>>> --- a/include/uapi/linux/fcntl.h
>>>> +++ b/include/uapi/linux/fcntl.h
>>>> @@ -43,6 +43,7 @@
>>>>    #define F_SEAL_GROW	0x0004	/* prevent file from growing */
>>>>    #define F_SEAL_WRITE	0x0008	/* prevent writes */
>>>>    #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
>>>> +#define F_SEAL_AUTO_ALLOCATE	0x0020  /* prevent allocation for writes */
>>>
>>> Why only "on writes" and not "on reads". IIRC, shmem doesn't support the
>>> shared zeropage, so you'll simply allocate a new page via read() or on
>>> read faults.
>>>
>>>
>>> Also, I *think* you can place pages via userfaultfd into shmem. Not sure
>>> if that would count "auto alloc", but it would certainly bypass fallocate().
>>
>> I was also thinking this at the same time, but for different reason:
>>
>> "Want to populate private preboot memory with firmware payload", so was
>> thinking userfaulftd could be an option as direct writes are restricted?
> 
> If that can be a side effect, I definitely glad to see it, though I'm
> still not clear how userfaultfd can be particularly helpful for that.

Was thinking if we can use userfaultfd to monitor the pagefault on 
virtual firmware memory range and use to populate the private memory.

Not sure if it is a side effect. Was just theoretically thinking (for 
now kept the idea aside as these enhancements can be worked later).

Thanks,
Pankaj


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 08/14] KVM: Rename mmu_notifier_*
  2022-07-06  8:20 ` [PATCH v7 08/14] KVM: Rename mmu_notifier_* Chao Peng
@ 2022-07-29 19:02   ` Sean Christopherson
  2022-08-03 10:13     ` Chao Peng
  2022-08-05 19:54     ` Paolo Bonzini
  2023-05-23  7:19   ` Kautuk Consul
  1 sibling, 2 replies; 398+ messages in thread
From: Sean Christopherson @ 2022-07-29 19:02 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Wed, Jul 06, 2022, Chao Peng wrote:
> The sync mechanism between mmu_notifier and page fault handler employs
> fields mmu_notifier_seq/count and mmu_notifier_range_start/end. For the
> to be added private memory, there is the same mechanism needed but not
> rely on mmu_notifier (It uses new introduced memfile_notifier). This
> patch renames the existing fields and related helper functions to a
> neutral name mmu_updating_* so private memory can reuse.

mmu_updating_* is too broad of a term, e.g. page faults and many other operations
also update the mmu.  Although the name most definitely came from the mmu_notifier,
it's not completely inaccurate for other sources, e.g. KVM's MMU is still being
notified of something, even if the source is not the actual mmu_notifier.

If we really want a different name, I'd vote for nomenclature that captures the
invalidation aspect, which is really what the variables are all trackng, e.g.

  mmu_invalidate_seq
  mmu_invalidate_in_progress
  mmu_invalidate_range_start
  mmu_invalidate_range_end


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 09/14] KVM: Extend the memslot to support fd-based private memory
  2022-07-06  8:20 ` [PATCH v7 09/14] KVM: Extend the memslot to support fd-based private memory Chao Peng
@ 2022-07-29 19:51   ` Sean Christopherson
  2022-08-03 10:08     ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2022-07-29 19:51 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Wed, Jul 06, 2022, Chao Peng wrote:
> @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
>  	__u64 userspace_addr; /* start of the userspace allocated memory */
>    };
>  
> +  struct kvm_userspace_memory_region_ext {
> +	struct kvm_userspace_memory_region region;
> +	__u64 private_offset;
> +	__u32 private_fd;
> +	__u32 pad1;
> +	__u64 pad2[14];
> +};
> +
>    /* for kvm_memory_region::flags */
>    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>    #define KVM_MEM_READONLY	(1UL << 1)
> +  #define KVM_MEM_PRIVATE		(1UL << 2)

Very belatedly following up on prior feedback...

  | I think a flag is still needed, the problem is private_fd can be safely
  | accessed only when this flag is set, e.g. without this flag, we can't
  | copy_from_user these new fields since they don't exist for previous
  | kvm_userspace_memory_region callers.

I forgot about that aspect of things.  We don't technically need a dedicated
PRIVATE flag to handle that, but it does seem to be the least awful soltuion.
We could either add a generic KVM_MEM_EXTENDED_REGION or an entirely new
ioctl(), e.g. KVM_SET_USER_MEMORY_REGION2, but in both approaches there's a decent
chance that we'll end up needed individual "this field is valid" flags anways.

E.g. if KVM requires pad1 and pad2 to be zero to carve out future extensions,
then we're right back here if some future extension needs to treat '0' as a legal
input.

TL;DR: adding KVM_MEM_PRIVATE still seems like the best approach.

> @@ -4631,14 +4658,35 @@ static long kvm_vm_ioctl(struct file *filp,
>  		break;
>  	}
>  	case KVM_SET_USER_MEMORY_REGION: {
> -		struct kvm_userspace_memory_region kvm_userspace_mem;
> +		struct kvm_user_mem_region mem;
> +		unsigned long size;
> +		u32 flags;
> +
> +		kvm_sanity_check_user_mem_region_alias();
> +
> +		memset(&mem, 0, sizeof(mem));
>  
>  		r = -EFAULT;
> -		if (copy_from_user(&kvm_userspace_mem, argp,
> -						sizeof(kvm_userspace_mem)))
> +
> +		if (get_user(flags,
> +			(u32 __user *)(argp + offsetof(typeof(mem), flags))))
> +			goto out;


Indentation is funky.  It's hard to massage this into something short and
readable  What about capturing the offset separately?  E.g.

                struct kvm_user_mem_region mem;
                unsigned int flags_offset = offsetof(typeof(mem), flags));
                unsigned long size;
                u32 flags;

                kvm_sanity_check_user_mem_region_alias();

		memset(&mem, 0, sizeof(mem));

                r = -EFAULT;
                if (get_user(flags, (u32 __user *)(argp + flags_offset)))
                        goto out;

But this can actually be punted until KVM_MEM_PRIVATE is fully supported.  As of
this patch, KVM doesn't read the extended size, so I believe the diff for this
patch can simply be:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index da263c370d00..5194beb7b52f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4640,6 +4640,10 @@ static long kvm_vm_ioctl(struct file *filp,
                                                sizeof(kvm_userspace_mem)))
                        goto out;

+               r = -EINVAL;
+               if (mem.flags & KVM_MEM_PRIVATE)
+                       goto out;
+
                r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
                break;
        }


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-07-25 13:04                     ` Chao Peng
@ 2022-07-29 19:54                       ` Sean Christopherson
  2022-08-02  0:49                         ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2022-07-29 19:54 UTC (permalink / raw)
  To: Chao Peng
  Cc: Wei Wang, Gupta, Pankaj, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song

On Mon, Jul 25, 2022, Chao Peng wrote:
> On Thu, Jul 21, 2022 at 05:58:50PM +0000, Sean Christopherson wrote:
> > On Thu, Jul 21, 2022, Chao Peng wrote:
> > > On Thu, Jul 21, 2022 at 03:34:59PM +0800, Wei Wang wrote:
> > > > 
> > > > 
> > > > On 7/21/22 00:21, Sean Christopherson wrote:
> > > > Maybe you could tag it with cgs for all the confidential guest support
> > > > related stuff: e.g. kvm_vm_ioctl_set_cgs_mem()
> > > > 
> > > > bool is_private = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> > > > ...
> > > > kvm_vm_ioctl_set_cgs_mem(, is_private)
> > > 
> > > If we plan to widely use such abbr. through KVM (e.g. it's well known),
> > > I'm fine.
> > 
> > I'd prefer to stay away from "confidential guest", and away from any VM-scoped
> > name for that matter.  User-unmappable memmory has use cases beyond hiding guest
> > state from the host, e.g. userspace could use inaccessible/unmappable memory to
> > harden itself against unintentional access to guest memory.
> > 
> > > I actually use mem_attr in patch: https://lkml.org/lkml/2022/7/20/610
> > > But I also don't quite like it, it's so generic and sounds say nothing.
> > > 
> > > But I do want a name can cover future usages other than just 
> > > private/shared (pKVM for example may have a third state).
> > 
> > I don't think there can be a third top-level state.  Memory is either private to
> > the guest or it's not.  There can be sub-states, e.g. memory could be selectively
> > shared or encrypted with a different key, in which case we'd need metadata to
> > track that state.
> > 
> > Though that begs the question of whether or not private_fd is the correct
> > terminology.  E.g. if guest memory is backed by a memfd that can't be mapped by
> > userspace (currently F_SEAL_INACCESSIBLE), but something else in the kernel plugs
> > that memory into a device or another VM, then arguably that memory is shared,
> > especially the multi-VM scenario.
> > 
> > For TDX and SNP "private vs. shared" is likely the correct terminology given the
> > current specs, but for generic KVM it's probably better to align with whatever
> > terminology is used for memfd.  "inaccessible_fd" and "user_inaccessible_fd" are
> > a bit odd since the fd itself is accesible.
> > 
> > What about "user_unmappable"?  E.g.
> > 
> >   F_SEAL_USER_UNMAPPABLE, MFD_USER_UNMAPPABLE, KVM_HAS_USER_UNMAPPABLE_MEMORY,
> >   MEMFILE_F_USER_INACCESSIBLE, user_unmappable_fd, etc...
> 
> For KVM I also think user_unmappable looks better than 'private', e.g.
> user_unmappable_fd/KVM_HAS_USER_UNMAPPABLE_MEMORY sounds more
> appropriate names. For memfd however, I don't feel that strong to change
> it from current 'inaccessible' to 'user_unmappable', one of the reason
> is it's not just about unmappable, but actually also inaccessible
> through direct ioctls like read()/write().

Heh, I _knew_ there had to be a catch.  I agree that INACCESSIBLE is better for
memfd.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 12/14] KVM: Handle page fault for private memory
  2022-07-06  8:20 ` [PATCH v7 12/14] KVM: Handle page fault for private memory Chao Peng
@ 2022-07-29 20:58   ` Sean Christopherson
  2022-08-03  9:52     ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2022-07-29 20:58 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Wed, Jul 06, 2022, Chao Peng wrote:
> A page fault can carry the private/shared information for
> KVM_MEM_PRIVATE memslot, this can be filled by architecture code(like
> TDX code). To handle page fault for such access, KVM maps the page only
> when this private property matches the host's view on the page.
> 
> For a successful match, private pfn is obtained with memfile_notifier
> callbacks from private fd and shared pfn is obtained with existing
> get_user_pages.
> 
> For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> userspace. Userspace then can convert memory between private/shared from
> host's view then retry the access.
> 
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 60 ++++++++++++++++++++++++++++++++-
>  arch/x86/kvm/mmu/mmu_internal.h | 18 ++++++++++
>  arch/x86/kvm/mmu/mmutrace.h     |  1 +
>  include/linux/kvm_host.h        | 35 ++++++++++++++++++-
>  4 files changed, 112 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 545eb74305fe..27dbdd4fe8d1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3004,6 +3004,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
>  	if (max_level == PG_LEVEL_4K)
>  		return PG_LEVEL_4K;
>  
> +	if (kvm_mem_is_private(kvm, gfn))
> +		return max_level;
> +
>  	host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
>  	return min(host_level, max_level);
>  }
> @@ -4101,10 +4104,52 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
>  }
>  
> +static inline u8 order_to_level(int order)
> +{
> +	enum pg_level level;
> +
> +	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > PG_LEVEL_4K; level--)

Curly braces needed for the for-loop.

And I think it makes sense to take in the fault->max_level, that way this is
slightly more performant when the guest mapping is smaller than the host, e.g.

	for (level = max_level; level > PG_LEVEL_4K; level--)
		...

	return level;

Though I think I'd vote to avoid a loop entirely and do:

	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);

	if (order > ???)
		return PG_LEVEL_1G;
	
	if (order > ???)
		return PG_LEVEL_2M;

	return PG_LEVEL_4K;


> +		if (order >= page_level_shift(level) - PAGE_SHIFT)
> +			return level;
> +	return level;
> +}
> +
> +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> +				   struct kvm_page_fault *fault)
> +{
> +	int order;
> +	struct kvm_memory_slot *slot = fault->slot;
> +	bool private_exist = kvm_mem_is_private(vcpu->kvm, fault->gfn);
> +
> +	if (fault->is_private != private_exist) {
> +		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> +		if (fault->is_private)
> +			vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> +		else
> +			vcpu->run->memory.flags = 0;
> +		vcpu->run->memory.padding = 0;
> +		vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> +		vcpu->run->memory.size = PAGE_SIZE;
> +		return RET_PF_USER;
> +	}
> +
> +	if (fault->is_private) {
> +		if (kvm_private_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> +			return RET_PF_RETRY;
> +		fault->max_level = min(order_to_level(order), fault->max_level);
> +		fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> +		return RET_PF_FIXED;
> +	}
> +
> +	/* Fault is shared, fallthrough. */
> +	return RET_PF_CONTINUE;
> +}
> +
>  static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	struct kvm_memory_slot *slot = fault->slot;
>  	bool async;
> +	int r;
>  
>  	/*
>  	 * Retry the page fault if the gfn hit a memslot that is being deleted
> @@ -4133,6 +4178,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  			return RET_PF_EMULATE;
>  	}
>  
> +	if (kvm_slot_can_be_private(slot)) {
> +		r = kvm_faultin_pfn_private(vcpu, fault);
> +		if (r != RET_PF_CONTINUE)
> +			return r == RET_PF_FIXED ? RET_PF_CONTINUE : r;

I apologize if I've given you conflicting feedback in the past.  Now that this
returns RET_PF_* directly, I definitely think it makes sense to do:

	if (kvm_slot_can_be_private(slot) &&
	    fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
		if (fault->is_private)
			vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
		else
			vcpu->run->memory.flags = 0;
		vcpu->run->memory.padding = 0;
		vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
		vcpu->run->memory.size = PAGE_SIZE;
		return RET_PF_USER;
	}

	if (fault->is_private)
		return kvm_faultin_pfn_private(vcpu, fault);

That way kvm_faultin_pfn_private() only handles private faults, and this doesn't
need to play games with RET_PF_FIXED.


> +	}
> +
>  	async = false;
>  	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
>  					  fault->write, &fault->map_writable,
> @@ -4241,7 +4292,11 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  		read_unlock(&vcpu->kvm->mmu_lock);
>  	else
>  		write_unlock(&vcpu->kvm->mmu_lock);
> -	kvm_release_pfn_clean(fault->pfn);
> +
> +	if (fault->is_private)
> +		kvm_private_mem_put_pfn(fault->slot, fault->pfn);
> +	else
> +		kvm_release_pfn_clean(fault->pfn);

AFAIK, we never bottomed out on whether or not this is needed[*].  Can you follow
up with Kirill to get an answer before posting v8?

[*] https://lore.kernel.org/all/20220620141647.GC2016793@chaop.bj.intel.com

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 14/14] memfd_create.2: Describe MFD_INACCESSIBLE flag
  2022-07-06  8:20 ` [PATCH v7 14/14] memfd_create.2: Describe MFD_INACCESSIBLE flag Chao Peng
@ 2022-08-01 14:40   ` Dave Hansen
  2022-08-03  9:53     ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Dave Hansen @ 2022-08-01 14:40 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song

This patch does not belong in this series.  It's not a patch to the
kernel.  This is a kernel series.

It would be much more appropriate to put a link to a separately posted
manpage patch in the cover letter.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-07-29 19:54                       ` Sean Christopherson
@ 2022-08-02  0:49                         ` Sean Christopherson
  2022-08-02 16:38                           ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2022-08-02  0:49 UTC (permalink / raw)
  To: Chao Peng
  Cc: Wei Wang, Gupta, Pankaj, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song

On Fri, Jul 29, 2022, Sean Christopherson wrote:
> On Mon, Jul 25, 2022, Chao Peng wrote:
> > On Thu, Jul 21, 2022 at 05:58:50PM +0000, Sean Christopherson wrote:
> > > On Thu, Jul 21, 2022, Chao Peng wrote:
> > > > On Thu, Jul 21, 2022 at 03:34:59PM +0800, Wei Wang wrote:
> > > > > 
> > > > > 
> > > > > On 7/21/22 00:21, Sean Christopherson wrote:
> > > > > Maybe you could tag it with cgs for all the confidential guest support
> > > > > related stuff: e.g. kvm_vm_ioctl_set_cgs_mem()
> > > > > 
> > > > > bool is_private = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> > > > > ...
> > > > > kvm_vm_ioctl_set_cgs_mem(, is_private)
> > > > 
> > > > If we plan to widely use such abbr. through KVM (e.g. it's well known),
> > > > I'm fine.
> > > 
> > > I'd prefer to stay away from "confidential guest", and away from any VM-scoped
> > > name for that matter.  User-unmappable memmory has use cases beyond hiding guest
> > > state from the host, e.g. userspace could use inaccessible/unmappable memory to
> > > harden itself against unintentional access to guest memory.
> > > 
> > > > I actually use mem_attr in patch: https://lkml.org/lkml/2022/7/20/610
> > > > But I also don't quite like it, it's so generic and sounds say nothing.
> > > > 
> > > > But I do want a name can cover future usages other than just 
> > > > private/shared (pKVM for example may have a third state).
> > > 
> > > I don't think there can be a third top-level state.  Memory is either private to
> > > the guest or it's not.  There can be sub-states, e.g. memory could be selectively
> > > shared or encrypted with a different key, in which case we'd need metadata to
> > > track that state.
> > > 
> > > Though that begs the question of whether or not private_fd is the correct
> > > terminology.  E.g. if guest memory is backed by a memfd that can't be mapped by
> > > userspace (currently F_SEAL_INACCESSIBLE), but something else in the kernel plugs
> > > that memory into a device or another VM, then arguably that memory is shared,
> > > especially the multi-VM scenario.
> > > 
> > > For TDX and SNP "private vs. shared" is likely the correct terminology given the
> > > current specs, but for generic KVM it's probably better to align with whatever
> > > terminology is used for memfd.  "inaccessible_fd" and "user_inaccessible_fd" are
> > > a bit odd since the fd itself is accesible.
> > > 
> > > What about "user_unmappable"?  E.g.
> > > 
> > >   F_SEAL_USER_UNMAPPABLE, MFD_USER_UNMAPPABLE, KVM_HAS_USER_UNMAPPABLE_MEMORY,
> > >   MEMFILE_F_USER_INACCESSIBLE, user_unmappable_fd, etc...
> > 
> > For KVM I also think user_unmappable looks better than 'private', e.g.
> > user_unmappable_fd/KVM_HAS_USER_UNMAPPABLE_MEMORY sounds more
> > appropriate names. For memfd however, I don't feel that strong to change
> > it from current 'inaccessible' to 'user_unmappable', one of the reason
> > is it's not just about unmappable, but actually also inaccessible
> > through direct ioctls like read()/write().
> 
> Heh, I _knew_ there had to be a catch.  I agree that INACCESSIBLE is better for
> memfd.

Thought about this some more...

I think we should avoid UNMAPPABLE even on the KVM side of things for the core
memslots functionality and instead be very literal, e.g.

	KVM_HAS_FD_BASED_MEMSLOTS
	KVM_MEM_FD_VALID

We'll still need KVM_HAS_USER_UNMAPPABLE_MEMORY, but it won't be tied directly to
the memslot.  Decoupling the two thingis will require a bit of extra work, but the
code impact should be quite small, e.g. explicitly query and propagate
MEMFILE_F_USER_INACCESSIBLE to kvm_memory_slot to track if a memslot can be private.
And unless I'm missing something, it won't require an additional memslot flag.
The biggest oddity (if we don't also add KVM_MEM_PRIVATE) is that KVM would
effectively ignore the hva for fd-based memslots for VM types that don't support
private memory, i.e. userspace can't opt out of using the fd-based backing, but that
doesn't seem like a deal breaker.

Decoupling private memory from fd-based memslots will allow using fd-based memslots
for backing VMs even if the memory is user mappable, which opens up potentially
interesting use cases.  It would also allow testing some parts of fd-based memslots
with existing VMs.

The big advantage of KVM's hva-based memslots is that KVM doesn't care what's backing
a memslot, and so (in thoery) enabling new backing stores for KVM is free.  It's not
always free, but at this point I think we've eliminated most of the hiccups, e.g. x86's
MMU should no longer require additional enlightenment to support huge pages for new
backing types.

On the flip-side, a big disadvantage of hva-based memslots is that KVM doesn't
_know_ what's backing a memslot.  This is one of the major reasons, if not _the_
main reason at this point, why KVM binds a VM to a single virtual address space.
Running with different hva=>pfn mappings would either be completely unsafe or
prohibitively expensive (nearly impossible?) to ensure.

With fd-based memslots, KVM essentially binds a memslot directly to the backing
store.  This allows KVM to do a "deep" comparison of a memslot between two address
spaces simply by checking that the backing store is the same.  For intra-host/copyless
migration (to upgrade the userspace VMM), being able to do a deep comparison would
theoretically allow transferring KVM's page tables between VMs instead of forcing
the target VM to rebuild the page tables.  There are memcg complications (and probably
many others) for transferring page tables, but I'm pretty sure it could work.

I don't have a concrete use case (this is a recent idea on my end), but since we're
already adding fd-based memory, I can't think of a good reason not make it more generic
for not much extra cost.  And there are definitely classes of VMs for which fd-based
memory would Just Work, e.g. large VMs that are never oversubscribed on memory don't
need to support reclaim, so the fact that fd-based memslots won't support page aging
(among other things) right away is a non-issue.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-08-02  0:49                         ` Sean Christopherson
@ 2022-08-02 16:38                           ` Sean Christopherson
  2022-08-03  9:48                             ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2022-08-02 16:38 UTC (permalink / raw)
  To: Chao Peng
  Cc: Wei Wang, Gupta, Pankaj, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song

On Tue, Aug 02, 2022, Sean Christopherson wrote:
> I think we should avoid UNMAPPABLE even on the KVM side of things for the core
> memslots functionality and instead be very literal, e.g.
> 
> 	KVM_HAS_FD_BASED_MEMSLOTS
> 	KVM_MEM_FD_VALID
> 
> We'll still need KVM_HAS_USER_UNMAPPABLE_MEMORY, but it won't be tied directly to
> the memslot.  Decoupling the two thingis will require a bit of extra work, but the
> code impact should be quite small, e.g. explicitly query and propagate
> MEMFILE_F_USER_INACCESSIBLE to kvm_memory_slot to track if a memslot can be private.
> And unless I'm missing something, it won't require an additional memslot flag.
> The biggest oddity (if we don't also add KVM_MEM_PRIVATE) is that KVM would
> effectively ignore the hva for fd-based memslots for VM types that don't support
> private memory, i.e. userspace can't opt out of using the fd-based backing, but that
> doesn't seem like a deal breaker.

Hrm, but basing private memory on top of a generic FD_VALID would effectively require
shared memory to use hva-based memslots for confidential VMs.  That'd yield a very
weird API, e.g. non-confidential VMs could be backed entirely by fd-based memslots,
but confidential VMs would be forced to use hva-based memslots.

Ignore this idea for now.  If there's an actual use case for generic fd-based memory
then we'll want a separate flag, fd, and offset, i.e. that support could be added
independent of KVM_MEM_PRIVATE.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-08-02 16:38                           ` Sean Christopherson
@ 2022-08-03  9:48                             ` Chao Peng
  2022-08-03 15:51                               ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-08-03  9:48 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Wei Wang, Gupta, Pankaj, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song

On Tue, Aug 02, 2022 at 04:38:55PM +0000, Sean Christopherson wrote:
> On Tue, Aug 02, 2022, Sean Christopherson wrote:
> > I think we should avoid UNMAPPABLE even on the KVM side of things for the core
> > memslots functionality and instead be very literal, e.g.
> > 
> > 	KVM_HAS_FD_BASED_MEMSLOTS
> > 	KVM_MEM_FD_VALID
> > 
> > We'll still need KVM_HAS_USER_UNMAPPABLE_MEMORY, but it won't be tied directly to
> > the memslot.  Decoupling the two thingis will require a bit of extra work, but the
> > code impact should be quite small, e.g. explicitly query and propagate
> > MEMFILE_F_USER_INACCESSIBLE to kvm_memory_slot to track if a memslot can be private.
> > And unless I'm missing something, it won't require an additional memslot flag.
> > The biggest oddity (if we don't also add KVM_MEM_PRIVATE) is that KVM would
> > effectively ignore the hva for fd-based memslots for VM types that don't support
> > private memory, i.e. userspace can't opt out of using the fd-based backing, but that
> > doesn't seem like a deal breaker.

I actually love this idea. I don't mind adding extra code for potential
usage other than confidential VMs if we can have a workable solution for
it.

> 
> Hrm, but basing private memory on top of a generic FD_VALID would effectively require
> shared memory to use hva-based memslots for confidential VMs.  That'd yield a very
> weird API, e.g. non-confidential VMs could be backed entirely by fd-based memslots,
> but confidential VMs would be forced to use hva-based memslots.

It would work if we can treat userspace_addr as optional for
KVM_MEM_FD_VALID, e.g. userspace can opt in to decide whether needing
the mappable part or not for a regular VM and we can enforce KVM for
confidential VMs. But the u64 type of userspace_addr doesn't allow us to
express a 'null' value so sounds like we will end up needing another
flag anyway.

In concept, we could have three cofigurations here:
  1. hva-only: without any flag and use userspace_addr;
  2. fd-only:  another new flag is needed and use fd/offset;
  3. hva/fd mixed: both userspace_addr and fd/offset is effective.
     KVM_MEM_PRIVATE is a subset of it for confidential VMs. Not sure
     regular VM also wants this.

There is no direct relationship between unmappable and fd-based since
even fd-based can also be mappable for regular VM?

> 
> Ignore this idea for now.  If there's an actual use case for generic fd-based memory
> then we'll want a separate flag, fd, and offset, i.e. that support could be added
> independent of KVM_MEM_PRIVATE.

If we ignore this idea now (which I'm also fine), do you still think we
need change KVM_MEM_PRIVATE to KVM_MEM_USER_UNMAPPBLE?

Thanks,
Chao

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 12/14] KVM: Handle page fault for private memory
  2022-07-29 20:58   ` Sean Christopherson
@ 2022-08-03  9:52     ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-08-03  9:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Fri, Jul 29, 2022 at 08:58:41PM +0000, Sean Christopherson wrote:
> On Wed, Jul 06, 2022, Chao Peng wrote:
> > A page fault can carry the private/shared information for
> > KVM_MEM_PRIVATE memslot, this can be filled by architecture code(like
> > TDX code). To handle page fault for such access, KVM maps the page only
> > when this private property matches the host's view on the page.
> > 
> > For a successful match, private pfn is obtained with memfile_notifier
> > callbacks from private fd and shared pfn is obtained with existing
> > get_user_pages.
> > 
> > For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> > userspace. Userspace then can convert memory between private/shared from
> > host's view then retry the access.
> > 
> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c          | 60 ++++++++++++++++++++++++++++++++-
> >  arch/x86/kvm/mmu/mmu_internal.h | 18 ++++++++++
> >  arch/x86/kvm/mmu/mmutrace.h     |  1 +
> >  include/linux/kvm_host.h        | 35 ++++++++++++++++++-
> >  4 files changed, 112 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 545eb74305fe..27dbdd4fe8d1 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3004,6 +3004,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> >  	if (max_level == PG_LEVEL_4K)
> >  		return PG_LEVEL_4K;
> >  
> > +	if (kvm_mem_is_private(kvm, gfn))
> > +		return max_level;
> > +
> >  	host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
> >  	return min(host_level, max_level);
> >  }
> > @@ -4101,10 +4104,52 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> >  	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> >  }
> >  
> > +static inline u8 order_to_level(int order)
> > +{
> > +	enum pg_level level;
> > +
> > +	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > PG_LEVEL_4K; level--)
> 
> Curly braces needed for the for-loop.
> 
> And I think it makes sense to take in the fault->max_level, that way this is
> slightly more performant when the guest mapping is smaller than the host, e.g.
> 
> 	for (level = max_level; level > PG_LEVEL_4K; level--)
> 		...
> 
> 	return level;
> 
> Though I think I'd vote to avoid a loop entirely and do:
> 
> 	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> 
> 	if (order > ???)
> 		return PG_LEVEL_1G;
> 	
> 	if (order > ???)
> 		return PG_LEVEL_2M;
> 
> 	return PG_LEVEL_4K;

Sounds good.

> 
> 
> > +		if (order >= page_level_shift(level) - PAGE_SHIFT)
> > +			return level;
> > +	return level;
> > +}
> > +
> > +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > +				   struct kvm_page_fault *fault)
> > +{
> > +	int order;
> > +	struct kvm_memory_slot *slot = fault->slot;
> > +	bool private_exist = kvm_mem_is_private(vcpu->kvm, fault->gfn);
> > +
> > +	if (fault->is_private != private_exist) {
> > +		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > +		if (fault->is_private)
> > +			vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> > +		else
> > +			vcpu->run->memory.flags = 0;
> > +		vcpu->run->memory.padding = 0;
> > +		vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> > +		vcpu->run->memory.size = PAGE_SIZE;
> > +		return RET_PF_USER;
> > +	}
> > +
> > +	if (fault->is_private) {
> > +		if (kvm_private_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> > +			return RET_PF_RETRY;
> > +		fault->max_level = min(order_to_level(order), fault->max_level);
> > +		fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> > +		return RET_PF_FIXED;
> > +	}
> > +
> > +	/* Fault is shared, fallthrough. */
> > +	return RET_PF_CONTINUE;
> > +}
> > +
> >  static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  {
> >  	struct kvm_memory_slot *slot = fault->slot;
> >  	bool async;
> > +	int r;
> >  
> >  	/*
> >  	 * Retry the page fault if the gfn hit a memslot that is being deleted
> > @@ -4133,6 +4178,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  			return RET_PF_EMULATE;
> >  	}
> >  
> > +	if (kvm_slot_can_be_private(slot)) {
> > +		r = kvm_faultin_pfn_private(vcpu, fault);
> > +		if (r != RET_PF_CONTINUE)
> > +			return r == RET_PF_FIXED ? RET_PF_CONTINUE : r;
> 
> I apologize if I've given you conflicting feedback in the past.  Now that this
> returns RET_PF_* directly, I definitely think it makes sense to do:
> 
> 	if (kvm_slot_can_be_private(slot) &&
> 	    fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> 		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> 		if (fault->is_private)
> 			vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> 		else
> 			vcpu->run->memory.flags = 0;
> 		vcpu->run->memory.padding = 0;
> 		vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> 		vcpu->run->memory.size = PAGE_SIZE;
> 		return RET_PF_USER;
> 	}
> 
> 	if (fault->is_private)
> 		return kvm_faultin_pfn_private(vcpu, fault);
> 
> That way kvm_faultin_pfn_private() only handles private faults, and this doesn't
> need to play games with RET_PF_FIXED.

Agreed, this looks much simpler.

> 
> 
> > +	}
> > +
> >  	async = false;
> >  	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
> >  					  fault->write, &fault->map_writable,
> > @@ -4241,7 +4292,11 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  		read_unlock(&vcpu->kvm->mmu_lock);
> >  	else
> >  		write_unlock(&vcpu->kvm->mmu_lock);
> > -	kvm_release_pfn_clean(fault->pfn);
> > +
> > +	if (fault->is_private)
> > +		kvm_private_mem_put_pfn(fault->slot, fault->pfn);
> > +	else
> > +		kvm_release_pfn_clean(fault->pfn);
> 
> AFAIK, we never bottomed out on whether or not this is needed[*].  Can you follow
> up with Kirill to get an answer before posting v8?

Sure.

Chao
> 
> [*] https://lore.kernel.org/all/20220620141647.GC2016793@chaop.bj.intel.com

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 14/14] memfd_create.2: Describe MFD_INACCESSIBLE flag
  2022-08-01 14:40   ` Dave Hansen
@ 2022-08-03  9:53     ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-08-03  9:53 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song

On Mon, Aug 01, 2022 at 07:40:32AM -0700, Dave Hansen wrote:
> This patch does not belong in this series.  It's not a patch to the
> kernel.  This is a kernel series.

You are right.

> 
> It would be much more appropriate to put a link to a separately posted
> manpage patch in the cover letter.

Thanks for suggesion.

Chao

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 09/14] KVM: Extend the memslot to support fd-based private memory
  2022-07-29 19:51   ` Sean Christopherson
@ 2022-08-03 10:08     ` Chao Peng
  2022-08-03 14:42       ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-08-03 10:08 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Fri, Jul 29, 2022 at 07:51:29PM +0000, Sean Christopherson wrote:
> On Wed, Jul 06, 2022, Chao Peng wrote:
> > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> >  	__u64 userspace_addr; /* start of the userspace allocated memory */
> >    };
> >  
> > +  struct kvm_userspace_memory_region_ext {
> > +	struct kvm_userspace_memory_region region;
> > +	__u64 private_offset;
> > +	__u32 private_fd;
> > +	__u32 pad1;
> > +	__u64 pad2[14];
> > +};
> > +
> >    /* for kvm_memory_region::flags */
> >    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> >    #define KVM_MEM_READONLY	(1UL << 1)
> > +  #define KVM_MEM_PRIVATE		(1UL << 2)
> 
> Very belatedly following up on prior feedback...
> 
>   | I think a flag is still needed, the problem is private_fd can be safely
>   | accessed only when this flag is set, e.g. without this flag, we can't
>   | copy_from_user these new fields since they don't exist for previous
>   | kvm_userspace_memory_region callers.
> 
> I forgot about that aspect of things.  We don't technically need a dedicated
> PRIVATE flag to handle that, but it does seem to be the least awful soltuion.
> We could either add a generic KVM_MEM_EXTENDED_REGION or an entirely new
> ioctl(), e.g. KVM_SET_USER_MEMORY_REGION2, but in both approaches there's a decent
> chance that we'll end up needed individual "this field is valid" flags anways.
> 
> E.g. if KVM requires pad1 and pad2 to be zero to carve out future extensions,
> then we're right back here if some future extension needs to treat '0' as a legal
> input.

I had such practice (always rejecting none-zero 'pad' value when
introducing new user APIs) in other project previously, but I rarely
see that in KVM.

> 
> TL;DR: adding KVM_MEM_PRIVATE still seems like the best approach.
> 
> > @@ -4631,14 +4658,35 @@ static long kvm_vm_ioctl(struct file *filp,
> >  		break;
> >  	}
> >  	case KVM_SET_USER_MEMORY_REGION: {
> > -		struct kvm_userspace_memory_region kvm_userspace_mem;
> > +		struct kvm_user_mem_region mem;
> > +		unsigned long size;
> > +		u32 flags;
> > +
> > +		kvm_sanity_check_user_mem_region_alias();
> > +
> > +		memset(&mem, 0, sizeof(mem));
> >  
> >  		r = -EFAULT;
> > -		if (copy_from_user(&kvm_userspace_mem, argp,
> > -						sizeof(kvm_userspace_mem)))
> > +
> > +		if (get_user(flags,
> > +			(u32 __user *)(argp + offsetof(typeof(mem), flags))))
> > +			goto out;
> 
> 
> Indentation is funky.  It's hard to massage this into something short and
> readable  What about capturing the offset separately?  E.g.
> 
>                 struct kvm_user_mem_region mem;
>                 unsigned int flags_offset = offsetof(typeof(mem), flags));
>                 unsigned long size;
>                 u32 flags;
> 
>                 kvm_sanity_check_user_mem_region_alias();
> 
> 		memset(&mem, 0, sizeof(mem));
> 
>                 r = -EFAULT;
>                 if (get_user(flags, (u32 __user *)(argp + flags_offset)))
>                         goto out;
> 
> But this can actually be punted until KVM_MEM_PRIVATE is fully supported.  As of
> this patch, KVM doesn't read the extended size, so I believe the diff for this
> patch can simply be:

Looks good to me, Thanks.

Chao
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index da263c370d00..5194beb7b52f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4640,6 +4640,10 @@ static long kvm_vm_ioctl(struct file *filp,
>                                                 sizeof(kvm_userspace_mem)))
>                         goto out;
> 
> +               r = -EINVAL;
> +               if (mem.flags & KVM_MEM_PRIVATE)
> +                       goto out;
> +
>                 r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
>                 break;
>         }

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 08/14] KVM: Rename mmu_notifier_*
  2022-07-29 19:02   ` Sean Christopherson
@ 2022-08-03 10:13     ` Chao Peng
  2022-08-05 19:54     ` Paolo Bonzini
  1 sibling, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-08-03 10:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Fri, Jul 29, 2022 at 07:02:12PM +0000, Sean Christopherson wrote:
> On Wed, Jul 06, 2022, Chao Peng wrote:
> > The sync mechanism between mmu_notifier and page fault handler employs
> > fields mmu_notifier_seq/count and mmu_notifier_range_start/end. For the
> > to be added private memory, there is the same mechanism needed but not
> > rely on mmu_notifier (It uses new introduced memfile_notifier). This
> > patch renames the existing fields and related helper functions to a
> > neutral name mmu_updating_* so private memory can reuse.
> 
> mmu_updating_* is too broad of a term, e.g. page faults and many other operations
> also update the mmu.  Although the name most definitely came from the mmu_notifier,
> it's not completely inaccurate for other sources, e.g. KVM's MMU is still being
> notified of something, even if the source is not the actual mmu_notifier.
> 
> If we really want a different name, I'd vote for nomenclature that captures the
> invalidation aspect, which is really what the variables are all trackng, e.g.
> 
>   mmu_invalidate_seq
>   mmu_invalidate_in_progress
>   mmu_invalidate_range_start
>   mmu_invalidate_range_end

Looks good to me. Thanks.

Chao

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 09/14] KVM: Extend the memslot to support fd-based private memory
  2022-08-03 10:08     ` Chao Peng
@ 2022-08-03 14:42       ` Sean Christopherson
  0 siblings, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2022-08-03 14:42 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Wed, Aug 03, 2022, Chao Peng wrote:
> On Fri, Jul 29, 2022 at 07:51:29PM +0000, Sean Christopherson wrote:
> > On Wed, Jul 06, 2022, Chao Peng wrote:
> > > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> > >  	__u64 userspace_addr; /* start of the userspace allocated memory */
> > >    };
> > >  
> > > +  struct kvm_userspace_memory_region_ext {
> > > +	struct kvm_userspace_memory_region region;
> > > +	__u64 private_offset;
> > > +	__u32 private_fd;
> > > +	__u32 pad1;
> > > +	__u64 pad2[14];
> > > +};
> > > +
> > >    /* for kvm_memory_region::flags */
> > >    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> > >    #define KVM_MEM_READONLY	(1UL << 1)
> > > +  #define KVM_MEM_PRIVATE		(1UL << 2)
> > 
> > Very belatedly following up on prior feedback...
> > 
> >   | I think a flag is still needed, the problem is private_fd can be safely
> >   | accessed only when this flag is set, e.g. without this flag, we can't
> >   | copy_from_user these new fields since they don't exist for previous
> >   | kvm_userspace_memory_region callers.
> > 
> > I forgot about that aspect of things.  We don't technically need a dedicated
> > PRIVATE flag to handle that, but it does seem to be the least awful soltuion.
> > We could either add a generic KVM_MEM_EXTENDED_REGION or an entirely new
> > ioctl(), e.g. KVM_SET_USER_MEMORY_REGION2, but in both approaches there's a decent
> > chance that we'll end up needed individual "this field is valid" flags anways.
> > 
> > E.g. if KVM requires pad1 and pad2 to be zero to carve out future extensions,
> > then we're right back here if some future extension needs to treat '0' as a legal
> > input.
> 
> I had such practice (always rejecting none-zero 'pad' value when
> introducing new user APIs) in other project previously, but I rarely
> see that in KVM.

Ya, KVM often uses flags to indicate the validity of a field specifically so that
KVM doesn't misinterpret a '0' from an older userspace as an intended value.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-08-03  9:48                             ` Chao Peng
@ 2022-08-03 15:51                               ` Sean Christopherson
  2022-08-04  7:58                                 ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2022-08-03 15:51 UTC (permalink / raw)
  To: Chao Peng
  Cc: Wei Wang, Gupta, Pankaj, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song

On Wed, Aug 03, 2022, Chao Peng wrote:
> On Tue, Aug 02, 2022 at 04:38:55PM +0000, Sean Christopherson wrote:
> > On Tue, Aug 02, 2022, Sean Christopherson wrote:
> > > I think we should avoid UNMAPPABLE even on the KVM side of things for the core
> > > memslots functionality and instead be very literal, e.g.
> > > 
> > > 	KVM_HAS_FD_BASED_MEMSLOTS
> > > 	KVM_MEM_FD_VALID
> > > 
> > > We'll still need KVM_HAS_USER_UNMAPPABLE_MEMORY, but it won't be tied directly to
> > > the memslot.  Decoupling the two thingis will require a bit of extra work, but the
> > > code impact should be quite small, e.g. explicitly query and propagate
> > > MEMFILE_F_USER_INACCESSIBLE to kvm_memory_slot to track if a memslot can be private.
> > > And unless I'm missing something, it won't require an additional memslot flag.
> > > The biggest oddity (if we don't also add KVM_MEM_PRIVATE) is that KVM would
> > > effectively ignore the hva for fd-based memslots for VM types that don't support
> > > private memory, i.e. userspace can't opt out of using the fd-based backing, but that
> > > doesn't seem like a deal breaker.
> 
> I actually love this idea. I don't mind adding extra code for potential
> usage other than confidential VMs if we can have a workable solution for
> it.
> 
> > 
> > Hrm, but basing private memory on top of a generic FD_VALID would effectively require
> > shared memory to use hva-based memslots for confidential VMs.  That'd yield a very
> > weird API, e.g. non-confidential VMs could be backed entirely by fd-based memslots,
> > but confidential VMs would be forced to use hva-based memslots.
> 
> It would work if we can treat userspace_addr as optional for
> KVM_MEM_FD_VALID, e.g. userspace can opt in to decide whether needing
> the mappable part or not for a regular VM and we can enforce KVM for
> confidential VMs. But the u64 type of userspace_addr doesn't allow us to
> express a 'null' value so sounds like we will end up needing another
> flag anyway.
> 
> In concept, we could have three cofigurations here:
>   1. hva-only: without any flag and use userspace_addr;
>   2. fd-only:  another new flag is needed and use fd/offset;
>   3. hva/fd mixed: both userspace_addr and fd/offset is effective.
>      KVM_MEM_PRIVATE is a subset of it for confidential VMs. Not sure
>      regular VM also wants this.

My mental model breaks things down slightly differently, though the end result is
more or less the same. 

After this series, there will be two types of memory: private and "regular" (I'm
trying to avoid "shared").  "Regular" memory is always hva-based (userspace_addr),
and private always fd-based (fd+offset).

In the future, if we want to support fd-based memory for "regular" memory, then
as you said we'd need to add a new flag, and a new fd+offset pair.

At that point, we'd have two new (relatively to current) flags:

  KVM_MEM_PRIVATE_FD_VALID
  KVM_MEM_FD_VALID

along with two new pairs of fd+offset (private_* and "regular").  Mapping those
to your above list:
  
  1.  Neither *_FD_VALID flag set.
  2a. Both PRIVATE_FD_VALID and FD_VALID are set
  2b. FD_VALID is set and the VM doesn't support private memory
  3.  Only PRIVATE_FD_VALID is set (which private memory support in the VM).

Thus, "regular" VMs can't have a mix in a single memslot because they can't use
private memory.

> There is no direct relationship between unmappable and fd-based since
> even fd-based can also be mappable for regular VM?

Yep.

> > Ignore this idea for now.  If there's an actual use case for generic fd-based memory
> > then we'll want a separate flag, fd, and offset, i.e. that support could be added
> > independent of KVM_MEM_PRIVATE.
> 
> If we ignore this idea now (which I'm also fine), do you still think we
> need change KVM_MEM_PRIVATE to KVM_MEM_USER_UNMAPPBLE?

Hmm, no.  After working through this, I think it's safe to say KVM_MEM_USER_UNMAPPABLE
is bad name because we could end up with "regular" memory that's backed by an
inaccessible (unmappable) file.

One alternative would be to call it KVM_MEM_PROTECTED.  That shouldn't cause
problems for the known use of "private" (TDX and SNP), and it gives us a little
wiggle room, e.g. if we ever get a use case where VMs can share memory that is
otherwise protected.

That's a pretty big "if" though, and odds are good we'd need more memslot flags and
fd+offset pairs to allow differentiating "private" vs. "protected-shared" without
forcing userspace to punch holes in memslots, so I don't know that hedging now will
buy us anything.

So I'd say that if people think KVM_MEM_PRIVATE brings additional and meaningful
clarity over KVM_MEM_PROTECTECD, then lets go with PRIVATE.  But if PROTECTED is
just as good, go with PROTECTED as it gives us a wee bit of wiggle room for the
future.

Note, regardless of what name we settle on, I think it makes to do the
KVM_PRIVATE_MEM_SLOTS => KVM_INTERNAL_MEM_SLOTS rename.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 07/14] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-07-06  8:20 ` [PATCH v7 07/14] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
  2022-07-15 11:36   ` Gupta, Pankaj
@ 2022-08-04  7:10   ` Isaku Yamahata
  2022-08-10  8:19     ` Chao Peng
  1 sibling, 1 reply; 398+ messages in thread
From: Isaku Yamahata @ 2022-08-04  7:10 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, isaku.yamahata

On Wed, Jul 06, 2022 at 04:20:09PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 0bdb6044e316..e9153b54e2a4 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1362,10 +1362,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
>  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  #endif
>  
> -void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
> -				   unsigned long end);
> -void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start,
> -				   unsigned long end);
> +void kvm_inc_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end);
> +void kvm_dec_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end);
>  
>  long kvm_arch_dev_ioctl(struct file *filp,
>  			unsigned int ioctl, unsigned long arg);

The corresponding changes in kvm_main.c are missing.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b2c79bef61bd..0184e327f6f5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -711,8 +711,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
        kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
 }
 
-void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
-                                  unsigned long end)
+void kvm_inc_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end)
 {
        /*
         * The count increase must become visible at unlock time as no
@@ -786,8 +785,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
        return 0;
 }
 
-void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start,
-                                  unsigned long end)
+void kvm_dec_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end)
 {
        /*
         * This sequence increase will notify the kvm page fault that


-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-08-03 15:51                               ` Sean Christopherson
@ 2022-08-04  7:58                                 ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-08-04  7:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Wei Wang, Gupta, Pankaj, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song

On Wed, Aug 03, 2022 at 03:51:24PM +0000, Sean Christopherson wrote:
> On Wed, Aug 03, 2022, Chao Peng wrote:
> > On Tue, Aug 02, 2022 at 04:38:55PM +0000, Sean Christopherson wrote:
> > > On Tue, Aug 02, 2022, Sean Christopherson wrote:
> > > > I think we should avoid UNMAPPABLE even on the KVM side of things for the core
> > > > memslots functionality and instead be very literal, e.g.
> > > > 
> > > > 	KVM_HAS_FD_BASED_MEMSLOTS
> > > > 	KVM_MEM_FD_VALID
> > > > 
> > > > We'll still need KVM_HAS_USER_UNMAPPABLE_MEMORY, but it won't be tied directly to
> > > > the memslot.  Decoupling the two thingis will require a bit of extra work, but the
> > > > code impact should be quite small, e.g. explicitly query and propagate
> > > > MEMFILE_F_USER_INACCESSIBLE to kvm_memory_slot to track if a memslot can be private.
> > > > And unless I'm missing something, it won't require an additional memslot flag.
> > > > The biggest oddity (if we don't also add KVM_MEM_PRIVATE) is that KVM would
> > > > effectively ignore the hva for fd-based memslots for VM types that don't support
> > > > private memory, i.e. userspace can't opt out of using the fd-based backing, but that
> > > > doesn't seem like a deal breaker.
> > 
> > I actually love this idea. I don't mind adding extra code for potential
> > usage other than confidential VMs if we can have a workable solution for
> > it.
> > 
> > > 
> > > Hrm, but basing private memory on top of a generic FD_VALID would effectively require
> > > shared memory to use hva-based memslots for confidential VMs.  That'd yield a very
> > > weird API, e.g. non-confidential VMs could be backed entirely by fd-based memslots,
> > > but confidential VMs would be forced to use hva-based memslots.
> > 
> > It would work if we can treat userspace_addr as optional for
> > KVM_MEM_FD_VALID, e.g. userspace can opt in to decide whether needing
> > the mappable part or not for a regular VM and we can enforce KVM for
> > confidential VMs. But the u64 type of userspace_addr doesn't allow us to
> > express a 'null' value so sounds like we will end up needing another
> > flag anyway.
> > 
> > In concept, we could have three cofigurations here:
> >   1. hva-only: without any flag and use userspace_addr;
> >   2. fd-only:  another new flag is needed and use fd/offset;
> >   3. hva/fd mixed: both userspace_addr and fd/offset is effective.
> >      KVM_MEM_PRIVATE is a subset of it for confidential VMs. Not sure
> >      regular VM also wants this.
> 
> My mental model breaks things down slightly differently, though the end result is
> more or less the same. 
> 
> After this series, there will be two types of memory: private and "regular" (I'm
> trying to avoid "shared").  "Regular" memory is always hva-based (userspace_addr),
> and private always fd-based (fd+offset).
> 
> In the future, if we want to support fd-based memory for "regular" memory, then
> as you said we'd need to add a new flag, and a new fd+offset pair.
> 
> At that point, we'd have two new (relatively to current) flags:
> 
>   KVM_MEM_PRIVATE_FD_VALID
>   KVM_MEM_FD_VALID
> 
> along with two new pairs of fd+offset (private_* and "regular").  Mapping those
> to your above list:

I previously thought we could reuse the private_fd (name should be
changed) for regular VM as well so only need one pair of fd+offset, the
meaning of the fd can be decided by the flag. But introducing two pairs
of them may support extra usages like one fd for regular memory and
another private_fd for private memory, though unsure this is a useful
configuration.

>   
>   1.  Neither *_FD_VALID flag set.
>   2a. Both PRIVATE_FD_VALID and FD_VALID are set
>   2b. FD_VALID is set and the VM doesn't support private memory
>   3.  Only PRIVATE_FD_VALID is set (which private memory support in the VM).
> 
> Thus, "regular" VMs can't have a mix in a single memslot because they can't use
> private memory.
> 
> > There is no direct relationship between unmappable and fd-based since
> > even fd-based can also be mappable for regular VM?

Hmm, yes, for private memory we have special treatment in page fault
handler and that is not applied to regular VM.

> 
> Yep.
> 
> > > Ignore this idea for now.  If there's an actual use case for generic fd-based memory
> > > then we'll want a separate flag, fd, and offset, i.e. that support could be added
> > > independent of KVM_MEM_PRIVATE.
> > 
> > If we ignore this idea now (which I'm also fine), do you still think we
> > need change KVM_MEM_PRIVATE to KVM_MEM_USER_UNMAPPBLE?
> 
> Hmm, no.  After working through this, I think it's safe to say KVM_MEM_USER_UNMAPPABLE
> is bad name because we could end up with "regular" memory that's backed by an
> inaccessible (unmappable) file.
> 
> One alternative would be to call it KVM_MEM_PROTECTED.  That shouldn't cause
> problems for the known use of "private" (TDX and SNP), and it gives us a little
> wiggle room, e.g. if we ever get a use case where VMs can share memory that is
> otherwise protected.
> 
> That's a pretty big "if" though, and odds are good we'd need more memslot flags and
> fd+offset pairs to allow differentiating "private" vs. "protected-shared" without
> forcing userspace to punch holes in memslots, so I don't know that hedging now will
> buy us anything.
> 
> So I'd say that if people think KVM_MEM_PRIVATE brings additional and meaningful
> clarity over KVM_MEM_PROTECTECD, then lets go with PRIVATE.  But if PROTECTED is
> just as good, go with PROTECTED as it gives us a wee bit of wiggle room for the
> future.

Then I'd stay with PRIVATE.

> 
> Note, regardless of what name we settle on, I think it makes to do the
> KVM_PRIVATE_MEM_SLOTS => KVM_INTERNAL_MEM_SLOTS rename.

Agreed.

Chao

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 02/14] selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE
  2022-07-06  8:20 ` [PATCH v7 02/14] selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE Chao Peng
@ 2022-08-05 13:11   ` David Hildenbrand
  0 siblings, 0 replies; 398+ messages in thread
From: David Hildenbrand @ 2022-08-05 13:11 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song

On 06.07.22 10:20, Chao Peng wrote:
> Add tests to verify sealing memfds with the F_SEAL_AUTO_ALLOCATE works
> as expected.
> 
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  tools/testing/selftests/memfd/memfd_test.c | 166 +++++++++++++++++++++
>  1 file changed, 166 insertions(+)
> 
> diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
> index 94df2692e6e4..b849ece295fd 100644
> --- a/tools/testing/selftests/memfd/memfd_test.c
> +++ b/tools/testing/selftests/memfd/memfd_test.c
> @@ -9,6 +9,7 @@
>  #include <fcntl.h>
>  #include <linux/memfd.h>
>  #include <sched.h>
> +#include <setjmp.h>
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <signal.h>
> @@ -232,6 +233,31 @@ static void mfd_fail_open(int fd, int flags, mode_t mode)
>  	}
>  }
>  
> +static void mfd_assert_fallocate(int fd)
> +{
> +	int r;
> +
> +	r = fallocate(fd, 0, 0, mfd_def_size);
> +	if (r < 0) {
> +		printf("fallocate(ALLOC) failed: %m\n");
> +		abort();
> +	}
> +}
> +
> +static void mfd_assert_punch_hole(int fd)
> +{
> +	int r;
> +
> +	r = fallocate(fd,
> +		      FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> +		      0,
> +		      mfd_def_size);
> +	if (r < 0) {
> +		printf("fallocate(PUNCH_HOLE) failed: %m\n");
> +		abort();
> +	}
> +}
> +
>  static void mfd_assert_read(int fd)
>  {
>  	char buf[16];
> @@ -594,6 +620,94 @@ static void mfd_fail_grow_write(int fd)
>  	}
>  }
>  
> +static void mfd_assert_hole_write(int fd)
> +{
> +	ssize_t l;
> +	void *p;
> +	char *p1;
> +
> +	/*
> +	 * huegtlbfs does not support write, but we want to
> +	 * verify everything else here.
> +	 */
> +	if (!hugetlbfs_test) {
> +		/* verify direct write() succeeds */
> +		l = write(fd, "\0\0\0\0", 4);
> +		if (l != 4) {
> +			printf("write() failed: %m\n");
> +			abort();
> +		}
> +	}
> +
> +	/* verify mmaped write succeeds */
> +	p = mmap(NULL,
> +		 mfd_def_size,
> +		 PROT_READ | PROT_WRITE,
> +		 MAP_SHARED,
> +		 fd,
> +		 0);
> +	if (p == MAP_FAILED) {
> +		printf("mmap() failed: %m\n");
> +		abort();
> +	}
> +	p1 = (char *)p + mfd_def_size - 1;
> +	*p1 = 'H';
> +	if (*p1 != 'H') {
> +		printf("mmaped write failed: %m\n");
> +		abort();
> +
> +	}
> +	munmap(p, mfd_def_size);
> +}
> +
> +sigjmp_buf jbuf, *sigbuf;
> +static void sig_handler(int sig, siginfo_t *siginfo, void *ptr)
> +{
> +	if (sig == SIGBUS) {
> +		if (sigbuf)
> +			siglongjmp(*sigbuf, 1);
> +		abort();
> +	}
> +}
> +
> +static void mfd_fail_hole_write(int fd)
> +{
> +	ssize_t l;
> +	void *p;
> +	char *p1;
> +
> +	/* verify direct write() fails */
> +	l = write(fd, "data", 4);
> +	if (l > 0) {
> +		printf("expected failure on write(), but got %d: %m\n", (int)l);
> +		abort();
> +	}
> +
> +	/* verify mmaped write fails */
> +	p = mmap(NULL,
> +		 mfd_def_size,
> +		 PROT_READ | PROT_WRITE,
> +		 MAP_SHARED,
> +		 fd,
> +		 0);
> +	if (p == MAP_FAILED) {
> +		printf("mmap() failed: %m\n");
> +		abort();
> +	}
> +
> +	sigbuf = &jbuf;
> +	if (sigsetjmp(*sigbuf, 1))
> +		goto out;
> +
> +	/* Below write should trigger SIGBUS signal */
> +	p1 = (char *)p + mfd_def_size - 1;
> +	*p1 = 'H';

Maybe you want to verify separately, that bothj

> +	printf("failed to receive SIGBUS for mmaped write: %m\n");
> +	abort();
> +out:
> +	munmap(p, mfd_def_size);
> +}
> +
>  static int idle_thread_fn(void *arg)
>  {
>  	sigset_t set;
> @@ -880,6 +994,57 @@ static void test_seal_resize(void)
>  	close(fd);
>  }
>  
> +/*
> + * Test F_SEAL_AUTO_ALLOCATE
> + * Test whether F_SEAL_AUTO_ALLOCATE actually prevents allocation.
> + */
> +static void test_seal_auto_allocate(void)
> +{
> +	struct sigaction act;
> +	int fd;
> +
> +	printf("%s SEAL-AUTO-ALLOCATE\n", memfd_str);
> +
> +	memset(&act, 0, sizeof(act));
> +	act.sa_sigaction = sig_handler;
> +	act.sa_flags = SA_SIGINFO;
> +	if (sigaction(SIGBUS, &act, 0)) {
> +		printf("sigaction() failed: %m\n");
> +		abort();
> +	}
> +
> +	fd = mfd_assert_new("kern_memfd_seal_auto_allocate",
> +			    mfd_def_size,
> +			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
> +
> +	/* read/write should pass if F_SEAL_AUTO_ALLOCATE not set */
> +	mfd_assert_read(fd);
> +	mfd_assert_hole_write(fd);
> +
> +	mfd_assert_has_seals(fd, 0);
> +	mfd_assert_add_seals(fd, F_SEAL_AUTO_ALLOCATE);
> +	mfd_assert_has_seals(fd, F_SEAL_AUTO_ALLOCATE);
> +
> +	/* read/write should pass for pre-allocated area */
> +	mfd_assert_read(fd);
> +	mfd_assert_hole_write(fd);
> +
> +	mfd_assert_punch_hole(fd);
> +
> +	/* read should pass, write should fail in hole */
> +	mfd_assert_read(fd);
> +	mfd_fail_hole_write(fd);
> +
> +	mfd_assert_fallocate(fd);
> +
> +	/* read/write should pass after fallocate */
> +	mfd_assert_read(fd);
> +	mfd_assert_hole_write(fd);
> +
> +	close(fd);
> +}

What might make sense is to verify for the following operations:
* read()
* write()
* read via mmap
* write via mmap

After sealing on a hole, that there is *still* a hole and that only the
read() might succeed, with a comment stating that shmem optimized for
read on holes by reading from the shared zeropage.

I'd suggest decoupling hole_write from hole_mmap_write and similarly
have hole_read and hole_mmap_read.

You should be able to use fstat() to obtain the number of allocated
blocks to check that fairly easily.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 03/14] mm: Introduce memfile_notifier
  2022-07-06  8:20 ` [PATCH v7 03/14] mm: Introduce memfile_notifier Chao Peng
@ 2022-08-05 13:22   ` David Hildenbrand
  2022-08-10  9:22     ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: David Hildenbrand @ 2022-08-05 13:22 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song

On 06.07.22 10:20, Chao Peng wrote:
> This patch introduces memfile_notifier facility so existing memory file
> subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a
> third kernel component to make use of memory bookmarked in the memory
> file and gets notified when the pages in the memory file become
> invalidated.

Stupid question, but why is this called "memfile_notifier" and not
"memfd_notifier". We're only dealing with memfd's after all ... which
are anonymous files essentially. Or what am I missing? Are there any
other plans for fs than plain memfd support that I am not aware of?

> 
> It will be used for KVM to use a file descriptor as the guest memory
> backing store and KVM will use this memfile_notifier interface to
> interact with memory file subsystems. In the future there might be other
> consumers (e.g. VFIO with encrypted device memory).
> 
> It consists below components:
>  - memfile_backing_store: Each supported memory file subsystem can be
>    implemented as a memory backing store which bookmarks memory and
>    provides callbacks for other kernel systems (memfile_notifier
>    consumers) to interact with.
>  - memfile_notifier: memfile_notifier consumers defines callbacks and
>    associate them to a file using memfile_register_notifier().
>  - memfile_node: A memfile_node is associated with the file (inode) from
>    the backing store and includes feature flags and a list of registered
>    memfile_notifier for notifying.
> 
> In KVM usages, userspace is in charge of guest memory lifecycle: it first
> allocates pages in memory backing store and then passes the fd to KVM and
> lets KVM register memory slot to memory backing store via
> memfile_register_notifier.

Can we add documentation/description in any form how the different
functions exposed in linux/memfile_notifier.h are supposed to be used?

Staring at memfile_node_set_flags() and memfile_notifier_invalidate()
it's not immediately clear to me who's supposed to call that and under
which conditions.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 04/14] mm/shmem: Support memfile_notifier
  2022-07-06  8:20 ` [PATCH v7 04/14] mm/shmem: Support memfile_notifier Chao Peng
  2022-07-12 18:02   ` Gupta, Pankaj
@ 2022-08-05 13:26   ` David Hildenbrand
  2022-08-10  9:25     ` Chao Peng
  1 sibling, 1 reply; 398+ messages in thread
From: David Hildenbrand @ 2022-08-05 13:26 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song

On 06.07.22 10:20, Chao Peng wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Implement shmem as a memfile_notifier backing store. Essentially it
> interacts with the memfile_notifier feature flags for userspace
> access/page migration/page reclaiming and implements the necessary
> memfile_backing_store callbacks.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---

[...]

> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +static struct memfile_node *shmem_lookup_memfile_node(struct file *file)
> +{
> +	struct inode *inode = file_inode(file);
> +
> +	if (!shmem_mapping(inode->i_mapping))
> +		return NULL;
> +
> +	return  &SHMEM_I(inode)->memfile_node;
> +}
> +
> +
> +static int shmem_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +			 int *order)
> +{
> +	struct page *page;
> +	int ret;
> +
> +	ret = shmem_getpage(file_inode(file), offset, &page, SGP_WRITE);
> +	if (ret)
> +		return ret;
> +
> +	unlock_page(page);
> +	*pfn = page_to_pfn_t(page);
> +	*order = thp_order(compound_head(page));
> +	return 0;
> +}
> +
> +static void shmem_put_pfn(pfn_t pfn)
> +{
> +	struct page *page = pfn_t_to_page(pfn);
> +
> +	if (!page)
> +		return;
> +
> +	put_page(page);


Why do we export shmem_get_pfn/shmem_put_pfn and not simply

get_folio()

and let the caller deal with putting the folio? What's the reason to

a) Operate on PFNs and not folios
b) Have these get/put semantics?

> +}
> +
> +static struct memfile_backing_store shmem_backing_store = {
> +	.lookup_memfile_node = shmem_lookup_memfile_node,
> +	.get_pfn = shmem_get_pfn,
> +	.put_pfn = shmem_put_pfn,
> +};
> +#endif /* CONFIG_MEMFILE_NOTIFIER */
> +
>  void __init shmem_init(void)
>  {
>  	int error;
> @@ -3956,6 +4059,10 @@ void __init shmem_init(void)
>  	else
>  		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
>  #endif
> +
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +	memfile_register_backing_store(&shmem_backing_store);

Can we instead prove a dummy function that does nothing without
CONFIG_MEMFILE_NOTIFIER?

> +#endif
>  	return;
>  
>  out1:


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 05/14] mm/memfd: Introduce MFD_INACCESSIBLE flag
  2022-07-06  8:20 ` [PATCH v7 05/14] mm/memfd: Introduce MFD_INACCESSIBLE flag Chao Peng
@ 2022-08-05 13:28   ` David Hildenbrand
  2022-08-10  9:37     ` Chao Peng
  2022-09-07 16:18     ` Kirill A. Shutemov
  0 siblings, 2 replies; 398+ messages in thread
From: David Hildenbrand @ 2022-08-05 13:28 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song

On 06.07.22 10:20, Chao Peng wrote:
> Introduce a new memfd_create() flag indicating the content of the
> created memfd is inaccessible from userspace through ordinary MMU
> access (e.g., read/write/mmap). However, the file content can be
> accessed via a different mechanism (e.g. KVM MMU) indirectly.
> 
> It provides semantics required for KVM guest private memory support
> that a file descriptor with this flag set is going to be used as the
> source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV but may not be accessible from host userspace.
> 
> The flag can not coexist with MFD_ALLOW_SEALING, future sealing is
> also impossible for a memfd created with this flag.

It's kind of weird to have it that way. Why should the user have to
care? It's the notifier requirement to have that, no?

Why can't we handle that when register a notifier? If anything is
already mapped, fail registering the notifier if the notifier has these
demands. If registering succeeds, block it internally.

Or what am I missing? We might not need the memfile set flag semantics
eventually and would not have to expose such a flag to user space.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  2022-07-21  9:44   ` David Hildenbrand
                       ` (2 preceding siblings ...)
  2022-07-25 13:42     ` Chao Peng
@ 2022-08-05 17:55     ` Paolo Bonzini
  2022-08-05 18:06       ` David Hildenbrand
                         ` (2 more replies)
  3 siblings, 3 replies; 398+ messages in thread
From: Paolo Bonzini @ 2022-08-05 17:55 UTC (permalink / raw)
  To: David Hildenbrand, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest
  Cc: Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

On 7/21/22 11:44, David Hildenbrand wrote:
> 
> Also, I*think*  you can place pages via userfaultfd into shmem. Not
> sure if that would count "auto alloc", but it would certainly bypass
> fallocate().

Yeah, userfaultfd_register would probably have to forbid this for 
F_SEAL_AUTO_ALLOCATE vmas.  Maybe the memfile_node can be reused for 
this, adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags?  Then 
userfault_register would do something like 
memfile_node_get_flags(vma->vm_file) and check the result.

This means moving this patch later, after "mm: Introduce memfile_notifier".

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  2022-08-05 17:55     ` Paolo Bonzini
@ 2022-08-05 18:06       ` David Hildenbrand
  2022-08-10  9:40         ` Chao Peng
  2022-08-10  9:38       ` Chao Peng
  2022-08-17 23:41       ` Kirill A. Shutemov
  2 siblings, 1 reply; 398+ messages in thread
From: David Hildenbrand @ 2022-08-05 18:06 UTC (permalink / raw)
  To: Paolo Bonzini, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest
  Cc: Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

On 05.08.22 19:55, Paolo Bonzini wrote:
> On 7/21/22 11:44, David Hildenbrand wrote:
>>
>> Also, I*think*  you can place pages via userfaultfd into shmem. Not
>> sure if that would count "auto alloc", but it would certainly bypass
>> fallocate().
> 
> Yeah, userfaultfd_register would probably have to forbid this for 
> F_SEAL_AUTO_ALLOCATE vmas.  Maybe the memfile_node can be reused for 
> this, adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags?  Then 
> userfault_register would do something like 
> memfile_node_get_flags(vma->vm_file) and check the result.

An alternative is to simply have the shmem allocation fail in a similar
way. Maybe it does already, I haven't checked (don't think so).


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 08/14] KVM: Rename mmu_notifier_*
  2022-07-29 19:02   ` Sean Christopherson
  2022-08-03 10:13     ` Chao Peng
@ 2022-08-05 19:54     ` Paolo Bonzini
  2022-08-10  8:09       ` Chao Peng
  1 sibling, 1 reply; 398+ messages in thread
From: Paolo Bonzini @ 2022-08-05 19:54 UTC (permalink / raw)
  To: Sean Christopherson, Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song

On 7/29/22 21:02, Sean Christopherson wrote:
> If we really want a different name, I'd vote for nomenclature that captures the
> invalidation aspect, which is really what the variables are all trackng, e.g.
> 
>    mmu_invalidate_seq
>    mmu_invalidate_in_progress
>    mmu_invalidate_range_start
>    mmu_invalidate_range_end
> 

Agreed, and this can of course be committed separately if Chao Peng 
sends it outside this series.

Paolo

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 08/14] KVM: Rename mmu_notifier_*
  2022-08-05 19:54     ` Paolo Bonzini
@ 2022-08-10  8:09       ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-08-10  8:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, linux-kselftest,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Fri, Aug 05, 2022 at 09:54:35PM +0200, Paolo Bonzini wrote:
> On 7/29/22 21:02, Sean Christopherson wrote:
> > If we really want a different name, I'd vote for nomenclature that captures the
> > invalidation aspect, which is really what the variables are all trackng, e.g.
> > 
> >    mmu_invalidate_seq
> >    mmu_invalidate_in_progress
> >    mmu_invalidate_range_start
> >    mmu_invalidate_range_end
> > 
> 
> Agreed, and this can of course be committed separately if Chao Peng sends it
> outside this series.

I will do that, probably also includes:
  06/14 KVM: Rename KVM_PRIVATE_MEM_SLOT

Chao
> 
> Paolo

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 07/14] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-08-04  7:10   ` Isaku Yamahata
@ 2022-08-10  8:19     ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-08-10  8:19 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Thu, Aug 04, 2022 at 12:10:44AM -0700, Isaku Yamahata wrote:
> On Wed, Jul 06, 2022 at 04:20:09PM +0800,
> Chao Peng <chao.p.peng@linux.intel.com> wrote:
> 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 0bdb6044e316..e9153b54e2a4 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1362,10 +1362,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> >  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >  #endif
> >  
> > -void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
> > -				   unsigned long end);
> > -void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start,
> > -				   unsigned long end);
> > +void kvm_inc_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end);
> > +void kvm_dec_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end);
> >  
> >  long kvm_arch_dev_ioctl(struct file *filp,
> >  			unsigned int ioctl, unsigned long arg);
> 
> The corresponding changes in kvm_main.c are missing.

Exactly! Actually it's in the next patch while it should indeed in
this patch.

Chao
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b2c79bef61bd..0184e327f6f5 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -711,8 +711,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>         kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
>  }
>  
> -void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
> -                                  unsigned long end)
> +void kvm_inc_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end)
>  {
>         /*
>          * The count increase must become visible at unlock time as no
> @@ -786,8 +785,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>         return 0;
>  }
>  
> -void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start,
> -                                  unsigned long end)
> +void kvm_dec_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end)
>  {
>         /*
>          * This sequence increase will notify the kvm page fault that
> 
> 
> -- 
> Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 03/14] mm: Introduce memfile_notifier
  2022-08-05 13:22   ` David Hildenbrand
@ 2022-08-10  9:22     ` Chao Peng
  2022-08-10 10:05       ` David Hildenbrand
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-08-10  9:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song

On Fri, Aug 05, 2022 at 03:22:58PM +0200, David Hildenbrand wrote:
> On 06.07.22 10:20, Chao Peng wrote:
> > This patch introduces memfile_notifier facility so existing memory file
> > subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a
> > third kernel component to make use of memory bookmarked in the memory
> > file and gets notified when the pages in the memory file become
> > invalidated.
> 
> Stupid question, but why is this called "memfile_notifier" and not
> "memfd_notifier". We're only dealing with memfd's after all ... which
> are anonymous files essentially. Or what am I missing? Are there any
> other plans for fs than plain memfd support that I am not aware of?

There were some discussions on this in v3.
  https://lkml.org/lkml/2021/12/28/484
Sean commented it's OK to abstract it from memfd but he also wants the
kAPI (name) should not bind to memfd to make room for future non-memfd
usages.

> 
> > 
> > It will be used for KVM to use a file descriptor as the guest memory
> > backing store and KVM will use this memfile_notifier interface to
> > interact with memory file subsystems. In the future there might be other
> > consumers (e.g. VFIO with encrypted device memory).
> > 
> > It consists below components:
> >  - memfile_backing_store: Each supported memory file subsystem can be
> >    implemented as a memory backing store which bookmarks memory and
> >    provides callbacks for other kernel systems (memfile_notifier
> >    consumers) to interact with.
> >  - memfile_notifier: memfile_notifier consumers defines callbacks and
> >    associate them to a file using memfile_register_notifier().
> >  - memfile_node: A memfile_node is associated with the file (inode) from
> >    the backing store and includes feature flags and a list of registered
> >    memfile_notifier for notifying.
> > 
> > In KVM usages, userspace is in charge of guest memory lifecycle: it first
> > allocates pages in memory backing store and then passes the fd to KVM and
> > lets KVM register memory slot to memory backing store via
> > memfile_register_notifier.
> 
> Can we add documentation/description in any form how the different
> functions exposed in linux/memfile_notifier.h are supposed to be used?

Yeah, code comments can be added.

> 
> Staring at memfile_node_set_flags() and memfile_notifier_invalidate()
> it's not immediately clear to me who's supposed to call that and under
> which conditions.

I will also amend the commit message.

Chao
> 
> -- 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 04/14] mm/shmem: Support memfile_notifier
  2022-08-05 13:26   ` David Hildenbrand
@ 2022-08-10  9:25     ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-08-10  9:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song

On Fri, Aug 05, 2022 at 03:26:02PM +0200, David Hildenbrand wrote:
> On 06.07.22 10:20, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Implement shmem as a memfile_notifier backing store. Essentially it
> > interacts with the memfile_notifier feature flags for userspace
> > access/page migration/page reclaiming and implements the necessary
> > memfile_backing_store callbacks.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> 
> [...]
> 
> > +#ifdef CONFIG_MEMFILE_NOTIFIER
> > +static struct memfile_node *shmem_lookup_memfile_node(struct file *file)
> > +{
> > +	struct inode *inode = file_inode(file);
> > +
> > +	if (!shmem_mapping(inode->i_mapping))
> > +		return NULL;
> > +
> > +	return  &SHMEM_I(inode)->memfile_node;
> > +}
> > +
> > +
> > +static int shmem_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > +			 int *order)
> > +{
> > +	struct page *page;
> > +	int ret;
> > +
> > +	ret = shmem_getpage(file_inode(file), offset, &page, SGP_WRITE);
> > +	if (ret)
> > +		return ret;
> > +
> > +	unlock_page(page);
> > +	*pfn = page_to_pfn_t(page);
> > +	*order = thp_order(compound_head(page));
> > +	return 0;
> > +}
> > +
> > +static void shmem_put_pfn(pfn_t pfn)
> > +{
> > +	struct page *page = pfn_t_to_page(pfn);
> > +
> > +	if (!page)
> > +		return;
> > +
> > +	put_page(page);
> 
> 
> Why do we export shmem_get_pfn/shmem_put_pfn and not simply
> 
> get_folio()
> 
> and let the caller deal with putting the folio? What's the reason to
> 
> a) Operate on PFNs and not folios
> b) Have these get/put semantics?

We have a design assumption that somedays this can even support non-page
based backing stores. There are some discussions:
  https://lkml.org/lkml/2022/3/28/1440
I should add document for this two callbacks.

> 
> > +}
> > +
> > +static struct memfile_backing_store shmem_backing_store = {
> > +	.lookup_memfile_node = shmem_lookup_memfile_node,
> > +	.get_pfn = shmem_get_pfn,
> > +	.put_pfn = shmem_put_pfn,
> > +};
> > +#endif /* CONFIG_MEMFILE_NOTIFIER */
> > +
> >  void __init shmem_init(void)
> >  {
> >  	int error;
> > @@ -3956,6 +4059,10 @@ void __init shmem_init(void)
> >  	else
> >  		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
> >  #endif
> > +
> > +#ifdef CONFIG_MEMFILE_NOTIFIER
> > +	memfile_register_backing_store(&shmem_backing_store);
> 
> Can we instead prove a dummy function that does nothing without
> CONFIG_MEMFILE_NOTIFIER?

Sounds good.

Chao
> 
> > +#endif
> >  	return;
> >  
> >  out1:
> 
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 05/14] mm/memfd: Introduce MFD_INACCESSIBLE flag
  2022-08-05 13:28   ` David Hildenbrand
@ 2022-08-10  9:37     ` Chao Peng
  2022-08-10  9:55       ` David Hildenbrand
  2022-09-07 16:18     ` Kirill A. Shutemov
  1 sibling, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-08-10  9:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song

On Fri, Aug 05, 2022 at 03:28:50PM +0200, David Hildenbrand wrote:
> On 06.07.22 10:20, Chao Peng wrote:
> > Introduce a new memfd_create() flag indicating the content of the
> > created memfd is inaccessible from userspace through ordinary MMU
> > access (e.g., read/write/mmap). However, the file content can be
> > accessed via a different mechanism (e.g. KVM MMU) indirectly.
> > 
> > It provides semantics required for KVM guest private memory support
> > that a file descriptor with this flag set is going to be used as the
> > source of guest memory in confidential computing environments such
> > as Intel TDX/AMD SEV but may not be accessible from host userspace.
> > 
> > The flag can not coexist with MFD_ALLOW_SEALING, future sealing is
> > also impossible for a memfd created with this flag.
> 
> It's kind of weird to have it that way. Why should the user have to
> care? It's the notifier requirement to have that, no?
> 
> Why can't we handle that when register a notifier? If anything is
> already mapped, fail registering the notifier if the notifier has these
> demands. If registering succeeds, block it internally.
> 
> Or what am I missing? We might not need the memfile set flag semantics
> eventually and would not have to expose such a flag to user space.

This makes sense if doable. The major concern was: is there a reliable
way to detect this (already mapped) at the time of memslot registering.

Chao
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  2022-08-05 17:55     ` Paolo Bonzini
  2022-08-05 18:06       ` David Hildenbrand
@ 2022-08-10  9:38       ` Chao Peng
  2022-08-17 23:41       ` Kirill A. Shutemov
  2 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-08-10  9:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Hildenbrand, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, linux-kselftest,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

On Fri, Aug 05, 2022 at 07:55:38PM +0200, Paolo Bonzini wrote:
> On 7/21/22 11:44, David Hildenbrand wrote:
> > 
> > Also, I*think*  you can place pages via userfaultfd into shmem. Not
> > sure if that would count "auto alloc", but it would certainly bypass
> > fallocate().
> 
> Yeah, userfaultfd_register would probably have to forbid this for
> F_SEAL_AUTO_ALLOCATE vmas.  Maybe the memfile_node can be reused for this,
> adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags?  Then userfault_register
> would do something like memfile_node_get_flags(vma->vm_file) and check the
> result.

Then we need change userfault_register uAPI for a new property flag.
Userspace should still the decision-maker for this flag.

> 
> This means moving this patch later, after "mm: Introduce memfile_notifier".

Yes, it makes sense now.

Chao
> 
> Thanks,
> 
> Paolo

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  2022-08-05 18:06       ` David Hildenbrand
@ 2022-08-10  9:40         ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-08-10  9:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Paolo Bonzini, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, linux-kselftest,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

On Fri, Aug 05, 2022 at 08:06:03PM +0200, David Hildenbrand wrote:
> On 05.08.22 19:55, Paolo Bonzini wrote:
> > On 7/21/22 11:44, David Hildenbrand wrote:
> >>
> >> Also, I*think*  you can place pages via userfaultfd into shmem. Not
> >> sure if that would count "auto alloc", but it would certainly bypass
> >> fallocate().
> > 
> > Yeah, userfaultfd_register would probably have to forbid this for 
> > F_SEAL_AUTO_ALLOCATE vmas.  Maybe the memfile_node can be reused for 
> > this, adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags?  Then 
> > userfault_register would do something like 
> > memfile_node_get_flags(vma->vm_file) and check the result.
> 
> An alternative is to simply have the shmem allocation fail in a similar
> way. Maybe it does already, I haven't checked (don't think so).

This sounds a better option. We don't need uAPI changes for
userfault_register uAPI but I guess we will still need a KVM uAPI,
either on the memslot or on the whole VM since Roth said this feature
should be optional because some usages may want to disable it for
performance reason. For details please see discussion:
  https://lkml.org/lkml/2022/6/23/1905

Chao
> 
> 
> -- 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 05/14] mm/memfd: Introduce MFD_INACCESSIBLE flag
  2022-08-10  9:37     ` Chao Peng
@ 2022-08-10  9:55       ` David Hildenbrand
  2022-08-11 13:17         ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: David Hildenbrand @ 2022-08-10  9:55 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song

On 10.08.22 11:37, Chao Peng wrote:
> On Fri, Aug 05, 2022 at 03:28:50PM +0200, David Hildenbrand wrote:
>> On 06.07.22 10:20, Chao Peng wrote:
>>> Introduce a new memfd_create() flag indicating the content of the
>>> created memfd is inaccessible from userspace through ordinary MMU
>>> access (e.g., read/write/mmap). However, the file content can be
>>> accessed via a different mechanism (e.g. KVM MMU) indirectly.
>>>
>>> It provides semantics required for KVM guest private memory support
>>> that a file descriptor with this flag set is going to be used as the
>>> source of guest memory in confidential computing environments such
>>> as Intel TDX/AMD SEV but may not be accessible from host userspace.
>>>
>>> The flag can not coexist with MFD_ALLOW_SEALING, future sealing is
>>> also impossible for a memfd created with this flag.
>>
>> It's kind of weird to have it that way. Why should the user have to
>> care? It's the notifier requirement to have that, no?
>>
>> Why can't we handle that when register a notifier? If anything is
>> already mapped, fail registering the notifier if the notifier has these
>> demands. If registering succeeds, block it internally.
>>
>> Or what am I missing? We might not need the memfile set flag semantics
>> eventually and would not have to expose such a flag to user space.
> 
> This makes sense if doable. The major concern was: is there a reliable
> way to detect this (already mapped) at the time of memslot registering.

If too complicated, we could simplify to "was this ever mapped" and fail
for now. Hooking into shmem_mmap() might be sufficient for that to get
notified about the first mmap.

As an alternative, mapping_mapped() or similar *might* do what we want.



-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 03/14] mm: Introduce memfile_notifier
  2022-08-10  9:22     ` Chao Peng
@ 2022-08-10 10:05       ` David Hildenbrand
  2022-08-10 14:38         ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: David Hildenbrand @ 2022-08-10 10:05 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song

On 10.08.22 11:22, Chao Peng wrote:
> On Fri, Aug 05, 2022 at 03:22:58PM +0200, David Hildenbrand wrote:
>> On 06.07.22 10:20, Chao Peng wrote:
>>> This patch introduces memfile_notifier facility so existing memory file
>>> subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a
>>> third kernel component to make use of memory bookmarked in the memory
>>> file and gets notified when the pages in the memory file become
>>> invalidated.
>>
>> Stupid question, but why is this called "memfile_notifier" and not
>> "memfd_notifier". We're only dealing with memfd's after all ... which
>> are anonymous files essentially. Or what am I missing? Are there any
>> other plans for fs than plain memfd support that I am not aware of?
> 
> There were some discussions on this in v3.
>   https://lkml.org/lkml/2021/12/28/484
> Sean commented it's OK to abstract it from memfd but he also wants the
> kAPI (name) should not bind to memfd to make room for future non-memfd
> usages.

Sorry, but how is "memfile" any better? memfd abstracted to memfile?! :)

I understand Sean's suggestion about abstracting, but if the new name
makes it harder to grasp and there isn't really an alternative to memfd
in sight, I'm not so sure I enjoy the tried abstraction here.

Otherwise we'd have to get creative now and discuss something like
"file_population_notifer" or "mapping_population_notifer" and I am not
sure that our time is well spent doing so right now.

... as this is kernel-internal, we can always adjust the name as we
please later, once we *actually* now what the abstraction should be.
Until then I'd suggest to KIS and soft-glue this to memfd.

Or am I missing something important?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 03/14] mm: Introduce memfile_notifier
  2022-08-10 10:05       ` David Hildenbrand
@ 2022-08-10 14:38         ` Sean Christopherson
  2022-08-11 12:27           ` Quentin Perret
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2022-08-10 14:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, Will Deacon

+Will

On Wed, Aug 10, 2022, David Hildenbrand wrote:
> On 10.08.22 11:22, Chao Peng wrote:
> > On Fri, Aug 05, 2022 at 03:22:58PM +0200, David Hildenbrand wrote:
> >> On 06.07.22 10:20, Chao Peng wrote:
> >>> This patch introduces memfile_notifier facility so existing memory file
> >>> subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a
> >>> third kernel component to make use of memory bookmarked in the memory
> >>> file and gets notified when the pages in the memory file become
> >>> invalidated.
> >>
> >> Stupid question, but why is this called "memfile_notifier" and not
> >> "memfd_notifier". We're only dealing with memfd's after all ... which
> >> are anonymous files essentially. Or what am I missing? Are there any
> >> other plans for fs than plain memfd support that I am not aware of?
> > 
> > There were some discussions on this in v3.
> >   https://lkml.org/lkml/2021/12/28/484
> > Sean commented it's OK to abstract it from memfd but he also wants the
> > kAPI (name) should not bind to memfd to make room for future non-memfd
> > usages.
> 
> Sorry, but how is "memfile" any better? memfd abstracted to memfile?! :)

FWIW, I don't really like the memfile name either.

> I understand Sean's suggestion about abstracting, but if the new name
> makes it harder to grasp and there isn't really an alternative to memfd
> in sight, I'm not so sure I enjoy the tried abstraction here.

ARM's pKVM implementation is potentially (hopefully) going to switch to this API
(as a consumer) sooner than later.  If they anticipate being able to use memfd,
then there's unlikely to be a second backing type any time soon.

Quentin, Will?
 
> Otherwise we'd have to get creative now and discuss something like
> "file_population_notifer" or "mapping_population_notifer" and I am not
> sure that our time is well spent doing so right now.
> 
> ... as this is kernel-internal, we can always adjust the name as we
> please later, once we *actually* now what the abstraction should be.
> Until then I'd suggest to KIS and soft-glue this to memfd.
> 
> Or am I missing something important?

I don't think you're missing anything.  I'd still prefer a name that doesn't couple
KVM to memfd, but it's not a sticking point, and I've never been able to come up
with a better name...

With a little bit of cleverness I think we can keep the coupling in KVM to a
minimum, which is what I really care about.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-07-06  8:20 [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (14 preceding siblings ...)
  2022-07-13  3:58 ` [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Gupta, Pankaj
@ 2022-08-11 10:02 ` Nikunj A. Dadhania
  2022-08-11 11:30   ` Gupta, Pankaj
  2022-08-18  5:40 ` Hugh Dickins
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 398+ messages in thread
From: Nikunj A. Dadhania @ 2022-08-11 10:02 UTC (permalink / raw)
  To: Chao Peng, Sean Christopherson
  Cc: Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song,
	bharata, kvm, linux-kernel, linux-mm, linux-kselftest, linux-api,
	linux-doc, qemu-devel, linux-fsdevel

On 06/07/22 13:50, Chao Peng wrote:
> This is the v7 of this series which tries to implement the fd-based KVM
> guest private memory. The patches are based on latest kvm/queue branch
> commit:
> 
>   b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
> split_desc_cache only by default capacity
> 
> Introduction
> ------------
> In general this patch series introduce fd-based memslot which provides
> guest memory through memory file descriptor fd[offset,size] instead of
> hva/size. The fd can be created from a supported memory filesystem
> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> and the the memory backing store exchange callbacks when such memslot
> gets created. At runtime KVM will call into callbacks provided by the
> backing store to get the pfn with the fd+offset. Memory backing store
> will also call into KVM callbacks when userspace punch hole on the fd
> to notify KVM to unmap secondary MMU page table entries.
> 
> Comparing to existing hva-based memslot, this new type of memslot allows
> guest memory unmapped from host userspace like QEMU and even the kernel
> itself, therefore reduce attack surface and prevent bugs.
> 
> Based on this fd-based memslot, we can build guest private memory that
> is going to be used in confidential computing environments such as Intel
> TDX and AMD SEV. When supported, the memory backing store can provide
> more enforcement on the fd and KVM can use a single memslot to hold both
> the private and shared part of the guest memory. 
> 
> mm extension
> ---------------------
> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
> created with these flags cannot read(), write() or mmap() etc via normal
> MMU operations. The file content can only be used with the newly
> introduced memfile_notifier extension.
> 
> The memfile_notifier extension provides two sets of callbacks for KVM to
> interact with the memory backing store:
>   - memfile_notifier_ops: callbacks for memory backing store to notify
>     KVM when memory gets invalidated.
>   - backing store callbacks: callbacks for KVM to call into memory
>     backing store to request memory pages for guest private memory.
> 
> The memfile_notifier extension also provides APIs for memory backing
> store to register/unregister itself and to trigger the notifier when the
> bookmarked memory gets invalidated.
> 
> The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
> prevent double allocation caused by unintentional guest when we only
> have a single side of the shared/private memfds effective.
> 
> memslot extension
> -----------------
> Add the private fd and the fd offset to existing 'shared' memslot so
> that both private/shared guest memory can live in one single memslot.
> A page in the memslot is either private or shared. Whether a guest page
> is private or shared is maintained through reusing existing SEV ioctls
> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
> 
> Test
> ----
> To test the new functionalities of this patch TDX patchset is needed.
> Since TDX patchset has not been merged so I did two kinds of test:
> 
> -  Regresion test on kvm/queue (this patchset)
>    Most new code are not covered. Code also in below repo:
>    https://github.com/chao-p/linux/tree/privmem-v7
> 
> -  New Funational test on latest TDX code
>    The patch is rebased to latest TDX code and tested the new
>    funcationalities. See below repos:
>    Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
>    QEMU: https://github.com/chao-p/qemu/tree/privmem-v7

While debugging an issue with SEV+UPM, found that fallocate() returns 
an error in QEMU which is not handled (EINTR). With the below handling 
of EINTR subsequent fallocate() succeeds:


diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c
index af8fb0c957..e8597ed28d 100644
--- a/backends/hostmem-memfd-private.c
+++ b/backends/hostmem-memfd-private.c
@@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     MachineState *machine = MACHINE(qdev_get_machine());
     uint32_t ram_flags;
     char *name;
-    int fd, priv_fd;
+    int fd, priv_fd, ret;
 
     if (!backend->size) {
         error_setg(errp, "can't create backend with size 0");
@@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
                                    backend->size, ram_flags, fd, 0, errp);
     g_free(name);
 
-    fallocate(priv_fd, 0, 0, backend->size);
+again:
+    ret = fallocate(priv_fd, 0, 0, backend->size);
+    if (ret) {
+           perror("Fallocate failed: \n");
+           if (errno == EINTR)
+                   goto again;
+           else
+                   exit(1);
+    }

However, fallocate() preallocates full guest memory before starting the guest.
With this behaviour guest memory is *not* demand pinned. Is there a way to 
prevent fallocate() from reserving full guest memory?

> An example QEMU command line for TDX test:
> -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
> -machine confidential-guest-support=tdx \
> -object memory-backend-memfd-private,id=ram1,size=${mem} \
> -machine memory-backend=ram1
> 

Regards,
Nikunj


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-11 10:02 ` Nikunj A. Dadhania
@ 2022-08-11 11:30   ` Gupta, Pankaj
  2022-08-11 13:32     ` Chao Peng
  2022-08-11 17:18     ` Nikunj A. Dadhania
  0 siblings, 2 replies; 398+ messages in thread
From: Gupta, Pankaj @ 2022-08-11 11:30 UTC (permalink / raw)
  To: Nikunj A. Dadhania, Chao Peng, Sean Christopherson
  Cc: Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song,
	bharata, kvm, linux-kernel, linux-mm, linux-kselftest, linux-api,
	linux-doc, qemu-devel, linux-fsdevel


>> This is the v7 of this series which tries to implement the fd-based KVM
>> guest private memory. The patches are based on latest kvm/queue branch
>> commit:
>>
>>    b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>> split_desc_cache only by default capacity
>>
>> Introduction
>> ------------
>> In general this patch series introduce fd-based memslot which provides
>> guest memory through memory file descriptor fd[offset,size] instead of
>> hva/size. The fd can be created from a supported memory filesystem
>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>> and the the memory backing store exchange callbacks when such memslot
>> gets created. At runtime KVM will call into callbacks provided by the
>> backing store to get the pfn with the fd+offset. Memory backing store
>> will also call into KVM callbacks when userspace punch hole on the fd
>> to notify KVM to unmap secondary MMU page table entries.
>>
>> Comparing to existing hva-based memslot, this new type of memslot allows
>> guest memory unmapped from host userspace like QEMU and even the kernel
>> itself, therefore reduce attack surface and prevent bugs.
>>
>> Based on this fd-based memslot, we can build guest private memory that
>> is going to be used in confidential computing environments such as Intel
>> TDX and AMD SEV. When supported, the memory backing store can provide
>> more enforcement on the fd and KVM can use a single memslot to hold both
>> the private and shared part of the guest memory.
>>
>> mm extension
>> ---------------------
>> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
>> created with these flags cannot read(), write() or mmap() etc via normal
>> MMU operations. The file content can only be used with the newly
>> introduced memfile_notifier extension.
>>
>> The memfile_notifier extension provides two sets of callbacks for KVM to
>> interact with the memory backing store:
>>    - memfile_notifier_ops: callbacks for memory backing store to notify
>>      KVM when memory gets invalidated.
>>    - backing store callbacks: callbacks for KVM to call into memory
>>      backing store to request memory pages for guest private memory.
>>
>> The memfile_notifier extension also provides APIs for memory backing
>> store to register/unregister itself and to trigger the notifier when the
>> bookmarked memory gets invalidated.
>>
>> The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
>> prevent double allocation caused by unintentional guest when we only
>> have a single side of the shared/private memfds effective.
>>
>> memslot extension
>> -----------------
>> Add the private fd and the fd offset to existing 'shared' memslot so
>> that both private/shared guest memory can live in one single memslot.
>> A page in the memslot is either private or shared. Whether a guest page
>> is private or shared is maintained through reusing existing SEV ioctls
>> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
>>
>> Test
>> ----
>> To test the new functionalities of this patch TDX patchset is needed.
>> Since TDX patchset has not been merged so I did two kinds of test:
>>
>> -  Regresion test on kvm/queue (this patchset)
>>     Most new code are not covered. Code also in below repo:
>>     https://github.com/chao-p/linux/tree/privmem-v7
>>
>> -  New Funational test on latest TDX code
>>     The patch is rebased to latest TDX code and tested the new
>>     funcationalities. See below repos:
>>     Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
>>     QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
> 
> While debugging an issue with SEV+UPM, found that fallocate() returns
> an error in QEMU which is not handled (EINTR). With the below handling
> of EINTR subsequent fallocate() succeeds:
> 
> 
> diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c
> index af8fb0c957..e8597ed28d 100644
> --- a/backends/hostmem-memfd-private.c
> +++ b/backends/hostmem-memfd-private.c
> @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>       MachineState *machine = MACHINE(qdev_get_machine());
>       uint32_t ram_flags;
>       char *name;
> -    int fd, priv_fd;
> +    int fd, priv_fd, ret;
>   
>       if (!backend->size) {
>           error_setg(errp, "can't create backend with size 0");
> @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>                                      backend->size, ram_flags, fd, 0, errp);
>       g_free(name);
>   
> -    fallocate(priv_fd, 0, 0, backend->size);
> +again:
> +    ret = fallocate(priv_fd, 0, 0, backend->size);
> +    if (ret) {
> +           perror("Fallocate failed: \n");
> +           if (errno == EINTR)
> +                   goto again;
> +           else
> +                   exit(1);
> +    }
> 
> However, fallocate() preallocates full guest memory before starting the guest.
> With this behaviour guest memory is *not* demand pinned. Is there a way to
> prevent fallocate() from reserving full guest memory?

Isn't the pinning being handled by the corresponding host memory backend 
with mmu notifier and architecture support while doing the memory 
operations e.g page migration and swapping/reclaim (not supported
currently AFAIU). But yes, we need to allocate entire guest memory with 
the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.


Thanks,
Pankaj

> 
>> An example QEMU command line for TDX test:
>> -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
>> -machine confidential-guest-support=tdx \
>> -object memory-backend-memfd-private,id=ram1,size=${mem} \
>> -machine memory-backend=ram1
>>
> 
> Regards,
> Nikunj
> 
> 


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 03/14] mm: Introduce memfile_notifier
  2022-08-10 14:38         ` Sean Christopherson
@ 2022-08-11 12:27           ` Quentin Perret
  2022-08-11 13:39             ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Quentin Perret @ 2022-08-11 12:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Hildenbrand, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Michael Roth, mhocko, Muchun Song, Will Deacon, Fuad Tabba

+CC Fuad

On Wednesday 10 Aug 2022 at 14:38:43 (+0000), Sean Christopherson wrote:
> > I understand Sean's suggestion about abstracting, but if the new name
> > makes it harder to grasp and there isn't really an alternative to memfd
> > in sight, I'm not so sure I enjoy the tried abstraction here.
> 
> ARM's pKVM implementation is potentially (hopefully) going to switch to this API
> (as a consumer) sooner than later.  If they anticipate being able to use memfd,
> then there's unlikely to be a second backing type any time soon.
> 
> Quentin, Will?

Yep, Fuad is currently trying to port the pKVM mm stuff on top of this
series to see how well it fits, so stay tuned. I think there is still
some room for discussion around page conversions (private->shared etc),
and we'll need a clearer idea of what the code might look like to have a
constructive discussion, but so far it does seem like using a memfd (the
new private one or perhaps just memfd_secret, to be discussed) + memfd
notifiers is a promising option.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 05/14] mm/memfd: Introduce MFD_INACCESSIBLE flag
  2022-08-10  9:55       ` David Hildenbrand
@ 2022-08-11 13:17         ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-08-11 13:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song

On Wed, Aug 10, 2022 at 11:55:19AM +0200, David Hildenbrand wrote:
> On 10.08.22 11:37, Chao Peng wrote:
> > On Fri, Aug 05, 2022 at 03:28:50PM +0200, David Hildenbrand wrote:
> >> On 06.07.22 10:20, Chao Peng wrote:
> >>> Introduce a new memfd_create() flag indicating the content of the
> >>> created memfd is inaccessible from userspace through ordinary MMU
> >>> access (e.g., read/write/mmap). However, the file content can be
> >>> accessed via a different mechanism (e.g. KVM MMU) indirectly.
> >>>
> >>> It provides semantics required for KVM guest private memory support
> >>> that a file descriptor with this flag set is going to be used as the
> >>> source of guest memory in confidential computing environments such
> >>> as Intel TDX/AMD SEV but may not be accessible from host userspace.
> >>>
> >>> The flag can not coexist with MFD_ALLOW_SEALING, future sealing is
> >>> also impossible for a memfd created with this flag.
> >>
> >> It's kind of weird to have it that way. Why should the user have to
> >> care? It's the notifier requirement to have that, no?
> >>
> >> Why can't we handle that when register a notifier? If anything is
> >> already mapped, fail registering the notifier if the notifier has these
> >> demands. If registering succeeds, block it internally.
> >>
> >> Or what am I missing? We might not need the memfile set flag semantics
> >> eventually and would not have to expose such a flag to user space.
> > 
> > This makes sense if doable. The major concern was: is there a reliable
> > way to detect this (already mapped) at the time of memslot registering.
> 
> If too complicated, we could simplify to "was this ever mapped" and fail
> for now. Hooking into shmem_mmap() might be sufficient for that to get
> notified about the first mmap.
> 
> As an alternative, mapping_mapped() or similar *might* do what we want.

mapping_mapped() sounds the right one, I remember SEV people want first
map then unmap. "was this ever mapped" may not work for them.

Thanks,
Chao
> 
> 
> 
> -- 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-11 11:30   ` Gupta, Pankaj
@ 2022-08-11 13:32     ` Chao Peng
  2022-08-11 17:28       ` Nikunj A. Dadhania
  2022-08-12  3:22       ` Nikunj A. Dadhania
  2022-08-11 17:18     ` Nikunj A. Dadhania
  1 sibling, 2 replies; 398+ messages in thread
From: Chao Peng @ 2022-08-11 13:32 UTC (permalink / raw)
  To: Gupta, Pankaj
  Cc: Nikunj A. Dadhania, Sean Christopherson, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, bharata, kvm, linux-kernel,
	linux-mm, linux-kselftest, linux-api, linux-doc, qemu-devel,
	linux-fsdevel

On Thu, Aug 11, 2022 at 01:30:06PM +0200, Gupta, Pankaj wrote:
> 
> > > This is the v7 of this series which tries to implement the fd-based KVM
> > > guest private memory. The patches are based on latest kvm/queue branch
> > > commit:
> > > 
> > >    b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
> > > split_desc_cache only by default capacity
> > > 
> > > Introduction
> > > ------------
> > > In general this patch series introduce fd-based memslot which provides
> > > guest memory through memory file descriptor fd[offset,size] instead of
> > > hva/size. The fd can be created from a supported memory filesystem
> > > like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> > > and the the memory backing store exchange callbacks when such memslot
> > > gets created. At runtime KVM will call into callbacks provided by the
> > > backing store to get the pfn with the fd+offset. Memory backing store
> > > will also call into KVM callbacks when userspace punch hole on the fd
> > > to notify KVM to unmap secondary MMU page table entries.
> > > 
> > > Comparing to existing hva-based memslot, this new type of memslot allows
> > > guest memory unmapped from host userspace like QEMU and even the kernel
> > > itself, therefore reduce attack surface and prevent bugs.
> > > 
> > > Based on this fd-based memslot, we can build guest private memory that
> > > is going to be used in confidential computing environments such as Intel
> > > TDX and AMD SEV. When supported, the memory backing store can provide
> > > more enforcement on the fd and KVM can use a single memslot to hold both
> > > the private and shared part of the guest memory.
> > > 
> > > mm extension
> > > ---------------------
> > > Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
> > > created with these flags cannot read(), write() or mmap() etc via normal
> > > MMU operations. The file content can only be used with the newly
> > > introduced memfile_notifier extension.
> > > 
> > > The memfile_notifier extension provides two sets of callbacks for KVM to
> > > interact with the memory backing store:
> > >    - memfile_notifier_ops: callbacks for memory backing store to notify
> > >      KVM when memory gets invalidated.
> > >    - backing store callbacks: callbacks for KVM to call into memory
> > >      backing store to request memory pages for guest private memory.
> > > 
> > > The memfile_notifier extension also provides APIs for memory backing
> > > store to register/unregister itself and to trigger the notifier when the
> > > bookmarked memory gets invalidated.
> > > 
> > > The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
> > > prevent double allocation caused by unintentional guest when we only
> > > have a single side of the shared/private memfds effective.
> > > 
> > > memslot extension
> > > -----------------
> > > Add the private fd and the fd offset to existing 'shared' memslot so
> > > that both private/shared guest memory can live in one single memslot.
> > > A page in the memslot is either private or shared. Whether a guest page
> > > is private or shared is maintained through reusing existing SEV ioctls
> > > KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
> > > 
> > > Test
> > > ----
> > > To test the new functionalities of this patch TDX patchset is needed.
> > > Since TDX patchset has not been merged so I did two kinds of test:
> > > 
> > > -  Regresion test on kvm/queue (this patchset)
> > >     Most new code are not covered. Code also in below repo:
> > >     https://github.com/chao-p/linux/tree/privmem-v7
> > > 
> > > -  New Funational test on latest TDX code
> > >     The patch is rebased to latest TDX code and tested the new
> > >     funcationalities. See below repos:
> > >     Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
> > >     QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
> > 
> > While debugging an issue with SEV+UPM, found that fallocate() returns
> > an error in QEMU which is not handled (EINTR). With the below handling
> > of EINTR subsequent fallocate() succeeds:

QEMU code has not well-tested so it's not strange you met problem. But
from the man page, there is signal was caught for EINTR, do you know
the signal number?

Thanks for you patch but before we change it in QEMU I want to make sure
it's indeed a QEMU issue (e.g. not a kernel isssue).

> > 
> > 
> > diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c
> > index af8fb0c957..e8597ed28d 100644
> > --- a/backends/hostmem-memfd-private.c
> > +++ b/backends/hostmem-memfd-private.c
> > @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
> >       MachineState *machine = MACHINE(qdev_get_machine());
> >       uint32_t ram_flags;
> >       char *name;
> > -    int fd, priv_fd;
> > +    int fd, priv_fd, ret;
> >       if (!backend->size) {
> >           error_setg(errp, "can't create backend with size 0");
> > @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
> >                                      backend->size, ram_flags, fd, 0, errp);
> >       g_free(name);
> > -    fallocate(priv_fd, 0, 0, backend->size);
> > +again:
> > +    ret = fallocate(priv_fd, 0, 0, backend->size);
> > +    if (ret) {
> > +           perror("Fallocate failed: \n");
> > +           if (errno == EINTR)
> > +                   goto again;
> > +           else
> > +                   exit(1);
> > +    }
> > 
> > However, fallocate() preallocates full guest memory before starting the guest.
> > With this behaviour guest memory is *not* demand pinned. Is there a way to
> > prevent fallocate() from reserving full guest memory?
> 
> Isn't the pinning being handled by the corresponding host memory backend
> with mmu notifier and architecture support while doing the memory operations
> e.g page migration and swapping/reclaim (not supported
> currently AFAIU). But yes, we need to allocate entire guest memory with the
> new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.

Right.

> 
> 
> Thanks,
> Pankaj
> 
> > 
> > > An example QEMU command line for TDX test:
> > > -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
> > > -machine confidential-guest-support=tdx \
> > > -object memory-backend-memfd-private,id=ram1,size=${mem} \
> > > -machine memory-backend=ram1
> > > 
> > 
> > Regards,
> > Nikunj
> > 
> > 

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 03/14] mm: Introduce memfile_notifier
  2022-08-11 12:27           ` Quentin Perret
@ 2022-08-11 13:39             ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-08-11 13:39 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Sean Christopherson, David Hildenbrand, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Michael Roth, mhocko,
	Muchun Song, Will Deacon, Fuad Tabba

On Thu, Aug 11, 2022 at 12:27:56PM +0000, Quentin Perret wrote:
> +CC Fuad
> 
> On Wednesday 10 Aug 2022 at 14:38:43 (+0000), Sean Christopherson wrote:
> > > I understand Sean's suggestion about abstracting, but if the new name
> > > makes it harder to grasp and there isn't really an alternative to memfd
> > > in sight, I'm not so sure I enjoy the tried abstraction here.
> > 
> > ARM's pKVM implementation is potentially (hopefully) going to switch to this API
> > (as a consumer) sooner than later.  If they anticipate being able to use memfd,
> > then there's unlikely to be a second backing type any time soon.
> > 
> > Quentin, Will?
> 
> Yep, Fuad is currently trying to port the pKVM mm stuff on top of this
> series to see how well it fits, so stay tuned.

Good to hear that.

>I think there is still
> some room for discussion around page conversions (private->shared etc),
> and we'll need a clearer idea of what the code might look like to have a
> constructive discussion,

That's fine. Looking forward to your feedbacks.

>but so far it does seem like using a memfd (the
> new private one or perhaps just memfd_secret, to be discussed) + memfd
> notifiers is a promising option.

If it still memfd (even memfd_secret), maybe we can use the name
memfd_notifier?

Chao

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-11 11:30   ` Gupta, Pankaj
  2022-08-11 13:32     ` Chao Peng
@ 2022-08-11 17:18     ` Nikunj A. Dadhania
  2022-08-11 23:02       ` Gupta, Pankaj
  1 sibling, 1 reply; 398+ messages in thread
From: Nikunj A. Dadhania @ 2022-08-11 17:18 UTC (permalink / raw)
  To: Gupta, Pankaj, Chao Peng, Sean Christopherson
  Cc: Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song,
	bharata, kvm, linux-kernel, linux-mm, linux-kselftest, linux-api,
	linux-doc, qemu-devel, linux-fsdevel

On 11/08/22 17:00, Gupta, Pankaj wrote:
> 
>>> This is the v7 of this series which tries to implement the fd-based KVM
>>> guest private memory. The patches are based on latest kvm/queue branch
>>> commit:
>>>
>>>    b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>>> split_desc_cache only by default capacity
>>>
>>> Introduction
>>> ------------
>>> In general this patch series introduce fd-based memslot which provides
>>> guest memory through memory file descriptor fd[offset,size] instead of
>>> hva/size. The fd can be created from a supported memory filesystem
>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>> and the the memory backing store exchange callbacks when such memslot
>>> gets created. At runtime KVM will call into callbacks provided by the
>>> backing store to get the pfn with the fd+offset. Memory backing store
>>> will also call into KVM callbacks when userspace punch hole on the fd
>>> to notify KVM to unmap secondary MMU page table entries.
>>>
>>> Comparing to existing hva-based memslot, this new type of memslot allows
>>> guest memory unmapped from host userspace like QEMU and even the kernel
>>> itself, therefore reduce attack surface and prevent bugs.
>>>
>>> Based on this fd-based memslot, we can build guest private memory that
>>> is going to be used in confidential computing environments such as Intel
>>> TDX and AMD SEV. When supported, the memory backing store can provide
>>> more enforcement on the fd and KVM can use a single memslot to hold both
>>> the private and shared part of the guest memory.
>>>
>>> mm extension
>>> ---------------------
>>> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
>>> created with these flags cannot read(), write() or mmap() etc via normal
>>> MMU operations. The file content can only be used with the newly
>>> introduced memfile_notifier extension.
>>>
>>> The memfile_notifier extension provides two sets of callbacks for KVM to
>>> interact with the memory backing store:
>>>    - memfile_notifier_ops: callbacks for memory backing store to notify
>>>      KVM when memory gets invalidated.
>>>    - backing store callbacks: callbacks for KVM to call into memory
>>>      backing store to request memory pages for guest private memory.
>>>
>>> The memfile_notifier extension also provides APIs for memory backing
>>> store to register/unregister itself and to trigger the notifier when the
>>> bookmarked memory gets invalidated.
>>>
>>> The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
>>> prevent double allocation caused by unintentional guest when we only
>>> have a single side of the shared/private memfds effective.
>>>
>>> memslot extension
>>> -----------------
>>> Add the private fd and the fd offset to existing 'shared' memslot so
>>> that both private/shared guest memory can live in one single memslot.
>>> A page in the memslot is either private or shared. Whether a guest page
>>> is private or shared is maintained through reusing existing SEV ioctls
>>> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
>>>
>>> Test
>>> ----
>>> To test the new functionalities of this patch TDX patchset is needed.
>>> Since TDX patchset has not been merged so I did two kinds of test:
>>>
>>> -  Regresion test on kvm/queue (this patchset)
>>>     Most new code are not covered. Code also in below repo:
>>>     https://github.com/chao-p/linux/tree/privmem-v7
>>>
>>> -  New Funational test on latest TDX code
>>>     The patch is rebased to latest TDX code and tested the new
>>>     funcationalities. See below repos:
>>>     Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
>>>     QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
>>
>> While debugging an issue with SEV+UPM, found that fallocate() returns
>> an error in QEMU which is not handled (EINTR). With the below handling
>> of EINTR subsequent fallocate() succeeds:
>>
>>
>> diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c
>> index af8fb0c957..e8597ed28d 100644
>> --- a/backends/hostmem-memfd-private.c
>> +++ b/backends/hostmem-memfd-private.c
>> @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>       MachineState *machine = MACHINE(qdev_get_machine());
>>       uint32_t ram_flags;
>>       char *name;
>> -    int fd, priv_fd;
>> +    int fd, priv_fd, ret;
>>         if (!backend->size) {
>>           error_setg(errp, "can't create backend with size 0");
>> @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>                                      backend->size, ram_flags, fd, 0, errp);
>>       g_free(name);
>>   -    fallocate(priv_fd, 0, 0, backend->size);
>> +again:
>> +    ret = fallocate(priv_fd, 0, 0, backend->size);
>> +    if (ret) {
>> +           perror("Fallocate failed: \n");
>> +           if (errno == EINTR)
>> +                   goto again;
>> +           else
>> +                   exit(1);
>> +    }
>>
>> However, fallocate() preallocates full guest memory before starting the guest.
>> With this behaviour guest memory is *not* demand pinned. Is there a way to
>> prevent fallocate() from reserving full guest memory?
> 
> Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.

That is correct, but the question is when does the memory allocated, as these flags are set,
memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is
allocated.

Regards
Nikunj

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-11 13:32     ` Chao Peng
@ 2022-08-11 17:28       ` Nikunj A. Dadhania
  2022-08-12  3:22       ` Nikunj A. Dadhania
  1 sibling, 0 replies; 398+ messages in thread
From: Nikunj A. Dadhania @ 2022-08-11 17:28 UTC (permalink / raw)
  To: Chao Peng, Gupta, Pankaj
  Cc: Sean Christopherson, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, bharata, kvm, linux-kernel,
	linux-mm, linux-kselftest, linux-api, linux-doc, qemu-devel,
	linux-fsdevel

On 11/08/22 19:02, Chao Peng wrote:
> On Thu, Aug 11, 2022 at 01:30:06PM +0200, Gupta, Pankaj wrote:
>>

>>>> Test
>>>> ----
>>>> To test the new functionalities of this patch TDX patchset is needed.
>>>> Since TDX patchset has not been merged so I did two kinds of test:
>>>>
>>>> -  Regresion test on kvm/queue (this patchset)
>>>>     Most new code are not covered. Code also in below repo:
>>>>     https://github.com/chao-p/linux/tree/privmem-v7
>>>>
>>>> -  New Funational test on latest TDX code
>>>>     The patch is rebased to latest TDX code and tested the new
>>>>     funcationalities. See below repos:
>>>>     Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
>>>>     QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
>>>
>>> While debugging an issue with SEV+UPM, found that fallocate() returns
>>> an error in QEMU which is not handled (EINTR). With the below handling
>>> of EINTR subsequent fallocate() succeeds:
> 
> QEMU code has not well-tested so it's not strange you met problem. But
> from the man page, there is signal was caught for EINTR, do you know
> the signal number?

I haven't check that, but that should be fairly straight forward to get.
I presume that you are referring to signal_pending() in the shmem_fallocate()

> Thanks for you patch but before we change it in QEMU I want to make sure
> it's indeed a QEMU issue (e.g. not a kernel isssue).

As per the manual fallocate() can return EINTR, and this should be handled 
by the user space.

Regards
Nikunj

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-11 17:18     ` Nikunj A. Dadhania
@ 2022-08-11 23:02       ` Gupta, Pankaj
  2022-08-12  6:02         ` Gupta, Pankaj
  0 siblings, 1 reply; 398+ messages in thread
From: Gupta, Pankaj @ 2022-08-11 23:02 UTC (permalink / raw)
  To: Nikunj A. Dadhania, Chao Peng, Sean Christopherson
  Cc: Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song,
	bharata, kvm, linux-kernel, linux-mm, linux-kselftest, linux-api,
	linux-doc, qemu-devel, linux-fsdevel

On 8/11/2022 7:18 PM, Nikunj A. Dadhania wrote:
> On 11/08/22 17:00, Gupta, Pankaj wrote:
>>
>>>> This is the v7 of this series which tries to implement the fd-based KVM
>>>> guest private memory. The patches are based on latest kvm/queue branch
>>>> commit:
>>>>
>>>>     b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>>>> split_desc_cache only by default capacity
>>>>
>>>> Introduction
>>>> ------------
>>>> In general this patch series introduce fd-based memslot which provides
>>>> guest memory through memory file descriptor fd[offset,size] instead of
>>>> hva/size. The fd can be created from a supported memory filesystem
>>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>>> and the the memory backing store exchange callbacks when such memslot
>>>> gets created. At runtime KVM will call into callbacks provided by the
>>>> backing store to get the pfn with the fd+offset. Memory backing store
>>>> will also call into KVM callbacks when userspace punch hole on the fd
>>>> to notify KVM to unmap secondary MMU page table entries.
>>>>
>>>> Comparing to existing hva-based memslot, this new type of memslot allows
>>>> guest memory unmapped from host userspace like QEMU and even the kernel
>>>> itself, therefore reduce attack surface and prevent bugs.
>>>>
>>>> Based on this fd-based memslot, we can build guest private memory that
>>>> is going to be used in confidential computing environments such as Intel
>>>> TDX and AMD SEV. When supported, the memory backing store can provide
>>>> more enforcement on the fd and KVM can use a single memslot to hold both
>>>> the private and shared part of the guest memory.
>>>>
>>>> mm extension
>>>> ---------------------
>>>> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
>>>> created with these flags cannot read(), write() or mmap() etc via normal
>>>> MMU operations. The file content can only be used with the newly
>>>> introduced memfile_notifier extension.
>>>>
>>>> The memfile_notifier extension provides two sets of callbacks for KVM to
>>>> interact with the memory backing store:
>>>>     - memfile_notifier_ops: callbacks for memory backing store to notify
>>>>       KVM when memory gets invalidated.
>>>>     - backing store callbacks: callbacks for KVM to call into memory
>>>>       backing store to request memory pages for guest private memory.
>>>>
>>>> The memfile_notifier extension also provides APIs for memory backing
>>>> store to register/unregister itself and to trigger the notifier when the
>>>> bookmarked memory gets invalidated.
>>>>
>>>> The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
>>>> prevent double allocation caused by unintentional guest when we only
>>>> have a single side of the shared/private memfds effective.
>>>>
>>>> memslot extension
>>>> -----------------
>>>> Add the private fd and the fd offset to existing 'shared' memslot so
>>>> that both private/shared guest memory can live in one single memslot.
>>>> A page in the memslot is either private or shared. Whether a guest page
>>>> is private or shared is maintained through reusing existing SEV ioctls
>>>> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
>>>>
>>>> Test
>>>> ----
>>>> To test the new functionalities of this patch TDX patchset is needed.
>>>> Since TDX patchset has not been merged so I did two kinds of test:
>>>>
>>>> -  Regresion test on kvm/queue (this patchset)
>>>>      Most new code are not covered. Code also in below repo:
>>>>      https://github.com/chao-p/linux/tree/privmem-v7
>>>>
>>>> -  New Funational test on latest TDX code
>>>>      The patch is rebased to latest TDX code and tested the new
>>>>      funcationalities. See below repos:
>>>>      Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
>>>>      QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
>>>
>>> While debugging an issue with SEV+UPM, found that fallocate() returns
>>> an error in QEMU which is not handled (EINTR). With the below handling
>>> of EINTR subsequent fallocate() succeeds:
>>>
>>>
>>> diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c
>>> index af8fb0c957..e8597ed28d 100644
>>> --- a/backends/hostmem-memfd-private.c
>>> +++ b/backends/hostmem-memfd-private.c
>>> @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>>        MachineState *machine = MACHINE(qdev_get_machine());
>>>        uint32_t ram_flags;
>>>        char *name;
>>> -    int fd, priv_fd;
>>> +    int fd, priv_fd, ret;
>>>          if (!backend->size) {
>>>            error_setg(errp, "can't create backend with size 0");
>>> @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>>                                       backend->size, ram_flags, fd, 0, errp);
>>>        g_free(name);
>>>    -    fallocate(priv_fd, 0, 0, backend->size);
>>> +again:
>>> +    ret = fallocate(priv_fd, 0, 0, backend->size);
>>> +    if (ret) {
>>> +           perror("Fallocate failed: \n");
>>> +           if (errno == EINTR)
>>> +                   goto again;
>>> +           else
>>> +                   exit(1);
>>> +    }
>>>
>>> However, fallocate() preallocates full guest memory before starting the guest.
>>> With this behaviour guest memory is *not* demand pinned. Is there a way to
>>> prevent fallocate() from reserving full guest memory?
>>
>> Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.
> 
> That is correct, but the question is when does the memory allocated, as these flags are set,
> memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is
> allocated.

I guess so if guest memory is private by default.

Other option would be to allocate memory as shared by default and
handle on demand allocation and RMPUPDATE with page state change event. 
But still that would be done at guest boot time, IIUC.

Might be missing some details on this. So, better to wait for someone 
more familiar to answer.

Thanks,
Pankaj


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-11 13:32     ` Chao Peng
  2022-08-11 17:28       ` Nikunj A. Dadhania
@ 2022-08-12  3:22       ` Nikunj A. Dadhania
  1 sibling, 0 replies; 398+ messages in thread
From: Nikunj A. Dadhania @ 2022-08-12  3:22 UTC (permalink / raw)
  To: Chao Peng, Sean Christopherson
  Cc: Sean Christopherson, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, bharata, kvm, linux-kernel,
	linux-mm, linux-kselftest, linux-api, linux-doc, qemu-devel,
	linux-fsdevel, Gupta, Pankaj



On 11/08/22 19:02, Chao Peng wrote:
> On Thu, Aug 11, 2022 at 01:30:06PM +0200, Gupta, Pankaj wrote:
>>>
>>> While debugging an issue with SEV+UPM, found that fallocate() returns
>>> an error in QEMU which is not handled (EINTR). With the below handling
>>> of EINTR subsequent fallocate() succeeds:
> 
> QEMU code has not well-tested so it's not strange you met problem. But
> from the man page, there is signal was caught for EINTR, do you know
> the signal number?
> 
> Thanks for you patch but before we change it in QEMU I want to make sure
> it's indeed a QEMU issue (e.g. not a kernel isssue).
> 
>>>
>>>
>>> diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c
>>> index af8fb0c957..e8597ed28d 100644
>>> --- a/backends/hostmem-memfd-private.c
>>> +++ b/backends/hostmem-memfd-private.c
>>> @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>>       MachineState *machine = MACHINE(qdev_get_machine());
>>>       uint32_t ram_flags;
>>>       char *name;
>>> -    int fd, priv_fd;
>>> +    int fd, priv_fd, ret;
>>>       if (!backend->size) {
>>>           error_setg(errp, "can't create backend with size 0");
>>> @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>>                                      backend->size, ram_flags, fd, 0, errp);
>>>       g_free(name);
>>> -    fallocate(priv_fd, 0, 0, backend->size);
>>> +again:
>>> +    ret = fallocate(priv_fd, 0, 0, backend->size);
>>> +    if (ret) {
>>> +           perror("Fallocate failed: \n");
>>> +           if (errno == EINTR)
>>> +                   goto again;
>>> +           else
>>> +                   exit(1);
>>> +    }
>>>
>>> However, fallocate() preallocates full guest memory before starting the guest.
>>> With this behaviour guest memory is *not* demand pinned. 

This is with reference to the SEV demand pinning patches that I was working on. 
The understanding was UPM will not reserve memory for SEV/TDX guest in the beginning 
similar to normal guest. Here is the relevant quote from the discussion with Sean[1]:

	"I think we should abandon this approach in favor of committing all our resources
	to fd-based private memory[*], which (if done right) will provide on-demand pinning
	for "free". "

>>> Is there a way to prevent fallocate() from reserving full guest memory?
Regards
Nikunj
[1] https://lore.kernel.org/kvm/YkIh8zM7XfhsFN8L@google.com/


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-11 23:02       ` Gupta, Pankaj
@ 2022-08-12  6:02         ` Gupta, Pankaj
  2022-08-12  7:18           ` Gupta, Pankaj
  0 siblings, 1 reply; 398+ messages in thread
From: Gupta, Pankaj @ 2022-08-12  6:02 UTC (permalink / raw)
  To: Nikunj A. Dadhania, Chao Peng, Sean Christopherson
  Cc: Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song,
	bharata, kvm, linux-kernel, linux-mm, linux-kselftest, linux-api,
	linux-doc, qemu-devel, linux-fsdevel


>>>>> This is the v7 of this series which tries to implement the fd-based 
>>>>> KVM
>>>>> guest private memory. The patches are based on latest kvm/queue branch
>>>>> commit:
>>>>>
>>>>>     b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>>>>> split_desc_cache only by default capacity
>>>>>
>>>>> Introduction
>>>>> ------------
>>>>> In general this patch series introduce fd-based memslot which provides
>>>>> guest memory through memory file descriptor fd[offset,size] instead of
>>>>> hva/size. The fd can be created from a supported memory filesystem
>>>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>>>> and the the memory backing store exchange callbacks when such memslot
>>>>> gets created. At runtime KVM will call into callbacks provided by the
>>>>> backing store to get the pfn with the fd+offset. Memory backing store
>>>>> will also call into KVM callbacks when userspace punch hole on the fd
>>>>> to notify KVM to unmap secondary MMU page table entries.
>>>>>
>>>>> Comparing to existing hva-based memslot, this new type of memslot 
>>>>> allows
>>>>> guest memory unmapped from host userspace like QEMU and even the 
>>>>> kernel
>>>>> itself, therefore reduce attack surface and prevent bugs.
>>>>>
>>>>> Based on this fd-based memslot, we can build guest private memory that
>>>>> is going to be used in confidential computing environments such as 
>>>>> Intel
>>>>> TDX and AMD SEV. When supported, the memory backing store can provide
>>>>> more enforcement on the fd and KVM can use a single memslot to hold 
>>>>> both
>>>>> the private and shared part of the guest memory.
>>>>>
>>>>> mm extension
>>>>> ---------------------
>>>>> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
>>>>> created with these flags cannot read(), write() or mmap() etc via 
>>>>> normal
>>>>> MMU operations. The file content can only be used with the newly
>>>>> introduced memfile_notifier extension.
>>>>>
>>>>> The memfile_notifier extension provides two sets of callbacks for 
>>>>> KVM to
>>>>> interact with the memory backing store:
>>>>>     - memfile_notifier_ops: callbacks for memory backing store to 
>>>>> notify
>>>>>       KVM when memory gets invalidated.
>>>>>     - backing store callbacks: callbacks for KVM to call into memory
>>>>>       backing store to request memory pages for guest private memory.
>>>>>
>>>>> The memfile_notifier extension also provides APIs for memory backing
>>>>> store to register/unregister itself and to trigger the notifier 
>>>>> when the
>>>>> bookmarked memory gets invalidated.
>>>>>
>>>>> The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
>>>>> prevent double allocation caused by unintentional guest when we only
>>>>> have a single side of the shared/private memfds effective.
>>>>>
>>>>> memslot extension
>>>>> -----------------
>>>>> Add the private fd and the fd offset to existing 'shared' memslot so
>>>>> that both private/shared guest memory can live in one single memslot.
>>>>> A page in the memslot is either private or shared. Whether a guest 
>>>>> page
>>>>> is private or shared is maintained through reusing existing SEV ioctls
>>>>> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
>>>>>
>>>>> Test
>>>>> ----
>>>>> To test the new functionalities of this patch TDX patchset is needed.
>>>>> Since TDX patchset has not been merged so I did two kinds of test:
>>>>>
>>>>> -  Regresion test on kvm/queue (this patchset)
>>>>>      Most new code are not covered. Code also in below repo:
>>>>>      
>>>>> https://github.com/chao-p/linux/tree/privmem-v7
>>>>>
>>>>>
>>>>> -  New Funational test on latest TDX code
>>>>>      The patch is rebased to latest TDX code and tested the new
>>>>>      funcationalities. See below repos:
>>>>>      Linux: 
>>>>> https://github.com/chao-p/linux/tree/privmem-v7-tdx
>>>>>
>>>>>      QEMU: 
>>>>> https://github.com/chao-p/qemu/tree/privmem-v7
>>>>>
>>>>
>>>> While debugging an issue with SEV+UPM, found that fallocate() returns
>>>> an error in QEMU which is not handled (EINTR). With the below handling
>>>> of EINTR subsequent fallocate() succeeds:
>>>>
>>>>
>>>> diff --git a/backends/hostmem-memfd-private.c 
>>>> b/backends/hostmem-memfd-private.c
>>>> index af8fb0c957..e8597ed28d 100644
>>>> --- a/backends/hostmem-memfd-private.c
>>>> +++ b/backends/hostmem-memfd-private.c
>>>> @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend 
>>>> *backend, Error **errp)
>>>>        MachineState *machine = MACHINE(qdev_get_machine());
>>>>        uint32_t ram_flags;
>>>>        char *name;
>>>> -    int fd, priv_fd;
>>>> +    int fd, priv_fd, ret;
>>>>          if (!backend->size) {
>>>>            error_setg(errp, "can't create backend with size 0");
>>>> @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend 
>>>> *backend, Error **errp)
>>>>                                       backend->size, ram_flags, fd, 
>>>> 0, errp);
>>>>        g_free(name);
>>>>    -    fallocate(priv_fd, 0, 0, backend->size);
>>>> +again:
>>>> +    ret = fallocate(priv_fd, 0, 0, backend->size);
>>>> +    if (ret) {
>>>> +           perror("Fallocate failed: \n");
>>>> +           if (errno == EINTR)
>>>> +                   goto again;
>>>> +           else
>>>> +                   exit(1);
>>>> +    }
>>>>
>>>> However, fallocate() preallocates full guest memory before starting 
>>>> the guest.
>>>> With this behaviour guest memory is *not* demand pinned. Is there a 
>>>> way to
>>>> prevent fallocate() from reserving full guest memory?
>>>
>>> Isn't the pinning being handled by the corresponding host memory 
>>> backend with mmu > notifier and architecture support while doing the 
>>> memory operations e.g page> migration and swapping/reclaim (not 
>>> supported currently AFAIU). But yes, we need> to allocate entire 
>>> guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE 
>>> etc}.
>>
>> That is correct, but the question is when does the memory allocated, 
>> as these flags are set,
>> memory is neither moved nor reclaimed. In current scenario, if I start 
>> a 32GB guest, all 32GB is
>> allocated.
> 
> I guess so if guest memory is private by default.
> 
> Other option would be to allocate memory as shared by default and
> handle on demand allocation and RMPUPDATE with page state change event. 
> But still that would be done at guest boot time, IIUC.

Sorry! Don't want to hijack the other thread so replying here.

I thought the question is for SEV SNP. For SEV, maybe the hypercall with 
the page state information can be used to allocate memory as we use it 
or something like quota based memory allocation (just thinking).

> 
> Might be missing some details on this. So, better to wait for someone 
> more familiar to answer.

Same applies here :)

Thanks,
Pankaj


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-12  6:02         ` Gupta, Pankaj
@ 2022-08-12  7:18           ` Gupta, Pankaj
  2022-08-12  8:48             ` Nikunj A. Dadhania
  0 siblings, 1 reply; 398+ messages in thread
From: Gupta, Pankaj @ 2022-08-12  7:18 UTC (permalink / raw)
  To: Nikunj A. Dadhania, Chao Peng, Sean Christopherson
  Cc: Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song,
	bharata, kvm, linux-kernel, linux-mm, linux-kselftest, linux-api,
	linux-doc, qemu-devel, linux-fsdevel


>>>>>> This is the v7 of this series which tries to implement the 
>>>>>> fd-based KVM
>>>>>> guest private memory. The patches are based on latest kvm/queue 
>>>>>> branch
>>>>>> commit:
>>>>>>
>>>>>>     b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
>>>>>> split_desc_cache only by default capacity
>>>>>>
>>>>>> Introduction
>>>>>> ------------
>>>>>> In general this patch series introduce fd-based memslot which 
>>>>>> provides
>>>>>> guest memory through memory file descriptor fd[offset,size] 
>>>>>> instead of
>>>>>> hva/size. The fd can be created from a supported memory filesystem
>>>>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>>>>> and the the memory backing store exchange callbacks when such memslot
>>>>>> gets created. At runtime KVM will call into callbacks provided by the
>>>>>> backing store to get the pfn with the fd+offset. Memory backing store
>>>>>> will also call into KVM callbacks when userspace punch hole on the fd
>>>>>> to notify KVM to unmap secondary MMU page table entries.
>>>>>>
>>>>>> Comparing to existing hva-based memslot, this new type of memslot 
>>>>>> allows
>>>>>> guest memory unmapped from host userspace like QEMU and even the 
>>>>>> kernel
>>>>>> itself, therefore reduce attack surface and prevent bugs.
>>>>>>
>>>>>> Based on this fd-based memslot, we can build guest private memory 
>>>>>> that
>>>>>> is going to be used in confidential computing environments such as 
>>>>>> Intel
>>>>>> TDX and AMD SEV. When supported, the memory backing store can provide
>>>>>> more enforcement on the fd and KVM can use a single memslot to 
>>>>>> hold both
>>>>>> the private and shared part of the guest memory.
>>>>>>
>>>>>> mm extension
>>>>>> ---------------------
>>>>>> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
>>>>>> created with these flags cannot read(), write() or mmap() etc via 
>>>>>> normal
>>>>>> MMU operations. The file content can only be used with the newly
>>>>>> introduced memfile_notifier extension.
>>>>>>
>>>>>> The memfile_notifier extension provides two sets of callbacks for 
>>>>>> KVM to
>>>>>> interact with the memory backing store:
>>>>>>     - memfile_notifier_ops: callbacks for memory backing store to 
>>>>>> notify
>>>>>>       KVM when memory gets invalidated.
>>>>>>     - backing store callbacks: callbacks for KVM to call into memory
>>>>>>       backing store to request memory pages for guest private memory.
>>>>>>
>>>>>> The memfile_notifier extension also provides APIs for memory backing
>>>>>> store to register/unregister itself and to trigger the notifier 
>>>>>> when the
>>>>>> bookmarked memory gets invalidated.
>>>>>>
>>>>>> The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
>>>>>> prevent double allocation caused by unintentional guest when we only
>>>>>> have a single side of the shared/private memfds effective.
>>>>>>
>>>>>> memslot extension
>>>>>> -----------------
>>>>>> Add the private fd and the fd offset to existing 'shared' memslot so
>>>>>> that both private/shared guest memory can live in one single memslot.
>>>>>> A page in the memslot is either private or shared. Whether a guest 
>>>>>> page
>>>>>> is private or shared is maintained through reusing existing SEV 
>>>>>> ioctls
>>>>>> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
>>>>>>
>>>>>> Test
>>>>>> ----
>>>>>> To test the new functionalities of this patch TDX patchset is needed.
>>>>>> Since TDX patchset has not been merged so I did two kinds of test:
>>>>>>
>>>>>> -  Regresion test on kvm/queue (this patchset)
>>>>>>      Most new code are not covered. Code also in below repo:
>>>>>> https://github.com/chao-p/linux/tree/privmem-v7
>>>>>>
>>>>>>
>>>>>>
>>>>>> -  New Funational test on latest TDX code
>>>>>>      The patch is rebased to latest TDX code and tested the new
>>>>>>      funcationalities. See below repos:
>>>>>>      Linux: 
>>>>>> https://github.com/chao-p/linux/tree/privmem-v7-tdx
>>>>>>
>>>>>>
>>>>>>      QEMU: 
>>>>>> https://github.com/chao-p/qemu/tree/privmem-v7
>>>>>>
>>>>>>
>>>>>
>>>>> While debugging an issue with SEV+UPM, found that fallocate() returns
>>>>> an error in QEMU which is not handled (EINTR). With the below handling
>>>>> of EINTR subsequent fallocate() succeeds:
>>>>>
>>>>>
>>>>> diff --git a/backends/hostmem-memfd-private.c 
>>>>> b/backends/hostmem-memfd-private.c
>>>>> index af8fb0c957..e8597ed28d 100644
>>>>> --- a/backends/hostmem-memfd-private.c
>>>>> +++ b/backends/hostmem-memfd-private.c
>>>>> @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend 
>>>>> *backend, Error **errp)
>>>>>        MachineState *machine = MACHINE(qdev_get_machine());
>>>>>        uint32_t ram_flags;
>>>>>        char *name;
>>>>> -    int fd, priv_fd;
>>>>> +    int fd, priv_fd, ret;
>>>>>          if (!backend->size) {
>>>>>            error_setg(errp, "can't create backend with size 0");
>>>>> @@ -65,7 +65,15 @@ 
>>>>> priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error 
>>>>> **errp)
>>>>>                                       backend->size, ram_flags, fd, 
>>>>> 0, errp);
>>>>>        g_free(name);
>>>>>    -    fallocate(priv_fd, 0, 0, backend->size);
>>>>> +again:
>>>>> +    ret = fallocate(priv_fd, 0, 0, backend->size);
>>>>> +    if (ret) {
>>>>> +           perror("Fallocate failed: \n");
>>>>> +           if (errno == EINTR)
>>>>> +                   goto again;
>>>>> +           else
>>>>> +                   exit(1);
>>>>> +    }
>>>>>
>>>>> However, fallocate() preallocates full guest memory before starting 
>>>>> the guest.
>>>>> With this behaviour guest memory is *not* demand pinned. Is there a 
>>>>> way to
>>>>> prevent fallocate() from reserving full guest memory?
>>>>
>>>> Isn't the pinning being handled by the corresponding host memory 
>>>> backend with mmu > notifier and architecture support while doing the 
>>>> memory operations e.g page> migration and swapping/reclaim (not 
>>>> supported currently AFAIU). But yes, we need> to allocate entire 
>>>> guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE 
>>>> etc}.
>>>
>>> That is correct, but the question is when does the memory allocated, 
>>> as these flags are set,
>>> memory is neither moved nor reclaimed. In current scenario, if I 
>>> start a 32GB guest, all 32GB is
>>> allocated.
>>
>> I guess so if guest memory is private by default.
>>
>> Other option would be to allocate memory as shared by default and
>> handle on demand allocation and RMPUPDATE with page state change 
>> event. But still that would be done at guest boot time, IIUC.
> 
> Sorry! Don't want to hijack the other thread so replying here.
> 
> I thought the question is for SEV SNP. For SEV, maybe the hypercall with 
> the page state information can be used to allocate memory as we use it 
> or something like quota based memory allocation (just thinking).

But all this would have considerable performance overhead (if by default 
memory is shared) and used mostly at boot time. So, preallocating memory 
(default memory private) seems better approach for both SEV & SEV SNP 
with later page management (pinning, reclaim) taken care by host memory 
backend & architecture together.

Or maybe later we can think of something like allowing direct page fault 
on host memory access for *SEV* guest as there is no strict requirement 
for memory integrity guarantee and the performance overhead.

Don't know if it is feasible, just sharing my thoughts.

Thanks,
Pankaj

> 
>>
>> Might be missing some details on this. So, better to wait for someone 
>> more familiar to answer.
> 
> Same applies here :)
> 
> Thanks,
> Pankaj
> 
> 


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-12  7:18           ` Gupta, Pankaj
@ 2022-08-12  8:48             ` Nikunj A. Dadhania
  2022-08-12  9:33               ` Gupta, Pankaj
  2022-08-15 13:04               ` Chao Peng
  0 siblings, 2 replies; 398+ messages in thread
From: Nikunj A. Dadhania @ 2022-08-12  8:48 UTC (permalink / raw)
  To: Gupta, Pankaj, Chao Peng, Sean Christopherson
  Cc: Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song,
	bharata, kvm, linux-kernel, linux-mm, linux-kselftest, linux-api,
	linux-doc, qemu-devel, linux-fsdevel



On 12/08/22 12:48, Gupta, Pankaj wrote:
> 
>>>>>>
>>>>>> However, fallocate() preallocates full guest memory before starting the guest.
>>>>>> With this behaviour guest memory is *not* demand pinned. Is there a way to
>>>>>> prevent fallocate() from reserving full guest memory?
>>>>>
>>>>> Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.
>>>>
>>>> That is correct, but the question is when does the memory allocated, as these flags are set,
>>>> memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is
>>>> allocated.
>>>
>>> I guess so if guest memory is private by default.
>>>
>>> Other option would be to allocate memory as shared by default and
>>> handle on demand allocation and RMPUPDATE with page state change event. But still that would be done at guest boot time, IIUC.
>>
>> Sorry! Don't want to hijack the other thread so replying here.
>>
>> I thought the question is for SEV SNP. For SEV, maybe the hypercall with the page state information can be used to allocate memory as we use it or something like quota based memory allocation (just thinking).
> 
> But all this would have considerable performance overhead (if by default memory is shared) and used mostly at boot time. 

> So, preallocating memory (default memory private) seems better approach for both SEV & SEV SNP with later page management (pinning, reclaim) taken care by host memory backend & architecture together.

I am not sure how will pre-allocating memory help, even if guest would not use full memory it will be pre-allocated. Which if I understand correctly is not expected.

Regards
Nikunj

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-12  8:48             ` Nikunj A. Dadhania
@ 2022-08-12  9:33               ` Gupta, Pankaj
  2022-08-15 13:04               ` Chao Peng
  1 sibling, 0 replies; 398+ messages in thread
From: Gupta, Pankaj @ 2022-08-12  9:33 UTC (permalink / raw)
  To: Nikunj A. Dadhania, Chao Peng, Sean Christopherson
  Cc: Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song,
	bharata, kvm, linux-kernel, linux-mm, linux-kselftest, linux-api,
	linux-doc, qemu-devel, linux-fsdevel


>>>>>>>
>>>>>>> However, fallocate() preallocates full guest memory before starting the guest.
>>>>>>> With this behaviour guest memory is *not* demand pinned. Is there a way to
>>>>>>> prevent fallocate() from reserving full guest memory?
>>>>>>
>>>>>> Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.
>>>>>
>>>>> That is correct, but the question is when does the memory allocated, as these flags are set,
>>>>> memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is
>>>>> allocated.
>>>>
>>>> I guess so if guest memory is private by default.
>>>>
>>>> Other option would be to allocate memory as shared by default and
>>>> handle on demand allocation and RMPUPDATE with page state change event. But still that would be done at guest boot time, IIUC.
>>>
>>> Sorry! Don't want to hijack the other thread so replying here.
>>>
>>> I thought the question is for SEV SNP. For SEV, maybe the hypercall with the page state information can be used to allocate memory as we use it or something like quota based memory allocation (just thinking).
>>
>> But all this would have considerable performance overhead (if by default memory is shared) and used mostly at boot time.
> 
>> So, preallocating memory (default memory private) seems better approach for both SEV & SEV SNP with later page management (pinning, reclaim) taken care by host memory backend & architecture together.
> 
> I am not sure how will pre-allocating memory help, even if guest would not use full memory it will be pre-allocated. Which if I understand correctly is not expected.

For SEV I am also not very sure what would be the best way.
There could be a tradeoff between memory pinning and performance.
As I was also thinking about "Async page fault" aspect of guest
in my previous reply. Details needs to be figure out.

Quoting my previous reply here:
"Or maybe later we can think of something like allowing direct page 
fault on host memory access for *SEV* guest as there is no strict 
requirement for memory integrity guarantee and the performance overhead."

Thanks,
Pankaj

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-12  8:48             ` Nikunj A. Dadhania
  2022-08-12  9:33               ` Gupta, Pankaj
@ 2022-08-15 13:04               ` Chao Peng
  2022-08-16  4:28                 ` Nikunj A. Dadhania
  2022-08-16 11:33                 ` Gupta, Pankaj
  1 sibling, 2 replies; 398+ messages in thread
From: Chao Peng @ 2022-08-15 13:04 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: Gupta, Pankaj, Sean Christopherson, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, bharata, kvm, linux-kernel,
	linux-mm, linux-kselftest, linux-api, linux-doc, qemu-devel,
	linux-fsdevel

On Fri, Aug 12, 2022 at 02:18:43PM +0530, Nikunj A. Dadhania wrote:
> 
> 
> On 12/08/22 12:48, Gupta, Pankaj wrote:
> > 
> >>>>>>
> >>>>>> However, fallocate() preallocates full guest memory before starting the guest.
> >>>>>> With this behaviour guest memory is *not* demand pinned. Is there a way to
> >>>>>> prevent fallocate() from reserving full guest memory?
> >>>>>
> >>>>> Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.
> >>>>
> >>>> That is correct, but the question is when does the memory allocated, as these flags are set,
> >>>> memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is
> >>>> allocated.
> >>>
> >>> I guess so if guest memory is private by default.
> >>>
> >>> Other option would be to allocate memory as shared by default and
> >>> handle on demand allocation and RMPUPDATE with page state change event. But still that would be done at guest boot time, IIUC.
> >>
> >> Sorry! Don't want to hijack the other thread so replying here.
> >>
> >> I thought the question is for SEV SNP. For SEV, maybe the hypercall with the page state information can be used to allocate memory as we use it or something like quota based memory allocation (just thinking).
> > 
> > But all this would have considerable performance overhead (if by default memory is shared) and used mostly at boot time. 
> 
> > So, preallocating memory (default memory private) seems better approach for both SEV & SEV SNP with later page management (pinning, reclaim) taken care by host memory backend & architecture together.
> 
> I am not sure how will pre-allocating memory help, even if guest would not use full memory it will be pre-allocated. Which if I understand correctly is not expected.

Actually the current version allows you to delay the allocation to a
later time (e.g. page fault time) if you don't call fallocate() on the
private fd. fallocate() is necessary in previous versions because we
treat the existense in the fd as 'private' but in this version we track
private/shared info in KVM so we don't rely on that fact from memory
backstores.

Definitely the page will still be pinned once it's allocated, there is
no way to swap it out for example just with the current code. That kind
of support, if desirable, can be extended through MOVABLE flag and some
other callbacks to let feature-specific code to involve.

Chao
> 
> Regards
> Nikunj

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-15 13:04               ` Chao Peng
@ 2022-08-16  4:28                 ` Nikunj A. Dadhania
  2022-08-16 11:33                 ` Gupta, Pankaj
  1 sibling, 0 replies; 398+ messages in thread
From: Nikunj A. Dadhania @ 2022-08-16  4:28 UTC (permalink / raw)
  To: Chao Peng
  Cc: Gupta, Pankaj, Sean Christopherson, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, bharata, kvm, linux-kernel,
	linux-mm, linux-kselftest, linux-api, linux-doc, qemu-devel,
	linux-fsdevel

On 15/08/22 18:34, Chao Peng wrote:
> On Fri, Aug 12, 2022 at 02:18:43PM +0530, Nikunj A. Dadhania wrote:
>>
>>
>> On 12/08/22 12:48, Gupta, Pankaj wrote:
>>>
>>>>>>>>
>>>>>>>> However, fallocate() preallocates full guest memory before starting the guest.
>>>>>>>> With this behaviour guest memory is *not* demand pinned. Is there a way to
>>>>>>>> prevent fallocate() from reserving full guest memory?
>>>>>>>
>>>>>>> Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.
>>>>>>
>>>>>> That is correct, but the question is when does the memory allocated, as these flags are set,
>>>>>> memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is
>>>>>> allocated.
>>>>>
>>>>> I guess so if guest memory is private by default.
>>>>>
>>>>> Other option would be to allocate memory as shared by default and
>>>>> handle on demand allocation and RMPUPDATE with page state change event. But still that would be done at guest boot time, IIUC.
>>>>
>>>> Sorry! Don't want to hijack the other thread so replying here.
>>>>
>>>> I thought the question is for SEV SNP. For SEV, maybe the hypercall with the page state information can be used to allocate memory as we use it or something like quota based memory allocation (just thinking).
>>>
>>> But all this would have considerable performance overhead (if by default memory is shared) and used mostly at boot time. 
>>
>>> So, preallocating memory (default memory private) seems better approach for both SEV & SEV SNP with later page management (pinning, reclaim) taken care by host memory backend & architecture together.
>>
>> I am not sure how will pre-allocating memory help, even if guest would not use full memory it will be pre-allocated. Which if I understand correctly is not expected.
> 
> Actually the current version allows you to delay the allocation to a
> later time (e.g. page fault time) if you don't call fallocate() on the
> private fd. fallocate() is necessary in previous versions because we
> treat the existense in the fd as 'private' but in this version we track
> private/shared info in KVM so we don't rely on that fact from memory
> backstores.

Thanks for confirming Chao, in that case we can drop fallocate() from qemu 
in both the case
* Once while creating the memfd private object
* During ram_block_convert_range() for shared->private and vice versa.
 
> Definitely the page will still be pinned once it's allocated, there is
> no way to swap it out for example just with the current code. 

Agree, at present once the page is brought in, page will remain till VM shutdown.

> That kind of support, if desirable, can be extended through MOVABLE flag and some
> other callbacks to let feature-specific code to involve.

Sure, that could be future work.

Regards
Nikunj

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-15 13:04               ` Chao Peng
  2022-08-16  4:28                 ` Nikunj A. Dadhania
@ 2022-08-16 11:33                 ` Gupta, Pankaj
  2022-08-16 12:24                   ` Kirill A . Shutemov
  1 sibling, 1 reply; 398+ messages in thread
From: Gupta, Pankaj @ 2022-08-16 11:33 UTC (permalink / raw)
  To: Chao Peng
  Cc: Nikunj A. Dadhania, Sean Christopherson, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, bharata, kvm, linux-kernel,
	linux-mm, linux-kselftest, linux-api, linux-doc, qemu-devel,
	linux-fsdevel

Hi Chao,

> 
> Actually the current version allows you to delay the allocation to a
> later time (e.g. page fault time) if you don't call fallocate() on the
> private fd. fallocate() is necessary in previous versions because we
> treat the existense in the fd as 'private' but in this version we track
> private/shared info in KVM so we don't rely on that fact from memory
> backstores.

Does this also mean reservation of guest physical memory with secure 
processor (both for SEV-SNP & TDX) will also happen at page fault time?

Do we plan to keep it this way?

Thanks,
Pankaj
> 
> Definitely the page will still be pinned once it's allocated, there is
> no way to swap it out for example just with the current code. That kind
> of support, if desirable, can be extended through MOVABLE flag and some
> other callbacks to let feature-specific code to involve.


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-16 11:33                 ` Gupta, Pankaj
@ 2022-08-16 12:24                   ` Kirill A . Shutemov
  2022-08-16 13:03                     ` Gupta, Pankaj
  0 siblings, 1 reply; 398+ messages in thread
From: Kirill A . Shutemov @ 2022-08-16 12:24 UTC (permalink / raw)
  To: Gupta, Pankaj
  Cc: Chao Peng, Nikunj A. Dadhania, Sean Christopherson,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, luto, jun.nakajima, dave.hansen, ak,
	david, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, bharata, kvm, linux-kernel, linux-mm,
	linux-kselftest, linux-api, linux-doc, qemu-devel, linux-fsdevel

On Tue, Aug 16, 2022 at 01:33:00PM +0200, Gupta, Pankaj wrote:
> Hi Chao,
> 
> > 
> > Actually the current version allows you to delay the allocation to a
> > later time (e.g. page fault time) if you don't call fallocate() on the
> > private fd. fallocate() is necessary in previous versions because we
> > treat the existense in the fd as 'private' but in this version we track
> > private/shared info in KVM so we don't rely on that fact from memory
> > backstores.
> 
> Does this also mean reservation of guest physical memory with secure
> processor (both for SEV-SNP & TDX) will also happen at page fault time?
> 
> Do we plan to keep it this way?

If you are talking about accepting memory by the guest, it is initiated by
the guest and has nothing to do with page fault time vs fallocate()
allocation of host memory. I mean acceptance happens after host memory
allocation but they are not in lockstep, acceptance can happen much later.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-16 12:24                   ` Kirill A . Shutemov
@ 2022-08-16 13:03                     ` Gupta, Pankaj
  2022-08-16 15:38                       ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Gupta, Pankaj @ 2022-08-16 13:03 UTC (permalink / raw)
  To: Kirill A . Shutemov
  Cc: Chao Peng, Nikunj A. Dadhania, Sean Christopherson,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, luto, jun.nakajima, dave.hansen, ak,
	david, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, bharata, kvm, linux-kernel, linux-mm,
	linux-kselftest, linux-api, linux-doc, qemu-devel, linux-fsdevel


>>> Actually the current version allows you to delay the allocation to a
>>> later time (e.g. page fault time) if you don't call fallocate() on the
>>> private fd. fallocate() is necessary in previous versions because we
>>> treat the existense in the fd as 'private' but in this version we track
>>> private/shared info in KVM so we don't rely on that fact from memory
>>> backstores.
>>
>> Does this also mean reservation of guest physical memory with secure
>> processor (both for SEV-SNP & TDX) will also happen at page fault time?
>>
>> Do we plan to keep it this way?
> 
> If you are talking about accepting memory by the guest, it is initiated by
> the guest and has nothing to do with page fault time vs fallocate()
> allocation of host memory. I mean acceptance happens after host memory
> allocation but they are not in lockstep, acceptance can happen much later.

No, I meant reserving guest physical memory range from hypervisor e.g 
with RMPUpdate for SEV-SNP or equivalent at TDX side (PAMTs?).

Thanks,
Pankaj

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-16 13:03                     ` Gupta, Pankaj
@ 2022-08-16 15:38                       ` Sean Christopherson
  2022-08-17 15:27                         ` Michael Roth
  2022-08-23 17:41                         ` Gupta, Pankaj
  0 siblings, 2 replies; 398+ messages in thread
From: Sean Christopherson @ 2022-08-16 15:38 UTC (permalink / raw)
  To: Gupta, Pankaj
  Cc: Kirill A . Shutemov, Chao Peng, Nikunj A. Dadhania,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, luto, jun.nakajima, dave.hansen, ak,
	david, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, bharata, kvm, linux-kernel, linux-mm,
	linux-kselftest, linux-api, linux-doc, qemu-devel, linux-fsdevel

On Tue, Aug 16, 2022, Gupta, Pankaj wrote:
> 
> > > > Actually the current version allows you to delay the allocation to a
> > > > later time (e.g. page fault time) if you don't call fallocate() on the
> > > > private fd. fallocate() is necessary in previous versions because we
> > > > treat the existense in the fd as 'private' but in this version we track
> > > > private/shared info in KVM so we don't rely on that fact from memory
> > > > backstores.
> > > 
> > > Does this also mean reservation of guest physical memory with secure
> > > processor (both for SEV-SNP & TDX) will also happen at page fault time?
> > > 
> > > Do we plan to keep it this way?
> > 
> > If you are talking about accepting memory by the guest, it is initiated by
> > the guest and has nothing to do with page fault time vs fallocate()
> > allocation of host memory. I mean acceptance happens after host memory
> > allocation but they are not in lockstep, acceptance can happen much later.
> 
> No, I meant reserving guest physical memory range from hypervisor e.g with
> RMPUpdate for SEV-SNP or equivalent at TDX side (PAMTs?).

As proposed, RMP/PAMT updates will occur in the fault path, i.e. there is no way
for userspace to pre-map guest memory.

I think the best approach is to turn KVM_TDX_INIT_MEM_REGION into a generic
vCPU-scoped ioctl() that allows userspace to pre-map guest memory.  Supporting
initializing guest private memory with a source page can be implemented via a
flag.  That also gives KVM line of sight to in-place "conversion", e.g. another
flag could be added to say that the dest is also the source.

The TDX and SNP restrictions would then become addition restrictions on when
initializing with a source is allowed (and VMs that don't have guest private
memory wouldn't allow the flag at all).

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-16 15:38                       ` Sean Christopherson
@ 2022-08-17 15:27                         ` Michael Roth
  2022-08-23  1:25                           ` Isaku Yamahata
  2022-08-23 17:41                         ` Gupta, Pankaj
  1 sibling, 1 reply; 398+ messages in thread
From: Michael Roth @ 2022-08-17 15:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Gupta, Pankaj, Kirill A . Shutemov, Chao Peng,
	Nikunj A. Dadhania, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, luto, jun.nakajima, dave.hansen, ak, david, aarcange,
	ddutile, dhildenb, Quentin Perret, mhocko, Muchun Song, bharata,
	kvm, linux-kernel, linux-mm, linux-kselftest, linux-api,
	linux-doc, qemu-devel, linux-fsdevel

On Tue, Aug 16, 2022 at 03:38:08PM +0000, Sean Christopherson wrote:
> On Tue, Aug 16, 2022, Gupta, Pankaj wrote:
> > 
> > > > > Actually the current version allows you to delay the allocation to a
> > > > > later time (e.g. page fault time) if you don't call fallocate() on the
> > > > > private fd. fallocate() is necessary in previous versions because we
> > > > > treat the existense in the fd as 'private' but in this version we track
> > > > > private/shared info in KVM so we don't rely on that fact from memory
> > > > > backstores.
> > > > 
> > > > Does this also mean reservation of guest physical memory with secure
> > > > processor (both for SEV-SNP & TDX) will also happen at page fault time?
> > > > 
> > > > Do we plan to keep it this way?
> > > 
> > > If you are talking about accepting memory by the guest, it is initiated by
> > > the guest and has nothing to do with page fault time vs fallocate()
> > > allocation of host memory. I mean acceptance happens after host memory
> > > allocation but they are not in lockstep, acceptance can happen much later.
> > 
> > No, I meant reserving guest physical memory range from hypervisor e.g with
> > RMPUpdate for SEV-SNP or equivalent at TDX side (PAMTs?).
> 
> As proposed, RMP/PAMT updates will occur in the fault path, i.e. there is no way
> for userspace to pre-map guest memory.

Hi Sean,

Currently I have the rmpupdate hook in KVM_MEMORY_ENCRYPT_{REG,UNREG}_REGION
ioctls, so that when the pages actually get faulted in they are already
in the expected state. I have userspace set up to call
KVM_MEMORY_ENCRYPT_* in response to explicit page state changes issued by
the guest, as well as in response to MEMORY_FAULT exits for implicit page
state changes.

Initially the private backing store may or may not be pre-fallocate()'d
depending on how userspace wants to handle it. If it's not
pre-fallocate()'d, then the pages don't get faulted in until the guest
does explicit page state changes (currently SNP guests will do this for all
memory at boot time, but with unaccepted memory patches for guest/ovmf
this will happen during guest run-time, would still allow us to make
efficient use of lazy-pinning support for shorter boot times).

If userspaces wants to pre-allocate, it can issue the fallocate() for
all the ranges up-front so it doesn't incur the cost during run-time.

Is that compatible with the proposed design?

Of course, for the initial encrypted payload, we would need to to issue
the KVM_MEMORY_ENCRYPT_{REG,UNREG}_REGION up-front. I'm doing that in
conjunction with the hack to allow pwrite() to memfd to pre-populate the
private pages before the in-place encryption that occurs when
SNP_LAUNCH_UPDATE is issued...

In the past you and Vishal suggested doing the copy from within
SNP_LAUNCH_UPDATE, which seems like a workable solution and something
we've been meaning to implement...

> 
> I think the best approach is to turn KVM_TDX_INIT_MEM_REGION into a generic
> vCPU-scoped ioctl() that allows userspace to pre-map guest memory.  Supporting
> initializing guest private memory with a source page can be implemented via a
> flag.  That also gives KVM line of sight to in-place "conversion", e.g. another
> flag could be added to say that the dest is also the source.

So is this proposed ioctl only intended to handle the initial encrypted
payload, and the KVM_MEMORY_ENCRYPT_{REG,UNREG}_REGION ioctls would
still be used for conversions post-boot?

If so, that seems reasonable, but I thought there was some consensus that
just handling it per-platform in, e.g., SNP_LAUNCH_UPDATE, was
sufficient for now until some additional need arose for a new interface.
Has something changed in the regard? Just want to understand the
motivations so we can plan accordingly.

Thanks!

-Mike

> 
> The TDX and SNP restrictions would then become addition restrictions on when
> initializing with a source is allowed (and VMs that don't have guest private
> memory wouldn't allow the flag at all).

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  2022-08-05 17:55     ` Paolo Bonzini
  2022-08-05 18:06       ` David Hildenbrand
  2022-08-10  9:38       ` Chao Peng
@ 2022-08-17 23:41       ` Kirill A. Shutemov
  2022-08-18  9:09         ` Paolo Bonzini
  2022-08-23  7:36         ` David Hildenbrand
  2 siblings, 2 replies; 398+ messages in thread
From: Kirill A. Shutemov @ 2022-08-17 23:41 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Hildenbrand, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

On Fri, Aug 05, 2022 at 07:55:38PM +0200, Paolo Bonzini wrote:
> On 7/21/22 11:44, David Hildenbrand wrote:
> > 
> > Also, I*think*  you can place pages via userfaultfd into shmem. Not
> > sure if that would count "auto alloc", but it would certainly bypass
> > fallocate().
> 
> Yeah, userfaultfd_register would probably have to forbid this for
> F_SEAL_AUTO_ALLOCATE vmas.  Maybe the memfile_node can be reused for this,
> adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags?  Then userfault_register
> would do something like memfile_node_get_flags(vma->vm_file) and check the
> result.

I donno, memory allocation with userfaultfd looks pretty intentional to
me. Why would F_SEAL_AUTO_ALLOCATE prevent it?

Maybe we would need it in the future for post-copy migration or something?

Or existing practises around userfaultfd touch memory randomly and
therefore incompatible with F_SEAL_AUTO_ALLOCATE intent?

Note, that userfaultfd is only relevant for shared memory as it requires
VMA which we don't have for MFD_INACCESSIBLE.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-07-06  8:20 [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (15 preceding siblings ...)
  2022-08-11 10:02 ` Nikunj A. Dadhania
@ 2022-08-18  5:40 ` Hugh Dickins
  2022-08-18 13:24   ` Kirill A . Shutemov
  2022-08-26 15:19 ` Fuad Tabba
  2022-09-09 15:35 ` Michael Roth
  18 siblings, 1 reply; 398+ messages in thread
From: Hugh Dickins @ 2022-08-18  5:40 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, Gupta, Pankaj

On Wed, 6 Jul 2022, Chao Peng wrote:
> This is the v7 of this series which tries to implement the fd-based KVM
> guest private memory.

Here at last are my reluctant thoughts on this patchset.

fd-based approach for supporting KVM guest private memory: fine.

Use or abuse of memfd and shmem.c: mistaken.

memfd_create() was an excellent way to put together the initial prototype.

But since then, TDX in particular has forced an effort into preventing
(by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.

Are any of the shmem.c mods useful to existing users of shmem.c? No.
Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.

What use do you have for a filesystem here?  Almost none.
IIUC, what you want is an fd through which QEMU can allocate kernel
memory, selectively free that memory, and communicate fd+offset+length
to KVM.  And perhaps an interface to initialize a little of that memory
from a template (presumably copied from a real file on disk somewhere).

You don't need shmem.c or a filesystem for that!

If your memory could be swapped, that would be enough of a good reason
to make use of shmem.c: but it cannot be swapped; and although there
are some references in the mailthreads to it perhaps being swappable
in future, I get the impression that will not happen soon if ever.

If your memory could be migrated, that would be some reason to use
filesystem page cache (because page migration happens to understand
that type of memory): but it cannot be migrated.

Some of these impressions may come from earlier iterations of the
patchset (v7 looks better in several ways than v5).  I am probably
underestimating the extent to which you have taken on board other
usages beyond TDX and SEV private memory, and rightly want to serve
them all with similar interfaces: perhaps there is enough justification
for shmem there, but I don't see it.  There was mention of userfaultfd
in one link: does that provide the justification for using shmem?

I'm afraid of the special demands you may make of memory allocation
later on - surprised that huge pages are not mentioned already;
gigantic contiguous extents? secretmem removed from direct map?

Here's what I would prefer, and imagine much easier for you to maintain;
but I'm no system designer, and may be misunderstanding throughout.

QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
the fallocate syscall interface itself) to allocate and free the memory,
ioctl for initializing some of it too.  KVM in control of whether that
fd can be read or written or mmap'ed or whatever, no need to prevent it
in shmem.c, no need for flags, seals, notifications to and fro because
KVM is already in control and knows the history.  If shmem actually has
value, call into it underneath - somewhat like SysV SHM, and /dev/zero
mmap, and i915/gem make use of it underneath.  If shmem has nothing to
add, just allocate and free kernel memory directly, recorded in your
own xarray.

With that /dev/kvm_something subject to access controls and LSMs -
which I cannot find for memfd_create().  Full marks for including the
MFD_INACCESSIBLE manpage update, and for Cc'ing linux-api: but I'd
have expected some doubts from that direction already.

Hugh

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  2022-08-17 23:41       ` Kirill A. Shutemov
@ 2022-08-18  9:09         ` Paolo Bonzini
  2022-08-23  7:36         ` David Hildenbrand
  1 sibling, 0 replies; 398+ messages in thread
From: Paolo Bonzini @ 2022-08-18  9:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: David Hildenbrand, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

On 8/18/22 01:41, Kirill A. Shutemov wrote:
> Note, that userfaultfd is only relevant for shared memory as it requires
> VMA which we don't have for MFD_INACCESSIBLE.

Oh, you're right!  So yeah, looks like userfaultfd is not a problem.

Paolo

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-18  5:40 ` Hugh Dickins
@ 2022-08-18 13:24   ` Kirill A . Shutemov
  2022-08-19  0:20     ` Sean Christopherson
                       ` (2 more replies)
  0 siblings, 3 replies; 398+ messages in thread
From: Kirill A . Shutemov @ 2022-08-18 13:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, luto, jun.nakajima, dave.hansen, ak,
	david, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, Gupta, Pankaj

On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> On Wed, 6 Jul 2022, Chao Peng wrote:
> > This is the v7 of this series which tries to implement the fd-based KVM
> > guest private memory.
> 
> Here at last are my reluctant thoughts on this patchset.
> 
> fd-based approach for supporting KVM guest private memory: fine.
> 
> Use or abuse of memfd and shmem.c: mistaken.
> 
> memfd_create() was an excellent way to put together the initial prototype.
> 
> But since then, TDX in particular has forced an effort into preventing
> (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
> 
> Are any of the shmem.c mods useful to existing users of shmem.c? No.
> Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
> 
> What use do you have for a filesystem here?  Almost none.
> IIUC, what you want is an fd through which QEMU can allocate kernel
> memory, selectively free that memory, and communicate fd+offset+length
> to KVM.  And perhaps an interface to initialize a little of that memory
> from a template (presumably copied from a real file on disk somewhere).
> 
> You don't need shmem.c or a filesystem for that!
> 
> If your memory could be swapped, that would be enough of a good reason
> to make use of shmem.c: but it cannot be swapped; and although there
> are some references in the mailthreads to it perhaps being swappable
> in future, I get the impression that will not happen soon if ever.
> 
> If your memory could be migrated, that would be some reason to use
> filesystem page cache (because page migration happens to understand
> that type of memory): but it cannot be migrated.

Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
theoretically possible, but I'm not aware of any plans as of now.

[1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html

> Some of these impressions may come from earlier iterations of the
> patchset (v7 looks better in several ways than v5).  I am probably
> underestimating the extent to which you have taken on board other
> usages beyond TDX and SEV private memory, and rightly want to serve
> them all with similar interfaces: perhaps there is enough justification
> for shmem there, but I don't see it.  There was mention of userfaultfd
> in one link: does that provide the justification for using shmem?
> 
> I'm afraid of the special demands you may make of memory allocation
> later on - surprised that huge pages are not mentioned already;
> gigantic contiguous extents? secretmem removed from direct map?

The design allows for extension to hugetlbfs if needed. Combination of
MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero
implications for shmem. It is going to be separate struct memfile_backing_store.

I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE
to be movable if platform supports it and secretmem is not migratable by
design (without direct mapping fragmentations).

> Here's what I would prefer, and imagine much easier for you to maintain;
> but I'm no system designer, and may be misunderstanding throughout.
> 
> QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
> the fallocate syscall interface itself) to allocate and free the memory,
> ioctl for initializing some of it too.  KVM in control of whether that
> fd can be read or written or mmap'ed or whatever, no need to prevent it
> in shmem.c, no need for flags, seals, notifications to and fro because
> KVM is already in control and knows the history.  If shmem actually has
> value, call into it underneath - somewhat like SysV SHM, and /dev/zero
> mmap, and i915/gem make use of it underneath.  If shmem has nothing to
> add, just allocate and free kernel memory directly, recorded in your
> own xarray.

I guess shim layer on top of shmem *can* work. I don't see immediately why
it would not. But I'm not sure it is right direction. We risk creating yet
another parallel VM with own rules/locking/accounting that opaque to
core-mm.

Note that on machines that run TDX guests such memory would likely be the
bulk of memory use. Treating it as a fringe case may bite us one day.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-18 13:24   ` Kirill A . Shutemov
@ 2022-08-19  0:20     ` Sean Christopherson
  2022-08-19  3:38       ` Hugh Dickins
  2022-08-19  3:00     ` Hugh Dickins
  2022-09-09  4:44     ` Andy Lutomirski
  2 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2022-08-19  0:20 UTC (permalink / raw)
  To: Kirill A . Shutemov
  Cc: Hugh Dickins, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, luto, jun.nakajima, dave.hansen, ak,
	david, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, Gupta, Pankaj

On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > On Wed, 6 Jul 2022, Chao Peng wrote:
> > But since then, TDX in particular has forced an effort into preventing
> > (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
> > 
> > Are any of the shmem.c mods useful to existing users of shmem.c? No.
> > Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.

But QEMU and other VMMs are users of shmem and memfd.  The new features certainly
aren't useful for _all_ existing users, but I don't think it's fair to say that
they're not useful for _any_ existing users.

> > What use do you have for a filesystem here?  Almost none.
> > IIUC, what you want is an fd through which QEMU can allocate kernel
> > memory, selectively free that memory, and communicate fd+offset+length
> > to KVM.  And perhaps an interface to initialize a little of that memory
> > from a template (presumably copied from a real file on disk somewhere).
> > 
> > You don't need shmem.c or a filesystem for that!
> > 
> > If your memory could be swapped, that would be enough of a good reason
> > to make use of shmem.c: but it cannot be swapped; and although there
> > are some references in the mailthreads to it perhaps being swappable
> > in future, I get the impression that will not happen soon if ever.
> > 
> > If your memory could be migrated, that would be some reason to use
> > filesystem page cache (because page migration happens to understand
> > that type of memory): but it cannot be migrated.
> 
> Migration support is in pipeline. It is part of TDX 1.5 [1]. 

And this isn't intended for just TDX (or SNP, or pKVM).  We're not _that_ far off
from being able to use UPM for "regular" VMs as a way to provide defense-in-depth
without having to take on the overhead of confidential VMs.  At that point,
migration and probably even swap are on the table.

> And swapping theoretically possible, but I'm not aware of any plans as of
> now.

Ya, I highly doubt confidential VMs will ever bother with swap.

> > I'm afraid of the special demands you may make of memory allocation
> > later on - surprised that huge pages are not mentioned already;
> > gigantic contiguous extents? secretmem removed from direct map?
> 
> The design allows for extension to hugetlbfs if needed. Combination of
> MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero
> implications for shmem. It is going to be separate struct memfile_backing_store.
> 
> I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE
> to be movable if platform supports it and secretmem is not migratable by
> design (without direct mapping fragmentations).

But secretmem _could_ be a fit.  If a use case wants to unmap guest private memory
from both userspace and the kernel then KVM should absolutely be able to support
that, but at the same time I don't want to have to update KVM to enable secretmem
(and I definitely don't want KVM poking into the directmap itself).

MFD_INACCESSIBLE should only say "this memory can't be mapped into userspace",
any other properties should be completely separate, e.g. the inability to migrate
pages is effective a restriction from KVM (acting on behalf of TDX/SNP), it's not
a fundamental property of MFD_INACCESSIBLE.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-18 13:24   ` Kirill A . Shutemov
  2022-08-19  0:20     ` Sean Christopherson
@ 2022-08-19  3:00     ` Hugh Dickins
  2022-08-20  0:27       ` Kirill A. Shutemov
  2022-08-21 10:27       ` Matthew Wilcox
  2022-09-09  4:44     ` Andy Lutomirski
  2 siblings, 2 replies; 398+ messages in thread
From: Hugh Dickins @ 2022-08-19  3:00 UTC (permalink / raw)
  To: Kirill A . Shutemov
  Cc: Hugh Dickins, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Gupta, Pankaj

On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > 
> > If your memory could be swapped, that would be enough of a good reason
> > to make use of shmem.c: but it cannot be swapped; and although there
> > are some references in the mailthreads to it perhaps being swappable
> > in future, I get the impression that will not happen soon if ever.
> > 
> > If your memory could be migrated, that would be some reason to use
> > filesystem page cache (because page migration happens to understand
> > that type of memory): but it cannot be migrated.
> 
> Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
> theoretically possible, but I'm not aware of any plans as of now.
> 
> [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html

I always forget, migration means different things to different audiences.
As an mm person, I was meaning page migration, whereas a virtualization
person thinks VM live migration (which that reference appears to be about),
a scheduler person task migration, an ornithologist bird migration, etc.

But you're an mm person too: you may have cited that reference in the
knowledge that TDX 1.5 Live Migration will entail page migration of the
kind I'm thinking of.  (Anyway, it's not important to clarify that here.)

> 
> > Some of these impressions may come from earlier iterations of the
> > patchset (v7 looks better in several ways than v5).  I am probably
> > underestimating the extent to which you have taken on board other
> > usages beyond TDX and SEV private memory, and rightly want to serve
> > them all with similar interfaces: perhaps there is enough justification
> > for shmem there, but I don't see it.  There was mention of userfaultfd
> > in one link: does that provide the justification for using shmem?
> > 
> > I'm afraid of the special demands you may make of memory allocation
> > later on - surprised that huge pages are not mentioned already;
> > gigantic contiguous extents? secretmem removed from direct map?
> 
> The design allows for extension to hugetlbfs if needed. Combination of
> MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero
> implications for shmem. It is going to be separate struct memfile_backing_store.

Last year's MFD_HUGEPAGE proposal would have allowed you to do it with
memfd via tmpfs without needing to involve hugetlbfs; but you may prefer
the determinism of hugetlbfs, relying on /proc/sys/vm/nr_hugepages etc.

But I've yet to see why you want to involve this or that filesystem
(with all its filesystem-icity suppressed) at all.  The backing store
is host memory, and tmpfs and hugetlbfs just impose their own
idiosyncrasies on how that memory is allocated; but I think you would
do better to choose your own idiosyncrasies in allocation directly -
you don't need a different "backing store" to choose between 4k or 2M
or 1G or whatever allocations.

tmpfs and hugetlbfs and page cache are designed around sharing memory:
TDX is designed around absolutely not sharing memory; and the further
uses which Sean foresees appear not to need it as page cache either.

Except perhaps for page migration reasons.  It's somewhat incidental,  
but of course page migration knows how to migrate page cache, so
masquerading as page cache will give a short cut to page migration,
when page migration becomes at all possible.

> 
> I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE
> to be movable if platform supports it and secretmem is not migratable by
> design (without direct mapping fragmentations).
> 
> > Here's what I would prefer, and imagine much easier for you to maintain;
> > but I'm no system designer, and may be misunderstanding throughout.
> > 
> > QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
> > the fallocate syscall interface itself) to allocate and free the memory,
> > ioctl for initializing some of it too.  KVM in control of whether that
> > fd can be read or written or mmap'ed or whatever, no need to prevent it
> > in shmem.c, no need for flags, seals, notifications to and fro because
> > KVM is already in control and knows the history.  If shmem actually has
> > value, call into it underneath - somewhat like SysV SHM, and /dev/zero
> > mmap, and i915/gem make use of it underneath.  If shmem has nothing to
> > add, just allocate and free kernel memory directly, recorded in your
> > own xarray.
> 
> I guess shim layer on top of shmem *can* work. I don't see immediately why
> it would not. But I'm not sure it is right direction. We risk creating yet
> another parallel VM with own rules/locking/accounting that opaque to
> core-mm.

You are already proposing a new set of rules, foreign to how tmpfs works
for others.  You're right that KVM allocating large amounts of memory,
opaque to core-mm, carries risk: and you'd be right to say that shmem.c
provides some clues (security_vm_enough_memory checks, memcg charging,
user_shm_lock accounting) on what to remember.

But I'm not up to the job of being the one to police you there,
and you don't want to be waiting on me either.

To take a rather silly example: Ted just added chattr support to tmpfs,
and it fits in well.  But I don't now want to have to decide whether
"chattr +i" FS_IMMUTABLE_FL is or is not compatible with
MEMFILE_F_USER_INACCESSIBLE.  They are from different worlds,
and I'd prefer KVM to carry the weight of imposing INACCESSIBLE:
which seems easily done if it manages the fd, without making the
memory allocated to that fd accessible to those who hold the fd.

> 
> Note that on machines that run TDX guests such memory would likely be the
> bulk of memory use. Treating it as a fringe case may bite us one day.

Yes, I suspected that machines running TDX guests might well consume
most of the memory that way, but glad(?) to hear it confirmed.

I am not suggesting that this memory be treated as a fringe case, rather
the reverse: a different case, not something to hide away inside shmem.c.

Is there a notion that /proc/meminfo "Shmem:" is going to be a good hint
of this usage?  Whether or not it's also included in "Shmem:", I expect
that its different characteristics will deserve its own display.

Hugh

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-19  0:20     ` Sean Christopherson
@ 2022-08-19  3:38       ` Hugh Dickins
  2022-08-19 22:53         ` Sean Christopherson
  2022-08-23  7:55         ` David Hildenbrand
  0 siblings, 2 replies; 398+ messages in thread
From: Hugh Dickins @ 2022-08-19  3:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kirill A . Shutemov, Hugh Dickins, Chao Peng, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Gupta, Pankaj

On Fri, 19 Aug 2022, Sean Christopherson wrote:
> On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
> > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > > On Wed, 6 Jul 2022, Chao Peng wrote:
> > > But since then, TDX in particular has forced an effort into preventing
> > > (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
> > > 
> > > Are any of the shmem.c mods useful to existing users of shmem.c? No.
> > > Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
> 
> But QEMU and other VMMs are users of shmem and memfd.  The new features certainly
> aren't useful for _all_ existing users, but I don't think it's fair to say that
> they're not useful for _any_ existing users.

Okay, I stand corrected: there exist some users of memfd_create()
who will also have use for "INACCESSIBLE" memory.

> 
> > > What use do you have for a filesystem here?  Almost none.
> > > IIUC, what you want is an fd through which QEMU can allocate kernel
> > > memory, selectively free that memory, and communicate fd+offset+length
> > > to KVM.  And perhaps an interface to initialize a little of that memory
> > > from a template (presumably copied from a real file on disk somewhere).
> > > 
> > > You don't need shmem.c or a filesystem for that!
> > > 
> > > If your memory could be swapped, that would be enough of a good reason
> > > to make use of shmem.c: but it cannot be swapped; and although there
> > > are some references in the mailthreads to it perhaps being swappable
> > > in future, I get the impression that will not happen soon if ever.
> > > 
> > > If your memory could be migrated, that would be some reason to use
> > > filesystem page cache (because page migration happens to understand
> > > that type of memory): but it cannot be migrated.
> > 
> > Migration support is in pipeline. It is part of TDX 1.5 [1]. 
> 
> And this isn't intended for just TDX (or SNP, or pKVM).  We're not _that_ far off
> from being able to use UPM for "regular" VMs as a way to provide defense-in-depth

UPM? That's an acronym from your side of the fence, I spy references to
it in the mail threads, but haven't tracked down a definition.  I'll
just take it to mean the fd-based memory we're discussing.

> without having to take on the overhead of confidential VMs.  At that point,
> migration and probably even swap are on the table.

Good, the more "flexible" that memory is, the better for competing users
of memory.  But an fd supplied by KVM gives you freedom to change to a
better implementation of allocation underneath, whenever it suits you.
Maybe shmem beneath is good from the start, maybe not.

Hugh

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-07-06  8:20 ` [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions Chao Peng
  2022-07-19  8:00   ` Gupta, Pankaj
  2022-07-20 16:44   ` Sean Christopherson
@ 2022-08-19 19:37   ` Vishal Annapurve
  2022-08-24 10:37     ` Chao Peng
  2022-08-26 15:19   ` Fuad Tabba
  3 siblings, 1 reply; 398+ messages in thread
From: Vishal Annapurve @ 2022-08-19 19:37 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm list, LKML, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang,
	Kirill A . Shutemov, Andy Lutomirski, Jun Nakajima, Dave Hansen,
	Andi Kleen, David Hildenbrand, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

> ...
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 230c8ff9659c..bb714c2a4b06 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -914,6 +914,35 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>
>  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +#define KVM_MEM_ATTR_PRIVATE   0x0001
> +static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl,
> +                                            struct kvm_enc_region *region)
> +{
> +       unsigned long start, end;
> +       void *entry;
> +       int r;
> +
> +       if (region->size == 0 || region->addr + region->size < region->addr)
> +               return -EINVAL;
> +       if (region->addr & (PAGE_SIZE - 1) || region->size & (PAGE_SIZE - 1))
> +               return -EINVAL;
> +
> +       start = region->addr >> PAGE_SHIFT;
> +       end = (region->addr + region->size - 1) >> PAGE_SHIFT;
> +
> +       entry = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION ?
> +                               xa_mk_value(KVM_MEM_ATTR_PRIVATE) : NULL;
> +
> +       r = xa_err(xa_store_range(&kvm->mem_attr_array, start, end,
> +                                       entry, GFP_KERNEL_ACCOUNT));

xa_store_range seems to create multi-index entries by default.
Subsequent xa_store_range call changes all the entries stored
previously.
xa_store needs to be used here instead of xa_store_range to achieve
the intended behavior.

> +
> +       kvm_zap_gfn_range(kvm, start, end + 1);
> +
> +       return r;
> +}
> +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
> +
> ...

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-19  3:38       ` Hugh Dickins
@ 2022-08-19 22:53         ` Sean Christopherson
  2022-08-23  7:55         ` David Hildenbrand
  1 sibling, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2022-08-19 22:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A . Shutemov, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, luto, jun.nakajima, dave.hansen, ak,
	david, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, Gupta, Pankaj

On Thu, Aug 18, 2022, Hugh Dickins wrote:
> On Fri, 19 Aug 2022, Sean Christopherson wrote:
> > On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
> > > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > > > If your memory could be migrated, that would be some reason to use
> > > > filesystem page cache (because page migration happens to understand
> > > > that type of memory): but it cannot be migrated.
> > > 
> > > Migration support is in pipeline. It is part of TDX 1.5 [1]. 
> > 
> > And this isn't intended for just TDX (or SNP, or pKVM).  We're not _that_ far off
> > from being able to use UPM for "regular" VMs as a way to provide defense-in-depth
> 
> UPM? That's an acronym from your side of the fence, I spy references to
> it in the mail threads, but haven't tracked down a definition.  I'll
> just take it to mean the fd-based memory we're discussing.

Ya, sorry, UPM is what we came up with as shorthand for "Unmapping guest Private
Memory".  Your assumption is spot on, it's just a fancy way of saying "guest is
backed with inaccessible fd-based memory".

> > without having to take on the overhead of confidential VMs.  At that point,
> > migration and probably even swap are on the table.
> 
> Good, the more "flexible" that memory is, the better for competing users
> of memory.  But an fd supplied by KVM gives you freedom to change to a
> better implementation of allocation underneath, whenever it suits you.
> Maybe shmem beneath is good from the start, maybe not.

The main flaw with KVM providing the fd is that it forces KVM to get into the
memory management business, which us KVM folks really, really do not want to do.
And based on the types of bugs KVM has had in the past related to memory management,
it's a safe bet to say the mm folks don't want us getting involved either :-)

The combination of gup()/follow_pte() and mmu_notifiers has worked very well.
KVM gets a set of (relatively) simple rules to follow and doesn't have to be taught
new things every time a new backing type comes along.  And from the other side, KVM
has very rarely had to go poke into other subsystems' code to support exposing a
new type of memory to guests.

What we're trying to do with UPM/fd-based memory is establish a similar contract
between mm and KVM, but without requiring mm to also map memory into host userspace.

The only way having KVM provide the fd works out in the long run is if KVM is the
only subsystem that ever wants to make use of memory that isn't accessible from
userspace and isn't tied to a specific backing type, _and_ if the set of backing
types that KVM ever supports is kept to an absolute minimum.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-19  3:00     ` Hugh Dickins
@ 2022-08-20  0:27       ` Kirill A. Shutemov
  2022-08-21  5:15         ` Hugh Dickins
  2022-09-09  4:48         ` Andy Lutomirski
  2022-08-21 10:27       ` Matthew Wilcox
  1 sibling, 2 replies; 398+ messages in thread
From: Kirill A. Shutemov @ 2022-08-20  0:27 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A . Shutemov, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Gupta, Pankaj

On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
> On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
> > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > > 
> > > If your memory could be swapped, that would be enough of a good reason
> > > to make use of shmem.c: but it cannot be swapped; and although there
> > > are some references in the mailthreads to it perhaps being swappable
> > > in future, I get the impression that will not happen soon if ever.
> > > 
> > > If your memory could be migrated, that would be some reason to use
> > > filesystem page cache (because page migration happens to understand
> > > that type of memory): but it cannot be migrated.
> > 
> > Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
> > theoretically possible, but I'm not aware of any plans as of now.
> > 
> > [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> 
> I always forget, migration means different things to different audiences.
> As an mm person, I was meaning page migration, whereas a virtualization
> person thinks VM live migration (which that reference appears to be about),
> a scheduler person task migration, an ornithologist bird migration, etc.
> 
> But you're an mm person too: you may have cited that reference in the
> knowledge that TDX 1.5 Live Migration will entail page migration of the
> kind I'm thinking of.  (Anyway, it's not important to clarify that here.)

TDX 1.5 brings both.

In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.

> > > Some of these impressions may come from earlier iterations of the
> > > patchset (v7 looks better in several ways than v5).  I am probably
> > > underestimating the extent to which you have taken on board other
> > > usages beyond TDX and SEV private memory, and rightly want to serve
> > > them all with similar interfaces: perhaps there is enough justification
> > > for shmem there, but I don't see it.  There was mention of userfaultfd
> > > in one link: does that provide the justification for using shmem?
> > > 
> > > I'm afraid of the special demands you may make of memory allocation
> > > later on - surprised that huge pages are not mentioned already;
> > > gigantic contiguous extents? secretmem removed from direct map?
> > 
> > The design allows for extension to hugetlbfs if needed. Combination of
> > MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero
> > implications for shmem. It is going to be separate struct memfile_backing_store.
> 
> Last year's MFD_HUGEPAGE proposal would have allowed you to do it with
> memfd via tmpfs without needing to involve hugetlbfs; but you may prefer
> the determinism of hugetlbfs, relying on /proc/sys/vm/nr_hugepages etc.
> 
> But I've yet to see why you want to involve this or that filesystem
> (with all its filesystem-icity suppressed) at all.  The backing store
> is host memory, and tmpfs and hugetlbfs just impose their own
> idiosyncrasies on how that memory is allocated; but I think you would
> do better to choose your own idiosyncrasies in allocation directly -
> you don't need a different "backing store" to choose between 4k or 2M
> or 1G or whatever allocations.

These idiosyncrasies are well known: user who used hugetlbfs may want to
get direct replacement that would tap into the same hugetlb reserves and
get the same allocation guarantees. Admins know where to look if ENOMEM
happens.

For THP, admin may know how to tweak allocation/defrag policy for his
liking and how to track if they are allocated.

> tmpfs and hugetlbfs and page cache are designed around sharing memory:
> TDX is designed around absolutely not sharing memory; and the further
> uses which Sean foresees appear not to need it as page cache either.
> 
> Except perhaps for page migration reasons.  It's somewhat incidental,  
> but of course page migration knows how to migrate page cache, so
> masquerading as page cache will give a short cut to page migration,
> when page migration becomes at all possible.
> 
> > 
> > I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE
> > to be movable if platform supports it and secretmem is not migratable by
> > design (without direct mapping fragmentations).
> > 
> > > Here's what I would prefer, and imagine much easier for you to maintain;
> > > but I'm no system designer, and may be misunderstanding throughout.
> > > 
> > > QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
> > > the fallocate syscall interface itself) to allocate and free the memory,
> > > ioctl for initializing some of it too.  KVM in control of whether that
> > > fd can be read or written or mmap'ed or whatever, no need to prevent it
> > > in shmem.c, no need for flags, seals, notifications to and fro because
> > > KVM is already in control and knows the history.  If shmem actually has
> > > value, call into it underneath - somewhat like SysV SHM, and /dev/zero
> > > mmap, and i915/gem make use of it underneath.  If shmem has nothing to
> > > add, just allocate and free kernel memory directly, recorded in your
> > > own xarray.
> > 
> > I guess shim layer on top of shmem *can* work. I don't see immediately why
> > it would not. But I'm not sure it is right direction. We risk creating yet
> > another parallel VM with own rules/locking/accounting that opaque to
> > core-mm.
> 
> You are already proposing a new set of rules, foreign to how tmpfs works
> for others.  You're right that KVM allocating large amounts of memory,
> opaque to core-mm, carries risk: and you'd be right to say that shmem.c
> provides some clues (security_vm_enough_memory checks, memcg charging,
> user_shm_lock accounting) on what to remember.

That's a nice list of clues that would need to be re-implemented somewhere
else to get competent solution.

> But I'm not up to the job of being the one to police you there,
> and you don't want to be waiting on me either.

> To take a rather silly example: Ted just added chattr support to tmpfs,
> and it fits in well.  But I don't now want to have to decide whether
> "chattr +i" FS_IMMUTABLE_FL is or is not compatible with
> MEMFILE_F_USER_INACCESSIBLE.  They are from different worlds,
> and I'd prefer KVM to carry the weight of imposing INACCESSIBLE:
> which seems easily done if it manages the fd, without making the
> memory allocated to that fd accessible to those who hold the fd.

From a quick look, these are orthogonal. But it is not your point.

Yes, INACCESSIBLE is increase of complexity which you do not want to deal
with in shmem.c. It get it.

I will try next week to rework it as shim to top of shmem. Does it work
for you?

But I think it is wrong to throw it over the fence to KVM folks and say it
is your problem. Core MM has to manage it.

> > Note that on machines that run TDX guests such memory would likely be the
> > bulk of memory use. Treating it as a fringe case may bite us one day.
> 
> Yes, I suspected that machines running TDX guests might well consume
> most of the memory that way, but glad(?) to hear it confirmed.
> 
> I am not suggesting that this memory be treated as a fringe case, rather
> the reverse: a different case, not something to hide away inside shmem.c.
> 
> Is there a notion that /proc/meminfo "Shmem:" is going to be a good hint
> of this usage?  Whether or not it's also included in "Shmem:", I expect
> that its different characteristics will deserve its own display.

That's the hint users know about from previous experience.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-20  0:27       ` Kirill A. Shutemov
@ 2022-08-21  5:15         ` Hugh Dickins
  2022-08-31 14:24           ` Kirill A . Shutemov
  2022-09-09  4:48         ` Andy Lutomirski
  1 sibling, 1 reply; 398+ messages in thread
From: Hugh Dickins @ 2022-08-21  5:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Kirill A . Shutemov, Chao Peng, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Gupta, Pankaj

On Sat, 20 Aug 2022, Kirill A. Shutemov wrote:
> 
> Yes, INACCESSIBLE is increase of complexity which you do not want to deal
> with in shmem.c. It get it.

It's not so much that INACCESSIBLE increases the complexity of
memfd/shmem/tmpfs, as that it is completely foreign to it.

And by handling all those foreign needs at the KVM end (where you
can be sure that the mem attached to the fd is INACCESSIBLE because
you have given nobody access to it - no handshaking with 3rd party
required).

> 
> I will try next week to rework it as shim to top of shmem. Does it work
> for you?

Yes, please do, thanks.  It's a compromise between us: the initial TDX
case has no justification to use shmem at all, but doing it that way
will help you with some of the infrastructure, and will probably be
easiest for KVM to extend to other more relaxed fd cases later.

> 
> But I think it is wrong to throw it over the fence to KVM folks and say it
> is your problem. Core MM has to manage it.

We disagree on who is throwing over the fence to whom :)

Core MM should manage the core MM parts and KVM should manage the KVM
parts.  What makes this rather different from most driver usage of MM,
is that KVM seems likely to use a great proportion of memory this way.
With great memory usage comes great responsibility: I don't think
all those flags and seals and notifiers let KVM escape from that.

Hugh

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-19  3:00     ` Hugh Dickins
  2022-08-20  0:27       ` Kirill A. Shutemov
@ 2022-08-21 10:27       ` Matthew Wilcox
  2022-08-24 10:27         ` Chao Peng
  1 sibling, 1 reply; 398+ messages in thread
From: Matthew Wilcox @ 2022-08-21 10:27 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A . Shutemov, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Gupta, Pankaj

On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
> tmpfs and hugetlbfs and page cache are designed around sharing memory:
> TDX is designed around absolutely not sharing memory; and the further
> uses which Sean foresees appear not to need it as page cache either.
> 
> Except perhaps for page migration reasons.  It's somewhat incidental,  
> but of course page migration knows how to migrate page cache, so
> masquerading as page cache will give a short cut to page migration,
> when page migration becomes at all possible.

I haven't read the patch series, and I'm not taking a position one way
or the other on whether this is better implemented as a shmem addition
or a shim that asks shmem for memory.  Page migration can be done for
driver memory by using PageMovable.  I just rewrote how it works, so
the details are top of my mind at the moment if anyone wants something
explained.  Commit 68f2736a8583 is the key one to look at.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-17 15:27                         ` Michael Roth
@ 2022-08-23  1:25                           ` Isaku Yamahata
  0 siblings, 0 replies; 398+ messages in thread
From: Isaku Yamahata @ 2022-08-23  1:25 UTC (permalink / raw)
  To: Michael Roth
  Cc: Sean Christopherson, Gupta, Pankaj, Kirill A . Shutemov,
	Chao Peng, Nikunj A. Dadhania, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, luto, jun.nakajima, dave.hansen, ak, david, aarcange,
	ddutile, dhildenb, Quentin Perret, mhocko, Muchun Song, bharata,
	kvm, linux-kernel, linux-mm, linux-kselftest, linux-api,
	linux-doc, qemu-devel, linux-fsdevel, isaku.yamahata

On Wed, Aug 17, 2022 at 10:27:19AM -0500,
Michael Roth <michael.roth@amd.com> wrote:

> > I think the best approach is to turn KVM_TDX_INIT_MEM_REGION into a generic
> > vCPU-scoped ioctl() that allows userspace to pre-map guest memory.  Supporting
> > initializing guest private memory with a source page can be implemented via a
> > flag.  That also gives KVM line of sight to in-place "conversion", e.g. another
> > flag could be added to say that the dest is also the source.
> 
> So is this proposed ioctl only intended to handle the initial encrypted
> payload, and the KVM_MEMORY_ENCRYPT_{REG,UNREG}_REGION ioctls would
> still be used for conversions post-boot?

Yes.  It is called before running any vcpu.  At run time (after running vcpus),
KVM_MEMORY_ENCRYPT_{REG,UNREG}_REGION is used.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  2022-08-17 23:41       ` Kirill A. Shutemov
  2022-08-18  9:09         ` Paolo Bonzini
@ 2022-08-23  7:36         ` David Hildenbrand
  2022-08-24 10:20           ` Chao Peng
  1 sibling, 1 reply; 398+ messages in thread
From: David Hildenbrand @ 2022-08-23  7:36 UTC (permalink / raw)
  To: Kirill A. Shutemov, Paolo Bonzini
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song

On 18.08.22 01:41, Kirill A. Shutemov wrote:
> On Fri, Aug 05, 2022 at 07:55:38PM +0200, Paolo Bonzini wrote:
>> On 7/21/22 11:44, David Hildenbrand wrote:
>>>
>>> Also, I*think*  you can place pages via userfaultfd into shmem. Not
>>> sure if that would count "auto alloc", but it would certainly bypass
>>> fallocate().
>>
>> Yeah, userfaultfd_register would probably have to forbid this for
>> F_SEAL_AUTO_ALLOCATE vmas.  Maybe the memfile_node can be reused for this,
>> adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags?  Then userfault_register
>> would do something like memfile_node_get_flags(vma->vm_file) and check the
>> result.
> 
> I donno, memory allocation with userfaultfd looks pretty intentional to
> me. Why would F_SEAL_AUTO_ALLOCATE prevent it?
> 

Can't we say the same about a write()?

> Maybe we would need it in the future for post-copy migration or something?
> 
> Or existing practises around userfaultfd touch memory randomly and
> therefore incompatible with F_SEAL_AUTO_ALLOCATE intent?
> 
> Note, that userfaultfd is only relevant for shared memory as it requires
> VMA which we don't have for MFD_INACCESSIBLE.

This feature (F_SEAL_AUTO_ALLOCATE) is independent of all the lovely
encrypted VM stuff, so it doesn't matter how it relates to MFD_INACCESSIBLE.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-19  3:38       ` Hugh Dickins
  2022-08-19 22:53         ` Sean Christopherson
@ 2022-08-23  7:55         ` David Hildenbrand
  2022-08-23 16:05           ` Sean Christopherson
  1 sibling, 1 reply; 398+ messages in thread
From: David Hildenbrand @ 2022-08-23  7:55 UTC (permalink / raw)
  To: Hugh Dickins, Sean Christopherson
  Cc: Kirill A . Shutemov, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, luto, jun.nakajima, dave.hansen, ak,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, Gupta, Pankaj

On 19.08.22 05:38, Hugh Dickins wrote:
> On Fri, 19 Aug 2022, Sean Christopherson wrote:
>> On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
>>> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
>>>> On Wed, 6 Jul 2022, Chao Peng wrote:
>>>> But since then, TDX in particular has forced an effort into preventing
>>>> (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
>>>>
>>>> Are any of the shmem.c mods useful to existing users of shmem.c? No.
>>>> Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
>>
>> But QEMU and other VMMs are users of shmem and memfd.  The new features certainly
>> aren't useful for _all_ existing users, but I don't think it's fair to say that
>> they're not useful for _any_ existing users.
> 
> Okay, I stand corrected: there exist some users of memfd_create()
> who will also have use for "INACCESSIBLE" memory.

As raised in reply to the relevant patch, I'm not sure if we really have
to/want to expose MFD_INACCESSIBLE to user space. I feel like this is a
requirement of specific memfd_notifer (memfile_notifier) implementations
-- such as TDX that will convert the memory and MCE-kill the machine on
ordinary write access. We might be able to set/enforce this when
registering a notifier internally instead, and fail notifier
registration if a condition isn't met (e.g., existing mmap).

So I'd be curious, which other users of shmem/memfd would benefit from
(MMU)-"INACCESSIBLE" memory obtained via memfd_create()?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-23  7:55         ` David Hildenbrand
@ 2022-08-23 16:05           ` Sean Christopherson
  2022-08-24  9:41             ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2022-08-23 16:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hugh Dickins, Kirill A . Shutemov, Chao Peng, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, Gupta, Pankaj

On Tue, Aug 23, 2022, David Hildenbrand wrote:
> On 19.08.22 05:38, Hugh Dickins wrote:
> > On Fri, 19 Aug 2022, Sean Christopherson wrote:
> >> On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
> >>> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> >>>> On Wed, 6 Jul 2022, Chao Peng wrote:
> >>>> But since then, TDX in particular has forced an effort into preventing
> >>>> (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
> >>>>
> >>>> Are any of the shmem.c mods useful to existing users of shmem.c? No.
> >>>> Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
> >>
> >> But QEMU and other VMMs are users of shmem and memfd.  The new features certainly
> >> aren't useful for _all_ existing users, but I don't think it's fair to say that
> >> they're not useful for _any_ existing users.
> > 
> > Okay, I stand corrected: there exist some users of memfd_create()
> > who will also have use for "INACCESSIBLE" memory.
> 
> As raised in reply to the relevant patch, I'm not sure if we really have
> to/want to expose MFD_INACCESSIBLE to user space. I feel like this is a
> requirement of specific memfd_notifer (memfile_notifier) implementations
> -- such as TDX that will convert the memory and MCE-kill the machine on
> ordinary write access. We might be able to set/enforce this when
> registering a notifier internally instead, and fail notifier
> registration if a condition isn't met (e.g., existing mmap).
>
> So I'd be curious, which other users of shmem/memfd would benefit from
> (MMU)-"INACCESSIBLE" memory obtained via memfd_create()?

I agree that there's no need to expose the inaccessible behavior via uAPI.  Making
it a kernel-internal thing that's negotiated/resolved when KVM binds to the fd
would align INACCESSIBLE with the UNMOVABLE and UNRECLAIMABLE flags (and any other
flags that get added in the future).

AFAICT, the user-visible flag is a holdover from the early RFCs and doesn't provide
any unique functionality.

If we go that route, we might want to have shmem/memfd require INACCESSIBLE to be
set for the initial implementation.  I.e. disallow binding without INACCESSIBLE
until there's a use case.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-16 15:38                       ` Sean Christopherson
  2022-08-17 15:27                         ` Michael Roth
@ 2022-08-23 17:41                         ` Gupta, Pankaj
  1 sibling, 0 replies; 398+ messages in thread
From: Gupta, Pankaj @ 2022-08-23 17:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kirill A . Shutemov, Chao Peng, Nikunj A. Dadhania,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, luto, jun.nakajima, dave.hansen, ak,
	david, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, bharata, kvm, linux-kernel, linux-mm,
	linux-kselftest, linux-api, linux-doc, qemu-devel, linux-fsdevel


>>>>> Actually the current version allows you to delay the allocation to a
>>>>> later time (e.g. page fault time) if you don't call fallocate() on the
>>>>> private fd. fallocate() is necessary in previous versions because we
>>>>> treat the existense in the fd as 'private' but in this version we track
>>>>> private/shared info in KVM so we don't rely on that fact from memory
>>>>> backstores.
>>>>
>>>> Does this also mean reservation of guest physical memory with secure
>>>> processor (both for SEV-SNP & TDX) will also happen at page fault time?
>>>>
>>>> Do we plan to keep it this way?
>>>
>>> If you are talking about accepting memory by the guest, it is initiated by
>>> the guest and has nothing to do with page fault time vs fallocate()
>>> allocation of host memory. I mean acceptance happens after host memory
>>> allocation but they are not in lockstep, acceptance can happen much later.
>>
>> No, I meant reserving guest physical memory range from hypervisor e.g with
>> RMPUpdate for SEV-SNP or equivalent at TDX side (PAMTs?).
> 
> As proposed, RMP/PAMT updates will occur in the fault path, i.e. there is no way
> for userspace to pre-map guest memory.
> 
> I think the best approach is to turn KVM_TDX_INIT_MEM_REGION into a generic
> vCPU-scoped ioctl() that allows userspace to pre-map guest memory.  Supporting
> initializing guest private memory with a source page can be implemented via a
> flag.  That also gives KVM line of sight to in-place "conversion", e.g. another
> flag could be added to say that the dest is also the source.

Questions to clarify *my* understanding here:

- Do you suggest to use KVM_TDX_INIT_MEM_REGION into a generic ioctl to
   pre-map guest private memory in addition to initialize the payload
   (in-place encryption or just copy page to guest private memory)?

- Want to clarify "pre-map": Are you suggesting to use the ioctl
   to avoid the RMP/PAMT registration at guest page fault time? instead
   pre-map guest private memory i.e to allocate and do RMP/PAMT
   registration before running the actual guest vCPU's?

Thanks,
Pankaj

> 
> The TDX and SNP restrictions would then become addition restrictions on when
> initializing with a source is allowed (and VMs that don't have guest private
> memory wouldn't allow the flag at all).
> 


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-23 16:05           ` Sean Christopherson
@ 2022-08-24  9:41             ` Chao Peng
  2022-09-09  4:55               ` Andy Lutomirski
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-08-24  9:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Hildenbrand, Hugh Dickins, Kirill A . Shutemov, kvm,
	linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, Gupta, Pankaj

On Tue, Aug 23, 2022 at 04:05:27PM +0000, Sean Christopherson wrote:
> On Tue, Aug 23, 2022, David Hildenbrand wrote:
> > On 19.08.22 05:38, Hugh Dickins wrote:
> > > On Fri, 19 Aug 2022, Sean Christopherson wrote:
> > >> On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
> > >>> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > >>>> On Wed, 6 Jul 2022, Chao Peng wrote:
> > >>>> But since then, TDX in particular has forced an effort into preventing
> > >>>> (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
> > >>>>
> > >>>> Are any of the shmem.c mods useful to existing users of shmem.c? No.
> > >>>> Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
> > >>
> > >> But QEMU and other VMMs are users of shmem and memfd.  The new features certainly
> > >> aren't useful for _all_ existing users, but I don't think it's fair to say that
> > >> they're not useful for _any_ existing users.
> > > 
> > > Okay, I stand corrected: there exist some users of memfd_create()
> > > who will also have use for "INACCESSIBLE" memory.
> > 
> > As raised in reply to the relevant patch, I'm not sure if we really have
> > to/want to expose MFD_INACCESSIBLE to user space. I feel like this is a
> > requirement of specific memfd_notifer (memfile_notifier) implementations
> > -- such as TDX that will convert the memory and MCE-kill the machine on
> > ordinary write access. We might be able to set/enforce this when
> > registering a notifier internally instead, and fail notifier
> > registration if a condition isn't met (e.g., existing mmap).
> >
> > So I'd be curious, which other users of shmem/memfd would benefit from
> > (MMU)-"INACCESSIBLE" memory obtained via memfd_create()?
> 
> I agree that there's no need to expose the inaccessible behavior via uAPI.  Making
> it a kernel-internal thing that's negotiated/resolved when KVM binds to the fd
> would align INACCESSIBLE with the UNMOVABLE and UNRECLAIMABLE flags (and any other
> flags that get added in the future).
> 
> AFAICT, the user-visible flag is a holdover from the early RFCs and doesn't provide
> any unique functionality.

That's also what I'm thinking. And I don't see problem immediately if
user has populated the fd at the binding time. Actually that looks an
advantage for previously discussed guest payload pre-loading.

> 
> If we go that route, we might want to have shmem/memfd require INACCESSIBLE to be
> set for the initial implementation.  I.e. disallow binding without INACCESSIBLE
> until there's a use case.

I can do that.

Chao

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  2022-08-23  7:36         ` David Hildenbrand
@ 2022-08-24 10:20           ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-08-24 10:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kirill A. Shutemov, Paolo Bonzini, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

On Tue, Aug 23, 2022 at 09:36:57AM +0200, David Hildenbrand wrote:
> On 18.08.22 01:41, Kirill A. Shutemov wrote:
> > On Fri, Aug 05, 2022 at 07:55:38PM +0200, Paolo Bonzini wrote:
> >> On 7/21/22 11:44, David Hildenbrand wrote:
> >>>
> >>> Also, I*think*  you can place pages via userfaultfd into shmem. Not
> >>> sure if that would count "auto alloc", but it would certainly bypass
> >>> fallocate().
> >>
> >> Yeah, userfaultfd_register would probably have to forbid this for
> >> F_SEAL_AUTO_ALLOCATE vmas.  Maybe the memfile_node can be reused for this,
> >> adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags?  Then userfault_register
> >> would do something like memfile_node_get_flags(vma->vm_file) and check the
> >> result.
> > 
> > I donno, memory allocation with userfaultfd looks pretty intentional to
> > me. Why would F_SEAL_AUTO_ALLOCATE prevent it?
> > 
> 
> Can't we say the same about a write()?
> 
> > Maybe we would need it in the future for post-copy migration or something?
> > 
> > Or existing practises around userfaultfd touch memory randomly and
> > therefore incompatible with F_SEAL_AUTO_ALLOCATE intent?
> > 
> > Note, that userfaultfd is only relevant for shared memory as it requires
> > VMA which we don't have for MFD_INACCESSIBLE.
> 
> This feature (F_SEAL_AUTO_ALLOCATE) is independent of all the lovely
> encrypted VM stuff, so it doesn't matter how it relates to MFD_INACCESSIBLE.

Right, this patch is for normal user accssible fd. In KVM this flag is
expected to be set on the shared part of the memslot, while all other
patches in this series are for private part of the memslot.

Private memory doesn't have this need because it's totally inaccissible
from userspace so no chance for userspace to write to the fd and cause
allocation by accident. While for shared memory, malicious/buggy guest
OS may cause userspace to write to any range of the shared fd and cause
memory allocation, even that range should the private memory not the
shared memory be visible to guest OS.

Chao
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-21 10:27       ` Matthew Wilcox
@ 2022-08-24 10:27         ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-08-24 10:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Kirill A . Shutemov, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Gupta, Pankaj

On Sun, Aug 21, 2022 at 11:27:44AM +0100, Matthew Wilcox wrote:
> On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
> > tmpfs and hugetlbfs and page cache are designed around sharing memory:
> > TDX is designed around absolutely not sharing memory; and the further
> > uses which Sean foresees appear not to need it as page cache either.
> > 
> > Except perhaps for page migration reasons.  It's somewhat incidental,  
> > but of course page migration knows how to migrate page cache, so
> > masquerading as page cache will give a short cut to page migration,
> > when page migration becomes at all possible.
> 
> I haven't read the patch series, and I'm not taking a position one way
> or the other on whether this is better implemented as a shmem addition
> or a shim that asks shmem for memory.  Page migration can be done for
> driver memory by using PageMovable.  I just rewrote how it works, so
> the details are top of my mind at the moment if anyone wants something
> explained.  Commit 68f2736a8583 is the key one to look at.

Thanks Matthew. That is helpful to understand the current code.

Chao

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-08-19 19:37   ` Vishal Annapurve
@ 2022-08-24 10:37     ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-08-24 10:37 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: kvm list, LKML, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang,
	Kirill A . Shutemov, Andy Lutomirski, Jun Nakajima, Dave Hansen,
	Andi Kleen, David Hildenbrand, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

On Fri, Aug 19, 2022 at 12:37:42PM -0700, Vishal Annapurve wrote:
> > ...
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 230c8ff9659c..bb714c2a4b06 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -914,6 +914,35 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
> >
> >  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
> >
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > +#define KVM_MEM_ATTR_PRIVATE   0x0001
> > +static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl,
> > +                                            struct kvm_enc_region *region)
> > +{
> > +       unsigned long start, end;
> > +       void *entry;
> > +       int r;
> > +
> > +       if (region->size == 0 || region->addr + region->size < region->addr)
> > +               return -EINVAL;
> > +       if (region->addr & (PAGE_SIZE - 1) || region->size & (PAGE_SIZE - 1))
> > +               return -EINVAL;
> > +
> > +       start = region->addr >> PAGE_SHIFT;
> > +       end = (region->addr + region->size - 1) >> PAGE_SHIFT;
> > +
> > +       entry = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION ?
> > +                               xa_mk_value(KVM_MEM_ATTR_PRIVATE) : NULL;
> > +
> > +       r = xa_err(xa_store_range(&kvm->mem_attr_array, start, end,
> > +                                       entry, GFP_KERNEL_ACCOUNT));
> 
> xa_store_range seems to create multi-index entries by default.
> Subsequent xa_store_range call changes all the entries stored
> previously.

By using xa_store_range and storing them as multi-index entries I
expected to save some memory for continuous pages originally.

But sounds like the current multi-index store behaviour isn't quite
ready for our usage.

Chao
> xa_store needs to be used here instead of xa_store_range to achieve
> the intended behavior.
> 
> > +
> > +       kvm_zap_gfn_range(kvm, start, end + 1);
> > +
> > +       return r;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
> > +
> > ...

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-07-06  8:20 [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (16 preceding siblings ...)
  2022-08-18  5:40 ` Hugh Dickins
@ 2022-08-26 15:19 ` Fuad Tabba
  2022-08-29 15:17   ` Chao Peng
  2022-09-09 15:35 ` Michael Roth
  18 siblings, 1 reply; 398+ messages in thread
From: Fuad Tabba @ 2022-08-26 15:19 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, Marc Zyngier, Will Deacon

Hi,

On Wed, Jul 6, 2022 at 9:24 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> This is the v7 of this series which tries to implement the fd-based KVM
> guest private memory. The patches are based on latest kvm/queue branch
> commit:
>
>   b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
> split_desc_cache only by default capacity
>
> Introduction
> ------------
> In general this patch series introduce fd-based memslot which provides
> guest memory through memory file descriptor fd[offset,size] instead of
> hva/size. The fd can be created from a supported memory filesystem
> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> and the the memory backing store exchange callbacks when such memslot
> gets created. At runtime KVM will call into callbacks provided by the
> backing store to get the pfn with the fd+offset. Memory backing store
> will also call into KVM callbacks when userspace punch hole on the fd
> to notify KVM to unmap secondary MMU page table entries.
>
> Comparing to existing hva-based memslot, this new type of memslot allows
> guest memory unmapped from host userspace like QEMU and even the kernel
> itself, therefore reduce attack surface and prevent bugs.
>
> Based on this fd-based memslot, we can build guest private memory that
> is going to be used in confidential computing environments such as Intel
> TDX and AMD SEV. When supported, the memory backing store can provide
> more enforcement on the fd and KVM can use a single memslot to hold both
> the private and shared part of the guest memory.
>
> mm extension
> ---------------------
> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
> created with these flags cannot read(), write() or mmap() etc via normal
> MMU operations. The file content can only be used with the newly
> introduced memfile_notifier extension.
>
> The memfile_notifier extension provides two sets of callbacks for KVM to
> interact with the memory backing store:
>   - memfile_notifier_ops: callbacks for memory backing store to notify
>     KVM when memory gets invalidated.
>   - backing store callbacks: callbacks for KVM to call into memory
>     backing store to request memory pages for guest private memory.
>
> The memfile_notifier extension also provides APIs for memory backing
> store to register/unregister itself and to trigger the notifier when the
> bookmarked memory gets invalidated.
>
> The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
> prevent double allocation caused by unintentional guest when we only
> have a single side of the shared/private memfds effective.
>
> memslot extension
> -----------------
> Add the private fd and the fd offset to existing 'shared' memslot so
> that both private/shared guest memory can live in one single memslot.
> A page in the memslot is either private or shared. Whether a guest page
> is private or shared is maintained through reusing existing SEV ioctls
> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
>

I'm on the Android pKVM team at Google, and we've been looking into
how this approach fits with what we've been doing with pkvm/arm64.
I've had a go at porting your patches, along with some fixes and
additions so it would go on top of our latest pkvm patch series [1] to
see how well this proposal fits with what we’re doing. You can find
the ported code at this link [2].

In general, an fd-based approach fits very well with pKVM for the
reasons you mention. It means that we don't necessarily need to map
the guest memory, and with the new extensions it allows the host
kernel to control whether to restrict migration and swapping.

For pKVM, we would also need the guest private memory not to be
GUP’able by the kernel so that userspace can’t trick the kernel into
accessing guest private memory in a context where it isn’t prepared to
handle the fault injected by the hypervisor. We’re looking at whether
we could use memfd_secret to achieve this, or maybe whether extending
your work might solve the problem.

However, during the porting effort, the main issue we've encountered
is that many of the details of this approach seem to be targeted at
TDX/SEV and don’t readily align with the design of pKVM. My knowledge
on TDX is very rudimentary, so please bear with me if I get things
wrong.

The idea of the memslot having two references to the backing memory,
the (new) private_fd (a file descriptor) as well as the userspace_addr
(a memory address), with the meaning changing depending on whether the
memory is private or shared. Both can potentially be live at the same
time, but only one is used by the guest depending on whether the
memory is shared or private. For pKVM, the memory region is the same,
and whether the underlying physical page is shared or private is
determined by the hypervisor based on the initial configuration of the
VM and also in response to hypercalls from the guest. So at least from
our side, having a private_fd isn't the best fit, but rather just
having an fd instead of a userspace_addr.

Moreover, something which was discussed here before [3], is the
ability to share in-place. For pKVM/arm64, the conversion between
shared and private involves only changes to the stage-2 page tables,
which are controlled by the hypervisor. Android supports this in-place
conversion already, and I think that the cost of copying for many
use-cases that would involve large amounts of data would be big. We
will measure the relative costs in due course, but in the meantime
we’re nervous about adopting a new user ABI which doesn’t appear to
cater for in-place conversion; having just the fd would simplify that
somewhat

In the memfd approach, what is the plan for being able to initialize
guest private memory from the host? In my port of this patch series,
I've added an fcntl() command that allows setting INACCESSIBLE after
the memfd has been created. So the memory can be mapped, initialized,
then unmapped. Of course there is no way to enforce that the memory is
unmapped from userspace before being used as private memory, but the
hypervisor will take care of the stage-2 mapping and so a user access
to the private memory would result in a SEGV regardless of the flag

Now, moving on to implementation-specific issues in this patch series
that I have encountered:

- There are a couple of small issues in porting the patches, some of
which have been mentioned already by others. I will point out the rest
in direct replies to these patches.

- MEMFILE_F_UNRECLAIMABLE and MEMFILE_F_UNMOVABLE are never set in
this patch series. MFD_INACCESSIBLE only sets
MEMFILE_F_USER_INACCESSIBLE. Is this intentional?

- Nothing in this patch series enforces that MFD_INACCESSIBLE or that
any of the MEMFILE_F_* flags are set for the file descriptor to be
used as a private_fd. Is this also intentional?

Most of us working on pKVM will be at KVM forum Dublin in September,
so it would be great if we could have a chat (and/or beer!) face to
face sometime during the conference to help us figure out an
upstreamable solution for Android

Cheers,
/fuad

[1] https://lore.kernel.org/all/20220630135747.26983-1-will@kernel.org/
[2] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/fdmem
[3] https://lore.kernel.org/all/YkcTTY4YjQs5BRhE@google.com/


> Test
> ----
> To test the new functionalities of this patch TDX patchset is needed.
> Since TDX patchset has not been merged so I did two kinds of test:
>
> -  Regresion test on kvm/queue (this patchset)
>    Most new code are not covered. Code also in below repo:
>    https://github.com/chao-p/linux/tree/privmem-v7
>
> -  New Funational test on latest TDX code
>    The patch is rebased to latest TDX code and tested the new
>    funcationalities. See below repos:
>    Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
>    QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
>
> An example QEMU command line for TDX test:
> -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
> -machine confidential-guest-support=tdx \
> -object memory-backend-memfd-private,id=ram1,size=${mem} \
> -machine memory-backend=ram1
>
> Changelog
> ----------
> v7:
>   - Move the private/shared info from backing store to KVM.
>   - Introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
>   - Rework on the sync mechanism between zap/page fault paths.
>   - Addressed other comments in v6.
> v6:
>   - Re-organzied patch for both mm/KVM parts.
>   - Added flags for memfile_notifier so its consumers can state their
>     features and memory backing store can check against these flags.
>   - Put a backing store reference in the memfile_notifier and move pfn_ops
>     into backing store.
>   - Only support boot time backing store register.
>   - Overall KVM part improvement suggested by Sean and some others.
> v5:
>   - Removed userspace visible F_SEAL_INACCESSIBLE, instead using an
>     in-kernel flag (SHM_F_INACCESSIBLE for shmem). Private fd can only
>     be created by MFD_INACCESSIBLE.
>   - Introduced new APIs for backing store to register itself to
>     memfile_notifier instead of direct function call.
>   - Added the accounting and restriction for MFD_INACCESSIBLE memory.
>   - Added KVM API doc for new memslot extensions and man page for the new
>     MFD_INACCESSIBLE flag.
>   - Removed the overlap check for mapping the same file+offset into
>     multiple gfns due to perf consideration, warned in document.
>   - Addressed other comments in v4.
> v4:
>   - Decoupled the callbacks between KVM/mm from memfd and use new
>     name 'memfile_notifier'.
>   - Supported register multiple memslots to the same backing store.
>   - Added per-memslot pfn_ops instead of per-system.
>   - Reworked the invalidation part.
>   - Improved new KVM uAPIs (private memslot extension and memory
>     error) per Sean's suggestions.
>   - Addressed many other minor fixes for comments from v3.
> v3:
>   - Added locking protection when calling
>     invalidate_page_range/fallocate callbacks.
>   - Changed memslot structure to keep use useraddr for shared memory.
>   - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
>   - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
>   - Commit message improvement.
>   - Many small fixes for comments from the last version.
>
> Links to previous discussions
> -----------------------------
> [1] Original design proposal:
> https://lkml.kernel.org/kvm/20210824005248.200037-1-seanjc@google.com/
> [2] Updated proposal and RFC patch v1:
> https://lkml.kernel.org/linux-fsdevel/20211111141352.26311-1-chao.p.peng@linux.intel.com/
> [3] Patch v5: https://lkml.org/lkml/2022/5/19/861
>
> Chao Peng (12):
>   mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
>   selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE
>   mm: Introduce memfile_notifier
>   mm/memfd: Introduce MFD_INACCESSIBLE flag
>   KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS
>   KVM: Use gfn instead of hva for mmu_notifier_retry
>   KVM: Rename mmu_notifier_*
>   KVM: Extend the memslot to support fd-based private memory
>   KVM: Add KVM_EXIT_MEMORY_FAULT exit
>   KVM: Register/unregister the guest private memory regions
>   KVM: Handle page fault for private memory
>   KVM: Enable and expose KVM_MEM_PRIVATE
>
> Kirill A. Shutemov (1):
>   mm/shmem: Support memfile_notifier
>
>  Documentation/virt/kvm/api.rst             |  77 +++++-
>  arch/arm64/kvm/mmu.c                       |   8 +-
>  arch/mips/include/asm/kvm_host.h           |   2 +-
>  arch/mips/kvm/mmu.c                        |  10 +-
>  arch/powerpc/include/asm/kvm_book3s_64.h   |   2 +-
>  arch/powerpc/kvm/book3s_64_mmu_host.c      |   4 +-
>  arch/powerpc/kvm/book3s_64_mmu_hv.c        |   4 +-
>  arch/powerpc/kvm/book3s_64_mmu_radix.c     |   6 +-
>  arch/powerpc/kvm/book3s_hv_nested.c        |   2 +-
>  arch/powerpc/kvm/book3s_hv_rm_mmu.c        |   8 +-
>  arch/powerpc/kvm/e500_mmu_host.c           |   4 +-
>  arch/riscv/kvm/mmu.c                       |   4 +-
>  arch/x86/include/asm/kvm_host.h            |   3 +-
>  arch/x86/kvm/Kconfig                       |   3 +
>  arch/x86/kvm/mmu.h                         |   2 -
>  arch/x86/kvm/mmu/mmu.c                     |  74 +++++-
>  arch/x86/kvm/mmu/mmu_internal.h            |  18 ++
>  arch/x86/kvm/mmu/mmutrace.h                |   1 +
>  arch/x86/kvm/mmu/paging_tmpl.h             |   4 +-
>  arch/x86/kvm/x86.c                         |   2 +-
>  include/linux/kvm_host.h                   | 105 +++++---
>  include/linux/memfile_notifier.h           |  91 +++++++
>  include/linux/shmem_fs.h                   |   2 +
>  include/uapi/linux/fcntl.h                 |   1 +
>  include/uapi/linux/kvm.h                   |  37 +++
>  include/uapi/linux/memfd.h                 |   1 +
>  mm/Kconfig                                 |   4 +
>  mm/Makefile                                |   1 +
>  mm/memfd.c                                 |  18 +-
>  mm/memfile_notifier.c                      | 123 ++++++++++
>  mm/shmem.c                                 | 125 +++++++++-
>  tools/testing/selftests/memfd/memfd_test.c | 166 +++++++++++++
>  virt/kvm/Kconfig                           |   3 +
>  virt/kvm/kvm_main.c                        | 272 ++++++++++++++++++---
>  virt/kvm/pfncache.c                        |  14 +-
>  35 files changed, 1074 insertions(+), 127 deletions(-)
>  create mode 100644 include/linux/memfile_notifier.h
>  create mode 100644 mm/memfile_notifier.c
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  2022-07-06  8:20 ` [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd Chao Peng
  2022-07-21  9:44   ` David Hildenbrand
@ 2022-08-26 15:19   ` Fuad Tabba
  2022-08-29 15:18     ` Chao Peng
  1 sibling, 1 reply; 398+ messages in thread
From: Fuad Tabba @ 2022-08-26 15:19 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

Hi Chao,

On Wed, Jul 6, 2022 at 9:25 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> Normally, a write to unallocated space of a file or the hole of a sparse
> file automatically causes space allocation, for memfd, this equals to
> memory allocation. This new seal prevents such automatically allocating,
> either this is from a direct write() or a write on the previously
> mmap-ed area. The seal does not prevent fallocate() so an explicit
> fallocate() can still cause allocating and can be used to reserve
> memory.
>
> This is used to prevent unintentional allocation from userspace on a
> stray or careless write and any intentional allocation should use an
> explicit fallocate(). One of the main usecases is to avoid memory double
> allocation for confidential computing usage where we use two memfds to
> back guest memory and at a single point only one memfd is alive and we
> want to prevent memory allocation for the other memfd which may have
> been mmap-ed previously. More discussion can be found at:
>
>   https://lkml.org/lkml/2022/6/14/1255
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  include/uapi/linux/fcntl.h |  1 +
>  mm/memfd.c                 |  3 ++-
>  mm/shmem.c                 | 16 ++++++++++++++--
>  3 files changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 2f86b2ad6d7e..98bdabc8e309 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -43,6 +43,7 @@
>  #define F_SEAL_GROW    0x0004  /* prevent file from growing */
>  #define F_SEAL_WRITE   0x0008  /* prevent writes */
>  #define F_SEAL_FUTURE_WRITE    0x0010  /* prevent future writes while mapped */
> +#define F_SEAL_AUTO_ALLOCATE   0x0020  /* prevent allocation for writes */

I think this should also be added to tools/include/uapi/linux/fcntl.h

Cheers,
/fuad


>  /* (1U << 31) is reserved for signed error codes */
>
>  /*
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 08f5f8304746..2afd898798e4 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -150,7 +150,8 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
>                      F_SEAL_SHRINK | \
>                      F_SEAL_GROW | \
>                      F_SEAL_WRITE | \
> -                    F_SEAL_FUTURE_WRITE)
> +                    F_SEAL_FUTURE_WRITE | \
> +                    F_SEAL_AUTO_ALLOCATE)
>
>  static int memfd_add_seals(struct file *file, unsigned int seals)
>  {
> diff --git a/mm/shmem.c b/mm/shmem.c
> index a6f565308133..6c8aef15a17d 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2051,6 +2051,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>         struct vm_area_struct *vma = vmf->vma;
>         struct inode *inode = file_inode(vma->vm_file);
>         gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
> +       struct shmem_inode_info *info = SHMEM_I(inode);
> +       enum sgp_type sgp;
>         int err;
>         vm_fault_t ret = VM_FAULT_LOCKED;
>
> @@ -2113,7 +2115,12 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>                 spin_unlock(&inode->i_lock);
>         }
>
> -       err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE,
> +       if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE))
> +               sgp = SGP_NOALLOC;
> +       else
> +               sgp = SGP_CACHE;
> +
> +       err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
>                                   gfp, vma, vmf, &ret);
>         if (err)
>                 return vmf_error(err);
> @@ -2459,6 +2466,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
>         struct inode *inode = mapping->host;
>         struct shmem_inode_info *info = SHMEM_I(inode);
>         pgoff_t index = pos >> PAGE_SHIFT;
> +       enum sgp_type sgp;
>         int ret = 0;
>
>         /* i_rwsem is held by caller */
> @@ -2470,7 +2478,11 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
>                         return -EPERM;
>         }
>
> -       ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
> +       if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE))
> +               sgp = SGP_NOALLOC;
> +       else
> +               sgp = SGP_WRITE;
> +       ret = shmem_getpage(inode, index, pagep, sgp);
>
>         if (ret)
>                 return ret;
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-07-06  8:20 ` [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions Chao Peng
                     ` (2 preceding siblings ...)
  2022-08-19 19:37   ` Vishal Annapurve
@ 2022-08-26 15:19   ` Fuad Tabba
  2022-08-29 15:21     ` Chao Peng
  3 siblings, 1 reply; 398+ messages in thread
From: Fuad Tabba @ 2022-08-26 15:19 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

Hi Chao,

On Wed, Jul 6, 2022 at 9:27 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> If CONFIG_HAVE_KVM_PRIVATE_MEM=y, userspace can register/unregister the
> guest private memory regions through KVM_MEMORY_ENCRYPT_{UN,}REG_REGION
> ioctls. The patch reuses existing SEV ioctl but differs that the
> address in the region for private memory is gpa while SEV case it's hva.
>
> The private memory region is stored as xarray in KVM for memory
> efficiency in normal usages and zapping existing memory mappings is also
> a side effect of these two ioctls.
>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  Documentation/virt/kvm/api.rst  | 17 +++++++---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/Kconfig            |  1 +
>  arch/x86/kvm/mmu.h              |  2 --
>  include/linux/kvm_host.h        |  8 +++++
>  virt/kvm/kvm_main.c             | 57 +++++++++++++++++++++++++++++++++
>  6 files changed, 80 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 5ecfc7fbe0ee..dfb4caecab73 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -4715,10 +4715,19 @@ Documentation/virt/kvm/amd-memory-encryption.rst.
>  This ioctl can be used to register a guest memory region which may
>  contain encrypted data (e.g. guest RAM, SMRAM etc).
>
> -It is used in the SEV-enabled guest. When encryption is enabled, a guest
> -memory region may contain encrypted data. The SEV memory encryption
> -engine uses a tweak such that two identical plaintext pages, each at
> -different locations will have differing ciphertexts. So swapping or
> +Currently this ioctl supports registering memory regions for two usages:
> +private memory and SEV-encrypted memory.
> +
> +When private memory is enabled, this ioctl is used to register guest private
> +memory region and the addr/size of kvm_enc_region represents guest physical
> +address (GPA). In this usage, this ioctl zaps the existing guest memory
> +mappings in KVM that fallen into the region.
> +
> +When SEV-encrypted memory is enabled, this ioctl is used to register guest
> +memory region which may contain encrypted data for a SEV-enabled guest. The
> +addr/size of kvm_enc_region represents userspace address (HVA). The SEV
> +memory encryption engine uses a tweak such that two identical plaintext pages,
> +each at different locations will have differing ciphertexts. So swapping or
>  moving ciphertext of those pages will not result in plaintext being
>  swapped. So relocating (or migrating) physical backing pages for the SEV
>  guest will require some additional steps.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index dae190e19fce..92120e3a224e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -37,6 +37,7 @@
>  #include <asm/hyperv-tlfs.h>
>
>  #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> +#define __KVM_HAVE_ZAP_GFN_RANGE
>
>  #define KVM_MAX_VCPUS 1024
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 1f160801e2a7..05861b9656a4 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -50,6 +50,7 @@ config KVM
>         select HAVE_KVM_PM_NOTIFIER if PM
>         select HAVE_KVM_PRIVATE_MEM if X86_64
>         select MEMFILE_NOTIFIER if HAVE_KVM_PRIVATE_MEM
> +       select XARRAY_MULTI if HAVE_KVM_PRIVATE_MEM
>         help
>           Support hosting fully virtualized guest machines using hardware
>           virtualization extensions.  You will need a fairly recent
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index a99acec925eb..428cd2e88cbd 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -209,8 +209,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>         return -(u32)fault & errcode;
>  }
>
> -void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> -
>  int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
>
>  int kvm_mmu_post_init_vm(struct kvm *kvm);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 1b203c8aa696..da33f8828456 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -260,6 +260,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  #endif
>
> +#ifdef __KVM_HAVE_ZAP_GFN_RANGE
> +void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> +#endif
> +
>  enum {
>         OUTSIDE_GUEST_MODE,
>         IN_GUEST_MODE,
> @@ -795,6 +799,9 @@ struct kvm {
>         struct notifier_block pm_notifier;
>  #endif
>         char stats_id[KVM_STATS_NAME_SIZE];
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +       struct xarray mem_attr_array;
> +#endif
>  };
>
>  #define kvm_err(fmt, ...) \
> @@ -1459,6 +1466,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
>  int kvm_arch_post_init_vm(struct kvm *kvm);
>  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
>  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_private_mem_supported(struct kvm *kvm);
>
>  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
>  /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 230c8ff9659c..bb714c2a4b06 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -914,6 +914,35 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>
>  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +#define KVM_MEM_ATTR_PRIVATE   0x0001
> +static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl,
> +                                            struct kvm_enc_region *region)
> +{
> +       unsigned long start, end;
> +       void *entry;
> +       int r;
> +
> +       if (region->size == 0 || region->addr + region->size < region->addr)
> +               return -EINVAL;
> +       if (region->addr & (PAGE_SIZE - 1) || region->size & (PAGE_SIZE - 1))
> +               return -EINVAL;
> +
> +       start = region->addr >> PAGE_SHIFT;
> +       end = (region->addr + region->size - 1) >> PAGE_SHIFT;
> +
> +       entry = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION ?
> +                               xa_mk_value(KVM_MEM_ATTR_PRIVATE) : NULL;
> +
> +       r = xa_err(xa_store_range(&kvm->mem_attr_array, start, end,
> +                                       entry, GFP_KERNEL_ACCOUNT));
> +
> +       kvm_zap_gfn_range(kvm, start, end + 1);
> +
> +       return r;
> +}
> +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
> +
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>  static int kvm_pm_notifier_call(struct notifier_block *bl,
>                                 unsigned long state,
> @@ -1138,6 +1167,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
>         spin_lock_init(&kvm->mn_invalidate_lock);
>         rcuwait_init(&kvm->mn_memslots_update_rcuwait);
>         xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +       xa_init(&kvm->mem_attr_array);
> +#endif
>
>         INIT_LIST_HEAD(&kvm->gpc_list);
>         spin_lock_init(&kvm->gpc_lock);
> @@ -1305,6 +1337,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
>                 kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
>                 kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
>         }
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +       xa_destroy(&kvm->mem_attr_array);
> +#endif
>         cleanup_srcu_struct(&kvm->irq_srcu);
>         cleanup_srcu_struct(&kvm->srcu);
>         kvm_arch_free_vm(kvm);
> @@ -1508,6 +1543,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
>         }
>  }
>
> +bool __weak kvm_arch_private_mem_supported(struct kvm *kvm)
> +{
> +       return false;
> +}
> +
>  static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
>  {
>         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> @@ -4689,6 +4729,22 @@ static long kvm_vm_ioctl(struct file *filp,
>                 r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>                 break;
>         }
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +       case KVM_MEMORY_ENCRYPT_REG_REGION:
> +       case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> +               struct kvm_enc_region region;
> +
> +               if (!kvm_arch_private_mem_supported(kvm))
> +                       goto arch_vm_ioctl;
> +
> +               r = -EFAULT;
> +               if (copy_from_user(&region, argp, sizeof(region)))
> +                       goto out;
> +
> +               r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, &region);
> +               break;
> +       }
> +#endif
>         case KVM_GET_DIRTY_LOG: {
>                 struct kvm_dirty_log log;
>
> @@ -4842,6 +4898,7 @@ static long kvm_vm_ioctl(struct file *filp,
>                 r = kvm_vm_ioctl_get_stats_fd(kvm);
>                 break;
>         default:
> +arch_vm_ioctl:

It might be good to make this label conditional on
CONFIG_HAVE_KVM_PRIVATE_MEM, otherwise you get a warning if
CONFIG_HAVE_KVM_PRIVATE_MEM isn't defined.

+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
 arch_vm_ioctl:
+#endif

Cheers,
/fuad





>                 r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>         }
>  out:
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-26 15:19 ` Fuad Tabba
@ 2022-08-29 15:17   ` Chao Peng
  2022-08-31  9:12     ` Fuad Tabba
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-08-29 15:17 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, Marc Zyngier, Will Deacon

On Fri, Aug 26, 2022 at 04:19:25PM +0100, Fuad Tabba wrote:
> Hi,
> 
> On Wed, Jul 6, 2022 at 9:24 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > This is the v7 of this series which tries to implement the fd-based KVM
> > guest private memory. The patches are based on latest kvm/queue branch
> > commit:
> >
> >   b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
> > split_desc_cache only by default capacity
> >
> > Introduction
> > ------------
> > In general this patch series introduce fd-based memslot which provides
> > guest memory through memory file descriptor fd[offset,size] instead of
> > hva/size. The fd can be created from a supported memory filesystem
> > like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> > and the the memory backing store exchange callbacks when such memslot
> > gets created. At runtime KVM will call into callbacks provided by the
> > backing store to get the pfn with the fd+offset. Memory backing store
> > will also call into KVM callbacks when userspace punch hole on the fd
> > to notify KVM to unmap secondary MMU page table entries.
> >
> > Comparing to existing hva-based memslot, this new type of memslot allows
> > guest memory unmapped from host userspace like QEMU and even the kernel
> > itself, therefore reduce attack surface and prevent bugs.
> >
> > Based on this fd-based memslot, we can build guest private memory that
> > is going to be used in confidential computing environments such as Intel
> > TDX and AMD SEV. When supported, the memory backing store can provide
> > more enforcement on the fd and KVM can use a single memslot to hold both
> > the private and shared part of the guest memory.
> >
> > mm extension
> > ---------------------
> > Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
> > created with these flags cannot read(), write() or mmap() etc via normal
> > MMU operations. The file content can only be used with the newly
> > introduced memfile_notifier extension.
> >
> > The memfile_notifier extension provides two sets of callbacks for KVM to
> > interact with the memory backing store:
> >   - memfile_notifier_ops: callbacks for memory backing store to notify
> >     KVM when memory gets invalidated.
> >   - backing store callbacks: callbacks for KVM to call into memory
> >     backing store to request memory pages for guest private memory.
> >
> > The memfile_notifier extension also provides APIs for memory backing
> > store to register/unregister itself and to trigger the notifier when the
> > bookmarked memory gets invalidated.
> >
> > The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
> > prevent double allocation caused by unintentional guest when we only
> > have a single side of the shared/private memfds effective.
> >
> > memslot extension
> > -----------------
> > Add the private fd and the fd offset to existing 'shared' memslot so
> > that both private/shared guest memory can live in one single memslot.
> > A page in the memslot is either private or shared. Whether a guest page
> > is private or shared is maintained through reusing existing SEV ioctls
> > KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
> >
> 
> I'm on the Android pKVM team at Google, and we've been looking into
> how this approach fits with what we've been doing with pkvm/arm64.
> I've had a go at porting your patches, along with some fixes and
> additions so it would go on top of our latest pkvm patch series [1] to
> see how well this proposal fits with what we’re doing. You can find
> the ported code at this link [2].
> 
> In general, an fd-based approach fits very well with pKVM for the
> reasons you mention. It means that we don't necessarily need to map
> the guest memory, and with the new extensions it allows the host
> kernel to control whether to restrict migration and swapping.

Good to hear that.

> 
> For pKVM, we would also need the guest private memory not to be
> GUP’able by the kernel so that userspace can’t trick the kernel into
> accessing guest private memory in a context where it isn’t prepared to
> handle the fault injected by the hypervisor. We’re looking at whether
> we could use memfd_secret to achieve this, or maybe whether extending
> your work might solve the problem.

This is interesting and can be a valuable addition to this series.

> 
> However, during the porting effort, the main issue we've encountered
> is that many of the details of this approach seem to be targeted at
> TDX/SEV and don’t readily align with the design of pKVM. My knowledge
> on TDX is very rudimentary, so please bear with me if I get things
> wrong.

No doubt this series is initially designed for confidential computing
usages, but pKVM can definitely extend it if it finds useful.

> 
> The idea of the memslot having two references to the backing memory,
> the (new) private_fd (a file descriptor) as well as the userspace_addr
> (a memory address), with the meaning changing depending on whether the
> memory is private or shared. Both can potentially be live at the same
> time, but only one is used by the guest depending on whether the
> memory is shared or private. For pKVM, the memory region is the same,
> and whether the underlying physical page is shared or private is
> determined by the hypervisor based on the initial configuration of the
> VM and also in response to hypercalls from the guest.

For confidential computing usages, this is actually the same. The shared
or private is determined by initial configuration or guest hypercalls.

> So at least from
> our side, having a private_fd isn't the best fit, but rather just
> having an fd instead of a userspace_addr.

Let me understand this a bit: pKVM basically wants to maintain the
shared and private memory in only one fd, and not use userspace_addr at
all, right? Any blocking for pKVM to use private_fd + userspace_addr
instead?

> 
> Moreover, something which was discussed here before [3], is the
> ability to share in-place. For pKVM/arm64, the conversion between
> shared and private involves only changes to the stage-2 page tables,
> which are controlled by the hypervisor. Android supports this in-place
> conversion already, and I think that the cost of copying for many
> use-cases that would involve large amounts of data would be big. We
> will measure the relative costs in due course, but in the meantime
> we’re nervous about adopting a new user ABI which doesn’t appear to
> cater for in-place conversion; having just the fd would simplify that
> somewhat

I understand there is difficulty to achieve that with the current
private_fd + userspace_addr (they basically in two separate fds), but is
it possible for pKVM to extend this? Brainstorming for example, pKVM can
ignore userspace_addr and only use private_fd to cover both shared and
private memory, or pKVM introduce new KVM memslot flag?

> 
> In the memfd approach, what is the plan for being able to initialize
> guest private memory from the host? In my port of this patch series,
> I've added an fcntl() command that allows setting INACCESSIBLE after
> the memfd has been created. So the memory can be mapped, initialized,
> then unmapped. Of course there is no way to enforce that the memory is
> unmapped from userspace before being used as private memory, but the
> hypervisor will take care of the stage-2 mapping and so a user access
> to the private memory would result in a SEGV regardless of the flag

There is discussion on removing MFD_INACCESSIBLE and delaying the
alignment of the flag to the KVM/backing store binding time
(https://lkml.kernel.org/lkml/20220824094149.GA1383966@chaop.bj.intel.com/).

Creating new API like what you are playing with fcntl() also works if it
turns out the MFD_INACCESSIBLE has to be set at the memfd_create time.

> 
> Now, moving on to implementation-specific issues in this patch series
> that I have encountered:
> 
> - There are a couple of small issues in porting the patches, some of
> which have been mentioned already by others. I will point out the rest
> in direct replies to these patches.

Thanks.

> 
> - MEMFILE_F_UNRECLAIMABLE and MEMFILE_F_UNMOVABLE are never set in
> this patch series. MFD_INACCESSIBLE only sets
> MEMFILE_F_USER_INACCESSIBLE. Is this intentional?

It gets set in kvm_private_mem_register() of patch 13, basically those
flags are expected to be set by architecture code.

> 
> - Nothing in this patch series enforces that MFD_INACCESSIBLE or that
> any of the MEMFILE_F_* flags are set for the file descriptor to be
> used as a private_fd. Is this also intentional?

With KVM_MEM_PRIVATE memslot flag, the MEMFILE_F_* are enforced by the
architecture code.

> 
> Most of us working on pKVM will be at KVM forum Dublin in September,
> so it would be great if we could have a chat (and/or beer!) face to
> face sometime during the conference to help us figure out an
> upstreamable solution for Android

I would like to, but currently I have no travel plan due to COVID-19 :(
We can have more online discussions anyway.

Thanks,
Chao
> 
> Cheers,
> /fuad
> 
> [1] https://lore.kernel.org/all/20220630135747.26983-1-will@kernel.org/
> [2] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/fdmem
> [3] https://lore.kernel.org/all/YkcTTY4YjQs5BRhE@google.com/
> 
> 
> > Test
> > ----
> > To test the new functionalities of this patch TDX patchset is needed.
> > Since TDX patchset has not been merged so I did two kinds of test:
> >
> > -  Regresion test on kvm/queue (this patchset)
> >    Most new code are not covered. Code also in below repo:
> >    https://github.com/chao-p/linux/tree/privmem-v7
> >
> > -  New Funational test on latest TDX code
> >    The patch is rebased to latest TDX code and tested the new
> >    funcationalities. See below repos:
> >    Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
> >    QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
> >
> > An example QEMU command line for TDX test:
> > -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
> > -machine confidential-guest-support=tdx \
> > -object memory-backend-memfd-private,id=ram1,size=${mem} \
> > -machine memory-backend=ram1
> >
> > Changelog
> > ----------
> > v7:
> >   - Move the private/shared info from backing store to KVM.
> >   - Introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
> >   - Rework on the sync mechanism between zap/page fault paths.
> >   - Addressed other comments in v6.
> > v6:
> >   - Re-organzied patch for both mm/KVM parts.
> >   - Added flags for memfile_notifier so its consumers can state their
> >     features and memory backing store can check against these flags.
> >   - Put a backing store reference in the memfile_notifier and move pfn_ops
> >     into backing store.
> >   - Only support boot time backing store register.
> >   - Overall KVM part improvement suggested by Sean and some others.
> > v5:
> >   - Removed userspace visible F_SEAL_INACCESSIBLE, instead using an
> >     in-kernel flag (SHM_F_INACCESSIBLE for shmem). Private fd can only
> >     be created by MFD_INACCESSIBLE.
> >   - Introduced new APIs for backing store to register itself to
> >     memfile_notifier instead of direct function call.
> >   - Added the accounting and restriction for MFD_INACCESSIBLE memory.
> >   - Added KVM API doc for new memslot extensions and man page for the new
> >     MFD_INACCESSIBLE flag.
> >   - Removed the overlap check for mapping the same file+offset into
> >     multiple gfns due to perf consideration, warned in document.
> >   - Addressed other comments in v4.
> > v4:
> >   - Decoupled the callbacks between KVM/mm from memfd and use new
> >     name 'memfile_notifier'.
> >   - Supported register multiple memslots to the same backing store.
> >   - Added per-memslot pfn_ops instead of per-system.
> >   - Reworked the invalidation part.
> >   - Improved new KVM uAPIs (private memslot extension and memory
> >     error) per Sean's suggestions.
> >   - Addressed many other minor fixes for comments from v3.
> > v3:
> >   - Added locking protection when calling
> >     invalidate_page_range/fallocate callbacks.
> >   - Changed memslot structure to keep use useraddr for shared memory.
> >   - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
> >   - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
> >   - Commit message improvement.
> >   - Many small fixes for comments from the last version.
> >
> > Links to previous discussions
> > -----------------------------
> > [1] Original design proposal:
> > https://lkml.kernel.org/kvm/20210824005248.200037-1-seanjc@google.com/
> > [2] Updated proposal and RFC patch v1:
> > https://lkml.kernel.org/linux-fsdevel/20211111141352.26311-1-chao.p.peng@linux.intel.com/
> > [3] Patch v5: https://lkml.org/lkml/2022/5/19/861
> >
> > Chao Peng (12):
> >   mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
> >   selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE
> >   mm: Introduce memfile_notifier
> >   mm/memfd: Introduce MFD_INACCESSIBLE flag
> >   KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS
> >   KVM: Use gfn instead of hva for mmu_notifier_retry
> >   KVM: Rename mmu_notifier_*
> >   KVM: Extend the memslot to support fd-based private memory
> >   KVM: Add KVM_EXIT_MEMORY_FAULT exit
> >   KVM: Register/unregister the guest private memory regions
> >   KVM: Handle page fault for private memory
> >   KVM: Enable and expose KVM_MEM_PRIVATE
> >
> > Kirill A. Shutemov (1):
> >   mm/shmem: Support memfile_notifier
> >
> >  Documentation/virt/kvm/api.rst             |  77 +++++-
> >  arch/arm64/kvm/mmu.c                       |   8 +-
> >  arch/mips/include/asm/kvm_host.h           |   2 +-
> >  arch/mips/kvm/mmu.c                        |  10 +-
> >  arch/powerpc/include/asm/kvm_book3s_64.h   |   2 +-
> >  arch/powerpc/kvm/book3s_64_mmu_host.c      |   4 +-
> >  arch/powerpc/kvm/book3s_64_mmu_hv.c        |   4 +-
> >  arch/powerpc/kvm/book3s_64_mmu_radix.c     |   6 +-
> >  arch/powerpc/kvm/book3s_hv_nested.c        |   2 +-
> >  arch/powerpc/kvm/book3s_hv_rm_mmu.c        |   8 +-
> >  arch/powerpc/kvm/e500_mmu_host.c           |   4 +-
> >  arch/riscv/kvm/mmu.c                       |   4 +-
> >  arch/x86/include/asm/kvm_host.h            |   3 +-
> >  arch/x86/kvm/Kconfig                       |   3 +
> >  arch/x86/kvm/mmu.h                         |   2 -
> >  arch/x86/kvm/mmu/mmu.c                     |  74 +++++-
> >  arch/x86/kvm/mmu/mmu_internal.h            |  18 ++
> >  arch/x86/kvm/mmu/mmutrace.h                |   1 +
> >  arch/x86/kvm/mmu/paging_tmpl.h             |   4 +-
> >  arch/x86/kvm/x86.c                         |   2 +-
> >  include/linux/kvm_host.h                   | 105 +++++---
> >  include/linux/memfile_notifier.h           |  91 +++++++
> >  include/linux/shmem_fs.h                   |   2 +
> >  include/uapi/linux/fcntl.h                 |   1 +
> >  include/uapi/linux/kvm.h                   |  37 +++
> >  include/uapi/linux/memfd.h                 |   1 +
> >  mm/Kconfig                                 |   4 +
> >  mm/Makefile                                |   1 +
> >  mm/memfd.c                                 |  18 +-
> >  mm/memfile_notifier.c                      | 123 ++++++++++
> >  mm/shmem.c                                 | 125 +++++++++-
> >  tools/testing/selftests/memfd/memfd_test.c | 166 +++++++++++++
> >  virt/kvm/Kconfig                           |   3 +
> >  virt/kvm/kvm_main.c                        | 272 ++++++++++++++++++---
> >  virt/kvm/pfncache.c                        |  14 +-
> >  35 files changed, 1074 insertions(+), 127 deletions(-)
> >  create mode 100644 include/linux/memfile_notifier.h
> >  create mode 100644 mm/memfile_notifier.c
> >
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
  2022-08-26 15:19   ` Fuad Tabba
@ 2022-08-29 15:18     ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-08-29 15:18 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Fri, Aug 26, 2022 at 04:19:32PM +0100, Fuad Tabba wrote:
> Hi Chao,
> 
> On Wed, Jul 6, 2022 at 9:25 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > Normally, a write to unallocated space of a file or the hole of a sparse
> > file automatically causes space allocation, for memfd, this equals to
> > memory allocation. This new seal prevents such automatically allocating,
> > either this is from a direct write() or a write on the previously
> > mmap-ed area. The seal does not prevent fallocate() so an explicit
> > fallocate() can still cause allocating and can be used to reserve
> > memory.
> >
> > This is used to prevent unintentional allocation from userspace on a
> > stray or careless write and any intentional allocation should use an
> > explicit fallocate(). One of the main usecases is to avoid memory double
> > allocation for confidential computing usage where we use two memfds to
> > back guest memory and at a single point only one memfd is alive and we
> > want to prevent memory allocation for the other memfd which may have
> > been mmap-ed previously. More discussion can be found at:
> >
> >   https://lkml.org/lkml/2022/6/14/1255
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  include/uapi/linux/fcntl.h |  1 +
> >  mm/memfd.c                 |  3 ++-
> >  mm/shmem.c                 | 16 ++++++++++++++--
> >  3 files changed, 17 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> > index 2f86b2ad6d7e..98bdabc8e309 100644
> > --- a/include/uapi/linux/fcntl.h
> > +++ b/include/uapi/linux/fcntl.h
> > @@ -43,6 +43,7 @@
> >  #define F_SEAL_GROW    0x0004  /* prevent file from growing */
> >  #define F_SEAL_WRITE   0x0008  /* prevent writes */
> >  #define F_SEAL_FUTURE_WRITE    0x0010  /* prevent future writes while mapped */
> > +#define F_SEAL_AUTO_ALLOCATE   0x0020  /* prevent allocation for writes */
> 
> I think this should also be added to tools/include/uapi/linux/fcntl.h

Yes, thanks.

Chao
> 
> Cheers,
> /fuad
> 
> 
> >  /* (1U << 31) is reserved for signed error codes */
> >
> >  /*
> > diff --git a/mm/memfd.c b/mm/memfd.c
> > index 08f5f8304746..2afd898798e4 100644
> > --- a/mm/memfd.c
> > +++ b/mm/memfd.c
> > @@ -150,7 +150,8 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
> >                      F_SEAL_SHRINK | \
> >                      F_SEAL_GROW | \
> >                      F_SEAL_WRITE | \
> > -                    F_SEAL_FUTURE_WRITE)
> > +                    F_SEAL_FUTURE_WRITE | \
> > +                    F_SEAL_AUTO_ALLOCATE)
> >
> >  static int memfd_add_seals(struct file *file, unsigned int seals)
> >  {
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index a6f565308133..6c8aef15a17d 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2051,6 +2051,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
> >         struct vm_area_struct *vma = vmf->vma;
> >         struct inode *inode = file_inode(vma->vm_file);
> >         gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
> > +       struct shmem_inode_info *info = SHMEM_I(inode);
> > +       enum sgp_type sgp;
> >         int err;
> >         vm_fault_t ret = VM_FAULT_LOCKED;
> >
> > @@ -2113,7 +2115,12 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
> >                 spin_unlock(&inode->i_lock);
> >         }
> >
> > -       err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE,
> > +       if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE))
> > +               sgp = SGP_NOALLOC;
> > +       else
> > +               sgp = SGP_CACHE;
> > +
> > +       err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
> >                                   gfp, vma, vmf, &ret);
> >         if (err)
> >                 return vmf_error(err);
> > @@ -2459,6 +2466,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
> >         struct inode *inode = mapping->host;
> >         struct shmem_inode_info *info = SHMEM_I(inode);
> >         pgoff_t index = pos >> PAGE_SHIFT;
> > +       enum sgp_type sgp;
> >         int ret = 0;
> >
> >         /* i_rwsem is held by caller */
> > @@ -2470,7 +2478,11 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
> >                         return -EPERM;
> >         }
> >
> > -       ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
> > +       if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE))
> > +               sgp = SGP_NOALLOC;
> > +       else
> > +               sgp = SGP_WRITE;
> > +       ret = shmem_getpage(inode, index, pagep, sgp);
> >
> >         if (ret)
> >                 return ret;
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions
  2022-08-26 15:19   ` Fuad Tabba
@ 2022-08-29 15:21     ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-08-29 15:21 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Fri, Aug 26, 2022 at 04:19:43PM +0100, Fuad Tabba wrote:
> > +bool __weak kvm_arch_private_mem_supported(struct kvm *kvm)
> > +{
> > +       return false;
> > +}
> > +
> >  static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> >  {
> >         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> > @@ -4689,6 +4729,22 @@ static long kvm_vm_ioctl(struct file *filp,
> >                 r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> >                 break;
> >         }
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > +       case KVM_MEMORY_ENCRYPT_REG_REGION:
> > +       case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> > +               struct kvm_enc_region region;
> > +
> > +               if (!kvm_arch_private_mem_supported(kvm))
> > +                       goto arch_vm_ioctl;
> > +
> > +               r = -EFAULT;
> > +               if (copy_from_user(&region, argp, sizeof(region)))
> > +                       goto out;
> > +
> > +               r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, &region);
> > +               break;
> > +       }
> > +#endif
> >         case KVM_GET_DIRTY_LOG: {
> >                 struct kvm_dirty_log log;
> >
> > @@ -4842,6 +4898,7 @@ static long kvm_vm_ioctl(struct file *filp,
> >                 r = kvm_vm_ioctl_get_stats_fd(kvm);
> >                 break;
> >         default:
> > +arch_vm_ioctl:
> 
> It might be good to make this label conditional on
> CONFIG_HAVE_KVM_PRIVATE_MEM, otherwise you get a warning if
> CONFIG_HAVE_KVM_PRIVATE_MEM isn't defined.
> 
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
>  arch_vm_ioctl:
> +#endif

Right, as the bot already complains.

Chao
> 
> Cheers,
> /fuad
> 
> 
> 
> 
> 
> >                 r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> >         }
> >  out:
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-29 15:17   ` Chao Peng
@ 2022-08-31  9:12     ` Fuad Tabba
  2022-09-02 10:19       ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Fuad Tabba @ 2022-08-31  9:12 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, Marc Zyngier, Will Deacon

Hi Chao,

Thank you for your reply.

On Mon, Aug 29, 2022 at 4:23 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> On Fri, Aug 26, 2022 at 04:19:25PM +0100, Fuad Tabba wrote:
> > Hi,
> >
> > On Wed, Jul 6, 2022 at 9:24 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > >
> > > This is the v7 of this series which tries to implement the fd-based KVM
> > > guest private memory. The patches are based on latest kvm/queue branch
> > > commit:
> > >
> > >   b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
> > > split_desc_cache only by default capacity
> > >
> > > Introduction
> > > ------------
> > > In general this patch series introduce fd-based memslot which provides
> > > guest memory through memory file descriptor fd[offset,size] instead of
> > > hva/size. The fd can be created from a supported memory filesystem
> > > like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> > > and the the memory backing store exchange callbacks when such memslot
> > > gets created. At runtime KVM will call into callbacks provided by the
> > > backing store to get the pfn with the fd+offset. Memory backing store
> > > will also call into KVM callbacks when userspace punch hole on the fd
> > > to notify KVM to unmap secondary MMU page table entries.
> > >
> > > Comparing to existing hva-based memslot, this new type of memslot allows
> > > guest memory unmapped from host userspace like QEMU and even the kernel
> > > itself, therefore reduce attack surface and prevent bugs.
> > >
> > > Based on this fd-based memslot, we can build guest private memory that
> > > is going to be used in confidential computing environments such as Intel
> > > TDX and AMD SEV. When supported, the memory backing store can provide
> > > more enforcement on the fd and KVM can use a single memslot to hold both
> > > the private and shared part of the guest memory.
> > >
> > > mm extension
> > > ---------------------
> > > Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
> > > created with these flags cannot read(), write() or mmap() etc via normal
> > > MMU operations. The file content can only be used with the newly
> > > introduced memfile_notifier extension.
> > >
> > > The memfile_notifier extension provides two sets of callbacks for KVM to
> > > interact with the memory backing store:
> > >   - memfile_notifier_ops: callbacks for memory backing store to notify
> > >     KVM when memory gets invalidated.
> > >   - backing store callbacks: callbacks for KVM to call into memory
> > >     backing store to request memory pages for guest private memory.
> > >
> > > The memfile_notifier extension also provides APIs for memory backing
> > > store to register/unregister itself and to trigger the notifier when the
> > > bookmarked memory gets invalidated.
> > >
> > > The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to
> > > prevent double allocation caused by unintentional guest when we only
> > > have a single side of the shared/private memfds effective.
> > >
> > > memslot extension
> > > -----------------
> > > Add the private fd and the fd offset to existing 'shared' memslot so
> > > that both private/shared guest memory can live in one single memslot.
> > > A page in the memslot is either private or shared. Whether a guest page
> > > is private or shared is maintained through reusing existing SEV ioctls
> > > KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
> > >
> >
> > I'm on the Android pKVM team at Google, and we've been looking into
> > how this approach fits with what we've been doing with pkvm/arm64.
> > I've had a go at porting your patches, along with some fixes and
> > additions so it would go on top of our latest pkvm patch series [1] to
> > see how well this proposal fits with what we’re doing. You can find
> > the ported code at this link [2].
> >
> > In general, an fd-based approach fits very well with pKVM for the
> > reasons you mention. It means that we don't necessarily need to map
> > the guest memory, and with the new extensions it allows the host
> > kernel to control whether to restrict migration and swapping.
>
> Good to hear that.
>
> >
> > For pKVM, we would also need the guest private memory not to be
> > GUP’able by the kernel so that userspace can’t trick the kernel into
> > accessing guest private memory in a context where it isn’t prepared to
> > handle the fault injected by the hypervisor. We’re looking at whether
> > we could use memfd_secret to achieve this, or maybe whether extending
> > your work might solve the problem.
>
> This is interesting and can be a valuable addition to this series.

I'll keep you posted as it goes. I think with the work that you've
already put in, it wouldn't require that much more.

> >
> > However, during the porting effort, the main issue we've encountered
> > is that many of the details of this approach seem to be targeted at
> > TDX/SEV and don’t readily align with the design of pKVM. My knowledge
> > on TDX is very rudimentary, so please bear with me if I get things
> > wrong.
>
> No doubt this series is initially designed for confidential computing
> usages, but pKVM can definitely extend it if it finds useful.
>
> >
> > The idea of the memslot having two references to the backing memory,
> > the (new) private_fd (a file descriptor) as well as the userspace_addr
> > (a memory address), with the meaning changing depending on whether the
> > memory is private or shared. Both can potentially be live at the same
> > time, but only one is used by the guest depending on whether the
> > memory is shared or private. For pKVM, the memory region is the same,
> > and whether the underlying physical page is shared or private is
> > determined by the hypervisor based on the initial configuration of the
> > VM and also in response to hypercalls from the guest.
>
> For confidential computing usages, this is actually the same. The shared
> or private is determined by initial configuration or guest hypercalls.
>
> > So at least from
> > our side, having a private_fd isn't the best fit, but rather just
> > having an fd instead of a userspace_addr.
>
> Let me understand this a bit: pKVM basically wants to maintain the
> shared and private memory in only one fd, and not use userspace_addr at
> all, right? Any blocking for pKVM to use private_fd + userspace_addr
> instead?
> >
> > Moreover, something which was discussed here before [3], is the
> > ability to share in-place. For pKVM/arm64, the conversion between
> > shared and private involves only changes to the stage-2 page tables,
> > which are controlled by the hypervisor. Android supports this in-place
> > conversion already, and I think that the cost of copying for many
> > use-cases that would involve large amounts of data would be big. We
> > will measure the relative costs in due course, but in the meantime
> > we’re nervous about adopting a new user ABI which doesn’t appear to
> > cater for in-place conversion; having just the fd would simplify that
> > somewhat
>
> I understand there is difficulty to achieve that with the current
> private_fd + userspace_addr (they basically in two separate fds), but is
> it possible for pKVM to extend this? Brainstorming for example, pKVM can
> ignore userspace_addr and only use private_fd to cover both shared and
> private memory, or pKVM introduce new KVM memslot flag?

It's not that there's anything blocking pKVM from doing that. It's
that the disconnect of using a memory address for the shared memory,
and a file descriptor for the private memory doesn't really make sense
for pKVM. I see how it makes sense for TDX and the Intel-specific
implementation. It just seems that this is baking in an
implementation-specific aspect as a part of the KVM general api, and
the worry is that this might have some unintended consequences in the
future.

> >
> > In the memfd approach, what is the plan for being able to initialize
> > guest private memory from the host? In my port of this patch series,
> > I've added an fcntl() command that allows setting INACCESSIBLE after
> > the memfd has been created. So the memory can be mapped, initialized,
> > then unmapped. Of course there is no way to enforce that the memory is
> > unmapped from userspace before being used as private memory, but the
> > hypervisor will take care of the stage-2 mapping and so a user access
> > to the private memory would result in a SEGV regardless of the flag
>
> There is discussion on removing MFD_INACCESSIBLE and delaying the
> alignment of the flag to the KVM/backing store binding time
> (https://lkml.kernel.org/lkml/20220824094149.GA1383966@chaop.bj.intel.com/).
>
> Creating new API like what you are playing with fcntl() also works if it
> turns out the MFD_INACCESSIBLE has to be set at the memfd_create time.

That makes sense.

> >
> > Now, moving on to implementation-specific issues in this patch series
> > that I have encountered:
> >
> > - There are a couple of small issues in porting the patches, some of
> > which have been mentioned already by others. I will point out the rest
> > in direct replies to these patches.
>
> Thanks.
>
> >
> > - MEMFILE_F_UNRECLAIMABLE and MEMFILE_F_UNMOVABLE are never set in
> > this patch series. MFD_INACCESSIBLE only sets
> > MEMFILE_F_USER_INACCESSIBLE. Is this intentional?
>
> It gets set in kvm_private_mem_register() of patch 13, basically those
> flags are expected to be set by architecture code.
>
> >
> > - Nothing in this patch series enforces that MFD_INACCESSIBLE or that
> > any of the MEMFILE_F_* flags are set for the file descriptor to be
> > used as a private_fd. Is this also intentional?
>
> With KVM_MEM_PRIVATE memslot flag, the MEMFILE_F_* are enforced by the
> architecture code.

Right. I was expecting them to be in the mem_fd, but I see now how
they are being set and enforced in patch 13. This makes a lot of sense
now. Thanks!

> >
> > Most of us working on pKVM will be at KVM forum Dublin in September,
> > so it would be great if we could have a chat (and/or beer!) face to
> > face sometime during the conference to help us figure out an
> > upstreamable solution for Android
>
> I would like to, but currently I have no travel plan due to COVID-19 :(
> We can have more online discussions anyway.

Of course! We'll continue this online, and hopefully we will get a
chance to meet in person soon.

Cheers,
/fuad


> Thanks,
> Chao
> >
> > Cheers,
> > /fuad
> >
> > [1] https://lore.kernel.org/all/20220630135747.26983-1-will@kernel.org/
> > [2] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/fdmem
> > [3] https://lore.kernel.org/all/YkcTTY4YjQs5BRhE@google.com/
> >
> >
> > > Test
> > > ----
> > > To test the new functionalities of this patch TDX patchset is needed.
> > > Since TDX patchset has not been merged so I did two kinds of test:
> > >
> > > -  Regresion test on kvm/queue (this patchset)
> > >    Most new code are not covered. Code also in below repo:
> > >    https://github.com/chao-p/linux/tree/privmem-v7
> > >
> > > -  New Funational test on latest TDX code
> > >    The patch is rebased to latest TDX code and tested the new
> > >    funcationalities. See below repos:
> > >    Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
> > >    QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
> > >
> > > An example QEMU command line for TDX test:
> > > -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
> > > -machine confidential-guest-support=tdx \
> > > -object memory-backend-memfd-private,id=ram1,size=${mem} \
> > > -machine memory-backend=ram1
> > >
> > > Changelog
> > > ----------
> > > v7:
> > >   - Move the private/shared info from backing store to KVM.
> > >   - Introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
> > >   - Rework on the sync mechanism between zap/page fault paths.
> > >   - Addressed other comments in v6.
> > > v6:
> > >   - Re-organzied patch for both mm/KVM parts.
> > >   - Added flags for memfile_notifier so its consumers can state their
> > >     features and memory backing store can check against these flags.
> > >   - Put a backing store reference in the memfile_notifier and move pfn_ops
> > >     into backing store.
> > >   - Only support boot time backing store register.
> > >   - Overall KVM part improvement suggested by Sean and some others.
> > > v5:
> > >   - Removed userspace visible F_SEAL_INACCESSIBLE, instead using an
> > >     in-kernel flag (SHM_F_INACCESSIBLE for shmem). Private fd can only
> > >     be created by MFD_INACCESSIBLE.
> > >   - Introduced new APIs for backing store to register itself to
> > >     memfile_notifier instead of direct function call.
> > >   - Added the accounting and restriction for MFD_INACCESSIBLE memory.
> > >   - Added KVM API doc for new memslot extensions and man page for the new
> > >     MFD_INACCESSIBLE flag.
> > >   - Removed the overlap check for mapping the same file+offset into
> > >     multiple gfns due to perf consideration, warned in document.
> > >   - Addressed other comments in v4.
> > > v4:
> > >   - Decoupled the callbacks between KVM/mm from memfd and use new
> > >     name 'memfile_notifier'.
> > >   - Supported register multiple memslots to the same backing store.
> > >   - Added per-memslot pfn_ops instead of per-system.
> > >   - Reworked the invalidation part.
> > >   - Improved new KVM uAPIs (private memslot extension and memory
> > >     error) per Sean's suggestions.
> > >   - Addressed many other minor fixes for comments from v3.
> > > v3:
> > >   - Added locking protection when calling
> > >     invalidate_page_range/fallocate callbacks.
> > >   - Changed memslot structure to keep use useraddr for shared memory.
> > >   - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
> > >   - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
> > >   - Commit message improvement.
> > >   - Many small fixes for comments from the last version.
> > >
> > > Links to previous discussions
> > > -----------------------------
> > > [1] Original design proposal:
> > > https://lkml.kernel.org/kvm/20210824005248.200037-1-seanjc@google.com/
> > > [2] Updated proposal and RFC patch v1:
> > > https://lkml.kernel.org/linux-fsdevel/20211111141352.26311-1-chao.p.peng@linux.intel.com/
> > > [3] Patch v5: https://lkml.org/lkml/2022/5/19/861
> > >
> > > Chao Peng (12):
> > >   mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd
> > >   selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE
> > >   mm: Introduce memfile_notifier
> > >   mm/memfd: Introduce MFD_INACCESSIBLE flag
> > >   KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS
> > >   KVM: Use gfn instead of hva for mmu_notifier_retry
> > >   KVM: Rename mmu_notifier_*
> > >   KVM: Extend the memslot to support fd-based private memory
> > >   KVM: Add KVM_EXIT_MEMORY_FAULT exit
> > >   KVM: Register/unregister the guest private memory regions
> > >   KVM: Handle page fault for private memory
> > >   KVM: Enable and expose KVM_MEM_PRIVATE
> > >
> > > Kirill A. Shutemov (1):
> > >   mm/shmem: Support memfile_notifier
> > >
> > >  Documentation/virt/kvm/api.rst             |  77 +++++-
> > >  arch/arm64/kvm/mmu.c                       |   8 +-
> > >  arch/mips/include/asm/kvm_host.h           |   2 +-
> > >  arch/mips/kvm/mmu.c                        |  10 +-
> > >  arch/powerpc/include/asm/kvm_book3s_64.h   |   2 +-
> > >  arch/powerpc/kvm/book3s_64_mmu_host.c      |   4 +-
> > >  arch/powerpc/kvm/book3s_64_mmu_hv.c        |   4 +-
> > >  arch/powerpc/kvm/book3s_64_mmu_radix.c     |   6 +-
> > >  arch/powerpc/kvm/book3s_hv_nested.c        |   2 +-
> > >  arch/powerpc/kvm/book3s_hv_rm_mmu.c        |   8 +-
> > >  arch/powerpc/kvm/e500_mmu_host.c           |   4 +-
> > >  arch/riscv/kvm/mmu.c                       |   4 +-
> > >  arch/x86/include/asm/kvm_host.h            |   3 +-
> > >  arch/x86/kvm/Kconfig                       |   3 +
> > >  arch/x86/kvm/mmu.h                         |   2 -
> > >  arch/x86/kvm/mmu/mmu.c                     |  74 +++++-
> > >  arch/x86/kvm/mmu/mmu_internal.h            |  18 ++
> > >  arch/x86/kvm/mmu/mmutrace.h                |   1 +
> > >  arch/x86/kvm/mmu/paging_tmpl.h             |   4 +-
> > >  arch/x86/kvm/x86.c                         |   2 +-
> > >  include/linux/kvm_host.h                   | 105 +++++---
> > >  include/linux/memfile_notifier.h           |  91 +++++++
> > >  include/linux/shmem_fs.h                   |   2 +
> > >  include/uapi/linux/fcntl.h                 |   1 +
> > >  include/uapi/linux/kvm.h                   |  37 +++
> > >  include/uapi/linux/memfd.h                 |   1 +
> > >  mm/Kconfig                                 |   4 +
> > >  mm/Makefile                                |   1 +
> > >  mm/memfd.c                                 |  18 +-
> > >  mm/memfile_notifier.c                      | 123 ++++++++++
> > >  mm/shmem.c                                 | 125 +++++++++-
> > >  tools/testing/selftests/memfd/memfd_test.c | 166 +++++++++++++
> > >  virt/kvm/Kconfig                           |   3 +
> > >  virt/kvm/kvm_main.c                        | 272 ++++++++++++++++++---
> > >  virt/kvm/pfncache.c                        |  14 +-
> > >  35 files changed, 1074 insertions(+), 127 deletions(-)
> > >  create mode 100644 include/linux/memfile_notifier.h
> > >  create mode 100644 mm/memfile_notifier.c
> > >
> > > --
> > > 2.25.1
> > >

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-21  5:15         ` Hugh Dickins
@ 2022-08-31 14:24           ` Kirill A . Shutemov
  2022-09-02 10:27             ` Chao Peng
  2022-09-08  1:10             ` Kirill A. Shutemov
  0 siblings, 2 replies; 398+ messages in thread
From: Kirill A . Shutemov @ 2022-08-31 14:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Gupta, Pankaj,
	Elena Reshetova

On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
> > I will try next week to rework it as shim to top of shmem. Does it work
> > for you?
> 
> Yes, please do, thanks.  It's a compromise between us: the initial TDX
> case has no justification to use shmem at all, but doing it that way
> will help you with some of the infrastructure, and will probably be
> easiest for KVM to extend to other more relaxed fd cases later.

Okay, below is my take on the shim approach.

I don't hate how it turned out. It is easier to understand without
callback exchange thing.

The only caveat is I had to introduce external lock to protect against
race between lookup and truncate. Otherwise, looks pretty reasonable to me.

I did very limited testing. And it lacks integration with KVM, but API
changed not substantially, any it should be easy to adopt.

Any comments?

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..aec04a0f8b7b 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -3,6 +3,7 @@
 #define __LINUX_MEMFD_H
 
 #include <linux/file.h>
+#include <linux/pfn_t.h>
 
 #ifdef CONFIG_MEMFD_CREATE
 extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
@@ -13,4 +14,27 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
 }
 #endif
 
+struct inaccessible_notifier;
+
+struct inaccessible_notifier_ops {
+	void (*invalidate)(struct inaccessible_notifier *notifier,
+			   pgoff_t start, pgoff_t end);
+};
+
+struct inaccessible_notifier {
+	struct list_head list;
+	const struct inaccessible_notifier_ops *ops;
+};
+
+int inaccessible_register_notifier(struct file *file,
+				   struct inaccessible_notifier *notifier);
+void inaccessible_unregister_notifier(struct file *file,
+				      struct inaccessible_notifier *notifier);
+
+int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+			 int *order);
+void inaccessible_put_pfn(struct file *file, pfn_t pfn);
+
+struct file *memfd_mkinaccessible(struct file *memfd);
+
 #endif /* __LINUX_MEMFD_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..9d066be3d7e8 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@
 #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
+#define INACCESSIBLE_MAGIC	0x494e4143	/* "INAC" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..48750474b904 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,7 @@
 #define MFD_CLOEXEC		0x0001U
 #define MFD_ALLOW_SEALING	0x0002U
 #define MFD_HUGETLB		0x0004U
+#define MFD_INACCESSIBLE	0x0008U
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/Makefile b/mm/Makefile
index 9a564f836403..f82e5d4b4388 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -126,7 +126,7 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 obj-$(CONFIG_HMM_MIRROR) += hmm.o
-obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_MEMFD_CREATE) += memfd.o memfd_inaccessible.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
diff --git a/mm/memfd.c b/mm/memfd.c
index 08f5f8304746..1853a90f49ff 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -261,7 +261,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
 
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
+		       MFD_INACCESSIBLE)
 
 SYSCALL_DEFINE2(memfd_create,
 		const char __user *, uname,
@@ -283,6 +284,14 @@ SYSCALL_DEFINE2(memfd_create,
 			return -EINVAL;
 	}
 
+	/* Disallow sealing when MFD_INACCESSIBLE is set. */
+	if ((flags & MFD_INACCESSIBLE) && (flags & MFD_ALLOW_SEALING))
+		return -EINVAL;
+
+	/* TODO: add hugetlb support */
+	if ((flags & MFD_INACCESSIBLE) && (flags & MFD_HUGETLB))
+		return -EINVAL;
+
 	/* length includes terminating zero */
 	len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
 	if (len <= 0)
@@ -331,10 +340,24 @@ SYSCALL_DEFINE2(memfd_create,
 		*file_seals &= ~F_SEAL_SEAL;
 	}
 
+	if (flags & MFD_INACCESSIBLE) {
+		struct file *inaccessible_file;
+
+		inaccessible_file = memfd_mkinaccessible(file);
+		if (IS_ERR(inaccessible_file)) {
+			error = PTR_ERR(inaccessible_file);
+			goto err_file;
+		}
+
+		file = inaccessible_file;
+	}
+
 	fd_install(fd, file);
 	kfree(name);
 	return fd;
 
+err_file:
+	fput(file);
 err_fd:
 	put_unused_fd(fd);
 err_name:
diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
new file mode 100644
index 000000000000..89194438af9c
--- /dev/null
+++ b/mm/memfd_inaccessible.c
@@ -0,0 +1,234 @@
+#include <linux/memfd.h>
+#include <linux/pagemap.h>
+#include <linux/pseudo_fs.h>
+#include <linux/shmem_fs.h>
+#include <uapi/linux/falloc.h>
+#include <uapi/linux/magic.h>
+
+struct inaccessible_data {
+	struct rw_semaphore lock;
+	struct file *memfd;
+	struct list_head notifiers;
+};
+
+static void inaccessible_notifier_invalidate(struct inaccessible_data *data,
+				 pgoff_t start, pgoff_t end)
+{
+	struct inaccessible_notifier *notifier;
+
+	lockdep_assert_held(&data->lock);
+	VM_BUG_ON(!rwsem_is_locked(&data->lock));
+
+	list_for_each_entry(notifier, &data->notifiers, list) {
+		notifier->ops->invalidate(notifier, start, end);
+	}
+}
+
+static int inaccessible_release(struct inode *inode, struct file *file)
+{
+	struct inaccessible_data *data = inode->i_mapping->private_data;
+
+	fput(data->memfd);
+	kfree(data);
+	return 0;
+}
+
+static long inaccessible_fallocate(struct file *file, int mode,
+				   loff_t offset, loff_t len)
+{
+	struct inaccessible_data *data = file->f_mapping->private_data;
+	struct file *memfd = data->memfd;
+	int ret;
+
+	/* The lock prevents parallel inaccessible_get/put_pfn() */
+	down_write(&data->lock);
+	if (mode & FALLOC_FL_PUNCH_HOLE) {
+		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) {
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
+	inaccessible_notifier_invalidate(data, offset, offset + len);
+out:
+	up_write(&data->lock);
+	return ret;
+}
+
+static const struct file_operations inaccessible_fops = {
+	.release = inaccessible_release,
+	.fallocate = inaccessible_fallocate,
+};
+
+static int inaccessible_getattr(struct user_namespace *mnt_userns,
+				const struct path *path, struct kstat *stat,
+				u32 request_mask, unsigned int query_flags)
+{
+	struct inode *inode = d_inode(path->dentry);
+	struct inaccessible_data *data = inode->i_mapping->private_data;
+	struct file *memfd = data->memfd;
+
+	return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
+					     request_mask, query_flags);
+}
+
+static int inaccessible_setattr(struct user_namespace *mnt_userns,
+				struct dentry *dentry, struct iattr *attr)
+{
+	struct inode *inode = d_inode(dentry);
+	struct inaccessible_data *data = inode->i_mapping->private_data;
+	struct file *memfd = data->memfd;
+	int ret;
+
+	if (attr->ia_valid & ATTR_SIZE) {
+		if (memfd->f_inode->i_size) {
+			ret = -EPERM;
+			goto out;
+		}
+
+		if (!PAGE_ALIGNED(attr->ia_size)) {
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	ret = memfd->f_inode->i_op->setattr(mnt_userns,
+					    file_dentry(memfd), attr);
+out:
+	return ret;
+}
+
+static const struct inode_operations inaccessible_iops = {
+	.getattr = inaccessible_getattr,
+	.setattr = inaccessible_setattr,
+};
+
+static int inaccessible_init_fs_context(struct fs_context *fc)
+{
+	if (!init_pseudo(fc, INACCESSIBLE_MAGIC))
+		return -ENOMEM;
+
+	fc->s_iflags |= SB_I_NOEXEC;
+	return 0;
+}
+
+static struct file_system_type inaccessible_fs = {
+	.owner		= THIS_MODULE,
+	.name		= "[inaccessible]",
+	.init_fs_context = inaccessible_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static struct vfsmount *inaccessible_mnt;
+
+static __init int inaccessible_init(void)
+{
+	inaccessible_mnt = kern_mount(&inaccessible_fs);
+	if (IS_ERR(inaccessible_mnt))
+		return PTR_ERR(inaccessible_mnt);
+	return 0;
+}
+fs_initcall(inaccessible_init);
+
+struct file *memfd_mkinaccessible(struct file *memfd)
+{
+	struct inaccessible_data *data;
+	struct address_space *mapping;
+	struct inode *inode;
+	struct file *file;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return ERR_PTR(-ENOMEM);
+
+	data->memfd = memfd;
+	init_rwsem(&data->lock);
+	INIT_LIST_HEAD(&data->notifiers);
+
+	inode = alloc_anon_inode(inaccessible_mnt->mnt_sb);
+	if (IS_ERR(inode)) {
+		kfree(data);
+		return ERR_CAST(inode);
+	}
+
+	inode->i_mode |= S_IFREG;
+	inode->i_op = &inaccessible_iops;
+	inode->i_mapping->private_data = data;
+
+	file = alloc_file_pseudo(inode, inaccessible_mnt,
+				 "[memfd:inaccessible]", O_RDWR,
+				 &inaccessible_fops);
+	if (IS_ERR(file)) {
+		iput(inode);
+		kfree(data);
+	}
+
+	mapping = memfd->f_mapping;
+	mapping_set_unevictable(mapping);
+	mapping_set_gfp_mask(mapping,
+			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
+
+	return file;
+}
+
+int inaccessible_register_notifier(struct file *file,
+			      struct inaccessible_notifier *notifier)
+{
+	struct inaccessible_data *data = file->f_mapping->private_data;
+
+	down_write(&data->lock);
+	list_add(&notifier->list, &data->notifiers);
+	up_write(&data->lock);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
+
+void inaccessible_unregister_notifier(struct file *file,
+				      struct inaccessible_notifier *notifier)
+{
+	struct inaccessible_data *data = file->f_mapping->private_data;
+
+	down_write(&data->lock);
+	list_del_rcu(&notifier->list);
+	up_write(&data->lock);
+}
+EXPORT_SYMBOL_GPL(inaccessible_unregister_notifier);
+
+int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+			 int *order)
+{
+	struct inaccessible_data *data = file->f_mapping->private_data;
+	struct file *memfd = data->memfd;
+	struct page *page;
+	int ret;
+
+	down_read(&data->lock);
+
+	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
+	if (ret) {
+		up_read(&data->lock);
+		return ret;
+	}
+
+	*pfn = page_to_pfn_t(page);
+	*order = thp_order(compound_head(page));
+	return 0;
+}
+EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
+
+void inaccessible_put_pfn(struct file *file, pfn_t pfn)
+{
+	struct page *page = pfn_t_to_page(pfn);
+	struct inaccessible_data *data = file->f_mapping->private_data;
+
+	if (WARN_ON_ONCE(!page))
+		return;
+
+	SetPageUptodate(page);
+	unlock_page(page);
+	put_page(page);
+	up_read(&data->lock);
+}
+EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-31  9:12     ` Fuad Tabba
@ 2022-09-02 10:19       ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-09-02 10:19 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, Marc Zyngier, Will Deacon

On Wed, Aug 31, 2022 at 10:12:12AM +0100, Fuad Tabba wrote:
> > > Moreover, something which was discussed here before [3], is the
> > > ability to share in-place. For pKVM/arm64, the conversion between
> > > shared and private involves only changes to the stage-2 page tables,
> > > which are controlled by the hypervisor. Android supports this in-place
> > > conversion already, and I think that the cost of copying for many
> > > use-cases that would involve large amounts of data would be big. We
> > > will measure the relative costs in due course, but in the meantime
> > > we’re nervous about adopting a new user ABI which doesn’t appear to
> > > cater for in-place conversion; having just the fd would simplify that
> > > somewhat
> >
> > I understand there is difficulty to achieve that with the current
> > private_fd + userspace_addr (they basically in two separate fds), but is
> > it possible for pKVM to extend this? Brainstorming for example, pKVM can
> > ignore userspace_addr and only use private_fd to cover both shared and
> > private memory, or pKVM introduce new KVM memslot flag?
> 
> It's not that there's anything blocking pKVM from doing that. It's
> that the disconnect of using a memory address for the shared memory,
> and a file descriptor for the private memory doesn't really make sense
> for pKVM. I see how it makes sense for TDX and the Intel-specific
> implementation. It just seems that this is baking in an
> implementation-specific aspect as a part of the KVM general api, and
> the worry is that this might have some unintended consequences in the
> future.

It's true this API originates from supporting TDX and probably other
similar confidential computing(CC) technologies. But if we ever get
chance to make it more common to cover more usages like pKVM, I would
also like to. The challenge on this point is pKVM diverges a lot from CC
usages, putting both shared and private memory in the same fd
complicates CC usages. If two things are different enough, I'm also
thinking implementation-specific may not be that bad.

Chao

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-31 14:24           ` Kirill A . Shutemov
@ 2022-09-02 10:27             ` Chao Peng
  2022-09-02 12:30               ` Kirill A . Shutemov
  2022-09-08  1:10             ` Kirill A. Shutemov
  1 sibling, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-09-02 10:27 UTC (permalink / raw)
  To: Kirill A . Shutemov
  Cc: Hugh Dickins, Kirill A. Shutemov, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Gupta, Pankaj,
	Elena Reshetova

On Wed, Aug 31, 2022 at 05:24:39PM +0300, Kirill A . Shutemov wrote:
> On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
> > > I will try next week to rework it as shim to top of shmem. Does it work
> > > for you?
> > 
> > Yes, please do, thanks.  It's a compromise between us: the initial TDX
> > case has no justification to use shmem at all, but doing it that way
> > will help you with some of the infrastructure, and will probably be
> > easiest for KVM to extend to other more relaxed fd cases later.
> 
> Okay, below is my take on the shim approach.
> 
> I don't hate how it turned out. It is easier to understand without
> callback exchange thing.
> 
> The only caveat is I had to introduce external lock to protect against
> race between lookup and truncate. Otherwise, looks pretty reasonable to me.
> 
> I did very limited testing. And it lacks integration with KVM, but API
> changed not substantially, any it should be easy to adopt.

I have integrated this patch with other KVM patches and verified the
functionality works well in TDX environment with a minor fix below.

> 
> Any comments?
> 

...

> diff --git a/mm/memfd.c b/mm/memfd.c
> index 08f5f8304746..1853a90f49ff 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -261,7 +261,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
>  #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
>  #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>  
> -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> +		       MFD_INACCESSIBLE)
>  
>  SYSCALL_DEFINE2(memfd_create,
>  		const char __user *, uname,
> @@ -283,6 +284,14 @@ SYSCALL_DEFINE2(memfd_create,
>  			return -EINVAL;
>  	}
>  
> +	/* Disallow sealing when MFD_INACCESSIBLE is set. */
> +	if ((flags & MFD_INACCESSIBLE) && (flags & MFD_ALLOW_SEALING))
> +		return -EINVAL;
> +
> +	/* TODO: add hugetlb support */
> +	if ((flags & MFD_INACCESSIBLE) && (flags & MFD_HUGETLB))
> +		return -EINVAL;
> +
>  	/* length includes terminating zero */
>  	len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
>  	if (len <= 0)
> @@ -331,10 +340,24 @@ SYSCALL_DEFINE2(memfd_create,
>  		*file_seals &= ~F_SEAL_SEAL;
>  	}
>  
> +	if (flags & MFD_INACCESSIBLE) {
> +		struct file *inaccessible_file;
> +
> +		inaccessible_file = memfd_mkinaccessible(file);
> +		if (IS_ERR(inaccessible_file)) {
> +			error = PTR_ERR(inaccessible_file);
> +			goto err_file;
> +		}

The new file should alse be marked as O_LARGEFILE otherwise setting the
initial size greater than 2^31 on the fd will be refused by ftruncate().

+               inaccessible_file->f_flags |= O_LARGEFILE;
+

> +
> +		file = inaccessible_file;
> +	}
> +
>  	fd_install(fd, file);
>  	kfree(name);
>  	return fd;
>  
> +err_file:
> +	fput(file);
>  err_fd:
>  	put_unused_fd(fd);
>  err_name:

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-09-02 10:27             ` Chao Peng
@ 2022-09-02 12:30               ` Kirill A . Shutemov
  0 siblings, 0 replies; 398+ messages in thread
From: Kirill A . Shutemov @ 2022-09-02 12:30 UTC (permalink / raw)
  To: Chao Peng
  Cc: Hugh Dickins, Kirill A. Shutemov, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Gupta, Pankaj,
	Elena Reshetova

On Fri, Sep 02, 2022 at 06:27:57PM +0800, Chao Peng wrote:
> > +	if (flags & MFD_INACCESSIBLE) {
> > +		struct file *inaccessible_file;
> > +
> > +		inaccessible_file = memfd_mkinaccessible(file);
> > +		if (IS_ERR(inaccessible_file)) {
> > +			error = PTR_ERR(inaccessible_file);
> > +			goto err_file;
> > +		}
> 
> The new file should alse be marked as O_LARGEFILE otherwise setting the
> initial size greater than 2^31 on the fd will be refused by ftruncate().
> 
> +               inaccessible_file->f_flags |= O_LARGEFILE;
> +

Good catch. Thanks.

I will modify memfd_mkinaccessible() to do this.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 05/14] mm/memfd: Introduce MFD_INACCESSIBLE flag
  2022-08-05 13:28   ` David Hildenbrand
  2022-08-10  9:37     ` Chao Peng
@ 2022-09-07 16:18     ` Kirill A. Shutemov
  1 sibling, 0 replies; 398+ messages in thread
From: Kirill A. Shutemov @ 2022-09-07 16:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song

On Fri, Aug 05, 2022 at 03:28:50PM +0200, David Hildenbrand wrote:
> On 06.07.22 10:20, Chao Peng wrote:
> > Introduce a new memfd_create() flag indicating the content of the
> > created memfd is inaccessible from userspace through ordinary MMU
> > access (e.g., read/write/mmap). However, the file content can be
> > accessed via a different mechanism (e.g. KVM MMU) indirectly.
> > 
> > It provides semantics required for KVM guest private memory support
> > that a file descriptor with this flag set is going to be used as the
> > source of guest memory in confidential computing environments such
> > as Intel TDX/AMD SEV but may not be accessible from host userspace.
> > 
> > The flag can not coexist with MFD_ALLOW_SEALING, future sealing is
> > also impossible for a memfd created with this flag.
> 
> It's kind of weird to have it that way. Why should the user have to
> care? It's the notifier requirement to have that, no?
> 
> Why can't we handle that when register a notifier? If anything is
> already mapped, fail registering the notifier if the notifier has these
> demands. If registering succeeds, block it internally.
> 
> Or what am I missing? We might not need the memfile set flag semantics
> eventually and would not have to expose such a flag to user space.

Well, with the new shim-based[1] implementation the approach without uAPI
does not work.

We now have two struct file, one is a normal accessible memfd and the
other one is wrapper around that hides the memfd from userspace and
filters allowed operations. If we first create an accessible memfd that
userspace see it would be hard to hide it as by the time userspace may
have multiple fds in different processes that point to the same struct
file.

[1] https://lore.kernel.org/all/20220831142439.65q2gi4g2d2z4ofh@box.shutemov.name

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-31 14:24           ` Kirill A . Shutemov
  2022-09-02 10:27             ` Chao Peng
@ 2022-09-08  1:10             ` Kirill A. Shutemov
  2022-09-13  9:44               ` Sean Christopherson
  1 sibling, 1 reply; 398+ messages in thread
From: Kirill A. Shutemov @ 2022-09-08  1:10 UTC (permalink / raw)
  To: Kirill A . Shutemov
  Cc: Hugh Dickins, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Gupta, Pankaj,
	Elena Reshetova

On Wed, Aug 31, 2022 at 05:24:39PM +0300, Kirill A . Shutemov wrote:
> On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
> > > I will try next week to rework it as shim to top of shmem. Does it work
> > > for you?
> > 
> > Yes, please do, thanks.  It's a compromise between us: the initial TDX
> > case has no justification to use shmem at all, but doing it that way
> > will help you with some of the infrastructure, and will probably be
> > easiest for KVM to extend to other more relaxed fd cases later.
> 
> Okay, below is my take on the shim approach.
> 
> I don't hate how it turned out. It is easier to understand without
> callback exchange thing.
> 
> The only caveat is I had to introduce external lock to protect against
> race between lookup and truncate. Otherwise, looks pretty reasonable to me.
> 
> I did very limited testing. And it lacks integration with KVM, but API
> changed not substantially, any it should be easy to adopt.
> 
> Any comments?

Updated version below. Nothing major. Some simplification and cleanups.

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..334ddff08377 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -3,6 +3,7 @@
 #define __LINUX_MEMFD_H
 
 #include <linux/file.h>
+#include <linux/pfn_t.h>
 
 #ifdef CONFIG_MEMFD_CREATE
 extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
@@ -13,4 +14,27 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
 }
 #endif
 
+struct inaccessible_notifier;
+
+struct inaccessible_notifier_ops {
+	void (*invalidate)(struct inaccessible_notifier *notifier,
+			   pgoff_t start, pgoff_t end);
+};
+
+struct inaccessible_notifier {
+	struct list_head list;
+	const struct inaccessible_notifier_ops *ops;
+};
+
+void inaccessible_register_notifier(struct file *file,
+				    struct inaccessible_notifier *notifier);
+void inaccessible_unregister_notifier(struct file *file,
+				      struct inaccessible_notifier *notifier);
+
+int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+			 int *order);
+void inaccessible_put_pfn(struct file *file, pfn_t pfn);
+
+struct file *memfd_mkinaccessible(struct file *memfd);
+
 #endif /* __LINUX_MEMFD_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..9d066be3d7e8 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@
 #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
+#define INACCESSIBLE_MAGIC	0x494e4143	/* "INAC" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..48750474b904 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,7 @@
 #define MFD_CLOEXEC		0x0001U
 #define MFD_ALLOW_SEALING	0x0002U
 #define MFD_HUGETLB		0x0004U
+#define MFD_INACCESSIBLE	0x0008U
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/Makefile b/mm/Makefile
index 9a564f836403..f82e5d4b4388 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -126,7 +126,7 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 obj-$(CONFIG_HMM_MIRROR) += hmm.o
-obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_MEMFD_CREATE) += memfd.o memfd_inaccessible.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
diff --git a/mm/memfd.c b/mm/memfd.c
index 08f5f8304746..1853a90f49ff 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -261,7 +261,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
 
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
+		       MFD_INACCESSIBLE)
 
 SYSCALL_DEFINE2(memfd_create,
 		const char __user *, uname,
@@ -283,6 +284,14 @@ SYSCALL_DEFINE2(memfd_create,
 			return -EINVAL;
 	}
 
+	/* Disallow sealing when MFD_INACCESSIBLE is set. */
+	if ((flags & MFD_INACCESSIBLE) && (flags & MFD_ALLOW_SEALING))
+		return -EINVAL;
+
+	/* TODO: add hugetlb support */
+	if ((flags & MFD_INACCESSIBLE) && (flags & MFD_HUGETLB))
+		return -EINVAL;
+
 	/* length includes terminating zero */
 	len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
 	if (len <= 0)
@@ -331,10 +340,24 @@ SYSCALL_DEFINE2(memfd_create,
 		*file_seals &= ~F_SEAL_SEAL;
 	}
 
+	if (flags & MFD_INACCESSIBLE) {
+		struct file *inaccessible_file;
+
+		inaccessible_file = memfd_mkinaccessible(file);
+		if (IS_ERR(inaccessible_file)) {
+			error = PTR_ERR(inaccessible_file);
+			goto err_file;
+		}
+
+		file = inaccessible_file;
+	}
+
 	fd_install(fd, file);
 	kfree(name);
 	return fd;
 
+err_file:
+	fput(file);
 err_fd:
 	put_unused_fd(fd);
 err_name:
diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
new file mode 100644
index 000000000000..dc79988a49d0
--- /dev/null
+++ b/mm/memfd_inaccessible.c
@@ -0,0 +1,219 @@
+#include "linux/sbitmap.h"
+#include <linux/memfd.h>
+#include <linux/pagemap.h>
+#include <linux/pseudo_fs.h>
+#include <linux/shmem_fs.h>
+#include <uapi/linux/falloc.h>
+#include <uapi/linux/magic.h>
+
+struct inaccessible_data {
+	struct mutex lock;
+	struct file *memfd;
+	struct list_head notifiers;
+};
+
+static void inaccessible_notifier_invalidate(struct inaccessible_data *data,
+				 pgoff_t start, pgoff_t end)
+{
+	struct inaccessible_notifier *notifier;
+
+	mutex_lock(&data->lock);
+	list_for_each_entry(notifier, &data->notifiers, list) {
+		notifier->ops->invalidate(notifier, start, end);
+	}
+	mutex_unlock(&data->lock);
+}
+
+static int inaccessible_release(struct inode *inode, struct file *file)
+{
+	struct inaccessible_data *data = inode->i_mapping->private_data;
+
+	fput(data->memfd);
+	kfree(data);
+	return 0;
+}
+
+static long inaccessible_fallocate(struct file *file, int mode,
+				   loff_t offset, loff_t len)
+{
+	struct inaccessible_data *data = file->f_mapping->private_data;
+	struct file *memfd = data->memfd;
+	int ret;
+
+	if (mode & FALLOC_FL_PUNCH_HOLE) {
+		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) {
+			return -EINVAL;
+		}
+	}
+
+	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
+	inaccessible_notifier_invalidate(data, offset, offset + len);
+	return ret;
+}
+
+static const struct file_operations inaccessible_fops = {
+	.release = inaccessible_release,
+	.fallocate = inaccessible_fallocate,
+};
+
+static int inaccessible_getattr(struct user_namespace *mnt_userns,
+				const struct path *path, struct kstat *stat,
+				u32 request_mask, unsigned int query_flags)
+{
+	struct inode *inode = d_inode(path->dentry);
+	struct inaccessible_data *data = inode->i_mapping->private_data;
+	struct file *memfd = data->memfd;
+
+	return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
+					     request_mask, query_flags);
+}
+
+static int inaccessible_setattr(struct user_namespace *mnt_userns,
+				struct dentry *dentry, struct iattr *attr)
+{
+	struct inode *inode = d_inode(dentry);
+	struct inaccessible_data *data = inode->i_mapping->private_data;
+	struct file *memfd = data->memfd;
+	int ret;
+
+	if (attr->ia_valid & ATTR_SIZE) {
+		if (memfd->f_inode->i_size)
+			return -EPERM;
+
+		if (!PAGE_ALIGNED(attr->ia_size))
+			return -EINVAL;
+	}
+
+	ret = memfd->f_inode->i_op->setattr(mnt_userns,
+					    file_dentry(memfd), attr);
+	return ret;
+}
+
+static const struct inode_operations inaccessible_iops = {
+	.getattr = inaccessible_getattr,
+	.setattr = inaccessible_setattr,
+};
+
+static int inaccessible_init_fs_context(struct fs_context *fc)
+{
+	if (!init_pseudo(fc, INACCESSIBLE_MAGIC))
+		return -ENOMEM;
+
+	fc->s_iflags |= SB_I_NOEXEC;
+	return 0;
+}
+
+static struct file_system_type inaccessible_fs = {
+	.owner		= THIS_MODULE,
+	.name		= "[inaccessible]",
+	.init_fs_context = inaccessible_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static struct vfsmount *inaccessible_mnt;
+
+static __init int inaccessible_init(void)
+{
+	inaccessible_mnt = kern_mount(&inaccessible_fs);
+	if (IS_ERR(inaccessible_mnt))
+		return PTR_ERR(inaccessible_mnt);
+	return 0;
+}
+fs_initcall(inaccessible_init);
+
+struct file *memfd_mkinaccessible(struct file *memfd)
+{
+	struct inaccessible_data *data;
+	struct address_space *mapping;
+	struct inode *inode;
+	struct file *file;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return ERR_PTR(-ENOMEM);
+
+	data->memfd = memfd;
+	mutex_init(&data->lock);
+	INIT_LIST_HEAD(&data->notifiers);
+
+	inode = alloc_anon_inode(inaccessible_mnt->mnt_sb);
+	if (IS_ERR(inode)) {
+		kfree(data);
+		return ERR_CAST(inode);
+	}
+
+	inode->i_mode |= S_IFREG;
+	inode->i_op = &inaccessible_iops;
+	inode->i_mapping->private_data = data;
+
+	file = alloc_file_pseudo(inode, inaccessible_mnt,
+				 "[memfd:inaccessible]", O_RDWR,
+				 &inaccessible_fops);
+	if (IS_ERR(file)) {
+		iput(inode);
+		kfree(data);
+	}
+
+	file->f_flags |= O_LARGEFILE;
+
+	mapping = memfd->f_mapping;
+	mapping_set_unevictable(mapping);
+	mapping_set_gfp_mask(mapping,
+			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
+
+	return file;
+}
+
+void inaccessible_register_notifier(struct file *file,
+				    struct inaccessible_notifier *notifier)
+{
+	struct inaccessible_data *data = file->f_mapping->private_data;
+
+	mutex_lock(&data->lock);
+	list_add(&notifier->list, &data->notifiers);
+	mutex_unlock(&data->lock);
+}
+EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
+
+void inaccessible_unregister_notifier(struct file *file,
+				      struct inaccessible_notifier *notifier)
+{
+	struct inaccessible_data *data = file->f_mapping->private_data;
+
+	mutex_lock(&data->lock);
+	list_del(&notifier->list);
+	mutex_unlock(&data->lock);
+}
+EXPORT_SYMBOL_GPL(inaccessible_unregister_notifier);
+
+int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+			 int *order)
+{
+	struct inaccessible_data *data = file->f_mapping->private_data;
+	struct file *memfd = data->memfd;
+	struct page *page;
+	int ret;
+
+	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
+	if (ret)
+		return ret;
+
+	*pfn = page_to_pfn_t(page);
+	*order = thp_order(compound_head(page));
+	SetPageUptodate(page);
+	unlock_page(page);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
+
+void inaccessible_put_pfn(struct file *file, pfn_t pfn)
+{
+	struct page *page = pfn_t_to_page(pfn);
+
+	if (WARN_ON_ONCE(!page))
+		return;
+
+	put_page(page);
+}
+EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-18 13:24   ` Kirill A . Shutemov
  2022-08-19  0:20     ` Sean Christopherson
  2022-08-19  3:00     ` Hugh Dickins
@ 2022-09-09  4:44     ` Andy Lutomirski
  2 siblings, 0 replies; 398+ messages in thread
From: Andy Lutomirski @ 2022-09-09  4:44 UTC (permalink / raw)
  To: Kirill A . Shutemov, Hugh Dickins
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, Gupta, Pankaj

On 8/18/22 06:24, Kirill A . Shutemov wrote:
> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
>> On Wed, 6 Jul 2022, Chao Peng wrote:
>>> This is the v7 of this series which tries to implement the fd-based KVM
>>> guest private memory.
>>
>> Here at last are my reluctant thoughts on this patchset.
>>
>> fd-based approach for supporting KVM guest private memory: fine.
>>
>> Use or abuse of memfd and shmem.c: mistaken.
>>
>> memfd_create() was an excellent way to put together the initial prototype.
>>
>> But since then, TDX in particular has forced an effort into preventing
>> (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
>>
>> Are any of the shmem.c mods useful to existing users of shmem.c? No.
>> Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
>>
>> What use do you have for a filesystem here?  Almost none.
>> IIUC, what you want is an fd through which QEMU can allocate kernel
>> memory, selectively free that memory, and communicate fd+offset+length
>> to KVM.  And perhaps an interface to initialize a little of that memory
>> from a template (presumably copied from a real file on disk somewhere).
>>
>> You don't need shmem.c or a filesystem for that!
>>
>> If your memory could be swapped, that would be enough of a good reason
>> to make use of shmem.c: but it cannot be swapped; and although there
>> are some references in the mailthreads to it perhaps being swappable
>> in future, I get the impression that will not happen soon if ever.
>>
>> If your memory could be migrated, that would be some reason to use
>> filesystem page cache (because page migration happens to understand
>> that type of memory): but it cannot be migrated.
> 
> Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
> theoretically possible, but I'm not aware of any plans as of now.
> 
> [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> 

This thing?

https://cdrdv2.intel.com/v1/dl/getContent/733578

That looks like migration between computers, not between NUMA nodes.  Or 
am I missing something?

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-20  0:27       ` Kirill A. Shutemov
  2022-08-21  5:15         ` Hugh Dickins
@ 2022-09-09  4:48         ` Andy Lutomirski
  2022-09-09 14:32           ` Kirill A . Shutemov
  1 sibling, 1 reply; 398+ messages in thread
From: Andy Lutomirski @ 2022-09-09  4:48 UTC (permalink / raw)
  To: Kirill A. Shutemov, Hugh Dickins
  Cc: Kirill A . Shutemov, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Gupta, Pankaj

On 8/19/22 17:27, Kirill A. Shutemov wrote:
> On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
>> On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
>>> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
>>>>
>>>> If your memory could be swapped, that would be enough of a good reason
>>>> to make use of shmem.c: but it cannot be swapped; and although there
>>>> are some references in the mailthreads to it perhaps being swappable
>>>> in future, I get the impression that will not happen soon if ever.
>>>>
>>>> If your memory could be migrated, that would be some reason to use
>>>> filesystem page cache (because page migration happens to understand
>>>> that type of memory): but it cannot be migrated.
>>>
>>> Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
>>> theoretically possible, but I'm not aware of any plans as of now.
>>>
>>> [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
>>
>> I always forget, migration means different things to different audiences.
>> As an mm person, I was meaning page migration, whereas a virtualization
>> person thinks VM live migration (which that reference appears to be about),
>> a scheduler person task migration, an ornithologist bird migration, etc.
>>
>> But you're an mm person too: you may have cited that reference in the
>> knowledge that TDX 1.5 Live Migration will entail page migration of the
>> kind I'm thinking of.  (Anyway, it's not important to clarify that here.)
> 
> TDX 1.5 brings both.
> 
> In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.
> 

This seems to be a pretty bad fit for the way that the core mm migrates 
pages.  The core mm unmaps the page, then moves (in software) the 
contents to a new address, then faults it in.  TDH.MEM.PAGE.RELOCATE 
doesn't fit into that workflow very well.  I'm not saying it can't be 
done, but it won't just work.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-08-24  9:41             ` Chao Peng
@ 2022-09-09  4:55               ` Andy Lutomirski
  0 siblings, 0 replies; 398+ messages in thread
From: Andy Lutomirski @ 2022-09-09  4:55 UTC (permalink / raw)
  To: Chao Peng, Sean Christopherson
  Cc: David Hildenbrand, Hugh Dickins, Kirill A . Shutemov, kvm,
	linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, jun.nakajima,
	dave.hansen, ak, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, Gupta, Pankaj

On 8/24/22 02:41, Chao Peng wrote:
> On Tue, Aug 23, 2022 at 04:05:27PM +0000, Sean Christopherson wrote:
>> On Tue, Aug 23, 2022, David Hildenbrand wrote:
>>> On 19.08.22 05:38, Hugh Dickins wrote:
>>>> On Fri, 19 Aug 2022, Sean Christopherson wrote:
>>>>> On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
>>>>>> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
>>>>>>> On Wed, 6 Jul 2022, Chao Peng wrote:
>>>>>>> But since then, TDX in particular has forced an effort into preventing
>>>>>>> (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
>>>>>>>
>>>>>>> Are any of the shmem.c mods useful to existing users of shmem.c? No.
>>>>>>> Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
>>>>>
>>>>> But QEMU and other VMMs are users of shmem and memfd.  The new features certainly
>>>>> aren't useful for _all_ existing users, but I don't think it's fair to say that
>>>>> they're not useful for _any_ existing users.
>>>>
>>>> Okay, I stand corrected: there exist some users of memfd_create()
>>>> who will also have use for "INACCESSIBLE" memory.
>>>
>>> As raised in reply to the relevant patch, I'm not sure if we really have
>>> to/want to expose MFD_INACCESSIBLE to user space. I feel like this is a
>>> requirement of specific memfd_notifer (memfile_notifier) implementations
>>> -- such as TDX that will convert the memory and MCE-kill the machine on
>>> ordinary write access. We might be able to set/enforce this when
>>> registering a notifier internally instead, and fail notifier
>>> registration if a condition isn't met (e.g., existing mmap).
>>>
>>> So I'd be curious, which other users of shmem/memfd would benefit from
>>> (MMU)-"INACCESSIBLE" memory obtained via memfd_create()?
>>
>> I agree that there's no need to expose the inaccessible behavior via uAPI.  Making
>> it a kernel-internal thing that's negotiated/resolved when KVM binds to the fd
>> would align INACCESSIBLE with the UNMOVABLE and UNRECLAIMABLE flags (and any other
>> flags that get added in the future).
>>
>> AFAICT, the user-visible flag is a holdover from the early RFCs and doesn't provide
>> any unique functionality.
> 
> That's also what I'm thinking. And I don't see problem immediately if
> user has populated the fd at the binding time. Actually that looks an
> advantage for previously discussed guest payload pre-loading.

I think this gets awkward. Trying to define sensible semantics for what 
happens if a shmem or similar fd gets used as secret guest memory and 
that fd isn't utterly and completely empty can get quite nasty.  For 
example:

If there are already mmaps, then TDX (much more so than SEV) really 
doesn't want to also use it as guest memory.

If there is already data in the fd, then maybe some technologies can use 
this for pre-population, but TDX needs explicit instructions in order to 
get the guest's hash right.


In general, it seems like it will be much more likely to actually work 
well if the user (uAPI) is required to declare to the kernel exactly 
what the fd is for (e.g. TDX secret memory, software-only secret memory, 
etc) before doing anything at all with it other than binding it to KVM.

INACCESSIBLE is a way to achieve this.  Maybe it's not the prettiest in 
the world -- I personally would rather see an explicit request for, say, 
TDX or SEV memory or maybe the memory that works for a particular KVM 
instance instead of something generic like INACCESSIBLE, but this is a 
pretty weak preference.  But I think that just starting with a plain 
memfd is a can of worms.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-09-09  4:48         ` Andy Lutomirski
@ 2022-09-09 14:32           ` Kirill A . Shutemov
  2022-09-09 19:11             ` Andy Lutomirski
  0 siblings, 1 reply; 398+ messages in thread
From: Kirill A . Shutemov @ 2022-09-09 14:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kirill A. Shutemov, Hugh Dickins, Chao Peng, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Gupta, Pankaj

On Thu, Sep 08, 2022 at 09:48:35PM -0700, Andy Lutomirski wrote:
> On 8/19/22 17:27, Kirill A. Shutemov wrote:
> > On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
> > > On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
> > > > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > > > > 
> > > > > If your memory could be swapped, that would be enough of a good reason
> > > > > to make use of shmem.c: but it cannot be swapped; and although there
> > > > > are some references in the mailthreads to it perhaps being swappable
> > > > > in future, I get the impression that will not happen soon if ever.
> > > > > 
> > > > > If your memory could be migrated, that would be some reason to use
> > > > > filesystem page cache (because page migration happens to understand
> > > > > that type of memory): but it cannot be migrated.
> > > > 
> > > > Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
> > > > theoretically possible, but I'm not aware of any plans as of now.
> > > > 
> > > > [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> > > 
> > > I always forget, migration means different things to different audiences.
> > > As an mm person, I was meaning page migration, whereas a virtualization
> > > person thinks VM live migration (which that reference appears to be about),
> > > a scheduler person task migration, an ornithologist bird migration, etc.
> > > 
> > > But you're an mm person too: you may have cited that reference in the
> > > knowledge that TDX 1.5 Live Migration will entail page migration of the
> > > kind I'm thinking of.  (Anyway, it's not important to clarify that here.)
> > 
> > TDX 1.5 brings both.
> > 
> > In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.
> > 
> 
> This seems to be a pretty bad fit for the way that the core mm migrates
> pages.  The core mm unmaps the page, then moves (in software) the contents
> to a new address, then faults it in.  TDH.MEM.PAGE.RELOCATE doesn't fit into
> that workflow very well.  I'm not saying it can't be done, but it won't just
> work.

Hm. From what I see we have all necessary infrastructure in place.

Unmaping is NOP for inaccessible pages as it is never mapped and we have
mapping->a_ops->migrate_folio() callback that allows to replace software
copying with whatever is needed, like TDH.MEM.PAGE.RELOCATE.

What do I miss?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-07-06  8:20 [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
                   ` (17 preceding siblings ...)
  2022-08-26 15:19 ` Fuad Tabba
@ 2022-09-09 15:35 ` Michael Roth
  18 siblings, 0 replies; 398+ messages in thread
From: Michael Roth @ 2022-09-09 15:35 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, mhocko,
	Muchun Song

On Wed, Jul 06, 2022 at 04:20:02PM +0800, Chao Peng wrote:
> This is the v7 of this series which tries to implement the fd-based KVM
> guest private memory. The patches are based on latest kvm/queue branch
> commit:
> 
>   b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
> split_desc_cache only by default capacity
> 
> Introduction
> ------------
> In general this patch series introduce fd-based memslot which provides
> guest memory through memory file descriptor fd[offset,size] instead of
> hva/size. The fd can be created from a supported memory filesystem
> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> and the the memory backing store exchange callbacks when such memslot
> gets created. At runtime KVM will call into callbacks provided by the
> backing store to get the pfn with the fd+offset. Memory backing store
> will also call into KVM callbacks when userspace punch hole on the fd
> to notify KVM to unmap secondary MMU page table entries.
> 
> Comparing to existing hva-based memslot, this new type of memslot allows
> guest memory unmapped from host userspace like QEMU and even the kernel
> itself, therefore reduce attack surface and prevent bugs.
> 
> Based on this fd-based memslot, we can build guest private memory that
> is going to be used in confidential computing environments such as Intel
> TDX and AMD SEV. When supported, the memory backing store can provide
> more enforcement on the fd and KVM can use a single memslot to hold both
> the private and shared part of the guest memory. 

Hi everyone,

Just wanted to let you all know that I reserved a slot at the LPC
Confidential Computing Microconference to discuss some topics related
to unmapped/inaccessible private memory support:

  "Unmapped Private Memory for Confidential Guests"
  Tuesday, Sep 13th, 10:00am (Dublin time)
  https://lpc.events/event/16/sessions/133/#20220913

The discussion agenda is still a bit in flux, but one topic I really
wanted to cover is how we intend to deal with the kernel directmap
for TDX/SNP, where there is a need to either remove or split mappings
so that KVM or other kernel threads writing to non-private pages
don't run into issues due mappings overlapping with private pages.[1]

Other possible discussion topics:

  - guarding against shared->private conversions while KVM is
    attempting to access a shared page (separate PFN pools for
    shared/private seems to resolve this nicely, but may not be
    compatible with things like pKVM where the underlying PFN
    is the same for shared/private)[2]

  - extending KVM_EXIT_MEMORY_FAULT to handle batched requests to
    better handle things like explicit batched conversions initiated
    by the guest

It's a short session so not sure how much time we'll actually have
to discuss things in detail, but maybe this can at least be a good
jumping off point for other discussions.

Thanks, and hope to see you there!

[1] https://lore.kernel.org/all/YWb8WG6Ravbs1nbx@google.com/
[2] https://lore.kernel.org/lkml/CA+EHjTy6NF=BkCqK0vhXLdtKZMahp55JUMSfxN96-NT3YiMXYQ@mail.gmail.com/

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-09-09 14:32           ` Kirill A . Shutemov
@ 2022-09-09 19:11             ` Andy Lutomirski
  2022-09-09 23:02               ` Kirill A . Shutemov
  0 siblings, 1 reply; 398+ messages in thread
From: Andy Lutomirski @ 2022-09-09 19:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A . Shutemov, Hugh Dickins, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Nakajima, Jun,
	Dave Hansen, Andi Kleen, David Hildenbrand, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, Michal Hocko,
	Muchun Song, Gupta, Pankaj



On Fri, Sep 9, 2022, at 7:32 AM, Kirill A . Shutemov wrote:
> On Thu, Sep 08, 2022 at 09:48:35PM -0700, Andy Lutomirski wrote:
>> On 8/19/22 17:27, Kirill A. Shutemov wrote:
>> > On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
>> > > On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
>> > > > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
>> > > > > 
>> > > > > If your memory could be swapped, that would be enough of a good reason
>> > > > > to make use of shmem.c: but it cannot be swapped; and although there
>> > > > > are some references in the mailthreads to it perhaps being swappable
>> > > > > in future, I get the impression that will not happen soon if ever.
>> > > > > 
>> > > > > If your memory could be migrated, that would be some reason to use
>> > > > > filesystem page cache (because page migration happens to understand
>> > > > > that type of memory): but it cannot be migrated.
>> > > > 
>> > > > Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
>> > > > theoretically possible, but I'm not aware of any plans as of now.
>> > > > 
>> > > > [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
>> > > 
>> > > I always forget, migration means different things to different audiences.
>> > > As an mm person, I was meaning page migration, whereas a virtualization
>> > > person thinks VM live migration (which that reference appears to be about),
>> > > a scheduler person task migration, an ornithologist bird migration, etc.
>> > > 
>> > > But you're an mm person too: you may have cited that reference in the
>> > > knowledge that TDX 1.5 Live Migration will entail page migration of the
>> > > kind I'm thinking of.  (Anyway, it's not important to clarify that here.)
>> > 
>> > TDX 1.5 brings both.
>> > 
>> > In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.
>> > 
>> 
>> This seems to be a pretty bad fit for the way that the core mm migrates
>> pages.  The core mm unmaps the page, then moves (in software) the contents
>> to a new address, then faults it in.  TDH.MEM.PAGE.RELOCATE doesn't fit into
>> that workflow very well.  I'm not saying it can't be done, but it won't just
>> work.
>
> Hm. From what I see we have all necessary infrastructure in place.
>
> Unmaping is NOP for inaccessible pages as it is never mapped and we have
> mapping->a_ops->migrate_folio() callback that allows to replace software
> copying with whatever is needed, like TDH.MEM.PAGE.RELOCATE.
>
> What do I miss?

Hmm, maybe this isn't as bad as I thought.

Right now, unless I've missed something, the migration workflow is to unmap (via try_to_migrate) all mappings, then migrate the backing store (with ->migrate_folio(), although it seems like most callers expect the actual copy to happen outside of ->migrate_folio(), and then make new mappings.  With the *current* (vma-based, not fd-based) model for KVM memory, this won't work -- we can't unmap before calling TDH.MEM.PAGE.RELOCATE.

But maybe it's actually okay with some care or maybe mild modifications with the fd-based model.  We don't have any mmaps, per se, to unmap for secret / INACCESSIBLE memory.  So maybe we can get all the way to ->migrate_folio() without zapping anything in the secure EPT and just call TDH-MEM.PAGE.RELOCATE from inside migrate_folio().  And there will be nothing to fault back in.  From the core code's perspective, it's like migrating a memfd that doesn't happen to have my mappings at the time.

--Andy

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-09-09 19:11             ` Andy Lutomirski
@ 2022-09-09 23:02               ` Kirill A . Shutemov
  0 siblings, 0 replies; 398+ messages in thread
From: Kirill A . Shutemov @ 2022-09-09 23:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kirill A. Shutemov, Hugh Dickins, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, the arch/x86 maintainers,
	H. Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Nakajima, Jun,
	Dave Hansen, Andi Kleen, David Hildenbrand, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, Michal Hocko,
	Muchun Song, Gupta, Pankaj

On Fri, Sep 09, 2022 at 12:11:05PM -0700, Andy Lutomirski wrote:
> 
> 
> On Fri, Sep 9, 2022, at 7:32 AM, Kirill A . Shutemov wrote:
> > On Thu, Sep 08, 2022 at 09:48:35PM -0700, Andy Lutomirski wrote:
> >> On 8/19/22 17:27, Kirill A. Shutemov wrote:
> >> > On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
> >> > > On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
> >> > > > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> >> > > > > 
> >> > > > > If your memory could be swapped, that would be enough of a good reason
> >> > > > > to make use of shmem.c: but it cannot be swapped; and although there
> >> > > > > are some references in the mailthreads to it perhaps being swappable
> >> > > > > in future, I get the impression that will not happen soon if ever.
> >> > > > > 
> >> > > > > If your memory could be migrated, that would be some reason to use
> >> > > > > filesystem page cache (because page migration happens to understand
> >> > > > > that type of memory): but it cannot be migrated.
> >> > > > 
> >> > > > Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
> >> > > > theoretically possible, but I'm not aware of any plans as of now.
> >> > > > 
> >> > > > [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> >> > > 
> >> > > I always forget, migration means different things to different audiences.
> >> > > As an mm person, I was meaning page migration, whereas a virtualization
> >> > > person thinks VM live migration (which that reference appears to be about),
> >> > > a scheduler person task migration, an ornithologist bird migration, etc.
> >> > > 
> >> > > But you're an mm person too: you may have cited that reference in the
> >> > > knowledge that TDX 1.5 Live Migration will entail page migration of the
> >> > > kind I'm thinking of.  (Anyway, it's not important to clarify that here.)
> >> > 
> >> > TDX 1.5 brings both.
> >> > 
> >> > In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.
> >> > 
> >> 
> >> This seems to be a pretty bad fit for the way that the core mm migrates
> >> pages.  The core mm unmaps the page, then moves (in software) the contents
> >> to a new address, then faults it in.  TDH.MEM.PAGE.RELOCATE doesn't fit into
> >> that workflow very well.  I'm not saying it can't be done, but it won't just
> >> work.
> >
> > Hm. From what I see we have all necessary infrastructure in place.
> >
> > Unmaping is NOP for inaccessible pages as it is never mapped and we have
> > mapping->a_ops->migrate_folio() callback that allows to replace software
> > copying with whatever is needed, like TDH.MEM.PAGE.RELOCATE.
> >
> > What do I miss?
> 
> Hmm, maybe this isn't as bad as I thought.
> 
> Right now, unless I've missed something, the migration workflow is to
> unmap (via try_to_migrate) all mappings, then migrate the backing store
> (with ->migrate_folio(), although it seems like most callers expect the
> actual copy to happen outside of ->migrate_folio(),

Most? I guess you are talking about MIGRATE_SYNC_NO_COPY, right? AFAICS,
it is HMM thing and not a common thing.

> and then make new
> mappings.  With the *current* (vma-based, not fd-based) model for KVM
> memory, this won't work -- we can't unmap before calling
> TDH.MEM.PAGE.RELOCATE.

We don't need to unmap. The page is not mapped from core-mm PoV.

> But maybe it's actually okay with some care or maybe mild modifications
> with the fd-based model.  We don't have any mmaps, per se, to unmap for
> secret / INACCESSIBLE memory.  So maybe we can get all the way to
> ->migrate_folio() without zapping anything in the secure EPT and just
> call TDH-MEM.PAGE.RELOCATE from inside migrate_folio().  And there will
> be nothing to fault back in.  From the core code's perspective, it's
> like migrating a memfd that doesn't happen to have my mappings at the
> time.

Modifications needed if we want to initiate migation from userspace. IIRC,
we don't have any API that can initiate page migration for file ranges,
without mapping the file.

But kernel can do it fine for own housekeeping, like compaction doesn't
need any VMA. And we need compaction working for long term stability of
the system.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-09-08  1:10             ` Kirill A. Shutemov
@ 2022-09-13  9:44               ` Sean Christopherson
  2022-09-13 13:28                 ` Kirill A. Shutemov
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2022-09-13  9:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A . Shutemov, Hugh Dickins, Chao Peng, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Gupta, Pankaj,
	Elena Reshetova

On Thu, Sep 08, 2022, Kirill A. Shutemov wrote:
> On Wed, Aug 31, 2022 at 05:24:39PM +0300, Kirill A . Shutemov wrote:
> > On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
> > > > I will try next week to rework it as shim to top of shmem. Does it work
> > > > for you?
> > > 
> > > Yes, please do, thanks.  It's a compromise between us: the initial TDX
> > > case has no justification to use shmem at all, but doing it that way
> > > will help you with some of the infrastructure, and will probably be
> > > easiest for KVM to extend to other more relaxed fd cases later.
> > 
> > Okay, below is my take on the shim approach.
> > 
> > I don't hate how it turned out. It is easier to understand without
> > callback exchange thing.
> > 
> > The only caveat is I had to introduce external lock to protect against
> > race between lookup and truncate.

As before, I think this lock is unnecessary.  Or at least it's unnessary to hold
the lock across get/put.  The ->invalidate() call will ensure that the pfn is
never actually used if get() races with truncation.

Switching topics, what actually prevents mmapp() on the shim?  I tried to follow,
but I don't know these areas well enough.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-09-13  9:44               ` Sean Christopherson
@ 2022-09-13 13:28                 ` Kirill A. Shutemov
  2022-09-13 14:53                   ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Kirill A. Shutemov @ 2022-09-13 13:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kirill A . Shutemov, Hugh Dickins, Chao Peng, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Gupta, Pankaj,
	Elena Reshetova

On Tue, Sep 13, 2022 at 09:44:27AM +0000, Sean Christopherson wrote:
> On Thu, Sep 08, 2022, Kirill A. Shutemov wrote:
> > On Wed, Aug 31, 2022 at 05:24:39PM +0300, Kirill A . Shutemov wrote:
> > > On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
> > > > > I will try next week to rework it as shim to top of shmem. Does it work
> > > > > for you?
> > > > 
> > > > Yes, please do, thanks.  It's a compromise between us: the initial TDX
> > > > case has no justification to use shmem at all, but doing it that way
> > > > will help you with some of the infrastructure, and will probably be
> > > > easiest for KVM to extend to other more relaxed fd cases later.
> > > 
> > > Okay, below is my take on the shim approach.
> > > 
> > > I don't hate how it turned out. It is easier to understand without
> > > callback exchange thing.
> > > 
> > > The only caveat is I had to introduce external lock to protect against
> > > race between lookup and truncate.
> 
> As before, I think this lock is unnecessary.  Or at least it's unnessary to hold
> the lock across get/put.  The ->invalidate() call will ensure that the pfn is
> never actually used if get() races with truncation.

The updated version you replying to does not use the lock to protect
against truncation anymore. The lock protect notifier list.

> Switching topics, what actually prevents mmapp() on the shim?  I tried to follow,
> but I don't know these areas well enough.

It has no f_op->mmap, so mmap() will fail with -ENODEV. See do_mmap().
(I did not read the switch statement correctly at first. Note there are
two 'fallthrough' there.)

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-09-13 13:28                 ` Kirill A. Shutemov
@ 2022-09-13 14:53                   ` Sean Christopherson
  2022-09-13 16:00                     ` Kirill A. Shutemov
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2022-09-13 14:53 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A . Shutemov, Hugh Dickins, Chao Peng, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Gupta, Pankaj,
	Elena Reshetova

On Tue, Sep 13, 2022, Kirill A. Shutemov wrote:
> On Tue, Sep 13, 2022 at 09:44:27AM +0000, Sean Christopherson wrote:
> > On Thu, Sep 08, 2022, Kirill A. Shutemov wrote:
> > > On Wed, Aug 31, 2022 at 05:24:39PM +0300, Kirill A . Shutemov wrote:
> > > > On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
> > > > > > I will try next week to rework it as shim to top of shmem. Does it work
> > > > > > for you?
> > > > > 
> > > > > Yes, please do, thanks.  It's a compromise between us: the initial TDX
> > > > > case has no justification to use shmem at all, but doing it that way
> > > > > will help you with some of the infrastructure, and will probably be
> > > > > easiest for KVM to extend to other more relaxed fd cases later.
> > > > 
> > > > Okay, below is my take on the shim approach.
> > > > 
> > > > I don't hate how it turned out. It is easier to understand without
> > > > callback exchange thing.
> > > > 
> > > > The only caveat is I had to introduce external lock to protect against
> > > > race between lookup and truncate.
> > 
> > As before, I think this lock is unnecessary.  Or at least it's unnessary to hold
> > the lock across get/put.  The ->invalidate() call will ensure that the pfn is
> > never actually used if get() races with truncation.
> 
> The updated version you replying to does not use the lock to protect
> against truncation anymore. The lock protect notifier list.

Gah, grabbed the patch when applying.

> > Switching topics, what actually prevents mmapp() on the shim?  I tried to follow,
> > but I don't know these areas well enough.
> 
> It has no f_op->mmap, so mmap() will fail with -ENODEV. See do_mmap().
> (I did not read the switch statement correctly at first. Note there are
> two 'fallthrough' there.)

Ah, validate_mmap_request().  Thought not implementing ->mmap() was the key, but
couldn't find the actual check.

Thanks much!

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-09-13 14:53                   ` Sean Christopherson
@ 2022-09-13 16:00                     ` Kirill A. Shutemov
  2022-09-13 16:12                       ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Kirill A. Shutemov @ 2022-09-13 16:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kirill A . Shutemov, Hugh Dickins, Chao Peng, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Gupta, Pankaj,
	Elena Reshetova

On Tue, Sep 13, 2022 at 02:53:25PM +0000, Sean Christopherson wrote:
> > > Switching topics, what actually prevents mmapp() on the shim?  I tried to follow,
> > > but I don't know these areas well enough.
> > 
> > It has no f_op->mmap, so mmap() will fail with -ENODEV. See do_mmap().
> > (I did not read the switch statement correctly at first. Note there are
> > two 'fallthrough' there.)
> 
> Ah, validate_mmap_request().  Thought not implementing ->mmap() was the key, but
> couldn't find the actual check.

validate_mmap_request() is in mm/nommu.c which is not relevant for real
computers.

I was talking about this check:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/mmap.c#n1495

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-09-13 16:00                     ` Kirill A. Shutemov
@ 2022-09-13 16:12                       ` Sean Christopherson
  0 siblings, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2022-09-13 16:12 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A . Shutemov, Hugh Dickins, Chao Peng, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Gupta, Pankaj,
	Elena Reshetova

On Tue, Sep 13, 2022, Kirill A. Shutemov wrote:
> On Tue, Sep 13, 2022 at 02:53:25PM +0000, Sean Christopherson wrote:
> > > > Switching topics, what actually prevents mmapp() on the shim?  I tried to follow,
> > > > but I don't know these areas well enough.
> > > 
> > > It has no f_op->mmap, so mmap() will fail with -ENODEV. See do_mmap().
> > > (I did not read the switch statement correctly at first. Note there are
> > > two 'fallthrough' there.)
> > 
> > Ah, validate_mmap_request().  Thought not implementing ->mmap() was the key, but
> > couldn't find the actual check.
> 
> validate_mmap_request() is in mm/nommu.c which is not relevant for real
> computers.
> 
> I was talking about this check:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/mmap.c#n1495

Hence the comment about 'fallthrough'.  Thanks again!

^ permalink raw reply	[flat|nested] 398+ messages in thread

* [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
@ 2022-12-02  6:13 Chao Peng
  2022-12-02  6:13 ` [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
                   ` (11 more replies)
  0 siblings, 12 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-02  6:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
	mhocko, wei.w.wang

This patch series implements KVM guest private memory for confidential
computing scenarios like Intel TDX[1]. If a TDX host accesses
TDX-protected guest memory, machine check can happen which can further
crash the running host system, this is terrible for multi-tenant
configurations. The host accesses include those from KVM userspace like
QEMU. This series addresses KVM userspace induced crash by introducing
new mm and KVM interfaces so KVM userspace can still manage guest memory
via a fd-based approach, but it can never access the guest memory
content.

The patch series touches both core mm and KVM code. I appreciate
Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
reviews are always welcome.
  - 01: mm change, target for mm tree
  - 02-09: KVM change, target for KVM tree

Given KVM is the only current user for the mm part, I have chatted with
Paolo and he is OK to merge the mm change through KVM tree, but
reviewed-by/acked-by is still expected from the mm people.

The patches have been verified with Intel TDX environment, but Vishal
has done an excellent work on the selftests[4] which are dedicated for
this series, making it possible to test this series without innovative
hardware and fancy steps of building a VM environment. See Test section
below for more info.


Introduction
============
KVM userspace being able to crash the host is horrible. Under current
KVM architecture, all guest memory is inherently accessible from KVM
userspace and is exposed to the mentioned crash issue. The goal of this
series is to provide a solution to align mm and KVM, on a userspace
inaccessible approach of exposing guest memory. 

Normally, KVM populates secondary page table (e.g. EPT) by using a host
virtual address (hva) from core mm page table (e.g. x86 userspace page
table). This requires guest memory being mmaped into KVM userspace, but
this is also the source where the mentioned crash issue can happen. In
theory, apart from those 'shared' memory for device emulation etc, guest
memory doesn't have to be mmaped into KVM userspace.

This series introduces fd-based guest memory which will not be mmaped
into KVM userspace. KVM populates secondary page table by using a
fd/offset pair backed by a memory file system. The fd can be created
from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
directly interact with them with newly introduced in-kernel interface,
therefore remove the KVM userspace from the path of accessing/mmaping
the guest memory. 

Kirill had a patch [2] to address the same issue in a different way. It
tracks guest encrypted memory at the 'struct page' level and relies on
HWPOISON to reject the userspace access. The patch has been discussed in
several online and offline threads and resulted in a design document [3]
which is also the original proposal for this series. Later this patch
series evolved as more comments received in community but the major
concepts in [3] still hold true so recommend reading.

The patch series may also be useful for other usages, for example, pure
software approach may use it to harden itself against unintentional
access to guest memory. This series is designed with these usages in
mind but doesn't have code directly support them and extension might be
needed.


mm change
=========
Introduces a new memfd_restricted system call which can create memory
file that is restricted from userspace access via normal MMU operations
like read(), write() or mmap() etc and the only way to use it is
passing it to a third kernel module like KVM and relying on it to
access the fd through the newly added restrictedmem kernel interface.
The restrictedmem interface bridges the memory file subsystems
(tmpfs/hugetlbfs etc) and their users (KVM in this case) and provides
bi-directional communication between them. 


KVM change
==========
Extends the KVM memslot to provide guest private (encrypted) memory from
a fd. With this extension, a single memslot can maintain both private
memory through private fd (restricted_fd/restricted_offset) and shared
(unencrypted) memory through userspace mmaped host virtual address
(userspace_addr). For a particular guest page, the corresponding page in
KVM memslot can be only either private or shared and only one of the
shared/private parts of the memslot is visible to guest. For how this
new extension is used in QEMU, please refer to kvm_set_phys_mem() in
below TDX-enabled QEMU repo.

Introduces new KVM_EXIT_MEMORY_FAULT exit to allow userspace to get the
chance on decision-making for shared <-> private memory conversion. The
exit can be an implicit conversion in KVM page fault handler or an
explicit conversion from guest OS.

Introduces new KVM ioctl KVM_SET_MEMORY_ATTRIBUTES to maintain whether a
page is private or shared. This ioctl allows userspace to convert a page
between private <-> shared. The data maintained tells the truth whether
a guest page is private or shared and this information will be used in
KVM page fault handler to decide whether the private or the shared part
of the memslot is visible to guest.


Test
====
Ran two kinds of tests:
  - Selftests [4] from Vishal and VM boot tests in non-TDX environment
    Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v10

  - Functional tests in TDX capable environment
    Tested the new functionalities in TDX environment. Code repos:
    Linux: https://github.com/chao-p/linux/tree/privmem-v10-tdx
    QEMU: https://github.com/chao-p/qemu/tree/privmem-v10

    An example QEMU command line for TDX test:
    -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
    -machine confidential-guest-support=tdx \
    -object memory-backend-memfd-private,id=ram1,size=${mem} \
    -machine memory-backend=ram1


TODO
====
  - Page accounting and limiting for encrypted memory
  - hugetlbfs support


Changelog
=========
v10:
  - mm: hook up restricted_memfd to memory failure and route it to
    kernel users through .error() callback.
  - mm: call invalidate() notifier only for FALLOC_FL_PUNCH_HOLE, i.e.
    not for allocation.
  - KVM: introduce new ioctl KVM_SET_MEMORY_ATTRIBUTES for memory
    conversion instead of reusing KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
  - KVM: refine gfn-based mmu_notifier_retry() mechanism.
  - KVM: improve lpage_info updating code.
  - KVM: fix the bug in private memory handling that a private fault may
    fall into a non-private memslot.
  - KVM: handle memory machine check error for fd-based memory.
v9:
  - mm: move inaccessible memfd into separated syscall.
  - mm: return page instead of pfn_t for inaccessible_get_pfn and remove
    inaccessible_put_pfn.
  - KVM: rename inaccessible/private to restricted and CONFIG change to
    make the code friendly to pKVM.
  - KVM: add invalidate_begin/end pair to fix race contention and revise
    the lock protection for invalidation path.
  - KVM: optimize setting lpage_info for > 2M level by direct accessing
    lower level's result.
  - KVM: avoid load xarray in kvm_mmu_max_mapping_level() and instead let
    the caller to pass in is_private.
  - KVM: API doc improvement.
v8:
  - mm: redesign mm part by introducing a shim layer(inaccessible_memfd)
    in memfd to avoid touch the memory file systems directly.
  - mm: exclude F_SEAL_AUTO_ALLOCATE as it is for shared memory and
    cause confusion in this series, will send out separately.
  - doc: exclude the man page change, it's not kernel patch and will
    send out separately.
  - KVM: adapt to use the new mm inaccessible_memfd interface.
  - KVM: update lpage_info when setting mem_attr_array to support
    large page.
  - KVM: change from xa_store_range to xa_store for mem_attr_array due
    to xa_store_range overrides all entries which is not intended
    behavior for us.
  - KVM: refine the mmu_invalidate_retry_gfn mechanism for private page.
  - KVM: reorganize KVM_MEMORY_ENCRYPT_{UN,}REG_REGION and private page
    handling code suggested by Sean.
v7:
  - mm: introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
  - KVM: use KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to record
    private/shared info.
  - KVM: use similar sync mechanism between zap/page fault paths as
    mmu_notifier for memfile_notifier based invalidation.
v6:
  - mm: introduce MEMFILE_F_* flags into memfile_node to allow checking
    feature consistence among all memfile_notifier users and get rid of
    internal flags like SHM_F_INACCESSIBLE.
  - mm: make pfn_ops callbacks being members of memfile_backing_store
    and then refer to it directly in memfile_notifier.
  - mm: remove backing store unregister.
  - mm: remove RLIMIT_MEMLOCK based memory accounting and limiting.
  - KVM: reorganize patch sequence for page fault handling and private
    memory enabling.
v5:
  - Add man page for MFD_INACCESSIBLE flag and improve KVM API do for
    the new memslot extensions.
  - mm: introduce memfile_{un}register_backing_store to allow memory
    backing store to register/unregister it from memfile_notifier.
  - mm: remove F_SEAL_INACCESSIBLE, use in-kernel flag
    (SHM_F_INACCESSIBLE for shmem) instead. 
  - mm: add memory accounting and limiting (RLIMIT_MEMLOCK based) for
    MFD_INACCESSIBLE memory.
  - KVM: remove the overlap check for mapping the same file+offset into
    multiple gfns due to perf consideration, warned in document.
v4:
  - mm: rename memfd_ops to memfile_notifier and separate it from
    memfd.c to standalone memfile-notifier.c.
  - KVM: move pfn_ops to per-memslot scope from per-vm scope and allow
    registering multiple memslots to the same memory backing store.
  - KVM: add a 'kvm' reference in memslot so that we can recover kvm in
    memfile_notifier handlers.
  - KVM: add 'private_' prefix for the new fields in memslot.
  - KVM: reshape the 'type' to 'flag' for kvm_memory_exit
v3:
  - Remove 'RFC' prefix.
  - Fix race condition between memfile_notifier handlers and kvm destroy.
  - mm: introduce MFD_INACCESSIBLE flag for memfd_create() to force
    setting F_SEAL_INACCESSIBLE when the fd is created.
  - KVM: add the shared part of the memslot back to make private/shared
    pages live in one memslot.

Reference
=========
[1] Intel TDX:
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
[2] Kirill's implementation:
https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com/T/ 
[3] Original design proposal:
https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com/  
[4] Selftest:
https://lore.kernel.org/all/20221111014244.1714148-1-vannapurve@google.com/


Chao Peng (8):
  KVM: Introduce per-page memory attributes
  KVM: Extend the memslot to support fd-based private memory
  KVM: Add KVM_EXIT_MEMORY_FAULT exit
  KVM: Use gfn instead of hva for mmu_notifier_retry
  KVM: Unmap existing mappings when change the memory attributes
  KVM: Update lpage info when private/shared memory are mixed
  KVM: Handle page fault for private memory
  KVM: Enable and expose KVM_MEM_PRIVATE

Kirill A. Shutemov (1):
  mm: Introduce memfd_restricted system call to create restricted user
    memory

 Documentation/virt/kvm/api.rst         | 125 ++++++-
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 arch/x86/include/asm/kvm_host.h        |   9 +
 arch/x86/kvm/Kconfig                   |   3 +
 arch/x86/kvm/mmu/mmu.c                 | 205 ++++++++++-
 arch/x86/kvm/mmu/mmu_internal.h        |  14 +-
 arch/x86/kvm/mmu/mmutrace.h            |   1 +
 arch/x86/kvm/mmu/tdp_mmu.c             |   2 +-
 arch/x86/kvm/x86.c                     |  17 +-
 include/linux/kvm_host.h               | 103 +++++-
 include/linux/restrictedmem.h          |  71 ++++
 include/linux/syscalls.h               |   1 +
 include/uapi/asm-generic/unistd.h      |   5 +-
 include/uapi/linux/kvm.h               |  53 +++
 include/uapi/linux/magic.h             |   1 +
 kernel/sys_ni.c                        |   3 +
 mm/Kconfig                             |   4 +
 mm/Makefile                            |   1 +
 mm/memory-failure.c                    |   3 +
 mm/restrictedmem.c                     | 318 +++++++++++++++++
 virt/kvm/Kconfig                       |   6 +
 virt/kvm/kvm_main.c                    | 469 +++++++++++++++++++++----
 23 files changed, 1323 insertions(+), 93 deletions(-)
 create mode 100644 include/linux/restrictedmem.h
 create mode 100644 mm/restrictedmem.c


base-commit: df0bb47baa95aad133820b149851d5b94cbc6790
-- 
2.25.1


^ permalink raw reply	[flat|nested] 398+ messages in thread

* [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-02  6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
@ 2022-12-02  6:13 ` Chao Peng
  2022-12-06 14:57   ` Fuad Tabba
                     ` (6 more replies)
  2022-12-02  6:13 ` [PATCH v10 2/9] KVM: Introduce per-page memory attributes Chao Peng
                   ` (10 subsequent siblings)
  11 siblings, 7 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-02  6:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
	mhocko, wei.w.wang

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Introduce 'memfd_restricted' system call with the ability to create
memory areas that are restricted from userspace access through ordinary
MMU operations (e.g. read/write/mmap). The memory content is expected to
be used through the new in-kernel interface by a third kernel module.

memfd_restricted() is useful for scenarios where a file descriptor(fd)
can be used as an interface into mm but want to restrict userspace's
ability on the fd. Initially it is designed to provide protections for
KVM encrypted guest memory.

Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
(e.g. QEMU) and then using the mmaped virtual address to setup the
mapping in the KVM secondary page table (e.g. EPT). With confidential
computing technologies like Intel TDX, the memfd memory may be encrypted
with special key for special software domain (e.g. KVM guest) and is not
expected to be directly accessed by userspace. Precisely, userspace
access to such encrypted memory may lead to host crash so should be
prevented.

memfd_restricted() provides semantics required for KVM guest encrypted
memory support that a fd created with memfd_restricted() is going to be
used as the source of guest memory in confidential computing environment
and KVM can directly interact with core-mm without the need to expose
the memoy content into KVM userspace.

KVM userspace is still in charge of the lifecycle of the fd. It should
pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
obtain the physical memory page and then uses it to populate the KVM
secondary page table entries.

The userspace restricted memfd can be fallocate-ed or hole-punched
from userspace. When hole-punched, KVM can get notified through
invalidate_start/invalidate_end() callbacks, KVM then gets chance to
remove any mapped entries of the range in the secondary page tables.

Machine check can happen for memory pages in the restricted memfd,
instead of routing this directly to userspace, we call the error()
callback that KVM registered. KVM then gets chance to handle it
correctly.

memfd_restricted() itself is implemented as a shim layer on top of real
memory file systems (currently tmpfs). Pages in restrictedmem are marked
as unmovable and unevictable, this is required for current confidential
usage. But in future this might be changed.

By default memfd_restricted() prevents userspace read, write and mmap.
By defining new bit in the 'flags', it can be extended to support other
restricted semantics in the future.

The system call is currently wired up for x86 arch.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/restrictedmem.h          |  71 ++++++
 include/linux/syscalls.h               |   1 +
 include/uapi/asm-generic/unistd.h      |   5 +-
 include/uapi/linux/magic.h             |   1 +
 kernel/sys_ni.c                        |   3 +
 mm/Kconfig                             |   4 +
 mm/Makefile                            |   1 +
 mm/memory-failure.c                    |   3 +
 mm/restrictedmem.c                     | 318 +++++++++++++++++++++++++
 11 files changed, 408 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/restrictedmem.h
 create mode 100644 mm/restrictedmem.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 320480a8db4f..dc70ba90247e 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -455,3 +455,4 @@
 448	i386	process_mrelease	sys_process_mrelease
 449	i386	futex_waitv		sys_futex_waitv
 450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	i386	memfd_restricted	sys_memfd_restricted
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..06516abc8318 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
 450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
+451	common	memfd_restricted	sys_memfd_restricted
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
new file mode 100644
index 000000000000..c2700c5daa43
--- /dev/null
+++ b/include/linux/restrictedmem.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _LINUX_RESTRICTEDMEM_H
+
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/pfn_t.h>
+
+struct restrictedmem_notifier;
+
+struct restrictedmem_notifier_ops {
+	void (*invalidate_start)(struct restrictedmem_notifier *notifier,
+				 pgoff_t start, pgoff_t end);
+	void (*invalidate_end)(struct restrictedmem_notifier *notifier,
+			       pgoff_t start, pgoff_t end);
+	void (*error)(struct restrictedmem_notifier *notifier,
+			       pgoff_t start, pgoff_t end);
+};
+
+struct restrictedmem_notifier {
+	struct list_head list;
+	const struct restrictedmem_notifier_ops *ops;
+};
+
+#ifdef CONFIG_RESTRICTEDMEM
+
+void restrictedmem_register_notifier(struct file *file,
+				     struct restrictedmem_notifier *notifier);
+void restrictedmem_unregister_notifier(struct file *file,
+				       struct restrictedmem_notifier *notifier);
+
+int restrictedmem_get_page(struct file *file, pgoff_t offset,
+			   struct page **pagep, int *order);
+
+static inline bool file_is_restrictedmem(struct file *file)
+{
+	return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
+}
+
+void restrictedmem_error_page(struct page *page, struct address_space *mapping);
+
+#else
+
+static inline void restrictedmem_register_notifier(struct file *file,
+				     struct restrictedmem_notifier *notifier)
+{
+}
+
+static inline void restrictedmem_unregister_notifier(struct file *file,
+				       struct restrictedmem_notifier *notifier)
+{
+}
+
+static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
+					 struct page **pagep, int *order)
+{
+	return -1;
+}
+
+static inline bool file_is_restrictedmem(struct file *file)
+{
+	return false;
+}
+
+static inline void restrictedmem_error_page(struct page *page,
+					    struct address_space *mapping)
+{
+}
+
+#endif /* CONFIG_RESTRICTEDMEM */
+
+#endif /* _LINUX_RESTRICTEDMEM_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a34b0f9a9972..f9e9e0c820c5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
 asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
 					    unsigned long home_node,
 					    unsigned long flags);
+asmlinkage long sys_memfd_restricted(unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..e93cd35e46d0 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 #define __NR_set_mempolicy_home_node 450
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 
+#define __NR_memfd_restricted 451
+__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
+
 #undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..8aa38324b90a 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@
 #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
+#define RESTRICTEDMEM_MAGIC	0x5245534d	/* "RESM" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 860b2dcf3ac4..7c4a32cbd2e7 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
 /* memfd_secret */
 COND_SYSCALL(memfd_secret);
 
+/* memfd_restricted */
+COND_SYSCALL(memfd_restricted);
+
 /*
  * Architecture specific weak syscall entries.
  */
diff --git a/mm/Kconfig b/mm/Kconfig
index 57e1d8c5b505..06b0e1d6b8c1 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1076,6 +1076,10 @@ config IO_MAPPING
 config SECRETMEM
 	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
 
+config RESTRICTEDMEM
+	bool
+	depends on TMPFS
+
 config ANON_VMA_NAME
 	bool "Anonymous VMA name support"
 	depends on PROC_FS && ADVISE_SYSCALLS && MMU
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..bcbb0edf9ba1 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -121,6 +121,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
 obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
 obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
 obj-$(CONFIG_SECRETMEM) += secretmem.o
+obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
 obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
 obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
 obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 145bb561ddb3..f91b444e471e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -62,6 +62,7 @@
 #include <linux/page-isolation.h>
 #include <linux/pagewalk.h>
 #include <linux/shmem_fs.h>
+#include <linux/restrictedmem.h>
 #include "swap.h"
 #include "internal.h"
 #include "ras/ras_event.h"
@@ -940,6 +941,8 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
 		goto out;
 	}
 
+	restrictedmem_error_page(p, mapping);
+
 	/*
 	 * The shmem page is kept in page cache instead of truncating
 	 * so is expected to have an extra refcount after error-handling.
diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
new file mode 100644
index 000000000000..56953c204e5c
--- /dev/null
+++ b/mm/restrictedmem.c
@@ -0,0 +1,318 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "linux/sbitmap.h"
+#include <linux/pagemap.h>
+#include <linux/pseudo_fs.h>
+#include <linux/shmem_fs.h>
+#include <linux/syscalls.h>
+#include <uapi/linux/falloc.h>
+#include <uapi/linux/magic.h>
+#include <linux/restrictedmem.h>
+
+struct restrictedmem_data {
+	struct mutex lock;
+	struct file *memfd;
+	struct list_head notifiers;
+};
+
+static void restrictedmem_invalidate_start(struct restrictedmem_data *data,
+					   pgoff_t start, pgoff_t end)
+{
+	struct restrictedmem_notifier *notifier;
+
+	mutex_lock(&data->lock);
+	list_for_each_entry(notifier, &data->notifiers, list) {
+		notifier->ops->invalidate_start(notifier, start, end);
+	}
+	mutex_unlock(&data->lock);
+}
+
+static void restrictedmem_invalidate_end(struct restrictedmem_data *data,
+					 pgoff_t start, pgoff_t end)
+{
+	struct restrictedmem_notifier *notifier;
+
+	mutex_lock(&data->lock);
+	list_for_each_entry(notifier, &data->notifiers, list) {
+		notifier->ops->invalidate_end(notifier, start, end);
+	}
+	mutex_unlock(&data->lock);
+}
+
+static void restrictedmem_notifier_error(struct restrictedmem_data *data,
+					 pgoff_t start, pgoff_t end)
+{
+	struct restrictedmem_notifier *notifier;
+
+	mutex_lock(&data->lock);
+	list_for_each_entry(notifier, &data->notifiers, list) {
+		notifier->ops->error(notifier, start, end);
+	}
+	mutex_unlock(&data->lock);
+}
+
+static int restrictedmem_release(struct inode *inode, struct file *file)
+{
+	struct restrictedmem_data *data = inode->i_mapping->private_data;
+
+	fput(data->memfd);
+	kfree(data);
+	return 0;
+}
+
+static long restrictedmem_punch_hole(struct restrictedmem_data *data, int mode,
+				     loff_t offset, loff_t len)
+{
+	int ret;
+	pgoff_t start, end;
+	struct file *memfd = data->memfd;
+
+	if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
+		return -EINVAL;
+
+	start = offset >> PAGE_SHIFT;
+	end = (offset + len) >> PAGE_SHIFT;
+
+	restrictedmem_invalidate_start(data, start, end);
+	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
+	restrictedmem_invalidate_end(data, start, end);
+
+	return ret;
+}
+
+static long restrictedmem_fallocate(struct file *file, int mode,
+				    loff_t offset, loff_t len)
+{
+	struct restrictedmem_data *data = file->f_mapping->private_data;
+	struct file *memfd = data->memfd;
+
+	if (mode & FALLOC_FL_PUNCH_HOLE)
+		return restrictedmem_punch_hole(data, mode, offset, len);
+
+	return memfd->f_op->fallocate(memfd, mode, offset, len);
+}
+
+static const struct file_operations restrictedmem_fops = {
+	.release = restrictedmem_release,
+	.fallocate = restrictedmem_fallocate,
+};
+
+static int restrictedmem_getattr(struct user_namespace *mnt_userns,
+				 const struct path *path, struct kstat *stat,
+				 u32 request_mask, unsigned int query_flags)
+{
+	struct inode *inode = d_inode(path->dentry);
+	struct restrictedmem_data *data = inode->i_mapping->private_data;
+	struct file *memfd = data->memfd;
+
+	return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
+					     request_mask, query_flags);
+}
+
+static int restrictedmem_setattr(struct user_namespace *mnt_userns,
+				 struct dentry *dentry, struct iattr *attr)
+{
+	struct inode *inode = d_inode(dentry);
+	struct restrictedmem_data *data = inode->i_mapping->private_data;
+	struct file *memfd = data->memfd;
+	int ret;
+
+	if (attr->ia_valid & ATTR_SIZE) {
+		if (memfd->f_inode->i_size)
+			return -EPERM;
+
+		if (!PAGE_ALIGNED(attr->ia_size))
+			return -EINVAL;
+	}
+
+	ret = memfd->f_inode->i_op->setattr(mnt_userns,
+					    file_dentry(memfd), attr);
+	return ret;
+}
+
+static const struct inode_operations restrictedmem_iops = {
+	.getattr = restrictedmem_getattr,
+	.setattr = restrictedmem_setattr,
+};
+
+static int restrictedmem_init_fs_context(struct fs_context *fc)
+{
+	if (!init_pseudo(fc, RESTRICTEDMEM_MAGIC))
+		return -ENOMEM;
+
+	fc->s_iflags |= SB_I_NOEXEC;
+	return 0;
+}
+
+static struct file_system_type restrictedmem_fs = {
+	.owner		= THIS_MODULE,
+	.name		= "memfd:restrictedmem",
+	.init_fs_context = restrictedmem_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static struct vfsmount *restrictedmem_mnt;
+
+static __init int restrictedmem_init(void)
+{
+	restrictedmem_mnt = kern_mount(&restrictedmem_fs);
+	if (IS_ERR(restrictedmem_mnt))
+		return PTR_ERR(restrictedmem_mnt);
+	return 0;
+}
+fs_initcall(restrictedmem_init);
+
+static struct file *restrictedmem_file_create(struct file *memfd)
+{
+	struct restrictedmem_data *data;
+	struct address_space *mapping;
+	struct inode *inode;
+	struct file *file;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return ERR_PTR(-ENOMEM);
+
+	data->memfd = memfd;
+	mutex_init(&data->lock);
+	INIT_LIST_HEAD(&data->notifiers);
+
+	inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
+	if (IS_ERR(inode)) {
+		kfree(data);
+		return ERR_CAST(inode);
+	}
+
+	inode->i_mode |= S_IFREG;
+	inode->i_op = &restrictedmem_iops;
+	inode->i_mapping->private_data = data;
+
+	file = alloc_file_pseudo(inode, restrictedmem_mnt,
+				 "restrictedmem", O_RDWR,
+				 &restrictedmem_fops);
+	if (IS_ERR(file)) {
+		iput(inode);
+		kfree(data);
+		return ERR_CAST(file);
+	}
+
+	file->f_flags |= O_LARGEFILE;
+
+	/*
+	 * These pages are currently unmovable so don't place them into movable
+	 * pageblocks (e.g. CMA and ZONE_MOVABLE).
+	 */
+	mapping = memfd->f_mapping;
+	mapping_set_unevictable(mapping);
+	mapping_set_gfp_mask(mapping,
+			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
+
+	return file;
+}
+
+SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
+{
+	struct file *file, *restricted_file;
+	int fd, err;
+
+	if (flags)
+		return -EINVAL;
+
+	fd = get_unused_fd_flags(0);
+	if (fd < 0)
+		return fd;
+
+	file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto err_fd;
+	}
+	file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
+	file->f_flags |= O_LARGEFILE;
+
+	restricted_file = restrictedmem_file_create(file);
+	if (IS_ERR(restricted_file)) {
+		err = PTR_ERR(restricted_file);
+		fput(file);
+		goto err_fd;
+	}
+
+	fd_install(fd, restricted_file);
+	return fd;
+err_fd:
+	put_unused_fd(fd);
+	return err;
+}
+
+void restrictedmem_register_notifier(struct file *file,
+				     struct restrictedmem_notifier *notifier)
+{
+	struct restrictedmem_data *data = file->f_mapping->private_data;
+
+	mutex_lock(&data->lock);
+	list_add(&notifier->list, &data->notifiers);
+	mutex_unlock(&data->lock);
+}
+EXPORT_SYMBOL_GPL(restrictedmem_register_notifier);
+
+void restrictedmem_unregister_notifier(struct file *file,
+				       struct restrictedmem_notifier *notifier)
+{
+	struct restrictedmem_data *data = file->f_mapping->private_data;
+
+	mutex_lock(&data->lock);
+	list_del(&notifier->list);
+	mutex_unlock(&data->lock);
+}
+EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier);
+
+int restrictedmem_get_page(struct file *file, pgoff_t offset,
+			   struct page **pagep, int *order)
+{
+	struct restrictedmem_data *data = file->f_mapping->private_data;
+	struct file *memfd = data->memfd;
+	struct folio *folio;
+	struct page *page;
+	int ret;
+
+	ret = shmem_get_folio(file_inode(memfd), offset, &folio, SGP_WRITE);
+	if (ret)
+		return ret;
+
+	page = folio_file_page(folio, offset);
+	*pagep = page;
+	if (order)
+		*order = thp_order(compound_head(page));
+
+	SetPageUptodate(page);
+	unlock_page(page);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(restrictedmem_get_page);
+
+void restrictedmem_error_page(struct page *page, struct address_space *mapping)
+{
+	struct super_block *sb = restrictedmem_mnt->mnt_sb;
+	struct inode *inode, *next;
+
+	if (!shmem_mapping(mapping))
+		return;
+
+	spin_lock(&sb->s_inode_list_lock);
+	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+		struct restrictedmem_data *data = inode->i_mapping->private_data;
+		struct file *memfd = data->memfd;
+
+		if (memfd->f_mapping == mapping) {
+			pgoff_t start, end;
+
+			spin_unlock(&sb->s_inode_list_lock);
+
+			start = page->index;
+			end = start + thp_nr_pages(page);
+			restrictedmem_notifier_error(data, start, end);
+			return;
+		}
+	}
+	spin_unlock(&sb->s_inode_list_lock);
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2022-12-02  6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
  2022-12-02  6:13 ` [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
@ 2022-12-02  6:13 ` Chao Peng
  2022-12-06 13:34   ` Fabiano Rosas
                     ` (7 more replies)
  2022-12-02  6:13 ` [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory Chao Peng
                   ` (9 subsequent siblings)
  11 siblings, 8 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-02  6:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
	mhocko, wei.w.wang

In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.

Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
userspace to operate on the per-page memory attributes.
  - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
    a guest memory range.
  - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
    memory attributes.

KVM internally uses xarray to store the per-page memory attributes.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
---
 Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
 arch/x86/kvm/Kconfig           |  1 +
 include/linux/kvm_host.h       |  3 ++
 include/uapi/linux/kvm.h       | 17 ++++++++
 virt/kvm/Kconfig               |  3 ++
 virt/kvm/kvm_main.c            | 76 ++++++++++++++++++++++++++++++++++
 6 files changed, 163 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 5617bc4f899f..bb2f709c0900 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
 The "pad" and "reserved" fields may be used for future extensions and should be
 set to 0s by userspace.
 
+4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
+-----------------------------------------
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: u64 memory attributes bitmask(out)
+:Returns: 0 on success, <0 on error
+
+Returns supported memory attributes bitmask. Supported memory attributes will
+have the corresponding bits set in u64 memory attributes bitmask.
+
+The following memory attributes are defined::
+
+  #define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
+  #define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
+  #define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
+  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
+
+4.139 KVM_SET_MEMORY_ATTRIBUTES
+-----------------------------------------
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_memory_attributes(in/out)
+:Returns: 0 on success, <0 on error
+
+Sets memory attributes for pages in a guest memory range. Parameters are
+specified via the following structure::
+
+  struct kvm_memory_attributes {
+	__u64 address;
+	__u64 size;
+	__u64 attributes;
+	__u64 flags;
+  };
+
+The user sets the per-page memory attributes to a guest memory range indicated
+by address/size, and in return KVM adjusts address and size to reflect the
+actual pages of the memory range have been successfully set to the attributes.
+If the call returns 0, "address" is updated to the last successful address + 1
+and "size" is updated to the remaining address size that has not been set
+successfully. The user should check the return value as well as the size to
+decide if the operation succeeded for the whole range or not. The user may want
+to retry the operation with the returned address/size if the previous range was
+partially successful.
+
+Both address and size should be page aligned and the supported attributes can be
+retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
+
+The "flags" field may be used for future extensions and should be set to 0s.
+
 5. The kvm_run structure
 ========================
 
@@ -8270,6 +8323,16 @@ structure.
 When getting the Modified Change Topology Report value, the attr->addr
 must point to a byte where the value will be stored or retrieved from.
 
+8.40 KVM_CAP_MEMORY_ATTRIBUTES
+------------------------------
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm
+
+This capability indicates KVM supports per-page memory attributes and ioctls
+KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
+
 9. Known KVM API problems
 =========================
 
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index fbeaa9ddef59..a8e379a3afee 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -49,6 +49,7 @@ config KVM
 	select SRCU
 	select INTERVAL_TREE
 	select HAVE_KVM_PM_NOTIFIER if PM
+	select HAVE_KVM_MEMORY_ATTRIBUTES
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8f874a964313..a784e2b06625 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -800,6 +800,9 @@ struct kvm {
 
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 	struct notifier_block pm_notifier;
+#endif
+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+	struct xarray mem_attr_array;
 #endif
 	char stats_id[KVM_STATS_NAME_SIZE];
 };
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 64dfe9c07c87..5d0941acb5bb 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1182,6 +1182,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_CPU_TOPOLOGY 222
 #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
 #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
+#define KVM_CAP_MEMORY_ATTRIBUTES 225
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -2238,4 +2239,20 @@ struct kvm_s390_zpci_op {
 /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
 #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
 
+/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
+#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES    _IOR(KVMIO,  0xd2, __u64)
+#define KVM_SET_MEMORY_ATTRIBUTES              _IOWR(KVMIO,  0xd3, struct kvm_memory_attributes)
+
+struct kvm_memory_attributes {
+	__u64 address;
+	__u64 size;
+	__u64 attributes;
+	__u64 flags;
+};
+
+#define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
+#define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
+#define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
+#define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
+
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 800f9470e36b..effdea5dd4f0 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -19,6 +19,9 @@ config HAVE_KVM_IRQ_ROUTING
 config HAVE_KVM_DIRTY_RING
        bool
 
+config HAVE_KVM_MEMORY_ATTRIBUTES
+       bool
+
 # Only strongly ordered architectures can select this, as it doesn't
 # put any explicit constraint on userspace ordering. They can also
 # select the _ACQ_REL version.
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1782c4555d94..7f0f5e9f2406 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	spin_lock_init(&kvm->mn_invalidate_lock);
 	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
 	xa_init(&kvm->vcpu_array);
+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+	xa_init(&kvm->mem_attr_array);
+#endif
 
 	INIT_LIST_HEAD(&kvm->gpc_list);
 	spin_lock_init(&kvm->gpc_lock);
@@ -1323,6 +1326,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
 		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
 		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
 	}
+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+	xa_destroy(&kvm->mem_attr_array);
+#endif
 	cleanup_srcu_struct(&kvm->irq_srcu);
 	cleanup_srcu_struct(&kvm->srcu);
 	kvm_arch_free_vm(kvm);
@@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 }
 #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
 
+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+static u64 kvm_supported_mem_attributes(struct kvm *kvm)
+{
+	return 0;
+}
+
+static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
+					   struct kvm_memory_attributes *attrs)
+{
+	gfn_t start, end;
+	unsigned long i;
+	void *entry;
+	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
+
+	/* flags is currently not used. */
+	if (attrs->flags)
+		return -EINVAL;
+	if (attrs->attributes & ~supported_attrs)
+		return -EINVAL;
+	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
+		return -EINVAL;
+	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
+		return -EINVAL;
+
+	start = attrs->address >> PAGE_SHIFT;
+	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
+
+	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
+
+	mutex_lock(&kvm->lock);
+	for (i = start; i < end; i++)
+		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
+				    GFP_KERNEL_ACCOUNT)))
+			break;
+	mutex_unlock(&kvm->lock);
+
+	attrs->address = i << PAGE_SHIFT;
+	attrs->size = (end - i) << PAGE_SHIFT;
+
+	return 0;
+}
+#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
+
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 {
 	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
@@ -4459,6 +4508,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 #ifdef CONFIG_HAVE_KVM_MSI
 	case KVM_CAP_SIGNAL_MSI:
 #endif
+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+	case KVM_CAP_MEMORY_ATTRIBUTES:
+#endif
 #ifdef CONFIG_HAVE_KVM_IRQFD
 	case KVM_CAP_IRQFD:
 	case KVM_CAP_IRQFD_RESAMPLE:
@@ -4804,6 +4856,30 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+	case KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES: {
+		u64 attrs = kvm_supported_mem_attributes(kvm);
+
+		r = -EFAULT;
+		if (copy_to_user(argp, &attrs, sizeof(attrs)))
+			goto out;
+		r = 0;
+		break;
+	}
+	case KVM_SET_MEMORY_ATTRIBUTES: {
+		struct kvm_memory_attributes attrs;
+
+		r = -EFAULT;
+		if (copy_from_user(&attrs, argp, sizeof(attrs)))
+			goto out;
+
+		r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
+
+		if (!r && copy_to_user(argp, &attrs, sizeof(attrs)))
+			r = -EFAULT;
+		break;
+	}
+#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
 	case KVM_CREATE_DEVICE: {
 		struct kvm_create_device cd;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2022-12-02  6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
  2022-12-02  6:13 ` [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
  2022-12-02  6:13 ` [PATCH v10 2/9] KVM: Introduce per-page memory attributes Chao Peng
@ 2022-12-02  6:13 ` Chao Peng
  2022-12-05  9:03   ` Fuad Tabba
                     ` (3 more replies)
  2022-12-02  6:13 ` [PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
                   ` (8 subsequent siblings)
  11 siblings, 4 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-02  6:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
	mhocko, wei.w.wang

In memory encryption usage, guest memory may be encrypted with special
key and can be accessed only by the guest itself. We call such memory
private memory. It's valueless and sometimes can cause problem to allow
userspace to access guest private memory. This new KVM memslot extension
allows guest private memory being provided through a restrictedmem
backed file descriptor(fd) and userspace is restricted to access the
bookmarked memory in the fd.

This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
additional KVM memslot fields restricted_fd/restricted_offset to allow
userspace to instruct KVM to provide guest memory through restricted_fd.
'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
and the size is 'memory_size'.

The extended memslot can still have the userspace_addr(hva). When use, a
single memslot can maintain both private memory through restricted_fd
and shared memory through userspace_addr. Whether the private or shared
part is visible to guest is maintained by other KVM code.

A restrictedmem_notifier field is also added to the memslot structure to
allow the restricted_fd's backing store to notify KVM the memory change,
KVM then can invalidate its page table entries or handle memory errors.

Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
and right now it is selected on X86_64 only.

To make future maintenance easy, internally use a binary compatible
alias struct kvm_user_mem_region to handle both the normal and the
'_ext' variants.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
---
 Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++-----
 arch/x86/kvm/Kconfig           |  2 ++
 arch/x86/kvm/x86.c             |  2 +-
 include/linux/kvm_host.h       |  8 ++++--
 include/uapi/linux/kvm.h       | 28 +++++++++++++++++++
 virt/kvm/Kconfig               |  3 +++
 virt/kvm/kvm_main.c            | 49 ++++++++++++++++++++++++++++------
 7 files changed, 114 insertions(+), 18 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index bb2f709c0900..99352170c130 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
 :Capability: KVM_CAP_USER_MEMORY
 :Architectures: all
 :Type: vm ioctl
-:Parameters: struct kvm_userspace_memory_region (in)
+:Parameters: struct kvm_userspace_memory_region(_ext) (in)
 :Returns: 0 on success, -1 on error
 
 ::
@@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
 	__u64 userspace_addr; /* start of the userspace allocated memory */
   };
 
+  struct kvm_userspace_memory_region_ext {
+	struct kvm_userspace_memory_region region;
+	__u64 restricted_offset;
+	__u32 restricted_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+  };
+
   /* for kvm_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
   #define KVM_MEM_READONLY	(1UL << 1)
+  #define KVM_MEM_PRIVATE		(1UL << 2)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1365,12 +1374,29 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
-writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only.  In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+kvm_userspace_memory_region_ext struct includes all fields of
+kvm_userspace_memory_region struct, while also adds additional fields for some
+other features. See below description of flags field for more information.
+It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
+
+The flags field supports following flags:
+
+- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
+  within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
+
+- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
+  read-only. In this case, writes to this memory will be posted to userspace as
+  KVM_EXIT_MMIO exits.
+
+- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
+  KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has private
+  memory backed by a file descriptor(fd) and userspace access to the fd may be
+  restricted. Userspace should use restricted_fd/restricted_offset in the
+  kvm_userspace_memory_region_ext to instruct KVM to provide private memory
+  to guest. Userspace should guarantee not to map the same host physical address
+  indicated by restricted_fd/restricted_offset to different guest physical
+  addresses within multiple memslots. Failed to do this may result undefined
+  behavior.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index a8e379a3afee..690cb21010e7 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -50,6 +50,8 @@ config KVM
 	select INTERVAL_TREE
 	select HAVE_KVM_PM_NOTIFIER if PM
 	select HAVE_KVM_MEMORY_ATTRIBUTES
+	select HAVE_KVM_RESTRICTED_MEM if X86_64
+	select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7f850dfb4086..9a07380f8d3c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12224,7 +12224,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 	}
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-		struct kvm_userspace_memory_region m;
+		struct kvm_user_mem_region m;
 
 		m.slot = id | (i << 16);
 		m.flags = 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a784e2b06625..02347e386ea2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -44,6 +44,7 @@
 
 #include <asm/kvm_host.h>
 #include <linux/kvm_dirty_ring.h>
+#include <linux/restrictedmem.h>
 
 #ifndef KVM_MAX_VCPU_IDS
 #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
@@ -585,6 +586,9 @@ struct kvm_memory_slot {
 	u32 flags;
 	short id;
 	u16 as_id;
+	struct file *restricted_file;
+	loff_t restricted_offset;
+	struct restrictedmem_notifier notifier;
 };
 
 static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
@@ -1123,9 +1127,9 @@ enum kvm_mr_change {
 };
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem);
+			  const struct kvm_user_mem_region *mem);
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem);
+			    const struct kvm_user_mem_region *mem);
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5d0941acb5bb..13bff963b8b0 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
 	__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
+struct kvm_userspace_memory_region_ext {
+	struct kvm_userspace_memory_region region;
+	__u64 restricted_offset;
+	__u32 restricted_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+};
+
+#ifdef __KERNEL__
+/*
+ * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
+ * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
+ * all fields from the top-level "extended" region.
+ */
+struct kvm_user_mem_region {
+	__u32 slot;
+	__u32 flags;
+	__u64 guest_phys_addr;
+	__u64 memory_size;
+	__u64 userspace_addr;
+	__u64 restricted_offset;
+	__u32 restricted_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+};
+#endif
+
 /*
  * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
  * other bits are reserved for kvm internal use which are defined in
@@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_PRIVATE		(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index effdea5dd4f0..d605545d6dd1 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -89,3 +89,6 @@ config KVM_XFER_TO_GUEST_WORK
 
 config HAVE_KVM_PM_NOTIFIER
        bool
+
+config HAVE_KVM_RESTRICTED_MEM
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7f0f5e9f2406..b882eb2c76a2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1532,7 +1532,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
 	}
 }
 
-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
+static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
 
@@ -1934,7 +1934,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
  * Must be called holding kvm->slots_lock for write.
  */
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem)
+			    const struct kvm_user_mem_region *mem)
 {
 	struct kvm_memory_slot *old, *new;
 	struct kvm_memslots *slots;
@@ -2038,7 +2038,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem)
+			  const struct kvm_user_mem_region *mem)
 {
 	int r;
 
@@ -2050,7 +2050,7 @@ int kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(kvm_set_memory_region);
 
 static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
-					  struct kvm_userspace_memory_region *mem)
+					  struct kvm_user_mem_region *mem)
 {
 	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
 		return -EINVAL;
@@ -4698,6 +4698,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
 	return fd;
 }
 
+#define SANITY_CHECK_MEM_REGION_FIELD(field)					\
+do {										\
+	BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=		\
+		     offsetof(struct kvm_userspace_memory_region, field));	\
+	BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=		\
+		     sizeof_field(struct kvm_userspace_memory_region, field));	\
+} while (0)
+
+#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)					\
+do {											\
+	BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=			\
+		     offsetof(struct kvm_userspace_memory_region_ext, field));		\
+	BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=			\
+		     sizeof_field(struct kvm_userspace_memory_region_ext, field));	\
+} while (0)
+
+static void kvm_sanity_check_user_mem_region_alias(void)
+{
+	SANITY_CHECK_MEM_REGION_FIELD(slot);
+	SANITY_CHECK_MEM_REGION_FIELD(flags);
+	SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
+	SANITY_CHECK_MEM_REGION_FIELD(memory_size);
+	SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
+	SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
+	SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
+}
+
 static long kvm_vm_ioctl(struct file *filp,
 			   unsigned int ioctl, unsigned long arg)
 {
@@ -4721,14 +4748,20 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 	case KVM_SET_USER_MEMORY_REGION: {
-		struct kvm_userspace_memory_region kvm_userspace_mem;
+		struct kvm_user_mem_region mem;
+		unsigned long size = sizeof(struct kvm_userspace_memory_region);
+
+		kvm_sanity_check_user_mem_region_alias();
 
 		r = -EFAULT;
-		if (copy_from_user(&kvm_userspace_mem, argp,
-						sizeof(kvm_userspace_mem)))
+		if (copy_from_user(&mem, argp, size))
+			goto out;
+
+		r = -EINVAL;
+		if (mem.flags & KVM_MEM_PRIVATE)
 			goto out;
 
-		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
+		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
 		break;
 	}
 	case KVM_GET_DIRTY_LOG: {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-12-02  6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (2 preceding siblings ...)
  2022-12-02  6:13 ` [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory Chao Peng
@ 2022-12-02  6:13 ` Chao Peng
  2022-12-06 15:47   ` Fuad Tabba
  2023-01-13 23:13   ` Sean Christopherson
  2022-12-02  6:13 ` [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-02  6:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
	mhocko, wei.w.wang

This new KVM exit allows userspace to handle memory-related errors. It
indicates an error happens in KVM at guest memory range [gpa, gpa+size).
The flags includes additional information for userspace to handle the
error. Currently bit 0 is defined as 'private memory' where '1'
indicates error happens due to private memory access and '0' indicates
error happens due to shared memory access.

When private memory is enabled, this new exit will be used for KVM to
exit to userspace for shared <-> private memory conversion in memory
encryption usage. In such usage, typically there are two kind of memory
conversions:
  - explicit conversion: happens when guest explicitly calls into KVM
    to map a range (as private or shared), KVM then exits to userspace
    to perform the map/unmap operations.
  - implicit conversion: happens in KVM page fault handler where KVM
    exits to userspace for an implicit conversion when the page is in a
    different state than requested (private or shared).

Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
---
 Documentation/virt/kvm/api.rst | 22 ++++++++++++++++++++++
 include/uapi/linux/kvm.h       |  8 ++++++++
 2 files changed, 30 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 99352170c130..d9edb14ce30b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6634,6 +6634,28 @@ array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 0)
+			__u64 flags;
+			__u64 gpa;
+			__u64 size;
+		} memory;
+
+If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
+encountered a memory error which is not handled by KVM kernel module and
+userspace may choose to handle it. The 'flags' field indicates the memory
+properties of the exit.
+
+ - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
+   private memory access when the bit is set. Otherwise the memory error is
+   caused by shared memory access when the bit is clear.
+
+'gpa' and 'size' indicate the memory range the error occurs at. The userspace
+may handle the error and return to KVM to retry the previous memory access.
+
 ::
 
     /* KVM_EXIT_NOTIFY */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 13bff963b8b0..c7e9d375a902 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -300,6 +300,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MEMORY_FAULT     38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -541,6 +542,13 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 0)
+			__u64 flags;
+			__u64 gpa;
+			__u64 size;
+		} memory;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-12-02  6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (3 preceding siblings ...)
  2022-12-02  6:13 ` [PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
@ 2022-12-02  6:13 ` Chao Peng
  2022-12-05  9:23   ` Fuad Tabba
  2022-12-02  6:13 ` [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes Chao Peng
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-12-02  6:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
	mhocko, wei.w.wang

Currently in mmu_notifier invalidate path, hva range is recorded and
then checked against by mmu_notifier_retry_hva() in the page fault
handling path. However, for the to be introduced private memory, a page
fault may not have a hva associated, checking gfn(gpa) makes more sense.

For existing hva based shared memory, gfn is expected to also work. The
only downside is when aliasing multiple gfns to a single hva, the
current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/kvm/mmu/mmu.c   |  8 +++++---
 include/linux/kvm_host.h | 33 +++++++++++++++++++++------------
 virt/kvm/kvm_main.c      | 32 +++++++++++++++++++++++---------
 3 files changed, 49 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4736d7849c60..e2c70b5afa3e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4259,7 +4259,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
 		return true;
 
 	return fault->slot &&
-	       mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
+	       mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
 }
 
 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
@@ -6098,7 +6098,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 
 	write_lock(&kvm->mmu_lock);
 
-	kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
+	kvm_mmu_invalidate_begin(kvm);
+
+	kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
 
 	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
 
@@ -6112,7 +6114,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 		kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
 						   gfn_end - gfn_start);
 
-	kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
+	kvm_mmu_invalidate_end(kvm);
 
 	write_unlock(&kvm->mmu_lock);
 }
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 02347e386ea2..3d69484d2704 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -787,8 +787,8 @@ struct kvm {
 	struct mmu_notifier mmu_notifier;
 	unsigned long mmu_invalidate_seq;
 	long mmu_invalidate_in_progress;
-	unsigned long mmu_invalidate_range_start;
-	unsigned long mmu_invalidate_range_end;
+	gfn_t mmu_invalidate_range_start;
+	gfn_t mmu_invalidate_range_end;
 #endif
 	struct list_head devices;
 	u64 manual_dirty_log_protect;
@@ -1389,10 +1389,9 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 #endif
 
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
-			      unsigned long end);
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
-			    unsigned long end);
+void kvm_mmu_invalidate_begin(struct kvm *kvm);
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
+void kvm_mmu_invalidate_end(struct kvm *kvm);
 
 long kvm_arch_dev_ioctl(struct file *filp,
 			unsigned int ioctl, unsigned long arg);
@@ -1963,9 +1962,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
 	return 0;
 }
 
-static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
+static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
 					   unsigned long mmu_seq,
-					   unsigned long hva)
+					   gfn_t gfn)
 {
 	lockdep_assert_held(&kvm->mmu_lock);
 	/*
@@ -1974,10 +1973,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
 	 * that might be being invalidated. Note that it may include some false
 	 * positives, due to shortcuts when handing concurrent invalidations.
 	 */
-	if (unlikely(kvm->mmu_invalidate_in_progress) &&
-	    hva >= kvm->mmu_invalidate_range_start &&
-	    hva < kvm->mmu_invalidate_range_end)
-		return 1;
+	if (unlikely(kvm->mmu_invalidate_in_progress)) {
+		/*
+		 * Dropping mmu_lock after bumping mmu_invalidate_in_progress
+		 * but before updating the range is a KVM bug.
+		 */
+		if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
+				 kvm->mmu_invalidate_range_end == INVALID_GPA))
+			return 1;
+
+		if (gfn >= kvm->mmu_invalidate_range_start &&
+		    gfn < kvm->mmu_invalidate_range_end)
+			return 1;
+	}
+
 	if (kvm->mmu_invalidate_seq != mmu_seq)
 		return 1;
 	return 0;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b882eb2c76a2..ad55dfbc75d7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -540,9 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
 
 typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
-typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
-			     unsigned long end);
-
+typedef void (*on_lock_fn_t)(struct kvm *kvm);
 typedef void (*on_unlock_fn_t)(struct kvm *kvm);
 
 struct kvm_hva_range {
@@ -628,7 +626,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 				locked = true;
 				KVM_MMU_LOCK(kvm);
 				if (!IS_KVM_NULL_FN(range->on_lock))
-					range->on_lock(kvm, range->start, range->end);
+					range->on_lock(kvm);
+
 				if (IS_KVM_NULL_FN(range->handler))
 					break;
 			}
@@ -715,8 +714,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
 }
 
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
-			      unsigned long end)
+void kvm_mmu_invalidate_begin(struct kvm *kvm)
 {
 	/*
 	 * The count increase must become visible at unlock time as no
@@ -724,6 +722,17 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
 	 * count is also read inside the mmu_lock critical section.
 	 */
 	kvm->mmu_invalidate_in_progress++;
+
+	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
+		kvm->mmu_invalidate_range_start = INVALID_GPA;
+		kvm->mmu_invalidate_range_end = INVALID_GPA;
+	}
+}
+
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
+
 	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
 		kvm->mmu_invalidate_range_start = start;
 		kvm->mmu_invalidate_range_end = end;
@@ -744,6 +753,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
 	}
 }
 
+static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
+	return kvm_unmap_gfn_range(kvm, range);
+}
+
 static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
@@ -752,7 +767,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.start		= range->start,
 		.end		= range->end,
 		.pte		= __pte(0),
-		.handler	= kvm_unmap_gfn_range,
+		.handler	= kvm_mmu_unmap_gfn_range,
 		.on_lock	= kvm_mmu_invalidate_begin,
 		.on_unlock	= kvm_arch_guest_memory_reclaimed,
 		.flush_on_ret	= true,
@@ -791,8 +806,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	return 0;
 }
 
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
-			    unsigned long end)
+void kvm_mmu_invalidate_end(struct kvm *kvm)
 {
 	/*
 	 * This sequence increase will notify the kvm page fault that
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
  2022-12-02  6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (4 preceding siblings ...)
  2022-12-02  6:13 ` [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
@ 2022-12-02  6:13 ` Chao Peng
  2022-12-07  8:13   ` Yuan Yao
                     ` (3 more replies)
  2022-12-02  6:13 ` [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed Chao Peng
                   ` (5 subsequent siblings)
  11 siblings, 4 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-02  6:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
	mhocko, wei.w.wang

Unmap the existing guest mappings when memory attribute is changed
between shared and private. This is needed because shared pages and
private pages are from different backends, unmapping existing ones
gives a chance for page fault handler to re-populate the mappings
according to the new attribute.

Only architecture has private memory support needs this and the
supported architecture is expected to rewrite the weak
kvm_arch_has_private_mem().

Also, during memory attribute changing and the unmapping time frame,
page fault handler may happen in the same memory range and can cause
incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
page fault handler retry during this time frame.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/kvm_host.h |   7 +-
 virt/kvm/kvm_main.c      | 168 ++++++++++++++++++++++++++-------------
 2 files changed, 116 insertions(+), 59 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3d69484d2704..3331c0c92838 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
-#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
 struct kvm_gfn_range {
 	struct kvm_memory_slot *slot;
 	gfn_t start;
@@ -264,6 +263,8 @@ struct kvm_gfn_range {
 	bool may_block;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
+
+#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
@@ -785,11 +786,12 @@ struct kvm {
 
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
 	struct mmu_notifier mmu_notifier;
+#endif
 	unsigned long mmu_invalidate_seq;
 	long mmu_invalidate_in_progress;
 	gfn_t mmu_invalidate_range_start;
 	gfn_t mmu_invalidate_range_end;
-#endif
+
 	struct list_head devices;
 	u64 manual_dirty_log_protect;
 	struct dentry *debugfs_dentry;
@@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
 int kvm_arch_post_init_vm(struct kvm *kvm);
 void kvm_arch_pre_destroy_vm(struct kvm *kvm);
 int kvm_arch_create_vm_debugfs(struct kvm *kvm);
+bool kvm_arch_has_private_mem(struct kvm *kvm);
 
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 /*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ad55dfbc75d7..4e1e1e113bf0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
 }
 EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
 
+void kvm_mmu_invalidate_begin(struct kvm *kvm)
+{
+	/*
+	 * The count increase must become visible at unlock time as no
+	 * spte can be established without taking the mmu_lock and
+	 * count is also read inside the mmu_lock critical section.
+	 */
+	kvm->mmu_invalidate_in_progress++;
+
+	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
+		kvm->mmu_invalidate_range_start = INVALID_GPA;
+		kvm->mmu_invalidate_range_end = INVALID_GPA;
+	}
+}
+
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
+
+	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
+		kvm->mmu_invalidate_range_start = start;
+		kvm->mmu_invalidate_range_end = end;
+	} else {
+		/*
+		 * Fully tracking multiple concurrent ranges has diminishing
+		 * returns. Keep things simple and just find the minimal range
+		 * which includes the current and new ranges. As there won't be
+		 * enough information to subtract a range after its invalidate
+		 * completes, any ranges invalidated concurrently will
+		 * accumulate and persist until all outstanding invalidates
+		 * complete.
+		 */
+		kvm->mmu_invalidate_range_start =
+			min(kvm->mmu_invalidate_range_start, start);
+		kvm->mmu_invalidate_range_end =
+			max(kvm->mmu_invalidate_range_end, end);
+	}
+}
+
+void kvm_mmu_invalidate_end(struct kvm *kvm)
+{
+	/*
+	 * This sequence increase will notify the kvm page fault that
+	 * the page that is going to be mapped in the spte could have
+	 * been freed.
+	 */
+	kvm->mmu_invalidate_seq++;
+	smp_wmb();
+	/*
+	 * The above sequence increase must be visible before the
+	 * below count decrease, which is ensured by the smp_wmb above
+	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
+	 */
+	kvm->mmu_invalidate_in_progress--;
+}
+
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
 static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 {
@@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
 }
 
-void kvm_mmu_invalidate_begin(struct kvm *kvm)
-{
-	/*
-	 * The count increase must become visible at unlock time as no
-	 * spte can be established without taking the mmu_lock and
-	 * count is also read inside the mmu_lock critical section.
-	 */
-	kvm->mmu_invalidate_in_progress++;
-
-	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
-		kvm->mmu_invalidate_range_start = INVALID_GPA;
-		kvm->mmu_invalidate_range_end = INVALID_GPA;
-	}
-}
-
-void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
-{
-	WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
-
-	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
-		kvm->mmu_invalidate_range_start = start;
-		kvm->mmu_invalidate_range_end = end;
-	} else {
-		/*
-		 * Fully tracking multiple concurrent ranges has diminishing
-		 * returns. Keep things simple and just find the minimal range
-		 * which includes the current and new ranges. As there won't be
-		 * enough information to subtract a range after its invalidate
-		 * completes, any ranges invalidated concurrently will
-		 * accumulate and persist until all outstanding invalidates
-		 * complete.
-		 */
-		kvm->mmu_invalidate_range_start =
-			min(kvm->mmu_invalidate_range_start, start);
-		kvm->mmu_invalidate_range_end =
-			max(kvm->mmu_invalidate_range_end, end);
-	}
-}
-
 static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
@@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	return 0;
 }
 
-void kvm_mmu_invalidate_end(struct kvm *kvm)
-{
-	/*
-	 * This sequence increase will notify the kvm page fault that
-	 * the page that is going to be mapped in the spte could have
-	 * been freed.
-	 */
-	kvm->mmu_invalidate_seq++;
-	smp_wmb();
-	/*
-	 * The above sequence increase must be visible before the
-	 * below count decrease, which is ensured by the smp_wmb above
-	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
-	 */
-	kvm->mmu_invalidate_in_progress--;
-}
-
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
@@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
 	return 0;
 }
 
+bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
+{
+	return false;
+}
+
 static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 {
 	struct kvm *kvm = kvm_arch_alloc_vm();
@@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
 	return 0;
 }
 
+static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	struct kvm_gfn_range gfn_range;
+	struct kvm_memory_slot *slot;
+	struct kvm_memslots *slots;
+	struct kvm_memslot_iter iter;
+	int i;
+	int r = 0;
+
+	gfn_range.pte = __pte(0);
+	gfn_range.may_block = true;
+
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		slots = __kvm_memslots(kvm, i);
+
+		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
+			slot = iter.slot;
+			gfn_range.start = max(start, slot->base_gfn);
+			gfn_range.end = min(end, slot->base_gfn + slot->npages);
+			if (gfn_range.start >= gfn_range.end)
+				continue;
+			gfn_range.slot = slot;
+
+			r |= kvm_unmap_gfn_range(kvm, &gfn_range);
+		}
+	}
+
+	if (r)
+		kvm_flush_remote_tlbs(kvm);
+}
+
 static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
 					   struct kvm_memory_attributes *attrs)
 {
 	gfn_t start, end;
 	unsigned long i;
 	void *entry;
+	int idx;
 	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
 
-	/* flags is currently not used. */
+	/* 'flags' is currently not used. */
 	if (attrs->flags)
 		return -EINVAL;
 	if (attrs->attributes & ~supported_attrs)
@@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
 
 	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
 
+	if (kvm_arch_has_private_mem(kvm)) {
+		KVM_MMU_LOCK(kvm);
+		kvm_mmu_invalidate_begin(kvm);
+		kvm_mmu_invalidate_range_add(kvm, start, end);
+		KVM_MMU_UNLOCK(kvm);
+	}
+
 	mutex_lock(&kvm->lock);
 	for (i = start; i < end; i++)
 		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
@@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
 			break;
 	mutex_unlock(&kvm->lock);
 
+	if (kvm_arch_has_private_mem(kvm)) {
+		idx = srcu_read_lock(&kvm->srcu);
+		KVM_MMU_LOCK(kvm);
+		if (i > start)
+			kvm_unmap_mem_range(kvm, start, i);
+		kvm_mmu_invalidate_end(kvm);
+		KVM_MMU_UNLOCK(kvm);
+		srcu_read_unlock(&kvm->srcu, idx);
+	}
+
 	attrs->address = i << PAGE_SHIFT;
 	attrs->size = (end - i) << PAGE_SHIFT;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed
  2022-12-02  6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (5 preceding siblings ...)
  2022-12-02  6:13 ` [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes Chao Peng
@ 2022-12-02  6:13 ` Chao Peng
  2022-12-05 22:49   ` Isaku Yamahata
                     ` (2 more replies)
  2022-12-02  6:13 ` [PATCH v10 8/9] KVM: Handle page fault for private memory Chao Peng
                   ` (4 subsequent siblings)
  11 siblings, 3 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-02  6:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
	mhocko, wei.w.wang

A large page with mixed private/shared subpages can't be mapped as large
page since its sub private/shared pages are from different memory
backends and may also treated by architecture differently. When
private/shared memory are mixed in a large page, the current lpage_info
is not sufficient to decide whether the page can be mapped as large page
or not and additional private/shared mixed information is needed.

Tracking this 'mixed' information with the current 'count' like
disallow_lpage is a bit challenge so reserve a bit in 'disallow_lpage'
to indicate a large page has mixed private/share subpages and update
this 'mixed' bit whenever the memory attribute is changed between
private and shared.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h |   8 ++
 arch/x86/kvm/mmu/mmu.c          | 134 +++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              |   2 +
 include/linux/kvm_host.h        |  19 +++++
 virt/kvm/kvm_main.c             |   9 ++-
 5 files changed, 169 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 283cbb83d6ae..7772ab37ac89 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -38,6 +38,7 @@
 #include <asm/hyperv-tlfs.h>
 
 #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
+#define __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
 
 #define KVM_MAX_VCPUS 1024
 
@@ -1011,6 +1012,13 @@ struct kvm_vcpu_arch {
 #endif
 };
 
+/*
+ * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
+ * level. The remaining bits are used as a reference count.
+ */
+#define KVM_LPAGE_PRIVATE_SHARED_MIXED		(1U << 31)
+#define KVM_LPAGE_COUNT_MAX			((1U << 31) - 1)
+
 struct kvm_lpage_info {
 	int disallow_lpage;
 };
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e2c70b5afa3e..2190fd8c95c0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -763,11 +763,16 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
 {
 	struct kvm_lpage_info *linfo;
 	int i;
+	int disallow_count;
 
 	for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
 		linfo = lpage_info_slot(gfn, slot, i);
+
+		disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
+		WARN_ON(disallow_count + count < 0 ||
+			disallow_count > KVM_LPAGE_COUNT_MAX - count);
+
 		linfo->disallow_lpage += count;
-		WARN_ON(linfo->disallow_lpage < 0);
 	}
 }
 
@@ -6986,3 +6991,130 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 	if (kvm->arch.nx_huge_page_recovery_thread)
 		kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
 }
+
+static bool linfo_is_mixed(struct kvm_lpage_info *linfo)
+{
+	return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
+}
+
+static void linfo_set_mixed(gfn_t gfn, struct kvm_memory_slot *slot,
+			    int level, bool mixed)
+{
+	struct kvm_lpage_info *linfo = lpage_info_slot(gfn, slot, level);
+
+	if (mixed)
+		linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
+	else
+		linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
+}
+
+static bool is_expected_attr_entry(void *entry, unsigned long expected_attrs)
+{
+	bool expect_private = expected_attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
+
+	if (xa_to_value(entry) & KVM_MEMORY_ATTRIBUTE_PRIVATE) {
+		if (!expect_private)
+			return false;
+	} else if (expect_private)
+		return false;
+
+	return true;
+}
+
+static bool mem_attrs_mixed_2m(struct kvm *kvm, unsigned long attrs,
+			       gfn_t start, gfn_t end)
+{
+	XA_STATE(xas, &kvm->mem_attr_array, start);
+	gfn_t gfn = start;
+	void *entry;
+	bool mixed = false;
+
+	rcu_read_lock();
+	entry = xas_load(&xas);
+	while (gfn < end) {
+		if (xas_retry(&xas, entry))
+			continue;
+
+		KVM_BUG_ON(gfn != xas.xa_index, kvm);
+
+		if (!is_expected_attr_entry(entry, attrs)) {
+			mixed = true;
+			break;
+		}
+
+		entry = xas_next(&xas);
+		gfn++;
+	}
+
+	rcu_read_unlock();
+	return mixed;
+}
+
+static bool mem_attrs_mixed(struct kvm *kvm, struct kvm_memory_slot *slot,
+			    int level, unsigned long attrs,
+			    gfn_t start, gfn_t end)
+{
+	unsigned long gfn;
+
+	if (level == PG_LEVEL_2M)
+		return mem_attrs_mixed_2m(kvm, attrs, start, end);
+
+	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1))
+		if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)) ||
+		    !is_expected_attr_entry(xa_load(&kvm->mem_attr_array, gfn),
+					    attrs))
+			return true;
+	return false;
+}
+
+static void kvm_update_lpage_private_shared_mixed(struct kvm *kvm,
+						  struct kvm_memory_slot *slot,
+						  unsigned long attrs,
+						  gfn_t start, gfn_t end)
+{
+	unsigned long pages, mask;
+	gfn_t gfn, gfn_end, first, last;
+	int level;
+	bool mixed;
+
+	/*
+	 * The sequence matters here: we set the higher level basing on the
+	 * lower level's scanning result.
+	 */
+	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
+		pages = KVM_PAGES_PER_HPAGE(level);
+		mask = ~(pages - 1);
+		first = start & mask;
+		last = (end - 1) & mask;
+
+		/*
+		 * We only need to scan the head and tail page, for middle pages
+		 * we know they will not be mixed.
+		 */
+		gfn = max(first, slot->base_gfn);
+		gfn_end = min(first + pages, slot->base_gfn + slot->npages);
+		mixed = mem_attrs_mixed(kvm, slot, level, attrs, gfn, gfn_end);
+		linfo_set_mixed(gfn, slot, level, mixed);
+
+		if (first == last)
+			return;
+
+		for (gfn = first + pages; gfn < last; gfn += pages)
+			linfo_set_mixed(gfn, slot, level, false);
+
+		gfn = last;
+		gfn_end = min(last + pages, slot->base_gfn + slot->npages);
+		mixed = mem_attrs_mixed(kvm, slot, level, attrs, gfn, gfn_end);
+		linfo_set_mixed(gfn, slot, level, mixed);
+	}
+}
+
+void kvm_arch_set_memory_attributes(struct kvm *kvm,
+				    struct kvm_memory_slot *slot,
+				    unsigned long attrs,
+				    gfn_t start, gfn_t end)
+{
+	if (kvm_slot_can_be_private(slot))
+		kvm_update_lpage_private_shared_mixed(kvm, slot, attrs,
+						      start, end);
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9a07380f8d3c..5aefcff614d2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12362,6 +12362,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
 		if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
 			linfo[lpages - 1].disallow_lpage = 1;
 		ugfn = slot->userspace_addr >> PAGE_SHIFT;
+		if (kvm_slot_can_be_private(slot))
+			ugfn |= slot->restricted_offset >> PAGE_SHIFT;
 		/*
 		 * If the gfn and userspace address are not aligned wrt each
 		 * other, disable large page support for this slot.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3331c0c92838..25099c94e770 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -592,6 +592,11 @@ struct kvm_memory_slot {
 	struct restrictedmem_notifier notifier;
 };
 
+static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
+{
+	return slot && (slot->flags & KVM_MEM_PRIVATE);
+}
+
 static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
 {
 	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
@@ -2316,4 +2321,18 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+#ifdef __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
+void kvm_arch_set_memory_attributes(struct kvm *kvm,
+				    struct kvm_memory_slot *slot,
+				    unsigned long attrs,
+				    gfn_t start, gfn_t end);
+#else
+static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
+						  struct kvm_memory_slot *slot,
+						  unsigned long attrs,
+						  gfn_t start, gfn_t end)
+{
+}
+#endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
+
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4e1e1e113bf0..e107afea32f0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2354,7 +2354,8 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
 	return 0;
 }
 
-static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
+static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end,
+				unsigned long attrs)
 {
 	struct kvm_gfn_range gfn_range;
 	struct kvm_memory_slot *slot;
@@ -2378,6 +2379,10 @@ static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
 			gfn_range.slot = slot;
 
 			r |= kvm_unmap_gfn_range(kvm, &gfn_range);
+
+			kvm_arch_set_memory_attributes(kvm, slot, attrs,
+						       gfn_range.start,
+						       gfn_range.end);
 		}
 	}
 
@@ -2427,7 +2432,7 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
 		idx = srcu_read_lock(&kvm->srcu);
 		KVM_MMU_LOCK(kvm);
 		if (i > start)
-			kvm_unmap_mem_range(kvm, start, i);
+			kvm_unmap_mem_range(kvm, start, i, attrs->attributes);
 		kvm_mmu_invalidate_end(kvm);
 		KVM_MMU_UNLOCK(kvm);
 		srcu_read_unlock(&kvm->srcu, idx);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v10 8/9] KVM: Handle page fault for private memory
  2022-12-02  6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (6 preceding siblings ...)
  2022-12-02  6:13 ` [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed Chao Peng
@ 2022-12-02  6:13 ` Chao Peng
  2022-12-08  2:29   ` Yuan Yao
                     ` (2 more replies)
  2022-12-02  6:13 ` [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
                   ` (3 subsequent siblings)
  11 siblings, 3 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-02  6:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
	mhocko, wei.w.wang

A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
hva-based shared memory. Architecture code (like TDX code) can tell
whether the on-going fault is private or not. This patch adds a
'is_private' field to kvm_page_fault to indicate this and architecture
code is expected to set it.

To handle page fault for such memslot, the handling logic is different
depending on whether the fault is private or shared. KVM checks if
'is_private' matches the host's view of the page (maintained in
mem_attr_array).
  - For a successful match, private pfn is obtained with
    restrictedmem_get_page() and shared pfn is obtained with existing
    get_user_pages().
  - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
    userspace. Userspace then can convert memory between private/shared
    in host's view and retry the fault.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/kvm/mmu/mmu.c          | 63 +++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu/mmu_internal.h | 14 +++++++-
 arch/x86/kvm/mmu/mmutrace.h     |  1 +
 arch/x86/kvm/mmu/tdp_mmu.c      |  2 +-
 include/linux/kvm_host.h        | 30 ++++++++++++++++
 5 files changed, 105 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2190fd8c95c0..b1953ebc012e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
 
 int kvm_mmu_max_mapping_level(struct kvm *kvm,
 			      const struct kvm_memory_slot *slot, gfn_t gfn,
-			      int max_level)
+			      int max_level, bool is_private)
 {
 	struct kvm_lpage_info *linfo;
 	int host_level;
@@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
 			break;
 	}
 
+	if (is_private)
+		return max_level;
+
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
 
@@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	 * level, which will be used to do precise, accurate accounting.
 	 */
 	fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
-						     fault->gfn, fault->max_level);
+						     fault->gfn, fault->max_level,
+						     fault->is_private);
 	if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
 		return;
 
@@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
 }
 
+static inline u8 order_to_level(int order)
+{
+	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
+
+	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
+		return PG_LEVEL_1G;
+
+	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
+		return PG_LEVEL_2M;
+
+	return PG_LEVEL_4K;
+}
+
+static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
+				    struct kvm_page_fault *fault)
+{
+	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+	if (fault->is_private)
+		vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
+	else
+		vcpu->run->memory.flags = 0;
+	vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
+	vcpu->run->memory.size = PAGE_SIZE;
+	return RET_PF_USER;
+}
+
+static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
+				   struct kvm_page_fault *fault)
+{
+	int order;
+	struct kvm_memory_slot *slot = fault->slot;
+
+	if (!kvm_slot_can_be_private(slot))
+		return kvm_do_memory_fault_exit(vcpu, fault);
+
+	if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
+		return RET_PF_RETRY;
+
+	fault->max_level = min(order_to_level(order), fault->max_level);
+	fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
+	return RET_PF_CONTINUE;
+}
+
 static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
@@ -4210,6 +4257,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			return RET_PF_EMULATE;
 	}
 
+	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
+		return kvm_do_memory_fault_exit(vcpu, fault);
+
+	if (fault->is_private)
+		return kvm_faultin_pfn_private(vcpu, fault);
+
 	async = false;
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
 					  fault->write, &fault->map_writable,
@@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
 			return -EIO;
 	}
 
+	if (r == RET_PF_USER)
+		return 0;
+
 	if (r < 0)
 		return r;
 	if (r != RET_PF_EMULATE)
@@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 		 */
 		if (sp->role.direct &&
 		    sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
-							       PG_LEVEL_NUM)) {
+							       PG_LEVEL_NUM,
+							       false)) {
 			kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
 
 			if (kvm_available_flush_tlb_with_range())
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index dbaf6755c5a7..5ccf08183b00 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -189,6 +189,7 @@ struct kvm_page_fault {
 
 	/* Derived from mmu and global state.  */
 	const bool is_tdp;
+	const bool is_private;
 	const bool nx_huge_page_workaround_enabled;
 
 	/*
@@ -237,6 +238,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
  * RET_PF_RETRY: let CPU fault again on the address.
  * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
  * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
+ * RET_PF_USER: need to exit to userspace to handle this fault.
  * RET_PF_FIXED: The faulting entry has been fixed.
  * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
  *
@@ -253,6 +255,7 @@ enum {
 	RET_PF_RETRY,
 	RET_PF_EMULATE,
 	RET_PF_INVALID,
+	RET_PF_USER,
 	RET_PF_FIXED,
 	RET_PF_SPURIOUS,
 };
@@ -310,7 +313,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 
 int kvm_mmu_max_mapping_level(struct kvm *kvm,
 			      const struct kvm_memory_slot *slot, gfn_t gfn,
-			      int max_level);
+			      int max_level, bool is_private);
 void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
 
@@ -319,4 +322,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
+static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
+					gfn_t gfn, kvm_pfn_t *pfn, int *order)
+{
+	WARN_ON_ONCE(1);
+	return -EOPNOTSUPP;
+}
+#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index ae86820cef69..2d7555381955 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
 TRACE_DEFINE_ENUM(RET_PF_RETRY);
 TRACE_DEFINE_ENUM(RET_PF_EMULATE);
 TRACE_DEFINE_ENUM(RET_PF_INVALID);
+TRACE_DEFINE_ENUM(RET_PF_USER);
 TRACE_DEFINE_ENUM(RET_PF_FIXED);
 TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 771210ce5181..8ba1a4afc546 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1768,7 +1768,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 			continue;
 
 		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
-							      iter.gfn, PG_LEVEL_NUM);
+						iter.gfn, PG_LEVEL_NUM, false);
 		if (max_mapping_level < iter.level)
 			continue;
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 25099c94e770..153842bb33df 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
 }
 #endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
 
+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+{
+	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
+	       KVM_MEMORY_ATTRIBUTE_PRIVATE;
+}
+#else
+static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+{
+	return false;
+}
+
+#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
+
+#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
+static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
+					gfn_t gfn, kvm_pfn_t *pfn, int *order)
+{
+	int ret;
+	struct page *page;
+	pgoff_t index = gfn - slot->base_gfn +
+			(slot->restricted_offset >> PAGE_SHIFT);
+
+	ret = restrictedmem_get_page(slot->restricted_file, index,
+				     &page, order);
+	*pfn = page_to_pfn(page);
+	return ret;
+}
+#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
+
 #endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
  2022-12-02  6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (7 preceding siblings ...)
  2022-12-02  6:13 ` [PATCH v10 8/9] KVM: Handle page fault for private memory Chao Peng
@ 2022-12-02  6:13 ` Chao Peng
  2022-12-09  9:11   ` Fuad Tabba
                     ` (3 more replies)
  2023-01-14  0:37 ` [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Sean Christopherson
                   ` (2 subsequent siblings)
  11 siblings, 4 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-02  6:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
	mhocko, wei.w.wang

Register/unregister private memslot to fd-based memory backing store
restrictedmem and implement the callbacks for restrictedmem_notifier:
  - invalidate_start()/invalidate_end() to zap the existing memory
    mappings in the KVM page table.
  - error() to request KVM_REQ_MEMORY_MCE and later exit to userspace
    with KVM_EXIT_SHUTDOWN.

Expose KVM_MEM_PRIVATE for memslot and KVM_MEMORY_ATTRIBUTE_PRIVATE for
KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to userspace but either are
controlled by kvm_arch_has_private_mem() which should be rewritten by
architecture code.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
---
 arch/x86/include/asm/kvm_host.h |   1 +
 arch/x86/kvm/x86.c              |  13 +++
 include/linux/kvm_host.h        |   3 +
 virt/kvm/kvm_main.c             | 179 +++++++++++++++++++++++++++++++-
 4 files changed, 191 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7772ab37ac89..27ef31133352 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -114,6 +114,7 @@
 	KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_HV_TLB_FLUSH \
 	KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_MEMORY_MCE		KVM_ARCH_REQ(33)
 
 #define CR0_RESERVED_BITS                                               \
 	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5aefcff614d2..c67e22f3e2ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6587,6 +6587,13 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
 }
 #endif /* CONFIG_HAVE_KVM_PM_NOTIFIER */
 
+#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
+void kvm_arch_memory_mce(struct kvm *kvm)
+{
+	kvm_make_all_cpus_request(kvm, KVM_REQ_MEMORY_MCE);
+}
+#endif
+
 static int kvm_vm_ioctl_get_clock(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_clock_data data = { 0 };
@@ -10357,6 +10364,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 		if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
 			static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
+
+		if (kvm_check_request(KVM_REQ_MEMORY_MCE, vcpu)) {
+			vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
+			r = 0;
+			goto out;
+		}
 	}
 
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 153842bb33df..f032d878e034 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -590,6 +590,7 @@ struct kvm_memory_slot {
 	struct file *restricted_file;
 	loff_t restricted_offset;
 	struct restrictedmem_notifier notifier;
+	struct kvm *kvm;
 };
 
 static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
@@ -2363,6 +2364,8 @@ static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
 	*pfn = page_to_pfn(page);
 	return ret;
 }
+
+void kvm_arch_memory_mce(struct kvm *kvm);
 #endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
 
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e107afea32f0..ac835fc77273 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -936,6 +936,121 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
 
+#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
+static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
+					 pgoff_t start, pgoff_t end,
+					 gfn_t *gfn_start, gfn_t *gfn_end)
+{
+	unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
+
+	if (start > base_pgoff)
+		*gfn_start = slot->base_gfn + start - base_pgoff;
+	else
+		*gfn_start = slot->base_gfn;
+
+	if (end < base_pgoff + slot->npages)
+		*gfn_end = slot->base_gfn + end - base_pgoff;
+	else
+		*gfn_end = slot->base_gfn + slot->npages;
+
+	if (*gfn_start >= *gfn_end)
+		return false;
+
+	return true;
+}
+
+static void kvm_restrictedmem_invalidate_begin(struct restrictedmem_notifier *notifier,
+					       pgoff_t start, pgoff_t end)
+{
+	struct kvm_memory_slot *slot = container_of(notifier,
+						    struct kvm_memory_slot,
+						    notifier);
+	struct kvm *kvm = slot->kvm;
+	gfn_t gfn_start, gfn_end;
+	struct kvm_gfn_range gfn_range;
+	int idx;
+
+	if (!restrictedmem_range_is_valid(slot, start, end,
+					  &gfn_start, &gfn_end))
+		return;
+
+	gfn_range.start = gfn_start;
+	gfn_range.end = gfn_end;
+	gfn_range.slot = slot;
+	gfn_range.pte = __pte(0);
+	gfn_range.may_block = true;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	KVM_MMU_LOCK(kvm);
+
+	kvm_mmu_invalidate_begin(kvm);
+	kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
+	if (kvm_unmap_gfn_range(kvm, &gfn_range))
+		kvm_flush_remote_tlbs(kvm);
+
+	KVM_MMU_UNLOCK(kvm);
+	srcu_read_unlock(&kvm->srcu, idx);
+}
+
+static void kvm_restrictedmem_invalidate_end(struct restrictedmem_notifier *notifier,
+					     pgoff_t start, pgoff_t end)
+{
+	struct kvm_memory_slot *slot = container_of(notifier,
+						    struct kvm_memory_slot,
+						    notifier);
+	struct kvm *kvm = slot->kvm;
+	gfn_t gfn_start, gfn_end;
+
+	if (!restrictedmem_range_is_valid(slot, start, end,
+					  &gfn_start, &gfn_end))
+		return;
+
+	KVM_MMU_LOCK(kvm);
+	kvm_mmu_invalidate_end(kvm);
+	KVM_MMU_UNLOCK(kvm);
+}
+
+static void kvm_restrictedmem_error(struct restrictedmem_notifier *notifier,
+				    pgoff_t start, pgoff_t end)
+{
+	struct kvm_memory_slot *slot = container_of(notifier,
+						    struct kvm_memory_slot,
+						    notifier);
+	kvm_arch_memory_mce(slot->kvm);
+}
+
+static struct restrictedmem_notifier_ops kvm_restrictedmem_notifier_ops = {
+	.invalidate_start = kvm_restrictedmem_invalidate_begin,
+	.invalidate_end = kvm_restrictedmem_invalidate_end,
+	.error = kvm_restrictedmem_error,
+};
+
+static inline void kvm_restrictedmem_register(struct kvm_memory_slot *slot)
+{
+	slot->notifier.ops = &kvm_restrictedmem_notifier_ops;
+	restrictedmem_register_notifier(slot->restricted_file, &slot->notifier);
+}
+
+static inline void kvm_restrictedmem_unregister(struct kvm_memory_slot *slot)
+{
+	restrictedmem_unregister_notifier(slot->restricted_file,
+					  &slot->notifier);
+}
+
+#else /* !CONFIG_HAVE_KVM_RESTRICTED_MEM */
+
+static inline void kvm_restrictedmem_register(struct kvm_memory_slot *slot)
+{
+	WARN_ON_ONCE(1);
+}
+
+static inline void kvm_restrictedmem_unregister(struct kvm_memory_slot *slot)
+{
+	WARN_ON_ONCE(1);
+}
+
+#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
+
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 static int kvm_pm_notifier_call(struct notifier_block *bl,
 				unsigned long state,
@@ -980,6 +1095,11 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
 /* This does not remove the slot from struct kvm_memslots data structures */
 static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
+	if (slot->flags & KVM_MEM_PRIVATE) {
+		kvm_restrictedmem_unregister(slot);
+		fput(slot->restricted_file);
+	}
+
 	kvm_destroy_dirty_bitmap(slot);
 
 	kvm_arch_free_memslot(kvm, slot);
@@ -1551,10 +1671,14 @@ static void kvm_replace_memslot(struct kvm *kvm,
 	}
 }
 
-static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
+static int check_memory_region_flags(struct kvm *kvm,
+				     const struct kvm_user_mem_region *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
 
+	if (kvm_arch_has_private_mem(kvm))
+		valid_flags |= KVM_MEM_PRIVATE;
+
 #ifdef __KVM_HAVE_READONLY_MEM
 	valid_flags |= KVM_MEM_READONLY;
 #endif
@@ -1630,6 +1754,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
 {
 	int r;
 
+	if (change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
+		kvm_restrictedmem_register(new);
+
 	/*
 	 * If dirty logging is disabled, nullify the bitmap; the old bitmap
 	 * will be freed on "commit".  If logging is enabled in both old and
@@ -1658,6 +1785,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
 	if (r && new && new->dirty_bitmap && (!old || !old->dirty_bitmap))
 		kvm_destroy_dirty_bitmap(new);
 
+	if (r && change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
+		kvm_restrictedmem_unregister(new);
+
 	return r;
 }
 
@@ -1963,7 +2093,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	int as_id, id;
 	int r;
 
-	r = check_memory_region_flags(mem);
+	r = check_memory_region_flags(kvm, mem);
 	if (r)
 		return r;
 
@@ -1982,6 +2112,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
 			mem->memory_size))
 		return -EINVAL;
+	if (mem->flags & KVM_MEM_PRIVATE &&
+		(mem->restricted_offset & (PAGE_SIZE - 1) ||
+		 mem->restricted_offset > U64_MAX - mem->memory_size))
+		return -EINVAL;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
 		return -EINVAL;
 	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
@@ -2020,6 +2154,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
 			return -EINVAL;
 	} else { /* Modify an existing slot. */
+		/* Private memslots are immutable, they can only be deleted. */
+		if (mem->flags & KVM_MEM_PRIVATE)
+			return -EINVAL;
 		if ((mem->userspace_addr != old->userspace_addr) ||
 		    (npages != old->npages) ||
 		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
@@ -2048,10 +2185,28 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	new->npages = npages;
 	new->flags = mem->flags;
 	new->userspace_addr = mem->userspace_addr;
+	if (mem->flags & KVM_MEM_PRIVATE) {
+		new->restricted_file = fget(mem->restricted_fd);
+		if (!new->restricted_file ||
+		    !file_is_restrictedmem(new->restricted_file)) {
+			r = -EINVAL;
+			goto out;
+		}
+		new->restricted_offset = mem->restricted_offset;
+	}
+
+	new->kvm = kvm;
 
 	r = kvm_set_memslot(kvm, old, new, change);
 	if (r)
-		kfree(new);
+		goto out;
+
+	return 0;
+
+out:
+	if (new->restricted_file)
+		fput(new->restricted_file);
+	kfree(new);
 	return r;
 }
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
@@ -2351,6 +2506,8 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 #ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
 static u64 kvm_supported_mem_attributes(struct kvm *kvm)
 {
+	if (kvm_arch_has_private_mem(kvm))
+		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
 	return 0;
 }
 
@@ -4822,16 +4979,28 @@ static long kvm_vm_ioctl(struct file *filp,
 	}
 	case KVM_SET_USER_MEMORY_REGION: {
 		struct kvm_user_mem_region mem;
-		unsigned long size = sizeof(struct kvm_userspace_memory_region);
+		unsigned int flags_offset = offsetof(typeof(mem), flags);
+		unsigned long size;
+		u32 flags;
 
 		kvm_sanity_check_user_mem_region_alias();
 
+		memset(&mem, 0, sizeof(mem));
+
 		r = -EFAULT;
+		if (get_user(flags, (u32 __user *)(argp + flags_offset)))
+			goto out;
+
+		if (flags & KVM_MEM_PRIVATE)
+			size = sizeof(struct kvm_userspace_memory_region_ext);
+		else
+			size = sizeof(struct kvm_userspace_memory_region);
+
 		if (copy_from_user(&mem, argp, size))
 			goto out;
 
 		r = -EINVAL;
-		if (mem.flags & KVM_MEM_PRIVATE)
+		if ((flags ^ mem.flags) & KVM_MEM_PRIVATE)
 			goto out;
 
 		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2022-12-02  6:13 ` [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory Chao Peng
@ 2022-12-05  9:03   ` Fuad Tabba
  2022-12-06 11:53     ` Chao Peng
  2022-12-08  8:37   ` Xiaoyao Li
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 398+ messages in thread
From: Fuad Tabba @ 2022-12-05  9:03 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

Hi Chao,

On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> In memory encryption usage, guest memory may be encrypted with special
> key and can be accessed only by the guest itself. We call such memory
> private memory. It's valueless and sometimes can cause problem to allow
> userspace to access guest private memory. This new KVM memslot extension
> allows guest private memory being provided through a restrictedmem
> backed file descriptor(fd) and userspace is restricted to access the
> bookmarked memory in the fd.
>
> This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> additional KVM memslot fields restricted_fd/restricted_offset to allow
> userspace to instruct KVM to provide guest memory through restricted_fd.
> 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> and the size is 'memory_size'.
>
> The extended memslot can still have the userspace_addr(hva). When use, a
> single memslot can maintain both private memory through restricted_fd
> and shared memory through userspace_addr. Whether the private or shared
> part is visible to guest is maintained by other KVM code.
>
> A restrictedmem_notifier field is also added to the memslot structure to
> allow the restricted_fd's backing store to notify KVM the memory change,
> KVM then can invalidate its page table entries or handle memory errors.
>
> Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> and right now it is selected on X86_64 only.
>
> To make future maintenance easy, internally use a binary compatible
> alias struct kvm_user_mem_region to handle both the normal and the
> '_ext' variants.
>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Tested-by: Fuad Tabba <tabba@google.com>

V9 of this patch [*] had KVM_CAP_PRIVATE_MEM, but it's not in this
patch series anymore. Any reason you removed it, or is it just an
omission?

[*] https://lore.kernel.org/linux-mm/20221025151344.3784230-3-chao.p.peng@linux.intel.com/

Thanks,
/fuad

> ---
>  Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++-----
>  arch/x86/kvm/Kconfig           |  2 ++
>  arch/x86/kvm/x86.c             |  2 +-
>  include/linux/kvm_host.h       |  8 ++++--
>  include/uapi/linux/kvm.h       | 28 +++++++++++++++++++
>  virt/kvm/Kconfig               |  3 +++
>  virt/kvm/kvm_main.c            | 49 ++++++++++++++++++++++++++++------
>  7 files changed, 114 insertions(+), 18 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index bb2f709c0900..99352170c130 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
>  :Capability: KVM_CAP_USER_MEMORY
>  :Architectures: all
>  :Type: vm ioctl
> -:Parameters: struct kvm_userspace_memory_region (in)
> +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
>  :Returns: 0 on success, -1 on error
>
>  ::
> @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
>         __u64 userspace_addr; /* start of the userspace allocated memory */
>    };
>
> +  struct kvm_userspace_memory_region_ext {
> +       struct kvm_userspace_memory_region region;
> +       __u64 restricted_offset;
> +       __u32 restricted_fd;
> +       __u32 pad1;
> +       __u64 pad2[14];
> +  };
> +
>    /* for kvm_memory_region::flags */
>    #define KVM_MEM_LOG_DIRTY_PAGES      (1UL << 0)
>    #define KVM_MEM_READONLY     (1UL << 1)
> +  #define KVM_MEM_PRIVATE              (1UL << 2)
>
>  This ioctl allows the user to create, modify or delete a guest physical
>  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> @@ -1365,12 +1374,29 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
>  be identical.  This allows large pages in the guest to be backed by large
>  pages in the host.
>
> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> -to make a new slot read-only.  In this case, writes to this memory will be
> -posted to userspace as KVM_EXIT_MMIO exits.
> +kvm_userspace_memory_region_ext struct includes all fields of
> +kvm_userspace_memory_region struct, while also adds additional fields for some
> +other features. See below description of flags field for more information.
> +It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
> +
> +The flags field supports following flags:
> +
> +- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
> +  within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
> +
> +- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
> +  read-only. In this case, writes to this memory will be posted to userspace as
> +  KVM_EXIT_MMIO exits.
> +
> +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> +  KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has private
> +  memory backed by a file descriptor(fd) and userspace access to the fd may be
> +  restricted. Userspace should use restricted_fd/restricted_offset in the
> +  kvm_userspace_memory_region_ext to instruct KVM to provide private memory
> +  to guest. Userspace should guarantee not to map the same host physical address
> +  indicated by restricted_fd/restricted_offset to different guest physical
> +  addresses within multiple memslots. Failed to do this may result undefined
> +  behavior.
>
>  When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
>  the memory region are automatically reflected into the guest.  For example, an
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index a8e379a3afee..690cb21010e7 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -50,6 +50,8 @@ config KVM
>         select INTERVAL_TREE
>         select HAVE_KVM_PM_NOTIFIER if PM
>         select HAVE_KVM_MEMORY_ATTRIBUTES
> +       select HAVE_KVM_RESTRICTED_MEM if X86_64
> +       select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
>         help
>           Support hosting fully virtualized guest machines using hardware
>           virtualization extensions.  You will need a fairly recent
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7f850dfb4086..9a07380f8d3c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12224,7 +12224,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
>         }
>
>         for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> -               struct kvm_userspace_memory_region m;
> +               struct kvm_user_mem_region m;
>
>                 m.slot = id | (i << 16);
>                 m.flags = 0;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index a784e2b06625..02347e386ea2 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -44,6 +44,7 @@
>
>  #include <asm/kvm_host.h>
>  #include <linux/kvm_dirty_ring.h>
> +#include <linux/restrictedmem.h>
>
>  #ifndef KVM_MAX_VCPU_IDS
>  #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> @@ -585,6 +586,9 @@ struct kvm_memory_slot {
>         u32 flags;
>         short id;
>         u16 as_id;
> +       struct file *restricted_file;
> +       loff_t restricted_offset;
> +       struct restrictedmem_notifier notifier;
>  };
>
>  static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> @@ -1123,9 +1127,9 @@ enum kvm_mr_change {
>  };
>
>  int kvm_set_memory_region(struct kvm *kvm,
> -                         const struct kvm_userspace_memory_region *mem);
> +                         const struct kvm_user_mem_region *mem);
>  int __kvm_set_memory_region(struct kvm *kvm,
> -                           const struct kvm_userspace_memory_region *mem);
> +                           const struct kvm_user_mem_region *mem);
>  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
>  void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
>  int kvm_arch_prepare_memory_region(struct kvm *kvm,
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 5d0941acb5bb..13bff963b8b0 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
>         __u64 userspace_addr; /* start of the userspace allocated memory */
>  };
>
> +struct kvm_userspace_memory_region_ext {
> +       struct kvm_userspace_memory_region region;
> +       __u64 restricted_offset;
> +       __u32 restricted_fd;
> +       __u32 pad1;
> +       __u64 pad2[14];
> +};
> +
> +#ifdef __KERNEL__
> +/*
> + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> + * all fields from the top-level "extended" region.
> + */
> +struct kvm_user_mem_region {
> +       __u32 slot;
> +       __u32 flags;
> +       __u64 guest_phys_addr;
> +       __u64 memory_size;
> +       __u64 userspace_addr;
> +       __u64 restricted_offset;
> +       __u32 restricted_fd;
> +       __u32 pad1;
> +       __u64 pad2[14];
> +};
> +#endif
> +
>  /*
>   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
>   * other bits are reserved for kvm internal use which are defined in
> @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
>   */
>  #define KVM_MEM_LOG_DIRTY_PAGES        (1UL << 0)
>  #define KVM_MEM_READONLY       (1UL << 1)
> +#define KVM_MEM_PRIVATE                (1UL << 2)
>
>  /* for KVM_IRQ_LINE */
>  struct kvm_irq_level {
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index effdea5dd4f0..d605545d6dd1 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -89,3 +89,6 @@ config KVM_XFER_TO_GUEST_WORK
>
>  config HAVE_KVM_PM_NOTIFIER
>         bool
> +
> +config HAVE_KVM_RESTRICTED_MEM
> +       bool
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7f0f5e9f2406..b882eb2c76a2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1532,7 +1532,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
>         }
>  }
>
> -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
>  {
>         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>
> @@ -1934,7 +1934,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
>   * Must be called holding kvm->slots_lock for write.
>   */
>  int __kvm_set_memory_region(struct kvm *kvm,
> -                           const struct kvm_userspace_memory_region *mem)
> +                           const struct kvm_user_mem_region *mem)
>  {
>         struct kvm_memory_slot *old, *new;
>         struct kvm_memslots *slots;
> @@ -2038,7 +2038,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
>
>  int kvm_set_memory_region(struct kvm *kvm,
> -                         const struct kvm_userspace_memory_region *mem)
> +                         const struct kvm_user_mem_region *mem)
>  {
>         int r;
>
> @@ -2050,7 +2050,7 @@ int kvm_set_memory_region(struct kvm *kvm,
>  EXPORT_SYMBOL_GPL(kvm_set_memory_region);
>
>  static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> -                                         struct kvm_userspace_memory_region *mem)
> +                                         struct kvm_user_mem_region *mem)
>  {
>         if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
>                 return -EINVAL;
> @@ -4698,6 +4698,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
>         return fd;
>  }
>
> +#define SANITY_CHECK_MEM_REGION_FIELD(field)                                   \
> +do {                                                                           \
> +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=             \
> +                    offsetof(struct kvm_userspace_memory_region, field));      \
> +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=         \
> +                    sizeof_field(struct kvm_userspace_memory_region, field));  \
> +} while (0)
> +
> +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)                                       \
> +do {                                                                                   \
> +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=                     \
> +                    offsetof(struct kvm_userspace_memory_region_ext, field));          \
> +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=                 \
> +                    sizeof_field(struct kvm_userspace_memory_region_ext, field));      \
> +} while (0)
> +
> +static void kvm_sanity_check_user_mem_region_alias(void)
> +{
> +       SANITY_CHECK_MEM_REGION_FIELD(slot);
> +       SANITY_CHECK_MEM_REGION_FIELD(flags);
> +       SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> +       SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> +       SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> +       SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
> +       SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
> +}
> +
>  static long kvm_vm_ioctl(struct file *filp,
>                            unsigned int ioctl, unsigned long arg)
>  {
> @@ -4721,14 +4748,20 @@ static long kvm_vm_ioctl(struct file *filp,
>                 break;
>         }
>         case KVM_SET_USER_MEMORY_REGION: {
> -               struct kvm_userspace_memory_region kvm_userspace_mem;
> +               struct kvm_user_mem_region mem;
> +               unsigned long size = sizeof(struct kvm_userspace_memory_region);
> +
> +               kvm_sanity_check_user_mem_region_alias();
>
>                 r = -EFAULT;
> -               if (copy_from_user(&kvm_userspace_mem, argp,
> -                                               sizeof(kvm_userspace_mem)))
> +               if (copy_from_user(&mem, argp, size))
> +                       goto out;
> +
> +               r = -EINVAL;
> +               if (mem.flags & KVM_MEM_PRIVATE)
>                         goto out;
>
> -               r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> +               r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>                 break;
>         }
>         case KVM_GET_DIRTY_LOG: {
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-12-02  6:13 ` [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
@ 2022-12-05  9:23   ` Fuad Tabba
  2022-12-06 11:56     ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Fuad Tabba @ 2022-12-05  9:23 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

Hi Chao,

On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> Currently in mmu_notifier invalidate path, hva range is recorded and
> then checked against by mmu_notifier_retry_hva() in the page fault
> handling path. However, for the to be introduced private memory, a page
> fault may not have a hva associated, checking gfn(gpa) makes more sense.
>
> For existing hva based shared memory, gfn is expected to also work. The
> only downside is when aliasing multiple gfns to a single hva, the
> current algorithm of checking multiple ranges could result in a much
> larger range being rejected. Such aliasing should be uncommon, so the
> impact is expected small.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  arch/x86/kvm/mmu/mmu.c   |  8 +++++---
>  include/linux/kvm_host.h | 33 +++++++++++++++++++++------------
>  virt/kvm/kvm_main.c      | 32 +++++++++++++++++++++++---------
>  3 files changed, 49 insertions(+), 24 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4736d7849c60..e2c70b5afa3e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4259,7 +4259,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
>                 return true;
>
>         return fault->slot &&
> -              mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> +              mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
>  }
>
>  static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> @@ -6098,7 +6098,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>
>         write_lock(&kvm->mmu_lock);
>
> -       kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
> +       kvm_mmu_invalidate_begin(kvm);
> +
> +       kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
>
>         flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>
> @@ -6112,7 +6114,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>                 kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
>                                                    gfn_end - gfn_start);
>
> -       kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
> +       kvm_mmu_invalidate_end(kvm);
>
>         write_unlock(&kvm->mmu_lock);
>  }
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 02347e386ea2..3d69484d2704 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -787,8 +787,8 @@ struct kvm {
>         struct mmu_notifier mmu_notifier;
>         unsigned long mmu_invalidate_seq;
>         long mmu_invalidate_in_progress;
> -       unsigned long mmu_invalidate_range_start;
> -       unsigned long mmu_invalidate_range_end;
> +       gfn_t mmu_invalidate_range_start;
> +       gfn_t mmu_invalidate_range_end;
>  #endif
>         struct list_head devices;
>         u64 manual_dirty_log_protect;
> @@ -1389,10 +1389,9 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
>  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  #endif
>
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> -                             unsigned long end);
> -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> -                           unsigned long end);
> +void kvm_mmu_invalidate_begin(struct kvm *kvm);
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
> +void kvm_mmu_invalidate_end(struct kvm *kvm);
>
>  long kvm_arch_dev_ioctl(struct file *filp,
>                         unsigned int ioctl, unsigned long arg);
> @@ -1963,9 +1962,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
>         return 0;
>  }
>
> -static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> +static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
>                                            unsigned long mmu_seq,
> -                                          unsigned long hva)
> +                                          gfn_t gfn)
>  {
>         lockdep_assert_held(&kvm->mmu_lock);
>         /*
> @@ -1974,10 +1973,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
>          * that might be being invalidated. Note that it may include some false

nit: "might be" (or) "is being"

>          * positives, due to shortcuts when handing concurrent invalidations.

nit: handling

>          */
> -       if (unlikely(kvm->mmu_invalidate_in_progress) &&
> -           hva >= kvm->mmu_invalidate_range_start &&
> -           hva < kvm->mmu_invalidate_range_end)
> -               return 1;
> +       if (unlikely(kvm->mmu_invalidate_in_progress)) {
> +               /*
> +                * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> +                * but before updating the range is a KVM bug.
> +                */
> +               if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> +                                kvm->mmu_invalidate_range_end == INVALID_GPA))

INVALID_GPA is an x86-specific define in
arch/x86/include/asm/kvm_host.h, so this doesn't build on other
architectures. The obvious fix is to move it to
include/linux/kvm_host.h.

Cheers,
/fuad

> +                       return 1;
> +
> +               if (gfn >= kvm->mmu_invalidate_range_start &&
> +                   gfn < kvm->mmu_invalidate_range_end)
> +                       return 1;
> +       }
> +
>         if (kvm->mmu_invalidate_seq != mmu_seq)
>                 return 1;
>         return 0;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b882eb2c76a2..ad55dfbc75d7 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -540,9 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
>
>  typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
>
> -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> -                            unsigned long end);
> -
> +typedef void (*on_lock_fn_t)(struct kvm *kvm);
>  typedef void (*on_unlock_fn_t)(struct kvm *kvm);
>
>  struct kvm_hva_range {
> @@ -628,7 +626,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>                                 locked = true;
>                                 KVM_MMU_LOCK(kvm);
>                                 if (!IS_KVM_NULL_FN(range->on_lock))
> -                                       range->on_lock(kvm, range->start, range->end);
> +                                       range->on_lock(kvm);
> +
>                                 if (IS_KVM_NULL_FN(range->handler))
>                                         break;
>                         }
> @@ -715,8 +714,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>         kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
>  }
>
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> -                             unsigned long end)
> +void kvm_mmu_invalidate_begin(struct kvm *kvm)
>  {
>         /*
>          * The count increase must become visible at unlock time as no
> @@ -724,6 +722,17 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
>          * count is also read inside the mmu_lock critical section.
>          */
>         kvm->mmu_invalidate_in_progress++;
> +
> +       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> +               kvm->mmu_invalidate_range_start = INVALID_GPA;
> +               kvm->mmu_invalidate_range_end = INVALID_GPA;
> +       }
> +}
> +
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +       WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> +
>         if (likely(kvm->mmu_invalidate_in_progress == 1)) {
>                 kvm->mmu_invalidate_range_start = start;
>                 kvm->mmu_invalidate_range_end = end;
> @@ -744,6 +753,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
>         }
>  }
>
> +static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> +       kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> +       return kvm_unmap_gfn_range(kvm, range);
> +}
> +
>  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>                                         const struct mmu_notifier_range *range)
>  {
> @@ -752,7 +767,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>                 .start          = range->start,
>                 .end            = range->end,
>                 .pte            = __pte(0),
> -               .handler        = kvm_unmap_gfn_range,
> +               .handler        = kvm_mmu_unmap_gfn_range,
>                 .on_lock        = kvm_mmu_invalidate_begin,
>                 .on_unlock      = kvm_arch_guest_memory_reclaimed,
>                 .flush_on_ret   = true,
> @@ -791,8 +806,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>         return 0;
>  }
>
> -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> -                           unsigned long end)
> +void kvm_mmu_invalidate_end(struct kvm *kvm)
>  {
>         /*
>          * This sequence increase will notify the kvm page fault that
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed
  2022-12-02  6:13 ` [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed Chao Peng
@ 2022-12-05 22:49   ` Isaku Yamahata
  2022-12-06 12:02     ` Chao Peng
  2023-01-13 23:12   ` Sean Christopherson
  2023-01-13 23:16   ` Sean Christopherson
  2 siblings, 1 reply; 398+ messages in thread
From: Isaku Yamahata @ 2022-12-05 22:49 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang, isaku.yamahata

On Fri, Dec 02, 2022 at 02:13:45PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:

> A large page with mixed private/shared subpages can't be mapped as large
> page since its sub private/shared pages are from different memory
> backends and may also treated by architecture differently. When
> private/shared memory are mixed in a large page, the current lpage_info
> is not sufficient to decide whether the page can be mapped as large page
> or not and additional private/shared mixed information is needed.
> 
> Tracking this 'mixed' information with the current 'count' like
> disallow_lpage is a bit challenge so reserve a bit in 'disallow_lpage'
> to indicate a large page has mixed private/share subpages and update
> this 'mixed' bit whenever the memory attribute is changed between
> private and shared.
> 
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   8 ++
>  arch/x86/kvm/mmu/mmu.c          | 134 +++++++++++++++++++++++++++++++-
>  arch/x86/kvm/x86.c              |   2 +
>  include/linux/kvm_host.h        |  19 +++++
>  virt/kvm/kvm_main.c             |   9 ++-
>  5 files changed, 169 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 283cbb83d6ae..7772ab37ac89 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -38,6 +38,7 @@
>  #include <asm/hyperv-tlfs.h>
>  
>  #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> +#define __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
>  
>  #define KVM_MAX_VCPUS 1024
>  
> @@ -1011,6 +1012,13 @@ struct kvm_vcpu_arch {
>  #endif
>  };
>  
> +/*
> + * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
> + * level. The remaining bits are used as a reference count.
> + */
> +#define KVM_LPAGE_PRIVATE_SHARED_MIXED		(1U << 31)
> +#define KVM_LPAGE_COUNT_MAX			((1U << 31) - 1)
> +
>  struct kvm_lpage_info {
>  	int disallow_lpage;
>  };
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e2c70b5afa3e..2190fd8c95c0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -763,11 +763,16 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
>  {
>  	struct kvm_lpage_info *linfo;
>  	int i;
> +	int disallow_count;
>  
>  	for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
>  		linfo = lpage_info_slot(gfn, slot, i);
> +
> +		disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
> +		WARN_ON(disallow_count + count < 0 ||
> +			disallow_count > KVM_LPAGE_COUNT_MAX - count);
> +
>  		linfo->disallow_lpage += count;
> -		WARN_ON(linfo->disallow_lpage < 0);
>  	}
>  }
>  
> @@ -6986,3 +6991,130 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
>  	if (kvm->arch.nx_huge_page_recovery_thread)
>  		kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
>  }
> +
> +static bool linfo_is_mixed(struct kvm_lpage_info *linfo)
> +{
> +	return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +}
> +
> +static void linfo_set_mixed(gfn_t gfn, struct kvm_memory_slot *slot,
> +			    int level, bool mixed)
> +{
> +	struct kvm_lpage_info *linfo = lpage_info_slot(gfn, slot, level);
> +
> +	if (mixed)
> +		linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +	else
> +		linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +}
> +
> +static bool is_expected_attr_entry(void *entry, unsigned long expected_attrs)
> +{
> +	bool expect_private = expected_attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +
> +	if (xa_to_value(entry) & KVM_MEMORY_ATTRIBUTE_PRIVATE) {
> +		if (!expect_private)
> +			return false;
> +	} else if (expect_private)
> +		return false;
> +
> +	return true;
> +}
> +
> +static bool mem_attrs_mixed_2m(struct kvm *kvm, unsigned long attrs,
> +			       gfn_t start, gfn_t end)
> +{
> +	XA_STATE(xas, &kvm->mem_attr_array, start);
> +	gfn_t gfn = start;
> +	void *entry;
> +	bool mixed = false;
> +
> +	rcu_read_lock();
> +	entry = xas_load(&xas);
> +	while (gfn < end) {
> +		if (xas_retry(&xas, entry))
> +			continue;
> +
> +		KVM_BUG_ON(gfn != xas.xa_index, kvm);
> +
> +		if (!is_expected_attr_entry(entry, attrs)) {
> +			mixed = true;
> +			break;
> +		}
> +
> +		entry = xas_next(&xas);
> +		gfn++;
> +	}
> +
> +	rcu_read_unlock();
> +	return mixed;
> +}
> +
> +static bool mem_attrs_mixed(struct kvm *kvm, struct kvm_memory_slot *slot,
> +			    int level, unsigned long attrs,
> +			    gfn_t start, gfn_t end)
> +{
> +	unsigned long gfn;
> +
> +	if (level == PG_LEVEL_2M)
> +		return mem_attrs_mixed_2m(kvm, attrs, start, end);
> +
> +	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1))
> +		if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)) ||
> +		    !is_expected_attr_entry(xa_load(&kvm->mem_attr_array, gfn),
> +					    attrs))
> +			return true;
> +	return false;
> +}
> +
> +static void kvm_update_lpage_private_shared_mixed(struct kvm *kvm,
> +						  struct kvm_memory_slot *slot,
> +						  unsigned long attrs,
> +						  gfn_t start, gfn_t end)
> +{
> +	unsigned long pages, mask;
> +	gfn_t gfn, gfn_end, first, last;
> +	int level;
> +	bool mixed;
> +
> +	/*
> +	 * The sequence matters here: we set the higher level basing on the
> +	 * lower level's scanning result.
> +	 */
> +	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> +		pages = KVM_PAGES_PER_HPAGE(level);
> +		mask = ~(pages - 1);
> +		first = start & mask;
> +		last = (end - 1) & mask;
> +
> +		/*
> +		 * We only need to scan the head and tail page, for middle pages
> +		 * we know they will not be mixed.
> +		 */
> +		gfn = max(first, slot->base_gfn);
> +		gfn_end = min(first + pages, slot->base_gfn + slot->npages);
> +		mixed = mem_attrs_mixed(kvm, slot, level, attrs, gfn, gfn_end);
> +		linfo_set_mixed(gfn, slot, level, mixed);
> +
> +		if (first == last)
> +			return;


continue.

> +
> +		for (gfn = first + pages; gfn < last; gfn += pages)
> +			linfo_set_mixed(gfn, slot, level, false);
> +
> +		gfn = last;
> +		gfn_end = min(last + pages, slot->base_gfn + slot->npages);

if (gfn == gfn_end) continue.


> +		mixed = mem_attrs_mixed(kvm, slot, level, attrs, gfn, gfn_end);
> +		linfo_set_mixed(gfn, slot, level, mixed);
> +	}
> +}
> +
> +void kvm_arch_set_memory_attributes(struct kvm *kvm,
> +				    struct kvm_memory_slot *slot,
> +				    unsigned long attrs,
> +				    gfn_t start, gfn_t end)
> +{
> +	if (kvm_slot_can_be_private(slot))
> +		kvm_update_lpage_private_shared_mixed(kvm, slot, attrs,
> +						      start, end);
> +}
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9a07380f8d3c..5aefcff614d2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12362,6 +12362,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
>  		if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
>  			linfo[lpages - 1].disallow_lpage = 1;
>  		ugfn = slot->userspace_addr >> PAGE_SHIFT;
> +		if (kvm_slot_can_be_private(slot))
> +			ugfn |= slot->restricted_offset >> PAGE_SHIFT;

Is there any alignment restriction? If no, It should be +=.
In practice, alignment will hold though.

Thanks,

>  		/*
>  		 * If the gfn and userspace address are not aligned wrt each
>  		 * other, disable large page support for this slot.
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3331c0c92838..25099c94e770 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -592,6 +592,11 @@ struct kvm_memory_slot {
>  	struct restrictedmem_notifier notifier;
>  };
>  
> +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
> +{
> +	return slot && (slot->flags & KVM_MEM_PRIVATE);
> +}
> +
>  static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
>  {
>  	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
> @@ -2316,4 +2321,18 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>  /* Max number of entries allowed for each kvm dirty ring */
>  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>  
> +#ifdef __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
> +void kvm_arch_set_memory_attributes(struct kvm *kvm,
> +				    struct kvm_memory_slot *slot,
> +				    unsigned long attrs,
> +				    gfn_t start, gfn_t end);
> +#else
> +static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> +						  struct kvm_memory_slot *slot,
> +						  unsigned long attrs,
> +						  gfn_t start, gfn_t end)
> +{
> +}
> +#endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
> +
>  #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4e1e1e113bf0..e107afea32f0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2354,7 +2354,8 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
>  	return 0;
>  }
>  
> -static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end,
> +				unsigned long attrs)
>  {
>  	struct kvm_gfn_range gfn_range;
>  	struct kvm_memory_slot *slot;
> @@ -2378,6 +2379,10 @@ static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
>  			gfn_range.slot = slot;
>  
>  			r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> +
> +			kvm_arch_set_memory_attributes(kvm, slot, attrs,
> +						       gfn_range.start,
> +						       gfn_range.end);
>  		}
>  	}
>  
> @@ -2427,7 +2432,7 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>  		idx = srcu_read_lock(&kvm->srcu);
>  		KVM_MMU_LOCK(kvm);
>  		if (i > start)
> -			kvm_unmap_mem_range(kvm, start, i);
> +			kvm_unmap_mem_range(kvm, start, i, attrs->attributes);
>  		kvm_mmu_invalidate_end(kvm);
>  		KVM_MMU_UNLOCK(kvm);
>  		srcu_read_unlock(&kvm->srcu, idx);
> -- 
> 2.25.1
> 

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2022-12-05  9:03   ` Fuad Tabba
@ 2022-12-06 11:53     ` Chao Peng
  2022-12-06 12:39       ` Fuad Tabba
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-12-06 11:53 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

On Mon, Dec 05, 2022 at 09:03:11AM +0000, Fuad Tabba wrote:
> Hi Chao,
> 
> On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > In memory encryption usage, guest memory may be encrypted with special
> > key and can be accessed only by the guest itself. We call such memory
> > private memory. It's valueless and sometimes can cause problem to allow
> > userspace to access guest private memory. This new KVM memslot extension
> > allows guest private memory being provided through a restrictedmem
> > backed file descriptor(fd) and userspace is restricted to access the
> > bookmarked memory in the fd.
> >
> > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > userspace to instruct KVM to provide guest memory through restricted_fd.
> > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > and the size is 'memory_size'.
> >
> > The extended memslot can still have the userspace_addr(hva). When use, a
> > single memslot can maintain both private memory through restricted_fd
> > and shared memory through userspace_addr. Whether the private or shared
> > part is visible to guest is maintained by other KVM code.
> >
> > A restrictedmem_notifier field is also added to the memslot structure to
> > allow the restricted_fd's backing store to notify KVM the memory change,
> > KVM then can invalidate its page table entries or handle memory errors.
> >
> > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > and right now it is selected on X86_64 only.
> >
> > To make future maintenance easy, internally use a binary compatible
> > alias struct kvm_user_mem_region to handle both the normal and the
> > '_ext' variants.
> >
> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > Reviewed-by: Fuad Tabba <tabba@google.com>
> > Tested-by: Fuad Tabba <tabba@google.com>
> 
> V9 of this patch [*] had KVM_CAP_PRIVATE_MEM, but it's not in this
> patch series anymore. Any reason you removed it, or is it just an
> omission?

We had some discussion in v9 [1] to add generic memory attributes ioctls
and KVM_CAP_PRIVATE_MEM can be implemented as a new 
KVM_MEMORY_ATTRIBUTE_PRIVATE flag via KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES()
ioctl [2]. The api doc has been updated:

+- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
+  KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl) …


[1] https://lore.kernel.org/linux-mm/Y2WB48kD0J4VGynX@google.com/
[2]
https://lore.kernel.org/linux-mm/20221202061347.1070246-3-chao.p.peng@linux.intel.com/

Thanks,
Chao
> 
> [*] https://lore.kernel.org/linux-mm/20221025151344.3784230-3-chao.p.peng@linux.intel.com/
> 
> Thanks,
> /fuad
> 
> > ---
> >  Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++-----
> >  arch/x86/kvm/Kconfig           |  2 ++
> >  arch/x86/kvm/x86.c             |  2 +-
> >  include/linux/kvm_host.h       |  8 ++++--
> >  include/uapi/linux/kvm.h       | 28 +++++++++++++++++++
> >  virt/kvm/Kconfig               |  3 +++
> >  virt/kvm/kvm_main.c            | 49 ++++++++++++++++++++++++++++------
> >  7 files changed, 114 insertions(+), 18 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index bb2f709c0900..99352170c130 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
> >  :Capability: KVM_CAP_USER_MEMORY
> >  :Architectures: all
> >  :Type: vm ioctl
> > -:Parameters: struct kvm_userspace_memory_region (in)
> > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> >  :Returns: 0 on success, -1 on error
> >
> >  ::
> > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> >         __u64 userspace_addr; /* start of the userspace allocated memory */
> >    };
> >
> > +  struct kvm_userspace_memory_region_ext {
> > +       struct kvm_userspace_memory_region region;
> > +       __u64 restricted_offset;
> > +       __u32 restricted_fd;
> > +       __u32 pad1;
> > +       __u64 pad2[14];
> > +  };
> > +
> >    /* for kvm_memory_region::flags */
> >    #define KVM_MEM_LOG_DIRTY_PAGES      (1UL << 0)
> >    #define KVM_MEM_READONLY     (1UL << 1)
> > +  #define KVM_MEM_PRIVATE              (1UL << 2)
> >
> >  This ioctl allows the user to create, modify or delete a guest physical
> >  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> > @@ -1365,12 +1374,29 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> >  be identical.  This allows large pages in the guest to be backed by large
> >  pages in the host.
> >
> > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> > -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> > -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> > -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> > -to make a new slot read-only.  In this case, writes to this memory will be
> > -posted to userspace as KVM_EXIT_MMIO exits.
> > +kvm_userspace_memory_region_ext struct includes all fields of
> > +kvm_userspace_memory_region struct, while also adds additional fields for some
> > +other features. See below description of flags field for more information.
> > +It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
> > +
> > +The flags field supports following flags:
> > +
> > +- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
> > +  within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
> > +
> > +- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
> > +  read-only. In this case, writes to this memory will be posted to userspace as
> > +  KVM_EXIT_MMIO exits.
> > +
> > +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> > +  KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has private
> > +  memory backed by a file descriptor(fd) and userspace access to the fd may be
> > +  restricted. Userspace should use restricted_fd/restricted_offset in the
> > +  kvm_userspace_memory_region_ext to instruct KVM to provide private memory
> > +  to guest. Userspace should guarantee not to map the same host physical address
> > +  indicated by restricted_fd/restricted_offset to different guest physical
> > +  addresses within multiple memslots. Failed to do this may result undefined
> > +  behavior.
> >
> >  When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> >  the memory region are automatically reflected into the guest.  For example, an
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index a8e379a3afee..690cb21010e7 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -50,6 +50,8 @@ config KVM
> >         select INTERVAL_TREE
> >         select HAVE_KVM_PM_NOTIFIER if PM
> >         select HAVE_KVM_MEMORY_ATTRIBUTES
> > +       select HAVE_KVM_RESTRICTED_MEM if X86_64
> > +       select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> >         help
> >           Support hosting fully virtualized guest machines using hardware
> >           virtualization extensions.  You will need a fairly recent
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 7f850dfb4086..9a07380f8d3c 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -12224,7 +12224,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
> >         }
> >
> >         for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > -               struct kvm_userspace_memory_region m;
> > +               struct kvm_user_mem_region m;
> >
> >                 m.slot = id | (i << 16);
> >                 m.flags = 0;
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index a784e2b06625..02347e386ea2 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -44,6 +44,7 @@
> >
> >  #include <asm/kvm_host.h>
> >  #include <linux/kvm_dirty_ring.h>
> > +#include <linux/restrictedmem.h>
> >
> >  #ifndef KVM_MAX_VCPU_IDS
> >  #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> > @@ -585,6 +586,9 @@ struct kvm_memory_slot {
> >         u32 flags;
> >         short id;
> >         u16 as_id;
> > +       struct file *restricted_file;
> > +       loff_t restricted_offset;
> > +       struct restrictedmem_notifier notifier;
> >  };
> >
> >  static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> > @@ -1123,9 +1127,9 @@ enum kvm_mr_change {
> >  };
> >
> >  int kvm_set_memory_region(struct kvm *kvm,
> > -                         const struct kvm_userspace_memory_region *mem);
> > +                         const struct kvm_user_mem_region *mem);
> >  int __kvm_set_memory_region(struct kvm *kvm,
> > -                           const struct kvm_userspace_memory_region *mem);
> > +                           const struct kvm_user_mem_region *mem);
> >  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
> >  void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
> >  int kvm_arch_prepare_memory_region(struct kvm *kvm,
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 5d0941acb5bb..13bff963b8b0 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
> >         __u64 userspace_addr; /* start of the userspace allocated memory */
> >  };
> >
> > +struct kvm_userspace_memory_region_ext {
> > +       struct kvm_userspace_memory_region region;
> > +       __u64 restricted_offset;
> > +       __u32 restricted_fd;
> > +       __u32 pad1;
> > +       __u64 pad2[14];
> > +};
> > +
> > +#ifdef __KERNEL__
> > +/*
> > + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> > + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> > + * all fields from the top-level "extended" region.
> > + */
> > +struct kvm_user_mem_region {
> > +       __u32 slot;
> > +       __u32 flags;
> > +       __u64 guest_phys_addr;
> > +       __u64 memory_size;
> > +       __u64 userspace_addr;
> > +       __u64 restricted_offset;
> > +       __u32 restricted_fd;
> > +       __u32 pad1;
> > +       __u64 pad2[14];
> > +};
> > +#endif
> > +
> >  /*
> >   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> >   * other bits are reserved for kvm internal use which are defined in
> > @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
> >   */
> >  #define KVM_MEM_LOG_DIRTY_PAGES        (1UL << 0)
> >  #define KVM_MEM_READONLY       (1UL << 1)
> > +#define KVM_MEM_PRIVATE                (1UL << 2)
> >
> >  /* for KVM_IRQ_LINE */
> >  struct kvm_irq_level {
> > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > index effdea5dd4f0..d605545d6dd1 100644
> > --- a/virt/kvm/Kconfig
> > +++ b/virt/kvm/Kconfig
> > @@ -89,3 +89,6 @@ config KVM_XFER_TO_GUEST_WORK
> >
> >  config HAVE_KVM_PM_NOTIFIER
> >         bool
> > +
> > +config HAVE_KVM_RESTRICTED_MEM
> > +       bool
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 7f0f5e9f2406..b882eb2c76a2 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1532,7 +1532,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
> >         }
> >  }
> >
> > -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> > +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> >  {
> >         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> >
> > @@ -1934,7 +1934,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> >   * Must be called holding kvm->slots_lock for write.
> >   */
> >  int __kvm_set_memory_region(struct kvm *kvm,
> > -                           const struct kvm_userspace_memory_region *mem)
> > +                           const struct kvm_user_mem_region *mem)
> >  {
> >         struct kvm_memory_slot *old, *new;
> >         struct kvm_memslots *slots;
> > @@ -2038,7 +2038,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >  EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> >
> >  int kvm_set_memory_region(struct kvm *kvm,
> > -                         const struct kvm_userspace_memory_region *mem)
> > +                         const struct kvm_user_mem_region *mem)
> >  {
> >         int r;
> >
> > @@ -2050,7 +2050,7 @@ int kvm_set_memory_region(struct kvm *kvm,
> >  EXPORT_SYMBOL_GPL(kvm_set_memory_region);
> >
> >  static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> > -                                         struct kvm_userspace_memory_region *mem)
> > +                                         struct kvm_user_mem_region *mem)
> >  {
> >         if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
> >                 return -EINVAL;
> > @@ -4698,6 +4698,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
> >         return fd;
> >  }
> >
> > +#define SANITY_CHECK_MEM_REGION_FIELD(field)                                   \
> > +do {                                                                           \
> > +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=             \
> > +                    offsetof(struct kvm_userspace_memory_region, field));      \
> > +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=         \
> > +                    sizeof_field(struct kvm_userspace_memory_region, field));  \
> > +} while (0)
> > +
> > +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)                                       \
> > +do {                                                                                   \
> > +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=                     \
> > +                    offsetof(struct kvm_userspace_memory_region_ext, field));          \
> > +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=                 \
> > +                    sizeof_field(struct kvm_userspace_memory_region_ext, field));      \
> > +} while (0)
> > +
> > +static void kvm_sanity_check_user_mem_region_alias(void)
> > +{
> > +       SANITY_CHECK_MEM_REGION_FIELD(slot);
> > +       SANITY_CHECK_MEM_REGION_FIELD(flags);
> > +       SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> > +       SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> > +       SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> > +       SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
> > +       SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
> > +}
> > +
> >  static long kvm_vm_ioctl(struct file *filp,
> >                            unsigned int ioctl, unsigned long arg)
> >  {
> > @@ -4721,14 +4748,20 @@ static long kvm_vm_ioctl(struct file *filp,
> >                 break;
> >         }
> >         case KVM_SET_USER_MEMORY_REGION: {
> > -               struct kvm_userspace_memory_region kvm_userspace_mem;
> > +               struct kvm_user_mem_region mem;
> > +               unsigned long size = sizeof(struct kvm_userspace_memory_region);
> > +
> > +               kvm_sanity_check_user_mem_region_alias();
> >
> >                 r = -EFAULT;
> > -               if (copy_from_user(&kvm_userspace_mem, argp,
> > -                                               sizeof(kvm_userspace_mem)))
> > +               if (copy_from_user(&mem, argp, size))
> > +                       goto out;
> > +
> > +               r = -EINVAL;
> > +               if (mem.flags & KVM_MEM_PRIVATE)
> >                         goto out;
> >
> > -               r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> > +               r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> >                 break;
> >         }
> >         case KVM_GET_DIRTY_LOG: {
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-12-05  9:23   ` Fuad Tabba
@ 2022-12-06 11:56     ` Chao Peng
  2022-12-06 15:48       ` Fuad Tabba
  2022-12-07  6:34       ` Isaku Yamahata
  0 siblings, 2 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-06 11:56 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

On Mon, Dec 05, 2022 at 09:23:49AM +0000, Fuad Tabba wrote:
> Hi Chao,
> 
> On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > Currently in mmu_notifier invalidate path, hva range is recorded and
> > then checked against by mmu_notifier_retry_hva() in the page fault
> > handling path. However, for the to be introduced private memory, a page
> > fault may not have a hva associated, checking gfn(gpa) makes more sense.
> >
> > For existing hva based shared memory, gfn is expected to also work. The
> > only downside is when aliasing multiple gfns to a single hva, the
> > current algorithm of checking multiple ranges could result in a much
> > larger range being rejected. Such aliasing should be uncommon, so the
> > impact is expected small.
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c   |  8 +++++---
> >  include/linux/kvm_host.h | 33 +++++++++++++++++++++------------
> >  virt/kvm/kvm_main.c      | 32 +++++++++++++++++++++++---------
> >  3 files changed, 49 insertions(+), 24 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 4736d7849c60..e2c70b5afa3e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4259,7 +4259,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> >                 return true;
> >
> >         return fault->slot &&
> > -              mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> > +              mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
> >  }
> >
> >  static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > @@ -6098,7 +6098,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> >
> >         write_lock(&kvm->mmu_lock);
> >
> > -       kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
> > +       kvm_mmu_invalidate_begin(kvm);
> > +
> > +       kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
> >
> >         flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
> >
> > @@ -6112,7 +6114,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> >                 kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
> >                                                    gfn_end - gfn_start);
> >
> > -       kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
> > +       kvm_mmu_invalidate_end(kvm);
> >
> >         write_unlock(&kvm->mmu_lock);
> >  }
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 02347e386ea2..3d69484d2704 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -787,8 +787,8 @@ struct kvm {
> >         struct mmu_notifier mmu_notifier;
> >         unsigned long mmu_invalidate_seq;
> >         long mmu_invalidate_in_progress;
> > -       unsigned long mmu_invalidate_range_start;
> > -       unsigned long mmu_invalidate_range_end;
> > +       gfn_t mmu_invalidate_range_start;
> > +       gfn_t mmu_invalidate_range_end;
> >  #endif
> >         struct list_head devices;
> >         u64 manual_dirty_log_protect;
> > @@ -1389,10 +1389,9 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> >  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >  #endif
> >
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > -                             unsigned long end);
> > -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> > -                           unsigned long end);
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm);
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
> > +void kvm_mmu_invalidate_end(struct kvm *kvm);
> >
> >  long kvm_arch_dev_ioctl(struct file *filp,
> >                         unsigned int ioctl, unsigned long arg);
> > @@ -1963,9 +1962,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
> >         return 0;
> >  }
> >
> > -static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> > +static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
> >                                            unsigned long mmu_seq,
> > -                                          unsigned long hva)
> > +                                          gfn_t gfn)
> >  {
> >         lockdep_assert_held(&kvm->mmu_lock);
> >         /*
> > @@ -1974,10 +1973,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> >          * that might be being invalidated. Note that it may include some false
> 
> nit: "might be" (or) "is being"
> 
> >          * positives, due to shortcuts when handing concurrent invalidations.
> 
> nit: handling

Both are existing code, but I can fix it either.

> 
> >          */
> > -       if (unlikely(kvm->mmu_invalidate_in_progress) &&
> > -           hva >= kvm->mmu_invalidate_range_start &&
> > -           hva < kvm->mmu_invalidate_range_end)
> > -               return 1;
> > +       if (unlikely(kvm->mmu_invalidate_in_progress)) {
> > +               /*
> > +                * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> > +                * but before updating the range is a KVM bug.
> > +                */
> > +               if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> > +                                kvm->mmu_invalidate_range_end == INVALID_GPA))
> 
> INVALID_GPA is an x86-specific define in
> arch/x86/include/asm/kvm_host.h, so this doesn't build on other
> architectures. The obvious fix is to move it to
> include/linux/kvm_host.h.

Hmm, INVALID_GPA is defined as ZERO for x86, not 100% confident this is
correct choice for other architectures, but after search it has not been
used for other architectures, so should be safe to make it common.

Thanks,
Chao
> 
> Cheers,
> /fuad
> 
> > +                       return 1;
> > +
> > +               if (gfn >= kvm->mmu_invalidate_range_start &&
> > +                   gfn < kvm->mmu_invalidate_range_end)
> > +                       return 1;
> > +       }
> > +
> >         if (kvm->mmu_invalidate_seq != mmu_seq)
> >                 return 1;
> >         return 0;
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index b882eb2c76a2..ad55dfbc75d7 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -540,9 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
> >
> >  typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
> >
> > -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> > -                            unsigned long end);
> > -
> > +typedef void (*on_lock_fn_t)(struct kvm *kvm);
> >  typedef void (*on_unlock_fn_t)(struct kvm *kvm);
> >
> >  struct kvm_hva_range {
> > @@ -628,7 +626,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> >                                 locked = true;
> >                                 KVM_MMU_LOCK(kvm);
> >                                 if (!IS_KVM_NULL_FN(range->on_lock))
> > -                                       range->on_lock(kvm, range->start, range->end);
> > +                                       range->on_lock(kvm);
> > +
> >                                 if (IS_KVM_NULL_FN(range->handler))
> >                                         break;
> >                         }
> > @@ -715,8 +714,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> >         kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> >  }
> >
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > -                             unsigned long end)
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> >  {
> >         /*
> >          * The count increase must become visible at unlock time as no
> > @@ -724,6 +722,17 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> >          * count is also read inside the mmu_lock critical section.
> >          */
> >         kvm->mmu_invalidate_in_progress++;
> > +
> > +       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > +               kvm->mmu_invalidate_range_start = INVALID_GPA;
> > +               kvm->mmu_invalidate_range_end = INVALID_GPA;
> > +       }
> > +}
> > +
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +       WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > +
> >         if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> >                 kvm->mmu_invalidate_range_start = start;
> >                 kvm->mmu_invalidate_range_end = end;
> > @@ -744,6 +753,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> >         }
> >  }
> >
> > +static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > +{
> > +       kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > +       return kvm_unmap_gfn_range(kvm, range);
> > +}
> > +
> >  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >                                         const struct mmu_notifier_range *range)
> >  {
> > @@ -752,7 +767,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >                 .start          = range->start,
> >                 .end            = range->end,
> >                 .pte            = __pte(0),
> > -               .handler        = kvm_unmap_gfn_range,
> > +               .handler        = kvm_mmu_unmap_gfn_range,
> >                 .on_lock        = kvm_mmu_invalidate_begin,
> >                 .on_unlock      = kvm_arch_guest_memory_reclaimed,
> >                 .flush_on_ret   = true,
> > @@ -791,8 +806,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >         return 0;
> >  }
> >
> > -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> > -                           unsigned long end)
> > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> >  {
> >         /*
> >          * This sequence increase will notify the kvm page fault that
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed
  2022-12-05 22:49   ` Isaku Yamahata
@ 2022-12-06 12:02     ` Chao Peng
  2022-12-07  6:42       ` Isaku Yamahata
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-12-06 12:02 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Mon, Dec 05, 2022 at 02:49:59PM -0800, Isaku Yamahata wrote:
> On Fri, Dec 02, 2022 at 02:13:45PM +0800,
> Chao Peng <chao.p.peng@linux.intel.com> wrote:
> 
> > A large page with mixed private/shared subpages can't be mapped as large
> > page since its sub private/shared pages are from different memory
> > backends and may also treated by architecture differently. When
> > private/shared memory are mixed in a large page, the current lpage_info
> > is not sufficient to decide whether the page can be mapped as large page
> > or not and additional private/shared mixed information is needed.
> > 
> > Tracking this 'mixed' information with the current 'count' like
> > disallow_lpage is a bit challenge so reserve a bit in 'disallow_lpage'
> > to indicate a large page has mixed private/share subpages and update
> > this 'mixed' bit whenever the memory attribute is changed between
> > private and shared.
> > 
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |   8 ++
> >  arch/x86/kvm/mmu/mmu.c          | 134 +++++++++++++++++++++++++++++++-
> >  arch/x86/kvm/x86.c              |   2 +
> >  include/linux/kvm_host.h        |  19 +++++
> >  virt/kvm/kvm_main.c             |   9 ++-
> >  5 files changed, 169 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 283cbb83d6ae..7772ab37ac89 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -38,6 +38,7 @@
> >  #include <asm/hyperv-tlfs.h>
> >  
> >  #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> > +#define __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
> >  
> >  #define KVM_MAX_VCPUS 1024
> >  
> > @@ -1011,6 +1012,13 @@ struct kvm_vcpu_arch {
> >  #endif
> >  };
> >  
> > +/*
> > + * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
> > + * level. The remaining bits are used as a reference count.
> > + */
> > +#define KVM_LPAGE_PRIVATE_SHARED_MIXED		(1U << 31)
> > +#define KVM_LPAGE_COUNT_MAX			((1U << 31) - 1)
> > +
> >  struct kvm_lpage_info {
> >  	int disallow_lpage;
> >  };
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e2c70b5afa3e..2190fd8c95c0 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -763,11 +763,16 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
> >  {
> >  	struct kvm_lpage_info *linfo;
> >  	int i;
> > +	int disallow_count;
> >  
> >  	for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
> >  		linfo = lpage_info_slot(gfn, slot, i);
> > +
> > +		disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
> > +		WARN_ON(disallow_count + count < 0 ||
> > +			disallow_count > KVM_LPAGE_COUNT_MAX - count);
> > +
> >  		linfo->disallow_lpage += count;
> > -		WARN_ON(linfo->disallow_lpage < 0);
> >  	}
> >  }
> >  
> > @@ -6986,3 +6991,130 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> >  	if (kvm->arch.nx_huge_page_recovery_thread)
> >  		kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
> >  }
> > +
> > +static bool linfo_is_mixed(struct kvm_lpage_info *linfo)
> > +{
> > +	return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +}
> > +
> > +static void linfo_set_mixed(gfn_t gfn, struct kvm_memory_slot *slot,
> > +			    int level, bool mixed)
> > +{
> > +	struct kvm_lpage_info *linfo = lpage_info_slot(gfn, slot, level);
> > +
> > +	if (mixed)
> > +		linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +	else
> > +		linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +}
> > +
> > +static bool is_expected_attr_entry(void *entry, unsigned long expected_attrs)
> > +{
> > +	bool expect_private = expected_attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > +
> > +	if (xa_to_value(entry) & KVM_MEMORY_ATTRIBUTE_PRIVATE) {
> > +		if (!expect_private)
> > +			return false;
> > +	} else if (expect_private)
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> > +static bool mem_attrs_mixed_2m(struct kvm *kvm, unsigned long attrs,
> > +			       gfn_t start, gfn_t end)
> > +{
> > +	XA_STATE(xas, &kvm->mem_attr_array, start);
> > +	gfn_t gfn = start;
> > +	void *entry;
> > +	bool mixed = false;
> > +
> > +	rcu_read_lock();
> > +	entry = xas_load(&xas);
> > +	while (gfn < end) {
> > +		if (xas_retry(&xas, entry))
> > +			continue;
> > +
> > +		KVM_BUG_ON(gfn != xas.xa_index, kvm);
> > +
> > +		if (!is_expected_attr_entry(entry, attrs)) {
> > +			mixed = true;
> > +			break;
> > +		}
> > +
> > +		entry = xas_next(&xas);
> > +		gfn++;
> > +	}
> > +
> > +	rcu_read_unlock();
> > +	return mixed;
> > +}
> > +
> > +static bool mem_attrs_mixed(struct kvm *kvm, struct kvm_memory_slot *slot,
> > +			    int level, unsigned long attrs,
> > +			    gfn_t start, gfn_t end)
> > +{
> > +	unsigned long gfn;
> > +
> > +	if (level == PG_LEVEL_2M)
> > +		return mem_attrs_mixed_2m(kvm, attrs, start, end);
> > +
> > +	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1))
> > +		if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)) ||
> > +		    !is_expected_attr_entry(xa_load(&kvm->mem_attr_array, gfn),
> > +					    attrs))
> > +			return true;
> > +	return false;
> > +}
> > +
> > +static void kvm_update_lpage_private_shared_mixed(struct kvm *kvm,
> > +						  struct kvm_memory_slot *slot,
> > +						  unsigned long attrs,
> > +						  gfn_t start, gfn_t end)
> > +{
> > +	unsigned long pages, mask;
> > +	gfn_t gfn, gfn_end, first, last;
> > +	int level;
> > +	bool mixed;
> > +
> > +	/*
> > +	 * The sequence matters here: we set the higher level basing on the
> > +	 * lower level's scanning result.
> > +	 */
> > +	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> > +		pages = KVM_PAGES_PER_HPAGE(level);
> > +		mask = ~(pages - 1);
> > +		first = start & mask;
> > +		last = (end - 1) & mask;
> > +
> > +		/*
> > +		 * We only need to scan the head and tail page, for middle pages
> > +		 * we know they will not be mixed.
> > +		 */
> > +		gfn = max(first, slot->base_gfn);
> > +		gfn_end = min(first + pages, slot->base_gfn + slot->npages);
> > +		mixed = mem_attrs_mixed(kvm, slot, level, attrs, gfn, gfn_end);
> > +		linfo_set_mixed(gfn, slot, level, mixed);
> > +
> > +		if (first == last)
> > +			return;
> 
> 
> continue.

Ya!

> 
> > +
> > +		for (gfn = first + pages; gfn < last; gfn += pages)
> > +			linfo_set_mixed(gfn, slot, level, false);
> > +
> > +		gfn = last;
> > +		gfn_end = min(last + pages, slot->base_gfn + slot->npages);
> 
> if (gfn == gfn_end) continue.

Do you see a case where gfn can equal to gfn_end? Though it does not
hurt to add a check.

> 
> 
> > +		mixed = mem_attrs_mixed(kvm, slot, level, attrs, gfn, gfn_end);
> > +		linfo_set_mixed(gfn, slot, level, mixed);
> > +	}
> > +}
> > +
> > +void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > +				    struct kvm_memory_slot *slot,
> > +				    unsigned long attrs,
> > +				    gfn_t start, gfn_t end)
> > +{
> > +	if (kvm_slot_can_be_private(slot))
> > +		kvm_update_lpage_private_shared_mixed(kvm, slot, attrs,
> > +						      start, end);
> > +}
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 9a07380f8d3c..5aefcff614d2 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -12362,6 +12362,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
> >  		if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
> >  			linfo[lpages - 1].disallow_lpage = 1;
> >  		ugfn = slot->userspace_addr >> PAGE_SHIFT;
> > +		if (kvm_slot_can_be_private(slot))
> > +			ugfn |= slot->restricted_offset >> PAGE_SHIFT;
> 
> Is there any alignment restriction? If no, It should be +=.
> In practice, alignment will hold though.

All we need here is checking whether both userspace_addr and
restricted_offset are aligned to HPAGE_SIZE or not. '+=' actually can
yield wrong value in cases when userspace_addr + restricted_offset is
aligned to HPAGE_SIZE but individually they may not align to HPAGE_SIZE.

Thanks,
Chao
> 
> Thanks,
> 
> >  		/*
> >  		 * If the gfn and userspace address are not aligned wrt each
> >  		 * other, disable large page support for this slot.
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 3331c0c92838..25099c94e770 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -592,6 +592,11 @@ struct kvm_memory_slot {
> >  	struct restrictedmem_notifier notifier;
> >  };
> >  
> > +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
> > +{
> > +	return slot && (slot->flags & KVM_MEM_PRIVATE);
> > +}
> > +
> >  static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> >  {
> >  	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
> > @@ -2316,4 +2321,18 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
> >  /* Max number of entries allowed for each kvm dirty ring */
> >  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> >  
> > +#ifdef __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
> > +void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > +				    struct kvm_memory_slot *slot,
> > +				    unsigned long attrs,
> > +				    gfn_t start, gfn_t end);
> > +#else
> > +static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > +						  struct kvm_memory_slot *slot,
> > +						  unsigned long attrs,
> > +						  gfn_t start, gfn_t end)
> > +{
> > +}
> > +#endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
> > +
> >  #endif
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 4e1e1e113bf0..e107afea32f0 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2354,7 +2354,8 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> >  	return 0;
> >  }
> >  
> > -static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end,
> > +				unsigned long attrs)
> >  {
> >  	struct kvm_gfn_range gfn_range;
> >  	struct kvm_memory_slot *slot;
> > @@ -2378,6 +2379,10 @@ static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> >  			gfn_range.slot = slot;
> >  
> >  			r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > +
> > +			kvm_arch_set_memory_attributes(kvm, slot, attrs,
> > +						       gfn_range.start,
> > +						       gfn_range.end);
> >  		}
> >  	}
> >  
> > @@ -2427,7 +2432,7 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> >  		idx = srcu_read_lock(&kvm->srcu);
> >  		KVM_MMU_LOCK(kvm);
> >  		if (i > start)
> > -			kvm_unmap_mem_range(kvm, start, i);
> > +			kvm_unmap_mem_range(kvm, start, i, attrs->attributes);
> >  		kvm_mmu_invalidate_end(kvm);
> >  		KVM_MMU_UNLOCK(kvm);
> >  		srcu_read_unlock(&kvm->srcu, idx);
> > -- 
> > 2.25.1
> > 
> 
> -- 
> Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2022-12-06 11:53     ` Chao Peng
@ 2022-12-06 12:39       ` Fuad Tabba
  2022-12-07 15:10         ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Fuad Tabba @ 2022-12-06 12:39 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

Hi Chao,

On Tue, Dec 6, 2022 at 11:58 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> On Mon, Dec 05, 2022 at 09:03:11AM +0000, Fuad Tabba wrote:
> > Hi Chao,
> >
> > On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > >
> > > In memory encryption usage, guest memory may be encrypted with special
> > > key and can be accessed only by the guest itself. We call such memory
> > > private memory. It's valueless and sometimes can cause problem to allow
> > > userspace to access guest private memory. This new KVM memslot extension
> > > allows guest private memory being provided through a restrictedmem
> > > backed file descriptor(fd) and userspace is restricted to access the
> > > bookmarked memory in the fd.
> > >
> > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > > userspace to instruct KVM to provide guest memory through restricted_fd.
> > > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > > and the size is 'memory_size'.
> > >
> > > The extended memslot can still have the userspace_addr(hva). When use, a
> > > single memslot can maintain both private memory through restricted_fd
> > > and shared memory through userspace_addr. Whether the private or shared
> > > part is visible to guest is maintained by other KVM code.
> > >
> > > A restrictedmem_notifier field is also added to the memslot structure to
> > > allow the restricted_fd's backing store to notify KVM the memory change,
> > > KVM then can invalidate its page table entries or handle memory errors.
> > >
> > > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > > and right now it is selected on X86_64 only.
> > >
> > > To make future maintenance easy, internally use a binary compatible
> > > alias struct kvm_user_mem_region to handle both the normal and the
> > > '_ext' variants.
> > >
> > > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > Reviewed-by: Fuad Tabba <tabba@google.com>
> > > Tested-by: Fuad Tabba <tabba@google.com>
> >
> > V9 of this patch [*] had KVM_CAP_PRIVATE_MEM, but it's not in this
> > patch series anymore. Any reason you removed it, or is it just an
> > omission?
>
> We had some discussion in v9 [1] to add generic memory attributes ioctls
> and KVM_CAP_PRIVATE_MEM can be implemented as a new
> KVM_MEMORY_ATTRIBUTE_PRIVATE flag via KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES()
> ioctl [2]. The api doc has been updated:
>
> +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> +  KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl) …
>
>
> [1] https://lore.kernel.org/linux-mm/Y2WB48kD0J4VGynX@google.com/
> [2]
> https://lore.kernel.org/linux-mm/20221202061347.1070246-3-chao.p.peng@linux.intel.com/

I see. I just retested it with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES,
and my Reviewed/Tested-by still apply.

Cheers,
/fuad

>
> Thanks,
> Chao
> >
> > [*] https://lore.kernel.org/linux-mm/20221025151344.3784230-3-chao.p.peng@linux.intel.com/
> >
> > Thanks,
> > /fuad
> >
> > > ---
> > >  Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++-----
> > >  arch/x86/kvm/Kconfig           |  2 ++
> > >  arch/x86/kvm/x86.c             |  2 +-
> > >  include/linux/kvm_host.h       |  8 ++++--
> > >  include/uapi/linux/kvm.h       | 28 +++++++++++++++++++
> > >  virt/kvm/Kconfig               |  3 +++
> > >  virt/kvm/kvm_main.c            | 49 ++++++++++++++++++++++++++++------
> > >  7 files changed, 114 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > index bb2f709c0900..99352170c130 100644
> > > --- a/Documentation/virt/kvm/api.rst
> > > +++ b/Documentation/virt/kvm/api.rst
> > > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
> > >  :Capability: KVM_CAP_USER_MEMORY
> > >  :Architectures: all
> > >  :Type: vm ioctl
> > > -:Parameters: struct kvm_userspace_memory_region (in)
> > > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> > >  :Returns: 0 on success, -1 on error
> > >
> > >  ::
> > > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> > >         __u64 userspace_addr; /* start of the userspace allocated memory */
> > >    };
> > >
> > > +  struct kvm_userspace_memory_region_ext {
> > > +       struct kvm_userspace_memory_region region;
> > > +       __u64 restricted_offset;
> > > +       __u32 restricted_fd;
> > > +       __u32 pad1;
> > > +       __u64 pad2[14];
> > > +  };
> > > +
> > >    /* for kvm_memory_region::flags */
> > >    #define KVM_MEM_LOG_DIRTY_PAGES      (1UL << 0)
> > >    #define KVM_MEM_READONLY     (1UL << 1)
> > > +  #define KVM_MEM_PRIVATE              (1UL << 2)
> > >
> > >  This ioctl allows the user to create, modify or delete a guest physical
> > >  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> > > @@ -1365,12 +1374,29 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> > >  be identical.  This allows large pages in the guest to be backed by large
> > >  pages in the host.
> > >
> > > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> > > -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> > > -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> > > -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> > > -to make a new slot read-only.  In this case, writes to this memory will be
> > > -posted to userspace as KVM_EXIT_MMIO exits.
> > > +kvm_userspace_memory_region_ext struct includes all fields of
> > > +kvm_userspace_memory_region struct, while also adds additional fields for some
> > > +other features. See below description of flags field for more information.
> > > +It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
> > > +
> > > +The flags field supports following flags:
> > > +
> > > +- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
> > > +  within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
> > > +
> > > +- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
> > > +  read-only. In this case, writes to this memory will be posted to userspace as
> > > +  KVM_EXIT_MMIO exits.
> > > +
> > > +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> > > +  KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has private
> > > +  memory backed by a file descriptor(fd) and userspace access to the fd may be
> > > +  restricted. Userspace should use restricted_fd/restricted_offset in the
> > > +  kvm_userspace_memory_region_ext to instruct KVM to provide private memory
> > > +  to guest. Userspace should guarantee not to map the same host physical address
> > > +  indicated by restricted_fd/restricted_offset to different guest physical
> > > +  addresses within multiple memslots. Failed to do this may result undefined
> > > +  behavior.
> > >
> > >  When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> > >  the memory region are automatically reflected into the guest.  For example, an
> > > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > > index a8e379a3afee..690cb21010e7 100644
> > > --- a/arch/x86/kvm/Kconfig
> > > +++ b/arch/x86/kvm/Kconfig
> > > @@ -50,6 +50,8 @@ config KVM
> > >         select INTERVAL_TREE
> > >         select HAVE_KVM_PM_NOTIFIER if PM
> > >         select HAVE_KVM_MEMORY_ATTRIBUTES
> > > +       select HAVE_KVM_RESTRICTED_MEM if X86_64
> > > +       select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> > >         help
> > >           Support hosting fully virtualized guest machines using hardware
> > >           virtualization extensions.  You will need a fairly recent
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 7f850dfb4086..9a07380f8d3c 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -12224,7 +12224,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
> > >         }
> > >
> > >         for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > -               struct kvm_userspace_memory_region m;
> > > +               struct kvm_user_mem_region m;
> > >
> > >                 m.slot = id | (i << 16);
> > >                 m.flags = 0;
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index a784e2b06625..02347e386ea2 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -44,6 +44,7 @@
> > >
> > >  #include <asm/kvm_host.h>
> > >  #include <linux/kvm_dirty_ring.h>
> > > +#include <linux/restrictedmem.h>
> > >
> > >  #ifndef KVM_MAX_VCPU_IDS
> > >  #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> > > @@ -585,6 +586,9 @@ struct kvm_memory_slot {
> > >         u32 flags;
> > >         short id;
> > >         u16 as_id;
> > > +       struct file *restricted_file;
> > > +       loff_t restricted_offset;
> > > +       struct restrictedmem_notifier notifier;
> > >  };
> > >
> > >  static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> > > @@ -1123,9 +1127,9 @@ enum kvm_mr_change {
> > >  };
> > >
> > >  int kvm_set_memory_region(struct kvm *kvm,
> > > -                         const struct kvm_userspace_memory_region *mem);
> > > +                         const struct kvm_user_mem_region *mem);
> > >  int __kvm_set_memory_region(struct kvm *kvm,
> > > -                           const struct kvm_userspace_memory_region *mem);
> > > +                           const struct kvm_user_mem_region *mem);
> > >  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
> > >  void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
> > >  int kvm_arch_prepare_memory_region(struct kvm *kvm,
> > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > index 5d0941acb5bb..13bff963b8b0 100644
> > > --- a/include/uapi/linux/kvm.h
> > > +++ b/include/uapi/linux/kvm.h
> > > @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
> > >         __u64 userspace_addr; /* start of the userspace allocated memory */
> > >  };
> > >
> > > +struct kvm_userspace_memory_region_ext {
> > > +       struct kvm_userspace_memory_region region;
> > > +       __u64 restricted_offset;
> > > +       __u32 restricted_fd;
> > > +       __u32 pad1;
> > > +       __u64 pad2[14];
> > > +};
> > > +
> > > +#ifdef __KERNEL__
> > > +/*
> > > + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> > > + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> > > + * all fields from the top-level "extended" region.
> > > + */
> > > +struct kvm_user_mem_region {
> > > +       __u32 slot;
> > > +       __u32 flags;
> > > +       __u64 guest_phys_addr;
> > > +       __u64 memory_size;
> > > +       __u64 userspace_addr;
> > > +       __u64 restricted_offset;
> > > +       __u32 restricted_fd;
> > > +       __u32 pad1;
> > > +       __u64 pad2[14];
> > > +};
> > > +#endif
> > > +
> > >  /*
> > >   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> > >   * other bits are reserved for kvm internal use which are defined in
> > > @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
> > >   */
> > >  #define KVM_MEM_LOG_DIRTY_PAGES        (1UL << 0)
> > >  #define KVM_MEM_READONLY       (1UL << 1)
> > > +#define KVM_MEM_PRIVATE                (1UL << 2)
> > >
> > >  /* for KVM_IRQ_LINE */
> > >  struct kvm_irq_level {
> > > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > > index effdea5dd4f0..d605545d6dd1 100644
> > > --- a/virt/kvm/Kconfig
> > > +++ b/virt/kvm/Kconfig
> > > @@ -89,3 +89,6 @@ config KVM_XFER_TO_GUEST_WORK
> > >
> > >  config HAVE_KVM_PM_NOTIFIER
> > >         bool
> > > +
> > > +config HAVE_KVM_RESTRICTED_MEM
> > > +       bool
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 7f0f5e9f2406..b882eb2c76a2 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -1532,7 +1532,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
> > >         }
> > >  }
> > >
> > > -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> > > +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> > >  {
> > >         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> > >
> > > @@ -1934,7 +1934,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> > >   * Must be called holding kvm->slots_lock for write.
> > >   */
> > >  int __kvm_set_memory_region(struct kvm *kvm,
> > > -                           const struct kvm_userspace_memory_region *mem)
> > > +                           const struct kvm_user_mem_region *mem)
> > >  {
> > >         struct kvm_memory_slot *old, *new;
> > >         struct kvm_memslots *slots;
> > > @@ -2038,7 +2038,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > >  EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> > >
> > >  int kvm_set_memory_region(struct kvm *kvm,
> > > -                         const struct kvm_userspace_memory_region *mem)
> > > +                         const struct kvm_user_mem_region *mem)
> > >  {
> > >         int r;
> > >
> > > @@ -2050,7 +2050,7 @@ int kvm_set_memory_region(struct kvm *kvm,
> > >  EXPORT_SYMBOL_GPL(kvm_set_memory_region);
> > >
> > >  static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> > > -                                         struct kvm_userspace_memory_region *mem)
> > > +                                         struct kvm_user_mem_region *mem)
> > >  {
> > >         if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
> > >                 return -EINVAL;
> > > @@ -4698,6 +4698,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
> > >         return fd;
> > >  }
> > >
> > > +#define SANITY_CHECK_MEM_REGION_FIELD(field)                                   \
> > > +do {                                                                           \
> > > +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=             \
> > > +                    offsetof(struct kvm_userspace_memory_region, field));      \
> > > +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=         \
> > > +                    sizeof_field(struct kvm_userspace_memory_region, field));  \
> > > +} while (0)
> > > +
> > > +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)                                       \
> > > +do {                                                                                   \
> > > +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=                     \
> > > +                    offsetof(struct kvm_userspace_memory_region_ext, field));          \
> > > +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=                 \
> > > +                    sizeof_field(struct kvm_userspace_memory_region_ext, field));      \
> > > +} while (0)
> > > +
> > > +static void kvm_sanity_check_user_mem_region_alias(void)
> > > +{
> > > +       SANITY_CHECK_MEM_REGION_FIELD(slot);
> > > +       SANITY_CHECK_MEM_REGION_FIELD(flags);
> > > +       SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> > > +       SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> > > +       SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> > > +       SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
> > > +       SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
> > > +}
> > > +
> > >  static long kvm_vm_ioctl(struct file *filp,
> > >                            unsigned int ioctl, unsigned long arg)
> > >  {
> > > @@ -4721,14 +4748,20 @@ static long kvm_vm_ioctl(struct file *filp,
> > >                 break;
> > >         }
> > >         case KVM_SET_USER_MEMORY_REGION: {
> > > -               struct kvm_userspace_memory_region kvm_userspace_mem;
> > > +               struct kvm_user_mem_region mem;
> > > +               unsigned long size = sizeof(struct kvm_userspace_memory_region);
> > > +
> > > +               kvm_sanity_check_user_mem_region_alias();
> > >
> > >                 r = -EFAULT;
> > > -               if (copy_from_user(&kvm_userspace_mem, argp,
> > > -                                               sizeof(kvm_userspace_mem)))
> > > +               if (copy_from_user(&mem, argp, size))
> > > +                       goto out;
> > > +
> > > +               r = -EINVAL;
> > > +               if (mem.flags & KVM_MEM_PRIVATE)
> > >                         goto out;
> > >
> > > -               r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> > > +               r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > >                 break;
> > >         }
> > >         case KVM_GET_DIRTY_LOG: {
> > > --
> > > 2.25.1
> > >

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2022-12-02  6:13 ` [PATCH v10 2/9] KVM: Introduce per-page memory attributes Chao Peng
@ 2022-12-06 13:34   ` Fabiano Rosas
  2022-12-07 14:31     ` Chao Peng
  2022-12-06 15:07   ` Fuad Tabba
                     ` (6 subsequent siblings)
  7 siblings, 1 reply; 398+ messages in thread
From: Fabiano Rosas @ 2022-12-06 13:34 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
	mhocko, wei.w.wang

Chao Peng <chao.p.peng@linux.intel.com> writes:

> In confidential computing usages, whether a page is private or shared is
> necessary information for KVM to perform operations like page fault
> handling, page zapping etc. There are other potential use cases for
> per-page memory attributes, e.g. to make memory read-only (or no-exec,
> or exec-only, etc.) without having to modify memslots.
>
> Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> userspace to operate on the per-page memory attributes.
>   - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
>     a guest memory range.
>   - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
>     memory attributes.
>
> KVM internally uses xarray to store the per-page memory attributes.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> ---
>  Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
>  arch/x86/kvm/Kconfig           |  1 +
>  include/linux/kvm_host.h       |  3 ++
>  include/uapi/linux/kvm.h       | 17 ++++++++
>  virt/kvm/Kconfig               |  3 ++
>  virt/kvm/kvm_main.c            | 76 ++++++++++++++++++++++++++++++++++
>  6 files changed, 163 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 5617bc4f899f..bb2f709c0900 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
>  The "pad" and "reserved" fields may be used for future extensions and should be
>  set to 0s by userspace.
>  
> +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: u64 memory attributes bitmask(out)
> +:Returns: 0 on success, <0 on error
> +
> +Returns supported memory attributes bitmask. Supported memory attributes will
> +have the corresponding bits set in u64 memory attributes bitmask.
> +
> +The following memory attributes are defined::
> +
> +  #define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> +  #define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> +  #define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> +  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> +
> +4.139 KVM_SET_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct kvm_memory_attributes(in/out)
> +:Returns: 0 on success, <0 on error
> +
> +Sets memory attributes for pages in a guest memory range. Parameters are
> +specified via the following structure::
> +
> +  struct kvm_memory_attributes {
> +	__u64 address;
> +	__u64 size;
> +	__u64 attributes;
> +	__u64 flags;
> +  };
> +
> +The user sets the per-page memory attributes to a guest memory range indicated
> +by address/size, and in return KVM adjusts address and size to reflect the
> +actual pages of the memory range have been successfully set to the attributes.

This wording could cause some confusion, what about a simpler:

"reflect the range of pages that had its attributes successfully set"

> +If the call returns 0, "address" is updated to the last successful address + 1
> +and "size" is updated to the remaining address size that has not been set
> +successfully.

"address + 1 page" or "subsequent page" perhaps.

In fact, wouldn't this all become simpler if size were number of pages instead?

> The user should check the return value as well as the size to
> +decide if the operation succeeded for the whole range or not. The user may want
> +to retry the operation with the returned address/size if the previous range was
> +partially successful.
> +
> +Both address and size should be page aligned and the supported attributes can be
> +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> +
> +The "flags" field may be used for future extensions and should be set to 0s.
> +

...

> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> +					   struct kvm_memory_attributes *attrs)
> +{
> +	gfn_t start, end;
> +	unsigned long i;
> +	void *entry;
> +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> +	/* flags is currently not used. */
> +	if (attrs->flags)
> +		return -EINVAL;
> +	if (attrs->attributes & ~supported_attrs)
> +		return -EINVAL;
> +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> +		return -EINVAL;
> +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> +		return -EINVAL;
> +
> +	start = attrs->address >> PAGE_SHIFT;
> +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;

Here PAGE_SIZE and -1 cancel out.

Consider using gpa_to_gfn as well.

> +
> +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +
> +	mutex_lock(&kvm->lock);
> +	for (i = start; i < end; i++)
> +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> +				    GFP_KERNEL_ACCOUNT)))
> +			break;
> +	mutex_unlock(&kvm->lock);
> +
> +	attrs->address = i << PAGE_SHIFT;
> +	attrs->size = (end - i) << PAGE_SHIFT;
> +
> +	return 0;
> +}
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
>  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
>  {
>  	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> @@ -4459,6 +4508,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>  #ifdef CONFIG_HAVE_KVM_MSI
>  	case KVM_CAP_SIGNAL_MSI:
>  #endif
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	case KVM_CAP_MEMORY_ATTRIBUTES:
> +#endif
>  #ifdef CONFIG_HAVE_KVM_IRQFD
>  	case KVM_CAP_IRQFD:
>  	case KVM_CAP_IRQFD_RESAMPLE:
> @@ -4804,6 +4856,30 @@ static long kvm_vm_ioctl(struct file *filp,
>  		break;
>  	}
>  #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	case KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES: {
> +		u64 attrs = kvm_supported_mem_attributes(kvm);
> +
> +		r = -EFAULT;
> +		if (copy_to_user(argp, &attrs, sizeof(attrs)))
> +			goto out;
> +		r = 0;
> +		break;
> +	}
> +	case KVM_SET_MEMORY_ATTRIBUTES: {
> +		struct kvm_memory_attributes attrs;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&attrs, argp, sizeof(attrs)))
> +			goto out;
> +
> +		r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
> +
> +		if (!r && copy_to_user(argp, &attrs, sizeof(attrs)))
> +			r = -EFAULT;
> +		break;
> +	}
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
>  	case KVM_CREATE_DEVICE: {
>  		struct kvm_create_device cd;

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-02  6:13 ` [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
@ 2022-12-06 14:57   ` Fuad Tabba
  2022-12-07 13:50     ` Chao Peng
  2022-12-13 23:49   ` Huang, Kai
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 398+ messages in thread
From: Fuad Tabba @ 2022-12-06 14:57 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

Hi,

On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Introduce 'memfd_restricted' system call with the ability to create
> memory areas that are restricted from userspace access through ordinary
> MMU operations (e.g. read/write/mmap). The memory content is expected to
> be used through the new in-kernel interface by a third kernel module.
>
> memfd_restricted() is useful for scenarios where a file descriptor(fd)
> can be used as an interface into mm but want to restrict userspace's
> ability on the fd. Initially it is designed to provide protections for
> KVM encrypted guest memory.
>
> Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> (e.g. QEMU) and then using the mmaped virtual address to setup the
> mapping in the KVM secondary page table (e.g. EPT). With confidential
> computing technologies like Intel TDX, the memfd memory may be encrypted
> with special key for special software domain (e.g. KVM guest) and is not
> expected to be directly accessed by userspace. Precisely, userspace
> access to such encrypted memory may lead to host crash so should be
> prevented.
>
> memfd_restricted() provides semantics required for KVM guest encrypted
> memory support that a fd created with memfd_restricted() is going to be
> used as the source of guest memory in confidential computing environment
> and KVM can directly interact with core-mm without the need to expose
> the memoy content into KVM userspace.

nit: memory

>
> KVM userspace is still in charge of the lifecycle of the fd. It should
> pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> obtain the physical memory page and then uses it to populate the KVM
> secondary page table entries.
>
> The userspace restricted memfd can be fallocate-ed or hole-punched
> from userspace. When hole-punched, KVM can get notified through
> invalidate_start/invalidate_end() callbacks, KVM then gets chance to
> remove any mapped entries of the range in the secondary page tables.
>
> Machine check can happen for memory pages in the restricted memfd,
> instead of routing this directly to userspace, we call the error()
> callback that KVM registered. KVM then gets chance to handle it
> correctly.
>
> memfd_restricted() itself is implemented as a shim layer on top of real
> memory file systems (currently tmpfs). Pages in restrictedmem are marked
> as unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
>
> By default memfd_restricted() prevents userspace read, write and mmap.
> By defining new bit in the 'flags', it can be extended to support other
> restricted semantics in the future.
>
> The system call is currently wired up for x86 arch.

Reviewed-by: Fuad Tabba <tabba@google.com>
After wiring the system call for arm64 (on qemu/arm64):
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad



>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  include/linux/restrictedmem.h          |  71 ++++++
>  include/linux/syscalls.h               |   1 +
>  include/uapi/asm-generic/unistd.h      |   5 +-
>  include/uapi/linux/magic.h             |   1 +
>  kernel/sys_ni.c                        |   3 +
>  mm/Kconfig                             |   4 +
>  mm/Makefile                            |   1 +
>  mm/memory-failure.c                    |   3 +
>  mm/restrictedmem.c                     | 318 +++++++++++++++++++++++++
>  11 files changed, 408 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/restrictedmem.h
>  create mode 100644 mm/restrictedmem.c
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 320480a8db4f..dc70ba90247e 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -455,3 +455,4 @@
>  448    i386    process_mrelease        sys_process_mrelease
>  449    i386    futex_waitv             sys_futex_waitv
>  450    i386    set_mempolicy_home_node         sys_set_mempolicy_home_node
> +451    i386    memfd_restricted        sys_memfd_restricted
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index c84d12608cd2..06516abc8318 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -372,6 +372,7 @@
>  448    common  process_mrelease        sys_process_mrelease
>  449    common  futex_waitv             sys_futex_waitv
>  450    common  set_mempolicy_home_node sys_set_mempolicy_home_node
> +451    common  memfd_restricted        sys_memfd_restricted
>
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> new file mode 100644
> index 000000000000..c2700c5daa43
> --- /dev/null
> +++ b/include/linux/restrictedmem.h
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _LINUX_RESTRICTEDMEM_H
> +
> +#include <linux/file.h>
> +#include <linux/magic.h>
> +#include <linux/pfn_t.h>
> +
> +struct restrictedmem_notifier;
> +
> +struct restrictedmem_notifier_ops {
> +       void (*invalidate_start)(struct restrictedmem_notifier *notifier,
> +                                pgoff_t start, pgoff_t end);
> +       void (*invalidate_end)(struct restrictedmem_notifier *notifier,
> +                              pgoff_t start, pgoff_t end);
> +       void (*error)(struct restrictedmem_notifier *notifier,
> +                              pgoff_t start, pgoff_t end);
> +};
> +
> +struct restrictedmem_notifier {
> +       struct list_head list;
> +       const struct restrictedmem_notifier_ops *ops;
> +};
> +
> +#ifdef CONFIG_RESTRICTEDMEM
> +
> +void restrictedmem_register_notifier(struct file *file,
> +                                    struct restrictedmem_notifier *notifier);
> +void restrictedmem_unregister_notifier(struct file *file,
> +                                      struct restrictedmem_notifier *notifier);
> +
> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +                          struct page **pagep, int *order);
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +       return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
> +}
> +
> +void restrictedmem_error_page(struct page *page, struct address_space *mapping);
> +
> +#else
> +
> +static inline void restrictedmem_register_notifier(struct file *file,
> +                                    struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline void restrictedmem_unregister_notifier(struct file *file,
> +                                      struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +                                        struct page **pagep, int *order)
> +{
> +       return -1;
> +}
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +       return false;
> +}
> +
> +static inline void restrictedmem_error_page(struct page *page,
> +                                           struct address_space *mapping)
> +{
> +}
> +
> +#endif /* CONFIG_RESTRICTEDMEM */
> +
> +#endif /* _LINUX_RESTRICTEDMEM_H */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a34b0f9a9972..f9e9e0c820c5 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
>  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
>                                             unsigned long home_node,
>                                             unsigned long flags);
> +asmlinkage long sys_memfd_restricted(unsigned int flags);
>
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 45fa180cc56a..e93cd35e46d0 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
>  #define __NR_set_mempolicy_home_node 450
>  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
>
> +#define __NR_memfd_restricted 451
> +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
> +
>  #undef __NR_syscalls
> -#define __NR_syscalls 451
> +#define __NR_syscalls 452
>
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 6325d1d0e90f..8aa38324b90a 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -101,5 +101,6 @@
>  #define DMA_BUF_MAGIC          0x444d4142      /* "DMAB" */
>  #define DEVMEM_MAGIC           0x454d444d      /* "DMEM" */
>  #define SECRETMEM_MAGIC                0x5345434d      /* "SECM" */
> +#define RESTRICTEDMEM_MAGIC    0x5245534d      /* "RESM" */
>
>  #endif /* __LINUX_MAGIC_H__ */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 860b2dcf3ac4..7c4a32cbd2e7 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
>  /* memfd_secret */
>  COND_SYSCALL(memfd_secret);
>
> +/* memfd_restricted */
> +COND_SYSCALL(memfd_restricted);
> +
>  /*
>   * Architecture specific weak syscall entries.
>   */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 57e1d8c5b505..06b0e1d6b8c1 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1076,6 +1076,10 @@ config IO_MAPPING
>  config SECRETMEM
>         def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
>
> +config RESTRICTEDMEM
> +       bool
> +       depends on TMPFS
> +
>  config ANON_VMA_NAME
>         bool "Anonymous VMA name support"
>         depends on PROC_FS && ADVISE_SYSCALLS && MMU
> diff --git a/mm/Makefile b/mm/Makefile
> index 8e105e5b3e29..bcbb0edf9ba1 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -121,6 +121,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>  obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>  obj-$(CONFIG_SECRETMEM) += secretmem.o
> +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
>  obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>  obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 145bb561ddb3..f91b444e471e 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -62,6 +62,7 @@
>  #include <linux/page-isolation.h>
>  #include <linux/pagewalk.h>
>  #include <linux/shmem_fs.h>
> +#include <linux/restrictedmem.h>
>  #include "swap.h"
>  #include "internal.h"
>  #include "ras/ras_event.h"
> @@ -940,6 +941,8 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
>                 goto out;
>         }
>
> +       restrictedmem_error_page(p, mapping);
> +
>         /*
>          * The shmem page is kept in page cache instead of truncating
>          * so is expected to have an extra refcount after error-handling.
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> new file mode 100644
> index 000000000000..56953c204e5c
> --- /dev/null
> +++ b/mm/restrictedmem.c
> @@ -0,0 +1,318 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/shmem_fs.h>
> +#include <linux/syscalls.h>
> +#include <uapi/linux/falloc.h>
> +#include <uapi/linux/magic.h>
> +#include <linux/restrictedmem.h>
> +
> +struct restrictedmem_data {
> +       struct mutex lock;
> +       struct file *memfd;
> +       struct list_head notifiers;
> +};
> +
> +static void restrictedmem_invalidate_start(struct restrictedmem_data *data,
> +                                          pgoff_t start, pgoff_t end)
> +{
> +       struct restrictedmem_notifier *notifier;
> +
> +       mutex_lock(&data->lock);
> +       list_for_each_entry(notifier, &data->notifiers, list) {
> +               notifier->ops->invalidate_start(notifier, start, end);
> +       }
> +       mutex_unlock(&data->lock);
> +}
> +
> +static void restrictedmem_invalidate_end(struct restrictedmem_data *data,
> +                                        pgoff_t start, pgoff_t end)
> +{
> +       struct restrictedmem_notifier *notifier;
> +
> +       mutex_lock(&data->lock);
> +       list_for_each_entry(notifier, &data->notifiers, list) {
> +               notifier->ops->invalidate_end(notifier, start, end);
> +       }
> +       mutex_unlock(&data->lock);
> +}
> +
> +static void restrictedmem_notifier_error(struct restrictedmem_data *data,
> +                                        pgoff_t start, pgoff_t end)
> +{
> +       struct restrictedmem_notifier *notifier;
> +
> +       mutex_lock(&data->lock);
> +       list_for_each_entry(notifier, &data->notifiers, list) {
> +               notifier->ops->error(notifier, start, end);
> +       }
> +       mutex_unlock(&data->lock);
> +}
> +
> +static int restrictedmem_release(struct inode *inode, struct file *file)
> +{
> +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> +
> +       fput(data->memfd);
> +       kfree(data);
> +       return 0;
> +}
> +
> +static long restrictedmem_punch_hole(struct restrictedmem_data *data, int mode,
> +                                    loff_t offset, loff_t len)
> +{
> +       int ret;
> +       pgoff_t start, end;
> +       struct file *memfd = data->memfd;
> +
> +       if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> +               return -EINVAL;
> +
> +       start = offset >> PAGE_SHIFT;
> +       end = (offset + len) >> PAGE_SHIFT;
> +
> +       restrictedmem_invalidate_start(data, start, end);
> +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> +       restrictedmem_invalidate_end(data, start, end);
> +
> +       return ret;
> +}
> +
> +static long restrictedmem_fallocate(struct file *file, int mode,
> +                                   loff_t offset, loff_t len)
> +{
> +       struct restrictedmem_data *data = file->f_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +
> +       if (mode & FALLOC_FL_PUNCH_HOLE)
> +               return restrictedmem_punch_hole(data, mode, offset, len);
> +
> +       return memfd->f_op->fallocate(memfd, mode, offset, len);
> +}
> +
> +static const struct file_operations restrictedmem_fops = {
> +       .release = restrictedmem_release,
> +       .fallocate = restrictedmem_fallocate,
> +};
> +
> +static int restrictedmem_getattr(struct user_namespace *mnt_userns,
> +                                const struct path *path, struct kstat *stat,
> +                                u32 request_mask, unsigned int query_flags)
> +{
> +       struct inode *inode = d_inode(path->dentry);
> +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +
> +       return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
> +                                            request_mask, query_flags);
> +}
> +
> +static int restrictedmem_setattr(struct user_namespace *mnt_userns,
> +                                struct dentry *dentry, struct iattr *attr)
> +{
> +       struct inode *inode = d_inode(dentry);
> +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       int ret;
> +
> +       if (attr->ia_valid & ATTR_SIZE) {
> +               if (memfd->f_inode->i_size)
> +                       return -EPERM;
> +
> +               if (!PAGE_ALIGNED(attr->ia_size))
> +                       return -EINVAL;
> +       }
> +
> +       ret = memfd->f_inode->i_op->setattr(mnt_userns,
> +                                           file_dentry(memfd), attr);
> +       return ret;
> +}
> +
> +static const struct inode_operations restrictedmem_iops = {
> +       .getattr = restrictedmem_getattr,
> +       .setattr = restrictedmem_setattr,
> +};
> +
> +static int restrictedmem_init_fs_context(struct fs_context *fc)
> +{
> +       if (!init_pseudo(fc, RESTRICTEDMEM_MAGIC))
> +               return -ENOMEM;
> +
> +       fc->s_iflags |= SB_I_NOEXEC;
> +       return 0;
> +}
> +
> +static struct file_system_type restrictedmem_fs = {
> +       .owner          = THIS_MODULE,
> +       .name           = "memfd:restrictedmem",
> +       .init_fs_context = restrictedmem_init_fs_context,
> +       .kill_sb        = kill_anon_super,
> +};
> +
> +static struct vfsmount *restrictedmem_mnt;
> +
> +static __init int restrictedmem_init(void)
> +{
> +       restrictedmem_mnt = kern_mount(&restrictedmem_fs);
> +       if (IS_ERR(restrictedmem_mnt))
> +               return PTR_ERR(restrictedmem_mnt);
> +       return 0;
> +}
> +fs_initcall(restrictedmem_init);
> +
> +static struct file *restrictedmem_file_create(struct file *memfd)
> +{
> +       struct restrictedmem_data *data;
> +       struct address_space *mapping;
> +       struct inode *inode;
> +       struct file *file;
> +
> +       data = kzalloc(sizeof(*data), GFP_KERNEL);
> +       if (!data)
> +               return ERR_PTR(-ENOMEM);
> +
> +       data->memfd = memfd;
> +       mutex_init(&data->lock);
> +       INIT_LIST_HEAD(&data->notifiers);
> +
> +       inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> +       if (IS_ERR(inode)) {
> +               kfree(data);
> +               return ERR_CAST(inode);
> +       }
> +
> +       inode->i_mode |= S_IFREG;
> +       inode->i_op = &restrictedmem_iops;
> +       inode->i_mapping->private_data = data;
> +
> +       file = alloc_file_pseudo(inode, restrictedmem_mnt,
> +                                "restrictedmem", O_RDWR,
> +                                &restrictedmem_fops);
> +       if (IS_ERR(file)) {
> +               iput(inode);
> +               kfree(data);
> +               return ERR_CAST(file);
> +       }
> +
> +       file->f_flags |= O_LARGEFILE;
> +
> +       /*
> +        * These pages are currently unmovable so don't place them into movable
> +        * pageblocks (e.g. CMA and ZONE_MOVABLE).
> +        */
> +       mapping = memfd->f_mapping;
> +       mapping_set_unevictable(mapping);
> +       mapping_set_gfp_mask(mapping,
> +                            mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> +
> +       return file;
> +}
> +
> +SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> +{
> +       struct file *file, *restricted_file;
> +       int fd, err;
> +
> +       if (flags)
> +               return -EINVAL;
> +
> +       fd = get_unused_fd_flags(0);
> +       if (fd < 0)
> +               return fd;
> +
> +       file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> +       if (IS_ERR(file)) {
> +               err = PTR_ERR(file);
> +               goto err_fd;
> +       }
> +       file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> +       file->f_flags |= O_LARGEFILE;
> +
> +       restricted_file = restrictedmem_file_create(file);
> +       if (IS_ERR(restricted_file)) {
> +               err = PTR_ERR(restricted_file);
> +               fput(file);
> +               goto err_fd;
> +       }
> +
> +       fd_install(fd, restricted_file);
> +       return fd;
> +err_fd:
> +       put_unused_fd(fd);
> +       return err;
> +}
> +
> +void restrictedmem_register_notifier(struct file *file,
> +                                    struct restrictedmem_notifier *notifier)
> +{
> +       struct restrictedmem_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_add(&notifier->list, &data->notifiers);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_register_notifier);
> +
> +void restrictedmem_unregister_notifier(struct file *file,
> +                                      struct restrictedmem_notifier *notifier)
> +{
> +       struct restrictedmem_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_del(&notifier->list);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier);
> +
> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +                          struct page **pagep, int *order)
> +{
> +       struct restrictedmem_data *data = file->f_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       struct folio *folio;
> +       struct page *page;
> +       int ret;
> +
> +       ret = shmem_get_folio(file_inode(memfd), offset, &folio, SGP_WRITE);
> +       if (ret)
> +               return ret;
> +
> +       page = folio_file_page(folio, offset);
> +       *pagep = page;
> +       if (order)
> +               *order = thp_order(compound_head(page));
> +
> +       SetPageUptodate(page);
> +       unlock_page(page);
> +
> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> +
> +void restrictedmem_error_page(struct page *page, struct address_space *mapping)
> +{
> +       struct super_block *sb = restrictedmem_mnt->mnt_sb;
> +       struct inode *inode, *next;
> +
> +       if (!shmem_mapping(mapping))
> +               return;
> +
> +       spin_lock(&sb->s_inode_list_lock);
> +       list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> +               struct restrictedmem_data *data = inode->i_mapping->private_data;
> +               struct file *memfd = data->memfd;
> +
> +               if (memfd->f_mapping == mapping) {
> +                       pgoff_t start, end;
> +
> +                       spin_unlock(&sb->s_inode_list_lock);
> +
> +                       start = page->index;
> +                       end = start + thp_nr_pages(page);
> +                       restrictedmem_notifier_error(data, start, end);
> +                       return;
> +               }
> +       }
> +       spin_unlock(&sb->s_inode_list_lock);
> +}
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2022-12-02  6:13 ` [PATCH v10 2/9] KVM: Introduce per-page memory attributes Chao Peng
  2022-12-06 13:34   ` Fabiano Rosas
@ 2022-12-06 15:07   ` Fuad Tabba
  2022-12-07 14:51     ` Chao Peng
  2022-12-16 15:09   ` Borislav Petkov
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 398+ messages in thread
From: Fuad Tabba @ 2022-12-06 15:07 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

Hi,

On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> In confidential computing usages, whether a page is private or shared is
> necessary information for KVM to perform operations like page fault
> handling, page zapping etc. There are other potential use cases for
> per-page memory attributes, e.g. to make memory read-only (or no-exec,
> or exec-only, etc.) without having to modify memslots.
>
> Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> userspace to operate on the per-page memory attributes.
>   - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
>     a guest memory range.
>   - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
>     memory attributes.
>
> KVM internally uses xarray to store the per-page memory attributes.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> ---
>  Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
>  arch/x86/kvm/Kconfig           |  1 +
>  include/linux/kvm_host.h       |  3 ++
>  include/uapi/linux/kvm.h       | 17 ++++++++
>  virt/kvm/Kconfig               |  3 ++
>  virt/kvm/kvm_main.c            | 76 ++++++++++++++++++++++++++++++++++
>  6 files changed, 163 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 5617bc4f899f..bb2f709c0900 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
>  The "pad" and "reserved" fields may be used for future extensions and should be
>  set to 0s by userspace.
>
> +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: u64 memory attributes bitmask(out)
> +:Returns: 0 on success, <0 on error
> +
> +Returns supported memory attributes bitmask. Supported memory attributes will
> +have the corresponding bits set in u64 memory attributes bitmask.
> +
> +The following memory attributes are defined::
> +
> +  #define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> +  #define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> +  #define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> +  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> +
> +4.139 KVM_SET_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct kvm_memory_attributes(in/out)
> +:Returns: 0 on success, <0 on error
> +
> +Sets memory attributes for pages in a guest memory range. Parameters are
> +specified via the following structure::
> +
> +  struct kvm_memory_attributes {
> +       __u64 address;
> +       __u64 size;
> +       __u64 attributes;
> +       __u64 flags;
> +  };
> +
> +The user sets the per-page memory attributes to a guest memory range indicated
> +by address/size, and in return KVM adjusts address and size to reflect the
> +actual pages of the memory range have been successfully set to the attributes.
> +If the call returns 0, "address" is updated to the last successful address + 1
> +and "size" is updated to the remaining address size that has not been set
> +successfully. The user should check the return value as well as the size to
> +decide if the operation succeeded for the whole range or not. The user may want
> +to retry the operation with the returned address/size if the previous range was
> +partially successful.
> +
> +Both address and size should be page aligned and the supported attributes can be
> +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> +
> +The "flags" field may be used for future extensions and should be set to 0s.
> +
>  5. The kvm_run structure
>  ========================
>
> @@ -8270,6 +8323,16 @@ structure.
>  When getting the Modified Change Topology Report value, the attr->addr
>  must point to a byte where the value will be stored or retrieved from.
>
> +8.40 KVM_CAP_MEMORY_ATTRIBUTES
> +------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm
> +
> +This capability indicates KVM supports per-page memory attributes and ioctls
> +KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
> +
>  9. Known KVM API problems
>  =========================
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index fbeaa9ddef59..a8e379a3afee 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -49,6 +49,7 @@ config KVM
>         select SRCU
>         select INTERVAL_TREE
>         select HAVE_KVM_PM_NOTIFIER if PM
> +       select HAVE_KVM_MEMORY_ATTRIBUTES
>         help
>           Support hosting fully virtualized guest machines using hardware
>           virtualization extensions.  You will need a fairly recent
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8f874a964313..a784e2b06625 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -800,6 +800,9 @@ struct kvm {
>
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>         struct notifier_block pm_notifier;
> +#endif
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +       struct xarray mem_attr_array;
>  #endif
>         char stats_id[KVM_STATS_NAME_SIZE];
>  };
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 64dfe9c07c87..5d0941acb5bb 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1182,6 +1182,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_S390_CPU_TOPOLOGY 222
>  #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
>  #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
> +#define KVM_CAP_MEMORY_ATTRIBUTES 225
>
>  #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -2238,4 +2239,20 @@ struct kvm_s390_zpci_op {
>  /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
>  #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
>
> +/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
> +#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES    _IOR(KVMIO,  0xd2, __u64)
> +#define KVM_SET_MEMORY_ATTRIBUTES              _IOWR(KVMIO,  0xd3, struct kvm_memory_attributes)
> +
> +struct kvm_memory_attributes {
> +       __u64 address;
> +       __u64 size;
> +       __u64 attributes;
> +       __u64 flags;
> +};
> +
> +#define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> +#define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> +#define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> +#define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)

nit: how about using the BIT() macro for these?

> +
>  #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 800f9470e36b..effdea5dd4f0 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -19,6 +19,9 @@ config HAVE_KVM_IRQ_ROUTING
>  config HAVE_KVM_DIRTY_RING
>         bool
>
> +config HAVE_KVM_MEMORY_ATTRIBUTES
> +       bool
> +
>  # Only strongly ordered architectures can select this, as it doesn't
>  # put any explicit constraint on userspace ordering. They can also
>  # select the _ACQ_REL version.
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1782c4555d94..7f0f5e9f2406 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>         spin_lock_init(&kvm->mn_invalidate_lock);
>         rcuwait_init(&kvm->mn_memslots_update_rcuwait);
>         xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +       xa_init(&kvm->mem_attr_array);
> +#endif
>
>         INIT_LIST_HEAD(&kvm->gpc_list);
>         spin_lock_init(&kvm->gpc_lock);
> @@ -1323,6 +1326,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
>                 kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
>                 kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
>         }
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +       xa_destroy(&kvm->mem_attr_array);
> +#endif
>         cleanup_srcu_struct(&kvm->irq_srcu);
>         cleanup_srcu_struct(&kvm->srcu);
>         kvm_arch_free_vm(kvm);
> @@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
>  }
>  #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> +{
> +       return 0;
> +}
> +
> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> +                                          struct kvm_memory_attributes *attrs)
> +{
> +       gfn_t start, end;
> +       unsigned long i;
> +       void *entry;
> +       u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> +       /* flags is currently not used. */

nit: "is reserved"? I think it makes it a bit clearer what its purpose is.

> +       if (attrs->flags)
> +               return -EINVAL;
> +       if (attrs->attributes & ~supported_attrs)
> +               return -EINVAL;
> +       if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> +               return -EINVAL;
> +       if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> +               return -EINVAL;
> +
> +       start = attrs->address >> PAGE_SHIFT;
> +       end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;

Would using existing helpers be better for getting the frame numbers?
Also, the code checks that the address and size are page aligned, so
the end rounding up seems redundant, and might even be wrong if the
address+size-1 is close to the gfn_t limit (which this code tries to
avoid in an earlier check).

> +       entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +
> +       mutex_lock(&kvm->lock);
> +       for (i = start; i < end; i++)
> +               if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> +                                   GFP_KERNEL_ACCOUNT)))
> +                       break;
> +       mutex_unlock(&kvm->lock);
> +
> +       attrs->address = i << PAGE_SHIFT;
> +       attrs->size = (end - i) << PAGE_SHIFT;

nit: helpers for these too?

With the end calculation fixed,

Reviewed-by: Fuad Tabba <tabba@google.com>
After adding the necessary configs for arm64 (on qemu/arm64):
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

> +
> +       return 0;
> +}
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
>  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
>  {
>         return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> @@ -4459,6 +4508,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>  #ifdef CONFIG_HAVE_KVM_MSI
>         case KVM_CAP_SIGNAL_MSI:
>  #endif
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +       case KVM_CAP_MEMORY_ATTRIBUTES:
> +#endif
>  #ifdef CONFIG_HAVE_KVM_IRQFD
>         case KVM_CAP_IRQFD:
>         case KVM_CAP_IRQFD_RESAMPLE:
> @@ -4804,6 +4856,30 @@ static long kvm_vm_ioctl(struct file *filp,
>                 break;
>         }
>  #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +       case KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES: {
> +               u64 attrs = kvm_supported_mem_attributes(kvm);
> +
> +               r = -EFAULT;
> +               if (copy_to_user(argp, &attrs, sizeof(attrs)))
> +                       goto out;
> +               r = 0;
> +               break;
> +       }
> +       case KVM_SET_MEMORY_ATTRIBUTES: {
> +               struct kvm_memory_attributes attrs;
> +
> +               r = -EFAULT;
> +               if (copy_from_user(&attrs, argp, sizeof(attrs)))
> +                       goto out;
> +
> +               r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
> +
> +               if (!r && copy_to_user(argp, &attrs, sizeof(attrs)))
> +                       r = -EFAULT;
> +               break;
> +       }
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
>         case KVM_CREATE_DEVICE: {
>                 struct kvm_create_device cd;
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-12-02  6:13 ` [PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
@ 2022-12-06 15:47   ` Fuad Tabba
  2022-12-07 15:11     ` Chao Peng
  2023-01-13 23:13   ` Sean Christopherson
  1 sibling, 1 reply; 398+ messages in thread
From: Fuad Tabba @ 2022-12-06 15:47 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

Hi,

On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> This new KVM exit allows userspace to handle memory-related errors. It
> indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> The flags includes additional information for userspace to handle the
> error. Currently bit 0 is defined as 'private memory' where '1'
> indicates error happens due to private memory access and '0' indicates
> error happens due to shared memory access.
>
> When private memory is enabled, this new exit will be used for KVM to
> exit to userspace for shared <-> private memory conversion in memory
> encryption usage. In such usage, typically there are two kind of memory
> conversions:
>   - explicit conversion: happens when guest explicitly calls into KVM
>     to map a range (as private or shared), KVM then exits to userspace
>     to perform the map/unmap operations.
>   - implicit conversion: happens in KVM page fault handler where KVM
>     exits to userspace for an implicit conversion when the page is in a
>     different state than requested (private or shared).
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> ---
>  Documentation/virt/kvm/api.rst | 22 ++++++++++++++++++++++
>  include/uapi/linux/kvm.h       |  8 ++++++++
>  2 files changed, 30 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 99352170c130..d9edb14ce30b 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6634,6 +6634,28 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>
> +::
> +
> +               /* KVM_EXIT_MEMORY_FAULT */
> +               struct {
> +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 0)
> +                       __u64 flags;

I see you've removed the padding and increased the flag size.

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad




> +                       __u64 gpa;
> +                       __u64 size;
> +               } memory;
> +
> +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> +encountered a memory error which is not handled by KVM kernel module and
> +userspace may choose to handle it. The 'flags' field indicates the memory
> +properties of the exit.
> +
> + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> +   private memory access when the bit is set. Otherwise the memory error is
> +   caused by shared memory access when the bit is clear.
> +
> +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
> +may handle the error and return to KVM to retry the previous memory access.
> +
>  ::
>
>      /* KVM_EXIT_NOTIFY */
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 13bff963b8b0..c7e9d375a902 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -300,6 +300,7 @@ struct kvm_xen_exit {
>  #define KVM_EXIT_RISCV_SBI        35
>  #define KVM_EXIT_RISCV_CSR        36
>  #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MEMORY_FAULT     38
>
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -541,6 +542,13 @@ struct kvm_run {
>  #define KVM_NOTIFY_CONTEXT_INVALID     (1 << 0)
>                         __u32 flags;
>                 } notify;
> +               /* KVM_EXIT_MEMORY_FAULT */
> +               struct {
> +#define KVM_MEMORY_EXIT_FLAG_PRIVATE   (1ULL << 0)
> +                       __u64 flags;
> +                       __u64 gpa;
> +                       __u64 size;
> +               } memory;
>                 /* Fix the size of the union. */
>                 char padding[256];
>         };
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-12-06 11:56     ` Chao Peng
@ 2022-12-06 15:48       ` Fuad Tabba
  2022-12-09  6:24         ` Chao Peng
  2022-12-07  6:34       ` Isaku Yamahata
  1 sibling, 1 reply; 398+ messages in thread
From: Fuad Tabba @ 2022-12-06 15:48 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

Hi,

On Tue, Dec 6, 2022 at 12:01 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> On Mon, Dec 05, 2022 at 09:23:49AM +0000, Fuad Tabba wrote:
> > Hi Chao,
> >
> > On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > >
> > > Currently in mmu_notifier invalidate path, hva range is recorded and
> > > then checked against by mmu_notifier_retry_hva() in the page fault
> > > handling path. However, for the to be introduced private memory, a page
> > > fault may not have a hva associated, checking gfn(gpa) makes more sense.
> > >
> > > For existing hva based shared memory, gfn is expected to also work. The
> > > only downside is when aliasing multiple gfns to a single hva, the
> > > current algorithm of checking multiple ranges could result in a much
> > > larger range being rejected. Such aliasing should be uncommon, so the
> > > impact is expected small.
> > >
> > > Suggested-by: Sean Christopherson <seanjc@google.com>
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c   |  8 +++++---
> > >  include/linux/kvm_host.h | 33 +++++++++++++++++++++------------
> > >  virt/kvm/kvm_main.c      | 32 +++++++++++++++++++++++---------
> > >  3 files changed, 49 insertions(+), 24 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 4736d7849c60..e2c70b5afa3e 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -4259,7 +4259,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> > >                 return true;
> > >
> > >         return fault->slot &&
> > > -              mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> > > +              mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
> > >  }
> > >
> > >  static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > > @@ -6098,7 +6098,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> > >
> > >         write_lock(&kvm->mmu_lock);
> > >
> > > -       kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
> > > +       kvm_mmu_invalidate_begin(kvm);
> > > +
> > > +       kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
> > >
> > >         flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
> > >
> > > @@ -6112,7 +6114,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> > >                 kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
> > >                                                    gfn_end - gfn_start);
> > >
> > > -       kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
> > > +       kvm_mmu_invalidate_end(kvm);
> > >
> > >         write_unlock(&kvm->mmu_lock);
> > >  }
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 02347e386ea2..3d69484d2704 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -787,8 +787,8 @@ struct kvm {
> > >         struct mmu_notifier mmu_notifier;
> > >         unsigned long mmu_invalidate_seq;
> > >         long mmu_invalidate_in_progress;
> > > -       unsigned long mmu_invalidate_range_start;
> > > -       unsigned long mmu_invalidate_range_end;
> > > +       gfn_t mmu_invalidate_range_start;
> > > +       gfn_t mmu_invalidate_range_end;
> > >  #endif
> > >         struct list_head devices;
> > >         u64 manual_dirty_log_protect;
> > > @@ -1389,10 +1389,9 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> > >  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > >  #endif
> > >
> > > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > > -                             unsigned long end);
> > > -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> > > -                           unsigned long end);
> > > +void kvm_mmu_invalidate_begin(struct kvm *kvm);
> > > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
> > > +void kvm_mmu_invalidate_end(struct kvm *kvm);
> > >
> > >  long kvm_arch_dev_ioctl(struct file *filp,
> > >                         unsigned int ioctl, unsigned long arg);
> > > @@ -1963,9 +1962,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
> > >         return 0;
> > >  }
> > >
> > > -static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> > > +static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
> > >                                            unsigned long mmu_seq,
> > > -                                          unsigned long hva)
> > > +                                          gfn_t gfn)
> > >  {
> > >         lockdep_assert_held(&kvm->mmu_lock);
> > >         /*
> > > @@ -1974,10 +1973,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> > >          * that might be being invalidated. Note that it may include some false
> >
> > nit: "might be" (or) "is being"
> >
> > >          * positives, due to shortcuts when handing concurrent invalidations.
> >
> > nit: handling
>
> Both are existing code, but I can fix it either.

That was just a nit, please feel free to ignore it, especially if it
might cause headaches in the future with merges.
>
> >
> > >          */
> > > -       if (unlikely(kvm->mmu_invalidate_in_progress) &&
> > > -           hva >= kvm->mmu_invalidate_range_start &&
> > > -           hva < kvm->mmu_invalidate_range_end)
> > > -               return 1;
> > > +       if (unlikely(kvm->mmu_invalidate_in_progress)) {
> > > +               /*
> > > +                * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> > > +                * but before updating the range is a KVM bug.
> > > +                */
> > > +               if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> > > +                                kvm->mmu_invalidate_range_end == INVALID_GPA))
> >
> > INVALID_GPA is an x86-specific define in
> > arch/x86/include/asm/kvm_host.h, so this doesn't build on other
> > architectures. The obvious fix is to move it to
> > include/linux/kvm_host.h.
>
> Hmm, INVALID_GPA is defined as ZERO for x86, not 100% confident this is
> correct choice for other architectures, but after search it has not been
> used for other architectures, so should be safe to make it common.

With this fixed,

Reviewed-by: Fuad Tabba <tabba@google.com>
And the necessary work to port to arm64 (on qemu/arm64):
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad


>
> Thanks,
> Chao
> >
> > Cheers,
> > /fuad
> >
> > > +                       return 1;
> > > +
> > > +               if (gfn >= kvm->mmu_invalidate_range_start &&
> > > +                   gfn < kvm->mmu_invalidate_range_end)
> > > +                       return 1;
> > > +       }
> > > +
> > >         if (kvm->mmu_invalidate_seq != mmu_seq)
> > >                 return 1;
> > >         return 0;
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index b882eb2c76a2..ad55dfbc75d7 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -540,9 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
> > >
> > >  typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
> > >
> > > -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> > > -                            unsigned long end);
> > > -
> > > +typedef void (*on_lock_fn_t)(struct kvm *kvm);
> > >  typedef void (*on_unlock_fn_t)(struct kvm *kvm);
> > >
> > >  struct kvm_hva_range {
> > > @@ -628,7 +626,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> > >                                 locked = true;
> > >                                 KVM_MMU_LOCK(kvm);
> > >                                 if (!IS_KVM_NULL_FN(range->on_lock))
> > > -                                       range->on_lock(kvm, range->start, range->end);
> > > +                                       range->on_lock(kvm);
> > > +
> > >                                 if (IS_KVM_NULL_FN(range->handler))
> > >                                         break;
> > >                         }
> > > @@ -715,8 +714,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > >         kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > >  }
> > >
> > > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > > -                             unsigned long end)
> > > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > >  {
> > >         /*
> > >          * The count increase must become visible at unlock time as no
> > > @@ -724,6 +722,17 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > >          * count is also read inside the mmu_lock critical section.
> > >          */
> > >         kvm->mmu_invalidate_in_progress++;
> > > +
> > > +       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > +               kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > +               kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > +       }
> > > +}
> > > +
> > > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > +       WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > +
> > >         if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > >                 kvm->mmu_invalidate_range_start = start;
> > >                 kvm->mmu_invalidate_range_end = end;
> > > @@ -744,6 +753,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > >         }
> > >  }
> > >
> > > +static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > > +{
> > > +       kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > > +       return kvm_unmap_gfn_range(kvm, range);
> > > +}
> > > +
> > >  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > >                                         const struct mmu_notifier_range *range)
> > >  {
> > > @@ -752,7 +767,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > >                 .start          = range->start,
> > >                 .end            = range->end,
> > >                 .pte            = __pte(0),
> > > -               .handler        = kvm_unmap_gfn_range,
> > > +               .handler        = kvm_mmu_unmap_gfn_range,
> > >                 .on_lock        = kvm_mmu_invalidate_begin,
> > >                 .on_unlock      = kvm_arch_guest_memory_reclaimed,
> > >                 .flush_on_ret   = true,
> > > @@ -791,8 +806,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > >         return 0;
> > >  }
> > >
> > > -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> > > -                           unsigned long end)
> > > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> > >  {
> > >         /*
> > >          * This sequence increase will notify the kvm page fault that
> > > --
> > > 2.25.1
> > >

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-12-06 11:56     ` Chao Peng
  2022-12-06 15:48       ` Fuad Tabba
@ 2022-12-07  6:34       ` Isaku Yamahata
  2022-12-07 15:14         ` Chao Peng
  1 sibling, 1 reply; 398+ messages in thread
From: Isaku Yamahata @ 2022-12-07  6:34 UTC (permalink / raw)
  To: Chao Peng
  Cc: Fuad Tabba, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Arnd Bergmann, Naoya Horiguchi,
	Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang,
	isaku.yamahata

On Tue, Dec 06, 2022 at 07:56:23PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:

> > > -       if (unlikely(kvm->mmu_invalidate_in_progress) &&
> > > -           hva >= kvm->mmu_invalidate_range_start &&
> > > -           hva < kvm->mmu_invalidate_range_end)
> > > -               return 1;
> > > +       if (unlikely(kvm->mmu_invalidate_in_progress)) {
> > > +               /*
> > > +                * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> > > +                * but before updating the range is a KVM bug.
> > > +                */
> > > +               if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> > > +                                kvm->mmu_invalidate_range_end == INVALID_GPA))
> > 
> > INVALID_GPA is an x86-specific define in
> > arch/x86/include/asm/kvm_host.h, so this doesn't build on other
> > architectures. The obvious fix is to move it to
> > include/linux/kvm_host.h.
> 
> Hmm, INVALID_GPA is defined as ZERO for x86, not 100% confident this is
> correct choice for other architectures, but after search it has not been
> used for other architectures, so should be safe to make it common.

INVALID_GPA is defined as all bit 1.  Please notice "~" (tilde).

#define INVALID_GPA (~(gpa_t)0)
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed
  2022-12-06 12:02     ` Chao Peng
@ 2022-12-07  6:42       ` Isaku Yamahata
  2022-12-08 11:17         ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Isaku Yamahata @ 2022-12-07  6:42 UTC (permalink / raw)
  To: Chao Peng
  Cc: Isaku Yamahata, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Arnd Bergmann, Naoya Horiguchi,
	Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Tue, Dec 06, 2022 at 08:02:24PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:

> On Mon, Dec 05, 2022 at 02:49:59PM -0800, Isaku Yamahata wrote:
> > On Fri, Dec 02, 2022 at 02:13:45PM +0800,
> > Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > 
> > > A large page with mixed private/shared subpages can't be mapped as large
> > > page since its sub private/shared pages are from different memory
> > > backends and may also treated by architecture differently. When
> > > private/shared memory are mixed in a large page, the current lpage_info
> > > is not sufficient to decide whether the page can be mapped as large page
> > > or not and additional private/shared mixed information is needed.
> > > 
> > > Tracking this 'mixed' information with the current 'count' like
> > > disallow_lpage is a bit challenge so reserve a bit in 'disallow_lpage'
> > > to indicate a large page has mixed private/share subpages and update
> > > this 'mixed' bit whenever the memory attribute is changed between
> > > private and shared.
> > > 
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > ---
> > >  arch/x86/include/asm/kvm_host.h |   8 ++
> > >  arch/x86/kvm/mmu/mmu.c          | 134 +++++++++++++++++++++++++++++++-
> > >  arch/x86/kvm/x86.c              |   2 +
> > >  include/linux/kvm_host.h        |  19 +++++
> > >  virt/kvm/kvm_main.c             |   9 ++-
> > >  5 files changed, 169 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 283cbb83d6ae..7772ab37ac89 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -38,6 +38,7 @@
> > >  #include <asm/hyperv-tlfs.h>
> > >  
> > >  #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> > > +#define __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
> > >  
> > >  #define KVM_MAX_VCPUS 1024
> > >  
> > > @@ -1011,6 +1012,13 @@ struct kvm_vcpu_arch {
> > >  #endif
> > >  };
> > >  
> > > +/*
> > > + * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
> > > + * level. The remaining bits are used as a reference count.
> > > + */
> > > +#define KVM_LPAGE_PRIVATE_SHARED_MIXED		(1U << 31)
> > > +#define KVM_LPAGE_COUNT_MAX			((1U << 31) - 1)
> > > +
> > >  struct kvm_lpage_info {
> > >  	int disallow_lpage;
> > >  };
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index e2c70b5afa3e..2190fd8c95c0 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -763,11 +763,16 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
> > >  {
> > >  	struct kvm_lpage_info *linfo;
> > >  	int i;
> > > +	int disallow_count;
> > >  
> > >  	for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
> > >  		linfo = lpage_info_slot(gfn, slot, i);
> > > +
> > > +		disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
> > > +		WARN_ON(disallow_count + count < 0 ||
> > > +			disallow_count > KVM_LPAGE_COUNT_MAX - count);
> > > +
> > >  		linfo->disallow_lpage += count;
> > > -		WARN_ON(linfo->disallow_lpage < 0);
> > >  	}
> > >  }
> > >  
> > > @@ -6986,3 +6991,130 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> > >  	if (kvm->arch.nx_huge_page_recovery_thread)
> > >  		kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
> > >  }
> > > +
> > > +static bool linfo_is_mixed(struct kvm_lpage_info *linfo)
> > > +{
> > > +	return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > > +}
> > > +
> > > +static void linfo_set_mixed(gfn_t gfn, struct kvm_memory_slot *slot,
> > > +			    int level, bool mixed)
> > > +{
> > > +	struct kvm_lpage_info *linfo = lpage_info_slot(gfn, slot, level);
> > > +
> > > +	if (mixed)
> > > +		linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > > +	else
> > > +		linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > > +}
> > > +
> > > +static bool is_expected_attr_entry(void *entry, unsigned long expected_attrs)
> > > +{
> > > +	bool expect_private = expected_attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > > +
> > > +	if (xa_to_value(entry) & KVM_MEMORY_ATTRIBUTE_PRIVATE) {
> > > +		if (!expect_private)
> > > +			return false;
> > > +	} else if (expect_private)
> > > +		return false;
> > > +
> > > +	return true;
> > > +}
> > > +
> > > +static bool mem_attrs_mixed_2m(struct kvm *kvm, unsigned long attrs,
> > > +			       gfn_t start, gfn_t end)
> > > +{
> > > +	XA_STATE(xas, &kvm->mem_attr_array, start);
> > > +	gfn_t gfn = start;
> > > +	void *entry;
> > > +	bool mixed = false;
> > > +
> > > +	rcu_read_lock();
> > > +	entry = xas_load(&xas);
> > > +	while (gfn < end) {
> > > +		if (xas_retry(&xas, entry))
> > > +			continue;
> > > +
> > > +		KVM_BUG_ON(gfn != xas.xa_index, kvm);
> > > +
> > > +		if (!is_expected_attr_entry(entry, attrs)) {
> > > +			mixed = true;
> > > +			break;
> > > +		}
> > > +
> > > +		entry = xas_next(&xas);
> > > +		gfn++;
> > > +	}
> > > +
> > > +	rcu_read_unlock();
> > > +	return mixed;
> > > +}
> > > +
> > > +static bool mem_attrs_mixed(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > +			    int level, unsigned long attrs,
> > > +			    gfn_t start, gfn_t end)
> > > +{
> > > +	unsigned long gfn;
> > > +
> > > +	if (level == PG_LEVEL_2M)
> > > +		return mem_attrs_mixed_2m(kvm, attrs, start, end);
> > > +
> > > +	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1))
> > > +		if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)) ||
> > > +		    !is_expected_attr_entry(xa_load(&kvm->mem_attr_array, gfn),
> > > +					    attrs))
> > > +			return true;
> > > +	return false;
> > > +}
> > > +
> > > +static void kvm_update_lpage_private_shared_mixed(struct kvm *kvm,
> > > +						  struct kvm_memory_slot *slot,
> > > +						  unsigned long attrs,
> > > +						  gfn_t start, gfn_t end)
> > > +{
> > > +	unsigned long pages, mask;
> > > +	gfn_t gfn, gfn_end, first, last;
> > > +	int level;
> > > +	bool mixed;
> > > +
> > > +	/*
> > > +	 * The sequence matters here: we set the higher level basing on the
> > > +	 * lower level's scanning result.
> > > +	 */
> > > +	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> > > +		pages = KVM_PAGES_PER_HPAGE(level);
> > > +		mask = ~(pages - 1);
> > > +		first = start & mask;
> > > +		last = (end - 1) & mask;
> > > +
> > > +		/*
> > > +		 * We only need to scan the head and tail page, for middle pages
> > > +		 * we know they will not be mixed.
> > > +		 */
> > > +		gfn = max(first, slot->base_gfn);
> > > +		gfn_end = min(first + pages, slot->base_gfn + slot->npages);
> > > +		mixed = mem_attrs_mixed(kvm, slot, level, attrs, gfn, gfn_end);
> > > +		linfo_set_mixed(gfn, slot, level, mixed);
> > > +
> > > +		if (first == last)
> > > +			return;
> > 
> > 
> > continue.
> 
> Ya!
> 
> > 
> > > +
> > > +		for (gfn = first + pages; gfn < last; gfn += pages)
> > > +			linfo_set_mixed(gfn, slot, level, false);
> > > +
> > > +		gfn = last;
> > > +		gfn_end = min(last + pages, slot->base_gfn + slot->npages);
> > 
> > if (gfn == gfn_end) continue.
> 
> Do you see a case where gfn can equal to gfn_end? Though it does not
> hurt to add a check.

If last == base_gfn + npages, gfn == gfn_end can occur.


> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 9a07380f8d3c..5aefcff614d2 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -12362,6 +12362,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
> > >  		if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
> > >  			linfo[lpages - 1].disallow_lpage = 1;
> > >  		ugfn = slot->userspace_addr >> PAGE_SHIFT;
> > > +		if (kvm_slot_can_be_private(slot))
> > > +			ugfn |= slot->restricted_offset >> PAGE_SHIFT;
> > 
> > Is there any alignment restriction? If no, It should be +=.
> > In practice, alignment will hold though.
> 
> All we need here is checking whether both userspace_addr and
> restricted_offset are aligned to HPAGE_SIZE or not. '+=' actually can
> yield wrong value in cases when userspace_addr + restricted_offset is
> aligned to HPAGE_SIZE but individually they may not align to HPAGE_SIZE.

Ah, got it. The blow comment explains it.

> Thanks,
> Chao
> > 
> > Thanks,
> > 
> > >  		/*
> > >  		 * If the gfn and userspace address are not aligned wrt each
> > >  		 * other, disable large page support for this slot.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
  2022-12-02  6:13 ` [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes Chao Peng
@ 2022-12-07  8:13   ` Yuan Yao
  2022-12-08 11:20     ` Chao Peng
  2022-12-07 17:16   ` Fuad Tabba
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 398+ messages in thread
From: Yuan Yao @ 2022-12-07  8:13 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Fri, Dec 02, 2022 at 02:13:44PM +0800, Chao Peng wrote:
> Unmap the existing guest mappings when memory attribute is changed
> between shared and private. This is needed because shared pages and
> private pages are from different backends, unmapping existing ones
> gives a chance for page fault handler to re-populate the mappings
> according to the new attribute.
>
> Only architecture has private memory support needs this and the
> supported architecture is expected to rewrite the weak
> kvm_arch_has_private_mem().
>
> Also, during memory attribute changing and the unmapping time frame,
> page fault handler may happen in the same memory range and can cause
> incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> page fault handler retry during this time frame.
>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  include/linux/kvm_host.h |   7 +-
>  virt/kvm/kvm_main.c      | 168 ++++++++++++++++++++++++++-------------
>  2 files changed, 116 insertions(+), 59 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3d69484d2704..3331c0c92838 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>  #endif
>
> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
>  struct kvm_gfn_range {
>  	struct kvm_memory_slot *slot;
>  	gfn_t start;
> @@ -264,6 +263,8 @@ struct kvm_gfn_range {
>  	bool may_block;
>  };
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> +
> +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> @@ -785,11 +786,12 @@ struct kvm {
>
>  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>  	struct mmu_notifier mmu_notifier;
> +#endif
>  	unsigned long mmu_invalidate_seq;
>  	long mmu_invalidate_in_progress;
>  	gfn_t mmu_invalidate_range_start;
>  	gfn_t mmu_invalidate_range_end;
> -#endif
> +
>  	struct list_head devices;
>  	u64 manual_dirty_log_protect;
>  	struct dentry *debugfs_dentry;
> @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
>  int kvm_arch_post_init_vm(struct kvm *kvm);
>  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
>  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_has_private_mem(struct kvm *kvm);
>
>  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
>  /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ad55dfbc75d7..4e1e1e113bf0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
>  }
>  EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
>
> +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> +{
> +	/*
> +	 * The count increase must become visible at unlock time as no
> +	 * spte can be established without taking the mmu_lock and
> +	 * count is also read inside the mmu_lock critical section.
> +	 */
> +	kvm->mmu_invalidate_in_progress++;
> +
> +	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> +		kvm->mmu_invalidate_range_start = INVALID_GPA;
> +		kvm->mmu_invalidate_range_end = INVALID_GPA;
> +	}
> +}
> +
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> +
> +	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> +		kvm->mmu_invalidate_range_start = start;
> +		kvm->mmu_invalidate_range_end = end;
> +	} else {
> +		/*
> +		 * Fully tracking multiple concurrent ranges has diminishing
> +		 * returns. Keep things simple and just find the minimal range
> +		 * which includes the current and new ranges. As there won't be
> +		 * enough information to subtract a range after its invalidate
> +		 * completes, any ranges invalidated concurrently will
> +		 * accumulate and persist until all outstanding invalidates
> +		 * complete.
> +		 */
> +		kvm->mmu_invalidate_range_start =
> +			min(kvm->mmu_invalidate_range_start, start);
> +		kvm->mmu_invalidate_range_end =
> +			max(kvm->mmu_invalidate_range_end, end);
> +	}
> +}
> +
> +void kvm_mmu_invalidate_end(struct kvm *kvm)
> +{
> +	/*
> +	 * This sequence increase will notify the kvm page fault that
> +	 * the page that is going to be mapped in the spte could have
> +	 * been freed.
> +	 */
> +	kvm->mmu_invalidate_seq++;
> +	smp_wmb();
> +	/*
> +	 * The above sequence increase must be visible before the
> +	 * below count decrease, which is ensured by the smp_wmb above
> +	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
> +	 */
> +	kvm->mmu_invalidate_in_progress--;
> +}
> +
>  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>  static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>  {
> @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
>  }
>
> -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> -{
> -	/*
> -	 * The count increase must become visible at unlock time as no
> -	 * spte can be established without taking the mmu_lock and
> -	 * count is also read inside the mmu_lock critical section.
> -	 */
> -	kvm->mmu_invalidate_in_progress++;
> -
> -	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> -		kvm->mmu_invalidate_range_start = INVALID_GPA;
> -		kvm->mmu_invalidate_range_end = INVALID_GPA;
> -	}
> -}
> -
> -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> -{
> -	WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> -
> -	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> -		kvm->mmu_invalidate_range_start = start;
> -		kvm->mmu_invalidate_range_end = end;
> -	} else {
> -		/*
> -		 * Fully tracking multiple concurrent ranges has diminishing
> -		 * returns. Keep things simple and just find the minimal range
> -		 * which includes the current and new ranges. As there won't be
> -		 * enough information to subtract a range after its invalidate
> -		 * completes, any ranges invalidated concurrently will
> -		 * accumulate and persist until all outstanding invalidates
> -		 * complete.
> -		 */
> -		kvm->mmu_invalidate_range_start =
> -			min(kvm->mmu_invalidate_range_start, start);
> -		kvm->mmu_invalidate_range_end =
> -			max(kvm->mmu_invalidate_range_end, end);
> -	}
> -}
> -
>  static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>  	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  	return 0;
>  }
>
> -void kvm_mmu_invalidate_end(struct kvm *kvm)
> -{
> -	/*
> -	 * This sequence increase will notify the kvm page fault that
> -	 * the page that is going to be mapped in the spte could have
> -	 * been freed.
> -	 */
> -	kvm->mmu_invalidate_seq++;
> -	smp_wmb();
> -	/*
> -	 * The above sequence increase must be visible before the
> -	 * below count decrease, which is ensured by the smp_wmb above
> -	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
> -	 */
> -	kvm->mmu_invalidate_in_progress--;
> -}
> -
>  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>  					const struct mmu_notifier_range *range)
>  {
> @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
>  	return 0;
>  }
>
> +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> +{
> +	return false;
> +}
> +
>  static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>  {
>  	struct kvm *kvm = kvm_arch_alloc_vm();
> @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
>  	return 0;
>  }
>
> +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	struct kvm_gfn_range gfn_range;
> +	struct kvm_memory_slot *slot;
> +	struct kvm_memslots *slots;
> +	struct kvm_memslot_iter iter;
> +	int i;
> +	int r = 0;
> +
> +	gfn_range.pte = __pte(0);
> +	gfn_range.may_block = true;
> +
> +	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +		slots = __kvm_memslots(kvm, i);
> +
> +		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> +			slot = iter.slot;
> +			gfn_range.start = max(start, slot->base_gfn);
> +			gfn_range.end = min(end, slot->base_gfn + slot->npages);
> +			if (gfn_range.start >= gfn_range.end)
> +				continue;
> +			gfn_range.slot = slot;
> +
> +			r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> +		}
> +	}
> +
> +	if (r)
> +		kvm_flush_remote_tlbs(kvm);
> +}
> +
>  static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>  					   struct kvm_memory_attributes *attrs)
>  {
>  	gfn_t start, end;
>  	unsigned long i;
>  	void *entry;
> +	int idx;
>  	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
>
> -	/* flags is currently not used. */
> +	/* 'flags' is currently not used. */
>  	if (attrs->flags)
>  		return -EINVAL;
>  	if (attrs->attributes & ~supported_attrs)
> @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>
>  	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
>
> +	if (kvm_arch_has_private_mem(kvm)) {
> +		KVM_MMU_LOCK(kvm);
> +		kvm_mmu_invalidate_begin(kvm);
> +		kvm_mmu_invalidate_range_add(kvm, start, end);

Nit: this works for KVM_MEMORY_ATTRIBUTE_PRIVATE, but
the invalidation should be necessary yet for attribute change of:

KVM_MEMORY_ATTRIBUTE_READ
KVM_MEMORY_ATTRIBUTE_WRITE
KVM_MEMORY_ATTRIBUTE_EXECUTE

> +		KVM_MMU_UNLOCK(kvm);
> +	}
> +
>  	mutex_lock(&kvm->lock);
>  	for (i = start; i < end; i++)
>  		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>  			break;
>  	mutex_unlock(&kvm->lock);
>
> +	if (kvm_arch_has_private_mem(kvm)) {
> +		idx = srcu_read_lock(&kvm->srcu);
> +		KVM_MMU_LOCK(kvm);
> +		if (i > start)
> +			kvm_unmap_mem_range(kvm, start, i);
> +		kvm_mmu_invalidate_end(kvm);

Ditto.

> +		KVM_MMU_UNLOCK(kvm);
> +		srcu_read_unlock(&kvm->srcu, idx);
> +	}
> +
>  	attrs->address = i << PAGE_SHIFT;
>  	attrs->size = (end - i) << PAGE_SHIFT;
>
> --
> 2.25.1
>
>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-06 14:57   ` Fuad Tabba
@ 2022-12-07 13:50     ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-07 13:50 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

On Tue, Dec 06, 2022 at 02:57:04PM +0000, Fuad Tabba wrote:
> Hi,
> 
> On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > Introduce 'memfd_restricted' system call with the ability to create
> > memory areas that are restricted from userspace access through ordinary
> > MMU operations (e.g. read/write/mmap). The memory content is expected to
> > be used through the new in-kernel interface by a third kernel module.
> >
> > memfd_restricted() is useful for scenarios where a file descriptor(fd)
> > can be used as an interface into mm but want to restrict userspace's
> > ability on the fd. Initially it is designed to provide protections for
> > KVM encrypted guest memory.
> >
> > Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> > (e.g. QEMU) and then using the mmaped virtual address to setup the
> > mapping in the KVM secondary page table (e.g. EPT). With confidential
> > computing technologies like Intel TDX, the memfd memory may be encrypted
> > with special key for special software domain (e.g. KVM guest) and is not
> > expected to be directly accessed by userspace. Precisely, userspace
> > access to such encrypted memory may lead to host crash so should be
> > prevented.
> >
> > memfd_restricted() provides semantics required for KVM guest encrypted
> > memory support that a fd created with memfd_restricted() is going to be
> > used as the source of guest memory in confidential computing environment
> > and KVM can directly interact with core-mm without the need to expose
> > the memoy content into KVM userspace.
> 
> nit: memory

Ya!

> 
> >
> > KVM userspace is still in charge of the lifecycle of the fd. It should
> > pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> > obtain the physical memory page and then uses it to populate the KVM
> > secondary page table entries.
> >
> > The userspace restricted memfd can be fallocate-ed or hole-punched
> > from userspace. When hole-punched, KVM can get notified through
> > invalidate_start/invalidate_end() callbacks, KVM then gets chance to
> > remove any mapped entries of the range in the secondary page tables.
> >
> > Machine check can happen for memory pages in the restricted memfd,
> > instead of routing this directly to userspace, we call the error()
> > callback that KVM registered. KVM then gets chance to handle it
> > correctly.
> >
> > memfd_restricted() itself is implemented as a shim layer on top of real
> > memory file systems (currently tmpfs). Pages in restrictedmem are marked
> > as unmovable and unevictable, this is required for current confidential
> > usage. But in future this might be changed.
> >
> > By default memfd_restricted() prevents userspace read, write and mmap.
> > By defining new bit in the 'flags', it can be extended to support other
> > restricted semantics in the future.
> >
> > The system call is currently wired up for x86 arch.
> 
> Reviewed-by: Fuad Tabba <tabba@google.com>
> After wiring the system call for arm64 (on qemu/arm64):
> Tested-by: Fuad Tabba <tabba@google.com>

Thanks.
Chao
> 
> Cheers,
> /fuad
> 
> 
> 
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
> >  include/linux/restrictedmem.h          |  71 ++++++
> >  include/linux/syscalls.h               |   1 +
> >  include/uapi/asm-generic/unistd.h      |   5 +-
> >  include/uapi/linux/magic.h             |   1 +
> >  kernel/sys_ni.c                        |   3 +
> >  mm/Kconfig                             |   4 +
> >  mm/Makefile                            |   1 +
> >  mm/memory-failure.c                    |   3 +
> >  mm/restrictedmem.c                     | 318 +++++++++++++++++++++++++
> >  11 files changed, 408 insertions(+), 1 deletion(-)
> >  create mode 100644 include/linux/restrictedmem.h
> >  create mode 100644 mm/restrictedmem.c
> >
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 320480a8db4f..dc70ba90247e 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -455,3 +455,4 @@
> >  448    i386    process_mrelease        sys_process_mrelease
> >  449    i386    futex_waitv             sys_futex_waitv
> >  450    i386    set_mempolicy_home_node         sys_set_mempolicy_home_node
> > +451    i386    memfd_restricted        sys_memfd_restricted
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index c84d12608cd2..06516abc8318 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -372,6 +372,7 @@
> >  448    common  process_mrelease        sys_process_mrelease
> >  449    common  futex_waitv             sys_futex_waitv
> >  450    common  set_mempolicy_home_node sys_set_mempolicy_home_node
> > +451    common  memfd_restricted        sys_memfd_restricted
> >
> >  #
> >  # Due to a historical design error, certain syscalls are numbered differently
> > diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> > new file mode 100644
> > index 000000000000..c2700c5daa43
> > --- /dev/null
> > +++ b/include/linux/restrictedmem.h
> > @@ -0,0 +1,71 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _LINUX_RESTRICTEDMEM_H
> > +
> > +#include <linux/file.h>
> > +#include <linux/magic.h>
> > +#include <linux/pfn_t.h>
> > +
> > +struct restrictedmem_notifier;
> > +
> > +struct restrictedmem_notifier_ops {
> > +       void (*invalidate_start)(struct restrictedmem_notifier *notifier,
> > +                                pgoff_t start, pgoff_t end);
> > +       void (*invalidate_end)(struct restrictedmem_notifier *notifier,
> > +                              pgoff_t start, pgoff_t end);
> > +       void (*error)(struct restrictedmem_notifier *notifier,
> > +                              pgoff_t start, pgoff_t end);
> > +};
> > +
> > +struct restrictedmem_notifier {
> > +       struct list_head list;
> > +       const struct restrictedmem_notifier_ops *ops;
> > +};
> > +
> > +#ifdef CONFIG_RESTRICTEDMEM
> > +
> > +void restrictedmem_register_notifier(struct file *file,
> > +                                    struct restrictedmem_notifier *notifier);
> > +void restrictedmem_unregister_notifier(struct file *file,
> > +                                      struct restrictedmem_notifier *notifier);
> > +
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +                          struct page **pagep, int *order);
> > +
> > +static inline bool file_is_restrictedmem(struct file *file)
> > +{
> > +       return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
> > +}
> > +
> > +void restrictedmem_error_page(struct page *page, struct address_space *mapping);
> > +
> > +#else
> > +
> > +static inline void restrictedmem_register_notifier(struct file *file,
> > +                                    struct restrictedmem_notifier *notifier)
> > +{
> > +}
> > +
> > +static inline void restrictedmem_unregister_notifier(struct file *file,
> > +                                      struct restrictedmem_notifier *notifier)
> > +{
> > +}
> > +
> > +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +                                        struct page **pagep, int *order)
> > +{
> > +       return -1;
> > +}
> > +
> > +static inline bool file_is_restrictedmem(struct file *file)
> > +{
> > +       return false;
> > +}
> > +
> > +static inline void restrictedmem_error_page(struct page *page,
> > +                                           struct address_space *mapping)
> > +{
> > +}
> > +
> > +#endif /* CONFIG_RESTRICTEDMEM */
> > +
> > +#endif /* _LINUX_RESTRICTEDMEM_H */
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index a34b0f9a9972..f9e9e0c820c5 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
> >  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
> >                                             unsigned long home_node,
> >                                             unsigned long flags);
> > +asmlinkage long sys_memfd_restricted(unsigned int flags);
> >
> >  /*
> >   * Architecture-specific system calls
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index 45fa180cc56a..e93cd35e46d0 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
> >  #define __NR_set_mempolicy_home_node 450
> >  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
> >
> > +#define __NR_memfd_restricted 451
> > +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
> > +
> >  #undef __NR_syscalls
> > -#define __NR_syscalls 451
> > +#define __NR_syscalls 452
> >
> >  /*
> >   * 32 bit systems traditionally used different
> > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > index 6325d1d0e90f..8aa38324b90a 100644
> > --- a/include/uapi/linux/magic.h
> > +++ b/include/uapi/linux/magic.h
> > @@ -101,5 +101,6 @@
> >  #define DMA_BUF_MAGIC          0x444d4142      /* "DMAB" */
> >  #define DEVMEM_MAGIC           0x454d444d      /* "DMEM" */
> >  #define SECRETMEM_MAGIC                0x5345434d      /* "SECM" */
> > +#define RESTRICTEDMEM_MAGIC    0x5245534d      /* "RESM" */
> >
> >  #endif /* __LINUX_MAGIC_H__ */
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index 860b2dcf3ac4..7c4a32cbd2e7 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
> >  /* memfd_secret */
> >  COND_SYSCALL(memfd_secret);
> >
> > +/* memfd_restricted */
> > +COND_SYSCALL(memfd_restricted);
> > +
> >  /*
> >   * Architecture specific weak syscall entries.
> >   */
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 57e1d8c5b505..06b0e1d6b8c1 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -1076,6 +1076,10 @@ config IO_MAPPING
> >  config SECRETMEM
> >         def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
> >
> > +config RESTRICTEDMEM
> > +       bool
> > +       depends on TMPFS
> > +
> >  config ANON_VMA_NAME
> >         bool "Anonymous VMA name support"
> >         depends on PROC_FS && ADVISE_SYSCALLS && MMU
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 8e105e5b3e29..bcbb0edf9ba1 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -121,6 +121,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
> >  obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
> >  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
> >  obj-$(CONFIG_SECRETMEM) += secretmem.o
> > +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
> >  obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
> >  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
> >  obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 145bb561ddb3..f91b444e471e 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -62,6 +62,7 @@
> >  #include <linux/page-isolation.h>
> >  #include <linux/pagewalk.h>
> >  #include <linux/shmem_fs.h>
> > +#include <linux/restrictedmem.h>
> >  #include "swap.h"
> >  #include "internal.h"
> >  #include "ras/ras_event.h"
> > @@ -940,6 +941,8 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
> >                 goto out;
> >         }
> >
> > +       restrictedmem_error_page(p, mapping);
> > +
> >         /*
> >          * The shmem page is kept in page cache instead of truncating
> >          * so is expected to have an extra refcount after error-handling.
> > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> > new file mode 100644
> > index 000000000000..56953c204e5c
> > --- /dev/null
> > +++ b/mm/restrictedmem.c
> > @@ -0,0 +1,318 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "linux/sbitmap.h"
> > +#include <linux/pagemap.h>
> > +#include <linux/pseudo_fs.h>
> > +#include <linux/shmem_fs.h>
> > +#include <linux/syscalls.h>
> > +#include <uapi/linux/falloc.h>
> > +#include <uapi/linux/magic.h>
> > +#include <linux/restrictedmem.h>
> > +
> > +struct restrictedmem_data {
> > +       struct mutex lock;
> > +       struct file *memfd;
> > +       struct list_head notifiers;
> > +};
> > +
> > +static void restrictedmem_invalidate_start(struct restrictedmem_data *data,
> > +                                          pgoff_t start, pgoff_t end)
> > +{
> > +       struct restrictedmem_notifier *notifier;
> > +
> > +       mutex_lock(&data->lock);
> > +       list_for_each_entry(notifier, &data->notifiers, list) {
> > +               notifier->ops->invalidate_start(notifier, start, end);
> > +       }
> > +       mutex_unlock(&data->lock);
> > +}
> > +
> > +static void restrictedmem_invalidate_end(struct restrictedmem_data *data,
> > +                                        pgoff_t start, pgoff_t end)
> > +{
> > +       struct restrictedmem_notifier *notifier;
> > +
> > +       mutex_lock(&data->lock);
> > +       list_for_each_entry(notifier, &data->notifiers, list) {
> > +               notifier->ops->invalidate_end(notifier, start, end);
> > +       }
> > +       mutex_unlock(&data->lock);
> > +}
> > +
> > +static void restrictedmem_notifier_error(struct restrictedmem_data *data,
> > +                                        pgoff_t start, pgoff_t end)
> > +{
> > +       struct restrictedmem_notifier *notifier;
> > +
> > +       mutex_lock(&data->lock);
> > +       list_for_each_entry(notifier, &data->notifiers, list) {
> > +               notifier->ops->error(notifier, start, end);
> > +       }
> > +       mutex_unlock(&data->lock);
> > +}
> > +
> > +static int restrictedmem_release(struct inode *inode, struct file *file)
> > +{
> > +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> > +
> > +       fput(data->memfd);
> > +       kfree(data);
> > +       return 0;
> > +}
> > +
> > +static long restrictedmem_punch_hole(struct restrictedmem_data *data, int mode,
> > +                                    loff_t offset, loff_t len)
> > +{
> > +       int ret;
> > +       pgoff_t start, end;
> > +       struct file *memfd = data->memfd;
> > +
> > +       if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > +               return -EINVAL;
> > +
> > +       start = offset >> PAGE_SHIFT;
> > +       end = (offset + len) >> PAGE_SHIFT;
> > +
> > +       restrictedmem_invalidate_start(data, start, end);
> > +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > +       restrictedmem_invalidate_end(data, start, end);
> > +
> > +       return ret;
> > +}
> > +
> > +static long restrictedmem_fallocate(struct file *file, int mode,
> > +                                   loff_t offset, loff_t len)
> > +{
> > +       struct restrictedmem_data *data = file->f_mapping->private_data;
> > +       struct file *memfd = data->memfd;
> > +
> > +       if (mode & FALLOC_FL_PUNCH_HOLE)
> > +               return restrictedmem_punch_hole(data, mode, offset, len);
> > +
> > +       return memfd->f_op->fallocate(memfd, mode, offset, len);
> > +}
> > +
> > +static const struct file_operations restrictedmem_fops = {
> > +       .release = restrictedmem_release,
> > +       .fallocate = restrictedmem_fallocate,
> > +};
> > +
> > +static int restrictedmem_getattr(struct user_namespace *mnt_userns,
> > +                                const struct path *path, struct kstat *stat,
> > +                                u32 request_mask, unsigned int query_flags)
> > +{
> > +       struct inode *inode = d_inode(path->dentry);
> > +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> > +       struct file *memfd = data->memfd;
> > +
> > +       return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
> > +                                            request_mask, query_flags);
> > +}
> > +
> > +static int restrictedmem_setattr(struct user_namespace *mnt_userns,
> > +                                struct dentry *dentry, struct iattr *attr)
> > +{
> > +       struct inode *inode = d_inode(dentry);
> > +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> > +       struct file *memfd = data->memfd;
> > +       int ret;
> > +
> > +       if (attr->ia_valid & ATTR_SIZE) {
> > +               if (memfd->f_inode->i_size)
> > +                       return -EPERM;
> > +
> > +               if (!PAGE_ALIGNED(attr->ia_size))
> > +                       return -EINVAL;
> > +       }
> > +
> > +       ret = memfd->f_inode->i_op->setattr(mnt_userns,
> > +                                           file_dentry(memfd), attr);
> > +       return ret;
> > +}
> > +
> > +static const struct inode_operations restrictedmem_iops = {
> > +       .getattr = restrictedmem_getattr,
> > +       .setattr = restrictedmem_setattr,
> > +};
> > +
> > +static int restrictedmem_init_fs_context(struct fs_context *fc)
> > +{
> > +       if (!init_pseudo(fc, RESTRICTEDMEM_MAGIC))
> > +               return -ENOMEM;
> > +
> > +       fc->s_iflags |= SB_I_NOEXEC;
> > +       return 0;
> > +}
> > +
> > +static struct file_system_type restrictedmem_fs = {
> > +       .owner          = THIS_MODULE,
> > +       .name           = "memfd:restrictedmem",
> > +       .init_fs_context = restrictedmem_init_fs_context,
> > +       .kill_sb        = kill_anon_super,
> > +};
> > +
> > +static struct vfsmount *restrictedmem_mnt;
> > +
> > +static __init int restrictedmem_init(void)
> > +{
> > +       restrictedmem_mnt = kern_mount(&restrictedmem_fs);
> > +       if (IS_ERR(restrictedmem_mnt))
> > +               return PTR_ERR(restrictedmem_mnt);
> > +       return 0;
> > +}
> > +fs_initcall(restrictedmem_init);
> > +
> > +static struct file *restrictedmem_file_create(struct file *memfd)
> > +{
> > +       struct restrictedmem_data *data;
> > +       struct address_space *mapping;
> > +       struct inode *inode;
> > +       struct file *file;
> > +
> > +       data = kzalloc(sizeof(*data), GFP_KERNEL);
> > +       if (!data)
> > +               return ERR_PTR(-ENOMEM);
> > +
> > +       data->memfd = memfd;
> > +       mutex_init(&data->lock);
> > +       INIT_LIST_HEAD(&data->notifiers);
> > +
> > +       inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> > +       if (IS_ERR(inode)) {
> > +               kfree(data);
> > +               return ERR_CAST(inode);
> > +       }
> > +
> > +       inode->i_mode |= S_IFREG;
> > +       inode->i_op = &restrictedmem_iops;
> > +       inode->i_mapping->private_data = data;
> > +
> > +       file = alloc_file_pseudo(inode, restrictedmem_mnt,
> > +                                "restrictedmem", O_RDWR,
> > +                                &restrictedmem_fops);
> > +       if (IS_ERR(file)) {
> > +               iput(inode);
> > +               kfree(data);
> > +               return ERR_CAST(file);
> > +       }
> > +
> > +       file->f_flags |= O_LARGEFILE;
> > +
> > +       /*
> > +        * These pages are currently unmovable so don't place them into movable
> > +        * pageblocks (e.g. CMA and ZONE_MOVABLE).
> > +        */
> > +       mapping = memfd->f_mapping;
> > +       mapping_set_unevictable(mapping);
> > +       mapping_set_gfp_mask(mapping,
> > +                            mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> > +
> > +       return file;
> > +}
> > +
> > +SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> > +{
> > +       struct file *file, *restricted_file;
> > +       int fd, err;
> > +
> > +       if (flags)
> > +               return -EINVAL;
> > +
> > +       fd = get_unused_fd_flags(0);
> > +       if (fd < 0)
> > +               return fd;
> > +
> > +       file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> > +       if (IS_ERR(file)) {
> > +               err = PTR_ERR(file);
> > +               goto err_fd;
> > +       }
> > +       file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> > +       file->f_flags |= O_LARGEFILE;
> > +
> > +       restricted_file = restrictedmem_file_create(file);
> > +       if (IS_ERR(restricted_file)) {
> > +               err = PTR_ERR(restricted_file);
> > +               fput(file);
> > +               goto err_fd;
> > +       }
> > +
> > +       fd_install(fd, restricted_file);
> > +       return fd;
> > +err_fd:
> > +       put_unused_fd(fd);
> > +       return err;
> > +}
> > +
> > +void restrictedmem_register_notifier(struct file *file,
> > +                                    struct restrictedmem_notifier *notifier)
> > +{
> > +       struct restrictedmem_data *data = file->f_mapping->private_data;
> > +
> > +       mutex_lock(&data->lock);
> > +       list_add(&notifier->list, &data->notifiers);
> > +       mutex_unlock(&data->lock);
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_register_notifier);
> > +
> > +void restrictedmem_unregister_notifier(struct file *file,
> > +                                      struct restrictedmem_notifier *notifier)
> > +{
> > +       struct restrictedmem_data *data = file->f_mapping->private_data;
> > +
> > +       mutex_lock(&data->lock);
> > +       list_del(&notifier->list);
> > +       mutex_unlock(&data->lock);
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier);
> > +
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +                          struct page **pagep, int *order)
> > +{
> > +       struct restrictedmem_data *data = file->f_mapping->private_data;
> > +       struct file *memfd = data->memfd;
> > +       struct folio *folio;
> > +       struct page *page;
> > +       int ret;
> > +
> > +       ret = shmem_get_folio(file_inode(memfd), offset, &folio, SGP_WRITE);
> > +       if (ret)
> > +               return ret;
> > +
> > +       page = folio_file_page(folio, offset);
> > +       *pagep = page;
> > +       if (order)
> > +               *order = thp_order(compound_head(page));
> > +
> > +       SetPageUptodate(page);
> > +       unlock_page(page);
> > +
> > +       return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> > +
> > +void restrictedmem_error_page(struct page *page, struct address_space *mapping)
> > +{
> > +       struct super_block *sb = restrictedmem_mnt->mnt_sb;
> > +       struct inode *inode, *next;
> > +
> > +       if (!shmem_mapping(mapping))
> > +               return;
> > +
> > +       spin_lock(&sb->s_inode_list_lock);
> > +       list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> > +               struct restrictedmem_data *data = inode->i_mapping->private_data;
> > +               struct file *memfd = data->memfd;
> > +
> > +               if (memfd->f_mapping == mapping) {
> > +                       pgoff_t start, end;
> > +
> > +                       spin_unlock(&sb->s_inode_list_lock);
> > +
> > +                       start = page->index;
> > +                       end = start + thp_nr_pages(page);
> > +                       restrictedmem_notifier_error(data, start, end);
> > +                       return;
> > +               }
> > +       }
> > +       spin_unlock(&sb->s_inode_list_lock);
> > +}
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2022-12-06 13:34   ` Fabiano Rosas
@ 2022-12-07 14:31     ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-07 14:31 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Tue, Dec 06, 2022 at 10:34:32AM -0300, Fabiano Rosas wrote:
> Chao Peng <chao.p.peng@linux.intel.com> writes:
> 
> > In confidential computing usages, whether a page is private or shared is
> > necessary information for KVM to perform operations like page fault
> > handling, page zapping etc. There are other potential use cases for
> > per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > or exec-only, etc.) without having to modify memslots.
> >
> > Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > userspace to operate on the per-page memory attributes.
> >   - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> >     a guest memory range.
> >   - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> >     memory attributes.
> >
> > KVM internally uses xarray to store the per-page memory attributes.
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> > ---
> >  Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
> >  arch/x86/kvm/Kconfig           |  1 +
> >  include/linux/kvm_host.h       |  3 ++
> >  include/uapi/linux/kvm.h       | 17 ++++++++
> >  virt/kvm/Kconfig               |  3 ++
> >  virt/kvm/kvm_main.c            | 76 ++++++++++++++++++++++++++++++++++
> >  6 files changed, 163 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 5617bc4f899f..bb2f709c0900 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
> >  The "pad" and "reserved" fields may be used for future extensions and should be
> >  set to 0s by userspace.
> >  
> > +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> > +-----------------------------------------
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: u64 memory attributes bitmask(out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Returns supported memory attributes bitmask. Supported memory attributes will
> > +have the corresponding bits set in u64 memory attributes bitmask.
> > +
> > +The following memory attributes are defined::
> > +
> > +  #define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> > +  #define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> > +  #define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> > +  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> > +
> > +4.139 KVM_SET_MEMORY_ATTRIBUTES
> > +-----------------------------------------
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: struct kvm_memory_attributes(in/out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Sets memory attributes for pages in a guest memory range. Parameters are
> > +specified via the following structure::
> > +
> > +  struct kvm_memory_attributes {
> > +	__u64 address;
> > +	__u64 size;
> > +	__u64 attributes;
> > +	__u64 flags;
> > +  };
> > +
> > +The user sets the per-page memory attributes to a guest memory range indicated
> > +by address/size, and in return KVM adjusts address and size to reflect the
> > +actual pages of the memory range have been successfully set to the attributes.
> 
> This wording could cause some confusion, what about a simpler:
> 
> "reflect the range of pages that had its attributes successfully set"

Thanks, this is much better.

> 
> > +If the call returns 0, "address" is updated to the last successful address + 1
> > +and "size" is updated to the remaining address size that has not been set
> > +successfully.
> 
> "address + 1 page" or "subsequent page" perhaps.
> 
> In fact, wouldn't this all become simpler if size were number of pages instead?

It indeed becomes better if the size is number of pages and the address
is gfn, but I think we don't want to imply that the page size is 4K to
userspace.

> 
> > The user should check the return value as well as the size to
> > +decide if the operation succeeded for the whole range or not. The user may want
> > +to retry the operation with the returned address/size if the previous range was
> > +partially successful.
> > +
> > +Both address and size should be page aligned and the supported attributes can be
> > +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> > +
> > +The "flags" field may be used for future extensions and should be set to 0s.
> > +
> 
> ...
> 
> > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > +					   struct kvm_memory_attributes *attrs)
> > +{
> > +	gfn_t start, end;
> > +	unsigned long i;
> > +	void *entry;
> > +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > +
> > +	/* flags is currently not used. */
> > +	if (attrs->flags)
> > +		return -EINVAL;
> > +	if (attrs->attributes & ~supported_attrs)
> > +		return -EINVAL;
> > +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > +		return -EINVAL;
> > +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > +		return -EINVAL;
> > +
> > +	start = attrs->address >> PAGE_SHIFT;
> > +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> 
> Here PAGE_SIZE and -1 cancel out.

Correct!

> 
> Consider using gpa_to_gfn as well.

Yes using gpa_to_gfn is appropriate.

Thanks,
Chao
> 
> > +
> > +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > +
> > +	mutex_lock(&kvm->lock);
> > +	for (i = start; i < end; i++)
> > +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > +				    GFP_KERNEL_ACCOUNT)))
> > +			break;
> > +	mutex_unlock(&kvm->lock);
> > +
> > +	attrs->address = i << PAGE_SHIFT;
> > +	attrs->size = (end - i) << PAGE_SHIFT;
> > +
> > +	return 0;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > +
> >  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
> >  {
> >  	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> > @@ -4459,6 +4508,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> >  #ifdef CONFIG_HAVE_KVM_MSI
> >  	case KVM_CAP_SIGNAL_MSI:
> >  #endif
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +	case KVM_CAP_MEMORY_ATTRIBUTES:
> > +#endif
> >  #ifdef CONFIG_HAVE_KVM_IRQFD
> >  	case KVM_CAP_IRQFD:
> >  	case KVM_CAP_IRQFD_RESAMPLE:
> > @@ -4804,6 +4856,30 @@ static long kvm_vm_ioctl(struct file *filp,
> >  		break;
> >  	}
> >  #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +	case KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES: {
> > +		u64 attrs = kvm_supported_mem_attributes(kvm);
> > +
> > +		r = -EFAULT;
> > +		if (copy_to_user(argp, &attrs, sizeof(attrs)))
> > +			goto out;
> > +		r = 0;
> > +		break;
> > +	}
> > +	case KVM_SET_MEMORY_ATTRIBUTES: {
> > +		struct kvm_memory_attributes attrs;
> > +
> > +		r = -EFAULT;
> > +		if (copy_from_user(&attrs, argp, sizeof(attrs)))
> > +			goto out;
> > +
> > +		r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
> > +
> > +		if (!r && copy_to_user(argp, &attrs, sizeof(attrs)))
> > +			r = -EFAULT;
> > +		break;
> > +	}
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> >  	case KVM_CREATE_DEVICE: {
> >  		struct kvm_create_device cd;

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2022-12-06 15:07   ` Fuad Tabba
@ 2022-12-07 14:51     ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-07 14:51 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

On Tue, Dec 06, 2022 at 03:07:27PM +0000, Fuad Tabba wrote:
> Hi,
> 
> On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > In confidential computing usages, whether a page is private or shared is
> > necessary information for KVM to perform operations like page fault
> > handling, page zapping etc. There are other potential use cases for
> > per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > or exec-only, etc.) without having to modify memslots.
> >
> > Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > userspace to operate on the per-page memory attributes.
> >   - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> >     a guest memory range.
> >   - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> >     memory attributes.
> >
> > KVM internally uses xarray to store the per-page memory attributes.
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> > ---
> >  Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
> >  arch/x86/kvm/Kconfig           |  1 +
> >  include/linux/kvm_host.h       |  3 ++
> >  include/uapi/linux/kvm.h       | 17 ++++++++
> >  virt/kvm/Kconfig               |  3 ++
> >  virt/kvm/kvm_main.c            | 76 ++++++++++++++++++++++++++++++++++
> >  6 files changed, 163 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 5617bc4f899f..bb2f709c0900 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
> >  The "pad" and "reserved" fields may be used for future extensions and should be
> >  set to 0s by userspace.
> >
> > +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> > +-----------------------------------------
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: u64 memory attributes bitmask(out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Returns supported memory attributes bitmask. Supported memory attributes will
> > +have the corresponding bits set in u64 memory attributes bitmask.
> > +
> > +The following memory attributes are defined::
> > +
> > +  #define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> > +  #define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> > +  #define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> > +  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> > +
> > +4.139 KVM_SET_MEMORY_ATTRIBUTES
> > +-----------------------------------------
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: struct kvm_memory_attributes(in/out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Sets memory attributes for pages in a guest memory range. Parameters are
> > +specified via the following structure::
> > +
> > +  struct kvm_memory_attributes {
> > +       __u64 address;
> > +       __u64 size;
> > +       __u64 attributes;
> > +       __u64 flags;
> > +  };
> > +
> > +The user sets the per-page memory attributes to a guest memory range indicated
> > +by address/size, and in return KVM adjusts address and size to reflect the
> > +actual pages of the memory range have been successfully set to the attributes.
> > +If the call returns 0, "address" is updated to the last successful address + 1
> > +and "size" is updated to the remaining address size that has not been set
> > +successfully. The user should check the return value as well as the size to
> > +decide if the operation succeeded for the whole range or not. The user may want
> > +to retry the operation with the returned address/size if the previous range was
> > +partially successful.
> > +
> > +Both address and size should be page aligned and the supported attributes can be
> > +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> > +
> > +The "flags" field may be used for future extensions and should be set to 0s.
> > +
> >  5. The kvm_run structure
> >  ========================
> >
> > @@ -8270,6 +8323,16 @@ structure.
> >  When getting the Modified Change Topology Report value, the attr->addr
> >  must point to a byte where the value will be stored or retrieved from.
> >
> > +8.40 KVM_CAP_MEMORY_ATTRIBUTES
> > +------------------------------
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm
> > +
> > +This capability indicates KVM supports per-page memory attributes and ioctls
> > +KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
> > +
> >  9. Known KVM API problems
> >  =========================
> >
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index fbeaa9ddef59..a8e379a3afee 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -49,6 +49,7 @@ config KVM
> >         select SRCU
> >         select INTERVAL_TREE
> >         select HAVE_KVM_PM_NOTIFIER if PM
> > +       select HAVE_KVM_MEMORY_ATTRIBUTES
> >         help
> >           Support hosting fully virtualized guest machines using hardware
> >           virtualization extensions.  You will need a fairly recent
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 8f874a964313..a784e2b06625 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -800,6 +800,9 @@ struct kvm {
> >
> >  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> >         struct notifier_block pm_notifier;
> > +#endif
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +       struct xarray mem_attr_array;
> >  #endif
> >         char stats_id[KVM_STATS_NAME_SIZE];
> >  };
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 64dfe9c07c87..5d0941acb5bb 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -1182,6 +1182,7 @@ struct kvm_ppc_resize_hpt {
> >  #define KVM_CAP_S390_CPU_TOPOLOGY 222
> >  #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
> >  #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
> > +#define KVM_CAP_MEMORY_ATTRIBUTES 225
> >
> >  #ifdef KVM_CAP_IRQ_ROUTING
> >
> > @@ -2238,4 +2239,20 @@ struct kvm_s390_zpci_op {
> >  /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
> >  #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
> >
> > +/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
> > +#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES    _IOR(KVMIO,  0xd2, __u64)
> > +#define KVM_SET_MEMORY_ATTRIBUTES              _IOWR(KVMIO,  0xd3, struct kvm_memory_attributes)
> > +
> > +struct kvm_memory_attributes {
> > +       __u64 address;
> > +       __u64 size;
> > +       __u64 attributes;
> > +       __u64 flags;
> > +};
> > +
> > +#define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> > +#define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> > +#define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> > +#define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> 
> nit: how about using the BIT() macro for these?

Might be the _BITULL() in include/uapi/linux/const.h since it will be
used by userspace also.

> 
> > +
> >  #endif /* __LINUX_KVM_H */
> > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > index 800f9470e36b..effdea5dd4f0 100644
> > --- a/virt/kvm/Kconfig
> > +++ b/virt/kvm/Kconfig
> > @@ -19,6 +19,9 @@ config HAVE_KVM_IRQ_ROUTING
> >  config HAVE_KVM_DIRTY_RING
> >         bool
> >
> > +config HAVE_KVM_MEMORY_ATTRIBUTES
> > +       bool
> > +
> >  # Only strongly ordered architectures can select this, as it doesn't
> >  # put any explicit constraint on userspace ordering. They can also
> >  # select the _ACQ_REL version.
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 1782c4555d94..7f0f5e9f2406 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> >         spin_lock_init(&kvm->mn_invalidate_lock);
> >         rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> >         xa_init(&kvm->vcpu_array);
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +       xa_init(&kvm->mem_attr_array);
> > +#endif
> >
> >         INIT_LIST_HEAD(&kvm->gpc_list);
> >         spin_lock_init(&kvm->gpc_lock);
> > @@ -1323,6 +1326,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
> >                 kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
> >                 kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
> >         }
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +       xa_destroy(&kvm->mem_attr_array);
> > +#endif
> >         cleanup_srcu_struct(&kvm->irq_srcu);
> >         cleanup_srcu_struct(&kvm->srcu);
> >         kvm_arch_free_vm(kvm);
> > @@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> >  }
> >  #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
> >
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> > +{
> > +       return 0;
> > +}
> > +
> > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > +                                          struct kvm_memory_attributes *attrs)
> > +{
> > +       gfn_t start, end;
> > +       unsigned long i;
> > +       void *entry;
> > +       u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > +
> > +       /* flags is currently not used. */
> 
> nit: "is reserved"? I think it makes it a bit clearer what its purpose is.
OK, then:
  flags is reserved for future extention and currently is not used.

> 
> > +       if (attrs->flags)
> > +               return -EINVAL;
> > +       if (attrs->attributes & ~supported_attrs)
> > +               return -EINVAL;
> > +       if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > +               return -EINVAL;
> > +       if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > +               return -EINVAL;
> > +
> > +       start = attrs->address >> PAGE_SHIFT;
> > +       end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> 
> Would using existing helpers be better for getting the frame numbers?

Yes, gpa_to_gfn() can be used.

> Also, the code checks that the address and size are page aligned, so
> the end rounding up seems redundant, and might even be wrong if the
> address+size-1 is close to the gfn_t limit (which this code tries to
> avoid in an earlier check).

That's right.

> 
> > +       entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > +
> > +       mutex_lock(&kvm->lock);
> > +       for (i = start; i < end; i++)
> > +               if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > +                                   GFP_KERNEL_ACCOUNT)))
> > +                       break;
> > +       mutex_unlock(&kvm->lock);
> > +
> > +       attrs->address = i << PAGE_SHIFT;
> > +       attrs->size = (end - i) << PAGE_SHIFT;
> 
> nit: helpers for these too?

Similarly, gfn_to_gpa() will be used.

> 
> With the end calculation fixed,
> 
> Reviewed-by: Fuad Tabba <tabba@google.com>
> After adding the necessary configs for arm64 (on qemu/arm64):
> Tested-by: Fuad Tabba <tabba@google.com>

Thanks.
Chao
> 
> Cheers,
> /fuad
> 
> > +
> > +       return 0;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > +
> >  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
> >  {
> >         return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> > @@ -4459,6 +4508,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> >  #ifdef CONFIG_HAVE_KVM_MSI
> >         case KVM_CAP_SIGNAL_MSI:
> >  #endif
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +       case KVM_CAP_MEMORY_ATTRIBUTES:
> > +#endif
> >  #ifdef CONFIG_HAVE_KVM_IRQFD
> >         case KVM_CAP_IRQFD:
> >         case KVM_CAP_IRQFD_RESAMPLE:
> > @@ -4804,6 +4856,30 @@ static long kvm_vm_ioctl(struct file *filp,
> >                 break;
> >         }
> >  #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +       case KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES: {
> > +               u64 attrs = kvm_supported_mem_attributes(kvm);
> > +
> > +               r = -EFAULT;
> > +               if (copy_to_user(argp, &attrs, sizeof(attrs)))
> > +                       goto out;
> > +               r = 0;
> > +               break;
> > +       }
> > +       case KVM_SET_MEMORY_ATTRIBUTES: {
> > +               struct kvm_memory_attributes attrs;
> > +
> > +               r = -EFAULT;
> > +               if (copy_from_user(&attrs, argp, sizeof(attrs)))
> > +                       goto out;
> > +
> > +               r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
> > +
> > +               if (!r && copy_to_user(argp, &attrs, sizeof(attrs)))
> > +                       r = -EFAULT;
> > +               break;
> > +       }
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> >         case KVM_CREATE_DEVICE: {
> >                 struct kvm_create_device cd;
> >
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2022-12-06 12:39       ` Fuad Tabba
@ 2022-12-07 15:10         ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-07 15:10 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

On Tue, Dec 06, 2022 at 12:39:18PM +0000, Fuad Tabba wrote:
> Hi Chao,
> 
> On Tue, Dec 6, 2022 at 11:58 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > On Mon, Dec 05, 2022 at 09:03:11AM +0000, Fuad Tabba wrote:
> > > Hi Chao,
> > >
> > > On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > > >
> > > > In memory encryption usage, guest memory may be encrypted with special
> > > > key and can be accessed only by the guest itself. We call such memory
> > > > private memory. It's valueless and sometimes can cause problem to allow
> > > > userspace to access guest private memory. This new KVM memslot extension
> > > > allows guest private memory being provided through a restrictedmem
> > > > backed file descriptor(fd) and userspace is restricted to access the
> > > > bookmarked memory in the fd.
> > > >
> > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > > > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > > > userspace to instruct KVM to provide guest memory through restricted_fd.
> > > > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > > > and the size is 'memory_size'.
> > > >
> > > > The extended memslot can still have the userspace_addr(hva). When use, a
> > > > single memslot can maintain both private memory through restricted_fd
> > > > and shared memory through userspace_addr. Whether the private or shared
> > > > part is visible to guest is maintained by other KVM code.
> > > >
> > > > A restrictedmem_notifier field is also added to the memslot structure to
> > > > allow the restricted_fd's backing store to notify KVM the memory change,
> > > > KVM then can invalidate its page table entries or handle memory errors.
> > > >
> > > > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > > > and right now it is selected on X86_64 only.
> > > >
> > > > To make future maintenance easy, internally use a binary compatible
> > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > '_ext' variants.
> > > >
> > > > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > > Reviewed-by: Fuad Tabba <tabba@google.com>
> > > > Tested-by: Fuad Tabba <tabba@google.com>
> > >
> > > V9 of this patch [*] had KVM_CAP_PRIVATE_MEM, but it's not in this
> > > patch series anymore. Any reason you removed it, or is it just an
> > > omission?
> >
> > We had some discussion in v9 [1] to add generic memory attributes ioctls
> > and KVM_CAP_PRIVATE_MEM can be implemented as a new
> > KVM_MEMORY_ATTRIBUTE_PRIVATE flag via KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES()
> > ioctl [2]. The api doc has been updated:
> >
> > +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> > +  KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl) …
> >
> >
> > [1] https://lore.kernel.org/linux-mm/Y2WB48kD0J4VGynX@google.com/
> > [2]
> > https://lore.kernel.org/linux-mm/20221202061347.1070246-3-chao.p.peng@linux.intel.com/
> 
> I see. I just retested it with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES,
> and my Reviewed/Tested-by still apply.

Thanks for the info.

Chao
> 
> Cheers,
> /fuad
> 
> >
> > Thanks,
> > Chao
> > >
> > > [*] https://lore.kernel.org/linux-mm/20221025151344.3784230-3-chao.p.peng@linux.intel.com/
> > >
> > > Thanks,
> > > /fuad
> > >
> > > > ---
> > > >  Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++-----
> > > >  arch/x86/kvm/Kconfig           |  2 ++
> > > >  arch/x86/kvm/x86.c             |  2 +-
> > > >  include/linux/kvm_host.h       |  8 ++++--
> > > >  include/uapi/linux/kvm.h       | 28 +++++++++++++++++++
> > > >  virt/kvm/Kconfig               |  3 +++
> > > >  virt/kvm/kvm_main.c            | 49 ++++++++++++++++++++++++++++------
> > > >  7 files changed, 114 insertions(+), 18 deletions(-)
> > > >
> > > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > > index bb2f709c0900..99352170c130 100644
> > > > --- a/Documentation/virt/kvm/api.rst
> > > > +++ b/Documentation/virt/kvm/api.rst
> > > > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
> > > >  :Capability: KVM_CAP_USER_MEMORY
> > > >  :Architectures: all
> > > >  :Type: vm ioctl
> > > > -:Parameters: struct kvm_userspace_memory_region (in)
> > > > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> > > >  :Returns: 0 on success, -1 on error
> > > >
> > > >  ::
> > > > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> > > >         __u64 userspace_addr; /* start of the userspace allocated memory */
> > > >    };
> > > >
> > > > +  struct kvm_userspace_memory_region_ext {
> > > > +       struct kvm_userspace_memory_region region;
> > > > +       __u64 restricted_offset;
> > > > +       __u32 restricted_fd;
> > > > +       __u32 pad1;
> > > > +       __u64 pad2[14];
> > > > +  };
> > > > +
> > > >    /* for kvm_memory_region::flags */
> > > >    #define KVM_MEM_LOG_DIRTY_PAGES      (1UL << 0)
> > > >    #define KVM_MEM_READONLY     (1UL << 1)
> > > > +  #define KVM_MEM_PRIVATE              (1UL << 2)
> > > >
> > > >  This ioctl allows the user to create, modify or delete a guest physical
> > > >  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> > > > @@ -1365,12 +1374,29 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> > > >  be identical.  This allows large pages in the guest to be backed by large
> > > >  pages in the host.
> > > >
> > > > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> > > > -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> > > > -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> > > > -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> > > > -to make a new slot read-only.  In this case, writes to this memory will be
> > > > -posted to userspace as KVM_EXIT_MMIO exits.
> > > > +kvm_userspace_memory_region_ext struct includes all fields of
> > > > +kvm_userspace_memory_region struct, while also adds additional fields for some
> > > > +other features. See below description of flags field for more information.
> > > > +It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
> > > > +
> > > > +The flags field supports following flags:
> > > > +
> > > > +- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
> > > > +  within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
> > > > +
> > > > +- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
> > > > +  read-only. In this case, writes to this memory will be posted to userspace as
> > > > +  KVM_EXIT_MMIO exits.
> > > > +
> > > > +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> > > > +  KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has private
> > > > +  memory backed by a file descriptor(fd) and userspace access to the fd may be
> > > > +  restricted. Userspace should use restricted_fd/restricted_offset in the
> > > > +  kvm_userspace_memory_region_ext to instruct KVM to provide private memory
> > > > +  to guest. Userspace should guarantee not to map the same host physical address
> > > > +  indicated by restricted_fd/restricted_offset to different guest physical
> > > > +  addresses within multiple memslots. Failed to do this may result undefined
> > > > +  behavior.
> > > >
> > > >  When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> > > >  the memory region are automatically reflected into the guest.  For example, an
> > > > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > > > index a8e379a3afee..690cb21010e7 100644
> > > > --- a/arch/x86/kvm/Kconfig
> > > > +++ b/arch/x86/kvm/Kconfig
> > > > @@ -50,6 +50,8 @@ config KVM
> > > >         select INTERVAL_TREE
> > > >         select HAVE_KVM_PM_NOTIFIER if PM
> > > >         select HAVE_KVM_MEMORY_ATTRIBUTES
> > > > +       select HAVE_KVM_RESTRICTED_MEM if X86_64
> > > > +       select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> > > >         help
> > > >           Support hosting fully virtualized guest machines using hardware
> > > >           virtualization extensions.  You will need a fairly recent
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 7f850dfb4086..9a07380f8d3c 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -12224,7 +12224,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
> > > >         }
> > > >
> > > >         for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > > -               struct kvm_userspace_memory_region m;
> > > > +               struct kvm_user_mem_region m;
> > > >
> > > >                 m.slot = id | (i << 16);
> > > >                 m.flags = 0;
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index a784e2b06625..02347e386ea2 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -44,6 +44,7 @@
> > > >
> > > >  #include <asm/kvm_host.h>
> > > >  #include <linux/kvm_dirty_ring.h>
> > > > +#include <linux/restrictedmem.h>
> > > >
> > > >  #ifndef KVM_MAX_VCPU_IDS
> > > >  #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> > > > @@ -585,6 +586,9 @@ struct kvm_memory_slot {
> > > >         u32 flags;
> > > >         short id;
> > > >         u16 as_id;
> > > > +       struct file *restricted_file;
> > > > +       loff_t restricted_offset;
> > > > +       struct restrictedmem_notifier notifier;
> > > >  };
> > > >
> > > >  static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> > > > @@ -1123,9 +1127,9 @@ enum kvm_mr_change {
> > > >  };
> > > >
> > > >  int kvm_set_memory_region(struct kvm *kvm,
> > > > -                         const struct kvm_userspace_memory_region *mem);
> > > > +                         const struct kvm_user_mem_region *mem);
> > > >  int __kvm_set_memory_region(struct kvm *kvm,
> > > > -                           const struct kvm_userspace_memory_region *mem);
> > > > +                           const struct kvm_user_mem_region *mem);
> > > >  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
> > > >  void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
> > > >  int kvm_arch_prepare_memory_region(struct kvm *kvm,
> > > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > > index 5d0941acb5bb..13bff963b8b0 100644
> > > > --- a/include/uapi/linux/kvm.h
> > > > +++ b/include/uapi/linux/kvm.h
> > > > @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
> > > >         __u64 userspace_addr; /* start of the userspace allocated memory */
> > > >  };
> > > >
> > > > +struct kvm_userspace_memory_region_ext {
> > > > +       struct kvm_userspace_memory_region region;
> > > > +       __u64 restricted_offset;
> > > > +       __u32 restricted_fd;
> > > > +       __u32 pad1;
> > > > +       __u64 pad2[14];
> > > > +};
> > > > +
> > > > +#ifdef __KERNEL__
> > > > +/*
> > > > + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> > > > + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> > > > + * all fields from the top-level "extended" region.
> > > > + */
> > > > +struct kvm_user_mem_region {
> > > > +       __u32 slot;
> > > > +       __u32 flags;
> > > > +       __u64 guest_phys_addr;
> > > > +       __u64 memory_size;
> > > > +       __u64 userspace_addr;
> > > > +       __u64 restricted_offset;
> > > > +       __u32 restricted_fd;
> > > > +       __u32 pad1;
> > > > +       __u64 pad2[14];
> > > > +};
> > > > +#endif
> > > > +
> > > >  /*
> > > >   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> > > >   * other bits are reserved for kvm internal use which are defined in
> > > > @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
> > > >   */
> > > >  #define KVM_MEM_LOG_DIRTY_PAGES        (1UL << 0)
> > > >  #define KVM_MEM_READONLY       (1UL << 1)
> > > > +#define KVM_MEM_PRIVATE                (1UL << 2)
> > > >
> > > >  /* for KVM_IRQ_LINE */
> > > >  struct kvm_irq_level {
> > > > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > > > index effdea5dd4f0..d605545d6dd1 100644
> > > > --- a/virt/kvm/Kconfig
> > > > +++ b/virt/kvm/Kconfig
> > > > @@ -89,3 +89,6 @@ config KVM_XFER_TO_GUEST_WORK
> > > >
> > > >  config HAVE_KVM_PM_NOTIFIER
> > > >         bool
> > > > +
> > > > +config HAVE_KVM_RESTRICTED_MEM
> > > > +       bool
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index 7f0f5e9f2406..b882eb2c76a2 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -1532,7 +1532,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
> > > >         }
> > > >  }
> > > >
> > > > -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> > > > +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> > > >  {
> > > >         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> > > >
> > > > @@ -1934,7 +1934,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> > > >   * Must be called holding kvm->slots_lock for write.
> > > >   */
> > > >  int __kvm_set_memory_region(struct kvm *kvm,
> > > > -                           const struct kvm_userspace_memory_region *mem)
> > > > +                           const struct kvm_user_mem_region *mem)
> > > >  {
> > > >         struct kvm_memory_slot *old, *new;
> > > >         struct kvm_memslots *slots;
> > > > @@ -2038,7 +2038,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > > >  EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> > > >
> > > >  int kvm_set_memory_region(struct kvm *kvm,
> > > > -                         const struct kvm_userspace_memory_region *mem)
> > > > +                         const struct kvm_user_mem_region *mem)
> > > >  {
> > > >         int r;
> > > >
> > > > @@ -2050,7 +2050,7 @@ int kvm_set_memory_region(struct kvm *kvm,
> > > >  EXPORT_SYMBOL_GPL(kvm_set_memory_region);
> > > >
> > > >  static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> > > > -                                         struct kvm_userspace_memory_region *mem)
> > > > +                                         struct kvm_user_mem_region *mem)
> > > >  {
> > > >         if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
> > > >                 return -EINVAL;
> > > > @@ -4698,6 +4698,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
> > > >         return fd;
> > > >  }
> > > >
> > > > +#define SANITY_CHECK_MEM_REGION_FIELD(field)                                   \
> > > > +do {                                                                           \
> > > > +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=             \
> > > > +                    offsetof(struct kvm_userspace_memory_region, field));      \
> > > > +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=         \
> > > > +                    sizeof_field(struct kvm_userspace_memory_region, field));  \
> > > > +} while (0)
> > > > +
> > > > +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)                                       \
> > > > +do {                                                                                   \
> > > > +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=                     \
> > > > +                    offsetof(struct kvm_userspace_memory_region_ext, field));          \
> > > > +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=                 \
> > > > +                    sizeof_field(struct kvm_userspace_memory_region_ext, field));      \
> > > > +} while (0)
> > > > +
> > > > +static void kvm_sanity_check_user_mem_region_alias(void)
> > > > +{
> > > > +       SANITY_CHECK_MEM_REGION_FIELD(slot);
> > > > +       SANITY_CHECK_MEM_REGION_FIELD(flags);
> > > > +       SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> > > > +       SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> > > > +       SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> > > > +       SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
> > > > +       SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
> > > > +}
> > > > +
> > > >  static long kvm_vm_ioctl(struct file *filp,
> > > >                            unsigned int ioctl, unsigned long arg)
> > > >  {
> > > > @@ -4721,14 +4748,20 @@ static long kvm_vm_ioctl(struct file *filp,
> > > >                 break;
> > > >         }
> > > >         case KVM_SET_USER_MEMORY_REGION: {
> > > > -               struct kvm_userspace_memory_region kvm_userspace_mem;
> > > > +               struct kvm_user_mem_region mem;
> > > > +               unsigned long size = sizeof(struct kvm_userspace_memory_region);
> > > > +
> > > > +               kvm_sanity_check_user_mem_region_alias();
> > > >
> > > >                 r = -EFAULT;
> > > > -               if (copy_from_user(&kvm_userspace_mem, argp,
> > > > -                                               sizeof(kvm_userspace_mem)))
> > > > +               if (copy_from_user(&mem, argp, size))
> > > > +                       goto out;
> > > > +
> > > > +               r = -EINVAL;
> > > > +               if (mem.flags & KVM_MEM_PRIVATE)
> > > >                         goto out;
> > > >
> > > > -               r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> > > > +               r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > > >                 break;
> > > >         }
> > > >         case KVM_GET_DIRTY_LOG: {
> > > > --
> > > > 2.25.1
> > > >

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-12-06 15:47   ` Fuad Tabba
@ 2022-12-07 15:11     ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-07 15:11 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

On Tue, Dec 06, 2022 at 03:47:20PM +0000, Fuad Tabba wrote:
> Hi,
> 
> On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > This new KVM exit allows userspace to handle memory-related errors. It
> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> > The flags includes additional information for userspace to handle the
> > error. Currently bit 0 is defined as 'private memory' where '1'
> > indicates error happens due to private memory access and '0' indicates
> > error happens due to shared memory access.
> >
> > When private memory is enabled, this new exit will be used for KVM to
> > exit to userspace for shared <-> private memory conversion in memory
> > encryption usage. In such usage, typically there are two kind of memory
> > conversions:
> >   - explicit conversion: happens when guest explicitly calls into KVM
> >     to map a range (as private or shared), KVM then exits to userspace
> >     to perform the map/unmap operations.
> >   - implicit conversion: happens in KVM page fault handler where KVM
> >     exits to userspace for an implicit conversion when the page is in a
> >     different state than requested (private or shared).
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > Reviewed-by: Fuad Tabba <tabba@google.com>
> > ---
> >  Documentation/virt/kvm/api.rst | 22 ++++++++++++++++++++++
> >  include/uapi/linux/kvm.h       |  8 ++++++++
> >  2 files changed, 30 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 99352170c130..d9edb14ce30b 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6634,6 +6634,28 @@ array field represents return values. The userspace should update the return
> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >
> > +::
> > +
> > +               /* KVM_EXIT_MEMORY_FAULT */
> > +               struct {
> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 0)
> > +                       __u64 flags;
> 
> I see you've removed the padding and increased the flag size.

Yes Sean suggested this and also looks good to me.

Chao
> 
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Tested-by: Fuad Tabba <tabba@google.com>
> 
> Cheers,
> /fuad
> 
> 
> 
> 
> > +                       __u64 gpa;
> > +                       __u64 size;
> > +               } memory;
> > +
> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> > +encountered a memory error which is not handled by KVM kernel module and
> > +userspace may choose to handle it. The 'flags' field indicates the memory
> > +properties of the exit.
> > +
> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> > +   private memory access when the bit is set. Otherwise the memory error is
> > +   caused by shared memory access when the bit is clear.
> > +
> > +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
> > +may handle the error and return to KVM to retry the previous memory access.
> > +
> >  ::
> >
> >      /* KVM_EXIT_NOTIFY */
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 13bff963b8b0..c7e9d375a902 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -300,6 +300,7 @@ struct kvm_xen_exit {
> >  #define KVM_EXIT_RISCV_SBI        35
> >  #define KVM_EXIT_RISCV_CSR        36
> >  #define KVM_EXIT_NOTIFY           37
> > +#define KVM_EXIT_MEMORY_FAULT     38
> >
> >  /* For KVM_EXIT_INTERNAL_ERROR */
> >  /* Emulate instruction failed. */
> > @@ -541,6 +542,13 @@ struct kvm_run {
> >  #define KVM_NOTIFY_CONTEXT_INVALID     (1 << 0)
> >                         __u32 flags;
> >                 } notify;
> > +               /* KVM_EXIT_MEMORY_FAULT */
> > +               struct {
> > +#define KVM_MEMORY_EXIT_FLAG_PRIVATE   (1ULL << 0)
> > +                       __u64 flags;
> > +                       __u64 gpa;
> > +                       __u64 size;
> > +               } memory;
> >                 /* Fix the size of the union. */
> >                 char padding[256];
> >         };
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-12-07  6:34       ` Isaku Yamahata
@ 2022-12-07 15:14         ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-07 15:14 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Fuad Tabba, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Arnd Bergmann, Naoya Horiguchi,
	Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

On Tue, Dec 06, 2022 at 10:34:11PM -0800, Isaku Yamahata wrote:
> On Tue, Dec 06, 2022 at 07:56:23PM +0800,
> Chao Peng <chao.p.peng@linux.intel.com> wrote:
> 
> > > > -       if (unlikely(kvm->mmu_invalidate_in_progress) &&
> > > > -           hva >= kvm->mmu_invalidate_range_start &&
> > > > -           hva < kvm->mmu_invalidate_range_end)
> > > > -               return 1;
> > > > +       if (unlikely(kvm->mmu_invalidate_in_progress)) {
> > > > +               /*
> > > > +                * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> > > > +                * but before updating the range is a KVM bug.
> > > > +                */
> > > > +               if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> > > > +                                kvm->mmu_invalidate_range_end == INVALID_GPA))
> > > 
> > > INVALID_GPA is an x86-specific define in
> > > arch/x86/include/asm/kvm_host.h, so this doesn't build on other
> > > architectures. The obvious fix is to move it to
> > > include/linux/kvm_host.h.
> > 
> > Hmm, INVALID_GPA is defined as ZERO for x86, not 100% confident this is
> > correct choice for other architectures, but after search it has not been
> > used for other architectures, so should be safe to make it common.
> 
> INVALID_GPA is defined as all bit 1.  Please notice "~" (tilde).
> 
> #define INVALID_GPA (~(gpa_t)0)

Thanks for mention. Still looks right moving it to include/linux/kvm_host.h. 
Chao
> -- 
> Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
  2022-12-02  6:13 ` [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes Chao Peng
  2022-12-07  8:13   ` Yuan Yao
@ 2022-12-07 17:16   ` Fuad Tabba
  2022-12-08 11:13     ` Chao Peng
  2022-12-13 23:51   ` Huang, Kai
  2023-01-13 22:50   ` Sean Christopherson
  3 siblings, 1 reply; 398+ messages in thread
From: Fuad Tabba @ 2022-12-07 17:16 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

Hi,

On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> Unmap the existing guest mappings when memory attribute is changed
> between shared and private. This is needed because shared pages and
> private pages are from different backends, unmapping existing ones
> gives a chance for page fault handler to re-populate the mappings
> according to the new attribute.
>
> Only architecture has private memory support needs this and the
> supported architecture is expected to rewrite the weak
> kvm_arch_has_private_mem().

This kind of ties into the discussion of being able to share memory in
place. For pKVM for example, shared and private memory would have the
same backend, and the unmapping wouldn't be needed.

So I guess that, instead of kvm_arch_has_private_mem(), can the check
be done differently, e.g., with a different function, say
kvm_arch_private_notify_attribute_change() (but maybe with a more
friendly name than what I suggested :) )?

Thanks,
/fuad

>
> Also, during memory attribute changing and the unmapping time frame,
> page fault handler may happen in the same memory range and can cause
> incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> page fault handler retry during this time frame.
>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  include/linux/kvm_host.h |   7 +-
>  virt/kvm/kvm_main.c      | 168 ++++++++++++++++++++++++++-------------
>  2 files changed, 116 insertions(+), 59 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3d69484d2704..3331c0c92838 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>  #endif
>
> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
>  struct kvm_gfn_range {
>         struct kvm_memory_slot *slot;
>         gfn_t start;
> @@ -264,6 +263,8 @@ struct kvm_gfn_range {
>         bool may_block;
>  };
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> +
> +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> @@ -785,11 +786,12 @@ struct kvm {
>
>  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>         struct mmu_notifier mmu_notifier;
> +#endif
>         unsigned long mmu_invalidate_seq;
>         long mmu_invalidate_in_progress;
>         gfn_t mmu_invalidate_range_start;
>         gfn_t mmu_invalidate_range_end;
> -#endif
> +
>         struct list_head devices;
>         u64 manual_dirty_log_protect;
>         struct dentry *debugfs_dentry;
> @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
>  int kvm_arch_post_init_vm(struct kvm *kvm);
>  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
>  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_has_private_mem(struct kvm *kvm);
>
>  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
>  /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ad55dfbc75d7..4e1e1e113bf0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
>  }
>  EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
>
> +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> +{
> +       /*
> +        * The count increase must become visible at unlock time as no
> +        * spte can be established without taking the mmu_lock and
> +        * count is also read inside the mmu_lock critical section.
> +        */
> +       kvm->mmu_invalidate_in_progress++;
> +
> +       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> +               kvm->mmu_invalidate_range_start = INVALID_GPA;
> +               kvm->mmu_invalidate_range_end = INVALID_GPA;
> +       }
> +}
> +
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +       WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> +
> +       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> +               kvm->mmu_invalidate_range_start = start;
> +               kvm->mmu_invalidate_range_end = end;
> +       } else {
> +               /*
> +                * Fully tracking multiple concurrent ranges has diminishing
> +                * returns. Keep things simple and just find the minimal range
> +                * which includes the current and new ranges. As there won't be
> +                * enough information to subtract a range after its invalidate
> +                * completes, any ranges invalidated concurrently will
> +                * accumulate and persist until all outstanding invalidates
> +                * complete.
> +                */
> +               kvm->mmu_invalidate_range_start =
> +                       min(kvm->mmu_invalidate_range_start, start);
> +               kvm->mmu_invalidate_range_end =
> +                       max(kvm->mmu_invalidate_range_end, end);
> +       }
> +}
> +
> +void kvm_mmu_invalidate_end(struct kvm *kvm)
> +{
> +       /*
> +        * This sequence increase will notify the kvm page fault that
> +        * the page that is going to be mapped in the spte could have
> +        * been freed.
> +        */
> +       kvm->mmu_invalidate_seq++;
> +       smp_wmb();
> +       /*
> +        * The above sequence increase must be visible before the
> +        * below count decrease, which is ensured by the smp_wmb above
> +        * in conjunction with the smp_rmb in mmu_invalidate_retry().
> +        */
> +       kvm->mmu_invalidate_in_progress--;
> +}
> +
>  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>  static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>  {
> @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>         kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
>  }
>
> -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> -{
> -       /*
> -        * The count increase must become visible at unlock time as no
> -        * spte can be established without taking the mmu_lock and
> -        * count is also read inside the mmu_lock critical section.
> -        */
> -       kvm->mmu_invalidate_in_progress++;
> -
> -       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> -               kvm->mmu_invalidate_range_start = INVALID_GPA;
> -               kvm->mmu_invalidate_range_end = INVALID_GPA;
> -       }
> -}
> -
> -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> -{
> -       WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> -
> -       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> -               kvm->mmu_invalidate_range_start = start;
> -               kvm->mmu_invalidate_range_end = end;
> -       } else {
> -               /*
> -                * Fully tracking multiple concurrent ranges has diminishing
> -                * returns. Keep things simple and just find the minimal range
> -                * which includes the current and new ranges. As there won't be
> -                * enough information to subtract a range after its invalidate
> -                * completes, any ranges invalidated concurrently will
> -                * accumulate and persist until all outstanding invalidates
> -                * complete.
> -                */
> -               kvm->mmu_invalidate_range_start =
> -                       min(kvm->mmu_invalidate_range_start, start);
> -               kvm->mmu_invalidate_range_end =
> -                       max(kvm->mmu_invalidate_range_end, end);
> -       }
> -}
> -
>  static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>         kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>         return 0;
>  }
>
> -void kvm_mmu_invalidate_end(struct kvm *kvm)
> -{
> -       /*
> -        * This sequence increase will notify the kvm page fault that
> -        * the page that is going to be mapped in the spte could have
> -        * been freed.
> -        */
> -       kvm->mmu_invalidate_seq++;
> -       smp_wmb();
> -       /*
> -        * The above sequence increase must be visible before the
> -        * below count decrease, which is ensured by the smp_wmb above
> -        * in conjunction with the smp_rmb in mmu_invalidate_retry().
> -        */
> -       kvm->mmu_invalidate_in_progress--;
> -}
> -
>  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>                                         const struct mmu_notifier_range *range)
>  {
> @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
>         return 0;
>  }
>
> +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> +{
> +       return false;
> +}
> +
>  static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>  {
>         struct kvm *kvm = kvm_arch_alloc_vm();
> @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
>         return 0;
>  }
>
> +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +       struct kvm_gfn_range gfn_range;
> +       struct kvm_memory_slot *slot;
> +       struct kvm_memslots *slots;
> +       struct kvm_memslot_iter iter;
> +       int i;
> +       int r = 0;
> +
> +       gfn_range.pte = __pte(0);
> +       gfn_range.may_block = true;
> +
> +       for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +               slots = __kvm_memslots(kvm, i);
> +
> +               kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> +                       slot = iter.slot;
> +                       gfn_range.start = max(start, slot->base_gfn);
> +                       gfn_range.end = min(end, slot->base_gfn + slot->npages);
> +                       if (gfn_range.start >= gfn_range.end)
> +                               continue;
> +                       gfn_range.slot = slot;
> +
> +                       r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> +               }
> +       }
> +
> +       if (r)
> +               kvm_flush_remote_tlbs(kvm);
> +}
> +
>  static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>                                            struct kvm_memory_attributes *attrs)
>  {
>         gfn_t start, end;
>         unsigned long i;
>         void *entry;
> +       int idx;
>         u64 supported_attrs = kvm_supported_mem_attributes(kvm);
>
> -       /* flags is currently not used. */
> +       /* 'flags' is currently not used. */
>         if (attrs->flags)
>                 return -EINVAL;
>         if (attrs->attributes & ~supported_attrs)
> @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>
>         entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
>
> +       if (kvm_arch_has_private_mem(kvm)) {
> +               KVM_MMU_LOCK(kvm);
> +               kvm_mmu_invalidate_begin(kvm);
> +               kvm_mmu_invalidate_range_add(kvm, start, end);
> +               KVM_MMU_UNLOCK(kvm);
> +       }
> +
>         mutex_lock(&kvm->lock);
>         for (i = start; i < end; i++)
>                 if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>                         break;
>         mutex_unlock(&kvm->lock);
>
> +       if (kvm_arch_has_private_mem(kvm)) {
> +               idx = srcu_read_lock(&kvm->srcu);
> +               KVM_MMU_LOCK(kvm);
> +               if (i > start)
> +                       kvm_unmap_mem_range(kvm, start, i);
> +               kvm_mmu_invalidate_end(kvm);
> +               KVM_MMU_UNLOCK(kvm);
> +               srcu_read_unlock(&kvm->srcu, idx);
> +       }
> +
>         attrs->address = i << PAGE_SHIFT;
>         attrs->size = (end - i) << PAGE_SHIFT;
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 8/9] KVM: Handle page fault for private memory
  2022-12-02  6:13 ` [PATCH v10 8/9] KVM: Handle page fault for private memory Chao Peng
@ 2022-12-08  2:29   ` Yuan Yao
  2022-12-08 11:23     ` Chao Peng
  2022-12-09  9:01   ` Fuad Tabba
  2023-01-13 23:29   ` Sean Christopherson
  2 siblings, 1 reply; 398+ messages in thread
From: Yuan Yao @ 2022-12-08  2:29 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Fri, Dec 02, 2022 at 02:13:46PM +0800, Chao Peng wrote:
> A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
> hva-based shared memory. Architecture code (like TDX code) can tell
> whether the on-going fault is private or not. This patch adds a
> 'is_private' field to kvm_page_fault to indicate this and architecture
> code is expected to set it.
>
> To handle page fault for such memslot, the handling logic is different
> depending on whether the fault is private or shared. KVM checks if
> 'is_private' matches the host's view of the page (maintained in
> mem_attr_array).
>   - For a successful match, private pfn is obtained with
>     restrictedmem_get_page() and shared pfn is obtained with existing
>     get_user_pages().
>   - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
>     userspace. Userspace then can convert memory between private/shared
>     in host's view and retry the fault.
>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 63 +++++++++++++++++++++++++++++++--
>  arch/x86/kvm/mmu/mmu_internal.h | 14 +++++++-
>  arch/x86/kvm/mmu/mmutrace.h     |  1 +
>  arch/x86/kvm/mmu/tdp_mmu.c      |  2 +-
>  include/linux/kvm_host.h        | 30 ++++++++++++++++
>  5 files changed, 105 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2190fd8c95c0..b1953ebc012e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
>
>  int kvm_mmu_max_mapping_level(struct kvm *kvm,
>  			      const struct kvm_memory_slot *slot, gfn_t gfn,
> -			      int max_level)
> +			      int max_level, bool is_private)
>  {
>  	struct kvm_lpage_info *linfo;
>  	int host_level;
> @@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
>  			break;
>  	}
>
> +	if (is_private)
> +		return max_level;

lpage mixed information already saved, so is that possible
to query info->disallow_lpage without care 'is_private' ?

> +
>  	if (max_level == PG_LEVEL_4K)
>  		return PG_LEVEL_4K;
>
> @@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	 * level, which will be used to do precise, accurate accounting.
>  	 */
>  	fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> -						     fault->gfn, fault->max_level);
> +						     fault->gfn, fault->max_level,
> +						     fault->is_private);
>  	if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
>  		return;
>
> @@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
>  }
>
> +static inline u8 order_to_level(int order)
> +{
> +	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> +
> +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> +		return PG_LEVEL_1G;
> +
> +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> +		return PG_LEVEL_2M;
> +
> +	return PG_LEVEL_4K;
> +}
> +
> +static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
> +				    struct kvm_page_fault *fault)
> +{
> +	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> +	if (fault->is_private)
> +		vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> +	else
> +		vcpu->run->memory.flags = 0;
> +	vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> +	vcpu->run->memory.size = PAGE_SIZE;
> +	return RET_PF_USER;
> +}
> +
> +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> +				   struct kvm_page_fault *fault)
> +{
> +	int order;
> +	struct kvm_memory_slot *slot = fault->slot;
> +
> +	if (!kvm_slot_can_be_private(slot))
> +		return kvm_do_memory_fault_exit(vcpu, fault);
> +
> +	if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> +		return RET_PF_RETRY;
> +
> +	fault->max_level = min(order_to_level(order), fault->max_level);
> +	fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> +	return RET_PF_CONTINUE;
> +}
> +
>  static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	struct kvm_memory_slot *slot = fault->slot;
> @@ -4210,6 +4257,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  			return RET_PF_EMULATE;
>  	}
>
> +	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
> +		return kvm_do_memory_fault_exit(vcpu, fault);
> +
> +	if (fault->is_private)
> +		return kvm_faultin_pfn_private(vcpu, fault);
> +
>  	async = false;
>  	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
>  					  fault->write, &fault->map_writable,
> @@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>  			return -EIO;
>  	}
>
> +	if (r == RET_PF_USER)
> +		return 0;
> +
>  	if (r < 0)
>  		return r;
>  	if (r != RET_PF_EMULATE)
> @@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>  		 */
>  		if (sp->role.direct &&
>  		    sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> -							       PG_LEVEL_NUM)) {
> +							       PG_LEVEL_NUM,
> +							       false)) {
>  			kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
>
>  			if (kvm_available_flush_tlb_with_range())
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index dbaf6755c5a7..5ccf08183b00 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -189,6 +189,7 @@ struct kvm_page_fault {
>
>  	/* Derived from mmu and global state.  */
>  	const bool is_tdp;
> +	const bool is_private;
>  	const bool nx_huge_page_workaround_enabled;
>
>  	/*
> @@ -237,6 +238,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>   * RET_PF_RETRY: let CPU fault again on the address.
>   * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
>   * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> + * RET_PF_USER: need to exit to userspace to handle this fault.
>   * RET_PF_FIXED: The faulting entry has been fixed.
>   * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
>   *
> @@ -253,6 +255,7 @@ enum {
>  	RET_PF_RETRY,
>  	RET_PF_EMULATE,
>  	RET_PF_INVALID,
> +	RET_PF_USER,
>  	RET_PF_FIXED,
>  	RET_PF_SPURIOUS,
>  };
> @@ -310,7 +313,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>
>  int kvm_mmu_max_mapping_level(struct kvm *kvm,
>  			      const struct kvm_memory_slot *slot, gfn_t gfn,
> -			      int max_level);
> +			      int max_level, bool is_private);
>  void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>  void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
>
> @@ -319,4 +322,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> +#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> +					gfn_t gfn, kvm_pfn_t *pfn, int *order)
> +{
> +	WARN_ON_ONCE(1);
> +	return -EOPNOTSUPP;
> +}
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
>  #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> index ae86820cef69..2d7555381955 100644
> --- a/arch/x86/kvm/mmu/mmutrace.h
> +++ b/arch/x86/kvm/mmu/mmutrace.h
> @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
>  TRACE_DEFINE_ENUM(RET_PF_RETRY);
>  TRACE_DEFINE_ENUM(RET_PF_EMULATE);
>  TRACE_DEFINE_ENUM(RET_PF_INVALID);
> +TRACE_DEFINE_ENUM(RET_PF_USER);
>  TRACE_DEFINE_ENUM(RET_PF_FIXED);
>  TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 771210ce5181..8ba1a4afc546 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1768,7 +1768,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
>  			continue;
>
>  		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> -							      iter.gfn, PG_LEVEL_NUM);
> +						iter.gfn, PG_LEVEL_NUM, false);
>  		if (max_mapping_level < iter.level)
>  			continue;
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 25099c94e770..153842bb33df 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
>  }
>  #endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
>
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> +	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
> +	       KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +}
> +#else
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> +	return false;
> +}
> +
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
> +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> +					gfn_t gfn, kvm_pfn_t *pfn, int *order)
> +{
> +	int ret;
> +	struct page *page;
> +	pgoff_t index = gfn - slot->base_gfn +
> +			(slot->restricted_offset >> PAGE_SHIFT);
> +
> +	ret = restrictedmem_get_page(slot->restricted_file, index,
> +				     &page, order);
> +	*pfn = page_to_pfn(page);
> +	return ret;
> +}
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
>  #endif
> --
> 2.25.1
>
>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2022-12-02  6:13 ` [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory Chao Peng
  2022-12-05  9:03   ` Fuad Tabba
@ 2022-12-08  8:37   ` Xiaoyao Li
  2022-12-08 11:30     ` Chao Peng
  2022-12-19 14:36   ` Borislav Petkov
  2023-01-05 11:23   ` Jarkko Sakkinen
  3 siblings, 1 reply; 398+ messages in thread
From: Xiaoyao Li @ 2022-12-08  8:37 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On 12/2/2022 2:13 PM, Chao Peng wrote:

..

> Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> and right now it is selected on X86_64 only.
> 

 From the patch implementation, I have no idea why 
HAVE_KVM_RESTRICTED_MEM is needed.


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
  2022-12-07 17:16   ` Fuad Tabba
@ 2022-12-08 11:13     ` Chao Peng
  2022-12-09  8:57       ` Fuad Tabba
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-12-08 11:13 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

On Wed, Dec 07, 2022 at 05:16:34PM +0000, Fuad Tabba wrote:
> Hi,
> 
> On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > Unmap the existing guest mappings when memory attribute is changed
> > between shared and private. This is needed because shared pages and
> > private pages are from different backends, unmapping existing ones
> > gives a chance for page fault handler to re-populate the mappings
> > according to the new attribute.
> >
> > Only architecture has private memory support needs this and the
> > supported architecture is expected to rewrite the weak
> > kvm_arch_has_private_mem().
> 
> This kind of ties into the discussion of being able to share memory in
> place. For pKVM for example, shared and private memory would have the
> same backend, and the unmapping wouldn't be needed.
> 
> So I guess that, instead of kvm_arch_has_private_mem(), can the check
> be done differently, e.g., with a different function, say
> kvm_arch_private_notify_attribute_change() (but maybe with a more
> friendly name than what I suggested :) )?

Besides controlling the unmapping here, kvm_arch_has_private_mem() is
also used to gate the memslot KVM_MEM_PRIVATE flag in patch09. I know
unmapping is confirmed unnecessary for pKVM, but how about
KVM_MEM_PRIVATE? Will pKVM add its own flag or reuse KVM_MEM_PRIVATE?
If the answer is the latter, then yes we should use a different check
which only works for confidential usages here.

Thanks,
Chao
> 
> Thanks,
> /fuad
> 
> >
> > Also, during memory attribute changing and the unmapping time frame,
> > page fault handler may happen in the same memory range and can cause
> > incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> > page fault handler retry during this time frame.
> >
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  include/linux/kvm_host.h |   7 +-
> >  virt/kvm/kvm_main.c      | 168 ++++++++++++++++++++++++++-------------
> >  2 files changed, 116 insertions(+), 59 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 3d69484d2704..3331c0c92838 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> >  #endif
> >
> > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> >  struct kvm_gfn_range {
> >         struct kvm_memory_slot *slot;
> >         gfn_t start;
> > @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> >         bool may_block;
> >  };
> >  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > +
> > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> >  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> >  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> >  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > @@ -785,11 +786,12 @@ struct kvm {
> >
> >  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> >         struct mmu_notifier mmu_notifier;
> > +#endif
> >         unsigned long mmu_invalidate_seq;
> >         long mmu_invalidate_in_progress;
> >         gfn_t mmu_invalidate_range_start;
> >         gfn_t mmu_invalidate_range_end;
> > -#endif
> > +
> >         struct list_head devices;
> >         u64 manual_dirty_log_protect;
> >         struct dentry *debugfs_dentry;
> > @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> >  int kvm_arch_post_init_vm(struct kvm *kvm);
> >  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> >  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> >
> >  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> >  /*
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index ad55dfbc75d7..4e1e1e113bf0 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> >
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > +{
> > +       /*
> > +        * The count increase must become visible at unlock time as no
> > +        * spte can be established without taking the mmu_lock and
> > +        * count is also read inside the mmu_lock critical section.
> > +        */
> > +       kvm->mmu_invalidate_in_progress++;
> > +
> > +       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > +               kvm->mmu_invalidate_range_start = INVALID_GPA;
> > +               kvm->mmu_invalidate_range_end = INVALID_GPA;
> > +       }
> > +}
> > +
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +       WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > +
> > +       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > +               kvm->mmu_invalidate_range_start = start;
> > +               kvm->mmu_invalidate_range_end = end;
> > +       } else {
> > +               /*
> > +                * Fully tracking multiple concurrent ranges has diminishing
> > +                * returns. Keep things simple and just find the minimal range
> > +                * which includes the current and new ranges. As there won't be
> > +                * enough information to subtract a range after its invalidate
> > +                * completes, any ranges invalidated concurrently will
> > +                * accumulate and persist until all outstanding invalidates
> > +                * complete.
> > +                */
> > +               kvm->mmu_invalidate_range_start =
> > +                       min(kvm->mmu_invalidate_range_start, start);
> > +               kvm->mmu_invalidate_range_end =
> > +                       max(kvm->mmu_invalidate_range_end, end);
> > +       }
> > +}
> > +
> > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> > +{
> > +       /*
> > +        * This sequence increase will notify the kvm page fault that
> > +        * the page that is going to be mapped in the spte could have
> > +        * been freed.
> > +        */
> > +       kvm->mmu_invalidate_seq++;
> > +       smp_wmb();
> > +       /*
> > +        * The above sequence increase must be visible before the
> > +        * below count decrease, which is ensured by the smp_wmb above
> > +        * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > +        */
> > +       kvm->mmu_invalidate_in_progress--;
> > +}
> > +
> >  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> >  static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> >  {
> > @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> >         kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> >  }
> >
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > -{
> > -       /*
> > -        * The count increase must become visible at unlock time as no
> > -        * spte can be established without taking the mmu_lock and
> > -        * count is also read inside the mmu_lock critical section.
> > -        */
> > -       kvm->mmu_invalidate_in_progress++;
> > -
> > -       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > -               kvm->mmu_invalidate_range_start = INVALID_GPA;
> > -               kvm->mmu_invalidate_range_end = INVALID_GPA;
> > -       }
> > -}
> > -
> > -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > -{
> > -       WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > -
> > -       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > -               kvm->mmu_invalidate_range_start = start;
> > -               kvm->mmu_invalidate_range_end = end;
> > -       } else {
> > -               /*
> > -                * Fully tracking multiple concurrent ranges has diminishing
> > -                * returns. Keep things simple and just find the minimal range
> > -                * which includes the current and new ranges. As there won't be
> > -                * enough information to subtract a range after its invalidate
> > -                * completes, any ranges invalidated concurrently will
> > -                * accumulate and persist until all outstanding invalidates
> > -                * complete.
> > -                */
> > -               kvm->mmu_invalidate_range_start =
> > -                       min(kvm->mmu_invalidate_range_start, start);
> > -               kvm->mmu_invalidate_range_end =
> > -                       max(kvm->mmu_invalidate_range_end, end);
> > -       }
> > -}
> > -
> >  static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> >  {
> >         kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >         return 0;
> >  }
> >
> > -void kvm_mmu_invalidate_end(struct kvm *kvm)
> > -{
> > -       /*
> > -        * This sequence increase will notify the kvm page fault that
> > -        * the page that is going to be mapped in the spte could have
> > -        * been freed.
> > -        */
> > -       kvm->mmu_invalidate_seq++;
> > -       smp_wmb();
> > -       /*
> > -        * The above sequence increase must be visible before the
> > -        * below count decrease, which is ensured by the smp_wmb above
> > -        * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > -        */
> > -       kvm->mmu_invalidate_in_progress--;
> > -}
> > -
> >  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> >                                         const struct mmu_notifier_range *range)
> >  {
> > @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
> >         return 0;
> >  }
> >
> > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > +{
> > +       return false;
> > +}
> > +
> >  static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> >  {
> >         struct kvm *kvm = kvm_arch_alloc_vm();
> > @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> >         return 0;
> >  }
> >
> > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +       struct kvm_gfn_range gfn_range;
> > +       struct kvm_memory_slot *slot;
> > +       struct kvm_memslots *slots;
> > +       struct kvm_memslot_iter iter;
> > +       int i;
> > +       int r = 0;
> > +
> > +       gfn_range.pte = __pte(0);
> > +       gfn_range.may_block = true;
> > +
> > +       for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > +               slots = __kvm_memslots(kvm, i);
> > +
> > +               kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > +                       slot = iter.slot;
> > +                       gfn_range.start = max(start, slot->base_gfn);
> > +                       gfn_range.end = min(end, slot->base_gfn + slot->npages);
> > +                       if (gfn_range.start >= gfn_range.end)
> > +                               continue;
> > +                       gfn_range.slot = slot;
> > +
> > +                       r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > +               }
> > +       }
> > +
> > +       if (r)
> > +               kvm_flush_remote_tlbs(kvm);
> > +}
> > +
> >  static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> >                                            struct kvm_memory_attributes *attrs)
> >  {
> >         gfn_t start, end;
> >         unsigned long i;
> >         void *entry;
> > +       int idx;
> >         u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> >
> > -       /* flags is currently not used. */
> > +       /* 'flags' is currently not used. */
> >         if (attrs->flags)
> >                 return -EINVAL;
> >         if (attrs->attributes & ~supported_attrs)
> > @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> >
> >         entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> >
> > +       if (kvm_arch_has_private_mem(kvm)) {
> > +               KVM_MMU_LOCK(kvm);
> > +               kvm_mmu_invalidate_begin(kvm);
> > +               kvm_mmu_invalidate_range_add(kvm, start, end);
> > +               KVM_MMU_UNLOCK(kvm);
> > +       }
> > +
> >         mutex_lock(&kvm->lock);
> >         for (i = start; i < end; i++)
> >                 if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> >                         break;
> >         mutex_unlock(&kvm->lock);
> >
> > +       if (kvm_arch_has_private_mem(kvm)) {
> > +               idx = srcu_read_lock(&kvm->srcu);
> > +               KVM_MMU_LOCK(kvm);
> > +               if (i > start)
> > +                       kvm_unmap_mem_range(kvm, start, i);
> > +               kvm_mmu_invalidate_end(kvm);
> > +               KVM_MMU_UNLOCK(kvm);
> > +               srcu_read_unlock(&kvm->srcu, idx);
> > +       }
> > +
> >         attrs->address = i << PAGE_SHIFT;
> >         attrs->size = (end - i) << PAGE_SHIFT;
> >
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed
  2022-12-07  6:42       ` Isaku Yamahata
@ 2022-12-08 11:17         ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-08 11:17 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Tue, Dec 06, 2022 at 10:42:24PM -0800, Isaku Yamahata wrote:
> On Tue, Dec 06, 2022 at 08:02:24PM +0800,
> Chao Peng <chao.p.peng@linux.intel.com> wrote:
> 
> > On Mon, Dec 05, 2022 at 02:49:59PM -0800, Isaku Yamahata wrote:
> > > On Fri, Dec 02, 2022 at 02:13:45PM +0800,
> > > Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > > 
> > > > A large page with mixed private/shared subpages can't be mapped as large
> > > > page since its sub private/shared pages are from different memory
> > > > backends and may also treated by architecture differently. When
> > > > private/shared memory are mixed in a large page, the current lpage_info
> > > > is not sufficient to decide whether the page can be mapped as large page
> > > > or not and additional private/shared mixed information is needed.
> > > > 
> > > > Tracking this 'mixed' information with the current 'count' like
> > > > disallow_lpage is a bit challenge so reserve a bit in 'disallow_lpage'
> > > > to indicate a large page has mixed private/share subpages and update
> > > > this 'mixed' bit whenever the memory attribute is changed between
> > > > private and shared.
> > > > 
> > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > > ---
> > > >  arch/x86/include/asm/kvm_host.h |   8 ++
> > > >  arch/x86/kvm/mmu/mmu.c          | 134 +++++++++++++++++++++++++++++++-
> > > >  arch/x86/kvm/x86.c              |   2 +
> > > >  include/linux/kvm_host.h        |  19 +++++
> > > >  virt/kvm/kvm_main.c             |   9 ++-
> > > >  5 files changed, 169 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index 283cbb83d6ae..7772ab37ac89 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -38,6 +38,7 @@
> > > >  #include <asm/hyperv-tlfs.h>
> > > >  
> > > >  #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> > > > +#define __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
> > > >  
> > > >  #define KVM_MAX_VCPUS 1024
> > > >  
> > > > @@ -1011,6 +1012,13 @@ struct kvm_vcpu_arch {
> > > >  #endif
> > > >  };
> > > >  
> > > > +/*
> > > > + * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
> > > > + * level. The remaining bits are used as a reference count.
> > > > + */
> > > > +#define KVM_LPAGE_PRIVATE_SHARED_MIXED		(1U << 31)
> > > > +#define KVM_LPAGE_COUNT_MAX			((1U << 31) - 1)
> > > > +
> > > >  struct kvm_lpage_info {
> > > >  	int disallow_lpage;
> > > >  };
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index e2c70b5afa3e..2190fd8c95c0 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -763,11 +763,16 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
> > > >  {
> > > >  	struct kvm_lpage_info *linfo;
> > > >  	int i;
> > > > +	int disallow_count;
> > > >  
> > > >  	for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
> > > >  		linfo = lpage_info_slot(gfn, slot, i);
> > > > +
> > > > +		disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
> > > > +		WARN_ON(disallow_count + count < 0 ||
> > > > +			disallow_count > KVM_LPAGE_COUNT_MAX - count);
> > > > +
> > > >  		linfo->disallow_lpage += count;
> > > > -		WARN_ON(linfo->disallow_lpage < 0);
> > > >  	}
> > > >  }
> > > >  
> > > > @@ -6986,3 +6991,130 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> > > >  	if (kvm->arch.nx_huge_page_recovery_thread)
> > > >  		kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
> > > >  }
> > > > +
> > > > +static bool linfo_is_mixed(struct kvm_lpage_info *linfo)
> > > > +{
> > > > +	return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > > > +}
> > > > +
> > > > +static void linfo_set_mixed(gfn_t gfn, struct kvm_memory_slot *slot,
> > > > +			    int level, bool mixed)
> > > > +{
> > > > +	struct kvm_lpage_info *linfo = lpage_info_slot(gfn, slot, level);
> > > > +
> > > > +	if (mixed)
> > > > +		linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > > > +	else
> > > > +		linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > > > +}
> > > > +
> > > > +static bool is_expected_attr_entry(void *entry, unsigned long expected_attrs)
> > > > +{
> > > > +	bool expect_private = expected_attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > > > +
> > > > +	if (xa_to_value(entry) & KVM_MEMORY_ATTRIBUTE_PRIVATE) {
> > > > +		if (!expect_private)
> > > > +			return false;
> > > > +	} else if (expect_private)
> > > > +		return false;
> > > > +
> > > > +	return true;
> > > > +}
> > > > +
> > > > +static bool mem_attrs_mixed_2m(struct kvm *kvm, unsigned long attrs,
> > > > +			       gfn_t start, gfn_t end)
> > > > +{
> > > > +	XA_STATE(xas, &kvm->mem_attr_array, start);
> > > > +	gfn_t gfn = start;
> > > > +	void *entry;
> > > > +	bool mixed = false;
> > > > +
> > > > +	rcu_read_lock();
> > > > +	entry = xas_load(&xas);
> > > > +	while (gfn < end) {
> > > > +		if (xas_retry(&xas, entry))
> > > > +			continue;
> > > > +
> > > > +		KVM_BUG_ON(gfn != xas.xa_index, kvm);
> > > > +
> > > > +		if (!is_expected_attr_entry(entry, attrs)) {
> > > > +			mixed = true;
> > > > +			break;
> > > > +		}
> > > > +
> > > > +		entry = xas_next(&xas);
> > > > +		gfn++;
> > > > +	}
> > > > +
> > > > +	rcu_read_unlock();
> > > > +	return mixed;
> > > > +}
> > > > +
> > > > +static bool mem_attrs_mixed(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > > +			    int level, unsigned long attrs,
> > > > +			    gfn_t start, gfn_t end)
> > > > +{
> > > > +	unsigned long gfn;
> > > > +
> > > > +	if (level == PG_LEVEL_2M)
> > > > +		return mem_attrs_mixed_2m(kvm, attrs, start, end);
> > > > +
> > > > +	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1))
> > > > +		if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)) ||
> > > > +		    !is_expected_attr_entry(xa_load(&kvm->mem_attr_array, gfn),
> > > > +					    attrs))
> > > > +			return true;
> > > > +	return false;
> > > > +}
> > > > +
> > > > +static void kvm_update_lpage_private_shared_mixed(struct kvm *kvm,
> > > > +						  struct kvm_memory_slot *slot,
> > > > +						  unsigned long attrs,
> > > > +						  gfn_t start, gfn_t end)
> > > > +{
> > > > +	unsigned long pages, mask;
> > > > +	gfn_t gfn, gfn_end, first, last;
> > > > +	int level;
> > > > +	bool mixed;
> > > > +
> > > > +	/*
> > > > +	 * The sequence matters here: we set the higher level basing on the
> > > > +	 * lower level's scanning result.
> > > > +	 */
> > > > +	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> > > > +		pages = KVM_PAGES_PER_HPAGE(level);
> > > > +		mask = ~(pages - 1);
> > > > +		first = start & mask;
> > > > +		last = (end - 1) & mask;
> > > > +
> > > > +		/*
> > > > +		 * We only need to scan the head and tail page, for middle pages
> > > > +		 * we know they will not be mixed.
> > > > +		 */
> > > > +		gfn = max(first, slot->base_gfn);
> > > > +		gfn_end = min(first + pages, slot->base_gfn + slot->npages);
> > > > +		mixed = mem_attrs_mixed(kvm, slot, level, attrs, gfn, gfn_end);
> > > > +		linfo_set_mixed(gfn, slot, level, mixed);
> > > > +
> > > > +		if (first == last)
> > > > +			return;
> > > 
> > > 
> > > continue.
> > 
> > Ya!
> > 
> > > 
> > > > +
> > > > +		for (gfn = first + pages; gfn < last; gfn += pages)
> > > > +			linfo_set_mixed(gfn, slot, level, false);
> > > > +
> > > > +		gfn = last;
> > > > +		gfn_end = min(last + pages, slot->base_gfn + slot->npages);
> > > 
> > > if (gfn == gfn_end) continue.
> > 
> > Do you see a case where gfn can equal to gfn_end? Though it does not
> > hurt to add a check.
> 
> If last == base_gfn + npages, gfn == gfn_end can occur.

'end' is guaranteed <=  base_gfn + npages in kvm_unmap_mem_range():
	gfn_range.end = min(end, slot->base_gfn + slot->npages);

And 'last' is defined as:
	last = (end - 1) & mask;

Then the math is:
	last = (end - 1) & mask
	     <= end - 1
	     <= base_gfn + npages - 1
	     <  base_gfn + npages

Thanks,
Chao
> 
> 
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 9a07380f8d3c..5aefcff614d2 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -12362,6 +12362,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
> > > >  		if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
> > > >  			linfo[lpages - 1].disallow_lpage = 1;
> > > >  		ugfn = slot->userspace_addr >> PAGE_SHIFT;
> > > > +		if (kvm_slot_can_be_private(slot))
> > > > +			ugfn |= slot->restricted_offset >> PAGE_SHIFT;
> > > 
> > > Is there any alignment restriction? If no, It should be +=.
> > > In practice, alignment will hold though.
> > 
> > All we need here is checking whether both userspace_addr and
> > restricted_offset are aligned to HPAGE_SIZE or not. '+=' actually can
> > yield wrong value in cases when userspace_addr + restricted_offset is
> > aligned to HPAGE_SIZE but individually they may not align to HPAGE_SIZE.
> 
> Ah, got it. The blow comment explains it.
> 
> > Thanks,
> > Chao
> > > 
> > > Thanks,
> > > 
> > > >  		/*
> > > >  		 * If the gfn and userspace address are not aligned wrt each
> > > >  		 * other, disable large page support for this slot.
> -- 
> Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
  2022-12-07  8:13   ` Yuan Yao
@ 2022-12-08 11:20     ` Chao Peng
  2022-12-09  5:43       ` Yuan Yao
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-12-08 11:20 UTC (permalink / raw)
  To: Yuan Yao
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Wed, Dec 07, 2022 at 04:13:14PM +0800, Yuan Yao wrote:
> On Fri, Dec 02, 2022 at 02:13:44PM +0800, Chao Peng wrote:
> > Unmap the existing guest mappings when memory attribute is changed
> > between shared and private. This is needed because shared pages and
> > private pages are from different backends, unmapping existing ones
> > gives a chance for page fault handler to re-populate the mappings
> > according to the new attribute.
> >
> > Only architecture has private memory support needs this and the
> > supported architecture is expected to rewrite the weak
> > kvm_arch_has_private_mem().
> >
> > Also, during memory attribute changing and the unmapping time frame,
> > page fault handler may happen in the same memory range and can cause
> > incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> > page fault handler retry during this time frame.
> >
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  include/linux/kvm_host.h |   7 +-
> >  virt/kvm/kvm_main.c      | 168 ++++++++++++++++++++++++++-------------
> >  2 files changed, 116 insertions(+), 59 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 3d69484d2704..3331c0c92838 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> >  #endif
> >
> > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> >  struct kvm_gfn_range {
> >  	struct kvm_memory_slot *slot;
> >  	gfn_t start;
> > @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> >  	bool may_block;
> >  };
> >  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > +
> > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> >  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> >  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> >  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > @@ -785,11 +786,12 @@ struct kvm {
> >
> >  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> >  	struct mmu_notifier mmu_notifier;
> > +#endif
> >  	unsigned long mmu_invalidate_seq;
> >  	long mmu_invalidate_in_progress;
> >  	gfn_t mmu_invalidate_range_start;
> >  	gfn_t mmu_invalidate_range_end;
> > -#endif
> > +
> >  	struct list_head devices;
> >  	u64 manual_dirty_log_protect;
> >  	struct dentry *debugfs_dentry;
> > @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> >  int kvm_arch_post_init_vm(struct kvm *kvm);
> >  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> >  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> >
> >  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> >  /*
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index ad55dfbc75d7..4e1e1e113bf0 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> >
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > +{
> > +	/*
> > +	 * The count increase must become visible at unlock time as no
> > +	 * spte can be established without taking the mmu_lock and
> > +	 * count is also read inside the mmu_lock critical section.
> > +	 */
> > +	kvm->mmu_invalidate_in_progress++;
> > +
> > +	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > +		kvm->mmu_invalidate_range_start = INVALID_GPA;
> > +		kvm->mmu_invalidate_range_end = INVALID_GPA;
> > +	}
> > +}
> > +
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +	WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > +
> > +	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > +		kvm->mmu_invalidate_range_start = start;
> > +		kvm->mmu_invalidate_range_end = end;
> > +	} else {
> > +		/*
> > +		 * Fully tracking multiple concurrent ranges has diminishing
> > +		 * returns. Keep things simple and just find the minimal range
> > +		 * which includes the current and new ranges. As there won't be
> > +		 * enough information to subtract a range after its invalidate
> > +		 * completes, any ranges invalidated concurrently will
> > +		 * accumulate and persist until all outstanding invalidates
> > +		 * complete.
> > +		 */
> > +		kvm->mmu_invalidate_range_start =
> > +			min(kvm->mmu_invalidate_range_start, start);
> > +		kvm->mmu_invalidate_range_end =
> > +			max(kvm->mmu_invalidate_range_end, end);
> > +	}
> > +}
> > +
> > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> > +{
> > +	/*
> > +	 * This sequence increase will notify the kvm page fault that
> > +	 * the page that is going to be mapped in the spte could have
> > +	 * been freed.
> > +	 */
> > +	kvm->mmu_invalidate_seq++;
> > +	smp_wmb();
> > +	/*
> > +	 * The above sequence increase must be visible before the
> > +	 * below count decrease, which is ensured by the smp_wmb above
> > +	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > +	 */
> > +	kvm->mmu_invalidate_in_progress--;
> > +}
> > +
> >  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> >  static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> >  {
> > @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> >  	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> >  }
> >
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > -{
> > -	/*
> > -	 * The count increase must become visible at unlock time as no
> > -	 * spte can be established without taking the mmu_lock and
> > -	 * count is also read inside the mmu_lock critical section.
> > -	 */
> > -	kvm->mmu_invalidate_in_progress++;
> > -
> > -	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > -		kvm->mmu_invalidate_range_start = INVALID_GPA;
> > -		kvm->mmu_invalidate_range_end = INVALID_GPA;
> > -	}
> > -}
> > -
> > -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > -{
> > -	WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > -
> > -	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > -		kvm->mmu_invalidate_range_start = start;
> > -		kvm->mmu_invalidate_range_end = end;
> > -	} else {
> > -		/*
> > -		 * Fully tracking multiple concurrent ranges has diminishing
> > -		 * returns. Keep things simple and just find the minimal range
> > -		 * which includes the current and new ranges. As there won't be
> > -		 * enough information to subtract a range after its invalidate
> > -		 * completes, any ranges invalidated concurrently will
> > -		 * accumulate and persist until all outstanding invalidates
> > -		 * complete.
> > -		 */
> > -		kvm->mmu_invalidate_range_start =
> > -			min(kvm->mmu_invalidate_range_start, start);
> > -		kvm->mmu_invalidate_range_end =
> > -			max(kvm->mmu_invalidate_range_end, end);
> > -	}
> > -}
> > -
> >  static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> >  {
> >  	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >  	return 0;
> >  }
> >
> > -void kvm_mmu_invalidate_end(struct kvm *kvm)
> > -{
> > -	/*
> > -	 * This sequence increase will notify the kvm page fault that
> > -	 * the page that is going to be mapped in the spte could have
> > -	 * been freed.
> > -	 */
> > -	kvm->mmu_invalidate_seq++;
> > -	smp_wmb();
> > -	/*
> > -	 * The above sequence increase must be visible before the
> > -	 * below count decrease, which is ensured by the smp_wmb above
> > -	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > -	 */
> > -	kvm->mmu_invalidate_in_progress--;
> > -}
> > -
> >  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> >  					const struct mmu_notifier_range *range)
> >  {
> > @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
> >  	return 0;
> >  }
> >
> > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > +{
> > +	return false;
> > +}
> > +
> >  static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> >  {
> >  	struct kvm *kvm = kvm_arch_alloc_vm();
> > @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> >  	return 0;
> >  }
> >
> > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +	struct kvm_gfn_range gfn_range;
> > +	struct kvm_memory_slot *slot;
> > +	struct kvm_memslots *slots;
> > +	struct kvm_memslot_iter iter;
> > +	int i;
> > +	int r = 0;
> > +
> > +	gfn_range.pte = __pte(0);
> > +	gfn_range.may_block = true;
> > +
> > +	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > +		slots = __kvm_memslots(kvm, i);
> > +
> > +		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > +			slot = iter.slot;
> > +			gfn_range.start = max(start, slot->base_gfn);
> > +			gfn_range.end = min(end, slot->base_gfn + slot->npages);
> > +			if (gfn_range.start >= gfn_range.end)
> > +				continue;
> > +			gfn_range.slot = slot;
> > +
> > +			r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > +		}
> > +	}
> > +
> > +	if (r)
> > +		kvm_flush_remote_tlbs(kvm);
> > +}
> > +
> >  static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> >  					   struct kvm_memory_attributes *attrs)
> >  {
> >  	gfn_t start, end;
> >  	unsigned long i;
> >  	void *entry;
> > +	int idx;
> >  	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> >
> > -	/* flags is currently not used. */
> > +	/* 'flags' is currently not used. */
> >  	if (attrs->flags)
> >  		return -EINVAL;
> >  	if (attrs->attributes & ~supported_attrs)
> > @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> >
> >  	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> >
> > +	if (kvm_arch_has_private_mem(kvm)) {
> > +		KVM_MMU_LOCK(kvm);
> > +		kvm_mmu_invalidate_begin(kvm);
> > +		kvm_mmu_invalidate_range_add(kvm, start, end);
> 
> Nit: this works for KVM_MEMORY_ATTRIBUTE_PRIVATE, but
> the invalidation should be necessary yet for attribute change of:
> 
> KVM_MEMORY_ATTRIBUTE_READ
> KVM_MEMORY_ATTRIBUTE_WRITE
> KVM_MEMORY_ATTRIBUTE_EXECUTE

The unmapping is only needed for confidential usages which uses
KVM_MEMORY_ATTRIBUTE_PRIVATE only and the other flags are defined here
for other usages like pKVM. As Fuad commented in a different reply, pKVM
supports in-place remapping and unmapping is unnecessary.

Thanks,
Chao
> 
> > +		KVM_MMU_UNLOCK(kvm);
> > +	}
> > +
> >  	mutex_lock(&kvm->lock);
> >  	for (i = start; i < end; i++)
> >  		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> >  			break;
> >  	mutex_unlock(&kvm->lock);
> >
> > +	if (kvm_arch_has_private_mem(kvm)) {
> > +		idx = srcu_read_lock(&kvm->srcu);
> > +		KVM_MMU_LOCK(kvm);
> > +		if (i > start)
> > +			kvm_unmap_mem_range(kvm, start, i);
> > +		kvm_mmu_invalidate_end(kvm);
> 
> Ditto.
> 
> > +		KVM_MMU_UNLOCK(kvm);
> > +		srcu_read_unlock(&kvm->srcu, idx);
> > +	}
> > +
> >  	attrs->address = i << PAGE_SHIFT;
> >  	attrs->size = (end - i) << PAGE_SHIFT;
> >
> > --
> > 2.25.1
> >
> >

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 8/9] KVM: Handle page fault for private memory
  2022-12-08  2:29   ` Yuan Yao
@ 2022-12-08 11:23     ` Chao Peng
  2022-12-09  5:45       ` Yuan Yao
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-12-08 11:23 UTC (permalink / raw)
  To: Yuan Yao
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Thu, Dec 08, 2022 at 10:29:18AM +0800, Yuan Yao wrote:
> On Fri, Dec 02, 2022 at 02:13:46PM +0800, Chao Peng wrote:
> > A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
> > hva-based shared memory. Architecture code (like TDX code) can tell
> > whether the on-going fault is private or not. This patch adds a
> > 'is_private' field to kvm_page_fault to indicate this and architecture
> > code is expected to set it.
> >
> > To handle page fault for such memslot, the handling logic is different
> > depending on whether the fault is private or shared. KVM checks if
> > 'is_private' matches the host's view of the page (maintained in
> > mem_attr_array).
> >   - For a successful match, private pfn is obtained with
> >     restrictedmem_get_page() and shared pfn is obtained with existing
> >     get_user_pages().
> >   - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> >     userspace. Userspace then can convert memory between private/shared
> >     in host's view and retry the fault.
> >
> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c          | 63 +++++++++++++++++++++++++++++++--
> >  arch/x86/kvm/mmu/mmu_internal.h | 14 +++++++-
> >  arch/x86/kvm/mmu/mmutrace.h     |  1 +
> >  arch/x86/kvm/mmu/tdp_mmu.c      |  2 +-
> >  include/linux/kvm_host.h        | 30 ++++++++++++++++
> >  5 files changed, 105 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 2190fd8c95c0..b1953ebc012e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
> >
> >  int kvm_mmu_max_mapping_level(struct kvm *kvm,
> >  			      const struct kvm_memory_slot *slot, gfn_t gfn,
> > -			      int max_level)
> > +			      int max_level, bool is_private)
> >  {
> >  	struct kvm_lpage_info *linfo;
> >  	int host_level;
> > @@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> >  			break;
> >  	}
> >
> > +	if (is_private)
> > +		return max_level;
> 
> lpage mixed information already saved, so is that possible
> to query info->disallow_lpage without care 'is_private' ?

Actually we already queried info->disallow_lpage just before this
sentence. The check is needed because later in the function we call
host_pfn_mapping_level() which is shared memory specific.

Thanks,
Chao
> 
> > +
> >  	if (max_level == PG_LEVEL_4K)
> >  		return PG_LEVEL_4K;
> >
> > @@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  	 * level, which will be used to do precise, accurate accounting.
> >  	 */
> >  	fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> > -						     fault->gfn, fault->max_level);
> > +						     fault->gfn, fault->max_level,
> > +						     fault->is_private);
> >  	if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> >  		return;
> >
> > @@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> >  	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> >  }
> >
> > +static inline u8 order_to_level(int order)
> > +{
> > +	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> > +
> > +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> > +		return PG_LEVEL_1G;
> > +
> > +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> > +		return PG_LEVEL_2M;
> > +
> > +	return PG_LEVEL_4K;
> > +}
> > +
> > +static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
> > +				    struct kvm_page_fault *fault)
> > +{
> > +	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > +	if (fault->is_private)
> > +		vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> > +	else
> > +		vcpu->run->memory.flags = 0;
> > +	vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> > +	vcpu->run->memory.size = PAGE_SIZE;
> > +	return RET_PF_USER;
> > +}
> > +
> > +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > +				   struct kvm_page_fault *fault)
> > +{
> > +	int order;
> > +	struct kvm_memory_slot *slot = fault->slot;
> > +
> > +	if (!kvm_slot_can_be_private(slot))
> > +		return kvm_do_memory_fault_exit(vcpu, fault);
> > +
> > +	if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> > +		return RET_PF_RETRY;
> > +
> > +	fault->max_level = min(order_to_level(order), fault->max_level);
> > +	fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> > +	return RET_PF_CONTINUE;
> > +}
> > +
> >  static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  {
> >  	struct kvm_memory_slot *slot = fault->slot;
> > @@ -4210,6 +4257,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  			return RET_PF_EMULATE;
> >  	}
> >
> > +	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
> > +		return kvm_do_memory_fault_exit(vcpu, fault);
> > +
> > +	if (fault->is_private)
> > +		return kvm_faultin_pfn_private(vcpu, fault);
> > +
> >  	async = false;
> >  	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> >  					  fault->write, &fault->map_writable,
> > @@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> >  			return -EIO;
> >  	}
> >
> > +	if (r == RET_PF_USER)
> > +		return 0;
> > +
> >  	if (r < 0)
> >  		return r;
> >  	if (r != RET_PF_EMULATE)
> > @@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> >  		 */
> >  		if (sp->role.direct &&
> >  		    sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> > -							       PG_LEVEL_NUM)) {
> > +							       PG_LEVEL_NUM,
> > +							       false)) {
> >  			kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
> >
> >  			if (kvm_available_flush_tlb_with_range())
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index dbaf6755c5a7..5ccf08183b00 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -189,6 +189,7 @@ struct kvm_page_fault {
> >
> >  	/* Derived from mmu and global state.  */
> >  	const bool is_tdp;
> > +	const bool is_private;
> >  	const bool nx_huge_page_workaround_enabled;
> >
> >  	/*
> > @@ -237,6 +238,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> >   * RET_PF_RETRY: let CPU fault again on the address.
> >   * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
> >   * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> > + * RET_PF_USER: need to exit to userspace to handle this fault.
> >   * RET_PF_FIXED: The faulting entry has been fixed.
> >   * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
> >   *
> > @@ -253,6 +255,7 @@ enum {
> >  	RET_PF_RETRY,
> >  	RET_PF_EMULATE,
> >  	RET_PF_INVALID,
> > +	RET_PF_USER,
> >  	RET_PF_FIXED,
> >  	RET_PF_SPURIOUS,
> >  };
> > @@ -310,7 +313,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >
> >  int kvm_mmu_max_mapping_level(struct kvm *kvm,
> >  			      const struct kvm_memory_slot *slot, gfn_t gfn,
> > -			      int max_level);
> > +			      int max_level, bool is_private);
> >  void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> >  void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
> >
> > @@ -319,4 +322,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >  void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >  void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >
> > +#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > +					gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > +{
> > +	WARN_ON_ONCE(1);
> > +	return -EOPNOTSUPP;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > +
> >  #endif /* __KVM_X86_MMU_INTERNAL_H */
> > diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> > index ae86820cef69..2d7555381955 100644
> > --- a/arch/x86/kvm/mmu/mmutrace.h
> > +++ b/arch/x86/kvm/mmu/mmutrace.h
> > @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
> >  TRACE_DEFINE_ENUM(RET_PF_RETRY);
> >  TRACE_DEFINE_ENUM(RET_PF_EMULATE);
> >  TRACE_DEFINE_ENUM(RET_PF_INVALID);
> > +TRACE_DEFINE_ENUM(RET_PF_USER);
> >  TRACE_DEFINE_ENUM(RET_PF_FIXED);
> >  TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 771210ce5181..8ba1a4afc546 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1768,7 +1768,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> >  			continue;
> >
> >  		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> > -							      iter.gfn, PG_LEVEL_NUM);
> > +						iter.gfn, PG_LEVEL_NUM, false);
> >  		if (max_mapping_level < iter.level)
> >  			continue;
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 25099c94e770..153842bb33df 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> >  }
> >  #endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
> >
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > +	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
> > +	       KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > +}
> > +#else
> > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > +	return false;
> > +}
> > +
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > +
> > +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > +					gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > +{
> > +	int ret;
> > +	struct page *page;
> > +	pgoff_t index = gfn - slot->base_gfn +
> > +			(slot->restricted_offset >> PAGE_SHIFT);
> > +
> > +	ret = restrictedmem_get_page(slot->restricted_file, index,
> > +				     &page, order);
> > +	*pfn = page_to_pfn(page);
> > +	return ret;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > +
> >  #endif
> > --
> > 2.25.1
> >
> >

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2022-12-08  8:37   ` Xiaoyao Li
@ 2022-12-08 11:30     ` Chao Peng
  2022-12-13 12:04       ` Xiaoyao Li
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-12-08 11:30 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Thu, Dec 08, 2022 at 04:37:03PM +0800, Xiaoyao Li wrote:
> On 12/2/2022 2:13 PM, Chao Peng wrote:
> 
> ..
> 
> > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > and right now it is selected on X86_64 only.
> > 
> 
> From the patch implementation, I have no idea why HAVE_KVM_RESTRICTED_MEM is
> needed.

The reason is we want KVM further controls the feature enabling. An
opt-in CONFIG_RESTRICTEDMEM can cause problem if user sets that for
unsupported architectures.

Here is the original discussion:
https://lore.kernel.org/all/YkJLFu98hZOvTSrL@google.com/

Thanks,
Chao

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
  2022-12-08 11:20     ` Chao Peng
@ 2022-12-09  5:43       ` Yuan Yao
  0 siblings, 0 replies; 398+ messages in thread
From: Yuan Yao @ 2022-12-09  5:43 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Thu, Dec 08, 2022 at 07:20:43PM +0800, Chao Peng wrote:
> On Wed, Dec 07, 2022 at 04:13:14PM +0800, Yuan Yao wrote:
> > On Fri, Dec 02, 2022 at 02:13:44PM +0800, Chao Peng wrote:
> > > Unmap the existing guest mappings when memory attribute is changed
> > > between shared and private. This is needed because shared pages and
> > > private pages are from different backends, unmapping existing ones
> > > gives a chance for page fault handler to re-populate the mappings
> > > according to the new attribute.
> > >
> > > Only architecture has private memory support needs this and the
> > > supported architecture is expected to rewrite the weak
> > > kvm_arch_has_private_mem().
> > >
> > > Also, during memory attribute changing and the unmapping time frame,
> > > page fault handler may happen in the same memory range and can cause
> > > incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> > > page fault handler retry during this time frame.
> > >
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > ---
> > >  include/linux/kvm_host.h |   7 +-
> > >  virt/kvm/kvm_main.c      | 168 ++++++++++++++++++++++++++-------------
> > >  2 files changed, 116 insertions(+), 59 deletions(-)
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 3d69484d2704..3331c0c92838 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > >  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > >  #endif
> > >
> > > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > >  struct kvm_gfn_range {
> > >  	struct kvm_memory_slot *slot;
> > >  	gfn_t start;
> > > @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> > >  	bool may_block;
> > >  };
> > >  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > > +
> > > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > >  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > >  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > >  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > @@ -785,11 +786,12 @@ struct kvm {
> > >
> > >  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > >  	struct mmu_notifier mmu_notifier;
> > > +#endif
> > >  	unsigned long mmu_invalidate_seq;
> > >  	long mmu_invalidate_in_progress;
> > >  	gfn_t mmu_invalidate_range_start;
> > >  	gfn_t mmu_invalidate_range_end;
> > > -#endif
> > > +
> > >  	struct list_head devices;
> > >  	u64 manual_dirty_log_protect;
> > >  	struct dentry *debugfs_dentry;
> > > @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> > >  int kvm_arch_post_init_vm(struct kvm *kvm);
> > >  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> > >  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> > >
> > >  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> > >  /*
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index ad55dfbc75d7..4e1e1e113bf0 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> > >
> > > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > > +{
> > > +	/*
> > > +	 * The count increase must become visible at unlock time as no
> > > +	 * spte can be established without taking the mmu_lock and
> > > +	 * count is also read inside the mmu_lock critical section.
> > > +	 */
> > > +	kvm->mmu_invalidate_in_progress++;
> > > +
> > > +	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > +		kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > +		kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > +	}
> > > +}
> > > +
> > > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > +	WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > +
> > > +	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > +		kvm->mmu_invalidate_range_start = start;
> > > +		kvm->mmu_invalidate_range_end = end;
> > > +	} else {
> > > +		/*
> > > +		 * Fully tracking multiple concurrent ranges has diminishing
> > > +		 * returns. Keep things simple and just find the minimal range
> > > +		 * which includes the current and new ranges. As there won't be
> > > +		 * enough information to subtract a range after its invalidate
> > > +		 * completes, any ranges invalidated concurrently will
> > > +		 * accumulate and persist until all outstanding invalidates
> > > +		 * complete.
> > > +		 */
> > > +		kvm->mmu_invalidate_range_start =
> > > +			min(kvm->mmu_invalidate_range_start, start);
> > > +		kvm->mmu_invalidate_range_end =
> > > +			max(kvm->mmu_invalidate_range_end, end);
> > > +	}
> > > +}
> > > +
> > > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> > > +{
> > > +	/*
> > > +	 * This sequence increase will notify the kvm page fault that
> > > +	 * the page that is going to be mapped in the spte could have
> > > +	 * been freed.
> > > +	 */
> > > +	kvm->mmu_invalidate_seq++;
> > > +	smp_wmb();
> > > +	/*
> > > +	 * The above sequence increase must be visible before the
> > > +	 * below count decrease, which is ensured by the smp_wmb above
> > > +	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > +	 */
> > > +	kvm->mmu_invalidate_in_progress--;
> > > +}
> > > +
> > >  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > >  static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> > >  {
> > > @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > >  	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > >  }
> > >
> > > -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > > -{
> > > -	/*
> > > -	 * The count increase must become visible at unlock time as no
> > > -	 * spte can be established without taking the mmu_lock and
> > > -	 * count is also read inside the mmu_lock critical section.
> > > -	 */
> > > -	kvm->mmu_invalidate_in_progress++;
> > > -
> > > -	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > -		kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > -		kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > -	}
> > > -}
> > > -
> > > -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > -{
> > > -	WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > -
> > > -	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > -		kvm->mmu_invalidate_range_start = start;
> > > -		kvm->mmu_invalidate_range_end = end;
> > > -	} else {
> > > -		/*
> > > -		 * Fully tracking multiple concurrent ranges has diminishing
> > > -		 * returns. Keep things simple and just find the minimal range
> > > -		 * which includes the current and new ranges. As there won't be
> > > -		 * enough information to subtract a range after its invalidate
> > > -		 * completes, any ranges invalidated concurrently will
> > > -		 * accumulate and persist until all outstanding invalidates
> > > -		 * complete.
> > > -		 */
> > > -		kvm->mmu_invalidate_range_start =
> > > -			min(kvm->mmu_invalidate_range_start, start);
> > > -		kvm->mmu_invalidate_range_end =
> > > -			max(kvm->mmu_invalidate_range_end, end);
> > > -	}
> > > -}
> > > -
> > >  static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > >  {
> > >  	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > > @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > >  	return 0;
> > >  }
> > >
> > > -void kvm_mmu_invalidate_end(struct kvm *kvm)
> > > -{
> > > -	/*
> > > -	 * This sequence increase will notify the kvm page fault that
> > > -	 * the page that is going to be mapped in the spte could have
> > > -	 * been freed.
> > > -	 */
> > > -	kvm->mmu_invalidate_seq++;
> > > -	smp_wmb();
> > > -	/*
> > > -	 * The above sequence increase must be visible before the
> > > -	 * below count decrease, which is ensured by the smp_wmb above
> > > -	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > -	 */
> > > -	kvm->mmu_invalidate_in_progress--;
> > > -}
> > > -
> > >  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > >  					const struct mmu_notifier_range *range)
> > >  {
> > > @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
> > >  	return 0;
> > >  }
> > >
> > > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > > +{
> > > +	return false;
> > > +}
> > > +
> > >  static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > >  {
> > >  	struct kvm *kvm = kvm_arch_alloc_vm();
> > > @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> > >  	return 0;
> > >  }
> > >
> > > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > +	struct kvm_gfn_range gfn_range;
> > > +	struct kvm_memory_slot *slot;
> > > +	struct kvm_memslots *slots;
> > > +	struct kvm_memslot_iter iter;
> > > +	int i;
> > > +	int r = 0;
> > > +
> > > +	gfn_range.pte = __pte(0);
> > > +	gfn_range.may_block = true;
> > > +
> > > +	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > +		slots = __kvm_memslots(kvm, i);
> > > +
> > > +		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > > +			slot = iter.slot;
> > > +			gfn_range.start = max(start, slot->base_gfn);
> > > +			gfn_range.end = min(end, slot->base_gfn + slot->npages);
> > > +			if (gfn_range.start >= gfn_range.end)
> > > +				continue;
> > > +			gfn_range.slot = slot;
> > > +
> > > +			r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > > +		}
> > > +	}
> > > +
> > > +	if (r)
> > > +		kvm_flush_remote_tlbs(kvm);
> > > +}
> > > +
> > >  static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > >  					   struct kvm_memory_attributes *attrs)
> > >  {
> > >  	gfn_t start, end;
> > >  	unsigned long i;
> > >  	void *entry;
> > > +	int idx;
> > >  	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > >
> > > -	/* flags is currently not used. */
> > > +	/* 'flags' is currently not used. */
> > >  	if (attrs->flags)
> > >  		return -EINVAL;
> > >  	if (attrs->attributes & ~supported_attrs)
> > > @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > >
> > >  	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > >
> > > +	if (kvm_arch_has_private_mem(kvm)) {
> > > +		KVM_MMU_LOCK(kvm);
> > > +		kvm_mmu_invalidate_begin(kvm);
> > > +		kvm_mmu_invalidate_range_add(kvm, start, end);
> >
> > Nit: this works for KVM_MEMORY_ATTRIBUTE_PRIVATE, but
> > the invalidation should be necessary yet for attribute change of:
> >
> > KVM_MEMORY_ATTRIBUTE_READ
> > KVM_MEMORY_ATTRIBUTE_WRITE
> > KVM_MEMORY_ATTRIBUTE_EXECUTE
>
> The unmapping is only needed for confidential usages which uses
> KVM_MEMORY_ATTRIBUTE_PRIVATE only and the other flags are defined here
> for other usages like pKVM. As Fuad commented in a different reply, pKVM
> supports in-place remapping and unmapping is unnecessary.

Ah, I see. It's fine to me, thanks.

>
> Thanks,
> Chao
> >
> > > +		KVM_MMU_UNLOCK(kvm);
> > > +	}
> > > +
> > >  	mutex_lock(&kvm->lock);
> > >  	for (i = start; i < end; i++)
> > >  		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > > @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > >  			break;
> > >  	mutex_unlock(&kvm->lock);
> > >
> > > +	if (kvm_arch_has_private_mem(kvm)) {
> > > +		idx = srcu_read_lock(&kvm->srcu);
> > > +		KVM_MMU_LOCK(kvm);
> > > +		if (i > start)
> > > +			kvm_unmap_mem_range(kvm, start, i);
> > > +		kvm_mmu_invalidate_end(kvm);
> >
> > Ditto.
> >
> > > +		KVM_MMU_UNLOCK(kvm);
> > > +		srcu_read_unlock(&kvm->srcu, idx);
> > > +	}
> > > +
> > >  	attrs->address = i << PAGE_SHIFT;
> > >  	attrs->size = (end - i) << PAGE_SHIFT;
> > >
> > > --
> > > 2.25.1
> > >
> > >

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 8/9] KVM: Handle page fault for private memory
  2022-12-08 11:23     ` Chao Peng
@ 2022-12-09  5:45       ` Yuan Yao
  0 siblings, 0 replies; 398+ messages in thread
From: Yuan Yao @ 2022-12-09  5:45 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Thu, Dec 08, 2022 at 07:23:46PM +0800, Chao Peng wrote:
> On Thu, Dec 08, 2022 at 10:29:18AM +0800, Yuan Yao wrote:
> > On Fri, Dec 02, 2022 at 02:13:46PM +0800, Chao Peng wrote:
> > > A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
> > > hva-based shared memory. Architecture code (like TDX code) can tell
> > > whether the on-going fault is private or not. This patch adds a
> > > 'is_private' field to kvm_page_fault to indicate this and architecture
> > > code is expected to set it.
> > >
> > > To handle page fault for such memslot, the handling logic is different
> > > depending on whether the fault is private or shared. KVM checks if
> > > 'is_private' matches the host's view of the page (maintained in
> > > mem_attr_array).
> > >   - For a successful match, private pfn is obtained with
> > >     restrictedmem_get_page() and shared pfn is obtained with existing
> > >     get_user_pages().
> > >   - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> > >     userspace. Userspace then can convert memory between private/shared
> > >     in host's view and retry the fault.
> > >
> > > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c          | 63 +++++++++++++++++++++++++++++++--
> > >  arch/x86/kvm/mmu/mmu_internal.h | 14 +++++++-
> > >  arch/x86/kvm/mmu/mmutrace.h     |  1 +
> > >  arch/x86/kvm/mmu/tdp_mmu.c      |  2 +-
> > >  include/linux/kvm_host.h        | 30 ++++++++++++++++
> > >  5 files changed, 105 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 2190fd8c95c0..b1953ebc012e 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
> > >
> > >  int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > >  			      const struct kvm_memory_slot *slot, gfn_t gfn,
> > > -			      int max_level)
> > > +			      int max_level, bool is_private)
> > >  {
> > >  	struct kvm_lpage_info *linfo;
> > >  	int host_level;
> > > @@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > >  			break;
> > >  	}
> > >
> > > +	if (is_private)
> > > +		return max_level;
> >
> > lpage mixed information already saved, so is that possible
> > to query info->disallow_lpage without care 'is_private' ?
>
> Actually we already queried info->disallow_lpage just before this
> sentence. The check is needed because later in the function we call
> host_pfn_mapping_level() which is shared memory specific.

You're right. We can't get mapping level info for private page from
host_pfn_mapping_level().

>
> Thanks,
> Chao
> >
> > > +
> > >  	if (max_level == PG_LEVEL_4K)
> > >  		return PG_LEVEL_4K;
> > >
> > > @@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > >  	 * level, which will be used to do precise, accurate accounting.
> > >  	 */
> > >  	fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> > > -						     fault->gfn, fault->max_level);
> > > +						     fault->gfn, fault->max_level,
> > > +						     fault->is_private);
> > >  	if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> > >  		return;
> > >
> > > @@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> > >  	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> > >  }
> > >
> > > +static inline u8 order_to_level(int order)
> > > +{
> > > +	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> > > +
> > > +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> > > +		return PG_LEVEL_1G;
> > > +
> > > +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> > > +		return PG_LEVEL_2M;
> > > +
> > > +	return PG_LEVEL_4K;
> > > +}
> > > +
> > > +static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
> > > +				    struct kvm_page_fault *fault)
> > > +{
> > > +	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > > +	if (fault->is_private)
> > > +		vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> > > +	else
> > > +		vcpu->run->memory.flags = 0;
> > > +	vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> > > +	vcpu->run->memory.size = PAGE_SIZE;
> > > +	return RET_PF_USER;
> > > +}
> > > +
> > > +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > > +				   struct kvm_page_fault *fault)
> > > +{
> > > +	int order;
> > > +	struct kvm_memory_slot *slot = fault->slot;
> > > +
> > > +	if (!kvm_slot_can_be_private(slot))
> > > +		return kvm_do_memory_fault_exit(vcpu, fault);
> > > +
> > > +	if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> > > +		return RET_PF_RETRY;
> > > +
> > > +	fault->max_level = min(order_to_level(order), fault->max_level);
> > > +	fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> > > +	return RET_PF_CONTINUE;
> > > +}
> > > +
> > >  static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > >  {
> > >  	struct kvm_memory_slot *slot = fault->slot;
> > > @@ -4210,6 +4257,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > >  			return RET_PF_EMULATE;
> > >  	}
> > >
> > > +	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
> > > +		return kvm_do_memory_fault_exit(vcpu, fault);
> > > +
> > > +	if (fault->is_private)
> > > +		return kvm_faultin_pfn_private(vcpu, fault);
> > > +
> > >  	async = false;
> > >  	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> > >  					  fault->write, &fault->map_writable,
> > > @@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> > >  			return -EIO;
> > >  	}
> > >
> > > +	if (r == RET_PF_USER)
> > > +		return 0;
> > > +
> > >  	if (r < 0)
> > >  		return r;
> > >  	if (r != RET_PF_EMULATE)
> > > @@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> > >  		 */
> > >  		if (sp->role.direct &&
> > >  		    sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> > > -							       PG_LEVEL_NUM)) {
> > > +							       PG_LEVEL_NUM,
> > > +							       false)) {
> > >  			kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
> > >
> > >  			if (kvm_available_flush_tlb_with_range())
> > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > > index dbaf6755c5a7..5ccf08183b00 100644
> > > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > > @@ -189,6 +189,7 @@ struct kvm_page_fault {
> > >
> > >  	/* Derived from mmu and global state.  */
> > >  	const bool is_tdp;
> > > +	const bool is_private;
> > >  	const bool nx_huge_page_workaround_enabled;
> > >
> > >  	/*
> > > @@ -237,6 +238,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> > >   * RET_PF_RETRY: let CPU fault again on the address.
> > >   * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
> > >   * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> > > + * RET_PF_USER: need to exit to userspace to handle this fault.
> > >   * RET_PF_FIXED: The faulting entry has been fixed.
> > >   * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
> > >   *
> > > @@ -253,6 +255,7 @@ enum {
> > >  	RET_PF_RETRY,
> > >  	RET_PF_EMULATE,
> > >  	RET_PF_INVALID,
> > > +	RET_PF_USER,
> > >  	RET_PF_FIXED,
> > >  	RET_PF_SPURIOUS,
> > >  };
> > > @@ -310,7 +313,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > >
> > >  int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > >  			      const struct kvm_memory_slot *slot, gfn_t gfn,
> > > -			      int max_level);
> > > +			      int max_level, bool is_private);
> > >  void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> > >  void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
> > >
> > > @@ -319,4 +322,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > >  void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > >  void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > >
> > > +#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > > +					gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > > +{
> > > +	WARN_ON_ONCE(1);
> > > +	return -EOPNOTSUPP;
> > > +}
> > > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > > +
> > >  #endif /* __KVM_X86_MMU_INTERNAL_H */
> > > diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> > > index ae86820cef69..2d7555381955 100644
> > > --- a/arch/x86/kvm/mmu/mmutrace.h
> > > +++ b/arch/x86/kvm/mmu/mmutrace.h
> > > @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
> > >  TRACE_DEFINE_ENUM(RET_PF_RETRY);
> > >  TRACE_DEFINE_ENUM(RET_PF_EMULATE);
> > >  TRACE_DEFINE_ENUM(RET_PF_INVALID);
> > > +TRACE_DEFINE_ENUM(RET_PF_USER);
> > >  TRACE_DEFINE_ENUM(RET_PF_FIXED);
> > >  TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
> > >
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 771210ce5181..8ba1a4afc546 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -1768,7 +1768,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> > >  			continue;
> > >
> > >  		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> > > -							      iter.gfn, PG_LEVEL_NUM);
> > > +						iter.gfn, PG_LEVEL_NUM, false);
> > >  		if (max_mapping_level < iter.level)
> > >  			continue;
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 25099c94e770..153842bb33df 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > >  }
> > >  #endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
> > >
> > > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > > +{
> > > +	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
> > > +	       KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > > +}
> > > +#else
> > > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > > +{
> > > +	return false;
> > > +}
> > > +
> > > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > > +
> > > +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > > +					gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > > +{
> > > +	int ret;
> > > +	struct page *page;
> > > +	pgoff_t index = gfn - slot->base_gfn +
> > > +			(slot->restricted_offset >> PAGE_SHIFT);
> > > +
> > > +	ret = restrictedmem_get_page(slot->restricted_file, index,
> > > +				     &page, order);
> > > +	*pfn = page_to_pfn(page);
> > > +	return ret;
> > > +}
> > > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > > +
> > >  #endif
> > > --
> > > 2.25.1
> > >
> > >

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-12-06 15:48       ` Fuad Tabba
@ 2022-12-09  6:24         ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-09  6:24 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

On Tue, Dec 06, 2022 at 03:48:50PM +0000, Fuad Tabba wrote:
...
 > >
> > > >          */
> > > > -       if (unlikely(kvm->mmu_invalidate_in_progress) &&
> > > > -           hva >= kvm->mmu_invalidate_range_start &&
> > > > -           hva < kvm->mmu_invalidate_range_end)
> > > > -               return 1;
> > > > +       if (unlikely(kvm->mmu_invalidate_in_progress)) {
> > > > +               /*
> > > > +                * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> > > > +                * but before updating the range is a KVM bug.
> > > > +                */
> > > > +               if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> > > > +                                kvm->mmu_invalidate_range_end == INVALID_GPA))
> > >
> > > INVALID_GPA is an x86-specific define in
> > > arch/x86/include/asm/kvm_host.h, so this doesn't build on other
> > > architectures. The obvious fix is to move it to
> > > include/linux/kvm_host.h.
> >
> > Hmm, INVALID_GPA is defined as ZERO for x86, not 100% confident this is
> > correct choice for other architectures, but after search it has not been
> > used for other architectures, so should be safe to make it common.

As Yu posted a patch:
https://lore.kernel.org/all/20221209023622.274715-1-yu.c.zhang@linux.intel.com/

There is a GPA_INVALID in include/linux/kvm_types.h and I see ARM has already
been using it so sounds that is exactly what I need.

Chao
> 
> With this fixed,
> 
> Reviewed-by: Fuad Tabba <tabba@google.com>
> And the necessary work to port to arm64 (on qemu/arm64):
> Tested-by: Fuad Tabba <tabba@google.com>
> 
> Cheers,
> /fuad

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
  2022-12-08 11:13     ` Chao Peng
@ 2022-12-09  8:57       ` Fuad Tabba
  2022-12-12  7:22         ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Fuad Tabba @ 2022-12-09  8:57 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

Hi,

On Thu, Dec 8, 2022 at 11:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> On Wed, Dec 07, 2022 at 05:16:34PM +0000, Fuad Tabba wrote:
> > Hi,
> >
> > On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > >
> > > Unmap the existing guest mappings when memory attribute is changed
> > > between shared and private. This is needed because shared pages and
> > > private pages are from different backends, unmapping existing ones
> > > gives a chance for page fault handler to re-populate the mappings
> > > according to the new attribute.
> > >
> > > Only architecture has private memory support needs this and the
> > > supported architecture is expected to rewrite the weak
> > > kvm_arch_has_private_mem().
> >
> > This kind of ties into the discussion of being able to share memory in
> > place. For pKVM for example, shared and private memory would have the
> > same backend, and the unmapping wouldn't be needed.
> >
> > So I guess that, instead of kvm_arch_has_private_mem(), can the check
> > be done differently, e.g., with a different function, say
> > kvm_arch_private_notify_attribute_change() (but maybe with a more
> > friendly name than what I suggested :) )?
>
> Besides controlling the unmapping here, kvm_arch_has_private_mem() is
> also used to gate the memslot KVM_MEM_PRIVATE flag in patch09. I know
> unmapping is confirmed unnecessary for pKVM, but how about
> KVM_MEM_PRIVATE? Will pKVM add its own flag or reuse KVM_MEM_PRIVATE?
> If the answer is the latter, then yes we should use a different check
> which only works for confidential usages here.

I think it makes sense for pKVM to use the same flag (KVM_MEM_PRIVATE)
and not to add another one.

Thank you,
/fuad



>
> Thanks,
> Chao
> >
> > Thanks,
> > /fuad
> >
> > >
> > > Also, during memory attribute changing and the unmapping time frame,
> > > page fault handler may happen in the same memory range and can cause
> > > incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> > > page fault handler retry during this time frame.
> > >
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > ---
> > >  include/linux/kvm_host.h |   7 +-
> > >  virt/kvm/kvm_main.c      | 168 ++++++++++++++++++++++++++-------------
> > >  2 files changed, 116 insertions(+), 59 deletions(-)
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 3d69484d2704..3331c0c92838 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > >  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > >  #endif
> > >
> > > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > >  struct kvm_gfn_range {
> > >         struct kvm_memory_slot *slot;
> > >         gfn_t start;
> > > @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> > >         bool may_block;
> > >  };
> > >  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > > +
> > > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > >  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > >  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > >  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > @@ -785,11 +786,12 @@ struct kvm {
> > >
> > >  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > >         struct mmu_notifier mmu_notifier;
> > > +#endif
> > >         unsigned long mmu_invalidate_seq;
> > >         long mmu_invalidate_in_progress;
> > >         gfn_t mmu_invalidate_range_start;
> > >         gfn_t mmu_invalidate_range_end;
> > > -#endif
> > > +
> > >         struct list_head devices;
> > >         u64 manual_dirty_log_protect;
> > >         struct dentry *debugfs_dentry;
> > > @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> > >  int kvm_arch_post_init_vm(struct kvm *kvm);
> > >  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> > >  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> > >
> > >  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> > >  /*
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index ad55dfbc75d7..4e1e1e113bf0 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> > >
> > > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > > +{
> > > +       /*
> > > +        * The count increase must become visible at unlock time as no
> > > +        * spte can be established without taking the mmu_lock and
> > > +        * count is also read inside the mmu_lock critical section.
> > > +        */
> > > +       kvm->mmu_invalidate_in_progress++;
> > > +
> > > +       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > +               kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > +               kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > +       }
> > > +}
> > > +
> > > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > +       WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > +
> > > +       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > +               kvm->mmu_invalidate_range_start = start;
> > > +               kvm->mmu_invalidate_range_end = end;
> > > +       } else {
> > > +               /*
> > > +                * Fully tracking multiple concurrent ranges has diminishing
> > > +                * returns. Keep things simple and just find the minimal range
> > > +                * which includes the current and new ranges. As there won't be
> > > +                * enough information to subtract a range after its invalidate
> > > +                * completes, any ranges invalidated concurrently will
> > > +                * accumulate and persist until all outstanding invalidates
> > > +                * complete.
> > > +                */
> > > +               kvm->mmu_invalidate_range_start =
> > > +                       min(kvm->mmu_invalidate_range_start, start);
> > > +               kvm->mmu_invalidate_range_end =
> > > +                       max(kvm->mmu_invalidate_range_end, end);
> > > +       }
> > > +}
> > > +
> > > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> > > +{
> > > +       /*
> > > +        * This sequence increase will notify the kvm page fault that
> > > +        * the page that is going to be mapped in the spte could have
> > > +        * been freed.
> > > +        */
> > > +       kvm->mmu_invalidate_seq++;
> > > +       smp_wmb();
> > > +       /*
> > > +        * The above sequence increase must be visible before the
> > > +        * below count decrease, which is ensured by the smp_wmb above
> > > +        * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > +        */
> > > +       kvm->mmu_invalidate_in_progress--;
> > > +}
> > > +
> > >  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > >  static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> > >  {
> > > @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > >         kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > >  }
> > >
> > > -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > > -{
> > > -       /*
> > > -        * The count increase must become visible at unlock time as no
> > > -        * spte can be established without taking the mmu_lock and
> > > -        * count is also read inside the mmu_lock critical section.
> > > -        */
> > > -       kvm->mmu_invalidate_in_progress++;
> > > -
> > > -       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > -               kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > -               kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > -       }
> > > -}
> > > -
> > > -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > -{
> > > -       WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > -
> > > -       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > -               kvm->mmu_invalidate_range_start = start;
> > > -               kvm->mmu_invalidate_range_end = end;
> > > -       } else {
> > > -               /*
> > > -                * Fully tracking multiple concurrent ranges has diminishing
> > > -                * returns. Keep things simple and just find the minimal range
> > > -                * which includes the current and new ranges. As there won't be
> > > -                * enough information to subtract a range after its invalidate
> > > -                * completes, any ranges invalidated concurrently will
> > > -                * accumulate and persist until all outstanding invalidates
> > > -                * complete.
> > > -                */
> > > -               kvm->mmu_invalidate_range_start =
> > > -                       min(kvm->mmu_invalidate_range_start, start);
> > > -               kvm->mmu_invalidate_range_end =
> > > -                       max(kvm->mmu_invalidate_range_end, end);
> > > -       }
> > > -}
> > > -
> > >  static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > >  {
> > >         kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > > @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > >         return 0;
> > >  }
> > >
> > > -void kvm_mmu_invalidate_end(struct kvm *kvm)
> > > -{
> > > -       /*
> > > -        * This sequence increase will notify the kvm page fault that
> > > -        * the page that is going to be mapped in the spte could have
> > > -        * been freed.
> > > -        */
> > > -       kvm->mmu_invalidate_seq++;
> > > -       smp_wmb();
> > > -       /*
> > > -        * The above sequence increase must be visible before the
> > > -        * below count decrease, which is ensured by the smp_wmb above
> > > -        * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > -        */
> > > -       kvm->mmu_invalidate_in_progress--;
> > > -}
> > > -
> > >  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > >                                         const struct mmu_notifier_range *range)
> > >  {
> > > @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
> > >         return 0;
> > >  }
> > >
> > > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > > +{
> > > +       return false;
> > > +}
> > > +
> > >  static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > >  {
> > >         struct kvm *kvm = kvm_arch_alloc_vm();
> > > @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> > >         return 0;
> > >  }
> > >
> > > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > +       struct kvm_gfn_range gfn_range;
> > > +       struct kvm_memory_slot *slot;
> > > +       struct kvm_memslots *slots;
> > > +       struct kvm_memslot_iter iter;
> > > +       int i;
> > > +       int r = 0;
> > > +
> > > +       gfn_range.pte = __pte(0);
> > > +       gfn_range.may_block = true;
> > > +
> > > +       for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > +               slots = __kvm_memslots(kvm, i);
> > > +
> > > +               kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > > +                       slot = iter.slot;
> > > +                       gfn_range.start = max(start, slot->base_gfn);
> > > +                       gfn_range.end = min(end, slot->base_gfn + slot->npages);
> > > +                       if (gfn_range.start >= gfn_range.end)
> > > +                               continue;
> > > +                       gfn_range.slot = slot;
> > > +
> > > +                       r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > > +               }
> > > +       }
> > > +
> > > +       if (r)
> > > +               kvm_flush_remote_tlbs(kvm);
> > > +}
> > > +
> > >  static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > >                                            struct kvm_memory_attributes *attrs)
> > >  {
> > >         gfn_t start, end;
> > >         unsigned long i;
> > >         void *entry;
> > > +       int idx;
> > >         u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > >
> > > -       /* flags is currently not used. */
> > > +       /* 'flags' is currently not used. */
> > >         if (attrs->flags)
> > >                 return -EINVAL;
> > >         if (attrs->attributes & ~supported_attrs)
> > > @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > >
> > >         entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > >
> > > +       if (kvm_arch_has_private_mem(kvm)) {
> > > +               KVM_MMU_LOCK(kvm);
> > > +               kvm_mmu_invalidate_begin(kvm);
> > > +               kvm_mmu_invalidate_range_add(kvm, start, end);
> > > +               KVM_MMU_UNLOCK(kvm);
> > > +       }
> > > +
> > >         mutex_lock(&kvm->lock);
> > >         for (i = start; i < end; i++)
> > >                 if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > > @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > >                         break;
> > >         mutex_unlock(&kvm->lock);
> > >
> > > +       if (kvm_arch_has_private_mem(kvm)) {
> > > +               idx = srcu_read_lock(&kvm->srcu);
> > > +               KVM_MMU_LOCK(kvm);
> > > +               if (i > start)
> > > +                       kvm_unmap_mem_range(kvm, start, i);
> > > +               kvm_mmu_invalidate_end(kvm);
> > > +               KVM_MMU_UNLOCK(kvm);
> > > +               srcu_read_unlock(&kvm->srcu, idx);
> > > +       }
> > > +
> > >         attrs->address = i << PAGE_SHIFT;
> > >         attrs->size = (end - i) << PAGE_SHIFT;
> > >
> > > --
> > > 2.25.1
> > >

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 8/9] KVM: Handle page fault for private memory
  2022-12-02  6:13 ` [PATCH v10 8/9] KVM: Handle page fault for private memory Chao Peng
  2022-12-08  2:29   ` Yuan Yao
@ 2022-12-09  9:01   ` Fuad Tabba
  2022-12-12  7:23     ` Chao Peng
  2023-01-13 23:29   ` Sean Christopherson
  2 siblings, 1 reply; 398+ messages in thread
From: Fuad Tabba @ 2022-12-09  9:01 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

Hi,

On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
> hva-based shared memory. Architecture code (like TDX code) can tell
> whether the on-going fault is private or not. This patch adds a
> 'is_private' field to kvm_page_fault to indicate this and architecture
> code is expected to set it.
>
> To handle page fault for such memslot, the handling logic is different
> depending on whether the fault is private or shared. KVM checks if
> 'is_private' matches the host's view of the page (maintained in
> mem_attr_array).
>   - For a successful match, private pfn is obtained with
>     restrictedmem_get_page() and shared pfn is obtained with existing
>     get_user_pages().
>   - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
>     userspace. Userspace then can convert memory between private/shared
>     in host's view and retry the fault.
>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 63 +++++++++++++++++++++++++++++++--
>  arch/x86/kvm/mmu/mmu_internal.h | 14 +++++++-
>  arch/x86/kvm/mmu/mmutrace.h     |  1 +
>  arch/x86/kvm/mmu/tdp_mmu.c      |  2 +-
>  include/linux/kvm_host.h        | 30 ++++++++++++++++
>  5 files changed, 105 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2190fd8c95c0..b1953ebc012e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
>
>  int kvm_mmu_max_mapping_level(struct kvm *kvm,
>                               const struct kvm_memory_slot *slot, gfn_t gfn,
> -                             int max_level)
> +                             int max_level, bool is_private)
>  {
>         struct kvm_lpage_info *linfo;
>         int host_level;
> @@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
>                         break;
>         }
>
> +       if (is_private)
> +               return max_level;
> +
>         if (max_level == PG_LEVEL_4K)
>                 return PG_LEVEL_4K;
>
> @@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>          * level, which will be used to do precise, accurate accounting.
>          */
>         fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> -                                                    fault->gfn, fault->max_level);
> +                                                    fault->gfn, fault->max_level,
> +                                                    fault->is_private);
>         if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
>                 return;
>
> @@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>         kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
>  }
>
> +static inline u8 order_to_level(int order)
> +{
> +       BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> +
> +       if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> +               return PG_LEVEL_1G;
> +
> +       if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> +               return PG_LEVEL_2M;
> +
> +       return PG_LEVEL_4K;
> +}
> +
> +static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
> +                                   struct kvm_page_fault *fault)
> +{
> +       vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> +       if (fault->is_private)
> +               vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> +       else
> +               vcpu->run->memory.flags = 0;
> +       vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;

nit: As in previous patches, use helpers (for this and other similar
shifts in this patch)?

> +       vcpu->run->memory.size = PAGE_SIZE;
> +       return RET_PF_USER;
> +}
> +
> +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> +                                  struct kvm_page_fault *fault)
> +{
> +       int order;
> +       struct kvm_memory_slot *slot = fault->slot;
> +
> +       if (!kvm_slot_can_be_private(slot))
> +               return kvm_do_memory_fault_exit(vcpu, fault);
> +
> +       if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> +               return RET_PF_RETRY;
> +
> +       fault->max_level = min(order_to_level(order), fault->max_level);
> +       fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> +       return RET_PF_CONTINUE;
> +}
> +
>  static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>         struct kvm_memory_slot *slot = fault->slot;
> @@ -4210,6 +4257,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>                         return RET_PF_EMULATE;
>         }
>
> +       if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
> +               return kvm_do_memory_fault_exit(vcpu, fault);
> +
> +       if (fault->is_private)
> +               return kvm_faultin_pfn_private(vcpu, fault);
> +
>         async = false;
>         fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
>                                           fault->write, &fault->map_writable,
> @@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>                         return -EIO;
>         }
>
> +       if (r == RET_PF_USER)
> +               return 0;
> +
>         if (r < 0)
>                 return r;
>         if (r != RET_PF_EMULATE)
> @@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>                  */
>                 if (sp->role.direct &&
>                     sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> -                                                              PG_LEVEL_NUM)) {
> +                                                              PG_LEVEL_NUM,
> +                                                              false)) {
>                         kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
>
>                         if (kvm_available_flush_tlb_with_range())
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index dbaf6755c5a7..5ccf08183b00 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -189,6 +189,7 @@ struct kvm_page_fault {
>
>         /* Derived from mmu and global state.  */
>         const bool is_tdp;
> +       const bool is_private;
>         const bool nx_huge_page_workaround_enabled;
>
>         /*
> @@ -237,6 +238,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>   * RET_PF_RETRY: let CPU fault again on the address.
>   * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
>   * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> + * RET_PF_USER: need to exit to userspace to handle this fault.
>   * RET_PF_FIXED: The faulting entry has been fixed.
>   * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
>   *
> @@ -253,6 +255,7 @@ enum {
>         RET_PF_RETRY,
>         RET_PF_EMULATE,
>         RET_PF_INVALID,
> +       RET_PF_USER,
>         RET_PF_FIXED,
>         RET_PF_SPURIOUS,
>  };
> @@ -310,7 +313,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>
>  int kvm_mmu_max_mapping_level(struct kvm *kvm,
>                               const struct kvm_memory_slot *slot, gfn_t gfn,
> -                             int max_level);
> +                             int max_level, bool is_private);
>  void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>  void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
>
> @@ -319,4 +322,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> +#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> +                                       gfn_t gfn, kvm_pfn_t *pfn, int *order)
> +{
> +       WARN_ON_ONCE(1);
> +       return -EOPNOTSUPP;
> +}
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
>  #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> index ae86820cef69..2d7555381955 100644
> --- a/arch/x86/kvm/mmu/mmutrace.h
> +++ b/arch/x86/kvm/mmu/mmutrace.h
> @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
>  TRACE_DEFINE_ENUM(RET_PF_RETRY);
>  TRACE_DEFINE_ENUM(RET_PF_EMULATE);
>  TRACE_DEFINE_ENUM(RET_PF_INVALID);
> +TRACE_DEFINE_ENUM(RET_PF_USER);
>  TRACE_DEFINE_ENUM(RET_PF_FIXED);
>  TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 771210ce5181..8ba1a4afc546 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1768,7 +1768,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
>                         continue;
>
>                 max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> -                                                             iter.gfn, PG_LEVEL_NUM);
> +                                               iter.gfn, PG_LEVEL_NUM, false);
>                 if (max_mapping_level < iter.level)
>                         continue;
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 25099c94e770..153842bb33df 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
>  }
>  #endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
>
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> +       return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
> +              KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +}
> +#else
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> +       return false;
> +}
> +
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
> +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> +                                       gfn_t gfn, kvm_pfn_t *pfn, int *order)
> +{
> +       int ret;
> +       struct page *page;
> +       pgoff_t index = gfn - slot->base_gfn +
> +                       (slot->restricted_offset >> PAGE_SHIFT);
> +
> +       ret = restrictedmem_get_page(slot->restricted_file, index,
> +                                    &page, order);
> +       *pfn = page_to_pfn(page);
> +       return ret;
> +}
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
>  #endif
> --
> 2.25.1
>

With my limited understanding of x86 code:
Reviewed-by: Fuad Tabba <tabba@google.com>

The common code in kvm_host.h was used in the port to arm64, and the
x86 fault handling code was used as a guide to how it should be done
in pKVM (with similar code added there). So with these caveats in
mind:
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
  2022-12-02  6:13 ` [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
@ 2022-12-09  9:11   ` Fuad Tabba
  2023-01-05 20:38   ` Vishal Annapurve
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 398+ messages in thread
From: Fuad Tabba @ 2022-12-09  9:11 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

Hi,

On Fri, Dec 2, 2022 at 6:20 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> Register/unregister private memslot to fd-based memory backing store
> restrictedmem and implement the callbacks for restrictedmem_notifier:
>   - invalidate_start()/invalidate_end() to zap the existing memory
>     mappings in the KVM page table.
>   - error() to request KVM_REQ_MEMORY_MCE and later exit to userspace
>     with KVM_EXIT_SHUTDOWN.
>
> Expose KVM_MEM_PRIVATE for memslot and KVM_MEMORY_ATTRIBUTE_PRIVATE for
> KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to userspace but either are
> controlled by kvm_arch_has_private_mem() which should be rewritten by
> architecture code.
>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>

With the code to port it to pKVM/arm64:
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad


> ---
>  arch/x86/include/asm/kvm_host.h |   1 +
>  arch/x86/kvm/x86.c              |  13 +++
>  include/linux/kvm_host.h        |   3 +
>  virt/kvm/kvm_main.c             | 179 +++++++++++++++++++++++++++++++-
>  4 files changed, 191 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7772ab37ac89..27ef31133352 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -114,6 +114,7 @@
>         KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_HV_TLB_FLUSH \
>         KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> +#define KVM_REQ_MEMORY_MCE             KVM_ARCH_REQ(33)
>
>  #define CR0_RESERVED_BITS                                               \
>         (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5aefcff614d2..c67e22f3e2ee 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6587,6 +6587,13 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
>  }
>  #endif /* CONFIG_HAVE_KVM_PM_NOTIFIER */
>
> +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +void kvm_arch_memory_mce(struct kvm *kvm)
> +{
> +       kvm_make_all_cpus_request(kvm, KVM_REQ_MEMORY_MCE);
> +}
> +#endif
> +
>  static int kvm_vm_ioctl_get_clock(struct kvm *kvm, void __user *argp)
>  {
>         struct kvm_clock_data data = { 0 };
> @@ -10357,6 +10364,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>
>                 if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
>                         static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
> +
> +               if (kvm_check_request(KVM_REQ_MEMORY_MCE, vcpu)) {
> +                       vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
> +                       r = 0;
> +                       goto out;
> +               }
>         }
>
>         if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 153842bb33df..f032d878e034 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -590,6 +590,7 @@ struct kvm_memory_slot {
>         struct file *restricted_file;
>         loff_t restricted_offset;
>         struct restrictedmem_notifier notifier;
> +       struct kvm *kvm;
>  };
>
>  static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
> @@ -2363,6 +2364,8 @@ static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
>         *pfn = page_to_pfn(page);
>         return ret;
>  }
> +
> +void kvm_arch_memory_mce(struct kvm *kvm);
>  #endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
>
>  #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index e107afea32f0..ac835fc77273 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -936,6 +936,121 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>
>  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>
> +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
> +                                        pgoff_t start, pgoff_t end,
> +                                        gfn_t *gfn_start, gfn_t *gfn_end)
> +{
> +       unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
> +
> +       if (start > base_pgoff)
> +               *gfn_start = slot->base_gfn + start - base_pgoff;
> +       else
> +               *gfn_start = slot->base_gfn;
> +
> +       if (end < base_pgoff + slot->npages)
> +               *gfn_end = slot->base_gfn + end - base_pgoff;
> +       else
> +               *gfn_end = slot->base_gfn + slot->npages;
> +
> +       if (*gfn_start >= *gfn_end)
> +               return false;
> +
> +       return true;
> +}
> +
> +static void kvm_restrictedmem_invalidate_begin(struct restrictedmem_notifier *notifier,
> +                                              pgoff_t start, pgoff_t end)
> +{
> +       struct kvm_memory_slot *slot = container_of(notifier,
> +                                                   struct kvm_memory_slot,
> +                                                   notifier);
> +       struct kvm *kvm = slot->kvm;
> +       gfn_t gfn_start, gfn_end;
> +       struct kvm_gfn_range gfn_range;
> +       int idx;
> +
> +       if (!restrictedmem_range_is_valid(slot, start, end,
> +                                         &gfn_start, &gfn_end))
> +               return;
> +
> +       gfn_range.start = gfn_start;
> +       gfn_range.end = gfn_end;
> +       gfn_range.slot = slot;
> +       gfn_range.pte = __pte(0);
> +       gfn_range.may_block = true;
> +
> +       idx = srcu_read_lock(&kvm->srcu);
> +       KVM_MMU_LOCK(kvm);
> +
> +       kvm_mmu_invalidate_begin(kvm);
> +       kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
> +       if (kvm_unmap_gfn_range(kvm, &gfn_range))
> +               kvm_flush_remote_tlbs(kvm);
> +
> +       KVM_MMU_UNLOCK(kvm);
> +       srcu_read_unlock(&kvm->srcu, idx);
> +}
> +
> +static void kvm_restrictedmem_invalidate_end(struct restrictedmem_notifier *notifier,
> +                                            pgoff_t start, pgoff_t end)
> +{
> +       struct kvm_memory_slot *slot = container_of(notifier,
> +                                                   struct kvm_memory_slot,
> +                                                   notifier);
> +       struct kvm *kvm = slot->kvm;
> +       gfn_t gfn_start, gfn_end;
> +
> +       if (!restrictedmem_range_is_valid(slot, start, end,
> +                                         &gfn_start, &gfn_end))
> +               return;
> +
> +       KVM_MMU_LOCK(kvm);
> +       kvm_mmu_invalidate_end(kvm);
> +       KVM_MMU_UNLOCK(kvm);
> +}
> +
> +static void kvm_restrictedmem_error(struct restrictedmem_notifier *notifier,
> +                                   pgoff_t start, pgoff_t end)
> +{
> +       struct kvm_memory_slot *slot = container_of(notifier,
> +                                                   struct kvm_memory_slot,
> +                                                   notifier);
> +       kvm_arch_memory_mce(slot->kvm);
> +}
> +
> +static struct restrictedmem_notifier_ops kvm_restrictedmem_notifier_ops = {
> +       .invalidate_start = kvm_restrictedmem_invalidate_begin,
> +       .invalidate_end = kvm_restrictedmem_invalidate_end,
> +       .error = kvm_restrictedmem_error,
> +};
> +
> +static inline void kvm_restrictedmem_register(struct kvm_memory_slot *slot)
> +{
> +       slot->notifier.ops = &kvm_restrictedmem_notifier_ops;
> +       restrictedmem_register_notifier(slot->restricted_file, &slot->notifier);
> +}
> +
> +static inline void kvm_restrictedmem_unregister(struct kvm_memory_slot *slot)
> +{
> +       restrictedmem_unregister_notifier(slot->restricted_file,
> +                                         &slot->notifier);
> +}
> +
> +#else /* !CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
> +static inline void kvm_restrictedmem_register(struct kvm_memory_slot *slot)
> +{
> +       WARN_ON_ONCE(1);
> +}
> +
> +static inline void kvm_restrictedmem_unregister(struct kvm_memory_slot *slot)
> +{
> +       WARN_ON_ONCE(1);
> +}
> +
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>  static int kvm_pm_notifier_call(struct notifier_block *bl,
>                                 unsigned long state,
> @@ -980,6 +1095,11 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
>  /* This does not remove the slot from struct kvm_memslots data structures */
>  static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
>  {
> +       if (slot->flags & KVM_MEM_PRIVATE) {
> +               kvm_restrictedmem_unregister(slot);
> +               fput(slot->restricted_file);
> +       }
> +
>         kvm_destroy_dirty_bitmap(slot);
>
>         kvm_arch_free_memslot(kvm, slot);
> @@ -1551,10 +1671,14 @@ static void kvm_replace_memslot(struct kvm *kvm,
>         }
>  }
>
> -static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> +static int check_memory_region_flags(struct kvm *kvm,
> +                                    const struct kvm_user_mem_region *mem)
>  {
>         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>
> +       if (kvm_arch_has_private_mem(kvm))
> +               valid_flags |= KVM_MEM_PRIVATE;
> +
>  #ifdef __KVM_HAVE_READONLY_MEM
>         valid_flags |= KVM_MEM_READONLY;
>  #endif
> @@ -1630,6 +1754,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
>  {
>         int r;
>
> +       if (change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
> +               kvm_restrictedmem_register(new);
> +
>         /*
>          * If dirty logging is disabled, nullify the bitmap; the old bitmap
>          * will be freed on "commit".  If logging is enabled in both old and
> @@ -1658,6 +1785,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
>         if (r && new && new->dirty_bitmap && (!old || !old->dirty_bitmap))
>                 kvm_destroy_dirty_bitmap(new);
>
> +       if (r && change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
> +               kvm_restrictedmem_unregister(new);
> +
>         return r;
>  }
>
> @@ -1963,7 +2093,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>         int as_id, id;
>         int r;
>
> -       r = check_memory_region_flags(mem);
> +       r = check_memory_region_flags(kvm, mem);
>         if (r)
>                 return r;
>
> @@ -1982,6 +2112,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
>              !access_ok((void __user *)(unsigned long)mem->userspace_addr,
>                         mem->memory_size))
>                 return -EINVAL;
> +       if (mem->flags & KVM_MEM_PRIVATE &&
> +               (mem->restricted_offset & (PAGE_SIZE - 1) ||
> +                mem->restricted_offset > U64_MAX - mem->memory_size))
> +               return -EINVAL;
>         if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
>                 return -EINVAL;
>         if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> @@ -2020,6 +2154,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
>                 if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
>                         return -EINVAL;
>         } else { /* Modify an existing slot. */
> +               /* Private memslots are immutable, they can only be deleted. */
> +               if (mem->flags & KVM_MEM_PRIVATE)
> +                       return -EINVAL;
>                 if ((mem->userspace_addr != old->userspace_addr) ||
>                     (npages != old->npages) ||
>                     ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> @@ -2048,10 +2185,28 @@ int __kvm_set_memory_region(struct kvm *kvm,
>         new->npages = npages;
>         new->flags = mem->flags;
>         new->userspace_addr = mem->userspace_addr;
> +       if (mem->flags & KVM_MEM_PRIVATE) {
> +               new->restricted_file = fget(mem->restricted_fd);
> +               if (!new->restricted_file ||
> +                   !file_is_restrictedmem(new->restricted_file)) {
> +                       r = -EINVAL;
> +                       goto out;
> +               }
> +               new->restricted_offset = mem->restricted_offset;
> +       }
> +
> +       new->kvm = kvm;
>
>         r = kvm_set_memslot(kvm, old, new, change);
>         if (r)
> -               kfree(new);
> +               goto out;
> +
> +       return 0;
> +
> +out:
> +       if (new->restricted_file)
> +               fput(new->restricted_file);
> +       kfree(new);
>         return r;
>  }
>  EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> @@ -2351,6 +2506,8 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
>  #ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
>  static u64 kvm_supported_mem_attributes(struct kvm *kvm)
>  {
> +       if (kvm_arch_has_private_mem(kvm))
> +               return KVM_MEMORY_ATTRIBUTE_PRIVATE;
>         return 0;
>  }
>
> @@ -4822,16 +4979,28 @@ static long kvm_vm_ioctl(struct file *filp,
>         }
>         case KVM_SET_USER_MEMORY_REGION: {
>                 struct kvm_user_mem_region mem;
> -               unsigned long size = sizeof(struct kvm_userspace_memory_region);
> +               unsigned int flags_offset = offsetof(typeof(mem), flags);
> +               unsigned long size;
> +               u32 flags;
>
>                 kvm_sanity_check_user_mem_region_alias();
>
> +               memset(&mem, 0, sizeof(mem));
> +
>                 r = -EFAULT;
> +               if (get_user(flags, (u32 __user *)(argp + flags_offset)))
> +                       goto out;
> +
> +               if (flags & KVM_MEM_PRIVATE)
> +                       size = sizeof(struct kvm_userspace_memory_region_ext);
> +               else
> +                       size = sizeof(struct kvm_userspace_memory_region);
> +
>                 if (copy_from_user(&mem, argp, size))
>                         goto out;
>
>                 r = -EINVAL;
> -               if (mem.flags & KVM_MEM_PRIVATE)
> +               if ((flags ^ mem.flags) & KVM_MEM_PRIVATE)
>                         goto out;
>
>                 r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
  2022-12-09  8:57       ` Fuad Tabba
@ 2022-12-12  7:22         ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-12  7:22 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

On Fri, Dec 09, 2022 at 08:57:31AM +0000, Fuad Tabba wrote:
> Hi,
> 
> On Thu, Dec 8, 2022 at 11:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > On Wed, Dec 07, 2022 at 05:16:34PM +0000, Fuad Tabba wrote:
> > > Hi,
> > >
> > > On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > > >
> > > > Unmap the existing guest mappings when memory attribute is changed
> > > > between shared and private. This is needed because shared pages and
> > > > private pages are from different backends, unmapping existing ones
> > > > gives a chance for page fault handler to re-populate the mappings
> > > > according to the new attribute.
> > > >
> > > > Only architecture has private memory support needs this and the
> > > > supported architecture is expected to rewrite the weak
> > > > kvm_arch_has_private_mem().
> > >
> > > This kind of ties into the discussion of being able to share memory in
> > > place. For pKVM for example, shared and private memory would have the
> > > same backend, and the unmapping wouldn't be needed.
> > >
> > > So I guess that, instead of kvm_arch_has_private_mem(), can the check
> > > be done differently, e.g., with a different function, say
> > > kvm_arch_private_notify_attribute_change() (but maybe with a more
> > > friendly name than what I suggested :) )?
> >
> > Besides controlling the unmapping here, kvm_arch_has_private_mem() is
> > also used to gate the memslot KVM_MEM_PRIVATE flag in patch09. I know
> > unmapping is confirmed unnecessary for pKVM, but how about
> > KVM_MEM_PRIVATE? Will pKVM add its own flag or reuse KVM_MEM_PRIVATE?
> > If the answer is the latter, then yes we should use a different check
> > which only works for confidential usages here.
> 
> I think it makes sense for pKVM to use the same flag (KVM_MEM_PRIVATE)
> and not to add another one.

Thanks for the reply.
Chao
> 
> Thank you,
> /fuad
> 
> 
> 
> >
> > Thanks,
> > Chao
> > >
> > > Thanks,
> > > /fuad
> > >
> > > >
> > > > Also, during memory attribute changing and the unmapping time frame,
> > > > page fault handler may happen in the same memory range and can cause
> > > > incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> > > > page fault handler retry during this time frame.
> > > >
> > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > > ---
> > > >  include/linux/kvm_host.h |   7 +-
> > > >  virt/kvm/kvm_main.c      | 168 ++++++++++++++++++++++++++-------------
> > > >  2 files changed, 116 insertions(+), 59 deletions(-)
> > > >
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index 3d69484d2704..3331c0c92838 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > > >  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > > >  #endif
> > > >
> > > > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > > >  struct kvm_gfn_range {
> > > >         struct kvm_memory_slot *slot;
> > > >         gfn_t start;
> > > > @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> > > >         bool may_block;
> > > >  };
> > > >  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > > > +
> > > > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > > >  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > >  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > >  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > > @@ -785,11 +786,12 @@ struct kvm {
> > > >
> > > >  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > > >         struct mmu_notifier mmu_notifier;
> > > > +#endif
> > > >         unsigned long mmu_invalidate_seq;
> > > >         long mmu_invalidate_in_progress;
> > > >         gfn_t mmu_invalidate_range_start;
> > > >         gfn_t mmu_invalidate_range_end;
> > > > -#endif
> > > > +
> > > >         struct list_head devices;
> > > >         u64 manual_dirty_log_protect;
> > > >         struct dentry *debugfs_dentry;
> > > > @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> > > >  int kvm_arch_post_init_vm(struct kvm *kvm);
> > > >  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> > > >  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > > > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> > > >
> > > >  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> > > >  /*
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index ad55dfbc75d7..4e1e1e113bf0 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> > > >
> > > > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > > > +{
> > > > +       /*
> > > > +        * The count increase must become visible at unlock time as no
> > > > +        * spte can be established without taking the mmu_lock and
> > > > +        * count is also read inside the mmu_lock critical section.
> > > > +        */
> > > > +       kvm->mmu_invalidate_in_progress++;
> > > > +
> > > > +       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > > +               kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > > +               kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > > +       }
> > > > +}
> > > > +
> > > > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > > +{
> > > > +       WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > > +
> > > > +       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > > +               kvm->mmu_invalidate_range_start = start;
> > > > +               kvm->mmu_invalidate_range_end = end;
> > > > +       } else {
> > > > +               /*
> > > > +                * Fully tracking multiple concurrent ranges has diminishing
> > > > +                * returns. Keep things simple and just find the minimal range
> > > > +                * which includes the current and new ranges. As there won't be
> > > > +                * enough information to subtract a range after its invalidate
> > > > +                * completes, any ranges invalidated concurrently will
> > > > +                * accumulate and persist until all outstanding invalidates
> > > > +                * complete.
> > > > +                */
> > > > +               kvm->mmu_invalidate_range_start =
> > > > +                       min(kvm->mmu_invalidate_range_start, start);
> > > > +               kvm->mmu_invalidate_range_end =
> > > > +                       max(kvm->mmu_invalidate_range_end, end);
> > > > +       }
> > > > +}
> > > > +
> > > > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> > > > +{
> > > > +       /*
> > > > +        * This sequence increase will notify the kvm page fault that
> > > > +        * the page that is going to be mapped in the spte could have
> > > > +        * been freed.
> > > > +        */
> > > > +       kvm->mmu_invalidate_seq++;
> > > > +       smp_wmb();
> > > > +       /*
> > > > +        * The above sequence increase must be visible before the
> > > > +        * below count decrease, which is ensured by the smp_wmb above
> > > > +        * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > > +        */
> > > > +       kvm->mmu_invalidate_in_progress--;
> > > > +}
> > > > +
> > > >  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > > >  static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> > > >  {
> > > > @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > > >         kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > > >  }
> > > >
> > > > -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > > > -{
> > > > -       /*
> > > > -        * The count increase must become visible at unlock time as no
> > > > -        * spte can be established without taking the mmu_lock and
> > > > -        * count is also read inside the mmu_lock critical section.
> > > > -        */
> > > > -       kvm->mmu_invalidate_in_progress++;
> > > > -
> > > > -       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > > -               kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > > -               kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > > -       }
> > > > -}
> > > > -
> > > > -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > > -{
> > > > -       WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > > -
> > > > -       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > > -               kvm->mmu_invalidate_range_start = start;
> > > > -               kvm->mmu_invalidate_range_end = end;
> > > > -       } else {
> > > > -               /*
> > > > -                * Fully tracking multiple concurrent ranges has diminishing
> > > > -                * returns. Keep things simple and just find the minimal range
> > > > -                * which includes the current and new ranges. As there won't be
> > > > -                * enough information to subtract a range after its invalidate
> > > > -                * completes, any ranges invalidated concurrently will
> > > > -                * accumulate and persist until all outstanding invalidates
> > > > -                * complete.
> > > > -                */
> > > > -               kvm->mmu_invalidate_range_start =
> > > > -                       min(kvm->mmu_invalidate_range_start, start);
> > > > -               kvm->mmu_invalidate_range_end =
> > > > -                       max(kvm->mmu_invalidate_range_end, end);
> > > > -       }
> > > > -}
> > > > -
> > > >  static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > > >  {
> > > >         kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > > > @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > > >         return 0;
> > > >  }
> > > >
> > > > -void kvm_mmu_invalidate_end(struct kvm *kvm)
> > > > -{
> > > > -       /*
> > > > -        * This sequence increase will notify the kvm page fault that
> > > > -        * the page that is going to be mapped in the spte could have
> > > > -        * been freed.
> > > > -        */
> > > > -       kvm->mmu_invalidate_seq++;
> > > > -       smp_wmb();
> > > > -       /*
> > > > -        * The above sequence increase must be visible before the
> > > > -        * below count decrease, which is ensured by the smp_wmb above
> > > > -        * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > > -        */
> > > > -       kvm->mmu_invalidate_in_progress--;
> > > > -}
> > > > -
> > > >  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > > >                                         const struct mmu_notifier_range *range)
> > > >  {
> > > > @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
> > > >         return 0;
> > > >  }
> > > >
> > > > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > > > +{
> > > > +       return false;
> > > > +}
> > > > +
> > > >  static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > > >  {
> > > >         struct kvm *kvm = kvm_arch_alloc_vm();
> > > > @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> > > >         return 0;
> > > >  }
> > > >
> > > > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > > > +{
> > > > +       struct kvm_gfn_range gfn_range;
> > > > +       struct kvm_memory_slot *slot;
> > > > +       struct kvm_memslots *slots;
> > > > +       struct kvm_memslot_iter iter;
> > > > +       int i;
> > > > +       int r = 0;
> > > > +
> > > > +       gfn_range.pte = __pte(0);
> > > > +       gfn_range.may_block = true;
> > > > +
> > > > +       for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > > +               slots = __kvm_memslots(kvm, i);
> > > > +
> > > > +               kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > > > +                       slot = iter.slot;
> > > > +                       gfn_range.start = max(start, slot->base_gfn);
> > > > +                       gfn_range.end = min(end, slot->base_gfn + slot->npages);
> > > > +                       if (gfn_range.start >= gfn_range.end)
> > > > +                               continue;
> > > > +                       gfn_range.slot = slot;
> > > > +
> > > > +                       r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > > > +               }
> > > > +       }
> > > > +
> > > > +       if (r)
> > > > +               kvm_flush_remote_tlbs(kvm);
> > > > +}
> > > > +
> > > >  static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > >                                            struct kvm_memory_attributes *attrs)
> > > >  {
> > > >         gfn_t start, end;
> > > >         unsigned long i;
> > > >         void *entry;
> > > > +       int idx;
> > > >         u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > > >
> > > > -       /* flags is currently not used. */
> > > > +       /* 'flags' is currently not used. */
> > > >         if (attrs->flags)
> > > >                 return -EINVAL;
> > > >         if (attrs->attributes & ~supported_attrs)
> > > > @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > >
> > > >         entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > > >
> > > > +       if (kvm_arch_has_private_mem(kvm)) {
> > > > +               KVM_MMU_LOCK(kvm);
> > > > +               kvm_mmu_invalidate_begin(kvm);
> > > > +               kvm_mmu_invalidate_range_add(kvm, start, end);
> > > > +               KVM_MMU_UNLOCK(kvm);
> > > > +       }
> > > > +
> > > >         mutex_lock(&kvm->lock);
> > > >         for (i = start; i < end; i++)
> > > >                 if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > > > @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > >                         break;
> > > >         mutex_unlock(&kvm->lock);
> > > >
> > > > +       if (kvm_arch_has_private_mem(kvm)) {
> > > > +               idx = srcu_read_lock(&kvm->srcu);
> > > > +               KVM_MMU_LOCK(kvm);
> > > > +               if (i > start)
> > > > +                       kvm_unmap_mem_range(kvm, start, i);
> > > > +               kvm_mmu_invalidate_end(kvm);
> > > > +               KVM_MMU_UNLOCK(kvm);
> > > > +               srcu_read_unlock(&kvm->srcu, idx);
> > > > +       }
> > > > +
> > > >         attrs->address = i << PAGE_SHIFT;
> > > >         attrs->size = (end - i) << PAGE_SHIFT;
> > > >
> > > > --
> > > > 2.25.1
> > > >

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 8/9] KVM: Handle page fault for private memory
  2022-12-09  9:01   ` Fuad Tabba
@ 2022-12-12  7:23     ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-12  7:23 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

On Fri, Dec 09, 2022 at 09:01:04AM +0000, Fuad Tabba wrote:
> Hi,
> 
> On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
> > hva-based shared memory. Architecture code (like TDX code) can tell
> > whether the on-going fault is private or not. This patch adds a
> > 'is_private' field to kvm_page_fault to indicate this and architecture
> > code is expected to set it.
> >
> > To handle page fault for such memslot, the handling logic is different
> > depending on whether the fault is private or shared. KVM checks if
> > 'is_private' matches the host's view of the page (maintained in
> > mem_attr_array).
> >   - For a successful match, private pfn is obtained with
> >     restrictedmem_get_page() and shared pfn is obtained with existing
> >     get_user_pages().
> >   - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> >     userspace. Userspace then can convert memory between private/shared
> >     in host's view and retry the fault.
> >
> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c          | 63 +++++++++++++++++++++++++++++++--
> >  arch/x86/kvm/mmu/mmu_internal.h | 14 +++++++-
> >  arch/x86/kvm/mmu/mmutrace.h     |  1 +
> >  arch/x86/kvm/mmu/tdp_mmu.c      |  2 +-
> >  include/linux/kvm_host.h        | 30 ++++++++++++++++
> >  5 files changed, 105 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 2190fd8c95c0..b1953ebc012e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
> >
> >  int kvm_mmu_max_mapping_level(struct kvm *kvm,
> >                               const struct kvm_memory_slot *slot, gfn_t gfn,
> > -                             int max_level)
> > +                             int max_level, bool is_private)
> >  {
> >         struct kvm_lpage_info *linfo;
> >         int host_level;
> > @@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> >                         break;
> >         }
> >
> > +       if (is_private)
> > +               return max_level;
> > +
> >         if (max_level == PG_LEVEL_4K)
> >                 return PG_LEVEL_4K;
> >
> > @@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >          * level, which will be used to do precise, accurate accounting.
> >          */
> >         fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> > -                                                    fault->gfn, fault->max_level);
> > +                                                    fault->gfn, fault->max_level,
> > +                                                    fault->is_private);
> >         if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> >                 return;
> >
> > @@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> >         kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> >  }
> >
> > +static inline u8 order_to_level(int order)
> > +{
> > +       BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> > +
> > +       if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> > +               return PG_LEVEL_1G;
> > +
> > +       if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> > +               return PG_LEVEL_2M;
> > +
> > +       return PG_LEVEL_4K;
> > +}
> > +
> > +static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
> > +                                   struct kvm_page_fault *fault)
> > +{
> > +       vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > +       if (fault->is_private)
> > +               vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> > +       else
> > +               vcpu->run->memory.flags = 0;
> > +       vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> 
> nit: As in previous patches, use helpers (for this and other similar
> shifts in this patch)?

Agreed.

> 
> > +       vcpu->run->memory.size = PAGE_SIZE;
> > +       return RET_PF_USER;
> > +}
> > +
> > +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > +                                  struct kvm_page_fault *fault)
> > +{
> > +       int order;
> > +       struct kvm_memory_slot *slot = fault->slot;
> > +
> > +       if (!kvm_slot_can_be_private(slot))
> > +               return kvm_do_memory_fault_exit(vcpu, fault);
> > +
> > +       if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> > +               return RET_PF_RETRY;
> > +
> > +       fault->max_level = min(order_to_level(order), fault->max_level);
> > +       fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> > +       return RET_PF_CONTINUE;
> > +}
> > +
> >  static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  {
> >         struct kvm_memory_slot *slot = fault->slot;
> > @@ -4210,6 +4257,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >                         return RET_PF_EMULATE;
> >         }
> >
> > +       if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
> > +               return kvm_do_memory_fault_exit(vcpu, fault);
> > +
> > +       if (fault->is_private)
> > +               return kvm_faultin_pfn_private(vcpu, fault);
> > +
> >         async = false;
> >         fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> >                                           fault->write, &fault->map_writable,
> > @@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> >                         return -EIO;
> >         }
> >
> > +       if (r == RET_PF_USER)
> > +               return 0;
> > +
> >         if (r < 0)
> >                 return r;
> >         if (r != RET_PF_EMULATE)
> > @@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> >                  */
> >                 if (sp->role.direct &&
> >                     sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> > -                                                              PG_LEVEL_NUM)) {
> > +                                                              PG_LEVEL_NUM,
> > +                                                              false)) {
> >                         kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
> >
> >                         if (kvm_available_flush_tlb_with_range())
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index dbaf6755c5a7..5ccf08183b00 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -189,6 +189,7 @@ struct kvm_page_fault {
> >
> >         /* Derived from mmu and global state.  */
> >         const bool is_tdp;
> > +       const bool is_private;
> >         const bool nx_huge_page_workaround_enabled;
> >
> >         /*
> > @@ -237,6 +238,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> >   * RET_PF_RETRY: let CPU fault again on the address.
> >   * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
> >   * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> > + * RET_PF_USER: need to exit to userspace to handle this fault.
> >   * RET_PF_FIXED: The faulting entry has been fixed.
> >   * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
> >   *
> > @@ -253,6 +255,7 @@ enum {
> >         RET_PF_RETRY,
> >         RET_PF_EMULATE,
> >         RET_PF_INVALID,
> > +       RET_PF_USER,
> >         RET_PF_FIXED,
> >         RET_PF_SPURIOUS,
> >  };
> > @@ -310,7 +313,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >
> >  int kvm_mmu_max_mapping_level(struct kvm *kvm,
> >                               const struct kvm_memory_slot *slot, gfn_t gfn,
> > -                             int max_level);
> > +                             int max_level, bool is_private);
> >  void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> >  void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
> >
> > @@ -319,4 +322,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >  void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >  void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >
> > +#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > +                                       gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > +{
> > +       WARN_ON_ONCE(1);
> > +       return -EOPNOTSUPP;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > +
> >  #endif /* __KVM_X86_MMU_INTERNAL_H */
> > diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> > index ae86820cef69..2d7555381955 100644
> > --- a/arch/x86/kvm/mmu/mmutrace.h
> > +++ b/arch/x86/kvm/mmu/mmutrace.h
> > @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
> >  TRACE_DEFINE_ENUM(RET_PF_RETRY);
> >  TRACE_DEFINE_ENUM(RET_PF_EMULATE);
> >  TRACE_DEFINE_ENUM(RET_PF_INVALID);
> > +TRACE_DEFINE_ENUM(RET_PF_USER);
> >  TRACE_DEFINE_ENUM(RET_PF_FIXED);
> >  TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 771210ce5181..8ba1a4afc546 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1768,7 +1768,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> >                         continue;
> >
> >                 max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> > -                                                             iter.gfn, PG_LEVEL_NUM);
> > +                                               iter.gfn, PG_LEVEL_NUM, false);
> >                 if (max_mapping_level < iter.level)
> >                         continue;
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 25099c94e770..153842bb33df 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> >  }
> >  #endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
> >
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > +       return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
> > +              KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > +}
> > +#else
> > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > +       return false;
> > +}
> > +
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > +
> > +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > +                                       gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > +{
> > +       int ret;
> > +       struct page *page;
> > +       pgoff_t index = gfn - slot->base_gfn +
> > +                       (slot->restricted_offset >> PAGE_SHIFT);
> > +
> > +       ret = restrictedmem_get_page(slot->restricted_file, index,
> > +                                    &page, order);
> > +       *pfn = page_to_pfn(page);
> > +       return ret;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > +
> >  #endif
> > --
> > 2.25.1
> >
> 
> With my limited understanding of x86 code:
> Reviewed-by: Fuad Tabba <tabba@google.com>
> 
> The common code in kvm_host.h was used in the port to arm64, and the
> x86 fault handling code was used as a guide to how it should be done
> in pKVM (with similar code added there). So with these caveats in
> mind:
> Tested-by: Fuad Tabba <tabba@google.com>
> 
> Cheers,
> /fuad

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2022-12-08 11:30     ` Chao Peng
@ 2022-12-13 12:04       ` Xiaoyao Li
  2022-12-19  7:50         ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Xiaoyao Li @ 2022-12-13 12:04 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On 12/8/2022 7:30 PM, Chao Peng wrote:
> On Thu, Dec 08, 2022 at 04:37:03PM +0800, Xiaoyao Li wrote:
>> On 12/2/2022 2:13 PM, Chao Peng wrote:
>>
>> ..
>>
>>> Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
>>> and right now it is selected on X86_64 only.
>>>
>>
>>  From the patch implementation, I have no idea why HAVE_KVM_RESTRICTED_MEM is
>> needed.
> 
> The reason is we want KVM further controls the feature enabling. An
> opt-in CONFIG_RESTRICTEDMEM can cause problem if user sets that for
> unsupported architectures.

HAVE_KVM_RESTRICTED_MEM is not used in this patch. It's better to 
introduce it in the patch that actually uses it.

> Here is the original discussion:
> https://lore.kernel.org/all/YkJLFu98hZOvTSrL@google.com/
> 
> Thanks,
> Chao


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-02  6:13 ` [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
  2022-12-06 14:57   ` Fuad Tabba
@ 2022-12-13 23:49   ` Huang, Kai
  2022-12-19  7:53     ` Chao Peng
  2023-01-13 21:54   ` Sean Christopherson
                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 398+ messages in thread
From: Huang, Kai @ 2022-12-13 23:49 UTC (permalink / raw)
  To: linux-api, linux-mm, chao.p.peng, qemu-devel, linux-kernel,
	linux-arch, linux-doc, kvm, linux-fsdevel
  Cc: tglx, jmattson, Lutomirski, Andy, pbonzini, ak, kirill.shutemov,
	david, tabba, Hocko, Michal, michael.roth, corbet, bfields,
	dhildenb, x86, bp, vannapurve, rppt, shuah, vkuznets, vbabka,
	arnd, mail, qperret, Christopherson,,
	Sean, ddutile, naoya.horiguchi, aarcange, wanpengli, yu.c.zhang,
	hughd, mingo, hpa, Nakajima, Jun, jlayton, joro, steven.price,
	Hansen, Dave, akpm, linmiaohe, Wang, Wei W

> 
> memfd_restricted() itself is implemented as a shim layer on top of real
> memory file systems (currently tmpfs). Pages in restrictedmem are marked
> as unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
> 
> 
I didn't dig full histroy, but I interpret this as we don't support page
migration and swapping for restricted memfd for now.  IMHO "page marked as
unmovable" can be confused with PageMovable(), which is a different thing from
this series.  It's better to just say something like "those pages cannot be
migrated and swapped".

[...]

> +
> +	/*
> +	 * These pages are currently unmovable so don't place them into movable
> +	 * pageblocks (e.g. CMA and ZONE_MOVABLE).
> +	 */
> +	mapping = memfd->f_mapping;
> +	mapping_set_unevictable(mapping);
> +	mapping_set_gfp_mask(mapping,
> +			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);

But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from non-
movable zones, but doesn't necessarily prevent page from being migrated.  My
first glance is you need to implement either a_ops->migrate_folio() or just
get_page() after faulting in the page to prevent.

So I think the comment also needs improvement -- IMHO we can just call out
currently those pages cannot be migrated and swapped, which is clearer (and the
latter justifies mapping_set_unevictable() clearly).



^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
  2022-12-02  6:13 ` [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes Chao Peng
  2022-12-07  8:13   ` Yuan Yao
  2022-12-07 17:16   ` Fuad Tabba
@ 2022-12-13 23:51   ` Huang, Kai
  2022-12-19  7:54     ` Chao Peng
  2023-01-13 22:50   ` Sean Christopherson
  3 siblings, 1 reply; 398+ messages in thread
From: Huang, Kai @ 2022-12-13 23:51 UTC (permalink / raw)
  To: linux-api, linux-mm, chao.p.peng, qemu-devel, linux-kernel,
	linux-arch, linux-doc, kvm, linux-fsdevel
  Cc: tglx, jmattson, Lutomirski, Andy, pbonzini, ak, kirill.shutemov,
	david, tabba, Hocko, Michal, michael.roth, corbet, bfields,
	dhildenb, x86, bp, vannapurve, rppt, shuah, vkuznets, vbabka,
	arnd, mail, qperret, Christopherson,,
	Sean, ddutile, naoya.horiguchi, aarcange, wanpengli, yu.c.zhang,
	hughd, mingo, hpa, Nakajima, Jun, jlayton, joro, steven.price,
	Hansen, Dave, akpm, linmiaohe, Wang, Wei W

On Fri, 2022-12-02 at 14:13 +0800, Chao Peng wrote:
>  
> -	/* flags is currently not used. */
> +	/* 'flags' is currently not used. */
>  	if (attrs->flags)
>  		return -EINVAL;

Unintended code change.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2022-12-02  6:13 ` [PATCH v10 2/9] KVM: Introduce per-page memory attributes Chao Peng
  2022-12-06 13:34   ` Fabiano Rosas
  2022-12-06 15:07   ` Fuad Tabba
@ 2022-12-16 15:09   ` Borislav Petkov
  2022-12-19  8:15     ` Chao Peng
  2022-12-28  8:28   ` Chenyi Qiang
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 398+ messages in thread
From: Borislav Petkov @ 2022-12-16 15:09 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Fri, Dec 02, 2022 at 02:13:40PM +0800, Chao Peng wrote:
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1782c4555d94..7f0f5e9f2406 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>  	spin_lock_init(&kvm->mn_invalidate_lock);
>  	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
>  	xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	xa_init(&kvm->mem_attr_array);
> +#endif

	if (IS_ENABLED(CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES))
		...

would at least remove the ugly ifdeffery.

Or you could create wrapper functions for that xa_init() and
xa_destroy() and put the ifdeffery in there.

> @@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
>  }
>  #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>  
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static u64 kvm_supported_mem_attributes(struct kvm *kvm)

I guess that function should have a verb in the name:

kvm_get_supported_mem_attributes()

> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> +					   struct kvm_memory_attributes *attrs)
> +{
> +	gfn_t start, end;
> +	unsigned long i;
> +	void *entry;
> +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> +	/* flags is currently not used. */
> +	if (attrs->flags)
> +		return -EINVAL;
> +	if (attrs->attributes & ~supported_attrs)
> +		return -EINVAL;
> +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> +		return -EINVAL;
> +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> +		return -EINVAL;

Dunno, shouldn't those issue some sort of an error message so that the
caller knows where it failed? Or at least return different retvals which
signal what the problem is?

> +	start = attrs->address >> PAGE_SHIFT;
> +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +
> +	mutex_lock(&kvm->lock);
> +	for (i = start; i < end; i++)
> +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> +				    GFP_KERNEL_ACCOUNT)))
> +			break;
> +	mutex_unlock(&kvm->lock);
> +
> +	attrs->address = i << PAGE_SHIFT;
> +	attrs->size = (end - i) << PAGE_SHIFT;
> +
> +	return 0;
> +}

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2022-12-13 12:04       ` Xiaoyao Li
@ 2022-12-19  7:50         ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-19  7:50 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Tue, Dec 13, 2022 at 08:04:14PM +0800, Xiaoyao Li wrote:
> On 12/8/2022 7:30 PM, Chao Peng wrote:
> > On Thu, Dec 08, 2022 at 04:37:03PM +0800, Xiaoyao Li wrote:
> > > On 12/2/2022 2:13 PM, Chao Peng wrote:
> > > 
> > > ..
> > > 
> > > > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > > > and right now it is selected on X86_64 only.
> > > > 
> > > 
> > >  From the patch implementation, I have no idea why HAVE_KVM_RESTRICTED_MEM is
> > > needed.
> > 
> > The reason is we want KVM further controls the feature enabling. An
> > opt-in CONFIG_RESTRICTEDMEM can cause problem if user sets that for
> > unsupported architectures.
> 
> HAVE_KVM_RESTRICTED_MEM is not used in this patch. It's better to introduce
> it in the patch that actually uses it.

It's being 'used' in this patch by reverse selecting RESTRICTEDMEM in
arch/x86/kvm/Kconfig, this gives people a sense where
restrictedmem_notifier comes from. Introducing the config with other
private/restricted memslot stuff together can also help future
supporting architectures better identify what they need do. But those
are trivial and moving to patch 08 sounds also good to me.

Thanks,
Chao
> 
> > Here is the original discussion:
> > https://lore.kernel.org/all/YkJLFu98hZOvTSrL@google.com/
> > 
> > Thanks,
> > Chao

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-13 23:49   ` Huang, Kai
@ 2022-12-19  7:53     ` Chao Peng
  2022-12-19  8:48       ` Huang, Kai
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-12-19  7:53 UTC (permalink / raw)
  To: Huang, Kai
  Cc: linux-api, linux-mm, qemu-devel, linux-kernel, linux-arch,
	linux-doc, kvm, linux-fsdevel, tglx, jmattson, Lutomirski, Andy,
	pbonzini, ak, kirill.shutemov, david, tabba, Hocko, Michal,
	michael.roth, corbet, bfields, dhildenb, x86, bp, vannapurve,
	rppt, shuah, vkuznets, vbabka, arnd, mail, qperret,
	Christopherson,,
	Sean, ddutile, naoya.horiguchi, aarcange, wanpengli, yu.c.zhang,
	hughd, mingo, hpa, Nakajima, Jun, jlayton, joro, steven.price,
	Hansen, Dave, akpm, linmiaohe, Wang, Wei W

On Tue, Dec 13, 2022 at 11:49:13PM +0000, Huang, Kai wrote:
> > 
> > memfd_restricted() itself is implemented as a shim layer on top of real
> > memory file systems (currently tmpfs). Pages in restrictedmem are marked
> > as unmovable and unevictable, this is required for current confidential
> > usage. But in future this might be changed.
> > 
> > 
> I didn't dig full histroy, but I interpret this as we don't support page
> migration and swapping for restricted memfd for now.  IMHO "page marked as
> unmovable" can be confused with PageMovable(), which is a different thing from
> this series.  It's better to just say something like "those pages cannot be
> migrated and swapped".

Yes, if that helps some clarification.

> 
> [...]
> 
> > +
> > +	/*
> > +	 * These pages are currently unmovable so don't place them into movable
> > +	 * pageblocks (e.g. CMA and ZONE_MOVABLE).
> > +	 */
> > +	mapping = memfd->f_mapping;
> > +	mapping_set_unevictable(mapping);
> > +	mapping_set_gfp_mask(mapping,
> > +			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> 
> But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from non-
> movable zones, but doesn't necessarily prevent page from being migrated.  My
> first glance is you need to implement either a_ops->migrate_folio() or just
> get_page() after faulting in the page to prevent.

The current api restrictedmem_get_page() already does this, after the
caller calling it, it holds a reference to the page. The caller then
decides when to call put_page() appropriately.

> 
> So I think the comment also needs improvement -- IMHO we can just call out
> currently those pages cannot be migrated and swapped, which is clearer (and the
> latter justifies mapping_set_unevictable() clearly).

Good to me.

Thanks,
Chao
> 
> 

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
  2022-12-13 23:51   ` Huang, Kai
@ 2022-12-19  7:54     ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-19  7:54 UTC (permalink / raw)
  To: Huang, Kai
  Cc: linux-api, linux-mm, qemu-devel, linux-kernel, linux-arch,
	linux-doc, kvm, linux-fsdevel, tglx, jmattson, Lutomirski, Andy,
	pbonzini, ak, kirill.shutemov, david, tabba, Hocko, Michal,
	michael.roth, corbet, bfields, dhildenb, x86, bp, vannapurve,
	rppt, shuah, vkuznets, vbabka, arnd, mail, qperret,
	Christopherson,,
	Sean, ddutile, naoya.horiguchi, aarcange, wanpengli, yu.c.zhang,
	hughd, mingo, hpa, Nakajima, Jun, jlayton, joro, steven.price,
	Hansen, Dave, akpm, linmiaohe, Wang, Wei W

On Tue, Dec 13, 2022 at 11:51:25PM +0000, Huang, Kai wrote:
> On Fri, 2022-12-02 at 14:13 +0800, Chao Peng wrote:
> >  
> > -	/* flags is currently not used. */
> > +	/* 'flags' is currently not used. */
> >  	if (attrs->flags)
> >  		return -EINVAL;
> 
> Unintended code change.

Yeah!

Chao

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2022-12-16 15:09   ` Borislav Petkov
@ 2022-12-19  8:15     ` Chao Peng
  2022-12-19 10:17       ` Borislav Petkov
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-12-19  8:15 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Fri, Dec 16, 2022 at 04:09:06PM +0100, Borislav Petkov wrote:
> On Fri, Dec 02, 2022 at 02:13:40PM +0800, Chao Peng wrote:
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 1782c4555d94..7f0f5e9f2406 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> >  	spin_lock_init(&kvm->mn_invalidate_lock);
> >  	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> >  	xa_init(&kvm->vcpu_array);
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +	xa_init(&kvm->mem_attr_array);
> > +#endif
> 
> 	if (IS_ENABLED(CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES))
> 		...
> 
> would at least remove the ugly ifdeffery.
> 
> Or you could create wrapper functions for that xa_init() and
> xa_destroy() and put the ifdeffery in there.

Agreed.

> 
> > @@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> >  }
> >  #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
> >  
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> 
> I guess that function should have a verb in the name:
> 
> kvm_get_supported_mem_attributes()

Right!
> 
> > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > +					   struct kvm_memory_attributes *attrs)
> > +{
> > +	gfn_t start, end;
> > +	unsigned long i;
> > +	void *entry;
> > +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > +
> > +	/* flags is currently not used. */
> > +	if (attrs->flags)
> > +		return -EINVAL;
> > +	if (attrs->attributes & ~supported_attrs)
> > +		return -EINVAL;
> > +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > +		return -EINVAL;
> > +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > +		return -EINVAL;
> 
> Dunno, shouldn't those issue some sort of an error message so that the
> caller knows where it failed? Or at least return different retvals which
> signal what the problem is?

Tamping down with error number a bit:

        if (attrs->flags)
                return -ENXIO;
        if (attrs->attributes & ~supported_attrs)
                return -EOPNOTSUPP;
        if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size) ||
            attrs->size == 0)
                return -EINVAL;
        if (attrs->address + attrs->size < attrs->address)
                return -E2BIG;

Chao
> 
> > +	start = attrs->address >> PAGE_SHIFT;
> > +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > +
> > +	mutex_lock(&kvm->lock);
> > +	for (i = start; i < end; i++)
> > +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > +				    GFP_KERNEL_ACCOUNT)))
> > +			break;
> > +	mutex_unlock(&kvm->lock);
> > +
> > +	attrs->address = i << PAGE_SHIFT;
> > +	attrs->size = (end - i) << PAGE_SHIFT;
> > +
> > +	return 0;
> > +}
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-19  7:53     ` Chao Peng
@ 2022-12-19  8:48       ` Huang, Kai
  2022-12-20  7:22         ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Huang, Kai @ 2022-12-19  8:48 UTC (permalink / raw)
  To: chao.p.peng
  Cc: tglx, linux-arch, kvm, Wang, Wei W, jmattson, Lutomirski, Andy,
	ak, kirill.shutemov, david, qemu-devel, tabba, Hocko, Michal,
	michael.roth, corbet, linux-fsdevel, dhildenb, bfields,
	linux-kernel, x86, bp, vannapurve, rppt, shuah, vkuznets, vbabka,
	mail, linux-api, qperret, arnd, pbonzini, ddutile,
	naoya.horiguchi, Christopherson,,
	Sean, wanpengli, yu.c.zhang, hughd, aarcange, mingo, hpa,
	Nakajima, Jun, jlayton, joro, linux-mm, steven.price, Hansen,
	Dave, linux-doc, akpm, linmiaohe

On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > 
> > [...]
> > 
> > > +
> > > +	/*
> > > +	 * These pages are currently unmovable so don't place them into
> > > movable
> > > +	 * pageblocks (e.g. CMA and ZONE_MOVABLE).
> > > +	 */
> > > +	mapping = memfd->f_mapping;
> > > +	mapping_set_unevictable(mapping);
> > > +	mapping_set_gfp_mask(mapping,
> > > +			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> > 
> > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from
> > non-
> > movable zones, but doesn't necessarily prevent page from being migrated.  My
> > first glance is you need to implement either a_ops->migrate_folio() or just
> > get_page() after faulting in the page to prevent.
> 
> The current api restrictedmem_get_page() already does this, after the
> caller calling it, it holds a reference to the page. The caller then
> decides when to call put_page() appropriately.

I tried to dig some history. Perhaps I am missing something, but it seems Kirill
said in v9 that this code doesn't prevent page migration, and we need to
increase page refcount in restrictedmem_get_page():

https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/

But looking at this series it seems restrictedmem_get_page() in this v10 is
identical to the one in v9 (except v10 uses 'folio' instead of 'page')?

Anyway if this is not fixed, then it should be fixed.  Otherwise, a comment at
the place where page refcount is increased will be helpful to help people
understand page migration is actually prevented.


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2022-12-19  8:15     ` Chao Peng
@ 2022-12-19 10:17       ` Borislav Petkov
  2022-12-20  7:24         ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Borislav Petkov @ 2022-12-19 10:17 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Mon, Dec 19, 2022 at 04:15:32PM +0800, Chao Peng wrote:
> Tamping down with error number a bit:
> 
>         if (attrs->flags)
>                 return -ENXIO;
>         if (attrs->attributes & ~supported_attrs)
>                 return -EOPNOTSUPP;
>         if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size) ||
>             attrs->size == 0)
>                 return -EINVAL;
>         if (attrs->address + attrs->size < attrs->address)
>                 return -E2BIG;

Yap, better.

I guess you should add those to the documentation of the ioctl too
so that people can find out why it fails. Or, well, they can look
at the code directly too but still... imagine some blurb about
user-friendliness here...

:-)

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2022-12-02  6:13 ` [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory Chao Peng
  2022-12-05  9:03   ` Fuad Tabba
  2022-12-08  8:37   ` Xiaoyao Li
@ 2022-12-19 14:36   ` Borislav Petkov
  2022-12-20  7:43     ` Chao Peng
  2023-01-05 11:23   ` Jarkko Sakkinen
  3 siblings, 1 reply; 398+ messages in thread
From: Borislav Petkov @ 2022-12-19 14:36 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> In memory encryption usage, guest memory may be encrypted with special
> key and can be accessed only by the guest itself. We call such memory
> private memory. It's valueless and sometimes can cause problem to allow

valueless?

I can't parse that.

> userspace to access guest private memory. This new KVM memslot extension
> allows guest private memory being provided through a restrictedmem
> backed file descriptor(fd) and userspace is restricted to access the
> bookmarked memory in the fd.

bookmarked?

> This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> additional KVM memslot fields restricted_fd/restricted_offset to allow
> userspace to instruct KVM to provide guest memory through restricted_fd.
> 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> and the size is 'memory_size'.
> 
> The extended memslot can still have the userspace_addr(hva). When use, a

"When un use, ..."

...

> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index a8e379a3afee..690cb21010e7 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -50,6 +50,8 @@ config KVM
>  	select INTERVAL_TREE
>  	select HAVE_KVM_PM_NOTIFIER if PM
>  	select HAVE_KVM_MEMORY_ATTRIBUTES
> +	select HAVE_KVM_RESTRICTED_MEM if X86_64
> +	select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM

Those deps here look weird.

RESTRICTEDMEM should be selected by TDX_GUEST as it can't live without
it.

Then you don't have to select HAVE_KVM_RESTRICTED_MEM simply because of
X86_64 - you need that functionality when the respective guest support
is enabled in KVM.

Then, looking forward into your patchset, I'm not sure you even
need HAVE_KVM_RESTRICTED_MEM - you could make it all depend on
CONFIG_RESTRICTEDMEM. But that's KVM folks call - I'd always aim for
less Kconfig items because we have waay too many.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-19  8:48       ` Huang, Kai
@ 2022-12-20  7:22         ` Chao Peng
  2022-12-20  8:33           ` Huang, Kai
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-12-20  7:22 UTC (permalink / raw)
  To: Huang, Kai
  Cc: tglx, linux-arch, kvm, Wang, Wei W, jmattson, Lutomirski, Andy,
	ak, kirill.shutemov, david, qemu-devel, tabba, Hocko, Michal,
	michael.roth, corbet, linux-fsdevel, dhildenb, bfields,
	linux-kernel, x86, bp, vannapurve, rppt, shuah, vkuznets, vbabka,
	mail, linux-api, qperret, arnd, pbonzini, ddutile,
	naoya.horiguchi, Christopherson,,
	Sean, wanpengli, yu.c.zhang, hughd, aarcange, mingo, hpa,
	Nakajima, Jun, jlayton, joro, linux-mm, steven.price, Hansen,
	Dave, linux-doc, akpm, linmiaohe

On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
> On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > 
> > > [...]
> > > 
> > > > +
> > > > +	/*
> > > > +	 * These pages are currently unmovable so don't place them into
> > > > movable
> > > > +	 * pageblocks (e.g. CMA and ZONE_MOVABLE).
> > > > +	 */
> > > > +	mapping = memfd->f_mapping;
> > > > +	mapping_set_unevictable(mapping);
> > > > +	mapping_set_gfp_mask(mapping,
> > > > +			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> > > 
> > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from
> > > non-
> > > movable zones, but doesn't necessarily prevent page from being migrated.  My
> > > first glance is you need to implement either a_ops->migrate_folio() or just
> > > get_page() after faulting in the page to prevent.
> > 
> > The current api restrictedmem_get_page() already does this, after the
> > caller calling it, it holds a reference to the page. The caller then
> > decides when to call put_page() appropriately.
> 
> I tried to dig some history. Perhaps I am missing something, but it seems Kirill
> said in v9 that this code doesn't prevent page migration, and we need to
> increase page refcount in restrictedmem_get_page():
> 
> https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/
> 
> But looking at this series it seems restrictedmem_get_page() in this v10 is
> identical to the one in v9 (except v10 uses 'folio' instead of 'page')?

restrictedmem_get_page() increases page refcount several versions ago so
no change in v10 is needed. You probably missed my reply:

https://lore.kernel.org/linux-mm/20221129135844.GA902164@chaop.bj.intel.com/

The current solution is clear: unless we have better approach, we will
let restrictedmem user (KVM in this case) to hold the refcount to
prevent page migration.

Thanks,
Chao
> 
> Anyway if this is not fixed, then it should be fixed.  Otherwise, a comment at
> the place where page refcount is increased will be helpful to help people
> understand page migration is actually prevented.
> 

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2022-12-19 10:17       ` Borislav Petkov
@ 2022-12-20  7:24         ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-20  7:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Mon, Dec 19, 2022 at 11:17:22AM +0100, Borislav Petkov wrote:
> On Mon, Dec 19, 2022 at 04:15:32PM +0800, Chao Peng wrote:
> > Tamping down with error number a bit:
> > 
> >         if (attrs->flags)
> >                 return -ENXIO;
> >         if (attrs->attributes & ~supported_attrs)
> >                 return -EOPNOTSUPP;
> >         if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size) ||
> >             attrs->size == 0)
> >                 return -EINVAL;
> >         if (attrs->address + attrs->size < attrs->address)
> >                 return -E2BIG;
> 
> Yap, better.
> 
> I guess you should add those to the documentation of the ioctl too
> so that people can find out why it fails. Or, well, they can look
> at the code directly too but still... imagine some blurb about
> user-friendliness here...

Thanks for reminding. Yes KVM api doc is the right place to put these
documentation in.

Thanks,
Chao
> 
> :-)
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2022-12-19 14:36   ` Borislav Petkov
@ 2022-12-20  7:43     ` Chao Peng
  2022-12-20  9:55       ` Borislav Petkov
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2022-12-20  7:43 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Mon, Dec 19, 2022 at 03:36:28PM +0100, Borislav Petkov wrote:
> On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > In memory encryption usage, guest memory may be encrypted with special
> > key and can be accessed only by the guest itself. We call such memory
> > private memory. It's valueless and sometimes can cause problem to allow
> 
> valueless?
> 
> I can't parse that.

It's unnecessary and ...

> 
> > userspace to access guest private memory. This new KVM memslot extension
> > allows guest private memory being provided through a restrictedmem
> > backed file descriptor(fd) and userspace is restricted to access the
> > bookmarked memory in the fd.
> 
> bookmarked?

userspace is restricted to access the memory content in the fd.

> 
> > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > userspace to instruct KVM to provide guest memory through restricted_fd.
> > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > and the size is 'memory_size'.
> > 
> > The extended memslot can still have the userspace_addr(hva). When use, a
> 
> "When un use, ..."

When both userspace_addr and restricted_fd/offset were used, ...

> 
> ...
> 
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index a8e379a3afee..690cb21010e7 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -50,6 +50,8 @@ config KVM
> >  	select INTERVAL_TREE
> >  	select HAVE_KVM_PM_NOTIFIER if PM
> >  	select HAVE_KVM_MEMORY_ATTRIBUTES
> > +	select HAVE_KVM_RESTRICTED_MEM if X86_64
> > +	select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> 
> Those deps here look weird.
> 
> RESTRICTEDMEM should be selected by TDX_GUEST as it can't live without
> it.

RESTRICTEDMEM is needed by TDX_HOST, not TDX_GUEST.

> 
> Then you don't have to select HAVE_KVM_RESTRICTED_MEM simply because of
> X86_64 - you need that functionality when the respective guest support
> is enabled in KVM.

Letting the actual feature(e.g. TDX or pKVM) select it or add dependency
sounds a viable and clearer solution. Sean, let me know your opinion.

> 
> Then, looking forward into your patchset, I'm not sure you even
> need HAVE_KVM_RESTRICTED_MEM - you could make it all depend on
> CONFIG_RESTRICTEDMEM. But that's KVM folks call - I'd always aim for
> less Kconfig items because we have waay too many.

The only reason to add another HAVE_KVM_RESTRICTED_MEM is some code only
works for 64bit[*] and CONFIG_RESTRICTEDMEM is not sufficient to enforce
that.

[*] https://lore.kernel.org/all/YkJLFu98hZOvTSrL@google.com/

Thanks,
Chao
> 
> Thx.
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-20  7:22         ` Chao Peng
@ 2022-12-20  8:33           ` Huang, Kai
  2022-12-21 13:39             ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Huang, Kai @ 2022-12-20  8:33 UTC (permalink / raw)
  To: chao.p.peng
  Cc: tglx, linux-arch, kvm, jmattson, Lutomirski, Andy, ak,
	kirill.shutemov, Hocko, Michal, qemu-devel, tabba, david,
	michael.roth, corbet, bfields, dhildenb, linux-kernel,
	linux-fsdevel, x86, bp, linux-api, rppt, shuah, vkuznets, vbabka,
	mail, ddutile, qperret, arnd, pbonzini, vannapurve,
	naoya.horiguchi, Christopherson,,
	Sean, wanpengli, yu.c.zhang, hughd, aarcange, mingo, hpa,
	Nakajima, Jun, jlayton, joro, linux-mm, Wang, Wei W,
	steven.price, linux-doc, Hansen, Dave, akpm, linmiaohe

On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
> > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > > 
> > > > [...]
> > > > 
> > > > > +
> > > > > +	/*
> > > > > +	 * These pages are currently unmovable so don't place them into
> > > > > movable
> > > > > +	 * pageblocks (e.g. CMA and ZONE_MOVABLE).
> > > > > +	 */
> > > > > +	mapping = memfd->f_mapping;
> > > > > +	mapping_set_unevictable(mapping);
> > > > > +	mapping_set_gfp_mask(mapping,
> > > > > +			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> > > > 
> > > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from
> > > > non-
> > > > movable zones, but doesn't necessarily prevent page from being migrated.  My
> > > > first glance is you need to implement either a_ops->migrate_folio() or just
> > > > get_page() after faulting in the page to prevent.
> > > 
> > > The current api restrictedmem_get_page() already does this, after the
> > > caller calling it, it holds a reference to the page. The caller then
> > > decides when to call put_page() appropriately.
> > 
> > I tried to dig some history. Perhaps I am missing something, but it seems Kirill
> > said in v9 that this code doesn't prevent page migration, and we need to
> > increase page refcount in restrictedmem_get_page():
> > 
> > https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/
> > 
> > But looking at this series it seems restrictedmem_get_page() in this v10 is
> > identical to the one in v9 (except v10 uses 'folio' instead of 'page')?
> 
> restrictedmem_get_page() increases page refcount several versions ago so
> no change in v10 is needed. You probably missed my reply:
> 
> https://lore.kernel.org/linux-mm/20221129135844.GA902164@chaop.bj.intel.com/

But for non-restricted-mem case, it is correct for KVM to decrease page's
refcount after setting up mapping in the secondary mmu, otherwise the page will
be pinned by KVM for normal VM (since KVM uses GUP to get the page).

So what we are expecting is: for KVM if the page comes from restricted mem, then
KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should.

> 
> The current solution is clear: unless we have better approach, we will
> let restrictedmem user (KVM in this case) to hold the refcount to
> prevent page migration.
> 

OK.  Will leave to others :)


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2022-12-20  7:43     ` Chao Peng
@ 2022-12-20  9:55       ` Borislav Petkov
  2022-12-21 13:42         ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Borislav Petkov @ 2022-12-20  9:55 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Tue, Dec 20, 2022 at 03:43:18PM +0800, Chao Peng wrote:
> RESTRICTEDMEM is needed by TDX_HOST, not TDX_GUEST.

Which basically means that RESTRICTEDMEM should simply depend on KVM.
Because you can't know upfront whether KVM will run a TDX guest or a SNP
guest and so on.

Which then means that RESTRICTEDMEM will practically end up always
enabled in KVM HV configs.

> The only reason to add another HAVE_KVM_RESTRICTED_MEM is some code only
> works for 64bit[*] and CONFIG_RESTRICTEDMEM is not sufficient to enforce
> that.

This is what I mean with "we have too many Kconfig items". :-\

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-20  8:33           ` Huang, Kai
@ 2022-12-21 13:39             ` Chao Peng
  2022-12-22  0:37               ` Huang, Kai
  2022-12-22 18:15               ` Sean Christopherson
  0 siblings, 2 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-21 13:39 UTC (permalink / raw)
  To: Huang, Kai
  Cc: tglx, linux-arch, kvm, jmattson, Lutomirski, Andy, ak,
	kirill.shutemov, Hocko, Michal, qemu-devel, tabba, david,
	michael.roth, corbet, bfields, dhildenb, linux-kernel,
	linux-fsdevel, x86, bp, linux-api, rppt, shuah, vkuznets, vbabka,
	mail, ddutile, qperret, arnd, pbonzini, vannapurve,
	naoya.horiguchi, Christopherson,,
	Sean, wanpengli, yu.c.zhang, hughd, aarcange, mingo, hpa,
	Nakajima, Jun, jlayton, joro, linux-mm, Wang, Wei W,
	steven.price, linux-doc, Hansen, Dave, akpm, linmiaohe

On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote:
> On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
> > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * These pages are currently unmovable so don't place them into
> > > > > > movable
> > > > > > +	 * pageblocks (e.g. CMA and ZONE_MOVABLE).
> > > > > > +	 */
> > > > > > +	mapping = memfd->f_mapping;
> > > > > > +	mapping_set_unevictable(mapping);
> > > > > > +	mapping_set_gfp_mask(mapping,
> > > > > > +			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> > > > > 
> > > > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from
> > > > > non-
> > > > > movable zones, but doesn't necessarily prevent page from being migrated.  My
> > > > > first glance is you need to implement either a_ops->migrate_folio() or just
> > > > > get_page() after faulting in the page to prevent.
> > > > 
> > > > The current api restrictedmem_get_page() already does this, after the
> > > > caller calling it, it holds a reference to the page. The caller then
> > > > decides when to call put_page() appropriately.
> > > 
> > > I tried to dig some history. Perhaps I am missing something, but it seems Kirill
> > > said in v9 that this code doesn't prevent page migration, and we need to
> > > increase page refcount in restrictedmem_get_page():
> > > 
> > > https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/
> > > 
> > > But looking at this series it seems restrictedmem_get_page() in this v10 is
> > > identical to the one in v9 (except v10 uses 'folio' instead of 'page')?
> > 
> > restrictedmem_get_page() increases page refcount several versions ago so
> > no change in v10 is needed. You probably missed my reply:
> > 
> > https://lore.kernel.org/linux-mm/20221129135844.GA902164@chaop.bj.intel.com/
> 
> But for non-restricted-mem case, it is correct for KVM to decrease page's
> refcount after setting up mapping in the secondary mmu, otherwise the page will
> be pinned by KVM for normal VM (since KVM uses GUP to get the page).

That's true. Actually even true for restrictedmem case, most likely we
will still need the kvm_release_pfn_clean() for KVM generic code. On one
side, other restrictedmem users like pKVM may not require page pinning
at all. On the other side, see below.

> 
> So what we are expecting is: for KVM if the page comes from restricted mem, then
> KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should.

I argue that this page pinning (or page migration prevention) is not
tied to where the page comes from, instead related to how the page will
be used. Whether the page is restrictedmem backed or GUP() backed, once
it's used by current version of TDX then the page pinning is needed. So
such page migration prevention is really TDX thing, even not KVM generic
thing (that's why I think we don't need change the existing logic of
kvm_release_pfn_clean()). Wouldn't better to let TDX code (or who
requires that) to increase/decrease the refcount when it populates/drops
the secure EPT entries? This is exactly what the current TDX code does:

get_page():
https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1217

put_page():
https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1334

Thanks,
Chao
> 
> > 
> > The current solution is clear: unless we have better approach, we will
> > let restrictedmem user (KVM in this case) to hold the refcount to
> > prevent page migration.
> > 
> 
> OK.  Will leave to others :)
> 

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2022-12-20  9:55       ` Borislav Petkov
@ 2022-12-21 13:42         ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-21 13:42 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Tue, Dec 20, 2022 at 10:55:44AM +0100, Borislav Petkov wrote:
> On Tue, Dec 20, 2022 at 03:43:18PM +0800, Chao Peng wrote:
> > RESTRICTEDMEM is needed by TDX_HOST, not TDX_GUEST.
> 
> Which basically means that RESTRICTEDMEM should simply depend on KVM.
> Because you can't know upfront whether KVM will run a TDX guest or a SNP
> guest and so on.
> 
> Which then means that RESTRICTEDMEM will practically end up always
> enabled in KVM HV configs.

That's right, CONFIG_RESTRICTEDMEM is always selected for supported KVM
architectures (currently x86_64).

> 
> > The only reason to add another HAVE_KVM_RESTRICTED_MEM is some code only
> > works for 64bit[*] and CONFIG_RESTRICTEDMEM is not sufficient to enforce
> > that.
> 
> This is what I mean with "we have too many Kconfig items". :-\

Yes I agree. One way to remove this is probably additionally checking
CONFIG_64BIT instead.

Thanks,
Chao
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-21 13:39             ` Chao Peng
@ 2022-12-22  0:37               ` Huang, Kai
  2022-12-23  8:20                 ` Chao Peng
  2023-01-23 14:03                 ` Vlastimil Babka
  2022-12-22 18:15               ` Sean Christopherson
  1 sibling, 2 replies; 398+ messages in thread
From: Huang, Kai @ 2022-12-22  0:37 UTC (permalink / raw)
  To: chao.p.peng
  Cc: tglx, linux-arch, kvm, jmattson, Hocko, Michal, pbonzini, ak,
	Lutomirski, Andy, linux-fsdevel, tabba, david, michael.roth,
	kirill.shutemov, corbet, qemu-devel, dhildenb, bfields,
	linux-kernel, x86, bp, ddutile, rppt, shuah, vkuznets, vbabka,
	mail, naoya.horiguchi, qperret, arnd, linux-api, yu.c.zhang,
	Christopherson,,
	Sean, wanpengli, vannapurve, hughd, aarcange, mingo, hpa,
	Nakajima, Jun, jlayton, joro, linux-mm, Wang, Wei W,
	steven.price, linux-doc, Hansen, Dave, akpm, linmiaohe

On Wed, 2022-12-21 at 21:39 +0800, Chao Peng wrote:
> > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote:
> > > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> > > > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
> > > > > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > [...]
> > > > > > > > > > > > 
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +	/*
> > > > > > > > > > > > > > +	 * These pages are currently unmovable so don't place them into
> > > > > > > > > > > > > > movable
> > > > > > > > > > > > > > +	 * pageblocks (e.g. CMA and ZONE_MOVABLE).
> > > > > > > > > > > > > > +	 */
> > > > > > > > > > > > > > +	mapping = memfd->f_mapping;
> > > > > > > > > > > > > > +	mapping_set_unevictable(mapping);
> > > > > > > > > > > > > > +	mapping_set_gfp_mask(mapping,
> > > > > > > > > > > > > > +			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> > > > > > > > > > > > 
> > > > > > > > > > > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from
> > > > > > > > > > > > non-
> > > > > > > > > > > > movable zones, but doesn't necessarily prevent page from being migrated.  My
> > > > > > > > > > > > first glance is you need to implement either a_ops->migrate_folio() or just
> > > > > > > > > > > > get_page() after faulting in the page to prevent.
> > > > > > > > > > 
> > > > > > > > > > The current api restrictedmem_get_page() already does this, after the
> > > > > > > > > > caller calling it, it holds a reference to the page. The caller then
> > > > > > > > > > decides when to call put_page() appropriately.
> > > > > > > > 
> > > > > > > > I tried to dig some history. Perhaps I am missing something, but it seems Kirill
> > > > > > > > said in v9 that this code doesn't prevent page migration, and we need to
> > > > > > > > increase page refcount in restrictedmem_get_page():
> > > > > > > > 
> > > > > > > > https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/
> > > > > > > > 
> > > > > > > > But looking at this series it seems restrictedmem_get_page() in this v10 is
> > > > > > > > identical to the one in v9 (except v10 uses 'folio' instead of 'page')?
> > > > > > 
> > > > > > restrictedmem_get_page() increases page refcount several versions ago so
> > > > > > no change in v10 is needed. You probably missed my reply:
> > > > > > 
> > > > > > https://lore.kernel.org/linux-mm/20221129135844.GA902164@chaop.bj.intel.com/
> > > > 
> > > > But for non-restricted-mem case, it is correct for KVM to decrease page's
> > > > refcount after setting up mapping in the secondary mmu, otherwise the page will
> > > > be pinned by KVM for normal VM (since KVM uses GUP to get the page).
> > 
> > That's true. Actually even true for restrictedmem case, most likely we
> > will still need the kvm_release_pfn_clean() for KVM generic code. On one
> > side, other restrictedmem users like pKVM may not require page pinning
> > at all. On the other side, see below.

OK. Agreed.

> > 
> > > > 
> > > > So what we are expecting is: for KVM if the page comes from restricted mem, then
> > > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should.
> > 
> > I argue that this page pinning (or page migration prevention) is not
> > tied to where the page comes from, instead related to how the page will
> > be used. Whether the page is restrictedmem backed or GUP() backed, once
> > it's used by current version of TDX then the page pinning is needed. So
> > such page migration prevention is really TDX thing, even not KVM generic
> > thing (that's why I think we don't need change the existing logic of
> > kvm_release_pfn_clean()). 
> > 

This essentially boils down to who "owns" page migration handling, and sadly,
page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page
migration by itself -- it's just a passive receiver.

For normal pages, page migration is totally done by the core-kernel (i.e. it
unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops-
>migrate_page() to actually migrate the page).

In the sense of TDX, conceptually it should be done in the same way. The more
important thing is: yes KVM can use get_page() to prevent page migration, but
when KVM wants to support it, KVM cannot just remove get_page(), as the core-
kernel will still just do migrate_page() which won't work for TDX (given
restricted_memfd doesn't have a_ops->migrate_page() implemented).

So I think the restricted_memfd filesystem should own page migration handling,
(i.e. by implementing a_ops->migrate_page() to either just reject page migration
or somehow support it).

To support page migration, it may require KVM's help in case of TDX (the
TDH.MEM.PAGE.RELOCATE SEAMCALL requires "GPA" and "level" of EPT mapping, which
are only available in KVM), but that doesn't make KVM to own the handling of
page migration.


> > Wouldn't better to let TDX code (or who
> > requires that) to increase/decrease the refcount when it populates/drops
> > the secure EPT entries? This is exactly what the current TDX code does:
> > 
> > get_page():
> > https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1217
> > 
> > put_page():
> > https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1334
> > 

As explained above, I think doing so in KVM is wrong: it can prevent by using
get_page(), but you cannot simply remove it to support page migration.

Sean also said similar thing when reviewing v8 KVM TDX series and I also agree:

https://lore.kernel.org/lkml/Yvu5PsAndEbWKTHc@google.com/
https://lore.kernel.org/lkml/31fec1b4438a6d9bb7ff719f96caa8b23ed764d6.camel@intel.com/


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-21 13:39             ` Chao Peng
  2022-12-22  0:37               ` Huang, Kai
@ 2022-12-22 18:15               ` Sean Christopherson
  2022-12-23  0:50                 ` Huang, Kai
                                   ` (2 more replies)
  1 sibling, 3 replies; 398+ messages in thread
From: Sean Christopherson @ 2022-12-22 18:15 UTC (permalink / raw)
  To: Chao Peng
  Cc: Huang, Kai, tglx, linux-arch, kvm, jmattson, Lutomirski, Andy,
	ak, kirill.shutemov, Hocko, Michal, qemu-devel, tabba, david,
	michael.roth, corbet, bfields, dhildenb, linux-kernel,
	linux-fsdevel, x86, bp, linux-api, rppt, shuah, vkuznets, vbabka,
	mail, ddutile, qperret, arnd, pbonzini, vannapurve,
	naoya.horiguchi, wanpengli, yu.c.zhang, hughd, aarcange, mingo,
	hpa, Nakajima, Jun, jlayton, joro, linux-mm, Wang, Wei W,
	steven.price, linux-doc, Hansen, Dave, akpm, linmiaohe

On Wed, Dec 21, 2022, Chao Peng wrote:
> On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote:
> > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
> > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > But for non-restricted-mem case, it is correct for KVM to decrease page's
> > refcount after setting up mapping in the secondary mmu, otherwise the page will
> > be pinned by KVM for normal VM (since KVM uses GUP to get the page).
> 
> That's true. Actually even true for restrictedmem case, most likely we
> will still need the kvm_release_pfn_clean() for KVM generic code. On one
> side, other restrictedmem users like pKVM may not require page pinning
> at all. On the other side, see below.
> 
> > 
> > So what we are expecting is: for KVM if the page comes from restricted mem, then
> > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should.

No, requiring the user (KVM) to guard against lack of support for page migration
in restricted mem is a terrible API.  It's totally fine for restricted mem to not
support page migration until there's a use case, but punting the problem to KVM
is not acceptable.  Restricted mem itself doesn't yet support page migration,
e.g. explosions would occur even if KVM wanted to allow migration since there is
no notification to invalidate existing mappings.

> I argue that this page pinning (or page migration prevention) is not
> tied to where the page comes from, instead related to how the page will
> be used. Whether the page is restrictedmem backed or GUP() backed, once
> it's used by current version of TDX then the page pinning is needed. So
> such page migration prevention is really TDX thing, even not KVM generic
> thing (that's why I think we don't need change the existing logic of
> kvm_release_pfn_clean()). Wouldn't better to let TDX code (or who
> requires that) to increase/decrease the refcount when it populates/drops
> the secure EPT entries? This is exactly what the current TDX code does:

I agree that whether or not migration is supported should be controllable by the
user, but I strongly disagree on punting refcount management to KVM (or TDX).
The whole point of restricted mem is to support technologies like TDX and SNP,
accomodating their special needs for things like page migration should be part of
the API, not some footnote in the documenation.

It's not difficult to let the user communicate support for page migration, e.g.
if/when restricted mem gains support, add a hook to restrictedmem_notifier_ops
to signal support (or lack thereof) for page migration.  NULL == no migration,
non-NULL == migration allowed.

We know that supporting page migration in TDX and SNP is possible, and we know
that page migration will require a dedicated API since the backing store can't
memcpy() the page.  I don't see any reason to ignore that eventuality.

But again, unless I'm missing something, that's a future problem because restricted
mem doesn't yet support page migration regardless of the downstream user.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-22 18:15               ` Sean Christopherson
@ 2022-12-23  0:50                 ` Huang, Kai
  2022-12-23  8:24                 ` Chao Peng
  2023-01-23 15:43                 ` Kirill A. Shutemov
  2 siblings, 0 replies; 398+ messages in thread
From: Huang, Kai @ 2022-12-23  0:50 UTC (permalink / raw)
  To: Christopherson,, Sean, chao.p.peng
  Cc: tglx, linux-arch, kvm, jmattson, Hocko, Michal, pbonzini, ak,
	Lutomirski, Andy, linux-fsdevel, tabba, david, michael.roth,
	kirill.shutemov, corbet, qemu-devel, dhildenb, bfields,
	linux-kernel, x86, bp, ddutile, rppt, shuah, vkuznets, vbabka,
	mail, naoya.horiguchi, qperret, arnd, linux-api, yu.c.zhang,
	aarcange, wanpengli, vannapurve, hughd, mingo, hpa, Nakajima,
	Jun, jlayton, joro, linux-mm, Wang, Wei W, steven.price,
	linux-doc, Hansen, Dave, akpm, linmiaohe

On Thu, 2022-12-22 at 18:15 +0000, Sean Christopherson wrote:
> On Wed, Dec 21, 2022, Chao Peng wrote:
> > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote:
> > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
> > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > But for non-restricted-mem case, it is correct for KVM to decrease page's
> > > refcount after setting up mapping in the secondary mmu, otherwise the page will
> > > be pinned by KVM for normal VM (since KVM uses GUP to get the page).
> > 
> > That's true. Actually even true for restrictedmem case, most likely we
> > will still need the kvm_release_pfn_clean() for KVM generic code. On one
> > side, other restrictedmem users like pKVM may not require page pinning
> > at all. On the other side, see below.
> > 
> > > 
> > > So what we are expecting is: for KVM if the page comes from restricted mem, then
> > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should.
> 
> No, requiring the user (KVM) to guard against lack of support for page migration
> in restricted mem is a terrible API.  It's totally fine for restricted mem to not
> support page migration until there's a use case, but punting the problem to KVM
> is not acceptable.  Restricted mem itself doesn't yet support page migration,
> e.g. explosions would occur even if KVM wanted to allow migration since there is
> no notification to invalidate existing mappings.
> 
> 

Yes totally agree (I also replied separately).

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-22  0:37               ` Huang, Kai
@ 2022-12-23  8:20                 ` Chao Peng
  2023-01-23 14:03                 ` Vlastimil Babka
  1 sibling, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-23  8:20 UTC (permalink / raw)
  To: Huang, Kai
  Cc: tglx, linux-arch, kvm, jmattson, Hocko, Michal, pbonzini, ak,
	Lutomirski, Andy, linux-fsdevel, tabba, david, michael.roth,
	kirill.shutemov, corbet, qemu-devel, dhildenb, bfields,
	linux-kernel, x86, bp, ddutile, rppt, shuah, vkuznets, vbabka,
	mail, naoya.horiguchi, qperret, arnd, linux-api, yu.c.zhang,
	Christopherson,,
	Sean, wanpengli, vannapurve, hughd, aarcange, mingo, hpa,
	Nakajima, Jun, jlayton, joro, linux-mm, Wang, Wei W,
	steven.price, linux-doc, Hansen, Dave, akpm, linmiaohe

On Thu, Dec 22, 2022 at 12:37:19AM +0000, Huang, Kai wrote:
> On Wed, 2022-12-21 at 21:39 +0800, Chao Peng wrote:
> > > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote:
> > > > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> > > > > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
> > > > > > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > [...]
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +	/*
> > > > > > > > > > > > > > > +	 * These pages are currently unmovable so don't place them into
> > > > > > > > > > > > > > > movable
> > > > > > > > > > > > > > > +	 * pageblocks (e.g. CMA and ZONE_MOVABLE).
> > > > > > > > > > > > > > > +	 */
> > > > > > > > > > > > > > > +	mapping = memfd->f_mapping;
> > > > > > > > > > > > > > > +	mapping_set_unevictable(mapping);
> > > > > > > > > > > > > > > +	mapping_set_gfp_mask(mapping,
> > > > > > > > > > > > > > > +			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> > > > > > > > > > > > > 
> > > > > > > > > > > > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from
> > > > > > > > > > > > > non-
> > > > > > > > > > > > > movable zones, but doesn't necessarily prevent page from being migrated.  My
> > > > > > > > > > > > > first glance is you need to implement either a_ops->migrate_folio() or just
> > > > > > > > > > > > > get_page() after faulting in the page to prevent.
> > > > > > > > > > > 
> > > > > > > > > > > The current api restrictedmem_get_page() already does this, after the
> > > > > > > > > > > caller calling it, it holds a reference to the page. The caller then
> > > > > > > > > > > decides when to call put_page() appropriately.
> > > > > > > > > 
> > > > > > > > > I tried to dig some history. Perhaps I am missing something, but it seems Kirill
> > > > > > > > > said in v9 that this code doesn't prevent page migration, and we need to
> > > > > > > > > increase page refcount in restrictedmem_get_page():
> > > > > > > > > 
> > > > > > > > > https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/
> > > > > > > > > 
> > > > > > > > > But looking at this series it seems restrictedmem_get_page() in this v10 is
> > > > > > > > > identical to the one in v9 (except v10 uses 'folio' instead of 'page')?
> > > > > > > 
> > > > > > > restrictedmem_get_page() increases page refcount several versions ago so
> > > > > > > no change in v10 is needed. You probably missed my reply:
> > > > > > > 
> > > > > > > https://lore.kernel.org/linux-mm/20221129135844.GA902164@chaop.bj.intel.com/
> > > > > 
> > > > > But for non-restricted-mem case, it is correct for KVM to decrease page's
> > > > > refcount after setting up mapping in the secondary mmu, otherwise the page will
> > > > > be pinned by KVM for normal VM (since KVM uses GUP to get the page).
> > > 
> > > That's true. Actually even true for restrictedmem case, most likely we
> > > will still need the kvm_release_pfn_clean() for KVM generic code. On one
> > > side, other restrictedmem users like pKVM may not require page pinning
> > > at all. On the other side, see below.
> 
> OK. Agreed.
> 
> > > 
> > > > > 
> > > > > So what we are expecting is: for KVM if the page comes from restricted mem, then
> > > > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should.
> > > 
> > > I argue that this page pinning (or page migration prevention) is not
> > > tied to where the page comes from, instead related to how the page will
> > > be used. Whether the page is restrictedmem backed or GUP() backed, once
> > > it's used by current version of TDX then the page pinning is needed. So
> > > such page migration prevention is really TDX thing, even not KVM generic
> > > thing (that's why I think we don't need change the existing logic of
> > > kvm_release_pfn_clean()). 
> > > 
> 
> This essentially boils down to who "owns" page migration handling, and sadly,
> page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page
> migration by itself -- it's just a passive receiver.

No, I'm not talking on the page migration handling itself, I know page
migration requires coordination from both core-mm and KVM. I'm more
concerning on the page migration prevention here. This is something we
need to address for TDX before the page migration is supported.

> 
> For normal pages, page migration is totally done by the core-kernel (i.e. it
> unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops-
> >migrate_page() to actually migrate the page).
> 
> In the sense of TDX, conceptually it should be done in the same way. The more
> important thing is: yes KVM can use get_page() to prevent page migration, but
> when KVM wants to support it, KVM cannot just remove get_page(), as the core-
> kernel will still just do migrate_page() which won't work for TDX (given
> restricted_memfd doesn't have a_ops->migrate_page() implemented).
> 
> So I think the restricted_memfd filesystem should own page migration handling,
> (i.e. by implementing a_ops->migrate_page() to either just reject page migration
> or somehow support it).
> 
> To support page migration, it may require KVM's help in case of TDX (the
> TDH.MEM.PAGE.RELOCATE SEAMCALL requires "GPA" and "level" of EPT mapping, which
> are only available in KVM), but that doesn't make KVM to own the handling of
> page migration.
> 
> 
> > > Wouldn't better to let TDX code (or who
> > > requires that) to increase/decrease the refcount when it populates/drops
> > > the secure EPT entries? This is exactly what the current TDX code does:
> > > 
> > > get_page():
> > > https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1217
> > > 
> > > put_page():
> > > https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1334
> > > 
> 
> As explained above, I think doing so in KVM is wrong: it can prevent by using
> get_page(), but you cannot simply remove it to support page migration.

Removing get_page() is definitely not enough for page migration support.
But the key thing is for page migration prevention, other than
get_page(), do we really have alternative.

Thanks,
Chao
> 
> Sean also said similar thing when reviewing v8 KVM TDX series and I also agree:
> 
> https://lore.kernel.org/lkml/Yvu5PsAndEbWKTHc@google.com/
> https://lore.kernel.org/lkml/31fec1b4438a6d9bb7ff719f96caa8b23ed764d6.camel@intel.com/
> 

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-22 18:15               ` Sean Christopherson
  2022-12-23  0:50                 ` Huang, Kai
@ 2022-12-23  8:24                 ` Chao Peng
  2023-01-23 15:43                 ` Kirill A. Shutemov
  2 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2022-12-23  8:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Huang, Kai, tglx, linux-arch, kvm, jmattson, Lutomirski, Andy,
	ak, kirill.shutemov, Hocko, Michal, qemu-devel, tabba, david,
	michael.roth, corbet, bfields, dhildenb, linux-kernel,
	linux-fsdevel, x86, bp, linux-api, rppt, shuah, vkuznets, vbabka,
	mail, ddutile, qperret, arnd, pbonzini, vannapurve,
	naoya.horiguchi, wanpengli, yu.c.zhang, hughd, aarcange, mingo,
	hpa, Nakajima, Jun, jlayton, joro, linux-mm, Wang, Wei W,
	steven.price, linux-doc, Hansen, Dave, akpm, linmiaohe

On Thu, Dec 22, 2022 at 06:15:24PM +0000, Sean Christopherson wrote:
> On Wed, Dec 21, 2022, Chao Peng wrote:
> > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote:
> > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
> > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > But for non-restricted-mem case, it is correct for KVM to decrease page's
> > > refcount after setting up mapping in the secondary mmu, otherwise the page will
> > > be pinned by KVM for normal VM (since KVM uses GUP to get the page).
> > 
> > That's true. Actually even true for restrictedmem case, most likely we
> > will still need the kvm_release_pfn_clean() for KVM generic code. On one
> > side, other restrictedmem users like pKVM may not require page pinning
> > at all. On the other side, see below.
> > 
> > > 
> > > So what we are expecting is: for KVM if the page comes from restricted mem, then
> > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should.
> 
> No, requiring the user (KVM) to guard against lack of support for page migration
> in restricted mem is a terrible API.  It's totally fine for restricted mem to not
> support page migration until there's a use case, but punting the problem to KVM
> is not acceptable.  Restricted mem itself doesn't yet support page migration,
> e.g. explosions would occur even if KVM wanted to allow migration since there is
> no notification to invalidate existing mappings.
> 
> > I argue that this page pinning (or page migration prevention) is not
> > tied to where the page comes from, instead related to how the page will
> > be used. Whether the page is restrictedmem backed or GUP() backed, once
> > it's used by current version of TDX then the page pinning is needed. So
> > such page migration prevention is really TDX thing, even not KVM generic
> > thing (that's why I think we don't need change the existing logic of
> > kvm_release_pfn_clean()). Wouldn't better to let TDX code (or who
> > requires that) to increase/decrease the refcount when it populates/drops
> > the secure EPT entries? This is exactly what the current TDX code does:
> 
> I agree that whether or not migration is supported should be controllable by the
> user, but I strongly disagree on punting refcount management to KVM (or TDX).
> The whole point of restricted mem is to support technologies like TDX and SNP,
> accomodating their special needs for things like page migration should be part of
> the API, not some footnote in the documenation.

I never doubt page migration should be part of restrictedmem API, but
that's not an initial implementing as we all agreed? Then before that
API being introduced, we need find a solution to prevent page migration
for TDX. Other than refcount management, do we have any other workable
solution? 

> 
> It's not difficult to let the user communicate support for page migration, e.g.
> if/when restricted mem gains support, add a hook to restrictedmem_notifier_ops
> to signal support (or lack thereof) for page migration.  NULL == no migration,
> non-NULL == migration allowed.

I know.

> 
> We know that supporting page migration in TDX and SNP is possible, and we know
> that page migration will require a dedicated API since the backing store can't
> memcpy() the page.  I don't see any reason to ignore that eventuality.

No, I'm not ignoring it. It's just about the short-term page migration
prevention before that dedicated API being introduced.

> 
> But again, unless I'm missing something, that's a future problem because restricted
> mem doesn't yet support page migration regardless of the downstream user.

It's true a future problem for page migration support itself, but page
migration prevention is not a future problem since TDX pages need to be
pinned before page migration gets supported.

Thanks,
Chao

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2022-12-02  6:13 ` [PATCH v10 2/9] KVM: Introduce per-page memory attributes Chao Peng
                     ` (2 preceding siblings ...)
  2022-12-16 15:09   ` Borislav Petkov
@ 2022-12-28  8:28   ` Chenyi Qiang
  2023-01-03  1:39     ` Chao Peng
  2023-01-13 22:02   ` Sean Christopherson
                     ` (3 subsequent siblings)
  7 siblings, 1 reply; 398+ messages in thread
From: Chenyi Qiang @ 2022-12-28  8:28 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang



On 12/2/2022 2:13 PM, Chao Peng wrote:
> In confidential computing usages, whether a page is private or shared is
> necessary information for KVM to perform operations like page fault
> handling, page zapping etc. There are other potential use cases for
> per-page memory attributes, e.g. to make memory read-only (or no-exec,
> or exec-only, etc.) without having to modify memslots.
> 
> Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> userspace to operate on the per-page memory attributes.
>   - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
>     a guest memory range.
>   - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
>     memory attributes.
> 
> KVM internally uses xarray to store the per-page memory attributes.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> ---
>  Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
>  arch/x86/kvm/Kconfig           |  1 +
>  include/linux/kvm_host.h       |  3 ++
>  include/uapi/linux/kvm.h       | 17 ++++++++
>  virt/kvm/Kconfig               |  3 ++
>  virt/kvm/kvm_main.c            | 76 ++++++++++++++++++++++++++++++++++
>  6 files changed, 163 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 5617bc4f899f..bb2f709c0900 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
>  The "pad" and "reserved" fields may be used for future extensions and should be
>  set to 0s by userspace.
>  
> +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: u64 memory attributes bitmask(out)
> +:Returns: 0 on success, <0 on error
> +
> +Returns supported memory attributes bitmask. Supported memory attributes will
> +have the corresponding bits set in u64 memory attributes bitmask.
> +
> +The following memory attributes are defined::
> +
> +  #define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> +  #define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> +  #define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> +  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> +
> +4.139 KVM_SET_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct kvm_memory_attributes(in/out)
> +:Returns: 0 on success, <0 on error
> +
> +Sets memory attributes for pages in a guest memory range. Parameters are
> +specified via the following structure::
> +
> +  struct kvm_memory_attributes {
> +	__u64 address;
> +	__u64 size;
> +	__u64 attributes;
> +	__u64 flags;
> +  };
> +
> +The user sets the per-page memory attributes to a guest memory range indicated
> +by address/size, and in return KVM adjusts address and size to reflect the
> +actual pages of the memory range have been successfully set to the attributes.
> +If the call returns 0, "address" is updated to the last successful address + 1
> +and "size" is updated to the remaining address size that has not been set
> +successfully. The user should check the return value as well as the size to
> +decide if the operation succeeded for the whole range or not. The user may want
> +to retry the operation with the returned address/size if the previous range was
> +partially successful.
> +
> +Both address and size should be page aligned and the supported attributes can be
> +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> +
> +The "flags" field may be used for future extensions and should be set to 0s.
> +
>  5. The kvm_run structure
>  ========================
>  
> @@ -8270,6 +8323,16 @@ structure.
>  When getting the Modified Change Topology Report value, the attr->addr
>  must point to a byte where the value will be stored or retrieved from.
>  
> +8.40 KVM_CAP_MEMORY_ATTRIBUTES
> +------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm
> +
> +This capability indicates KVM supports per-page memory attributes and ioctls
> +KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
> +
>  9. Known KVM API problems
>  =========================
>  
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index fbeaa9ddef59..a8e379a3afee 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -49,6 +49,7 @@ config KVM
>  	select SRCU
>  	select INTERVAL_TREE
>  	select HAVE_KVM_PM_NOTIFIER if PM
> +	select HAVE_KVM_MEMORY_ATTRIBUTES
>  	help
>  	  Support hosting fully virtualized guest machines using hardware
>  	  virtualization extensions.  You will need a fairly recent
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8f874a964313..a784e2b06625 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -800,6 +800,9 @@ struct kvm {
>  
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>  	struct notifier_block pm_notifier;
> +#endif
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	struct xarray mem_attr_array;
>  #endif
>  	char stats_id[KVM_STATS_NAME_SIZE];
>  };
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 64dfe9c07c87..5d0941acb5bb 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1182,6 +1182,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_S390_CPU_TOPOLOGY 222
>  #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
>  #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
> +#define KVM_CAP_MEMORY_ATTRIBUTES 225
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -2238,4 +2239,20 @@ struct kvm_s390_zpci_op {
>  /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
>  #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
>  
> +/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
> +#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES    _IOR(KVMIO,  0xd2, __u64)
> +#define KVM_SET_MEMORY_ATTRIBUTES              _IOWR(KVMIO,  0xd3, struct kvm_memory_attributes)
> +
> +struct kvm_memory_attributes {
> +	__u64 address;
> +	__u64 size;
> +	__u64 attributes;
> +	__u64 flags;
> +};
> +
> +#define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> +#define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> +#define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> +#define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> +
>  #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 800f9470e36b..effdea5dd4f0 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -19,6 +19,9 @@ config HAVE_KVM_IRQ_ROUTING
>  config HAVE_KVM_DIRTY_RING
>         bool
>  
> +config HAVE_KVM_MEMORY_ATTRIBUTES
> +       bool
> +
>  # Only strongly ordered architectures can select this, as it doesn't
>  # put any explicit constraint on userspace ordering. They can also
>  # select the _ACQ_REL version.
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1782c4555d94..7f0f5e9f2406 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>  	spin_lock_init(&kvm->mn_invalidate_lock);
>  	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
>  	xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	xa_init(&kvm->mem_attr_array);
> +#endif
>  
>  	INIT_LIST_HEAD(&kvm->gpc_list);
>  	spin_lock_init(&kvm->gpc_lock);
> @@ -1323,6 +1326,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
>  		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
>  	}
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	xa_destroy(&kvm->mem_attr_array);
> +#endif
>  	cleanup_srcu_struct(&kvm->irq_srcu);
>  	cleanup_srcu_struct(&kvm->srcu);
>  	kvm_arch_free_vm(kvm);
> @@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
>  }
>  #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>  
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> +{
> +	return 0;
> +}
> +
> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> +					   struct kvm_memory_attributes *attrs)
> +{
> +	gfn_t start, end;
> +	unsigned long i;
> +	void *entry;
> +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> +	/* flags is currently not used. */
> +	if (attrs->flags)
> +		return -EINVAL;
> +	if (attrs->attributes & ~supported_attrs)
> +		return -EINVAL;
> +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> +		return -EINVAL;
> +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> +		return -EINVAL;
> +
> +	start = attrs->address >> PAGE_SHIFT;
> +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +

Because guest memory defaults to private, and now this patch stores the
attributes with KVM_MEMORY_ATTRIBUTE_PRIVATE instead of _SHARED, it
would bring more KVM_EXIT_MEMORY_FAULT exits at the beginning of boot
time. Maybe it can be optimized somehow in other places? e.g. set mem
attr in advance.

> +	mutex_lock(&kvm->lock);
> +	for (i = start; i < end; i++)
> +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> +				    GFP_KERNEL_ACCOUNT)))
> +			break;
> +	mutex_unlock(&kvm->lock);
> +
> +	attrs->address = i << PAGE_SHIFT;
> +	attrs->size = (end - i) << PAGE_SHIFT;
> +
> +	return 0;
> +}
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
>  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
>  {
>  	return __gfn_to_memslot(kvm_memslots(kvm), gfn);


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2022-12-28  8:28   ` Chenyi Qiang
@ 2023-01-03  1:39     ` Chao Peng
  2023-01-03  3:32       ` Wang, Wei W
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2023-01-03  1:39 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Wed, Dec 28, 2022 at 04:28:01PM +0800, Chenyi Qiang wrote:
...
> > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > +					   struct kvm_memory_attributes *attrs)
> > +{
> > +	gfn_t start, end;
> > +	unsigned long i;
> > +	void *entry;
> > +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > +
> > +	/* flags is currently not used. */
> > +	if (attrs->flags)
> > +		return -EINVAL;
> > +	if (attrs->attributes & ~supported_attrs)
> > +		return -EINVAL;
> > +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > +		return -EINVAL;
> > +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > +		return -EINVAL;
> > +
> > +	start = attrs->address >> PAGE_SHIFT;
> > +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > +
> 
> Because guest memory defaults to private, and now this patch stores the
> attributes with KVM_MEMORY_ATTRIBUTE_PRIVATE instead of _SHARED, it
> would bring more KVM_EXIT_MEMORY_FAULT exits at the beginning of boot
> time. Maybe it can be optimized somehow in other places? e.g. set mem
> attr in advance.

KVM defaults to 'shared' because this ioctl can also be potentially used
by normal VMs and 'shared' sounds a value meaningful for both normal VMs
and confidential VMs. As for more KVM_EXIT_MEMORY_FAULT exits during the
booting time, yes, setting all memory to 'private' for confidential VMs
through this ioctl in userspace before guest launch is an approach for
KVM userspace to 'override' the KVM default and reduce the number of
implicit conversions.

Thanks,
Chao
> 
> > +	mutex_lock(&kvm->lock);
> > +	for (i = start; i < end; i++)
> > +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > +				    GFP_KERNEL_ACCOUNT)))
> > +			break;
> > +	mutex_unlock(&kvm->lock);
> > +
> > +	attrs->address = i << PAGE_SHIFT;
> > +	attrs->size = (end - i) << PAGE_SHIFT;
> > +
> > +	return 0;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > +
> >  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
> >  {
> >  	return __gfn_to_memslot(kvm_memslots(kvm), gfn);

^ permalink raw reply	[flat|nested] 398+ messages in thread

* RE: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2023-01-03  1:39     ` Chao Peng
@ 2023-01-03  3:32       ` Wang, Wei W
  2023-01-03 23:06         ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Wang, Wei W @ 2023-01-03  3:32 UTC (permalink / raw)
  To: Chao Peng, Qiang, Chenyi
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Christopherson,,
	Sean, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	Lutomirski, Andy, Nakajima, Jun, Hansen, Dave, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
	Hocko, Michal

On Tuesday, January 3, 2023 9:40 AM, Chao Peng wrote:
> > Because guest memory defaults to private, and now this patch stores
> > the attributes with KVM_MEMORY_ATTRIBUTE_PRIVATE instead of
> _SHARED,
> > it would bring more KVM_EXIT_MEMORY_FAULT exits at the beginning of
> > boot time. Maybe it can be optimized somehow in other places? e.g. set
> > mem attr in advance.
> 
> KVM defaults to 'shared' because this ioctl can also be potentially used by
> normal VMs and 'shared' sounds a value meaningful for both normal VMs and
> confidential VMs. 

Do you mean a normal VM could have pages marked private? What's the usage?
(If all the pages are just marked shared for normal VMs, then why do we need it)

> As for more KVM_EXIT_MEMORY_FAULT exits during the
> booting time, yes, setting all memory to 'private' for confidential VMs through
> this ioctl in userspace before guest launch is an approach for KVM userspace to
> 'override' the KVM default and reduce the number of implicit conversions.

Most pages of a confidential VM are likely to be private pages. It seems more efficient
(and not difficult to check vm_type) to have KVM defaults to "private" for confidential VMs
and defaults to "shared" for normal VMs.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2023-01-03  3:32       ` Wang, Wei W
@ 2023-01-03 23:06         ` Sean Christopherson
  2023-01-05  4:39           ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-01-03 23:06 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: Chao Peng, Qiang, Chenyi, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, Lutomirski, Andy, Nakajima, Jun,
	Hansen, Dave, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, tabba, Michael Roth, Hocko, Michal

On Tue, Jan 03, 2023, Wang, Wei W wrote:
> On Tuesday, January 3, 2023 9:40 AM, Chao Peng wrote:
> > > Because guest memory defaults to private, and now this patch stores
> > > the attributes with KVM_MEMORY_ATTRIBUTE_PRIVATE instead of
> > _SHARED,
> > > it would bring more KVM_EXIT_MEMORY_FAULT exits at the beginning of
> > > boot time. Maybe it can be optimized somehow in other places? e.g. set
> > > mem attr in advance.
> > 
> > KVM defaults to 'shared' because this ioctl can also be potentially used by
> > normal VMs and 'shared' sounds a value meaningful for both normal VMs and
> > confidential VMs. 
> 
> Do you mean a normal VM could have pages marked private? What's the usage?
> (If all the pages are just marked shared for normal VMs, then why do we need it)

No, there are potential use cases for per-page attribute/permissions, e.g. to
make select pages read-only, exec-only, no-exec, etc...

> > As for more KVM_EXIT_MEMORY_FAULT exits during the
> > booting time, yes, setting all memory to 'private' for confidential VMs through
> > this ioctl in userspace before guest launch is an approach for KVM userspace to
> > 'override' the KVM default and reduce the number of implicit conversions.
> 
> Most pages of a confidential VM are likely to be private pages. It seems more efficient
> (and not difficult to check vm_type) to have KVM defaults to "private" for confidential VMs
> and defaults to "shared" for normal VMs.

If done right, the default shouldn't matter all that much for efficiency.  KVM
needs to be able to effeciently track large ranges regardless of the default,
otherwise the memory overhead and the presumably cost of lookups will be painful.
E.g. converting a 1GiB chunk to shared should ideally require one entry, not 256k
entries.

Looks like that behavior was changed in v8 in response to feedback[*] that doing
xa_store_range() on a subset of an existing range (entry) would overwrite the
entire existing range (entry), not just the smaller subset.  xa_store_range() does
appear to be too simplistic for this use case, but looking at __filemap_add_folio(),
splitting an existing entry isn't super complex.

Using xa_store() for the very initial implementation is ok, and probably a good
idea since it's more obviously correct and will give us a bisection point.  But
we definitely want a more performant implementation sooner than later.  The hardest
part will likely be merging existing entries, but that can be done separately too,
and is probably lower priority.

E.g. (1) use xa_store() and always track at 4KiB granularity, (2) support storing
metadata in multi-index entries, and finally (3) support merging adjacent entries
with identical values.

[*] https://lore.kernel.org/all/CAGtprH9xyw6bt4=RBWF6-v2CSpabOCpKq5rPz+e-9co7EisoVQ@mail.gmail.com

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2023-01-03 23:06         ` Sean Christopherson
@ 2023-01-05  4:39           ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2023-01-05  4:39 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Wang, Wei W, Qiang, Chenyi, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, Lutomirski, Andy, Nakajima, Jun,
	Hansen, Dave, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, tabba, Michael Roth, Hocko, Michal

On Tue, Jan 03, 2023 at 11:06:37PM +0000, Sean Christopherson wrote:
> On Tue, Jan 03, 2023, Wang, Wei W wrote:
> > On Tuesday, January 3, 2023 9:40 AM, Chao Peng wrote:
> > > > Because guest memory defaults to private, and now this patch stores
> > > > the attributes with KVM_MEMORY_ATTRIBUTE_PRIVATE instead of
> > > _SHARED,
> > > > it would bring more KVM_EXIT_MEMORY_FAULT exits at the beginning of
> > > > boot time. Maybe it can be optimized somehow in other places? e.g. set
> > > > mem attr in advance.
> > > 
> > > KVM defaults to 'shared' because this ioctl can also be potentially used by
> > > normal VMs and 'shared' sounds a value meaningful for both normal VMs and
> > > confidential VMs. 
> > 
> > Do you mean a normal VM could have pages marked private? What's the usage?
> > (If all the pages are just marked shared for normal VMs, then why do we need it)
> 
> No, there are potential use cases for per-page attribute/permissions, e.g. to
> make select pages read-only, exec-only, no-exec, etc...

Right, normal VMs are not likely use private/shared bit. Not sure pKVM,
but perhaps not call it 'normal' VMs in this context. But since the
ioctl can be used by normal VMs for other bits (read-only, exec-only,
no-exec, etc), a default 'private' looks strange for them. That's why I
default it to 'shared' and for confidential guest, we can issue another
call to this ioctl to set all the memory to 'private' before guest
booting, if default 'private' is needed for guest.

Like Wei mentioned, it's also possible to make the default dependents on
vm_type, but that looks awkward to me from the API definition as well as
the implementation, also the vm_type has not been introduced at this time.

> 
> > > As for more KVM_EXIT_MEMORY_FAULT exits during the
> > > booting time, yes, setting all memory to 'private' for confidential VMs through
> > > this ioctl in userspace before guest launch is an approach for KVM userspace to
> > > 'override' the KVM default and reduce the number of implicit conversions.
> > 
> > Most pages of a confidential VM are likely to be private pages. It seems more efficient
> > (and not difficult to check vm_type) to have KVM defaults to "private" for confidential VMs
> > and defaults to "shared" for normal VMs.
> 
> If done right, the default shouldn't matter all that much for efficiency.  KVM
> needs to be able to effeciently track large ranges regardless of the default,
> otherwise the memory overhead and the presumably cost of lookups will be painful.
> E.g. converting a 1GiB chunk to shared should ideally require one entry, not 256k
> entries.

I agree, KVM should have the ability to track large ranges efficiently.

> 
> Looks like that behavior was changed in v8 in response to feedback[*] that doing
> xa_store_range() on a subset of an existing range (entry) would overwrite the
> entire existing range (entry), not just the smaller subset.  xa_store_range() does
> appear to be too simplistic for this use case, but looking at __filemap_add_folio(),
> splitting an existing entry isn't super complex.

Yes, xa_store_range() looks a perfect match for us initially but the
'overwriting the entire entry' behavior makes it incorrect for us when
storing a subset on an existing large entry. xarray lib has utilities
for splitting, the hard part is merging existing entries, as you also
said below. Thanks for pointing out the __filemap_add_folio() example,
it does look not too complex for splitting.

> 
> Using xa_store() for the very initial implementation is ok, and probably a good
> idea since it's more obviously correct and will give us a bisection point.  But
> we definitely want a more performant implementation sooner than later.  The hardest
> part will likely be merging existing entries, but that can be done separately too,
> and is probably lower priority.
> 
> E.g. (1) use xa_store() and always track at 4KiB granularity, (2) support storing
> metadata in multi-index entries, and finally (3) support merging adjacent entries
> with identical values.

This path looks good to me.

Thanks,
Chao
> 
> [*] https://lore.kernel.org/all/CAGtprH9xyw6bt4=RBWF6-v2CSpabOCpKq5rPz+e-9co7EisoVQ@mail.gmail.com

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2022-12-02  6:13 ` [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory Chao Peng
                     ` (2 preceding siblings ...)
  2022-12-19 14:36   ` Borislav Petkov
@ 2023-01-05 11:23   ` Jarkko Sakkinen
  2023-01-06  9:40     ` Chao Peng
  3 siblings, 1 reply; 398+ messages in thread
From: Jarkko Sakkinen @ 2023-01-05 11:23 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> In memory encryption usage, guest memory may be encrypted with special
> key and can be accessed only by the guest itself. We call such memory
> private memory. It's valueless and sometimes can cause problem to allow
> userspace to access guest private memory. This new KVM memslot extension
> allows guest private memory being provided through a restrictedmem
> backed file descriptor(fd) and userspace is restricted to access the
> bookmarked memory in the fd.
> 
> This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> additional KVM memslot fields restricted_fd/restricted_offset to allow
> userspace to instruct KVM to provide guest memory through restricted_fd.
> 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> and the size is 'memory_size'.
> 
> The extended memslot can still have the userspace_addr(hva). When use, a
> single memslot can maintain both private memory through restricted_fd
> and shared memory through userspace_addr. Whether the private or shared
> part is visible to guest is maintained by other KVM code.
> 
> A restrictedmem_notifier field is also added to the memslot structure to
> allow the restricted_fd's backing store to notify KVM the memory change,
> KVM then can invalidate its page table entries or handle memory errors.
> 
> Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> and right now it is selected on X86_64 only.
> 
> To make future maintenance easy, internally use a binary compatible
> alias struct kvm_user_mem_region to handle both the normal and the
> '_ext' variants.

Feels bit hacky IMHO, and more like a completely new feature than
an extension.

Why not just add a new ioctl? The commit message does not address
the most essential design here.

BR, Jarkko

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
  2022-12-02  6:13 ` [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
  2022-12-09  9:11   ` Fuad Tabba
@ 2023-01-05 20:38   ` Vishal Annapurve
  2023-01-06  4:13     ` Chao Peng
  2023-01-14  0:01   ` Sean Christopherson
  2023-03-07 19:14   ` Ackerley Tng
  3 siblings, 1 reply; 398+ messages in thread
From: Vishal Annapurve @ 2023-01-05 20:38 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Thu, Dec 1, 2022 at 10:20 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
> +                                        pgoff_t start, pgoff_t end,
> +                                        gfn_t *gfn_start, gfn_t *gfn_end)
> +{
> +       unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
> +
> +       if (start > base_pgoff)
> +               *gfn_start = slot->base_gfn + start - base_pgoff;

There should be a check for overflow here in case start is a very big
value. Additional check can look like:
if (start >= base_pgoff + slot->npages)
       return false;

> +       else
> +               *gfn_start = slot->base_gfn;
> +
> +       if (end < base_pgoff + slot->npages)
> +               *gfn_end = slot->base_gfn + end - base_pgoff;

If "end" is smaller than base_pgoff, this can cause overflow and
return the range as valid. There should be additional check:
if (end < base_pgoff)
         return false;


> +       else
> +               *gfn_end = slot->base_gfn + slot->npages;
> +
> +       if (*gfn_start >= *gfn_end)
> +               return false;
> +
> +       return true;
> +}
> +

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
  2023-01-05 20:38   ` Vishal Annapurve
@ 2023-01-06  4:13     ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2023-01-06  4:13 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Thu, Jan 05, 2023 at 12:38:30PM -0800, Vishal Annapurve wrote:
> On Thu, Dec 1, 2022 at 10:20 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > +static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
> > +                                        pgoff_t start, pgoff_t end,
> > +                                        gfn_t *gfn_start, gfn_t *gfn_end)
> > +{
> > +       unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
> > +
> > +       if (start > base_pgoff)
> > +               *gfn_start = slot->base_gfn + start - base_pgoff;
> 
> There should be a check for overflow here in case start is a very big
> value. Additional check can look like:
> if (start >= base_pgoff + slot->npages)
>        return false;
> 
> > +       else
> > +               *gfn_start = slot->base_gfn;
> > +
> > +       if (end < base_pgoff + slot->npages)
> > +               *gfn_end = slot->base_gfn + end - base_pgoff;
> 
> If "end" is smaller than base_pgoff, this can cause overflow and
> return the range as valid. There should be additional check:
> if (end < base_pgoff)
>          return false;

Thanks! Both are good catches. The improved code:

static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
					 pgoff_t start, pgoff_t end,
					 gfn_t *gfn_start, gfn_t *gfn_end)
{
	unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;

	if (start >= base_pgoff + slot->npages)
		return false;
	else if (start <= base_pgoff)
		*gfn_start = slot->base_gfn;
	else
		*gfn_start = start - base_pgoff + slot->base_gfn;

	if (end <= base_pgoff)
		return false;
	else if (end >= base_pgoff + slot->npages)
		*gfn_end = slot->base_gfn + slot->npages;
	else
		*gfn_end = end - base_pgoff + slot->base_gfn;

	if (*gfn_start >= *gfn_end)
		return false;

	return true;
}

Thanks,
Chao
> 
> 
> > +       else
> > +               *gfn_end = slot->base_gfn + slot->npages;
> > +
> > +       if (*gfn_start >= *gfn_end)
> > +               return false;
> > +
> > +       return true;
> > +}
> > +

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2023-01-05 11:23   ` Jarkko Sakkinen
@ 2023-01-06  9:40     ` Chao Peng
  2023-01-09 19:32       ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2023-01-06  9:40 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > In memory encryption usage, guest memory may be encrypted with special
> > key and can be accessed only by the guest itself. We call such memory
> > private memory. It's valueless and sometimes can cause problem to allow
> > userspace to access guest private memory. This new KVM memslot extension
> > allows guest private memory being provided through a restrictedmem
> > backed file descriptor(fd) and userspace is restricted to access the
> > bookmarked memory in the fd.
> > 
> > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > userspace to instruct KVM to provide guest memory through restricted_fd.
> > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > and the size is 'memory_size'.
> > 
> > The extended memslot can still have the userspace_addr(hva). When use, a
> > single memslot can maintain both private memory through restricted_fd
> > and shared memory through userspace_addr. Whether the private or shared
> > part is visible to guest is maintained by other KVM code.
> > 
> > A restrictedmem_notifier field is also added to the memslot structure to
> > allow the restricted_fd's backing store to notify KVM the memory change,
> > KVM then can invalidate its page table entries or handle memory errors.
> > 
> > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > and right now it is selected on X86_64 only.
> > 
> > To make future maintenance easy, internally use a binary compatible
> > alias struct kvm_user_mem_region to handle both the normal and the
> > '_ext' variants.
> 
> Feels bit hacky IMHO, and more like a completely new feature than
> an extension.
> 
> Why not just add a new ioctl? The commit message does not address
> the most essential design here.

Yes, people can always choose to add a new ioctl for this kind of change
and the balance point here is we want to also avoid 'too many ioctls' if
the functionalities are similar.  The '_ext' variant reuses all the
existing fields in the 'normal' variant and most importantly KVM
internally can reuse most of the code. I certainly can add some words in
the commit message to explain this design choice.

Thanks,
Chao
> 
> BR, Jarkko

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2023-01-06  9:40     ` Chao Peng
@ 2023-01-09 19:32       ` Sean Christopherson
  2023-01-10  9:14         ` Chao Peng
  2023-01-20 23:28         ` Jarkko Sakkinen
  0 siblings, 2 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-01-09 19:32 UTC (permalink / raw)
  To: Chao Peng
  Cc: Jarkko Sakkinen, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Fri, Jan 06, 2023, Chao Peng wrote:
> On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > To make future maintenance easy, internally use a binary compatible
> > > alias struct kvm_user_mem_region to handle both the normal and the
> > > '_ext' variants.
> > 
> > Feels bit hacky IMHO, and more like a completely new feature than
> > an extension.
> > 
> > Why not just add a new ioctl? The commit message does not address
> > the most essential design here.
> 
> Yes, people can always choose to add a new ioctl for this kind of change
> and the balance point here is we want to also avoid 'too many ioctls' if
> the functionalities are similar.  The '_ext' variant reuses all the
> existing fields in the 'normal' variant and most importantly KVM
> internally can reuse most of the code. I certainly can add some words in
> the commit message to explain this design choice.

After seeing the userspace side of this, I agree with Jarkko; overloading
KVM_SET_USER_MEMORY_REGION is a hack.  E.g. the size validation ends up being
bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
itself.

It feels absolutely ridiculous, but I think the best option is to do:

#define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
					 struct kvm_userspace_memory_region2)

/* for KVM_SET_USER_MEMORY_REGION2 */
struct kvm_user_mem_region2 {
	__u32 slot;
	__u32 flags;
	__u64 guest_phys_addr;
	__u64 memory_size;
	__u64 userspace_addr;
	__u64 restricted_offset;
	__u32 restricted_fd;
	__u32 pad1;
	__u64 pad2[14];
}

And it's consistent with other KVM ioctls(), e.g. KVM_SET_CPUID2.

Regarding the userspace side of things, please include Vishal's selftests in v11,
it's impossible to properly review the uAPI changes without seeing the userspace
side of things.  I'm in the process of reviewing Vishal's v2[*], I'll try to
massage it into a set of patches that you can incorporate into your series.

[*] https://lore.kernel.org/all/20221205232341.4131240-1-vannapurve@google.com

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2023-01-09 19:32       ` Sean Christopherson
@ 2023-01-10  9:14         ` Chao Peng
  2023-01-10 22:51           ` Vishal Annapurve
                             ` (2 more replies)
  2023-01-20 23:28         ` Jarkko Sakkinen
  1 sibling, 3 replies; 398+ messages in thread
From: Chao Peng @ 2023-01-10  9:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jarkko Sakkinen, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Mon, Jan 09, 2023 at 07:32:05PM +0000, Sean Christopherson wrote:
> On Fri, Jan 06, 2023, Chao Peng wrote:
> > On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > To make future maintenance easy, internally use a binary compatible
> > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > '_ext' variants.
> > > 
> > > Feels bit hacky IMHO, and more like a completely new feature than
> > > an extension.
> > > 
> > > Why not just add a new ioctl? The commit message does not address
> > > the most essential design here.
> > 
> > Yes, people can always choose to add a new ioctl for this kind of change
> > and the balance point here is we want to also avoid 'too many ioctls' if
> > the functionalities are similar.  The '_ext' variant reuses all the
> > existing fields in the 'normal' variant and most importantly KVM
> > internally can reuse most of the code. I certainly can add some words in
> > the commit message to explain this design choice.
> 
> After seeing the userspace side of this, I agree with Jarkko; overloading
> KVM_SET_USER_MEMORY_REGION is a hack.  E.g. the size validation ends up being
> bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
> itself.

How is the size validation being bogus? I don't quite follow. Then we
will use kvm_userspace_memory_region2 as the KVM internal alias, right?
I see similar examples use different functions to handle different
versions but it does look easier if we use alias for this function.

> 
> It feels absolutely ridiculous, but I think the best option is to do:
> 
> #define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
> 					 struct kvm_userspace_memory_region2)

Just interesting, is 0x49 a safe number we can use? 

> 
> /* for KVM_SET_USER_MEMORY_REGION2 */
> struct kvm_user_mem_region2 {
> 	__u32 slot;
> 	__u32 flags;
> 	__u64 guest_phys_addr;
> 	__u64 memory_size;
> 	__u64 userspace_addr;
> 	__u64 restricted_offset;
> 	__u32 restricted_fd;
> 	__u32 pad1;
> 	__u64 pad2[14];
> }
> 
> And it's consistent with other KVM ioctls(), e.g. KVM_SET_CPUID2.

Okay, agree from KVM userspace API perspective this is more consistent
with similar existing examples. I see several of them.

I think we will also need a CAP_KVM_SET_USER_MEMORY_REGION2 for this new
ioctl.

> 
> Regarding the userspace side of things, please include Vishal's selftests in v11,
> it's impossible to properly review the uAPI changes without seeing the userspace
> side of things.  I'm in the process of reviewing Vishal's v2[*], I'll try to
> massage it into a set of patches that you can incorporate into your series.

Previously I included Vishal's selftests in the github repo, but not
include them in this patch series. It's OK for me to incorporate them
directly into this series and review together if Vishal is fine.

Chao
> 
> [*] https://lore.kernel.org/all/20221205232341.4131240-1-vannapurve@google.com

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2023-01-10  9:14         ` Chao Peng
@ 2023-01-10 22:51           ` Vishal Annapurve
  2023-01-13 22:37           ` Sean Christopherson
  2023-01-20 23:42           ` Jarkko Sakkinen
  2 siblings, 0 replies; 398+ messages in thread
From: Vishal Annapurve @ 2023-01-10 22:51 UTC (permalink / raw)
  To: Chao Peng
  Cc: Sean Christopherson, Jarkko Sakkinen, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-arch, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Arnd Bergmann, Naoya Horiguchi,
	Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
	mhocko, wei.w.wang

On Tue, Jan 10, 2023 at 1:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > Regarding the userspace side of things, please include Vishal's selftests in v11,
> > it's impossible to properly review the uAPI changes without seeing the userspace
> > side of things.  I'm in the process of reviewing Vishal's v2[*], I'll try to
> > massage it into a set of patches that you can incorporate into your series.
>
> Previously I included Vishal's selftests in the github repo, but not
> include them in this patch series. It's OK for me to incorporate them
> directly into this series and review together if Vishal is fine.
>

Yeah, I am ok with incorporating selftest patches into this series and
reviewing them together.

Regards,
Vishal

> Chao
> >
> > [*] https://lore.kernel.org/all/20221205232341.4131240-1-vannapurve@google.com

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-02  6:13 ` [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
  2022-12-06 14:57   ` Fuad Tabba
  2022-12-13 23:49   ` Huang, Kai
@ 2023-01-13 21:54   ` Sean Christopherson
  2023-01-17 12:41     ` Chao Peng
  2023-02-22  2:07     ` Alexey Kardashevskiy
  2023-01-30  5:26   ` Ackerley Tng
                     ` (3 subsequent siblings)
  6 siblings, 2 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-01-13 21:54 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Fri, Dec 02, 2022, Chao Peng wrote:
> The system call is currently wired up for x86 arch.

Building on other architectures (except for arm64 for some reason) yields:

  CALL    /.../scripts/checksyscalls.sh
  <stdin>:1565:2: warning: #warning syscall memfd_restricted not implemented [-Wcpp]

Do we care?  It's the only such warning, which makes me think we either need to
wire this up for all architectures, or explicitly document that it's unsupported.

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---

...

> diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> new file mode 100644
> index 000000000000..c2700c5daa43
> --- /dev/null
> +++ b/include/linux/restrictedmem.h
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _LINUX_RESTRICTEDMEM_H

Missing

 #define _LINUX_RESTRICTEDMEM_H

which causes fireworks if restrictedmem.h is included more than once.

> +#include <linux/file.h>
> +#include <linux/magic.h>
> +#include <linux/pfn_t.h>

...

> +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +					 struct page **pagep, int *order)
> +{
> +	return -1;

This should be a proper -errno, though in the current incarnation of things it's
a moot point because no stub is needed.  KVM can (and should) easily provide its
own stub for this one.

> +}
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +	return false;
> +}
> +
> +static inline void restrictedmem_error_page(struct page *page,
> +					    struct address_space *mapping)
> +{
> +}
> +
> +#endif /* CONFIG_RESTRICTEDMEM */
> +
> +#endif /* _LINUX_RESTRICTEDMEM_H */

...

> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> new file mode 100644
> index 000000000000..56953c204e5c
> --- /dev/null
> +++ b/mm/restrictedmem.c
> @@ -0,0 +1,318 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/shmem_fs.h>
> +#include <linux/syscalls.h>
> +#include <uapi/linux/falloc.h>
> +#include <uapi/linux/magic.h>
> +#include <linux/restrictedmem.h>
> +
> +struct restrictedmem_data {

Any objection to simply calling this "restrictedmem"?  And then using either "rm"
or "rmem" for local variable names?  I kept reading "data" as the underyling data
being written to the page, as opposed to the metadata describing the restrictedmem
instance.

> +	struct mutex lock;
> +	struct file *memfd;
> +	struct list_head notifiers;
> +};
> +
> +static void restrictedmem_invalidate_start(struct restrictedmem_data *data,
> +					   pgoff_t start, pgoff_t end)
> +{
> +	struct restrictedmem_notifier *notifier;
> +
> +	mutex_lock(&data->lock);

This can be a r/w semaphore instead of a mutex, that way punching holes at multiple
points in the file can at least run the notifiers in parallel.  The actual allocation
by shmem will still be serialized, but I think it's worth the simple optimization
since zapping and flushing in KVM may be somewhat slow.

> +	list_for_each_entry(notifier, &data->notifiers, list) {
> +		notifier->ops->invalidate_start(notifier, start, end);

Two major design issues that we overlooked long ago:

  1. Blindly invoking notifiers will not scale.  E.g. if userspace configures a
     VM with a large number of convertible memslots that are all backed by a
     single large restrictedmem instance, then converting a single page will
     result in a linear walk through all memslots.  I don't expect anyone to
     actually do something silly like that, but I also never expected there to be
     a legitimate usecase for thousands of memslots.

  2. This approach fails to provide the ability for KVM to ensure a guest has
     exclusive access to a page.  As discussed in the past, the kernel can rely
     on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but
     only for SNP and TDX VMs.  For VMs where userspace is trusted to some extent,
     e.g. SEV, there is value in ensuring a 1:1 association.

     And probably more importantly, relying on hardware for SNP and TDX yields a
     poor ABI and complicates KVM's internals.  If the kernel doesn't guarantee a
     page is exclusive to a guest, i.e. if userspace can hand out the same page
     from a restrictedmem instance to multiple VMs, then failure will occur only
     when KVM tries to assign the page to the second VM.  That will happen deep
     in KVM, which means KVM needs to gracefully handle such errors, and it means
     that KVM's ABI effectively allows plumbing garbage into its memslots.

Rather than use a simple list of notifiers, this appears to be yet another
opportunity to use an xarray.  Supporting sharing of restrictedmem will be
non-trivial, but IMO we should punt that to the future since it's still unclear
exactly how sharing will work.

An xarray will solve #1 by notifying only the consumers (memslots) that are bound
to the affected range.

And for #2, it's relatively straightforward (knock wood) to detect existing
entries, i.e. if the user wants exclusive access to memory, then the bind operation
can be reject if there's an existing entry.

VERY lightly tested code snippet at the bottom (will provide link to fully worked
code in cover letter).


> +static long restrictedmem_punch_hole(struct restrictedmem_data *data, int mode,
> +				     loff_t offset, loff_t len)
> +{
> +	int ret;
> +	pgoff_t start, end;
> +	struct file *memfd = data->memfd;
> +
> +	if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> +		return -EINVAL;
> +
> +	start = offset >> PAGE_SHIFT;
> +	end = (offset + len) >> PAGE_SHIFT;
> +
> +	restrictedmem_invalidate_start(data, start, end);
> +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> +	restrictedmem_invalidate_end(data, start, end);

The lock needs to be end for the entire duration of the hole punch, i.e. needs to
be taken before invalidate_start() and released after invalidate_end().  If a user
(un)binds/(un)registers after invalidate_state(), it will see an unpaired notification,
e.g. could leave KVM with incorrect notifier counts.

> +
> +	return ret;
> +}

What I ended up with for an xarray-based implementation.  I'm very flexible on
names and whatnot, these are just what made sense to me.

static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode,
				     loff_t offset, loff_t len)
{
	struct restrictedmem_notifier *notifier;
	struct file *memfd = rm->memfd;
	unsigned long index;
	pgoff_t start, end;
	int ret;

	if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
		return -EINVAL;

	start = offset >> PAGE_SHIFT;
	end = (offset + len) >> PAGE_SHIFT;

	/*
	 * Bindings must stable across invalidation to ensure the start+end
	 * are balanced.
	 */
	down_read(&rm->lock);

	xa_for_each_range(&rm->bindings, index, notifier, start, end)
		notifier->ops->invalidate_start(notifier, start, end);

	ret = memfd->f_op->fallocate(memfd, mode, offset, len);

	xa_for_each_range(&rm->bindings, index, notifier, start, end)
		notifier->ops->invalidate_end(notifier, start, end);

	up_read(&rm->lock);

	return ret;
}

int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
		       struct restrictedmem_notifier *notifier, bool exclusive)
{
	struct restrictedmem *rm = file->f_mapping->private_data;
	int ret = -EINVAL;

	down_write(&rm->lock);

	/* Non-exclusive mappings are not yet implemented. */
	if (!exclusive)
		goto out_unlock;

	if (!xa_empty(&rm->bindings)) {
		if (exclusive != rm->exclusive)
			goto out_unlock;

		if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT))
			goto out_unlock;
	}

	xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL);
	rm->exclusive = exclusive;
	ret = 0;
out_unlock:
	up_write(&rm->lock);
	return ret;
}
EXPORT_SYMBOL_GPL(restrictedmem_bind);

void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end,
			  struct restrictedmem_notifier *notifier)
{
	struct restrictedmem *rm = file->f_mapping->private_data;

	down_write(&rm->lock);
	xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL);
	synchronize_rcu();
	up_write(&rm->lock);
}
EXPORT_SYMBOL_GPL(restrictedmem_unbind);

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2022-12-02  6:13 ` [PATCH v10 2/9] KVM: Introduce per-page memory attributes Chao Peng
                     ` (3 preceding siblings ...)
  2022-12-28  8:28   ` Chenyi Qiang
@ 2023-01-13 22:02   ` Sean Christopherson
  2023-01-17  3:21   ` Binbin Wu
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-01-13 22:02 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Fri, Dec 02, 2022, Chao Peng wrote:
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index fbeaa9ddef59..a8e379a3afee 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -49,6 +49,7 @@ config KVM
>  	select SRCU
>  	select INTERVAL_TREE
>  	select HAVE_KVM_PM_NOTIFIER if PM
> +	select HAVE_KVM_MEMORY_ATTRIBUTES

I would prefer to call this KVM_GENERIC_MEMORY_ATTRIBUTES.  Similar to
KVM_GENERIC_HARDWARE_ENABLING, ARM does need/have hardware enabling, it just
doesn't want KVM's generic implementation.  In this case, pKVM does support memory
attributes, but uses stage-2 tables to track ownership and doesn't need/want the
overhead of the generic implementation.

>  	help

...

> +#define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> +#define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> +#define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> +#define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)

I think we should carve out bits 0-2 for RWX, but I don't think we should actually
define them until they're actually accepted by KVM.

> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> +					   struct kvm_memory_attributes *attrs)
> +{
> +	gfn_t start, end;
> +	unsigned long i;
> +	void *entry;
> +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> +	/* flags is currently not used. */
> +	if (attrs->flags)
> +		return -EINVAL;
> +	if (attrs->attributes & ~supported_attrs)

Nit, no need for "supported_attrs", just consume kvm_supported_mem_attributes()
directly.

> +		return -EINVAL;
> +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> +		return -EINVAL;
> +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> +		return -EINVAL;
> +
> +	start = attrs->address >> PAGE_SHIFT;
> +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +
> +	mutex_lock(&kvm->lock);

Peeking forward multiple patches, this needs to take kvm->slots_lock, not kvm->lock.
There's a bug in the lpage_disallowed patch that I believe can most easily be
solved by making this mutually exclusive with memslot changes.

When a memslot is created, KVM needs to walk through the attributes to detect
whether or not the attributes are identical for the entire slot.  To avoid races,
that means taking slots_lock.

The alternative would be to query the attributes when adjusting the hugepage level
and avoid lpage_disallowed entirely, but in the (very brief) time I've thought
about this I haven't come up with a way to do that in a performant manner.

> +	for (i = start; i < end; i++)

Curly braces needed on the for-loop.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2023-01-10  9:14         ` Chao Peng
  2023-01-10 22:51           ` Vishal Annapurve
@ 2023-01-13 22:37           ` Sean Christopherson
  2023-01-17 12:42             ` Chao Peng
  2023-01-20 23:42           ` Jarkko Sakkinen
  2 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-01-13 22:37 UTC (permalink / raw)
  To: Chao Peng
  Cc: Jarkko Sakkinen, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Tue, Jan 10, 2023, Chao Peng wrote:
> On Mon, Jan 09, 2023 at 07:32:05PM +0000, Sean Christopherson wrote:
> > On Fri, Jan 06, 2023, Chao Peng wrote:
> > > On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > > To make future maintenance easy, internally use a binary compatible
> > > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > > '_ext' variants.
> > > > 
> > > > Feels bit hacky IMHO, and more like a completely new feature than
> > > > an extension.
> > > > 
> > > > Why not just add a new ioctl? The commit message does not address
> > > > the most essential design here.
> > > 
> > > Yes, people can always choose to add a new ioctl for this kind of change
> > > and the balance point here is we want to also avoid 'too many ioctls' if
> > > the functionalities are similar.  The '_ext' variant reuses all the
> > > existing fields in the 'normal' variant and most importantly KVM
> > > internally can reuse most of the code. I certainly can add some words in
> > > the commit message to explain this design choice.
> > 
> > After seeing the userspace side of this, I agree with Jarkko; overloading
> > KVM_SET_USER_MEMORY_REGION is a hack.  E.g. the size validation ends up being
> > bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
> > itself.
> 
> How is the size validation being bogus? I don't quite follow.

The ioctl() magic embeds the size of the payload (struct kvm_userspace_memory_region
in this case) in the ioctl() number, and that information is visible to userspace
via _IOCTL_SIZE().  Attempting to take a larger size can mess up sanity checks,
e.g. KVM selftests get tripped up on this assert if KVM_SET_USER_MEMORY_REGION is
passed an "extended" struct.

	#define kvm_do_ioctl(fd, cmd, arg)						\
	({										\
		kvm_static_assert(!_IOC_SIZE(cmd) || sizeof(*arg) == _IOC_SIZE(cmd));	\
		ioctl(fd, cmd, arg);							\
	})

> Then we will use kvm_userspace_memory_region2 as the KVM internal alias,
> right?

Yep.

> I see similar examples use different functions to handle different versions
> but it does look easier if we use alias for this function.
> 
> > 
> > It feels absolutely ridiculous, but I think the best option is to do:
> > 
> > #define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
> > 					 struct kvm_userspace_memory_region2)
> 
> Just interesting, is 0x49 a safe number we can use? 

Yes?  So long as its not used by KVM, it's safe.  AFAICT, it's unused.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
  2022-12-02  6:13 ` [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes Chao Peng
                     ` (2 preceding siblings ...)
  2022-12-13 23:51   ` Huang, Kai
@ 2023-01-13 22:50   ` Sean Christopherson
  3 siblings, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-01-13 22:50 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Fri, Dec 02, 2022, Chao Peng wrote:
> @@ -785,11 +786,12 @@ struct kvm {
>  
>  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>  	struct mmu_notifier mmu_notifier;
> +#endif
>  	unsigned long mmu_invalidate_seq;
>  	long mmu_invalidate_in_progress;
>  	gfn_t mmu_invalidate_range_start;
>  	gfn_t mmu_invalidate_range_end;
> -#endif

Blech.  The existing code is a bit ugly, and trying to extend for this use case
makes things even worse.

Rather than use the base MMU_NOTIFIER Kconfig and an arbitrary define, I think we
should first add a proper Kconfig, e.g. KVM_GENERIC_MMU_NOTIFIER, to replace the
combination.  E.g

	config KVM_GENERIC_MMU_NOTIFIER
	       select MMU_NOTIFIER
	       bool

and then all architectures that currently #define KVM_ARCH_WANT_MMU_NOTIFIER can
simply select the Kconfig, which is everything except s390.  "GENERIC" again because
s390 does select MMU_NOTIFER and actually registers its own notifier for s390's
version of protected VMs (at least, I think that's what its "pv" stands for).

And then later down the line in this series, when the attributes and private mem
needs to tie into the notifiers, we can do:


	config KVM_GENERIC_MEMORY_ATTRIBUTES
	       select KVM_GENERIC_MMU_NOTIFIER
	       bool

I.e. that way this patch doesn't need to partially expose KVM's notifier stuff
and can instead just keep the soon-to-be-existing KVM_GENERIC_MMU_NOTIFIER.

Taking a depending on KVM_GENERIC_MMU_NOTIFIER for KVM_GENERIC_MEMORY_ATTRIBUTES
makes sense, because AFAICT, changing any type of attribute, e.g. RWX bits, is
going to necessitate unmapping the affected gfn range.

>  	struct list_head devices;
>  	u64 manual_dirty_log_protect;
>  	struct dentry *debugfs_dentry;
> @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
>  int kvm_arch_post_init_vm(struct kvm *kvm);
>  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
>  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_has_private_mem(struct kvm *kvm);

The reference to private memory belongs in a later patch.  More below.

> +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	struct kvm_gfn_range gfn_range;
> +	struct kvm_memory_slot *slot;
> +	struct kvm_memslots *slots;
> +	struct kvm_memslot_iter iter;
> +	int i;
> +	int r = 0;

The return from kvm_unmap_gfn_range() is a bool, this should be:

	bool flush = false;

> +
> +	gfn_range.pte = __pte(0);
> +	gfn_range.may_block = true;
> +
> +	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +		slots = __kvm_memslots(kvm, i);
> +
> +		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> +			slot = iter.slot;
> +			gfn_range.start = max(start, slot->base_gfn);
> +			gfn_range.end = min(end, slot->base_gfn + slot->npages);
> +			if (gfn_range.start >= gfn_range.end)
> +				continue;
> +			gfn_range.slot = slot;
> +
> +			r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> +		}
> +	}
> +
> +	if (r)
> +		kvm_flush_remote_tlbs(kvm);
> +}
> +
>  static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>  					   struct kvm_memory_attributes *attrs)
>  {
>  	gfn_t start, end;
>  	unsigned long i;
>  	void *entry;
> +	int idx;
>  	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
>  
> -	/* flags is currently not used. */
> +	/* 'flags' is currently not used. */

Kind of a spurious change.

>  	if (attrs->flags)
>  		return -EINVAL;
>  	if (attrs->attributes & ~supported_attrs)
> @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>  
>  	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
>  
> +	if (kvm_arch_has_private_mem(kvm)) {

I think we should assume that any future attributes will necessitate unmapping
and invalidation, i.e. drop the private mem check.  That allows introducing
kvm_arch_has_private_mem() in a later patch that is more directly related to
private memory.

> +		KVM_MMU_LOCK(kvm);
> +		kvm_mmu_invalidate_begin(kvm);
> +		kvm_mmu_invalidate_range_add(kvm, start, end);
> +		KVM_MMU_UNLOCK(kvm);
> +	}
> +
>  	mutex_lock(&kvm->lock);
>  	for (i = start; i < end; i++)
>  		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>  			break;
>  	mutex_unlock(&kvm->lock);
>  
> +	if (kvm_arch_has_private_mem(kvm)) {
> +		idx = srcu_read_lock(&kvm->srcu);

Mostly for reference, this goes away if slots_lock is used instead of kvm->lock.

> +		KVM_MMU_LOCK(kvm);
> +		if (i > start)
> +			kvm_unmap_mem_range(kvm, start, i);
> +		kvm_mmu_invalidate_end(kvm);
> +		KVM_MMU_UNLOCK(kvm);
> +		srcu_read_unlock(&kvm->srcu, idx);
> +	}
> +
>  	attrs->address = i << PAGE_SHIFT;
>  	attrs->size = (end - i) << PAGE_SHIFT;
>  
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed
  2022-12-02  6:13 ` [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed Chao Peng
  2022-12-05 22:49   ` Isaku Yamahata
@ 2023-01-13 23:12   ` Sean Christopherson
  2023-01-13 23:16   ` Sean Christopherson
  2 siblings, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-01-13 23:12 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Fri, Dec 02, 2022, Chao Peng wrote:
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 283cbb83d6ae..7772ab37ac89 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -38,6 +38,7 @@
>  #include <asm/hyperv-tlfs.h>
>  
>  #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> +#define __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES

No need for this, I think we should just make it mandatory to implement the
arch hook when CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES=y.  If another arch gains
support for mem attributes and doesn't need the hook, then we can simply add a
weak helper (or maybe add a #define then if we feel that's the way to go).

>  #define KVM_MAX_VCPUS 1024
>  
> @@ -1011,6 +1012,13 @@ struct kvm_vcpu_arch {
>  #endif
>  };
>  
> +/*
> + * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
> + * level. The remaining bits are used as a reference count.
> + */
> +#define KVM_LPAGE_PRIVATE_SHARED_MIXED		(1U << 31)

Similar to the need to unmap, I think we should just say "mixed" and ignore the
private vs. shared, i.e. make this a flag for all memory attributes.

> +#define KVM_LPAGE_COUNT_MAX			((1U << 31) - 1)

"MAX" is technically correct, but it's more of a mask.  I think we can make it a
moot point though.  There's no need to mask the count, we just want to assert that
adjusting the counting doesn't change the flag.

I would also say throw these defines into mmu.c, at least pending the bug fix
for kvm_alloc_memslot_metadata() (more on that below).

>  struct kvm_lpage_info {
>  	int disallow_lpage;
>  };
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e2c70b5afa3e..2190fd8c95c0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -763,11 +763,16 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
>  {
>  	struct kvm_lpage_info *linfo;
>  	int i;
> +	int disallow_count;
>  
>  	for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
>  		linfo = lpage_info_slot(gfn, slot, i);
> +
> +		disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
> +		WARN_ON(disallow_count + count < 0 ||
> +			disallow_count > KVM_LPAGE_COUNT_MAX - count);
> +
>  		linfo->disallow_lpage += count;
> -		WARN_ON(linfo->disallow_lpage < 0);

It's been a long week so don't trust my math, but I believe this can simply be:

		old = linfo->disallow_lpage;
		linfo->disallow_lpage += count;

		WARN_ON_ONCE((old ^ linfo->disallow_lpage) & KVM_LPAGE_MIXED_FLAG);
>  	}
>  }
>  
> @@ -6986,3 +6991,130 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
>  	if (kvm->arch.nx_huge_page_recovery_thread)
>  		kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
>  }
> +
> +static bool linfo_is_mixed(struct kvm_lpage_info *linfo)
> +{
> +	return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +}
> +
> +static void linfo_set_mixed(gfn_t gfn, struct kvm_memory_slot *slot,
> +			    int level, bool mixed)
> +{
> +	struct kvm_lpage_info *linfo = lpage_info_slot(gfn, slot, level);
> +
> +	if (mixed)
> +		linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +	else
> +		linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +}
> +
> +static bool is_expected_attr_entry(void *entry, unsigned long expected_attrs)
> +{
> +	bool expect_private = expected_attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +
> +	if (xa_to_value(entry) & KVM_MEMORY_ATTRIBUTE_PRIVATE) {
> +		if (!expect_private)
> +			return false;
> +	} else if (expect_private)
> +		return false;

This is messy.  If we drop the private vs. shared specifity, this can go away if
we add a helper to get attributes

	static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
	{
		return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
	}

and then we can do


		if (KVM_BUG_ON(gfn != xas.xa_index, kvm) ||
		    attrs != kvm_get_memory_attributes(kvm, gfn)) {
			mixed = true;
			break;
		}

and

		if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)) ||
		    attrs != kvm_get_memory_attributes(kvm, gfn))
			return true;


> +
> +	return true;
> +}
> +
> +static bool mem_attrs_mixed_2m(struct kvm *kvm, unsigned long attrs,
> +			       gfn_t start, gfn_t end)
> +{
> +	XA_STATE(xas, &kvm->mem_attr_array, start);
> +	gfn_t gfn = start;
> +	void *entry;
> +	bool mixed = false;
> +
> +	rcu_read_lock();
> +	entry = xas_load(&xas);
> +	while (gfn < end) {
> +		if (xas_retry(&xas, entry))
> +			continue;
> +
> +		KVM_BUG_ON(gfn != xas.xa_index, kvm);

As above, I think it's worth bailing immediately if there's a mismatch.

> +
> +		if (!is_expected_attr_entry(entry, attrs)) {
> +			mixed = true;
> +			break;
> +		}
> +
> +		entry = xas_next(&xas);
> +		gfn++;
> +	}
> +
> +	rcu_read_unlock();
> +	return mixed;
> +}
> +
> +static bool mem_attrs_mixed(struct kvm *kvm, struct kvm_memory_slot *slot,

s/mem_attrs_mixed/has_mixed_attrs to make it clear this is querying, not setting.
And has_mixed_attrs_2m() above.

> +			    int level, unsigned long attrs,
> +			    gfn_t start, gfn_t end)
> +{
> +	unsigned long gfn;
> +
> +	if (level == PG_LEVEL_2M)
> +		return mem_attrs_mixed_2m(kvm, attrs, start, end);
> +
> +	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1))

Curly braces needed on the for-loop.

> +		if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)) ||
> +		    !is_expected_attr_entry(xa_load(&kvm->mem_attr_array, gfn),
> +					    attrs))
> +			return true;
> +	return false;
> +}
> +
> +static void kvm_update_lpage_private_shared_mixed(struct kvm *kvm,
> +						  struct kvm_memory_slot *slot,
> +						  unsigned long attrs,
> +						  gfn_t start, gfn_t end)
> +{
> +	unsigned long pages, mask;
> +	gfn_t gfn, gfn_end, first, last;
> +	int level;
> +	bool mixed;
> +
> +	/*
> +	 * The sequence matters here: we set the higher level basing on the
> +	 * lower level's scanning result.
> +	 */
> +	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> +		pages = KVM_PAGES_PER_HPAGE(level);
> +		mask = ~(pages - 1);
> +		first = start & mask;
> +		last = (end - 1) & mask;
> +
> +		/*
> +		 * We only need to scan the head and tail page, for middle pages
> +		 * we know they will not be mixed.
> +		 */
> +		gfn = max(first, slot->base_gfn);
> +		gfn_end = min(first + pages, slot->base_gfn + slot->npages);
> +		mixed = mem_attrs_mixed(kvm, slot, level, attrs, gfn, gfn_end);
> +		linfo_set_mixed(gfn, slot, level, mixed);
> +
> +		if (first == last)
> +			return;
> +
> +		for (gfn = first + pages; gfn < last; gfn += pages)
> +			linfo_set_mixed(gfn, slot, level, false);
> +
> +		gfn = last;
> +		gfn_end = min(last + pages, slot->base_gfn + slot->npages);
> +		mixed = mem_attrs_mixed(kvm, slot, level, attrs, gfn, gfn_end);
> +		linfo_set_mixed(gfn, slot, level, mixed);
> +	}
> +}
> +
> +void kvm_arch_set_memory_attributes(struct kvm *kvm,
> +				    struct kvm_memory_slot *slot,
> +				    unsigned long attrs,
> +				    gfn_t start, gfn_t end)
> +{
> +	if (kvm_slot_can_be_private(slot))

Make this an early return optimization, with a comment explaining that KVM x86
doesn't yet support other attributes.

	/*
	 * KVM x86 currently only supports KVM_MEMORY_ATTRIBUTE_PRIVATE, skip
	 * the slot if the slot will never consume the PRIVATE attribute.
	 */
	if (!kvm_slot_can_be_private(slot))
		return;


> +		kvm_update_lpage_private_shared_mixed(kvm, slot, attrs,
> +						      start, end);
> +}
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9a07380f8d3c..5aefcff614d2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12362,6 +12362,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
>  		if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
>  			linfo[lpages - 1].disallow_lpage = 1;
>  		ugfn = slot->userspace_addr >> PAGE_SHIFT;
> +		if (kvm_slot_can_be_private(slot))
> +			ugfn |= slot->restricted_offset >> PAGE_SHIFT;

I would rather reject memslot if the gfn has lesser alignment than the offset.
I'm totally ok with this approach _if_ there's a use case.  Until such a use case
presents itself, I would rather be conservative from a uAPI perspective.

>  		/*
>  		 * If the gfn and userspace address are not aligned wrt each
>  		 * other, disable large page support for this slot.
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3331c0c92838..25099c94e770 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -592,6 +592,11 @@ struct kvm_memory_slot {
>  	struct restrictedmem_notifier notifier;
>  };
>  
> +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
> +{
> +	return slot && (slot->flags & KVM_MEM_PRIVATE);

KVM_MEM_PRIVATE should really be defined only when private memory is exposed to
userspace.  For this patch, even though it means we have untestable code, I think
it makes sense to "return false".

> +}
> +
>  static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
>  {
>  	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
> @@ -2316,4 +2321,18 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>  /* Max number of entries allowed for each kvm dirty ring */
>  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>  
> +#ifdef __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
> +void kvm_arch_set_memory_attributes(struct kvm *kvm,
> +				    struct kvm_memory_slot *slot,
> +				    unsigned long attrs,
> +				    gfn_t start, gfn_t end);
> +#else
> +static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> +						  struct kvm_memory_slot *slot,
> +						  unsigned long attrs,
> +						  gfn_t start, gfn_t end)
> +{
> +}
> +#endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */

As above, no stub is necessary.

>  #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4e1e1e113bf0..e107afea32f0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2354,7 +2354,8 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
>  	return 0;
>  }
>  
> -static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)

Feedback for an earlier patch (to avoid churn): this should be kvm_mem_attrs_changed()
or so now that this does more than just unmap.

> +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end,
> +				unsigned long attrs)

Weird nit.  I think we should keep the prototypes for kvm_mem_attrs_changed()
and kvm_arch_set_memory_attributes() somewhat similar, i.e. squeeze in @attrs
before @start.

>  {
>  	struct kvm_gfn_range gfn_range;
>  	struct kvm_memory_slot *slot;
> @@ -2378,6 +2379,10 @@ static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
>  			gfn_range.slot = slot;
>  
>  			r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> +
> +			kvm_arch_set_memory_attributes(kvm, slot, attrs,
> +						       gfn_range.start,
> +						       gfn_range.end);
>  		}
>  	}
>  
> @@ -2427,7 +2432,7 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>  		idx = srcu_read_lock(&kvm->srcu);
>  		KVM_MMU_LOCK(kvm);
>  		if (i > start)
> -			kvm_unmap_mem_range(kvm, start, i);
> +			kvm_unmap_mem_range(kvm, start, i, attrs->attributes);
>  		kvm_mmu_invalidate_end(kvm);
>  		KVM_MMU_UNLOCK(kvm);
>  		srcu_read_unlock(&kvm->srcu, idx);
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-12-02  6:13 ` [PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
  2022-12-06 15:47   ` Fuad Tabba
@ 2023-01-13 23:13   ` Sean Christopherson
  1 sibling, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-01-13 23:13 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Fri, Dec 02, 2022, Chao Peng wrote:
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 99352170c130..d9edb14ce30b 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6634,6 +6634,28 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>  
> +::
> +
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 0)

Unless there's a reason not to, we should use bit 3 to match the attributes.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed
  2022-12-02  6:13 ` [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed Chao Peng
  2022-12-05 22:49   ` Isaku Yamahata
  2023-01-13 23:12   ` Sean Christopherson
@ 2023-01-13 23:16   ` Sean Christopherson
  2023-01-28 13:54     ` Chao Peng
  2 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-01-13 23:16 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Fri, Dec 02, 2022, Chao Peng wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9a07380f8d3c..5aefcff614d2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12362,6 +12362,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
>  		if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
>  			linfo[lpages - 1].disallow_lpage = 1;
>  		ugfn = slot->userspace_addr >> PAGE_SHIFT;
> +		if (kvm_slot_can_be_private(slot))
> +			ugfn |= slot->restricted_offset >> PAGE_SHIFT;
>  		/*
>  		 * If the gfn and userspace address are not aligned wrt each
>  		 * other, disable large page support for this slot.

Forgot to talk about the bug.  This code needs to handle the scenario where a
memslot is created with existing, non-uniform attributes.  It might be a bit ugly
(I didn't even try to write the code), but it's definitely possible, and since
memslot updates are already slow I think it's best to handle things here.

In the meantime, I added this so we don't forget to fix it before merging.

#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
	pr_crit_once("FIXME: Walk the memory attributes of the slot and set the mixed status appropriately");
#endif


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 8/9] KVM: Handle page fault for private memory
  2022-12-02  6:13 ` [PATCH v10 8/9] KVM: Handle page fault for private memory Chao Peng
  2022-12-08  2:29   ` Yuan Yao
  2022-12-09  9:01   ` Fuad Tabba
@ 2023-01-13 23:29   ` Sean Christopherson
  2 siblings, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-01-13 23:29 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Fri, Dec 02, 2022, Chao Peng wrote:
> @@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>  			return -EIO;
>  	}
>  
> +	if (r == RET_PF_USER)
> +		return 0;
> +
>  	if (r < 0)
>  		return r;
>  	if (r != RET_PF_EMULATE)
> @@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>  		 */
>  		if (sp->role.direct &&
>  		    sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> -							       PG_LEVEL_NUM)) {
> +							       PG_LEVEL_NUM,
> +							       false)) {

Passing %false is incorrect.  It might not cause problems because KVM currently
doesn't allowing modifying private memslots (that likely needs to change to allow
dirty logging), but it's wrong since nothing guarantees KVM is operating on SPTEs
for shared memory.

One option would be to take the patches from the TDX series that add a "private"
flag to the shadow page role, but I'd rather not add the role until it's truly
necessary.

For now, I think we can do this without impacting performance of guests that don't
support private memory.

int kvm_mmu_max_mapping_level(struct kvm *kvm,
			      const struct kvm_memory_slot *slot, gfn_t gfn,
			      int max_level)
{
	bool is_private = kvm_slot_can_be_private(slot) &&
			  kvm_mem_is_private(kvm, gfn);

	return __kvm_mmu_max_mapping_level(kvm, slot, gfn, max_level, is_private);
}

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 25099c94e770..153842bb33df 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
>  }
>  #endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
>  
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{

This code, i.e. the generic KVM changes, belongs in a separate patch.  It'll be
small, but I want to separate x86's page fault changes from the restrictedmem
support adding to common KVM.

This should also short-circuit based on CONFIG_HAVE_KVM_RESTRICTED_MEM, though
I would name that CONFIG_KVM_PRIVATE_MEMORY since in KVM's world, it's all about
private vs. shared at this time.

> +	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
> +	       KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +}
> +#else
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> +	return false;
> +}
> +
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
> +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> +					gfn_t gfn, kvm_pfn_t *pfn, int *order)
> +{
> +	int ret;
> +	struct page *page;
> +	pgoff_t index = gfn - slot->base_gfn +
> +			(slot->restricted_offset >> PAGE_SHIFT);
> +
> +	ret = restrictedmem_get_page(slot->restricted_file, index,
> +				     &page, order);

This needs handle errors.  If "ret" is non-zero, "page" is garbage.

> +	*pfn = page_to_pfn(page);
> +	return ret;
> +}
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
>  #endif
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
  2022-12-02  6:13 ` [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
  2022-12-09  9:11   ` Fuad Tabba
  2023-01-05 20:38   ` Vishal Annapurve
@ 2023-01-14  0:01   ` Sean Christopherson
  2023-01-17 13:12     ` Chao Peng
  2023-01-28 14:00     ` Chao Peng
  2023-03-07 19:14   ` Ackerley Tng
  3 siblings, 2 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-01-14  0:01 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Fri, Dec 02, 2022, Chao Peng wrote:
> @@ -10357,6 +10364,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  
>  		if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
>  			static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
> +
> +		if (kvm_check_request(KVM_REQ_MEMORY_MCE, vcpu)) {
> +			vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;

Synthesizing triple fault shutdown is not the right approach.  Even with TDX's
MCE "architecture" (heavy sarcasm), it's possible that host userspace and the
guest have a paravirt interface for handling memory errors without killing the
host.

> +			r = 0;
> +			goto out;
> +		}
>  	}


> @@ -1982,6 +2112,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
>  			mem->memory_size))
>  		return -EINVAL;
> +	if (mem->flags & KVM_MEM_PRIVATE &&
> +		(mem->restricted_offset & (PAGE_SIZE - 1) ||

Align indentation.

> +		 mem->restricted_offset > U64_MAX - mem->memory_size))

Strongly prefer to use similar logic to existing code that detects wraps:

		mem->restricted_offset + mem->memory_size < mem->restricted_offset

This is also where I'd like to add the "gfn is aligned to offset" check, though
my brain is too fried to figure that out right now.

> +		return -EINVAL;
>  	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
>  		return -EINVAL;
>  	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> @@ -2020,6 +2154,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  		if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
>  			return -EINVAL;
>  	} else { /* Modify an existing slot. */
> +		/* Private memslots are immutable, they can only be deleted. */

I'm 99% certain I suggested this, but if we're going to make these memslots
immutable, then we should straight up disallow dirty logging, otherwise we'll
end up with a bizarre uAPI.

> +		if (mem->flags & KVM_MEM_PRIVATE)
> +			return -EINVAL;
>  		if ((mem->userspace_addr != old->userspace_addr) ||
>  		    (npages != old->npages) ||
>  		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> @@ -2048,10 +2185,28 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  	new->npages = npages;
>  	new->flags = mem->flags;
>  	new->userspace_addr = mem->userspace_addr;
> +	if (mem->flags & KVM_MEM_PRIVATE) {
> +		new->restricted_file = fget(mem->restricted_fd);
> +		if (!new->restricted_file ||
> +		    !file_is_restrictedmem(new->restricted_file)) {
> +			r = -EINVAL;
> +			goto out;
> +		}
> +		new->restricted_offset = mem->restricted_offset;
> +	}
> +
> +	new->kvm = kvm;

Set this above, just so that the code flows better.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2022-12-02  6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (8 preceding siblings ...)
  2022-12-02  6:13 ` [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
@ 2023-01-14  0:37 ` Sean Christopherson
  2023-01-16 13:48   ` Kirill A. Shutemov
                     ` (4 more replies)
  2023-02-16  5:13 ` Mike Rapoport
  2023-04-17 15:40 ` Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM) Sean Christopherson
  11 siblings, 5 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-01-14  0:37 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Fri, Dec 02, 2022, Chao Peng wrote:
> This patch series implements KVM guest private memory for confidential
> computing scenarios like Intel TDX[1]. If a TDX host accesses
> TDX-protected guest memory, machine check can happen which can further
> crash the running host system, this is terrible for multi-tenant
> configurations. The host accesses include those from KVM userspace like
> QEMU. This series addresses KVM userspace induced crash by introducing
> new mm and KVM interfaces so KVM userspace can still manage guest memory
> via a fd-based approach, but it can never access the guest memory
> content.
> 
> The patch series touches both core mm and KVM code. I appreciate
> Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> reviews are always welcome.
>   - 01: mm change, target for mm tree
>   - 02-09: KVM change, target for KVM tree

A version with all of my feedback, plus reworked versions of Vishal's selftest,
is available here:

  git@github.com:sean-jc/linux.git x86/upm_base_support

It compiles and passes the selftest, but it's otherwise barely tested.  There are
a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
a WIP.

As for next steps, can you (handwaving all of the TDX folks) take a look at what
I pushed and see if there's anything horrifically broken, and that it still works
for TDX?

Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
(and I mean that).

On my side, the two things on my mind are (a) tests and (b) downstream dependencies
(SEV and TDX).  For tests, I want to build a lists of tests that are required for
merging so that the criteria for merging are clear, and so that if the list is large
(haven't thought much yet), the work of writing and running tests can be distributed.

Regarding downstream dependencies, before this lands, I want to pull in all the
TDX and SNP series and see how everything fits together.  Specifically, I want to
make sure that we don't end up with a uAPI that necessitates ugly code, and that we
don't miss an opportunity to make things simpler.  The patches in the SNP series to
add "legacy" SEV support for UPM in particular made me slightly rethink some minor
details.  Nothing remotely major, but something that needs attention since it'll
be uAPI.

I'm off Monday, so it'll be at least Tuesday before I make any more progress on
my side.

Thanks!

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-01-14  0:37 ` [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Sean Christopherson
@ 2023-01-16 13:48   ` Kirill A. Shutemov
  2023-01-17 13:19   ` Chao Peng
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 398+ messages in thread
From: Kirill A. Shutemov @ 2023-01-16 13:48 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Sat, Jan 14, 2023 at 12:37:59AM +0000, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> > 
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> >   - 01: mm change, target for mm tree
> >   - 02-09: KVM change, target for KVM tree
> 
> A version with all of my feedback, plus reworked versions of Vishal's selftest,
> is available here:
> 
>   git@github.com:sean-jc/linux.git x86/upm_base_support
> 
> It compiles and passes the selftest, but it's otherwise barely tested.  There are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> a WIP.
> 
> As for next steps, can you (handwaving all of the TDX folks) take a look at what
> I pushed and see if there's anything horrifically broken, and that it still works
> for TDX?

Minor build fix:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6eb5336ccc65..4a9e9fa2552a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7211,8 +7211,8 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
 	int level;
 	bool mixed;
 
-	lockdep_assert_held_write(kvm->mmu_lock);
-	lockdep_assert_held(kvm->slots_lock);
+	lockdep_assert_held_write(&kvm->mmu_lock);
+	lockdep_assert_held(&kvm->slots_lock);
 
 	/*
 	 * KVM x86 currently only supports KVM_MEMORY_ATTRIBUTE_PRIVATE, skip
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 467916943c73..4ef60ba7eb1d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2304,7 +2304,7 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
 {
-	lockdep_assert_held(kvm->mmu_lock);
+	lockdep_assert_held(&kvm->mmu_lock);
 
 	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
 }
-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2022-12-02  6:13 ` [PATCH v10 2/9] KVM: Introduce per-page memory attributes Chao Peng
                     ` (4 preceding siblings ...)
  2023-01-13 22:02   ` Sean Christopherson
@ 2023-01-17  3:21   ` Binbin Wu
  2023-01-17 13:30     ` Chao Peng
  2023-02-09  7:25   ` Isaku Yamahata
  2023-05-19 17:32   ` Nicolas Saenz Julienne
  7 siblings, 1 reply; 398+ messages in thread
From: Binbin Wu @ 2023-01-17  3:21 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang


On 12/2/2022 2:13 PM, Chao Peng wrote:
> In confidential computing usages, whether a page is private or shared is
> necessary information for KVM to perform operations like page fault
> handling, page zapping etc. There are other potential use cases for
> per-page memory attributes, e.g. to make memory read-only (or no-exec,
> or exec-only, etc.) without having to modify memslots.
>
> Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> userspace to operate on the per-page memory attributes.
>    - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
>      a guest memory range.
>    - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
>      memory attributes.
>
> KVM internally uses xarray to store the per-page memory attributes.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> ---
>   Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
>   arch/x86/kvm/Kconfig           |  1 +
>   include/linux/kvm_host.h       |  3 ++
>   include/uapi/linux/kvm.h       | 17 ++++++++

Should the changes introduced in this file also need to be added in 
tools/include/uapi/linux/kvm.h ?



>   virt/kvm/Kconfig               |  3 ++
>   virt/kvm/kvm_main.c            | 76 ++++++++++++++++++++++++++++++++++
>   6 files changed, 163 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 5617bc4f899f..bb2f709c0900 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
>   The "pad" and "reserved" fields may be used for future extensions and should be
>   set to 0s by userspace.
>   
> +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: u64 memory attributes bitmask(out)
> +:Returns: 0 on success, <0 on error
> +
> +Returns supported memory attributes bitmask. Supported memory attributes will
> +have the corresponding bits set in u64 memory attributes bitmask.
> +
> +The following memory attributes are defined::
> +
> +  #define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> +  #define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> +  #define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> +  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> +
> +4.139 KVM_SET_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct kvm_memory_attributes(in/out)
> +:Returns: 0 on success, <0 on error
> +
> +Sets memory attributes for pages in a guest memory range. Parameters are
> +specified via the following structure::
> +
> +  struct kvm_memory_attributes {
> +	__u64 address;
> +	__u64 size;
> +	__u64 attributes;
> +	__u64 flags;
> +  };
> +
> +The user sets the per-page memory attributes to a guest memory range indicated
> +by address/size, and in return KVM adjusts address and size to reflect the
> +actual pages of the memory range have been successfully set to the attributes.
> +If the call returns 0, "address" is updated to the last successful address + 1
> +and "size" is updated to the remaining address size that has not been set
> +successfully. The user should check the return value as well as the size to
> +decide if the operation succeeded for the whole range or not. The user may want
> +to retry the operation with the returned address/size if the previous range was
> +partially successful.
> +
> +Both address and size should be page aligned and the supported attributes can be
> +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> +
> +The "flags" field may be used for future extensions and should be set to 0s.
> +
>   5. The kvm_run structure
>   ========================
>   
> @@ -8270,6 +8323,16 @@ structure.
>   When getting the Modified Change Topology Report value, the attr->addr
>   must point to a byte where the value will be stored or retrieved from.
>   
> +8.40 KVM_CAP_MEMORY_ATTRIBUTES
> +------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm
> +
> +This capability indicates KVM supports per-page memory attributes and ioctls
> +KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
> +
>   9. Known KVM API problems
>   =========================
>   
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index fbeaa9ddef59..a8e379a3afee 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -49,6 +49,7 @@ config KVM
>   	select SRCU
>   	select INTERVAL_TREE
>   	select HAVE_KVM_PM_NOTIFIER if PM
> +	select HAVE_KVM_MEMORY_ATTRIBUTES
>   	help
>   	  Support hosting fully virtualized guest machines using hardware
>   	  virtualization extensions.  You will need a fairly recent
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8f874a964313..a784e2b06625 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -800,6 +800,9 @@ struct kvm {
>   
>   #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>   	struct notifier_block pm_notifier;
> +#endif
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	struct xarray mem_attr_array;
>   #endif
>   	char stats_id[KVM_STATS_NAME_SIZE];
>   };
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 64dfe9c07c87..5d0941acb5bb 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1182,6 +1182,7 @@ struct kvm_ppc_resize_hpt {
>   #define KVM_CAP_S390_CPU_TOPOLOGY 222
>   #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
>   #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
> +#define KVM_CAP_MEMORY_ATTRIBUTES 225
>   
>   #ifdef KVM_CAP_IRQ_ROUTING
>   
> @@ -2238,4 +2239,20 @@ struct kvm_s390_zpci_op {
>   /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
>   #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
>   
> +/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
> +#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES    _IOR(KVMIO,  0xd2, __u64)
> +#define KVM_SET_MEMORY_ATTRIBUTES              _IOWR(KVMIO,  0xd3, struct kvm_memory_attributes)
> +
> +struct kvm_memory_attributes {
> +	__u64 address;
> +	__u64 size;
> +	__u64 attributes;
> +	__u64 flags;
> +};
> +
> +#define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> +#define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> +#define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> +#define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> +
>   #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 800f9470e36b..effdea5dd4f0 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -19,6 +19,9 @@ config HAVE_KVM_IRQ_ROUTING
>   config HAVE_KVM_DIRTY_RING
>          bool
>   
> +config HAVE_KVM_MEMORY_ATTRIBUTES
> +       bool
> +
>   # Only strongly ordered architectures can select this, as it doesn't
>   # put any explicit constraint on userspace ordering. They can also
>   # select the _ACQ_REL version.
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1782c4555d94..7f0f5e9f2406 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>   	spin_lock_init(&kvm->mn_invalidate_lock);
>   	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
>   	xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	xa_init(&kvm->mem_attr_array);
> +#endif
>   
>   	INIT_LIST_HEAD(&kvm->gpc_list);
>   	spin_lock_init(&kvm->gpc_lock);
> @@ -1323,6 +1326,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
>   		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
>   		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
>   	}
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	xa_destroy(&kvm->mem_attr_array);
> +#endif
>   	cleanup_srcu_struct(&kvm->irq_srcu);
>   	cleanup_srcu_struct(&kvm->srcu);
>   	kvm_arch_free_vm(kvm);
> @@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
>   }
>   #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>   
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> +{
> +	return 0;
> +}
> +
> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> +					   struct kvm_memory_attributes *attrs)
> +{
> +	gfn_t start, end;
> +	unsigned long i;
> +	void *entry;
> +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> +	/* flags is currently not used. */
> +	if (attrs->flags)
> +		return -EINVAL;
> +	if (attrs->attributes & ~supported_attrs)
> +		return -EINVAL;
> +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> +		return -EINVAL;
> +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> +		return -EINVAL;
> +
> +	start = attrs->address >> PAGE_SHIFT;
> +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +
> +	mutex_lock(&kvm->lock);
> +	for (i = start; i < end; i++)
> +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> +				    GFP_KERNEL_ACCOUNT)))
> +			break;
> +	mutex_unlock(&kvm->lock);
> +
> +	attrs->address = i << PAGE_SHIFT;
> +	attrs->size = (end - i) << PAGE_SHIFT;
> +
> +	return 0;
> +}
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
>   struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
>   {
>   	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> @@ -4459,6 +4508,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>   #ifdef CONFIG_HAVE_KVM_MSI
>   	case KVM_CAP_SIGNAL_MSI:
>   #endif
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	case KVM_CAP_MEMORY_ATTRIBUTES:
> +#endif
>   #ifdef CONFIG_HAVE_KVM_IRQFD
>   	case KVM_CAP_IRQFD:
>   	case KVM_CAP_IRQFD_RESAMPLE:
> @@ -4804,6 +4856,30 @@ static long kvm_vm_ioctl(struct file *filp,
>   		break;
>   	}
>   #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	case KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES: {
> +		u64 attrs = kvm_supported_mem_attributes(kvm);
> +
> +		r = -EFAULT;
> +		if (copy_to_user(argp, &attrs, sizeof(attrs)))
> +			goto out;
> +		r = 0;
> +		break;
> +	}
> +	case KVM_SET_MEMORY_ATTRIBUTES: {
> +		struct kvm_memory_attributes attrs;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&attrs, argp, sizeof(attrs)))
> +			goto out;
> +
> +		r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
> +
> +		if (!r && copy_to_user(argp, &attrs, sizeof(attrs)))
> +			r = -EFAULT;
> +		break;
> +	}
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
>   	case KVM_CREATE_DEVICE: {
>   		struct kvm_create_device cd;
>   

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2023-01-13 21:54   ` Sean Christopherson
@ 2023-01-17 12:41     ` Chao Peng
  2023-01-17 16:34       ` Sean Christopherson
  2023-02-22  2:07     ` Alexey Kardashevskiy
  1 sibling, 1 reply; 398+ messages in thread
From: Chao Peng @ 2023-01-17 12:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Fri, Jan 13, 2023 at 09:54:41PM +0000, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > The system call is currently wired up for x86 arch.
> 
> Building on other architectures (except for arm64 for some reason) yields:
> 
>   CALL    /.../scripts/checksyscalls.sh
>   <stdin>:1565:2: warning: #warning syscall memfd_restricted not implemented [-Wcpp]
> 
> Do we care?  It's the only such warning, which makes me think we either need to
> wire this up for all architectures, or explicitly document that it's unsupported.

I'm a bit conservative and prefer enabling only on x86 where we know the
exact usecase. For the warning we can get rid of by changing
scripts/checksyscalls.sh, just like __IGNORE_memfd_secret:

https://lkml.kernel.org/r/20210518072034.31572-7-rppt@kernel.org

> 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> 
> ...
> 
> > diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> > new file mode 100644
> > index 000000000000..c2700c5daa43
> > --- /dev/null
> > +++ b/include/linux/restrictedmem.h
> > @@ -0,0 +1,71 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _LINUX_RESTRICTEDMEM_H
> 
> Missing
> 
>  #define _LINUX_RESTRICTEDMEM_H
> 
> which causes fireworks if restrictedmem.h is included more than once.
> 
> > +#include <linux/file.h>
> > +#include <linux/magic.h>
> > +#include <linux/pfn_t.h>
> 
> ...
> 
> > +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +					 struct page **pagep, int *order)
> > +{
> > +	return -1;
> 
> This should be a proper -errno, though in the current incarnation of things it's
> a moot point because no stub is needed.  KVM can (and should) easily provide its
> own stub for this one.
> 
> > +}
> > +
> > +static inline bool file_is_restrictedmem(struct file *file)
> > +{
> > +	return false;
> > +}
> > +
> > +static inline void restrictedmem_error_page(struct page *page,
> > +					    struct address_space *mapping)
> > +{
> > +}
> > +
> > +#endif /* CONFIG_RESTRICTEDMEM */
> > +
> > +#endif /* _LINUX_RESTRICTEDMEM_H */
> 
> ...
> 
> > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> > new file mode 100644
> > index 000000000000..56953c204e5c
> > --- /dev/null
> > +++ b/mm/restrictedmem.c
> > @@ -0,0 +1,318 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "linux/sbitmap.h"
> > +#include <linux/pagemap.h>
> > +#include <linux/pseudo_fs.h>
> > +#include <linux/shmem_fs.h>
> > +#include <linux/syscalls.h>
> > +#include <uapi/linux/falloc.h>
> > +#include <uapi/linux/magic.h>
> > +#include <linux/restrictedmem.h>
> > +
> > +struct restrictedmem_data {
> 
> Any objection to simply calling this "restrictedmem"?  And then using either "rm"
> or "rmem" for local variable names?  I kept reading "data" as the underyling data
> being written to the page, as opposed to the metadata describing the restrictedmem
> instance.
> 
> > +	struct mutex lock;
> > +	struct file *memfd;
> > +	struct list_head notifiers;
> > +};
> > +
> > +static void restrictedmem_invalidate_start(struct restrictedmem_data *data,
> > +					   pgoff_t start, pgoff_t end)
> > +{
> > +	struct restrictedmem_notifier *notifier;
> > +
> > +	mutex_lock(&data->lock);
> 
> This can be a r/w semaphore instead of a mutex, that way punching holes at multiple
> points in the file can at least run the notifiers in parallel.  The actual allocation
> by shmem will still be serialized, but I think it's worth the simple optimization
> since zapping and flushing in KVM may be somewhat slow.
> 
> > +	list_for_each_entry(notifier, &data->notifiers, list) {
> > +		notifier->ops->invalidate_start(notifier, start, end);
> 
> Two major design issues that we overlooked long ago:
> 
>   1. Blindly invoking notifiers will not scale.  E.g. if userspace configures a
>      VM with a large number of convertible memslots that are all backed by a
>      single large restrictedmem instance, then converting a single page will
>      result in a linear walk through all memslots.  I don't expect anyone to
>      actually do something silly like that, but I also never expected there to be
>      a legitimate usecase for thousands of memslots.
> 
>   2. This approach fails to provide the ability for KVM to ensure a guest has
>      exclusive access to a page.  As discussed in the past, the kernel can rely
>      on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but
>      only for SNP and TDX VMs.  For VMs where userspace is trusted to some extent,
>      e.g. SEV, there is value in ensuring a 1:1 association.
> 
>      And probably more importantly, relying on hardware for SNP and TDX yields a
>      poor ABI and complicates KVM's internals.  If the kernel doesn't guarantee a
>      page is exclusive to a guest, i.e. if userspace can hand out the same page
>      from a restrictedmem instance to multiple VMs, then failure will occur only
>      when KVM tries to assign the page to the second VM.  That will happen deep
>      in KVM, which means KVM needs to gracefully handle such errors, and it means
>      that KVM's ABI effectively allows plumbing garbage into its memslots.

It may not be a valid usage, but in my TDX environment I do meet below
issue.

kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fe1ebfff000 ret=0
kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc00000 size=0x400000 ua=0x7fe271579000 ret=0
kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda0000 size=0x20000 ua=0x7fe1ec09f000 ret=-22

Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU
and slot#2 fails due to below exclusive check.

Currently I changed QEMU code to mark these alias slots as shared
instead of private but I'm not 100% confident this is correct fix.

> 
> Rather than use a simple list of notifiers, this appears to be yet another
> opportunity to use an xarray.  Supporting sharing of restrictedmem will be
> non-trivial, but IMO we should punt that to the future since it's still unclear
> exactly how sharing will work.
> 
> An xarray will solve #1 by notifying only the consumers (memslots) that are bound
> to the affected range.
> 
> And for #2, it's relatively straightforward (knock wood) to detect existing
> entries, i.e. if the user wants exclusive access to memory, then the bind operation
> can be reject if there's an existing entry.
> 
> VERY lightly tested code snippet at the bottom (will provide link to fully worked
> code in cover letter).
> 
> 
> > +static long restrictedmem_punch_hole(struct restrictedmem_data *data, int mode,
> > +				     loff_t offset, loff_t len)
> > +{
> > +	int ret;
> > +	pgoff_t start, end;
> > +	struct file *memfd = data->memfd;
> > +
> > +	if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > +		return -EINVAL;
> > +
> > +	start = offset >> PAGE_SHIFT;
> > +	end = (offset + len) >> PAGE_SHIFT;
> > +
> > +	restrictedmem_invalidate_start(data, start, end);
> > +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > +	restrictedmem_invalidate_end(data, start, end);
> 
> The lock needs to be end for the entire duration of the hole punch, i.e. needs to
> be taken before invalidate_start() and released after invalidate_end().  If a user
> (un)binds/(un)registers after invalidate_state(), it will see an unpaired notification,
> e.g. could leave KVM with incorrect notifier counts.
> 
> > +
> > +	return ret;
> > +}
> 
> What I ended up with for an xarray-based implementation.  I'm very flexible on
> names and whatnot, these are just what made sense to me.
> 
> static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode,
> 				     loff_t offset, loff_t len)
> {
> 	struct restrictedmem_notifier *notifier;
> 	struct file *memfd = rm->memfd;
> 	unsigned long index;
> 	pgoff_t start, end;
> 	int ret;
> 
> 	if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> 		return -EINVAL;
> 
> 	start = offset >> PAGE_SHIFT;
> 	end = (offset + len) >> PAGE_SHIFT;
> 
> 	/*
> 	 * Bindings must stable across invalidation to ensure the start+end
> 	 * are balanced.
> 	 */
> 	down_read(&rm->lock);
> 
> 	xa_for_each_range(&rm->bindings, index, notifier, start, end)
> 		notifier->ops->invalidate_start(notifier, start, end);
> 
> 	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> 
> 	xa_for_each_range(&rm->bindings, index, notifier, start, end)
> 		notifier->ops->invalidate_end(notifier, start, end);
> 
> 	up_read(&rm->lock);
> 
> 	return ret;
> }
> 
> int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
> 		       struct restrictedmem_notifier *notifier, bool exclusive)
> {
> 	struct restrictedmem *rm = file->f_mapping->private_data;
> 	int ret = -EINVAL;
> 
> 	down_write(&rm->lock);
> 
> 	/* Non-exclusive mappings are not yet implemented. */
> 	if (!exclusive)
> 		goto out_unlock;
> 
> 	if (!xa_empty(&rm->bindings)) {
> 		if (exclusive != rm->exclusive)
> 			goto out_unlock;
> 
> 		if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT))
> 			goto out_unlock;
> 	}
> 
> 	xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL);
> 	rm->exclusive = exclusive;
> 	ret = 0;
> out_unlock:
> 	up_write(&rm->lock);
> 	return ret;
> }
> EXPORT_SYMBOL_GPL(restrictedmem_bind);
> 
> void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end,
> 			  struct restrictedmem_notifier *notifier)
> {
> 	struct restrictedmem *rm = file->f_mapping->private_data;
> 
> 	down_write(&rm->lock);
> 	xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL);
> 	synchronize_rcu();
> 	up_write(&rm->lock);
> }
> EXPORT_SYMBOL_GPL(restrictedmem_unbind);

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2023-01-13 22:37           ` Sean Christopherson
@ 2023-01-17 12:42             ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2023-01-17 12:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jarkko Sakkinen, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Fri, Jan 13, 2023 at 10:37:39PM +0000, Sean Christopherson wrote:
> On Tue, Jan 10, 2023, Chao Peng wrote:
> > On Mon, Jan 09, 2023 at 07:32:05PM +0000, Sean Christopherson wrote:
> > > On Fri, Jan 06, 2023, Chao Peng wrote:
> > > > On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > > > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > > > To make future maintenance easy, internally use a binary compatible
> > > > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > > > '_ext' variants.
> > > > > 
> > > > > Feels bit hacky IMHO, and more like a completely new feature than
> > > > > an extension.
> > > > > 
> > > > > Why not just add a new ioctl? The commit message does not address
> > > > > the most essential design here.
> > > > 
> > > > Yes, people can always choose to add a new ioctl for this kind of change
> > > > and the balance point here is we want to also avoid 'too many ioctls' if
> > > > the functionalities are similar.  The '_ext' variant reuses all the
> > > > existing fields in the 'normal' variant and most importantly KVM
> > > > internally can reuse most of the code. I certainly can add some words in
> > > > the commit message to explain this design choice.
> > > 
> > > After seeing the userspace side of this, I agree with Jarkko; overloading
> > > KVM_SET_USER_MEMORY_REGION is a hack.  E.g. the size validation ends up being
> > > bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
> > > itself.
> > 
> > How is the size validation being bogus? I don't quite follow.
> 
> The ioctl() magic embeds the size of the payload (struct kvm_userspace_memory_region
> in this case) in the ioctl() number, and that information is visible to userspace
> via _IOCTL_SIZE().  Attempting to take a larger size can mess up sanity checks,
> e.g. KVM selftests get tripped up on this assert if KVM_SET_USER_MEMORY_REGION is
> passed an "extended" struct.
> 
> 	#define kvm_do_ioctl(fd, cmd, arg)						\
> 	({										\
> 		kvm_static_assert(!_IOC_SIZE(cmd) || sizeof(*arg) == _IOC_SIZE(cmd));	\
> 		ioctl(fd, cmd, arg);							\
> 	})

Got it. Thanks for the explanation.

Chao

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
  2023-01-14  0:01   ` Sean Christopherson
@ 2023-01-17 13:12     ` Chao Peng
  2023-01-17 19:35       ` Sean Christopherson
  2023-01-28 14:00     ` Chao Peng
  1 sibling, 1 reply; 398+ messages in thread
From: Chao Peng @ 2023-01-17 13:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Sat, Jan 14, 2023 at 12:01:01AM +0000, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > @@ -10357,6 +10364,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >  
> >  		if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
> >  			static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
> > +
> > +		if (kvm_check_request(KVM_REQ_MEMORY_MCE, vcpu)) {
> > +			vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
> 
> Synthesizing triple fault shutdown is not the right approach.  Even with TDX's
> MCE "architecture" (heavy sarcasm), it's possible that host userspace and the
> guest have a paravirt interface for handling memory errors without killing the
> host.

Agree shutdown is not the correct choice. I see you made below change:

send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current)

The MCE may happen in any thread than KVM thread, sending siginal to
'current' thread may not be the expected behavior. Also how userspace
can tell is the MCE on the shared page or private page? Do we care?

> 
> > +			r = 0;
> > +			goto out;
> > +		}
> >  	}
> 
> 
> > @@ -1982,6 +2112,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >  	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
> >  			mem->memory_size))
> >  		return -EINVAL;
> > +	if (mem->flags & KVM_MEM_PRIVATE &&
> > +		(mem->restricted_offset & (PAGE_SIZE - 1) ||
> 
> Align indentation.
> 
> > +		 mem->restricted_offset > U64_MAX - mem->memory_size))
> 
> Strongly prefer to use similar logic to existing code that detects wraps:
> 
> 		mem->restricted_offset + mem->memory_size < mem->restricted_offset
> 
> This is also where I'd like to add the "gfn is aligned to offset" check, though
> my brain is too fried to figure that out right now.
> 
> > +		return -EINVAL;
> >  	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> >  		return -EINVAL;
> >  	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> > @@ -2020,6 +2154,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >  		if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
> >  			return -EINVAL;
> >  	} else { /* Modify an existing slot. */
> > +		/* Private memslots are immutable, they can only be deleted. */
> 
> I'm 99% certain I suggested this, but if we're going to make these memslots
> immutable, then we should straight up disallow dirty logging, otherwise we'll
> end up with a bizarre uAPI.

But in my mind dirty logging will be needed in the very short time, when
live migration gets supported?

> 
> > +		if (mem->flags & KVM_MEM_PRIVATE)
> > +			return -EINVAL;
> >  		if ((mem->userspace_addr != old->userspace_addr) ||
> >  		    (npages != old->npages) ||
> >  		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> > @@ -2048,10 +2185,28 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >  	new->npages = npages;
> >  	new->flags = mem->flags;
> >  	new->userspace_addr = mem->userspace_addr;
> > +	if (mem->flags & KVM_MEM_PRIVATE) {
> > +		new->restricted_file = fget(mem->restricted_fd);
> > +		if (!new->restricted_file ||
> > +		    !file_is_restrictedmem(new->restricted_file)) {
> > +			r = -EINVAL;
> > +			goto out;
> > +		}
> > +		new->restricted_offset = mem->restricted_offset;

I see you changed slot->restricted_offset type from loff_t to gfn_t and
used pgoff_t when doing the restrictedmem_bind/unbind(). Using page
index is reasonable KVM internally and sounds simpler than loff_t. But
we also need initialize it to page index here as well as changes in
another two cases. This is needed when restricted_offset != 0.

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 547b92215002..49e375e78f30 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2364,8 +2364,7 @@ static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
                                             gfn_t gfn, kvm_pfn_t *pfn,
                                             int *order)
 {
-       pgoff_t index = gfn - slot->base_gfn +
-                       (slot->restricted_offset >> PAGE_SHIFT);
+       pgoff_t index = gfn - slot->base_gfn + slot->restricted_offset;
        struct page *page;
        int ret;
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 01db35ddd5b3..7439bdcb0d04 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -935,7 +935,7 @@ static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
                                         pgoff_t start, pgoff_t end,
                                         gfn_t *gfn_start, gfn_t *gfn_end)
 {
-       unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
+       unsigned long base_pgoff = slot->restricted_offset;
 
        if (start > base_pgoff)
                *gfn_start = slot->base_gfn + start - base_pgoff;
@@ -2275,7 +2275,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
                        r = -EINVAL;
                        goto out;
                }
-               new->restricted_offset = mem->restricted_offset;
+               new->restricted_offset = mem->restricted_offset >> PAGE_SHIFT;
        }
 
        r = kvm_set_memslot(kvm, old, new, change);

Chao
> > +	}
> > +
> > +	new->kvm = kvm;
> 
> Set this above, just so that the code flows better.

^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-01-14  0:37 ` [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Sean Christopherson
  2023-01-16 13:48   ` Kirill A. Shutemov
@ 2023-01-17 13:19   ` Chao Peng
  2023-01-17 14:32   ` Fuad Tabba
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2023-01-17 13:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Sat, Jan 14, 2023 at 12:37:59AM +0000, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> > 
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> >   - 01: mm change, target for mm tree
> >   - 02-09: KVM change, target for KVM tree
> 
> A version with all of my feedback, plus reworked versions of Vishal's selftest,
> is available here:
> 
>   git@github.com:sean-jc/linux.git x86/upm_base_support
> 
> It compiles and passes the selftest, but it's otherwise barely tested.  There are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> a WIP.

Thanks very much for doing this. Almost all of your comments are well
received, except for two cases that need more discussions which have
replied individually.

> 
> As for next steps, can you (handwaving all of the TDX folks) take a look at what
> I pushed and see if there's anything horrifically broken, and that it still works
> for TDX?

I have integrated this into my local TDX repo, with some changes (as I
replied individually), the new code basically still works with TDX.

I have also asked other TDX folks to take a look.

> 
> Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
> (and I mean that).
> 
> On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> (SEV and TDX).  For tests, I want to build a lists of tests that are required for
> merging so that the criteria for merging are clear, and so that if the list is large
> (haven't thought much yet), the work of writing and running tests can be distributed.
> 
> Regarding downstream dependencies, before this lands, I want to pull in all the
> TDX and SNP series and see how everything fits together.  Specifically, I want to
> make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> don't miss an opportunity to make things simpler.  The patches in the SNP series to
> add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> details.  Nothing remotely major, but something that needs attention since it'll
> be uAPI.
> 
> I'm off Monday, so it'll be at least Tuesday before I make any more progress on
> my side.

Appreciate your effort. As for the next steps, if you see something we
can do parallel, feel free to let me know.

Thanks,
Chao

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2023-01-17  3:21   ` Binbin Wu
@ 2023-01-17 13:30     ` Chao Peng
  2023-01-17 17:25       ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2023-01-17 13:30 UTC (permalink / raw)
  To: Binbin Wu
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Tue, Jan 17, 2023 at 11:21:10AM +0800, Binbin Wu wrote:
> 
> On 12/2/2022 2:13 PM, Chao Peng wrote:
> > In confidential computing usages, whether a page is private or shared is
> > necessary information for KVM to perform operations like page fault
> > handling, page zapping etc. There are other potential use cases for
> > per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > or exec-only, etc.) without having to modify memslots.
> > 
> > Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > userspace to operate on the per-page memory attributes.
> >    - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> >      a guest memory range.
> >    - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> >      memory attributes.
> > 
> > KVM internally uses xarray to store the per-page memory attributes.
> > 
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> > ---
> >   Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
> >   arch/x86/kvm/Kconfig           |  1 +
> >   include/linux/kvm_host.h       |  3 ++
> >   include/uapi/linux/kvm.h       | 17 ++++++++
> 
> Should the changes introduced in this file also need to be added in
> tools/include/uapi/linux/kvm.h ?

Yes I think. But I'm hesitate to include in this patch or not. I see
many commits sync kernel kvm.h to tools's copy. Looks that is done
periodically and with a 'pull' model.

Chao

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-01-14  0:37 ` [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Sean Christopherson
  2023-01-16 13:48   ` Kirill A. Shutemov
  2023-01-17 13:19   ` Chao Peng
@ 2023-01-17 14:32   ` Fuad Tabba
  2023-01-19 11:13   ` Isaku Yamahata
  2023-01-24 16:08   ` Liam Merwick
  4 siblings, 0 replies; 398+ messages in thread
From: Fuad Tabba @ 2023-01-17 14:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang

Hi Sean,

On Sat, Jan 14, 2023 at 12:38 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> >
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> >   - 01: mm change, target for mm tree
> >   - 02-09: KVM change, target for KVM tree
>
> A version with all of my feedback, plus reworked versions of Vishal's selftest,
> is available here:
>
>   git@github.com:sean-jc/linux.git x86/upm_base_support
>
> It compiles and passes the selftest, but it's otherwise barely tested.  There are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> a WIP.
>
> As for next steps, can you (handwaving all of the TDX folks) take a look at what
> I pushed and see if there's anything horrifically broken, and that it still works
> for TDX?
>
> Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
> (and I mean that).

Thanks for sharing this. I've had a look at the patches, and have
ported them to work with pKVM. At a high level, the new interface
seems fine and it works with the arm64/pKVM port. I have a couple of
comments regarding some of the details, but they can wait until v11 is
posted.

Cheers,
/fuad



> On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> (SEV and TDX).  For tests, I want to build a lists of tests that are required for
> merging so that the criteria for merging are clear, and so that if the list is large
> (haven't thought much yet), the work of writing and running tests can be distributed.
>
> Regarding downstream dependencies, before this lands, I want to pull in all the
> TDX and SNP series and see how everything fits together.  Specifically, I want to
> make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> don't miss an opportunity to make things simpler.  The patches in the SNP series to
> add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> details.  Nothing remotely major, but something that needs attention since it'll
> be uAPI.
>
> I'm off Monday, so it'll be at least Tuesday before I make any more progress on
> my side.
>
> Thanks!

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2023-01-17 12:41     ` Chao Peng
@ 2023-01-17 16:34       ` Sean Christopherson
  2023-01-18  8:16         ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-01-17 16:34 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Tue, Jan 17, 2023, Chao Peng wrote:
> On Fri, Jan 13, 2023 at 09:54:41PM +0000, Sean Christopherson wrote:
> > > +	list_for_each_entry(notifier, &data->notifiers, list) {
> > > +		notifier->ops->invalidate_start(notifier, start, end);
> > 
> > Two major design issues that we overlooked long ago:
> > 
> >   1. Blindly invoking notifiers will not scale.  E.g. if userspace configures a
> >      VM with a large number of convertible memslots that are all backed by a
> >      single large restrictedmem instance, then converting a single page will
> >      result in a linear walk through all memslots.  I don't expect anyone to
> >      actually do something silly like that, but I also never expected there to be
> >      a legitimate usecase for thousands of memslots.
> > 
> >   2. This approach fails to provide the ability for KVM to ensure a guest has
> >      exclusive access to a page.  As discussed in the past, the kernel can rely
> >      on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but
> >      only for SNP and TDX VMs.  For VMs where userspace is trusted to some extent,
> >      e.g. SEV, there is value in ensuring a 1:1 association.
> > 
> >      And probably more importantly, relying on hardware for SNP and TDX yields a
> >      poor ABI and complicates KVM's internals.  If the kernel doesn't guarantee a
> >      page is exclusive to a guest, i.e. if userspace can hand out the same page
> >      from a restrictedmem instance to multiple VMs, then failure will occur only
> >      when KVM tries to assign the page to the second VM.  That will happen deep
> >      in KVM, which means KVM needs to gracefully handle such errors, and it means
> >      that KVM's ABI effectively allows plumbing garbage into its memslots.
> 
> It may not be a valid usage, but in my TDX environment I do meet below
> issue.
> 
> kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fe1ebfff000 ret=0
> kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc00000 size=0x400000 ua=0x7fe271579000 ret=0
> kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda0000 size=0x20000 ua=0x7fe1ec09f000 ret=-22
> 
> Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU
> and slot#2 fails due to below exclusive check.
> 
> Currently I changed QEMU code to mark these alias slots as shared
> instead of private but I'm not 100% confident this is correct fix.

That's a QEMU bug of sorts.  SMM is mutually exclusive with TDX, QEMU shouldn't
be configuring SMRAM (or any SMM memslots for that matter) for TDX guests.

Actually, KVM should enforce that by disallowing SMM memslots for TDX guests.
Ditto for SNP guests and UPM-backed SEV and SEV-ES guests.  I think it probably
even makes sense to introduce that restriction in the base UPM support, e.g.
something like the below.  That would unnecessarily prevent emulating SMM for
KVM_X86_PROTECTED_VM types that aren't encrypted, but IMO that's an acceptable
limitation until there's an actual use case for KVM_X86_PROTECTED_VM guests beyond
SEV (my thought is that KVM_X86_PROTECTED_VM will mostly be a vehicle for selftests
and UPM-based SEV and SEV-ES guests).

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 48b7bdad1e0a..0a8aac821cb0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4357,6 +4357,14 @@ bool kvm_arch_has_private_mem(struct kvm *kvm)
        return kvm->arch.vm_type != KVM_X86_DEFAULT_VM;
 }
 
+bool kvm_arch_nr_address_spaces(struct kvm *kvm)
+{
+       if (kvm->arch.vm_type != KVM_X86_DEFAULT_VM)
+               return 1;
+
+       return KVM_ADDRESS_SPACE_NUM;
+}
+
 static bool kvm_is_vm_type_supported(unsigned long type)
 {
        return type == KVM_X86_DEFAULT_VM ||
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 97801d81ee42..e0a3fc819fe5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2126,7 +2126,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
             mem->restricted_offset + mem->memory_size < mem->restricted_offset ||
             0 /* TODO: require gfn be aligned with restricted offset */))
                return -EINVAL;
-       if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
+       if (as_id >= kvm_arch_nr_address_spaces(vm) || id >= KVM_MEM_SLOTS_NUM)
                return -EINVAL;
        if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
                return -EINVAL;


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2023-01-17 13:30     ` Chao Peng
@ 2023-01-17 17:25       ` Sean Christopherson
  0 siblings, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-01-17 17:25 UTC (permalink / raw)
  To: Chao Peng
  Cc: Binbin Wu, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Tue, Jan 17, 2023, Chao Peng wrote:
> On Tue, Jan 17, 2023 at 11:21:10AM +0800, Binbin Wu wrote:
> > 
> > On 12/2/2022 2:13 PM, Chao Peng wrote:
> > > In confidential computing usages, whether a page is private or shared is
> > > necessary information for KVM to perform operations like page fault
> > > handling, page zapping etc. There are other potential use cases for
> > > per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > > or exec-only, etc.) without having to modify memslots.
> > > 
> > > Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > > userspace to operate on the per-page memory attributes.
> > >    - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> > >      a guest memory range.
> > >    - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> > >      memory attributes.
> > > 
> > > KVM internally uses xarray to store the per-page memory attributes.
> > > 
> > > Suggested-by: Sean Christopherson <seanjc@google.com>
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> > > ---
> > >   Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
> > >   arch/x86/kvm/Kconfig           |  1 +
> > >   include/linux/kvm_host.h       |  3 ++
> > >   include/uapi/linux/kvm.h       | 17 ++++++++
> > 
> > Should the changes introduced in this file also need to be added in
> > tools/include/uapi/linux/kvm.h ?
> 
> Yes I think.

I'm not sure how Paolo or others feel, but my preference is to never update KVM's
uapi headers in tools/ in KVM's tree.  Nothing KVM-related in tools/ actually
relies on the headers being copied into tools/, e.g. KVM selftests pulls KVM's
headers from the .../usr/include/ directory that's populated by `make headers_install`.

Perf's tooling is what actually "needs" the headers to be copied into tools/, so
my preference is to let the tools/perf maintainers deal with the headache of keeping
everything up-to-date.

> But I'm hesitate to include in this patch or not. I see many commits sync
> kernel kvm.h to tools's copy. Looks that is done periodically and with a
> 'pull' model.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
  2023-01-17 13:12     ` Chao Peng
@ 2023-01-17 19:35       ` Sean Christopherson
  2023-01-18  8:23         ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-01-17 19:35 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Tue, Jan 17, 2023, Chao Peng wrote:
> On Sat, Jan 14, 2023 at 12:01:01AM +0000, Sean Christopherson wrote:
> > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > @@ -10357,6 +10364,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > >  
> > >  		if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
> > >  			static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
> > > +
> > > +		if (kvm_check_request(KVM_REQ_MEMORY_MCE, vcpu)) {
> > > +			vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
> > 
> > Synthesizing triple fault shutdown is not the right approach.  Even with TDX's
> > MCE "architecture" (heavy sarcasm), it's possible that host userspace and the
> > guest have a paravirt interface for handling memory errors without killing the
> > host.
> 
> Agree shutdown is not the correct choice. I see you made below change:
> 
> send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current)
> 
> The MCE may happen in any thread than KVM thread, sending siginal to
> 'current' thread may not be the expected behavior.

This is already true today, e.g. a #MC in memory that is mapped into the guest can
be triggered by a host access.  Hrm, but in this case we actually have a KVM
instance, and we know that the #MC is relevant to the KVM instance, so I agree
that signaling 'current' is kludgy.

>  Also how userspace can tell is the MCE on the shared page or private page?
>  Do we care?

We care.  I was originally thinking we could require userspace to keep track of
things, but that's quite prescriptive and flawed, e.g. could race with conversions.

One option would be to KVM_EXIT_MEMORY_FAULT, and then wire up a generic (not x86
specific) KVM request to exit to userspace, e.g.

		/* KVM_EXIT_MEMORY_FAULT */
		struct {
#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
#define KVM_MEMORY_EXIT_FLAG_HW_ERROR	(1ULL << 4)
			__u64 flags;
			__u64 gpa;
			__u64 size;
		} memory;

But I'm not sure that's the correct approach.  It kinda feels like we're reinventing
the wheel.  It seems like restrictedmem_get_page() _must_ be able to reject attempts
to get a poisoned page, i.e. restrictedmem_get_page() should yield KVM_PFN_ERR_HWPOISON.
Assuming that's the case, then I believe KVM simply needs to zap SPTEs in response
to an error notification in order to force vCPUs to fault on the poisoned page.

> > > +		return -EINVAL;
> > >  	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> > >  		return -EINVAL;
> > >  	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> > > @@ -2020,6 +2154,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > >  		if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
> > >  			return -EINVAL;
> > >  	} else { /* Modify an existing slot. */
> > > +		/* Private memslots are immutable, they can only be deleted. */
> > 
> > I'm 99% certain I suggested this, but if we're going to make these memslots
> > immutable, then we should straight up disallow dirty logging, otherwise we'll
> > end up with a bizarre uAPI.
> 
> But in my mind dirty logging will be needed in the very short time, when
> live migration gets supported?

Ya, but if/when live migration support is added, private memslots will no longer
be immutable as userspace will want to enable dirty logging only when a VM is
being migrated, i.e. something will need to change.

Given that it looks like we have clear line of sight to SEV+UPM guests, my
preference would be to allow toggling dirty logging from the get-go.  It doesn't
necessarily have to be in the first patch, e.g. KVM could initially reject
KVM_MEM_LOG_DIRTY_PAGES + KVM_MEM_PRIVATE and then add support separately to make
the series easier to review, test, and bisect.

static int check_memory_region_flags(struct kvm *kvm,
				     const struct kvm_userspace_memory_region2 *mem)
{
	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;

	if (kvm_arch_has_private_mem(kvm) &&
	    ~(mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
		valid_flags |= KVM_MEM_PRIVATE;


	...
}

> > > +		if (mem->flags & KVM_MEM_PRIVATE)
> > > +			return -EINVAL;
> > >  		if ((mem->userspace_addr != old->userspace_addr) ||
> > >  		    (npages != old->npages) ||
> > >  		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> > > @@ -2048,10 +2185,28 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > >  	new->npages = npages;
> > >  	new->flags = mem->flags;
> > >  	new->userspace_addr = mem->userspace_addr;
> > > +	if (mem->flags & KVM_MEM_PRIVATE) {
> > > +		new->restricted_file = fget(mem->restricted_fd);
> > > +		if (!new->restricted_file ||
> > > +		    !file_is_restrictedmem(new->restricted_file)) {
> > > +			r = -EINVAL;
> > > +			goto out;
> > > +		}
> > > +		new->restricted_offset = mem->restricted_offset;
> 
> I see you changed slot->restricted_offset type from loff_t to gfn_t and
> used pgoff_t when doing the restrictedmem_bind/unbind(). Using page
> index is reasonable KVM internally and sounds simpler than loff_t. But
> we also need initialize it to page index here as well as changes in
> another two cases. This is needed when restricted_offset != 0.

Oof.  I'm pretty sure I completely missed that loff_t is used for byte offsets,
whereas pgoff_t is a frame index. 

Given that the restrictmem APIs take pgoff_t, I definitely think it makes sense
to the index, but I'm very tempted to store pgoff_t instead of gfn_t, and name
the field "index" to help connect the dots to the rest of kernel, where "pgoff_t index"
is quite common.

And looking at those bits again, we should wrap all of the restrictedmem fields
with CONFIG_KVM_PRIVATE_MEM.  It'll require minor tweaks to __kvm_set_memory_region(),
but I think will yield cleaner code (and internal APIs) overall.

And wrap the three fields in an anonymous struct?  E.g. this is a little more
versbose (restrictedmem instead restricted), but at first glance it doesn't seem
to cause widespared line length issues.

#ifdef CONFIG_KVM_PRIVATE_MEM
	struct {
		struct file *file;
		pgoff_t index;
		struct restrictedmem_notifier notifier;
	} restrictedmem;
#endif

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 547b92215002..49e375e78f30 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2364,8 +2364,7 @@ static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
>                                              gfn_t gfn, kvm_pfn_t *pfn,
>                                              int *order)
>  {
> -       pgoff_t index = gfn - slot->base_gfn +
> -                       (slot->restricted_offset >> PAGE_SHIFT);
> +       pgoff_t index = gfn - slot->base_gfn + slot->restricted_offset;
>         struct page *page;
>         int ret;
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 01db35ddd5b3..7439bdcb0d04 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -935,7 +935,7 @@ static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
>                                          pgoff_t start, pgoff_t end,
>                                          gfn_t *gfn_start, gfn_t *gfn_end)
>  {
> -       unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
> +       unsigned long base_pgoff = slot->restricted_offset;
>  
>         if (start > base_pgoff)
>                 *gfn_start = slot->base_gfn + start - base_pgoff;
> @@ -2275,7 +2275,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>                         r = -EINVAL;
>                         goto out;
>                 }
> -               new->restricted_offset = mem->restricted_offset;
> +               new->restricted_offset = mem->restricted_offset >> PAGE_SHIFT;
>         }
>  
>         r = kvm_set_memslot(kvm, old, new, change);
> 
> Chao
> > > +	}
> > > +
> > > +	new->kvm = kvm;
> > 
> > Set this above, just so that the code flows better.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2023-01-17 16:34       ` Sean Christopherson
@ 2023-01-18  8:16         ` Chao Peng
  2023-01-18 10:17           ` Isaku Yamahata
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2023-01-18  8:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Tue, Jan 17, 2023 at 04:34:15PM +0000, Sean Christopherson wrote:
> On Tue, Jan 17, 2023, Chao Peng wrote:
> > On Fri, Jan 13, 2023 at 09:54:41PM +0000, Sean Christopherson wrote:
> > > > +	list_for_each_entry(notifier, &data->notifiers, list) {
> > > > +		notifier->ops->invalidate_start(notifier, start, end);
> > > 
> > > Two major design issues that we overlooked long ago:
> > > 
> > >   1. Blindly invoking notifiers will not scale.  E.g. if userspace configures a
> > >      VM with a large number of convertible memslots that are all backed by a
> > >      single large restrictedmem instance, then converting a single page will
> > >      result in a linear walk through all memslots.  I don't expect anyone to
> > >      actually do something silly like that, but I also never expected there to be
> > >      a legitimate usecase for thousands of memslots.
> > > 
> > >   2. This approach fails to provide the ability for KVM to ensure a guest has
> > >      exclusive access to a page.  As discussed in the past, the kernel can rely
> > >      on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but
> > >      only for SNP and TDX VMs.  For VMs where userspace is trusted to some extent,
> > >      e.g. SEV, there is value in ensuring a 1:1 association.
> > > 
> > >      And probably more importantly, relying on hardware for SNP and TDX yields a
> > >      poor ABI and complicates KVM's internals.  If the kernel doesn't guarantee a
> > >      page is exclusive to a guest, i.e. if userspace can hand out the same page
> > >      from a restrictedmem instance to multiple VMs, then failure will occur only
> > >      when KVM tries to assign the page to the second VM.  That will happen deep
> > >      in KVM, which means KVM needs to gracefully handle such errors, and it means
> > >      that KVM's ABI effectively allows plumbing garbage into its memslots.
> > 
> > It may not be a valid usage, but in my TDX environment I do meet below
> > issue.
> > 
> > kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fe1ebfff000 ret=0
> > kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc00000 size=0x400000 ua=0x7fe271579000 ret=0
> > kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda0000 size=0x20000 ua=0x7fe1ec09f000 ret=-22
> > 
> > Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU
> > and slot#2 fails due to below exclusive check.
> > 
> > Currently I changed QEMU code to mark these alias slots as shared
> > instead of private but I'm not 100% confident this is correct fix.
> 
> That's a QEMU bug of sorts.  SMM is mutually exclusive with TDX, QEMU shouldn't
> be configuring SMRAM (or any SMM memslots for that matter) for TDX guests.

Thanks for the confirmation. As long as we only bind one notifier for
each address, using xarray does make things simple.

Chao

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
  2023-01-17 19:35       ` Sean Christopherson
@ 2023-01-18  8:23         ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2023-01-18  8:23 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Tue, Jan 17, 2023 at 07:35:58PM +0000, Sean Christopherson wrote:
> On Tue, Jan 17, 2023, Chao Peng wrote:
> > On Sat, Jan 14, 2023 at 12:01:01AM +0000, Sean Christopherson wrote:
> > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > @@ -10357,6 +10364,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > > >  
> > > >  		if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
> > > >  			static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
> > > > +
> > > > +		if (kvm_check_request(KVM_REQ_MEMORY_MCE, vcpu)) {
> > > > +			vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
> > > 
> > > Synthesizing triple fault shutdown is not the right approach.  Even with TDX's
> > > MCE "architecture" (heavy sarcasm), it's possible that host userspace and the
> > > guest have a paravirt interface for handling memory errors without killing the
> > > host.
> > 
> > Agree shutdown is not the correct choice. I see you made below change:
> > 
> > send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current)
> > 
> > The MCE may happen in any thread than KVM thread, sending siginal to
> > 'current' thread may not be the expected behavior.
> 
> This is already true today, e.g. a #MC in memory that is mapped into the guest can
> be triggered by a host access.  Hrm, but in this case we actually have a KVM
> instance, and we know that the #MC is relevant to the KVM instance, so I agree
> that signaling 'current' is kludgy.
> 
> >  Also how userspace can tell is the MCE on the shared page or private page?
> >  Do we care?
> 
> We care.  I was originally thinking we could require userspace to keep track of
> things, but that's quite prescriptive and flawed, e.g. could race with conversions.
> 
> One option would be to KVM_EXIT_MEMORY_FAULT, and then wire up a generic (not x86
> specific) KVM request to exit to userspace, e.g.
> 
> 		/* KVM_EXIT_MEMORY_FAULT */
> 		struct {
> #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
> #define KVM_MEMORY_EXIT_FLAG_HW_ERROR	(1ULL << 4)
> 			__u64 flags;
> 			__u64 gpa;
> 			__u64 size;
> 		} memory;
> 
> But I'm not sure that's the correct approach.  It kinda feels like we're reinventing
> the wheel.  It seems like restrictedmem_get_page() _must_ be able to reject attempts
> to get a poisoned page, i.e. restrictedmem_get_page() should yield KVM_PFN_ERR_HWPOISON.

Yes, I see there is -EHWPOISON handling for hva_to_pfn() for shared
memory. It makes sense doing similar for private page.

> Assuming that's the case, then I believe KVM simply needs to zap SPTEs in response
> to an error notification in order to force vCPUs to fault on the poisoned page.

Agree, this is waht we should do anyway.

> 
> > > > +		return -EINVAL;
> > > >  	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> > > >  		return -EINVAL;
> > > >  	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> > > > @@ -2020,6 +2154,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > > >  		if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
> > > >  			return -EINVAL;
> > > >  	} else { /* Modify an existing slot. */
> > > > +		/* Private memslots are immutable, they can only be deleted. */
> > > 
> > > I'm 99% certain I suggested this, but if we're going to make these memslots
> > > immutable, then we should straight up disallow dirty logging, otherwise we'll
> > > end up with a bizarre uAPI.
> > 
> > But in my mind dirty logging will be needed in the very short time, when
> > live migration gets supported?
> 
> Ya, but if/when live migration support is added, private memslots will no longer
> be immutable as userspace will want to enable dirty logging only when a VM is
> being migrated, i.e. something will need to change.
> 
> Given that it looks like we have clear line of sight to SEV+UPM guests, my
> preference would be to allow toggling dirty logging from the get-go.  It doesn't
> necessarily have to be in the first patch, e.g. KVM could initially reject
> KVM_MEM_LOG_DIRTY_PAGES + KVM_MEM_PRIVATE and then add support separately to make
> the series easier to review, test, and bisect.
> 
> static int check_memory_region_flags(struct kvm *kvm,
> 				     const struct kvm_userspace_memory_region2 *mem)
> {
> 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> 
> 	if (kvm_arch_has_private_mem(kvm) &&
> 	    ~(mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
> 		valid_flags |= KVM_MEM_PRIVATE;

Adding this limitation is OK to me. It's not too hard to remove it when
live migration gets added.

> 
> 
> 	...
> }
> 
> > > > +		if (mem->flags & KVM_MEM_PRIVATE)
> > > > +			return -EINVAL;
> > > >  		if ((mem->userspace_addr != old->userspace_addr) ||
> > > >  		    (npages != old->npages) ||
> > > >  		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> > > > @@ -2048,10 +2185,28 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > > >  	new->npages = npages;
> > > >  	new->flags = mem->flags;
> > > >  	new->userspace_addr = mem->userspace_addr;
> > > > +	if (mem->flags & KVM_MEM_PRIVATE) {
> > > > +		new->restricted_file = fget(mem->restricted_fd);
> > > > +		if (!new->restricted_file ||
> > > > +		    !file_is_restrictedmem(new->restricted_file)) {
> > > > +			r = -EINVAL;
> > > > +			goto out;
> > > > +		}
> > > > +		new->restricted_offset = mem->restricted_offset;
> > 
> > I see you changed slot->restricted_offset type from loff_t to gfn_t and
> > used pgoff_t when doing the restrictedmem_bind/unbind(). Using page
> > index is reasonable KVM internally and sounds simpler than loff_t. But
> > we also need initialize it to page index here as well as changes in
> > another two cases. This is needed when restricted_offset != 0.
> 
> Oof.  I'm pretty sure I completely missed that loff_t is used for byte offsets,
> whereas pgoff_t is a frame index. 
> 
> Given that the restrictmem APIs take pgoff_t, I definitely think it makes sense
> to the index, but I'm very tempted to store pgoff_t instead of gfn_t, and name
> the field "index" to help connect the dots to the rest of kernel, where "pgoff_t index"
> is quite common.
> 
> And looking at those bits again, we should wrap all of the restrictedmem fields
> with CONFIG_KVM_PRIVATE_MEM.  It'll require minor tweaks to __kvm_set_memory_region(),
> but I think will yield cleaner code (and internal APIs) overall.
> 
> And wrap the three fields in an anonymous struct?  E.g. this is a little more
> versbose (restrictedmem instead restricted), but at first glance it doesn't seem
> to cause widespared line length issues.
> 
> #ifdef CONFIG_KVM_PRIVATE_MEM
> 	struct {
> 		struct file *file;
> 		pgoff_t index;
> 		struct restrictedmem_notifier notifier;
> 	} restrictedmem;
> #endif

Looks better.

Thanks,
Chao
> 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 547b92215002..49e375e78f30 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2364,8 +2364,7 @@ static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> >                                              gfn_t gfn, kvm_pfn_t *pfn,
> >                                              int *order)
> >  {
> > -       pgoff_t index = gfn - slot->base_gfn +
> > -                       (slot->restricted_offset >> PAGE_SHIFT);
> > +       pgoff_t index = gfn - slot->base_gfn + slot->restricted_offset;
> >         struct page *page;
> >         int ret;
> >  
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 01db35ddd5b3..7439bdcb0d04 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -935,7 +935,7 @@ static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
> >                                          pgoff_t start, pgoff_t end,
> >                                          gfn_t *gfn_start, gfn_t *gfn_end)
> >  {
> > -       unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
> > +       unsigned long base_pgoff = slot->restricted_offset;
> >  
> >         if (start > base_pgoff)
> >                 *gfn_start = slot->base_gfn + start - base_pgoff;
> > @@ -2275,7 +2275,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >                         r = -EINVAL;
> >                         goto out;
> >                 }
> > -               new->restricted_offset = mem->restricted_offset;
> > +               new->restricted_offset = mem->restricted_offset >> PAGE_SHIFT;
> >         }
> >  
> >         r = kvm_set_memslot(kvm, old, new, change);
> > 
> > Chao
> > > > +	}
> > > > +
> > > > +	new->kvm = kvm;
> > > 
> > > Set this above, just so that the code flows better.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2023-01-18  8:16         ` Chao Peng
@ 2023-01-18 10:17           ` Isaku Yamahata
  0 siblings, 0 replies; 398+ messages in thread
From: Isaku Yamahata @ 2023-01-18 10:17 UTC (permalink / raw)
  To: Chao Peng
  Cc: Sean Christopherson, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang, isaku.yamahata

On Wed, Jan 18, 2023 at 04:16:41PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:

> On Tue, Jan 17, 2023 at 04:34:15PM +0000, Sean Christopherson wrote:
> > On Tue, Jan 17, 2023, Chao Peng wrote:
> > > On Fri, Jan 13, 2023 at 09:54:41PM +0000, Sean Christopherson wrote:
> > > > > +	list_for_each_entry(notifier, &data->notifiers, list) {
> > > > > +		notifier->ops->invalidate_start(notifier, start, end);
> > > > 
> > > > Two major design issues that we overlooked long ago:
> > > > 
> > > >   1. Blindly invoking notifiers will not scale.  E.g. if userspace configures a
> > > >      VM with a large number of convertible memslots that are all backed by a
> > > >      single large restrictedmem instance, then converting a single page will
> > > >      result in a linear walk through all memslots.  I don't expect anyone to
> > > >      actually do something silly like that, but I also never expected there to be
> > > >      a legitimate usecase for thousands of memslots.
> > > > 
> > > >   2. This approach fails to provide the ability for KVM to ensure a guest has
> > > >      exclusive access to a page.  As discussed in the past, the kernel can rely
> > > >      on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but
> > > >      only for SNP and TDX VMs.  For VMs where userspace is trusted to some extent,
> > > >      e.g. SEV, there is value in ensuring a 1:1 association.
> > > > 
> > > >      And probably more importantly, relying on hardware for SNP and TDX yields a
> > > >      poor ABI and complicates KVM's internals.  If the kernel doesn't guarantee a
> > > >      page is exclusive to a guest, i.e. if userspace can hand out the same page
> > > >      from a restrictedmem instance to multiple VMs, then failure will occur only
> > > >      when KVM tries to assign the page to the second VM.  That will happen deep
> > > >      in KVM, which means KVM needs to gracefully handle such errors, and it means
> > > >      that KVM's ABI effectively allows plumbing garbage into its memslots.
> > > 
> > > It may not be a valid usage, but in my TDX environment I do meet below
> > > issue.
> > > 
> > > kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fe1ebfff000 ret=0
> > > kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc00000 size=0x400000 ua=0x7fe271579000 ret=0
> > > kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda0000 size=0x20000 ua=0x7fe1ec09f000 ret=-22
> > > 
> > > Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU
> > > and slot#2 fails due to below exclusive check.
> > > 
> > > Currently I changed QEMU code to mark these alias slots as shared
> > > instead of private but I'm not 100% confident this is correct fix.
> > 
> > That's a QEMU bug of sorts.  SMM is mutually exclusive with TDX, QEMU shouldn't
> > be configuring SMRAM (or any SMM memslots for that matter) for TDX guests.
> 
> Thanks for the confirmation. As long as we only bind one notifier for
> each address, using xarray does make things simple.

In the past, I had patches for qemu to disable PAM and SMRAM, but they were
dropped for simplicity because SMRAM/PAM are disabled as reset state with unused
memslot registered. TDX guest bios(TDVF or EDK2) doesn't enable them.
Now we can revive them.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-01-14  0:37 ` [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Sean Christopherson
                     ` (2 preceding siblings ...)
  2023-01-17 14:32   ` Fuad Tabba
@ 2023-01-19 11:13   ` Isaku Yamahata
  2023-01-19 15:25     ` Sean Christopherson
  2023-01-24 16:08   ` Liam Merwick
  4 siblings, 1 reply; 398+ messages in thread
From: Isaku Yamahata @ 2023-01-19 11:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang, isaku.yamahata

On Sat, Jan 14, 2023 at 12:37:59AM +0000,
Sean Christopherson <seanjc@google.com> wrote:

> On Fri, Dec 02, 2022, Chao Peng wrote:
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> > 
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> >   - 01: mm change, target for mm tree
> >   - 02-09: KVM change, target for KVM tree
> 
> A version with all of my feedback, plus reworked versions of Vishal's selftest,
> is available here:
> 
>   git@github.com:sean-jc/linux.git x86/upm_base_support
> 
> It compiles and passes the selftest, but it's otherwise barely tested.  There are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> a WIP.
> 
> As for next steps, can you (handwaving all of the TDX folks) take a look at what
> I pushed and see if there's anything horrifically broken, and that it still works
> for TDX?
> 
> Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
> (and I mean that).
> 
> On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> (SEV and TDX).  For tests, I want to build a lists of tests that are required for
> merging so that the criteria for merging are clear, and so that if the list is large
> (haven't thought much yet), the work of writing and running tests can be distributed.
> 
> Regarding downstream dependencies, before this lands, I want to pull in all the
> TDX and SNP series and see how everything fits together.  Specifically, I want to
> make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> don't miss an opportunity to make things simpler.  The patches in the SNP series to
> add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> details.  Nothing remotely major, but something that needs attention since it'll
> be uAPI.

Although I'm still debuging with TDX KVM, I needed the following.
kvm_faultin_pfn() is called without mmu_lock held.  the race to change
private/shared is handled by mmu_seq.  Maybe dedicated function only for
kvm_faultin_pfn().

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 02be5e1cba1e..38699ca75ab8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2322,7 +2322,7 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
 {
-       lockdep_assert_held(&kvm->mmu_lock);
+       // lockdep_assert_held(&kvm->mmu_lock);
 
        return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
 }


-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-01-19 11:13   ` Isaku Yamahata
@ 2023-01-19 15:25     ` Sean Christopherson
  2023-01-19 22:37       ` Isaku Yamahata
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-01-19 15:25 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> On Sat, Jan 14, 2023 at 12:37:59AM +0000,
> Sean Christopherson <seanjc@google.com> wrote:
> 
> > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > This patch series implements KVM guest private memory for confidential
> > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > TDX-protected guest memory, machine check can happen which can further
> > > crash the running host system, this is terrible for multi-tenant
> > > configurations. The host accesses include those from KVM userspace like
> > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > via a fd-based approach, but it can never access the guest memory
> > > content.
> > > 
> > > The patch series touches both core mm and KVM code. I appreciate
> > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > reviews are always welcome.
> > >   - 01: mm change, target for mm tree
> > >   - 02-09: KVM change, target for KVM tree
> > 
> > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > is available here:
> > 
> >   git@github.com:sean-jc/linux.git x86/upm_base_support
> > 
> > It compiles and passes the selftest, but it's otherwise barely tested.  There are
> > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > a WIP.
> > 
> > As for next steps, can you (handwaving all of the TDX folks) take a look at what
> > I pushed and see if there's anything horrifically broken, and that it still works
> > for TDX?
> > 
> > Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
> > (and I mean that).
> > 
> > On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> > (SEV and TDX).  For tests, I want to build a lists of tests that are required for
> > merging so that the criteria for merging are clear, and so that if the list is large
> > (haven't thought much yet), the work of writing and running tests can be distributed.
> > 
> > Regarding downstream dependencies, before this lands, I want to pull in all the
> > TDX and SNP series and see how everything fits together.  Specifically, I want to
> > make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> > don't miss an opportunity to make things simpler.  The patches in the SNP series to
> > add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> > details.  Nothing remotely major, but something that needs attention since it'll
> > be uAPI.
> 
> Although I'm still debuging with TDX KVM, I needed the following.
> kvm_faultin_pfn() is called without mmu_lock held.  the race to change
> private/shared is handled by mmu_seq.  Maybe dedicated function only for
> kvm_faultin_pfn().

Gah, you're not on the other thread where this was discussed[*].  Simply deleting
the lockdep assertion is safe, for guest types that rely on the attributes to
define shared vs. private, KVM rechecks the attributes under the protection of
mmu_seq.

I'll get a fixed version pushed out today.

[*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-01-19 15:25     ` Sean Christopherson
@ 2023-01-19 22:37       ` Isaku Yamahata
  2023-01-24  1:27         ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Isaku Yamahata @ 2023-01-19 22:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Isaku Yamahata, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, wei.w.wang

On Thu, Jan 19, 2023 at 03:25:08PM +0000,
Sean Christopherson <seanjc@google.com> wrote:

> On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > On Sat, Jan 14, 2023 at 12:37:59AM +0000,
> > Sean Christopherson <seanjc@google.com> wrote:
> > 
> > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > This patch series implements KVM guest private memory for confidential
> > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > TDX-protected guest memory, machine check can happen which can further
> > > > crash the running host system, this is terrible for multi-tenant
> > > > configurations. The host accesses include those from KVM userspace like
> > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > via a fd-based approach, but it can never access the guest memory
> > > > content.
> > > > 
> > > > The patch series touches both core mm and KVM code. I appreciate
> > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > reviews are always welcome.
> > > >   - 01: mm change, target for mm tree
> > > >   - 02-09: KVM change, target for KVM tree
> > > 
> > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > is available here:
> > > 
> > >   git@github.com:sean-jc/linux.git x86/upm_base_support
> > > 
> > > It compiles and passes the selftest, but it's otherwise barely tested.  There are
> > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > a WIP.
> > > 
> > > As for next steps, can you (handwaving all of the TDX folks) take a look at what
> > > I pushed and see if there's anything horrifically broken, and that it still works
> > > for TDX?
> > > 
> > > Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
> > > (and I mean that).
> > > 
> > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> > > (SEV and TDX).  For tests, I want to build a lists of tests that are required for
> > > merging so that the criteria for merging are clear, and so that if the list is large
> > > (haven't thought much yet), the work of writing and running tests can be distributed.
> > > 
> > > Regarding downstream dependencies, before this lands, I want to pull in all the
> > > TDX and SNP series and see how everything fits together.  Specifically, I want to
> > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> > > don't miss an opportunity to make things simpler.  The patches in the SNP series to
> > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> > > details.  Nothing remotely major, but something that needs attention since it'll
> > > be uAPI.
> > 
> > Although I'm still debuging with TDX KVM, I needed the following.
> > kvm_faultin_pfn() is called without mmu_lock held.  the race to change
> > private/shared is handled by mmu_seq.  Maybe dedicated function only for
> > kvm_faultin_pfn().
> 
> Gah, you're not on the other thread where this was discussed[*].  Simply deleting
> the lockdep assertion is safe, for guest types that rely on the attributes to
> define shared vs. private, KVM rechecks the attributes under the protection of
> mmu_seq.
> 
> I'll get a fixed version pushed out today.
> 
> [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com

Now I have tdx kvm working. I've uploaded at the followings.
It's rebased to v6.2-rc3.
        git@github.com:yamahata/linux.git tdx/upm
        git@github.com:yamahata/qemu.git tdx/upm

kvm_mmu_do_page_fault() needs the following change.
kvm_mem_is_private() queries mem_attr_array.  kvm_faultin_pfn() also uses
kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't
make sense. This change would belong to TDX KVM patches, though.

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 72b0da8e27e0..f45ac438bbf4 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -430,7 +430,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
                .max_level = vcpu->kvm->arch.tdp_max_page_level,
                .req_level = PG_LEVEL_4K,
                .goal_level = PG_LEVEL_4K,
-               .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
+               .is_private = kvm_is_private_gpa(vcpu->kvm, cr2_or_gpa),
        };
        int r;


-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2023-01-09 19:32       ` Sean Christopherson
  2023-01-10  9:14         ` Chao Peng
@ 2023-01-20 23:28         ` Jarkko Sakkinen
  1 sibling, 0 replies; 398+ messages in thread
From: Jarkko Sakkinen @ 2023-01-20 23:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Mon, Jan 09, 2023 at 07:32:05PM +0000, Sean Christopherson wrote:
> On Fri, Jan 06, 2023, Chao Peng wrote:
> > On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > To make future maintenance easy, internally use a binary compatible
> > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > '_ext' variants.
> > > 
> > > Feels bit hacky IMHO, and more like a completely new feature than
> > > an extension.
> > > 
> > > Why not just add a new ioctl? The commit message does not address
> > > the most essential design here.
> > 
> > Yes, people can always choose to add a new ioctl for this kind of change
> > and the balance point here is we want to also avoid 'too many ioctls' if
> > the functionalities are similar.  The '_ext' variant reuses all the
> > existing fields in the 'normal' variant and most importantly KVM
> > internally can reuse most of the code. I certainly can add some words in
> > the commit message to explain this design choice.
> 
> After seeing the userspace side of this, I agree with Jarkko; overloading
> KVM_SET_USER_MEMORY_REGION is a hack.  E.g. the size validation ends up being
> bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
> itself.
> 
> It feels absolutely ridiculous, but I think the best option is to do:
> 
> #define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
> 					 struct kvm_userspace_memory_region2)
> 
> /* for KVM_SET_USER_MEMORY_REGION2 */
> struct kvm_user_mem_region2 {
> 	__u32 slot;
> 	__u32 flags;
> 	__u64 guest_phys_addr;
> 	__u64 memory_size;
> 	__u64 userspace_addr;
> 	__u64 restricted_offset;
> 	__u32 restricted_fd;
> 	__u32 pad1;
> 	__u64 pad2[14];
> }
> 
> And it's consistent with other KVM ioctls(), e.g. KVM_SET_CPUID2.
> 
> Regarding the userspace side of things, please include Vishal's selftests in v11,
> it's impossible to properly review the uAPI changes without seeing the userspace
> side of things.  I'm in the process of reviewing Vishal's v2[*], I'll try to
> massage it into a set of patches that you can incorporate into your series.
> 
> [*] https://lore.kernel.org/all/20221205232341.4131240-1-vannapurve@google.com

+1

BR, Jarkko

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
  2023-01-10  9:14         ` Chao Peng
  2023-01-10 22:51           ` Vishal Annapurve
  2023-01-13 22:37           ` Sean Christopherson
@ 2023-01-20 23:42           ` Jarkko Sakkinen
  2 siblings, 0 replies; 398+ messages in thread
From: Jarkko Sakkinen @ 2023-01-20 23:42 UTC (permalink / raw)
  To: Chao Peng
  Cc: Sean Christopherson, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Tue, Jan 10, 2023 at 05:14:32PM +0800, Chao Peng wrote:
> On Mon, Jan 09, 2023 at 07:32:05PM +0000, Sean Christopherson wrote:
> > On Fri, Jan 06, 2023, Chao Peng wrote:
> > > On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > > To make future maintenance easy, internally use a binary compatible
> > > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > > '_ext' variants.
> > > > 
> > > > Feels bit hacky IMHO, and more like a completely new feature than
> > > > an extension.
> > > > 
> > > > Why not just add a new ioctl? The commit message does not address
> > > > the most essential design here.
> > > 
> > > Yes, people can always choose to add a new ioctl for this kind of change
> > > and the balance point here is we want to also avoid 'too many ioctls' if
> > > the functionalities are similar.  The '_ext' variant reuses all the
> > > existing fields in the 'normal' variant and most importantly KVM
> > > internally can reuse most of the code. I certainly can add some words in
> > > the commit message to explain this design choice.
> > 
> > After seeing the userspace side of this, I agree with Jarkko; overloading
> > KVM_SET_USER_MEMORY_REGION is a hack.  E.g. the size validation ends up being
> > bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
> > itself.
> 
> How is the size validation being bogus? I don't quite follow. Then we
> will use kvm_userspace_memory_region2 as the KVM internal alias, right?
> I see similar examples use different functions to handle different
> versions but it does look easier if we use alias for this function.
> 
> > 
> > It feels absolutely ridiculous, but I think the best option is to do:
> > 
> > #define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
> > 					 struct kvm_userspace_memory_region2)
> 
> Just interesting, is 0x49 a safe number we can use? 
> 
> > 
> > /* for KVM_SET_USER_MEMORY_REGION2 */
> > struct kvm_user_mem_region2 {
> > 	__u32 slot;
> > 	__u32 flags;
> > 	__u64 guest_phys_addr;
> > 	__u64 memory_size;
> > 	__u64 userspace_addr;
> > 	__u64 restricted_offset;
> > 	__u32 restricted_fd;
> > 	__u32 pad1;
> > 	__u64 pad2[14];
> > }
> > 
> > And it's consistent with other KVM ioctls(), e.g. KVM_SET_CPUID2.
> 
> Okay, agree from KVM userspace API perspective this is more consistent
> with similar existing examples. I see several of them.
> 
> I think we will also need a CAP_KVM_SET_USER_MEMORY_REGION2 for this new
> ioctl.

The current API in the patch set is trivial for C user space but for
any other more "constrained" language such as Rust a new ioctl would be
easier to adapt.

> > 
> > Regarding the userspace side of things, please include Vishal's selftests in v11,
> > it's impossible to properly review the uAPI changes without seeing the userspace
> > side of things.  I'm in the process of reviewing Vishal's v2[*], I'll try to
> > massage it into a set of patches that you can incorporate into your series.
> 
> Previously I included Vishal's selftests in the github repo, but not
> include them in this patch series. It's OK for me to incorporate them
> directly into this series and review together if Vishal is fine.
> 
> Chao
> > 
> > [*] https://lore.kernel.org/all/20221205232341.4131240-1-vannapurve@google.com

BR, Jarkko

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-22  0:37               ` Huang, Kai
  2022-12-23  8:20                 ` Chao Peng
@ 2023-01-23 14:03                 ` Vlastimil Babka
  2023-01-23 15:18                   ` Kirill A. Shutemov
  2023-01-23 23:01                   ` Huang, Kai
  1 sibling, 2 replies; 398+ messages in thread
From: Vlastimil Babka @ 2023-01-23 14:03 UTC (permalink / raw)
  To: Huang, Kai, chao.p.peng
  Cc: tglx, linux-arch, kvm, jmattson, Hocko, Michal, pbonzini, ak,
	Lutomirski, Andy, linux-fsdevel, tabba, david, michael.roth,
	kirill.shutemov, corbet, qemu-devel, dhildenb, bfields,
	linux-kernel, x86, bp, ddutile, rppt, shuah, vkuznets, mail,
	naoya.horiguchi, qperret, arnd, linux-api, yu.c.zhang,
	Christopherson,,
	Sean, wanpengli, vannapurve, hughd, aarcange, mingo, hpa,
	Nakajima, Jun, jlayton, joro, linux-mm, Wang, Wei W,
	steven.price, linux-doc, Hansen, Dave, akpm, linmiaohe

On 12/22/22 01:37, Huang, Kai wrote:
>>> I argue that this page pinning (or page migration prevention) is not
>>> tied to where the page comes from, instead related to how the page will
>>> be used. Whether the page is restrictedmem backed or GUP() backed, once
>>> it's used by current version of TDX then the page pinning is needed. So
>>> such page migration prevention is really TDX thing, even not KVM generic
>>> thing (that's why I think we don't need change the existing logic of
>>> kvm_release_pfn_clean()). 
>>>
> This essentially boils down to who "owns" page migration handling, and sadly,
> page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page
> migration by itself -- it's just a passive receiver.
> 
> For normal pages, page migration is totally done by the core-kernel (i.e. it
> unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops-
>> migrate_page() to actually migrate the page).
> In the sense of TDX, conceptually it should be done in the same way. The more
> important thing is: yes KVM can use get_page() to prevent page migration, but
> when KVM wants to support it, KVM cannot just remove get_page(), as the core-
> kernel will still just do migrate_page() which won't work for TDX (given
> restricted_memfd doesn't have a_ops->migrate_page() implemented).
> 
> So I think the restricted_memfd filesystem should own page migration handling,
> (i.e. by implementing a_ops->migrate_page() to either just reject page migration
> or somehow support it).

While this thread seems to be settled on refcounts already, just wanted
to point out that it wouldn't be ideal to prevent migrations by
a_ops->migrate_page() rejecting them. It would mean cputime wasted (i.e.
by memory compaction) by isolating the pages for migration and then
releasing them after the callback rejects it (at least we wouldn't waste
time creating and undoing migration entries in the userspace page tables
as there's no mmap). Elevated refcount on the other hand is detected
very early in compaction so no isolation is attempted, so from that
aspect it's optimal.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2023-01-23 14:03                 ` Vlastimil Babka
@ 2023-01-23 15:18                   ` Kirill A. Shutemov
  2023-02-13 14:23                     ` Vlastimil Babka
  2023-01-23 23:01                   ` Huang, Kai
  1 sibling, 1 reply; 398+ messages in thread
From: Kirill A. Shutemov @ 2023-01-23 15:18 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Huang, Kai, chao.p.peng, tglx, linux-arch, kvm, jmattson, Hocko,
	Michal, pbonzini, ak, Lutomirski, Andy, linux-fsdevel, tabba,
	david, michael.roth, kirill.shutemov, corbet, qemu-devel,
	dhildenb, bfields, linux-kernel, x86, bp, ddutile, rppt, shuah,
	vkuznets, mail, naoya.horiguchi, qperret, arnd, linux-api,
	yu.c.zhang, Christopherson,,
	Sean, wanpengli, vannapurve, hughd, aarcange, mingo, hpa,
	Nakajima, Jun, jlayton, joro, linux-mm, Wang, Wei W,
	steven.price, linux-doc, Hansen, Dave, akpm, linmiaohe

On Mon, Jan 23, 2023 at 03:03:45PM +0100, Vlastimil Babka wrote:
> On 12/22/22 01:37, Huang, Kai wrote:
> >>> I argue that this page pinning (or page migration prevention) is not
> >>> tied to where the page comes from, instead related to how the page will
> >>> be used. Whether the page is restrictedmem backed or GUP() backed, once
> >>> it's used by current version of TDX then the page pinning is needed. So
> >>> such page migration prevention is really TDX thing, even not KVM generic
> >>> thing (that's why I think we don't need change the existing logic of
> >>> kvm_release_pfn_clean()). 
> >>>
> > This essentially boils down to who "owns" page migration handling, and sadly,
> > page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page
> > migration by itself -- it's just a passive receiver.
> > 
> > For normal pages, page migration is totally done by the core-kernel (i.e. it
> > unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops-
> >> migrate_page() to actually migrate the page).
> > In the sense of TDX, conceptually it should be done in the same way. The more
> > important thing is: yes KVM can use get_page() to prevent page migration, but
> > when KVM wants to support it, KVM cannot just remove get_page(), as the core-
> > kernel will still just do migrate_page() which won't work for TDX (given
> > restricted_memfd doesn't have a_ops->migrate_page() implemented).
> > 
> > So I think the restricted_memfd filesystem should own page migration handling,
> > (i.e. by implementing a_ops->migrate_page() to either just reject page migration
> > or somehow support it).
> 
> While this thread seems to be settled on refcounts already, just wanted
> to point out that it wouldn't be ideal to prevent migrations by
> a_ops->migrate_page() rejecting them. It would mean cputime wasted (i.e.
> by memory compaction) by isolating the pages for migration and then
> releasing them after the callback rejects it (at least we wouldn't waste
> time creating and undoing migration entries in the userspace page tables
> as there's no mmap). Elevated refcount on the other hand is detected
> very early in compaction so no isolation is attempted, so from that
> aspect it's optimal.

Hm. Do we need a new hook in a_ops to check if the page is migratable
before going with longer path to migrate_page().

Or maybe add AS_UNMOVABLE?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-22 18:15               ` Sean Christopherson
  2022-12-23  0:50                 ` Huang, Kai
  2022-12-23  8:24                 ` Chao Peng
@ 2023-01-23 15:43                 ` Kirill A. Shutemov
  2023-02-13 11:43                   ` Vlastimil Babka
  2023-02-13 13:10                   ` Michael Roth
  2 siblings, 2 replies; 398+ messages in thread
From: Kirill A. Shutemov @ 2023-01-23 15:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, Huang, Kai, tglx, linux-arch, kvm, jmattson,
	Lutomirski, Andy, ak, kirill.shutemov, Hocko, Michal, qemu-devel,
	tabba, david, michael.roth, corbet, bfields, dhildenb,
	linux-kernel, linux-fsdevel, x86, bp, linux-api, rppt, shuah,
	vkuznets, vbabka, mail, ddutile, qperret, arnd, pbonzini,
	vannapurve, naoya.horiguchi, wanpengli, yu.c.zhang, hughd,
	aarcange, mingo, hpa, Nakajima, Jun, jlayton, joro, linux-mm,
	Wang, Wei W, steven.price, linux-doc, Hansen, Dave, akpm,
	linmiaohe

On Thu, Dec 22, 2022 at 06:15:24PM +0000, Sean Christopherson wrote:
> On Wed, Dec 21, 2022, Chao Peng wrote:
> > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote:
> > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
> > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > But for non-restricted-mem case, it is correct for KVM to decrease page's
> > > refcount after setting up mapping in the secondary mmu, otherwise the page will
> > > be pinned by KVM for normal VM (since KVM uses GUP to get the page).
> > 
> > That's true. Actually even true for restrictedmem case, most likely we
> > will still need the kvm_release_pfn_clean() for KVM generic code. On one
> > side, other restrictedmem users like pKVM may not require page pinning
> > at all. On the other side, see below.
> > 
> > > 
> > > So what we are expecting is: for KVM if the page comes from restricted mem, then
> > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should.
> 
> No, requiring the user (KVM) to guard against lack of support for page migration
> in restricted mem is a terrible API.  It's totally fine for restricted mem to not
> support page migration until there's a use case, but punting the problem to KVM
> is not acceptable.  Restricted mem itself doesn't yet support page migration,
> e.g. explosions would occur even if KVM wanted to allow migration since there is
> no notification to invalidate existing mappings.

I tried to find a way to hook into migration path from restrictedmem. It
is not easy because from code-mm PoV the restrictedmem page just yet
another shmem page.

It is somewhat dubious, but I think it should be safe to override
mapping->a_ops for the shmem mapping.

It also eliminates need in special treatment for the restrictedmem pages
from memory-failure code.

shmem_mapping() uses ->a_ops to detect shmem mapping. Modify the
implementation to still be true for restrictedmem pages.

Build tested only.

Any comments?

diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
index 6fddb08f03cc..73ded3c3bad1 100644
--- a/include/linux/restrictedmem.h
+++ b/include/linux/restrictedmem.h
@@ -36,8 +36,6 @@ static inline bool file_is_restrictedmem(struct file *file)
 	return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
 }
 
-void restrictedmem_error_page(struct page *page, struct address_space *mapping);
-
 #else
 
 static inline bool file_is_restrictedmem(struct file *file)
@@ -45,11 +43,6 @@ static inline bool file_is_restrictedmem(struct file *file)
 	return false;
 }
 
-static inline void restrictedmem_error_page(struct page *page,
-					    struct address_space *mapping)
-{
-}
-
 #endif /* CONFIG_RESTRICTEDMEM */
 
 #endif /* _LINUX_RESTRICTEDMEM_H */
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index d500ea967dc7..a4af160f37e4 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -9,6 +9,7 @@
 #include <linux/percpu_counter.h>
 #include <linux/xattr.h>
 #include <linux/fs_parser.h>
+#include <linux/magic.h>
 
 /* inode in-kernel data */
 
@@ -75,10 +76,9 @@ extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags);
 extern int shmem_lock(struct file *file, int lock, struct ucounts *ucounts);
 #ifdef CONFIG_SHMEM
-extern const struct address_space_operations shmem_aops;
 static inline bool shmem_mapping(struct address_space *mapping)
 {
-	return mapping->a_ops == &shmem_aops;
+	return mapping->host->i_sb->s_magic == TMPFS_MAGIC;
 }
 #else
 static inline bool shmem_mapping(struct address_space *mapping)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f91b444e471e..145bb561ddb3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -62,7 +62,6 @@
 #include <linux/page-isolation.h>
 #include <linux/pagewalk.h>
 #include <linux/shmem_fs.h>
-#include <linux/restrictedmem.h>
 #include "swap.h"
 #include "internal.h"
 #include "ras/ras_event.h"
@@ -941,8 +940,6 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
 		goto out;
 	}
 
-	restrictedmem_error_page(p, mapping);
-
 	/*
 	 * The shmem page is kept in page cache instead of truncating
 	 * so is expected to have an extra refcount after error-handling.
diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
index 15c52301eeb9..d0ca609b82cb 100644
--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -189,6 +189,51 @@ static struct file *restrictedmem_file_create(struct file *memfd)
 	return file;
 }
 
+static int restricted_error_remove_page(struct address_space *mapping,
+					struct page *page)
+{
+	struct super_block *sb = restrictedmem_mnt->mnt_sb;
+	struct inode *inode, *next;
+	pgoff_t start, end;
+
+	start = page->index;
+	end = start + thp_nr_pages(page);
+
+	spin_lock(&sb->s_inode_list_lock);
+	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+		struct restrictedmem *rm = inode->i_mapping->private_data;
+		struct restrictedmem_notifier *notifier;
+		struct file *memfd = rm->memfd;
+		unsigned long index;
+
+		if (memfd->f_mapping != mapping)
+			continue;
+
+		xa_for_each_range(&rm->bindings, index, notifier, start, end)
+			notifier->ops->error(notifier, start, end);
+		break;
+	}
+	spin_unlock(&sb->s_inode_list_lock);
+
+	return 0;
+}
+
+#ifdef CONFIG_MIGRATION
+static int restricted_folio(struct address_space *mapping, struct folio *dst,
+			    struct folio *src, enum migrate_mode mode)
+{
+	return -EBUSY;
+}
+#endif
+
+static struct address_space_operations restricted_aops = {
+	.dirty_folio	= noop_dirty_folio,
+	.error_remove_page = restricted_error_remove_page,
+#ifdef CONFIG_MIGRATION
+	.migrate_folio	= restricted_folio,
+#endif
+};
+
 SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
 {
 	struct file *file, *restricted_file;
@@ -209,6 +254,8 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
 	file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
 	file->f_flags |= O_LARGEFILE;
 
+	file->f_mapping->a_ops = &restricted_aops;
+
 	restricted_file = restrictedmem_file_create(file);
 	if (IS_ERR(restricted_file)) {
 		err = PTR_ERR(restricted_file);
@@ -293,31 +340,3 @@ int restrictedmem_get_page(struct file *file, pgoff_t offset,
 }
 EXPORT_SYMBOL_GPL(restrictedmem_get_page);
 
-void restrictedmem_error_page(struct page *page, struct address_space *mapping)
-{
-	struct super_block *sb = restrictedmem_mnt->mnt_sb;
-	struct inode *inode, *next;
-	pgoff_t start, end;
-
-	if (!shmem_mapping(mapping))
-		return;
-
-	start = page->index;
-	end = start + thp_nr_pages(page);
-
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
-		struct restrictedmem *rm = inode->i_mapping->private_data;
-		struct restrictedmem_notifier *notifier;
-		struct file *memfd = rm->memfd;
-		unsigned long index;
-
-		if (memfd->f_mapping != mapping)
-			continue;
-
-		xa_for_each_range(&rm->bindings, index, notifier, start, end)
-			notifier->ops->error(notifier, start, end);
-		break;
-	}
-	spin_unlock(&sb->s_inode_list_lock);
-}
diff --git a/mm/shmem.c b/mm/shmem.c
index c1d8b8a1aa3b..3df4d95784b9 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -231,7 +231,7 @@ static inline void shmem_inode_unacct_blocks(struct inode *inode, long pages)
 }
 
 static const struct super_operations shmem_ops;
-const struct address_space_operations shmem_aops;
+static const struct address_space_operations shmem_aops;
 static const struct file_operations shmem_file_operations;
 static const struct inode_operations shmem_inode_operations;
 static const struct inode_operations shmem_dir_inode_operations;
@@ -3894,7 +3894,7 @@ static int shmem_error_remove_page(struct address_space *mapping,
 	return 0;
 }
 
-const struct address_space_operations shmem_aops = {
+static const struct address_space_operations shmem_aops = {
 	.writepage	= shmem_writepage,
 	.dirty_folio	= noop_dirty_folio,
 #ifdef CONFIG_TMPFS
@@ -3906,7 +3906,6 @@ const struct address_space_operations shmem_aops = {
 #endif
 	.error_remove_page = shmem_error_remove_page,
 };
-EXPORT_SYMBOL(shmem_aops);
 
 static const struct file_operations shmem_file_operations = {
 	.mmap		= shmem_mmap,
-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2023-01-23 14:03                 ` Vlastimil Babka
  2023-01-23 15:18                   ` Kirill A. Shutemov
@ 2023-01-23 23:01                   ` Huang, Kai
  2023-01-23 23:38                     ` Sean Christopherson
  1 sibling, 1 reply; 398+ messages in thread
From: Huang, Kai @ 2023-01-23 23:01 UTC (permalink / raw)
  To: chao.p.peng, vbabka
  Cc: tglx, linux-arch, kvm, jmattson, Lutomirski, Andy, ak,
	kirill.shutemov, Hocko, Michal, tabba, qemu-devel, david,
	michael.roth, corbet, linux-kernel, dhildenb, bfields,
	linux-fsdevel, x86, bp, ddutile, rppt, shuah, vkuznets,
	naoya.horiguchi, linux-api, qperret, arnd, pbonzini, Annapurve,
	Vishal, mail, Christopherson,,
	Sean, wanpengli, yu.c.zhang, hughd, aarcange, mingo, hpa,
	Nakajima, Jun, jlayton, joro, linux-mm, Wang, Wei W,
	steven.price, linux-doc, Hansen, Dave, akpm, linmiaohe

On Mon, 2023-01-23 at 15:03 +0100, Vlastimil Babka wrote:
> On 12/22/22 01:37, Huang, Kai wrote:
> > > > I argue that this page pinning (or page migration prevention) is not
> > > > tied to where the page comes from, instead related to how the page will
> > > > be used. Whether the page is restrictedmem backed or GUP() backed, once
> > > > it's used by current version of TDX then the page pinning is needed. So
> > > > such page migration prevention is really TDX thing, even not KVM generic
> > > > thing (that's why I think we don't need change the existing logic of
> > > > kvm_release_pfn_clean()). 
> > > > 
> > This essentially boils down to who "owns" page migration handling, and sadly,
> > page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page
> > migration by itself -- it's just a passive receiver.
> > 
> > For normal pages, page migration is totally done by the core-kernel (i.e. it
> > unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops-
> > > migrate_page() to actually migrate the page).
> > In the sense of TDX, conceptually it should be done in the same way. The more
> > important thing is: yes KVM can use get_page() to prevent page migration, but
> > when KVM wants to support it, KVM cannot just remove get_page(), as the core-
> > kernel will still just do migrate_page() which won't work for TDX (given
> > restricted_memfd doesn't have a_ops->migrate_page() implemented).
> > 
> > So I think the restricted_memfd filesystem should own page migration handling,
> > (i.e. by implementing a_ops->migrate_page() to either just reject page migration
> > or somehow support it).
> 
> While this thread seems to be settled on refcounts already, 
> 

I am not sure but will let Sean/Paolo to decide.

> just wanted
> to point out that it wouldn't be ideal to prevent migrations by
> a_ops->migrate_page() rejecting them. It would mean cputime wasted (i.e.
> by memory compaction) by isolating the pages for migration and then
> releasing them after the callback rejects it (at least we wouldn't waste
> time creating and undoing migration entries in the userspace page tables
> as there's no mmap). Elevated refcount on the other hand is detected
> very early in compaction so no isolation is attempted, so from that
> aspect it's optimal.

I am probably missing something, but IIUC the checking of refcount happens at
very last stage of page migration too, for instance:

	migrate_folio(...) ->
		migrate_folio_extra(..., 0 /* extra_count */) ->
			folio_migrate_mapping(...).

And it is folio_migrate_mapping() who does the actual compare with the refcount,
which is at very late stage too:

int folio_migrate_mapping(struct address_space *mapping,
                struct folio *newfolio, struct folio *folio, int extra_count)
{
	...
        int expected_count = folio_expected_refs(mapping, folio) + extra_count;

        if (!mapping) {
                /* Anonymous page without mapping */
                if (folio_ref_count(folio) != expected_count)
                        return -EAGAIN;

		....
                return MIGRATEPAGE_SUCCESS;
        }

	....
        if (!folio_ref_freeze(folio, expected_count)) {
                xas_unlock_irq(&xas);
                return -EAGAIN;
        }
	...
}



^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2023-01-23 23:01                   ` Huang, Kai
@ 2023-01-23 23:38                     ` Sean Christopherson
  2023-01-24  7:51                       ` Vlastimil Babka
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-01-23 23:38 UTC (permalink / raw)
  To: Huang, Kai
  Cc: chao.p.peng, vbabka, tglx, linux-arch, kvm, jmattson, Lutomirski,
	Andy, ak, kirill.shutemov, Hocko, Michal, tabba, qemu-devel,
	david, michael.roth, corbet, linux-kernel, dhildenb, bfields,
	linux-fsdevel, x86, bp, ddutile, rppt, shuah, vkuznets,
	naoya.horiguchi, linux-api, qperret, arnd, pbonzini, Annapurve,
	Vishal, mail, wanpengli, yu.c.zhang, hughd, aarcange, mingo, hpa,
	Nakajima, Jun, jlayton, joro, linux-mm, Wang, Wei W,
	steven.price, linux-doc, Hansen, Dave, akpm, linmiaohe

On Mon, Jan 23, 2023, Huang, Kai wrote:
> On Mon, 2023-01-23 at 15:03 +0100, Vlastimil Babka wrote:
> > On 12/22/22 01:37, Huang, Kai wrote:
> > > > > I argue that this page pinning (or page migration prevention) is not
> > > > > tied to where the page comes from, instead related to how the page will
> > > > > be used. Whether the page is restrictedmem backed or GUP() backed, once
> > > > > it's used by current version of TDX then the page pinning is needed. So
> > > > > such page migration prevention is really TDX thing, even not KVM generic
> > > > > thing (that's why I think we don't need change the existing logic of
> > > > > kvm_release_pfn_clean()). 
> > > > > 
> > > This essentially boils down to who "owns" page migration handling, and sadly,
> > > page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page
> > > migration by itself -- it's just a passive receiver.
> > > 
> > > For normal pages, page migration is totally done by the core-kernel (i.e. it
> > > unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops-
> > > > migrate_page() to actually migrate the page).
> > > In the sense of TDX, conceptually it should be done in the same way. The more
> > > important thing is: yes KVM can use get_page() to prevent page migration, but
> > > when KVM wants to support it, KVM cannot just remove get_page(), as the core-
> > > kernel will still just do migrate_page() which won't work for TDX (given
> > > restricted_memfd doesn't have a_ops->migrate_page() implemented).
> > > 
> > > So I think the restricted_memfd filesystem should own page migration handling,
> > > (i.e. by implementing a_ops->migrate_page() to either just reject page migration
> > > or somehow support it).
> > 
> > While this thread seems to be settled on refcounts already, 
> > 
> 
> I am not sure but will let Sean/Paolo to decide.

My preference is whatever is most performant without being hideous :-)

> > just wanted
> > to point out that it wouldn't be ideal to prevent migrations by
> > a_ops->migrate_page() rejecting them. It would mean cputime wasted (i.e.
> > by memory compaction) by isolating the pages for migration and then
> > releasing them after the callback rejects it (at least we wouldn't waste
> > time creating and undoing migration entries in the userspace page tables
> > as there's no mmap). Elevated refcount on the other hand is detected
> > very early in compaction so no isolation is attempted, so from that
> > aspect it's optimal.
> 
> I am probably missing something,

Heh, me too, I could have sworn that using refcounts was the least efficient way
to block migration.

> but IIUC the checking of refcount happens at very last stage of page migration too 

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-01-19 22:37       ` Isaku Yamahata
@ 2023-01-24  1:27         ` Sean Christopherson
  2023-02-08 12:24           ` Isaku Yamahata
                             ` (2 more replies)
  0 siblings, 3 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-01-24  1:27 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> On Thu, Jan 19, 2023 at 03:25:08PM +0000,
> Sean Christopherson <seanjc@google.com> wrote:
> 
> > On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > > On Sat, Jan 14, 2023 at 12:37:59AM +0000,
> > > Sean Christopherson <seanjc@google.com> wrote:
> > > 
> > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > This patch series implements KVM guest private memory for confidential
> > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > TDX-protected guest memory, machine check can happen which can further
> > > > > crash the running host system, this is terrible for multi-tenant
> > > > > configurations. The host accesses include those from KVM userspace like
> > > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > > via a fd-based approach, but it can never access the guest memory
> > > > > content.
> > > > > 
> > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > > reviews are always welcome.
> > > > >   - 01: mm change, target for mm tree
> > > > >   - 02-09: KVM change, target for KVM tree
> > > > 
> > > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > > is available here:
> > > > 
> > > >   git@github.com:sean-jc/linux.git x86/upm_base_support
> > > > 
> > > > It compiles and passes the selftest, but it's otherwise barely tested.  There are
> > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > > a WIP.
> > > > 
> > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what
> > > > I pushed and see if there's anything horrifically broken, and that it still works
> > > > for TDX?
> > > > 
> > > > Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
> > > > (and I mean that).
> > > > 
> > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> > > > (SEV and TDX).  For tests, I want to build a lists of tests that are required for
> > > > merging so that the criteria for merging are clear, and so that if the list is large
> > > > (haven't thought much yet), the work of writing and running tests can be distributed.
> > > > 
> > > > Regarding downstream dependencies, before this lands, I want to pull in all the
> > > > TDX and SNP series and see how everything fits together.  Specifically, I want to
> > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> > > > don't miss an opportunity to make things simpler.  The patches in the SNP series to
> > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> > > > details.  Nothing remotely major, but something that needs attention since it'll
> > > > be uAPI.
> > > 
> > > Although I'm still debuging with TDX KVM, I needed the following.
> > > kvm_faultin_pfn() is called without mmu_lock held.  the race to change
> > > private/shared is handled by mmu_seq.  Maybe dedicated function only for
> > > kvm_faultin_pfn().
> > 
> > Gah, you're not on the other thread where this was discussed[*].  Simply deleting
> > the lockdep assertion is safe, for guest types that rely on the attributes to
> > define shared vs. private, KVM rechecks the attributes under the protection of
> > mmu_seq.
> > 
> > I'll get a fixed version pushed out today.
> > 
> > [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com
> 
> Now I have tdx kvm working. I've uploaded at the followings.
> It's rebased to v6.2-rc3.
>         git@github.com:yamahata/linux.git tdx/upm
>         git@github.com:yamahata/qemu.git tdx/upm

And I finally got a working, building version updated and pushed out (again to):

  git@github.com:sean-jc/linux.git x86/upm_base_support

Took longer than expected to get the memslot restrictions sussed out.  I'm done
working on the code for now, my plan is to come back to it+TDX+SNP in 2-3 weeks
to resolves any remaining todos (that no one else tackles) and to do the whole
"merge the world" excersise.

> kvm_mmu_do_page_fault() needs the following change.
> kvm_mem_is_private() queries mem_attr_array.  kvm_faultin_pfn() also uses
> kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't
> make sense. This change would belong to TDX KVM patches, though.

Yeah, SNP needs similar treatment.  Sorting that out is high up on the todo list.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2023-01-23 23:38                     ` Sean Christopherson
@ 2023-01-24  7:51                       ` Vlastimil Babka
  0 siblings, 0 replies; 398+ messages in thread
From: Vlastimil Babka @ 2023-01-24  7:51 UTC (permalink / raw)
  To: Sean Christopherson, Huang, Kai
  Cc: chao.p.peng, tglx, linux-arch, kvm, jmattson, Lutomirski, Andy,
	ak, kirill.shutemov, Hocko, Michal, tabba, qemu-devel, david,
	michael.roth, corbet, linux-kernel, dhildenb, bfields,
	linux-fsdevel, x86, bp, ddutile, rppt, shuah, vkuznets,
	naoya.horiguchi, linux-api, qperret, arnd, pbonzini, Annapurve,
	Vishal, mail, wanpengli, yu.c.zhang, hughd, aarcange, mingo, hpa,
	Nakajima, Jun, jlayton, joro, linux-mm, Wang, Wei W,
	steven.price, linux-doc, Hansen, Dave, akpm, linmiaohe

On 1/24/23 00:38, Sean Christopherson wrote:
> On Mon, Jan 23, 2023, Huang, Kai wrote:
>> On Mon, 2023-01-23 at 15:03 +0100, Vlastimil Babka wrote:
>>> On 12/22/22 01:37, Huang, Kai wrote:
>>>>>> I argue that this page pinning (or page migration prevention) is not
>>>>>> tied to where the page comes from, instead related to how the page will
>>>>>> be used. Whether the page is restrictedmem backed or GUP() backed, once
>>>>>> it's used by current version of TDX then the page pinning is needed. So
>>>>>> such page migration prevention is really TDX thing, even not KVM generic
>>>>>> thing (that's why I think we don't need change the existing logic of
>>>>>> kvm_release_pfn_clean()). 
>>>>>>
>>>> This essentially boils down to who "owns" page migration handling, and sadly,
>>>> page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page
>>>> migration by itself -- it's just a passive receiver.
>>>>
>>>> For normal pages, page migration is totally done by the core-kernel (i.e. it
>>>> unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops-
>>>>> migrate_page() to actually migrate the page).
>>>> In the sense of TDX, conceptually it should be done in the same way. The more
>>>> important thing is: yes KVM can use get_page() to prevent page migration, but
>>>> when KVM wants to support it, KVM cannot just remove get_page(), as the core-
>>>> kernel will still just do migrate_page() which won't work for TDX (given
>>>> restricted_memfd doesn't have a_ops->migrate_page() implemented).
>>>>
>>>> So I think the restricted_memfd filesystem should own page migration handling,
>>>> (i.e. by implementing a_ops->migrate_page() to either just reject page migration
>>>> or somehow support it).
>>>
>>> While this thread seems to be settled on refcounts already, 
>>>
>>
>> I am not sure but will let Sean/Paolo to decide.
> 
> My preference is whatever is most performant without being hideous :-)
> 
>>> just wanted
>>> to point out that it wouldn't be ideal to prevent migrations by
>>> a_ops->migrate_page() rejecting them. It would mean cputime wasted (i.e.
>>> by memory compaction) by isolating the pages for migration and then
>>> releasing them after the callback rejects it (at least we wouldn't waste
>>> time creating and undoing migration entries in the userspace page tables
>>> as there's no mmap). Elevated refcount on the other hand is detected
>>> very early in compaction so no isolation is attempted, so from that
>>> aspect it's optimal.
>>
>> I am probably missing something,
> 
> Heh, me too, I could have sworn that using refcounts was the least efficient way
> to block migration.

Well I admit that due to my experience with it, I do mostly consider
migration through memory compaction POV, which is a significant user of
migration on random pages that's not requested by userspace actions on
specific ranges.

And compaction has in isolate_migratepages_block():

/*
 * Migration will fail if an anonymous page is pinned in memory,
 * so avoid taking lru_lock and isolating it unnecessarily in an
 * admittedly racy check.
 */
mapping = page_mapping(page);
if (!mapping && page_count(page) > page_mapcount(page))
        goto isolate_fail;

so that prevents migration of pages with elevated refcount very early,
before they are even isolated, so before migrate_pages() is called.

But it's true there are other sources of "random pages migration" - numa
balancing, demotion in lieu of reclaim... and I'm not sure if all have
such early check too.

Anyway, whatever is decided to be a better way than elevated refcounts,
would ideally be checked before isolation as well, as that's the most
efficient way.

>> but IIUC the checking of refcount happens at very last stage of page migration too 


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-01-14  0:37 ` [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Sean Christopherson
                     ` (3 preceding siblings ...)
  2023-01-19 11:13   ` Isaku Yamahata
@ 2023-01-24 16:08   ` Liam Merwick
  2023-01-25  0:20     ` Sean Christopherson
  4 siblings, 1 reply; 398+ messages in thread
From: Liam Merwick @ 2023-01-24 16:08 UTC (permalink / raw)
  To: Sean Christopherson, Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang, Liam Merwick

On 14/01/2023 00:37, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
>> This patch series implements KVM guest private memory for confidential
>> computing scenarios like Intel TDX[1]. If a TDX host accesses
>> TDX-protected guest memory, machine check can happen which can further
>> crash the running host system, this is terrible for multi-tenant
>> configurations. The host accesses include those from KVM userspace like
>> QEMU. This series addresses KVM userspace induced crash by introducing
>> new mm and KVM interfaces so KVM userspace can still manage guest memory
>> via a fd-based approach, but it can never access the guest memory
>> content.
>>
>> The patch series touches both core mm and KVM code. I appreciate
>> Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
>> reviews are always welcome.
>>    - 01: mm change, target for mm tree
>>    - 02-09: KVM change, target for KVM tree
> 
> A version with all of my feedback, plus reworked versions of Vishal's selftest,
> is available here:
> 
>    git@github.com:sean-jc/linux.git x86/upm_base_support
> 
> It compiles and passes the selftest, but it's otherwise barely tested.  There are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> a WIP.
> 

When running LTP (https://github.com/linux-test-project/ltp) on the v10
bits (and also with Sean's branch above) I encounter the following NULL
pointer dereference with testcases/kernel/syscalls/madvise/madvise01
(100% reproducible).

It appears that in restrictedmem_error_page() 
inode->i_mapping->private_data is NULL
in the list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list)
but I don't know why.


[ 5365.177168] BUG: kernel NULL pointer dereference, address: 
0000000000000028
[ 5365.178881] #PF: supervisor read access in kernel mode
[ 5365.180006] #PF: error_code(0x0000) - not-present page
[ 5365.181322] PGD 8000000109dad067 P4D 8000000109dad067 PUD 107707067 PMD 0
[ 5365.183474] Oops: 0000 [#1] PREEMPT SMP PTI
[ 5365.184792] CPU: 0 PID: 22086 Comm: madvise01 Not tainted 
6.1.0-1.el8.seanjcupm.x86_64 #1
[ 5365.186572] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.5.1 06/16/2021
[ 5365.188816] RIP: 0010:restrictedmem_error_page+0xc7/0x1b0
[ 5365.190081] Code: 99 00 48 8b 55 00 48 8b 02 48 8d 8a e8 fe ff ff 48 
2d 18 01 00 00 48 39 d5 0f 84 8a 00 00 00 48 8b 51 30 48 8b 92 b8 00 00 
00 <48> 8b 4a 28 4c 39 b1 d8 00 00 00 74 22 48 8b 88 18 01 00 00 48 8d
[ 5365.193984] RSP: 0018:ffff9b7343c07d80 EFLAGS: 00010206
[ 5365.195142] RAX: ffff8e5b410cfc70 RBX: 0000000000000001 RCX: 
ffff8e5b4048e580
[ 5365.196888] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
0000000000000000
[ 5365.198399] RBP: ffff8e5b410cfd88 R08: 0000000000000000 R09: 
0000000000000000
[ 5365.200200] R10: 0000000000000000 R11: 0000000000000000 R12: 
0000000000000000
[ 5365.201843] R13: ffff8e5b410cfd80 R14: ffff8e5b47cc7618 R15: 
ffffd49d44c05080
[ 5365.203472] FS:  00007fc96de9b5c0(0000) GS:ffff8e5deda00000(0000) 
knlGS:0000000000000000
[ 5365.205485] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5365.206791] CR2: 0000000000000028 CR3: 000000012047e002 CR4: 
0000000000170ef0
[ 5365.208131] Call Trace:
[ 5365.208752]  <TASK>
[ 5365.209229]  me_pagecache_clean+0x58/0x100
[ 5365.210196]  identify_page_state+0x84/0xd0
[ 5365.211180]  memory_failure+0x231/0x8b0
[ 5365.212148]  madvise_inject_error.cold+0x8d/0xa4
[ 5365.213317]  do_madvise+0x363/0x3a0
[ 5365.214177]  __x64_sys_madvise+0x2c/0x40
[ 5365.215159]  do_syscall_64+0x3f/0xa0
[ 5365.216016]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 5365.217130] RIP: 0033:0x7fc96d8399ab
[ 5365.217953] Code: 73 01 c3 48 8b 0d dd 54 38 00 f7 d8 64 89 01 48 83 
c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 1c 00 00 00 0f 
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ad 54 38 00 f7 d8 64 89 01 48
[ 5365.222323] RSP: 002b:00007fff62a99b18 EFLAGS: 00000206 ORIG_RAX: 
000000000000001c
[ 5365.224026] RAX: ffffffffffffffda RBX: 000000000041ce00 RCX: 
00007fc96d8399ab
[ 5365.225375] RDX: 0000000000000064 RSI: 000000000000a000 RDI: 
00007fc96de8e000
[ 5365.226999] RBP: 00007fc96de9b540 R08: 0000000000000001 R09: 
0000000000415c80
[ 5365.228641] R10: 0000000000000001 R11: 0000000000000206 R12: 
0000000000000008
[ 5365.230074] R13: 0000000000000000 R14: 0000000000000000 R15: 
0000000000000000

Regards,
Liam

> As for next steps, can you (handwaving all of the TDX folks) take a look at what
> I pushed and see if there's anything horrifically broken, and that it still works
> for TDX?
> 
> Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
> (and I mean that).
> 
> On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> (SEV and TDX).  For tests, I want to build a lists of tests that are required for
> merging so that the criteria for merging are clear, and so that if the list is large
> (haven't thought much yet), the work of writing and running tests can be distributed.
> 
> Regarding downstream dependencies, before this lands, I want to pull in all the
> TDX and SNP series and see how everything fits together.  Specifically, I want to
> make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> don't miss an opportunity to make things simpler.  The patches in the SNP series to
> add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> details.  Nothing remotely major, but something that needs attention since it'll
> be uAPI.
> 
> I'm off Monday, so it'll be at least Tuesday before I make any more progress on
> my side.
> 
> Thanks!
> 


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-01-24 16:08   ` Liam Merwick
@ 2023-01-25  0:20     ` Sean Christopherson
  2023-01-25 12:53       ` Kirill A. Shutemov
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-01-25  0:20 UTC (permalink / raw)
  To: Liam Merwick
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Tue, Jan 24, 2023, Liam Merwick wrote:
> On 14/01/2023 00:37, Sean Christopherson wrote:
> > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > This patch series implements KVM guest private memory for confidential
> > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > TDX-protected guest memory, machine check can happen which can further
> > > crash the running host system, this is terrible for multi-tenant
> > > configurations. The host accesses include those from KVM userspace like
> > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > via a fd-based approach, but it can never access the guest memory
> > > content.
> > > 
> > > The patch series touches both core mm and KVM code. I appreciate
> > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > reviews are always welcome.
> > >    - 01: mm change, target for mm tree
> > >    - 02-09: KVM change, target for KVM tree
> > 
> > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > is available here:
> > 
> >    git@github.com:sean-jc/linux.git x86/upm_base_support
> > 
> > It compiles and passes the selftest, but it's otherwise barely tested.  There are
> > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > a WIP.
> > 
> 
> When running LTP (https://github.com/linux-test-project/ltp) on the v10
> bits (and also with Sean's branch above) I encounter the following NULL
> pointer dereference with testcases/kernel/syscalls/madvise/madvise01
> (100% reproducible).
> 
> It appears that in restrictedmem_error_page() inode->i_mapping->private_data
> is NULL
> in the list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list)
> but I don't know why.

Kirill, can you take a look?  Or pass the buck to someone who can? :-)

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-01-25  0:20     ` Sean Christopherson
@ 2023-01-25 12:53       ` Kirill A. Shutemov
  2023-01-25 16:01         ` Liam Merwick
  2023-04-13  1:07         ` Sean Christopherson
  0 siblings, 2 replies; 398+ messages in thread
From: Kirill A. Shutemov @ 2023-01-25 12:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Liam Merwick, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, wei.w.wang

On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote:
> On Tue, Jan 24, 2023, Liam Merwick wrote:
> > On 14/01/2023 00:37, Sean Christopherson wrote:
> > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > This patch series implements KVM guest private memory for confidential
> > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > TDX-protected guest memory, machine check can happen which can further
> > > > crash the running host system, this is terrible for multi-tenant
> > > > configurations. The host accesses include those from KVM userspace like
> > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > via a fd-based approach, but it can never access the guest memory
> > > > content.
> > > > 
> > > > The patch series touches both core mm and KVM code. I appreciate
> > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > reviews are always welcome.
> > > >    - 01: mm change, target for mm tree
> > > >    - 02-09: KVM change, target for KVM tree
> > > 
> > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > is available here:
> > > 
> > >    git@github.com:sean-jc/linux.git x86/upm_base_support
> > > 
> > > It compiles and passes the selftest, but it's otherwise barely tested.  There are
> > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > a WIP.
> > > 
> > 
> > When running LTP (https://github.com/linux-test-project/ltp) on the v10
> > bits (and also with Sean's branch above) I encounter the following NULL
> > pointer dereference with testcases/kernel/syscalls/madvise/madvise01
> > (100% reproducible).
> > 
> > It appears that in restrictedmem_error_page() inode->i_mapping->private_data
> > is NULL
> > in the list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list)
> > but I don't know why.
> 
> Kirill, can you take a look?  Or pass the buck to someone who can? :-)

The patch below should help.

diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
index 15c52301eeb9..39ada985c7c0 100644
--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct address_space *mapping)
 
 	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
-		struct restrictedmem *rm = inode->i_mapping->private_data;
 		struct restrictedmem_notifier *notifier;
-		struct file *memfd = rm->memfd;
+		struct restrictedmem *rm;
 		unsigned long index;
+		struct file *memfd;
 
-		if (memfd->f_mapping != mapping)
+		if (atomic_read(&inode->i_count))
 			continue;
 
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
+
+		rm = inode->i_mapping->private_data;
+		memfd = rm->memfd;
+
+		if (memfd->f_mapping != mapping) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
+		spin_unlock(&inode->i_lock);
+
 		xa_for_each_range(&rm->bindings, index, notifier, start, end)
 			notifier->ops->error(notifier, start, end);
 		break;
-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-01-25 12:53       ` Kirill A. Shutemov
@ 2023-01-25 16:01         ` Liam Merwick
  2023-04-13  1:07         ` Sean Christopherson
  1 sibling, 0 replies; 398+ messages in thread
From: Liam Merwick @ 2023-01-25 16:01 UTC (permalink / raw)
  To: Kirill A. Shutemov, Sean Christopherson
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang, Liam Merwick

On 25/01/2023 12:53, Kirill A. Shutemov wrote:
> On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote:
>> On Tue, Jan 24, 2023, Liam Merwick wrote:
>>> On 14/01/2023 00:37, Sean Christopherson wrote:
>>>> On Fri, Dec 02, 2022, Chao Peng wrote:
...
>>>
>>> When running LTP (https://github.com/linux-test-project/ltp) on the v10
>>> bits (and also with Sean's branch above) I encounter the following NULL
>>> pointer dereference with testcases/kernel/syscalls/madvise/madvise01
>>> (100% reproducible).
>>>
>>> It appears that in restrictedmem_error_page() inode->i_mapping->private_data
>>> is NULL
>>> in the list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list)
>>> but I don't know why.
>>
>> Kirill, can you take a look?  Or pass the buck to someone who can? :-)
> 
> The patch below should help.

Thanks, this works for me.

Regards,
Liam

> 
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> index 15c52301eeb9..39ada985c7c0 100644
> --- a/mm/restrictedmem.c
> +++ b/mm/restrictedmem.c
> @@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct address_space *mapping)
>   
>   	spin_lock(&sb->s_inode_list_lock);
>   	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> -		struct restrictedmem *rm = inode->i_mapping->private_data;
>   		struct restrictedmem_notifier *notifier;
> -		struct file *memfd = rm->memfd;
> +		struct restrictedmem *rm;
>   		unsigned long index;
> +		struct file *memfd;
>   
> -		if (memfd->f_mapping != mapping)
> +		if (atomic_read(&inode->i_count))
>   			continue;
>   
> +		spin_lock(&inode->i_lock);
> +		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
> +		}
> +
> +		rm = inode->i_mapping->private_data;
> +		memfd = rm->memfd;
> +
> +		if (memfd->f_mapping != mapping) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
> +		}
> +		spin_unlock(&inode->i_lock);
> +
>   		xa_for_each_range(&rm->bindings, index, notifier, start, end)
>   			notifier->ops->error(notifier, start, end);
>   		break;


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed
  2023-01-13 23:16   ` Sean Christopherson
@ 2023-01-28 13:54     ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2023-01-28 13:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Fri, Jan 13, 2023 at 11:16:27PM +0000, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 9a07380f8d3c..5aefcff614d2 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -12362,6 +12362,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
> >  		if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
> >  			linfo[lpages - 1].disallow_lpage = 1;
> >  		ugfn = slot->userspace_addr >> PAGE_SHIFT;
> > +		if (kvm_slot_can_be_private(slot))
> > +			ugfn |= slot->restricted_offset >> PAGE_SHIFT;
> >  		/*
> >  		 * If the gfn and userspace address are not aligned wrt each
> >  		 * other, disable large page support for this slot.
> 
> Forgot to talk about the bug.  This code needs to handle the scenario where a
> memslot is created with existing, non-uniform attributes.  It might be a bit ugly
> (I didn't even try to write the code), but it's definitely possible, and since
> memslot updates are already slow I think it's best to handle things here.
> 
> In the meantime, I added this so we don't forget to fix it before merging.
> 
> #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> 	pr_crit_once("FIXME: Walk the memory attributes of the slot and set the mixed status appropriately");
> #endif

Here is the code to fix (based on your latest github repo).

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e552374f2357..609ff1cba9c5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2195,4 +2195,9 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
 	 KVM_X86_QUIRK_FIX_HYPERCALL_INSN |	\
 	 KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS)
 
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+void kvm_memory_attributes_create_memslot(struct kvm *kvm,
+					  struct kvm_memory_slot *slot);
+#endif
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index eda615f3951c..8833d7201e41 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7201,10 +7201,11 @@ static bool has_mixed_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
 	return false;
 }
 
-void kvm_arch_set_memory_attributes(struct kvm *kvm,
-				    struct kvm_memory_slot *slot,
-				    unsigned long attrs,
-				    gfn_t start, gfn_t end)
+static void kvm_update_lpage_mixed_flag(struct kvm *kvm,
+					struct kvm_memory_slot *slot,
+					bool set_attrs,
+					unsigned long attrs,
+					gfn_t start, gfn_t end)
 {
 	unsigned long pages, mask;
 	gfn_t gfn, gfn_end, first, last;
@@ -7231,25 +7232,53 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
 		first = start & mask;
 		last = (end - 1) & mask;
 
-		/*
-		 * We only need to scan the head and tail page, for middle pages
-		 * we know they will not be mixed.
-		 */
+		/* head page */
 		gfn = max(first, slot->base_gfn);
 		gfn_end = min(first + pages, slot->base_gfn + slot->npages);
+		if(!set_attrs)
+			attrs = kvm_get_memory_attributes(kvm, gfn);
 		mixed = has_mixed_attrs(kvm, slot, level, attrs, gfn, gfn_end);
 		linfo_update_mixed(gfn, slot, level, mixed);
 
 		if (first == last)
 			return;
 
-		for (gfn = first + pages; gfn < last; gfn += pages)
-			linfo_update_mixed(gfn, slot, level, false);
+		/* middle pages */
+		for (gfn = first + pages; gfn < last; gfn += pages) {
+			if (set_attrs) {
+				mixed = false;
+			} else {
+				gfn_end = gfn + pages;
+				attrs = kvm_get_memory_attributes(kvm, gfn);
+				mixed = has_mixed_attrs(kvm, slot, level, attrs,
+							gfn, gfn_end);
+			}
+			linfo_update_mixed(gfn, slot, level, mixed);
+		}
 
+		/* tail page */
 		gfn = last;
 		gfn_end = min(last + pages, slot->base_gfn + slot->npages);
+		if(!set_attrs)
+			attrs = kvm_get_memory_attributes(kvm, gfn);
 		mixed = has_mixed_attrs(kvm, slot, level, attrs, gfn, gfn_end);
 		linfo_update_mixed(gfn, slot, level, mixed);
 	}
 }
+
+void kvm_arch_set_memory_attributes(struct kvm *kvm,
+				    struct kvm_memory_slot *slot,
+				    unsigned long attrs,
+				    gfn_t start, gfn_t end)
+{
+	kvm_update_lpage_mixed_flag(kvm, slot, true, attrs, start, end);
+}
+
+void kvm_memory_attributes_create_memslot(struct kvm *kvm,
+					  struct kvm_memory_slot *slot)
+{
+
+	kvm_update_lpage_mixed_flag(kvm, slot, false, 0, slot->base_gfn,
+				    slot->base_gfn + slot->npages);
+}
 #endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 268c3d16894d..c1074aecf2d0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12443,7 +12443,7 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
 	}
 
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
-	pr_crit_once("FIXME: Walk the memory attributes of the slot and set the mixed status appropriately");
+	kvm_memory_attributes_create_memslot(kvm, slot);
 #endif
 
 	if (kvm_page_track_create_memslot(kvm, slot, npages))

^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
  2023-01-14  0:01   ` Sean Christopherson
  2023-01-17 13:12     ` Chao Peng
@ 2023-01-28 14:00     ` Chao Peng
  2023-03-08  0:13       ` Ackerley Tng
  1 sibling, 1 reply; 398+ messages in thread
From: Chao Peng @ 2023-01-28 14:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Sat, Jan 14, 2023 at 12:01:01AM +0000, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
... 
> Strongly prefer to use similar logic to existing code that detects wraps:
> 
> 		mem->restricted_offset + mem->memory_size < mem->restricted_offset
> 
> This is also where I'd like to add the "gfn is aligned to offset" check, though
> my brain is too fried to figure that out right now.

Used count_trailing_zeros() for this TODO, unsure we have other better
approach.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index afc8c26fa652..fd34c5f7cd2f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -56,6 +56,7 @@
 #include <asm/processor.h>
 #include <asm/ioctl.h>
 #include <linux/uaccess.h>
+#include <linux/count_zeros.h>
 
 #include "coalesced_mmio.h"
 #include "async_pf.h"
@@ -2087,6 +2088,19 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
 	return false;
 }
 
+/*
+ * Return true when ALIGNMENT(offset) >= ALIGNMENT(gpa).
+ */
+static bool kvm_check_rmem_offset_alignment(u64 offset, u64 gpa)
+{
+	if (!offset)
+		return true;
+	if (!gpa)
+		return false;
+
+	return !!(count_trailing_zeros(offset) >= count_trailing_zeros(gpa));
+}
+
 /*
  * Allocate some memory and give it an address in the guest physical address
  * space.
@@ -2128,7 +2142,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	if (mem->flags & KVM_MEM_PRIVATE &&
 	    (mem->restrictedmem_offset & (PAGE_SIZE - 1) ||
 	     mem->restrictedmem_offset + mem->memory_size < mem->restrictedmem_offset ||
-	     0 /* TODO: require gfn be aligned with restricted offset */))
+	     !kvm_check_rmem_offset_alignment(mem->restrictedmem_offset,
+					      mem->guest_phys_addr)))
 		return -EINVAL;
 	if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_MEM_SLOTS_NUM)
 		return -EINVAL;


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-02  6:13 ` [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
                     ` (2 preceding siblings ...)
  2023-01-13 21:54   ` Sean Christopherson
@ 2023-01-30  5:26   ` Ackerley Tng
  2023-01-30  6:04     ` Wang, Wei W
  2023-02-16  9:51   ` Nikunj A. Dadhania
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 398+ messages in thread
From: Ackerley Tng @ 2023-01-30  5:26 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, pbonzini, corbet, seanjc,
	vkuznets, wanpengli, jmattson, joro, tglx, mingo, bp, arnd,
	naoya.horiguchi, linmiaohe, x86, hpa, hughd, jlayton, bfields,
	akpm, shuah, rppt, steven.price, mail, vbabka, vannapurve,
	yu.c.zhang, chao.p.peng, kirill.shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb, qperret,
	tabba, michael.roth, mhocko, wei.w.wang


> +static int restrictedmem_getattr(struct user_namespace *mnt_userns,
> +				 const struct path *path, struct kstat *stat,
> +				 u32 request_mask, unsigned int query_flags)
> +{
> +	struct inode *inode = d_inode(path->dentry);
> +	struct restrictedmem_data *data = inode->i_mapping->private_data;
> +	struct file *memfd = data->memfd;
> +
> +	return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
> +					     request_mask, query_flags);

Instead of calling shmem's getattr() with path, we should be using the
the memfd's path.

Otherwise, shmem's getattr() will use restrictedmem's inode instead of
shmem's inode. The private fields will be of the wrong type, and the
host will crash when shmem_is_huge() does SHMEM_SB(inode->i_sb)->huge),
since inode->i_sb->s_fs_info is NULL for the restrictedmem's superblock.

Here's the patch:

diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
index 37191cd9eed1..06b72d593bd8 100644
--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -84,7 +84,7 @@ static int restrictedmem_getattr(struct user_namespace  
*mnt_userns,
  	struct restrictedmem *rm = inode->i_mapping->private_data;
  	struct file *memfd = rm->memfd;

-	return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
+	return memfd->f_inode->i_op->getattr(mnt_userns, &memfd->f_path, stat,
  					     request_mask, query_flags);
  }

> +}
> +
> +static int restrictedmem_setattr(struct user_namespace *mnt_userns,
> +				 struct dentry *dentry, struct iattr *attr)
> +{
> +	struct inode *inode = d_inode(dentry);
> +	struct restrictedmem_data *data = inode->i_mapping->private_data;
> +	struct file *memfd = data->memfd;
> +	int ret;
> +
> +	if (attr->ia_valid & ATTR_SIZE) {
> +		if (memfd->f_inode->i_size)
> +			return -EPERM;
> +
> +		if (!PAGE_ALIGNED(attr->ia_size))
> +			return -EINVAL;
> +	}
> +
> +	ret = memfd->f_inode->i_op->setattr(mnt_userns,
> +					    file_dentry(memfd), attr);
> +	return ret;
> +}
> +
> +static const struct inode_operations restrictedmem_iops = {
> +	.getattr = restrictedmem_getattr,
> +	.setattr = restrictedmem_setattr,
> +};

^ permalink raw reply related	[flat|nested] 398+ messages in thread

* RE: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2023-01-30  5:26   ` Ackerley Tng
@ 2023-01-30  6:04     ` Wang, Wei W
  0 siblings, 0 replies; 398+ messages in thread
From: Wang, Wei W @ 2023-01-30  6:04 UTC (permalink / raw)
  To: Ackerley Tng, Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, pbonzini, corbet,
	Christopherson,,
	Sean, vkuznets, wanpengli, jmattson, joro, tglx, mingo, bp, arnd,
	naoya.horiguchi, linmiaohe, x86, hpa, hughd, jlayton, bfields,
	akpm, shuah, rppt, steven.price, mail, vbabka, Annapurve, Vishal,
	yu.c.zhang, chao.p.peng, kirill.shutemov, Lutomirski, Andy,
	Nakajima, Jun, Hansen, Dave, ak, david, aarcange, ddutile,
	dhildenb, qperret, tabba, michael.roth, Hocko, Michal

On Monday, January 30, 2023 1:26 PM, Ackerley Tng wrote:
> 
> > +static int restrictedmem_getattr(struct user_namespace *mnt_userns,
> > +				 const struct path *path, struct kstat *stat,
> > +				 u32 request_mask, unsigned int query_flags)
> {
> > +	struct inode *inode = d_inode(path->dentry);
> > +	struct restrictedmem_data *data = inode->i_mapping-
> >private_data;
> > +	struct file *memfd = data->memfd;
> > +
> > +	return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
> > +					     request_mask, query_flags);
> 
> Instead of calling shmem's getattr() with path, we should be using the the
> memfd's path.
> 
> Otherwise, shmem's getattr() will use restrictedmem's inode instead of
> shmem's inode. The private fields will be of the wrong type, and the host will
> crash when shmem_is_huge() does SHMEM_SB(inode->i_sb)->huge), since
> inode->i_sb->s_fs_info is NULL for the restrictedmem's superblock.
> 
> Here's the patch:
> 
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c index
> 37191cd9eed1..06b72d593bd8 100644
> --- a/mm/restrictedmem.c
> +++ b/mm/restrictedmem.c
> @@ -84,7 +84,7 @@ static int restrictedmem_getattr(struct user_namespace
> *mnt_userns,
>   	struct restrictedmem *rm = inode->i_mapping->private_data;
>   	struct file *memfd = rm->memfd;
> 
> -	return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
> +	return memfd->f_inode->i_op->getattr(mnt_userns, &memfd-
> >f_path, stat,
>   					     request_mask, query_flags);
>   }
> 

Nice catch. I also encountered this issue during my work.
The fix can further be enforced by shmem:

index c301487be5fb..d850c0190359 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -472,8 +472,9 @@ bool shmem_is_huge(struct vm_area_struct *vma, struct inode *inode,
                   pgoff_t index, bool shmem_huge_force)
 {
        loff_t i_size;
+       struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);

-       if (!S_ISREG(inode->i_mode))
+       if (!sbinfo || !S_ISREG(inode->i_mode))
                return false;
        if (vma && ((vma->vm_flags & VM_NOHUGEPAGE) ||
            test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)))
@@ -485,7 +486,7 @@ bool shmem_is_huge(struct vm_area_struct *vma, struct inode *inode,
        if (shmem_huge == SHMEM_HUGE_DENY)
                return false;

-       switch (SHMEM_SB(inode->i_sb)->huge) {
+       switch (sbinfo->huge) {
        case SHMEM_HUGE_ALWAYS:
                return true;
        case SHMEM_HUGE_WITHIN_SIZE:

^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-01-24  1:27         ` Sean Christopherson
@ 2023-02-08 12:24           ` Isaku Yamahata
  2023-02-13 13:01           ` Michael Roth
  2023-04-17 14:37           ` Chao Peng
  2 siblings, 0 replies; 398+ messages in thread
From: Isaku Yamahata @ 2023-02-08 12:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Isaku Yamahata, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, wei.w.wang

On Tue, Jan 24, 2023 at 01:27:50AM +0000,
Sean Christopherson <seanjc@google.com> wrote:

> On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > On Thu, Jan 19, 2023 at 03:25:08PM +0000,
> > Sean Christopherson <seanjc@google.com> wrote:
> > 
> > > On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > > > On Sat, Jan 14, 2023 at 12:37:59AM +0000,
> > > > Sean Christopherson <seanjc@google.com> wrote:
> > > > 
> > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > > This patch series implements KVM guest private memory for confidential
> > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > > TDX-protected guest memory, machine check can happen which can further
> > > > > > crash the running host system, this is terrible for multi-tenant
> > > > > > configurations. The host accesses include those from KVM userspace like
> > > > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > > > via a fd-based approach, but it can never access the guest memory
> > > > > > content.
> > > > > > 
> > > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > > > reviews are always welcome.
> > > > > >   - 01: mm change, target for mm tree
> > > > > >   - 02-09: KVM change, target for KVM tree
> > > > > 
> > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > > > is available here:
> > > > > 
> > > > >   git@github.com:sean-jc/linux.git x86/upm_base_support
> > > > > 
> > > > > It compiles and passes the selftest, but it's otherwise barely tested.  There are
> > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > > > a WIP.
> > > > > 
> > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what
> > > > > I pushed and see if there's anything horrifically broken, and that it still works
> > > > > for TDX?
> > > > > 
> > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
> > > > > (and I mean that).
> > > > > 
> > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> > > > > (SEV and TDX).  For tests, I want to build a lists of tests that are required for
> > > > > merging so that the criteria for merging are clear, and so that if the list is large
> > > > > (haven't thought much yet), the work of writing and running tests can be distributed.
> > > > > 
> > > > > Regarding downstream dependencies, before this lands, I want to pull in all the
> > > > > TDX and SNP series and see how everything fits together.  Specifically, I want to
> > > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> > > > > don't miss an opportunity to make things simpler.  The patches in the SNP series to
> > > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> > > > > details.  Nothing remotely major, but something that needs attention since it'll
> > > > > be uAPI.
> > > > 
> > > > Although I'm still debuging with TDX KVM, I needed the following.
> > > > kvm_faultin_pfn() is called without mmu_lock held.  the race to change
> > > > private/shared is handled by mmu_seq.  Maybe dedicated function only for
> > > > kvm_faultin_pfn().
> > > 
> > > Gah, you're not on the other thread where this was discussed[*].  Simply deleting
> > > the lockdep assertion is safe, for guest types that rely on the attributes to
> > > define shared vs. private, KVM rechecks the attributes under the protection of
> > > mmu_seq.
> > > 
> > > I'll get a fixed version pushed out today.
> > > 
> > > [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com
> > 
> > Now I have tdx kvm working. I've uploaded at the followings.
> > It's rebased to v6.2-rc3.
> >         git@github.com:yamahata/linux.git tdx/upm
> >         git@github.com:yamahata/qemu.git tdx/upm
> 
> And I finally got a working, building version updated and pushed out (again to):
> 
>   git@github.com:sean-jc/linux.git x86/upm_base_support
> 

Ok, I rebased TDX part to the updated branch.
        git@github.com:yamahata/linux.git tdx/upm
        git@github.com:yamahata/qemu.git tdx/upm

Now it's v6.2-rc7 based.
qemu needs more patches to avoid registering memory slot for SMM. 
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2022-12-02  6:13 ` [PATCH v10 2/9] KVM: Introduce per-page memory attributes Chao Peng
                     ` (5 preceding siblings ...)
  2023-01-17  3:21   ` Binbin Wu
@ 2023-02-09  7:25   ` Isaku Yamahata
  2023-02-10  0:35     ` Sean Christopherson
  2023-05-19 17:32   ` Nicolas Saenz Julienne
  7 siblings, 1 reply; 398+ messages in thread
From: Isaku Yamahata @ 2023-02-09  7:25 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang, isaku.yamahata

On Fri, Dec 02, 2022 at 02:13:40PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:

> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> +					   struct kvm_memory_attributes *attrs)
> +{
> +	gfn_t start, end;
> +	unsigned long i;
> +	void *entry;
> +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> +	/* flags is currently not used. */
> +	if (attrs->flags)
> +		return -EINVAL;
> +	if (attrs->attributes & ~supported_attrs)
> +		return -EINVAL;
> +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> +		return -EINVAL;
> +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> +		return -EINVAL;
> +
> +	start = attrs->address >> PAGE_SHIFT;
> +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +
> +	mutex_lock(&kvm->lock);
> +	for (i = start; i < end; i++)
> +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> +				    GFP_KERNEL_ACCOUNT)))
> +			break;
> +	mutex_unlock(&kvm->lock);
> +
> +	attrs->address = i << PAGE_SHIFT;
> +	attrs->size = (end - i) << PAGE_SHIFT;
> +
> +	return 0;
> +}
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +

If memslot isn't private, it should return error if private attribute is set.
Something like following check is needed.

+       if (attrs->flags & KVM_MEM_PRIVATE) {
+               /* non-private memory slot doesn't allow KVM_MEM_PRIVATE */
+               for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
+                       struct kvm_memslot_iter iter;
+                       struct kvm_memslots *slots;
+
+                       slots = __kvm_memslots(kvm, i);
+                       kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
+                               if (!kvm_slot_can_be_private(iter.slot)) {
+                                       mutex_unlock(&kvm->slots_lock);
+                                       return -EINVAL;
+                               }
+                       }
+               }
+       }
+


-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2023-02-09  7:25   ` Isaku Yamahata
@ 2023-02-10  0:35     ` Sean Christopherson
  2023-02-13 23:53       ` Isaku Yamahata
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-02-10  0:35 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Wed, Feb 08, 2023, Isaku Yamahata wrote:
> On Fri, Dec 02, 2022 at 02:13:40PM +0800,
> Chao Peng <chao.p.peng@linux.intel.com> wrote:
> 
> > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > +					   struct kvm_memory_attributes *attrs)
> > +{
> > +	gfn_t start, end;
> > +	unsigned long i;
> > +	void *entry;
> > +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > +
> > +	/* flags is currently not used. */
> > +	if (attrs->flags)
> > +		return -EINVAL;
> > +	if (attrs->attributes & ~supported_attrs)
> > +		return -EINVAL;
> > +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > +		return -EINVAL;
> > +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > +		return -EINVAL;
> > +
> > +	start = attrs->address >> PAGE_SHIFT;
> > +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > +
> > +	mutex_lock(&kvm->lock);
> > +	for (i = start; i < end; i++)
> > +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > +				    GFP_KERNEL_ACCOUNT)))
> > +			break;
> > +	mutex_unlock(&kvm->lock);
> > +
> > +	attrs->address = i << PAGE_SHIFT;
> > +	attrs->size = (end - i) << PAGE_SHIFT;
> > +
> > +	return 0;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > +
> 
> If memslot isn't private, it should return error if private attribute is set.

Why?  I'd rather keep the two things separate.  If we enforce this sort of thing
at KVM_SET_MEMORY_ATTRIBUTES, then we also have to enforce it at
KVM_SET_USER_MEMORY_REGION.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2023-01-23 15:43                 ` Kirill A. Shutemov
@ 2023-02-13 11:43                   ` Vlastimil Babka
  2023-02-13 13:10                   ` Michael Roth
  1 sibling, 0 replies; 398+ messages in thread
From: Vlastimil Babka @ 2023-02-13 11:43 UTC (permalink / raw)
  To: Kirill A. Shutemov, Sean Christopherson
  Cc: Chao Peng, Huang, Kai, tglx, linux-arch, kvm, jmattson,
	Lutomirski, Andy, ak, kirill.shutemov, Hocko, Michal, qemu-devel,
	tabba, david, michael.roth, corbet, bfields, dhildenb,
	linux-kernel, linux-fsdevel, x86, bp, linux-api, rppt, shuah,
	vkuznets, mail, ddutile, qperret, arnd, pbonzini, vannapurve,
	naoya.horiguchi, wanpengli, yu.c.zhang, hughd, aarcange, mingo,
	hpa, Nakajima, Jun, jlayton, joro, linux-mm, Wang, Wei W,
	steven.price, linux-doc, Hansen, Dave, akpm, linmiaohe

On 1/23/23 16:43, Kirill A. Shutemov wrote:
> On Thu, Dec 22, 2022 at 06:15:24PM +0000, Sean Christopherson wrote:
>> On Wed, Dec 21, 2022, Chao Peng wrote:
>> > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote:
>> > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
>> > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
>> > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
>> > > But for non-restricted-mem case, it is correct for KVM to decrease page's
>> > > refcount after setting up mapping in the secondary mmu, otherwise the page will
>> > > be pinned by KVM for normal VM (since KVM uses GUP to get the page).
>> > 
>> > That's true. Actually even true for restrictedmem case, most likely we
>> > will still need the kvm_release_pfn_clean() for KVM generic code. On one
>> > side, other restrictedmem users like pKVM may not require page pinning
>> > at all. On the other side, see below.
>> > 
>> > > 
>> > > So what we are expecting is: for KVM if the page comes from restricted mem, then
>> > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should.
>> 
>> No, requiring the user (KVM) to guard against lack of support for page migration
>> in restricted mem is a terrible API.  It's totally fine for restricted mem to not
>> support page migration until there's a use case, but punting the problem to KVM
>> is not acceptable.  Restricted mem itself doesn't yet support page migration,
>> e.g. explosions would occur even if KVM wanted to allow migration since there is
>> no notification to invalidate existing mappings.
> 
> I tried to find a way to hook into migration path from restrictedmem. It
> is not easy because from code-mm PoV the restrictedmem page just yet
> another shmem page.
> 
> It is somewhat dubious, but I think it should be safe to override
> mapping->a_ops for the shmem mapping.
> 
> It also eliminates need in special treatment for the restrictedmem pages
> from memory-failure code.
> 
> shmem_mapping() uses ->a_ops to detect shmem mapping. Modify the
> implementation to still be true for restrictedmem pages.
> 
> Build tested only.
> 
> Any comments?
> 
> diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> index 6fddb08f03cc..73ded3c3bad1 100644
> --- a/include/linux/restrictedmem.h
> +++ b/include/linux/restrictedmem.h
> @@ -36,8 +36,6 @@ static inline bool file_is_restrictedmem(struct file *file)
>  	return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
>  }
>  
> -void restrictedmem_error_page(struct page *page, struct address_space *mapping);
> -
>  #else
>  
>  static inline bool file_is_restrictedmem(struct file *file)
> @@ -45,11 +43,6 @@ static inline bool file_is_restrictedmem(struct file *file)
>  	return false;
>  }
>  
> -static inline void restrictedmem_error_page(struct page *page,
> -					    struct address_space *mapping)
> -{
> -}
> -
>  #endif /* CONFIG_RESTRICTEDMEM */
>  
>  #endif /* _LINUX_RESTRICTEDMEM_H */
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index d500ea967dc7..a4af160f37e4 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -9,6 +9,7 @@
>  #include <linux/percpu_counter.h>
>  #include <linux/xattr.h>
>  #include <linux/fs_parser.h>
> +#include <linux/magic.h>
>  
>  /* inode in-kernel data */
>  
> @@ -75,10 +76,9 @@ extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
>  		unsigned long len, unsigned long pgoff, unsigned long flags);
>  extern int shmem_lock(struct file *file, int lock, struct ucounts *ucounts);
>  #ifdef CONFIG_SHMEM
> -extern const struct address_space_operations shmem_aops;
>  static inline bool shmem_mapping(struct address_space *mapping)
>  {
> -	return mapping->a_ops == &shmem_aops;
> +	return mapping->host->i_sb->s_magic == TMPFS_MAGIC;

Alternatively just check a_ops against two possible values? Fewer chained
dereferences, no-op with !CONFIG_RESTRICTEDMEM, maybe Hugh would be less
unhappy with that.

Besides that, IIRC Michael Roth mentioned that this approach for preventing
migration would be simpler for SNP than the refcount elevation? Do I recall
right and should this be pursued then?

>  }
>  #else
>  static inline bool shmem_mapping(struct address_space *mapping)
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index f91b444e471e..145bb561ddb3 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -62,7 +62,6 @@
>  #include <linux/page-isolation.h>
>  #include <linux/pagewalk.h>
>  #include <linux/shmem_fs.h>
> -#include <linux/restrictedmem.h>
>  #include "swap.h"
>  #include "internal.h"
>  #include "ras/ras_event.h"
> @@ -941,8 +940,6 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
>  		goto out;
>  	}
>  
> -	restrictedmem_error_page(p, mapping);
> -
>  	/*
>  	 * The shmem page is kept in page cache instead of truncating
>  	 * so is expected to have an extra refcount after error-handling.
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> index 15c52301eeb9..d0ca609b82cb 100644
> --- a/mm/restrictedmem.c
> +++ b/mm/restrictedmem.c
> @@ -189,6 +189,51 @@ static struct file *restrictedmem_file_create(struct file *memfd)
>  	return file;
>  }
>  
> +static int restricted_error_remove_page(struct address_space *mapping,
> +					struct page *page)
> +{
> +	struct super_block *sb = restrictedmem_mnt->mnt_sb;
> +	struct inode *inode, *next;
> +	pgoff_t start, end;
> +
> +	start = page->index;
> +	end = start + thp_nr_pages(page);
> +
> +	spin_lock(&sb->s_inode_list_lock);
> +	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> +		struct restrictedmem *rm = inode->i_mapping->private_data;
> +		struct restrictedmem_notifier *notifier;
> +		struct file *memfd = rm->memfd;
> +		unsigned long index;
> +
> +		if (memfd->f_mapping != mapping)
> +			continue;
> +
> +		xa_for_each_range(&rm->bindings, index, notifier, start, end)
> +			notifier->ops->error(notifier, start, end);
> +		break;
> +	}
> +	spin_unlock(&sb->s_inode_list_lock);
> +
> +	return 0;
> +}
> +
> +#ifdef CONFIG_MIGRATION
> +static int restricted_folio(struct address_space *mapping, struct folio *dst,
> +			    struct folio *src, enum migrate_mode mode)
> +{
> +	return -EBUSY;
> +}
> +#endif
> +
> +static struct address_space_operations restricted_aops = {
> +	.dirty_folio	= noop_dirty_folio,
> +	.error_remove_page = restricted_error_remove_page,
> +#ifdef CONFIG_MIGRATION
> +	.migrate_folio	= restricted_folio,
> +#endif
> +};
> +
>  SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
>  {
>  	struct file *file, *restricted_file;
> @@ -209,6 +254,8 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
>  	file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
>  	file->f_flags |= O_LARGEFILE;
>  
> +	file->f_mapping->a_ops = &restricted_aops;
> +
>  	restricted_file = restrictedmem_file_create(file);
>  	if (IS_ERR(restricted_file)) {
>  		err = PTR_ERR(restricted_file);
> @@ -293,31 +340,3 @@ int restrictedmem_get_page(struct file *file, pgoff_t offset,
>  }
>  EXPORT_SYMBOL_GPL(restrictedmem_get_page);
>  
> -void restrictedmem_error_page(struct page *page, struct address_space *mapping)
> -{
> -	struct super_block *sb = restrictedmem_mnt->mnt_sb;
> -	struct inode *inode, *next;
> -	pgoff_t start, end;
> -
> -	if (!shmem_mapping(mapping))
> -		return;
> -
> -	start = page->index;
> -	end = start + thp_nr_pages(page);
> -
> -	spin_lock(&sb->s_inode_list_lock);
> -	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> -		struct restrictedmem *rm = inode->i_mapping->private_data;
> -		struct restrictedmem_notifier *notifier;
> -		struct file *memfd = rm->memfd;
> -		unsigned long index;
> -
> -		if (memfd->f_mapping != mapping)
> -			continue;
> -
> -		xa_for_each_range(&rm->bindings, index, notifier, start, end)
> -			notifier->ops->error(notifier, start, end);
> -		break;
> -	}
> -	spin_unlock(&sb->s_inode_list_lock);
> -}
> diff --git a/mm/shmem.c b/mm/shmem.c
> index c1d8b8a1aa3b..3df4d95784b9 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -231,7 +231,7 @@ static inline void shmem_inode_unacct_blocks(struct inode *inode, long pages)
>  }
>  
>  static const struct super_operations shmem_ops;
> -const struct address_space_operations shmem_aops;
> +static const struct address_space_operations shmem_aops;
>  static const struct file_operations shmem_file_operations;
>  static const struct inode_operations shmem_inode_operations;
>  static const struct inode_operations shmem_dir_inode_operations;
> @@ -3894,7 +3894,7 @@ static int shmem_error_remove_page(struct address_space *mapping,
>  	return 0;
>  }
>  
> -const struct address_space_operations shmem_aops = {
> +static const struct address_space_operations shmem_aops = {
>  	.writepage	= shmem_writepage,
>  	.dirty_folio	= noop_dirty_folio,
>  #ifdef CONFIG_TMPFS
> @@ -3906,7 +3906,6 @@ const struct address_space_operations shmem_aops = {
>  #endif
>  	.error_remove_page = shmem_error_remove_page,
>  };
> -EXPORT_SYMBOL(shmem_aops);
>  
>  static const struct file_operations shmem_file_operations = {
>  	.mmap		= shmem_mmap,


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-01-24  1:27         ` Sean Christopherson
  2023-02-08 12:24           ` Isaku Yamahata
@ 2023-02-13 13:01           ` Michael Roth
  2023-02-21 12:11             ` Chao Peng
  2023-04-17 14:37           ` Chao Peng
  2 siblings, 1 reply; 398+ messages in thread
From: Michael Roth @ 2023-02-13 13:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Isaku Yamahata, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	mhocko, wei.w.wang

On Tue, Jan 24, 2023 at 01:27:50AM +0000, Sean Christopherson wrote:
> On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > On Thu, Jan 19, 2023 at 03:25:08PM +0000,
> > Sean Christopherson <seanjc@google.com> wrote:
> > 
> > > On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > > > On Sat, Jan 14, 2023 at 12:37:59AM +0000,
> > > > Sean Christopherson <seanjc@google.com> wrote:
> > > > 
> > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > > This patch series implements KVM guest private memory for confidential
> > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > > TDX-protected guest memory, machine check can happen which can further
> > > > > > crash the running host system, this is terrible for multi-tenant
> > > > > > configurations. The host accesses include those from KVM userspace like
> > > > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > > > via a fd-based approach, but it can never access the guest memory
> > > > > > content.
> > > > > > 
> > > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > > > reviews are always welcome.
> > > > > >   - 01: mm change, target for mm tree
> > > > > >   - 02-09: KVM change, target for KVM tree
> > > > > 
> > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > > > is available here:
> > > > > 
> > > > >   git@github.com:sean-jc/linux.git x86/upm_base_support
> > > > > 
> > > > > It compiles and passes the selftest, but it's otherwise barely tested.  There are
> > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > > > a WIP.
> > > > > 
> > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what
> > > > > I pushed and see if there's anything horrifically broken, and that it still works
> > > > > for TDX?
> > > > > 
> > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
> > > > > (and I mean that).
> > > > > 
> > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> > > > > (SEV and TDX).  For tests, I want to build a lists of tests that are required for
> > > > > merging so that the criteria for merging are clear, and so that if the list is large
> > > > > (haven't thought much yet), the work of writing and running tests can be distributed.
> > > > > 
> > > > > Regarding downstream dependencies, before this lands, I want to pull in all the
> > > > > TDX and SNP series and see how everything fits together.  Specifically, I want to
> > > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> > > > > don't miss an opportunity to make things simpler.  The patches in the SNP series to
> > > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> > > > > details.  Nothing remotely major, but something that needs attention since it'll
> > > > > be uAPI.
> > > > 
> > > > Although I'm still debuging with TDX KVM, I needed the following.
> > > > kvm_faultin_pfn() is called without mmu_lock held.  the race to change
> > > > private/shared is handled by mmu_seq.  Maybe dedicated function only for
> > > > kvm_faultin_pfn().
> > > 
> > > Gah, you're not on the other thread where this was discussed[*].  Simply deleting
> > > the lockdep assertion is safe, for guest types that rely on the attributes to
> > > define shared vs. private, KVM rechecks the attributes under the protection of
> > > mmu_seq.
> > > 
> > > I'll get a fixed version pushed out today.
> > > 
> > > [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com
> > 
> > Now I have tdx kvm working. I've uploaded at the followings.
> > It's rebased to v6.2-rc3.
> >         git@github.com:yamahata/linux.git tdx/upm
> >         git@github.com:yamahata/qemu.git tdx/upm
> 
> And I finally got a working, building version updated and pushed out (again to):
> 
>   git@github.com:sean-jc/linux.git x86/upm_base_support
> 
> Took longer than expected to get the memslot restrictions sussed out.  I'm done
> working on the code for now, my plan is to come back to it+TDX+SNP in 2-3 weeks
> to resolves any remaining todos (that no one else tackles) and to do the whole
> "merge the world" excersise.
> 
> > kvm_mmu_do_page_fault() needs the following change.
> > kvm_mem_is_private() queries mem_attr_array.  kvm_faultin_pfn() also uses
> > kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't
> > make sense. This change would belong to TDX KVM patches, though.
> 
> Yeah, SNP needs similar treatment.  Sorting that out is high up on the todo list.

Hi Sean,

We've rebased the SEV+SNP support onto your updated UPM base support
tree and things seem to be working okay, but we needed some fixups on
top of the base support get things working, along with 1 workaround
for an issue that hasn't been root-caused yet:

  https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip

  *stash (upm_base_support): mm: restrictedmem: Kirill's pinning implementation
  *workaround (use_base_support): mm: restrictedmem: loosen exclusivity check
  *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding
  *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations
  *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations
  *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes

We plan to post an updated RFC for v8 soon, but also wanted to share
the staging tree in case you end up looking at the UPM integration aspects
before then.

-Mike

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2023-01-23 15:43                 ` Kirill A. Shutemov
  2023-02-13 11:43                   ` Vlastimil Babka
@ 2023-02-13 13:10                   ` Michael Roth
  1 sibling, 0 replies; 398+ messages in thread
From: Michael Roth @ 2023-02-13 13:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Sean Christopherson, Chao Peng, Huang, Kai, tglx, linux-arch,
	kvm, jmattson, Lutomirski, Andy, ak, kirill.shutemov, Hocko,
	Michal, qemu-devel, tabba, david, corbet, bfields, dhildenb,
	linux-kernel, linux-fsdevel, x86, bp, linux-api, rppt, shuah,
	vkuznets, vbabka, mail, ddutile, qperret, arnd, pbonzini,
	vannapurve, naoya.horiguchi, wanpengli, yu.c.zhang, hughd,
	aarcange, mingo, hpa, Nakajima, Jun, jlayton, joro, linux-mm,
	Wang, Wei W, steven.price, linux-doc, Hansen, Dave, akpm,
	linmiaohe

On Mon, Jan 23, 2023 at 06:43:34PM +0300, Kirill A. Shutemov wrote:
> On Thu, Dec 22, 2022 at 06:15:24PM +0000, Sean Christopherson wrote:
> > On Wed, Dec 21, 2022, Chao Peng wrote:
> > > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote:
> > > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> > > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
> > > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > > But for non-restricted-mem case, it is correct for KVM to decrease page's
> > > > refcount after setting up mapping in the secondary mmu, otherwise the page will
> > > > be pinned by KVM for normal VM (since KVM uses GUP to get the page).
> > > 
> > > That's true. Actually even true for restrictedmem case, most likely we
> > > will still need the kvm_release_pfn_clean() for KVM generic code. On one
> > > side, other restrictedmem users like pKVM may not require page pinning
> > > at all. On the other side, see below.
> > > 
> > > > 
> > > > So what we are expecting is: for KVM if the page comes from restricted mem, then
> > > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should.
> > 
> > No, requiring the user (KVM) to guard against lack of support for page migration
> > in restricted mem is a terrible API.  It's totally fine for restricted mem to not
> > support page migration until there's a use case, but punting the problem to KVM
> > is not acceptable.  Restricted mem itself doesn't yet support page migration,
> > e.g. explosions would occur even if KVM wanted to allow migration since there is
> > no notification to invalidate existing mappings.
> 
> I tried to find a way to hook into migration path from restrictedmem. It
> is not easy because from code-mm PoV the restrictedmem page just yet
> another shmem page.
> 
> It is somewhat dubious, but I think it should be safe to override
> mapping->a_ops for the shmem mapping.
> 
> It also eliminates need in special treatment for the restrictedmem pages
> from memory-failure code.
> 
> shmem_mapping() uses ->a_ops to detect shmem mapping. Modify the
> implementation to still be true for restrictedmem pages.
> 
> Build tested only.
> 
> Any comments?

Hi Kirill,

We've been testing your approach to handle pinning for the SNP+UPM
implementation and haven't noticed any problems so far:

  (based on top of Sean's updated UPM v10 tree)
  https://github.com/mdroth/linux/commit/f780033e6812a01f8732060605d941474fee2bd6

Prior to your patch we also tried elevating refcount via
restrictedmem_get_page() for cases where shmem_get_folio(..., SGP_NOALLOC)
indicates the page hasn't been allocated yet, and that approach also
seems to work, but there are potential races and other ugliness that
make your approach seem a lot cleaner.

-Mike

> 
> diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> index 6fddb08f03cc..73ded3c3bad1 100644
> --- a/include/linux/restrictedmem.h
> +++ b/include/linux/restrictedmem.h
> @@ -36,8 +36,6 @@ static inline bool file_is_restrictedmem(struct file *file)
>  	return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
>  }
>  
> -void restrictedmem_error_page(struct page *page, struct address_space *mapping);
> -
>  #else
>  
>  static inline bool file_is_restrictedmem(struct file *file)
> @@ -45,11 +43,6 @@ static inline bool file_is_restrictedmem(struct file *file)
>  	return false;
>  }
>  
> -static inline void restrictedmem_error_page(struct page *page,
> -					    struct address_space *mapping)
> -{
> -}
> -
>  #endif /* CONFIG_RESTRICTEDMEM */
>  
>  #endif /* _LINUX_RESTRICTEDMEM_H */
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index d500ea967dc7..a4af160f37e4 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -9,6 +9,7 @@
>  #include <linux/percpu_counter.h>
>  #include <linux/xattr.h>
>  #include <linux/fs_parser.h>
> +#include <linux/magic.h>
>  
>  /* inode in-kernel data */
>  
> @@ -75,10 +76,9 @@ extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
>  		unsigned long len, unsigned long pgoff, unsigned long flags);
>  extern int shmem_lock(struct file *file, int lock, struct ucounts *ucounts);
>  #ifdef CONFIG_SHMEM
> -extern const struct address_space_operations shmem_aops;
>  static inline bool shmem_mapping(struct address_space *mapping)
>  {
> -	return mapping->a_ops == &shmem_aops;
> +	return mapping->host->i_sb->s_magic == TMPFS_MAGIC;
>  }
>  #else
>  static inline bool shmem_mapping(struct address_space *mapping)
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index f91b444e471e..145bb561ddb3 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -62,7 +62,6 @@
>  #include <linux/page-isolation.h>
>  #include <linux/pagewalk.h>
>  #include <linux/shmem_fs.h>
> -#include <linux/restrictedmem.h>
>  #include "swap.h"
>  #include "internal.h"
>  #include "ras/ras_event.h"
> @@ -941,8 +940,6 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
>  		goto out;
>  	}
>  
> -	restrictedmem_error_page(p, mapping);
> -
>  	/*
>  	 * The shmem page is kept in page cache instead of truncating
>  	 * so is expected to have an extra refcount after error-handling.
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> index 15c52301eeb9..d0ca609b82cb 100644
> --- a/mm/restrictedmem.c
> +++ b/mm/restrictedmem.c
> @@ -189,6 +189,51 @@ static struct file *restrictedmem_file_create(struct file *memfd)
>  	return file;
>  }
>  
> +static int restricted_error_remove_page(struct address_space *mapping,
> +					struct page *page)
> +{
> +	struct super_block *sb = restrictedmem_mnt->mnt_sb;
> +	struct inode *inode, *next;
> +	pgoff_t start, end;
> +
> +	start = page->index;
> +	end = start + thp_nr_pages(page);
> +
> +	spin_lock(&sb->s_inode_list_lock);
> +	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> +		struct restrictedmem *rm = inode->i_mapping->private_data;
> +		struct restrictedmem_notifier *notifier;
> +		struct file *memfd = rm->memfd;
> +		unsigned long index;
> +
> +		if (memfd->f_mapping != mapping)
> +			continue;
> +
> +		xa_for_each_range(&rm->bindings, index, notifier, start, end)
> +			notifier->ops->error(notifier, start, end);
> +		break;
> +	}
> +	spin_unlock(&sb->s_inode_list_lock);
> +
> +	return 0;
> +}
> +
> +#ifdef CONFIG_MIGRATION
> +static int restricted_folio(struct address_space *mapping, struct folio *dst,
> +			    struct folio *src, enum migrate_mode mode)
> +{
> +	return -EBUSY;
> +}
> +#endif
> +
> +static struct address_space_operations restricted_aops = {
> +	.dirty_folio	= noop_dirty_folio,
> +	.error_remove_page = restricted_error_remove_page,
> +#ifdef CONFIG_MIGRATION
> +	.migrate_folio	= restricted_folio,
> +#endif
> +};
> +
>  SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
>  {
>  	struct file *file, *restricted_file;
> @@ -209,6 +254,8 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
>  	file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
>  	file->f_flags |= O_LARGEFILE;
>  
> +	file->f_mapping->a_ops = &restricted_aops;
> +
>  	restricted_file = restrictedmem_file_create(file);
>  	if (IS_ERR(restricted_file)) {
>  		err = PTR_ERR(restricted_file);
> @@ -293,31 +340,3 @@ int restrictedmem_get_page(struct file *file, pgoff_t offset,
>  }
>  EXPORT_SYMBOL_GPL(restrictedmem_get_page);
>  
> -void restrictedmem_error_page(struct page *page, struct address_space *mapping)
> -{
> -	struct super_block *sb = restrictedmem_mnt->mnt_sb;
> -	struct inode *inode, *next;
> -	pgoff_t start, end;
> -
> -	if (!shmem_mapping(mapping))
> -		return;
> -
> -	start = page->index;
> -	end = start + thp_nr_pages(page);
> -
> -	spin_lock(&sb->s_inode_list_lock);
> -	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> -		struct restrictedmem *rm = inode->i_mapping->private_data;
> -		struct restrictedmem_notifier *notifier;
> -		struct file *memfd = rm->memfd;
> -		unsigned long index;
> -
> -		if (memfd->f_mapping != mapping)
> -			continue;
> -
> -		xa_for_each_range(&rm->bindings, index, notifier, start, end)
> -			notifier->ops->error(notifier, start, end);
> -		break;
> -	}
> -	spin_unlock(&sb->s_inode_list_lock);
> -}
> diff --git a/mm/shmem.c b/mm/shmem.c
> index c1d8b8a1aa3b..3df4d95784b9 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -231,7 +231,7 @@ static inline void shmem_inode_unacct_blocks(struct inode *inode, long pages)
>  }
>  
>  static const struct super_operations shmem_ops;
> -const struct address_space_operations shmem_aops;
> +static const struct address_space_operations shmem_aops;
>  static const struct file_operations shmem_file_operations;
>  static const struct inode_operations shmem_inode_operations;
>  static const struct inode_operations shmem_dir_inode_operations;
> @@ -3894,7 +3894,7 @@ static int shmem_error_remove_page(struct address_space *mapping,
>  	return 0;
>  }
>  
> -const struct address_space_operations shmem_aops = {
> +static const struct address_space_operations shmem_aops = {
>  	.writepage	= shmem_writepage,
>  	.dirty_folio	= noop_dirty_folio,
>  #ifdef CONFIG_TMPFS
> @@ -3906,7 +3906,6 @@ const struct address_space_operations shmem_aops = {
>  #endif
>  	.error_remove_page = shmem_error_remove_page,
>  };
> -EXPORT_SYMBOL(shmem_aops);
>  
>  static const struct file_operations shmem_file_operations = {
>  	.mmap		= shmem_mmap,
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2023-01-23 15:18                   ` Kirill A. Shutemov
@ 2023-02-13 14:23                     ` Vlastimil Babka
  0 siblings, 0 replies; 398+ messages in thread
From: Vlastimil Babka @ 2023-02-13 14:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Huang, Kai, chao.p.peng, tglx, linux-arch, kvm, jmattson, Hocko,
	Michal, pbonzini, ak, Lutomirski, Andy, linux-fsdevel, tabba,
	david, michael.roth, kirill.shutemov, corbet, qemu-devel,
	dhildenb, bfields, linux-kernel, x86, bp, ddutile, rppt, shuah,
	vkuznets, mail, naoya.horiguchi, qperret, arnd, linux-api,
	yu.c.zhang, Christopherson,,
	Sean, wanpengli, vannapurve, hughd, aarcange, mingo, hpa,
	Nakajima, Jun, jlayton, joro, linux-mm, Wang, Wei W,
	steven.price, linux-doc, Hansen, Dave, akpm, linmiaohe

On 1/23/23 16:18, Kirill A. Shutemov wrote:
> On Mon, Jan 23, 2023 at 03:03:45PM +0100, Vlastimil Babka wrote:
>> On 12/22/22 01:37, Huang, Kai wrote:
>> >>> I argue that this page pinning (or page migration prevention) is not
>> >>> tied to where the page comes from, instead related to how the page will
>> >>> be used. Whether the page is restrictedmem backed or GUP() backed, once
>> >>> it's used by current version of TDX then the page pinning is needed. So
>> >>> such page migration prevention is really TDX thing, even not KVM generic
>> >>> thing (that's why I think we don't need change the existing logic of
>> >>> kvm_release_pfn_clean()). 
>> >>>
>> > This essentially boils down to who "owns" page migration handling, and sadly,
>> > page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page
>> > migration by itself -- it's just a passive receiver.
>> > 
>> > For normal pages, page migration is totally done by the core-kernel (i.e. it
>> > unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops-
>> >> migrate_page() to actually migrate the page).
>> > In the sense of TDX, conceptually it should be done in the same way. The more
>> > important thing is: yes KVM can use get_page() to prevent page migration, but
>> > when KVM wants to support it, KVM cannot just remove get_page(), as the core-
>> > kernel will still just do migrate_page() which won't work for TDX (given
>> > restricted_memfd doesn't have a_ops->migrate_page() implemented).
>> > 
>> > So I think the restricted_memfd filesystem should own page migration handling,
>> > (i.e. by implementing a_ops->migrate_page() to either just reject page migration
>> > or somehow support it).
>> 
>> While this thread seems to be settled on refcounts already, just wanted
>> to point out that it wouldn't be ideal to prevent migrations by
>> a_ops->migrate_page() rejecting them. It would mean cputime wasted (i.e.
>> by memory compaction) by isolating the pages for migration and then
>> releasing them after the callback rejects it (at least we wouldn't waste
>> time creating and undoing migration entries in the userspace page tables
>> as there's no mmap). Elevated refcount on the other hand is detected
>> very early in compaction so no isolation is attempted, so from that
>> aspect it's optimal.
> 
> Hm. Do we need a new hook in a_ops to check if the page is migratable
> before going with longer path to migrate_page().
> 
> Or maybe add AS_UNMOVABLE?

AS_UNMOVABLE should indeed allow a test in e.g. compaction to descide that
the page is not worth isolating in the first place.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2023-02-10  0:35     ` Sean Christopherson
@ 2023-02-13 23:53       ` Isaku Yamahata
  2023-02-14 18:07         ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Isaku Yamahata @ 2023-02-13 23:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Isaku Yamahata, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, wei.w.wang

On Fri, Feb 10, 2023 at 12:35:30AM +0000,
Sean Christopherson <seanjc@google.com> wrote:

> On Wed, Feb 08, 2023, Isaku Yamahata wrote:
> > On Fri, Dec 02, 2022 at 02:13:40PM +0800,
> > Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > 
> > > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > +					   struct kvm_memory_attributes *attrs)
> > > +{
> > > +	gfn_t start, end;
> > > +	unsigned long i;
> > > +	void *entry;
> > > +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > > +
> > > +	/* flags is currently not used. */
> > > +	if (attrs->flags)
> > > +		return -EINVAL;
> > > +	if (attrs->attributes & ~supported_attrs)
> > > +		return -EINVAL;
> > > +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > > +		return -EINVAL;
> > > +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > > +		return -EINVAL;
> > > +
> > > +	start = attrs->address >> PAGE_SHIFT;
> > > +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > > +
> > > +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > > +
> > > +	mutex_lock(&kvm->lock);
> > > +	for (i = start; i < end; i++)
> > > +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > > +				    GFP_KERNEL_ACCOUNT)))
> > > +			break;
> > > +	mutex_unlock(&kvm->lock);
> > > +
> > > +	attrs->address = i << PAGE_SHIFT;
> > > +	attrs->size = (end - i) << PAGE_SHIFT;
> > > +
> > > +	return 0;
> > > +}
> > > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > > +
> > 
> > If memslot isn't private, it should return error if private attribute is set.
> 
> Why?  I'd rather keep the two things separate.  If we enforce this sort of thing
> at KVM_SET_MEMORY_ATTRIBUTES, then we also have to enforce it at
> KVM_SET_USER_MEMORY_REGION.

For device assignment via shared GPA, non-private memory slot needs to be
allowed.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2023-02-13 23:53       ` Isaku Yamahata
@ 2023-02-14 18:07         ` Sean Christopherson
  0 siblings, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-02-14 18:07 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Mon, Feb 13, 2023, Isaku Yamahata wrote:
> On Fri, Feb 10, 2023 at 12:35:30AM +0000,
> Sean Christopherson <seanjc@google.com> wrote:
> 
> > On Wed, Feb 08, 2023, Isaku Yamahata wrote:
> > > On Fri, Dec 02, 2022 at 02:13:40PM +0800,
> > > Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > > 
> > > > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > > +					   struct kvm_memory_attributes *attrs)
> > > > +{
> > > > +	gfn_t start, end;
> > > > +	unsigned long i;
> > > > +	void *entry;
> > > > +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > > > +
> > > > +	/* flags is currently not used. */
> > > > +	if (attrs->flags)
> > > > +		return -EINVAL;
> > > > +	if (attrs->attributes & ~supported_attrs)
> > > > +		return -EINVAL;
> > > > +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > > > +		return -EINVAL;
> > > > +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > > > +		return -EINVAL;
> > > > +
> > > > +	start = attrs->address >> PAGE_SHIFT;
> > > > +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > > > +
> > > > +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > > > +
> > > > +	mutex_lock(&kvm->lock);
> > > > +	for (i = start; i < end; i++)
> > > > +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > > > +				    GFP_KERNEL_ACCOUNT)))
> > > > +			break;
> > > > +	mutex_unlock(&kvm->lock);
> > > > +
> > > > +	attrs->address = i << PAGE_SHIFT;
> > > > +	attrs->size = (end - i) << PAGE_SHIFT;
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > > > +
> > > 
> > > If memslot isn't private, it should return error if private attribute is set.
> > 
> > Why?  I'd rather keep the two things separate.  If we enforce this sort of thing
> > at KVM_SET_MEMORY_ATTRIBUTES, then we also have to enforce it at
> > KVM_SET_USER_MEMORY_REGION.
> 
> For device assignment via shared GPA, non-private memory slot needs to be
> allowed.

That doesn't say anything about why setting attributes needs to poke into the
memslot.  The page fault path already kicks out to userspace if there's a
discrepancy between the attributes and the memslot, why is that insufficient?

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2022-12-02  6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (9 preceding siblings ...)
  2023-01-14  0:37 ` [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Sean Christopherson
@ 2023-02-16  5:13 ` Mike Rapoport
  2023-02-16  9:41   ` David Hildenbrand
  2023-04-17 15:40 ` Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM) Sean Christopherson
  11 siblings, 1 reply; 398+ messages in thread
From: Mike Rapoport @ 2023-02-16  5:13 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

Hi,

On Fri, Dec 02, 2022 at 02:13:38PM +0800, Chao Peng wrote:
> This patch series implements KVM guest private memory for confidential
> computing scenarios like Intel TDX[1]. If a TDX host accesses
> TDX-protected guest memory, machine check can happen which can further
> crash the running host system, this is terrible for multi-tenant
> configurations. The host accesses include those from KVM userspace like
> QEMU. This series addresses KVM userspace induced crash by introducing
> new mm and KVM interfaces so KVM userspace can still manage guest memory
> via a fd-based approach, but it can never access the guest memory
> content.

Sorry for jumping late.

Unless I'm missing something, hibernation will also cause an machine check
when there is TDX-protected memory in the system. When the hibernation
creates memory snapshot it essentially walks all physical pages and saves
their contents, so for TDX memory this will trigger machine check, right?
 
>  Documentation/virt/kvm/api.rst         | 125 ++++++-
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  arch/x86/include/asm/kvm_host.h        |   9 +
>  arch/x86/kvm/Kconfig                   |   3 +
>  arch/x86/kvm/mmu/mmu.c                 | 205 ++++++++++-
>  arch/x86/kvm/mmu/mmu_internal.h        |  14 +-
>  arch/x86/kvm/mmu/mmutrace.h            |   1 +
>  arch/x86/kvm/mmu/tdp_mmu.c             |   2 +-
>  arch/x86/kvm/x86.c                     |  17 +-
>  include/linux/kvm_host.h               | 103 +++++-
>  include/linux/restrictedmem.h          |  71 ++++
>  include/linux/syscalls.h               |   1 +
>  include/uapi/asm-generic/unistd.h      |   5 +-
>  include/uapi/linux/kvm.h               |  53 +++
>  include/uapi/linux/magic.h             |   1 +
>  kernel/sys_ni.c                        |   3 +
>  mm/Kconfig                             |   4 +
>  mm/Makefile                            |   1 +
>  mm/memory-failure.c                    |   3 +
>  mm/restrictedmem.c                     | 318 +++++++++++++++++
>  virt/kvm/Kconfig                       |   6 +
>  virt/kvm/kvm_main.c                    | 469 +++++++++++++++++++++----
>  23 files changed, 1323 insertions(+), 93 deletions(-)
>  create mode 100644 include/linux/restrictedmem.h
>  create mode 100644 mm/restrictedmem.c

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-02-16  5:13 ` Mike Rapoport
@ 2023-02-16  9:41   ` David Hildenbrand
  2023-02-22 21:53     ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: David Hildenbrand @ 2023-02-16  9:41 UTC (permalink / raw)
  To: Mike Rapoport, Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, tabba, Michael Roth, mhocko, wei.w.wang

On 16.02.23 06:13, Mike Rapoport wrote:
> Hi,
> 
> On Fri, Dec 02, 2022 at 02:13:38PM +0800, Chao Peng wrote:
>> This patch series implements KVM guest private memory for confidential
>> computing scenarios like Intel TDX[1]. If a TDX host accesses
>> TDX-protected guest memory, machine check can happen which can further
>> crash the running host system, this is terrible for multi-tenant
>> configurations. The host accesses include those from KVM userspace like
>> QEMU. This series addresses KVM userspace induced crash by introducing
>> new mm and KVM interfaces so KVM userspace can still manage guest memory
>> via a fd-based approach, but it can never access the guest memory
>> content.
> 
> Sorry for jumping late.
> 
> Unless I'm missing something, hibernation will also cause an machine check
> when there is TDX-protected memory in the system. When the hibernation
> creates memory snapshot it essentially walks all physical pages and saves
> their contents, so for TDX memory this will trigger machine check, right?

I recall bringing that up in the past (also memory access due to kdump, 
/prov/kcore) and was told that the main focus for now is preventing 
unprivileged users from crashing the system, that is, not mapping such 
memory into user space (e.g., QEMU). In the long run, we'll want to 
handle such pages also properly in the other events where the kernel 
might access them.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-02  6:13 ` [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
                     ` (3 preceding siblings ...)
  2023-01-30  5:26   ` Ackerley Tng
@ 2023-02-16  9:51   ` Nikunj A. Dadhania
  2023-03-20 19:08     ` Michael Roth
  2023-04-13 15:25   ` [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Christian Brauner
  2023-04-13 17:22   ` [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Ackerley Tng
  6 siblings, 1 reply; 398+ messages in thread
From: Nikunj A. Dadhania @ 2023-02-16  9:51 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang


> +static struct file *restrictedmem_file_create(struct file *memfd)
> +{
> +	struct restrictedmem_data *data;
> +	struct address_space *mapping;
> +	struct inode *inode;
> +	struct file *file;
> +
> +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> +	if (!data)
> +		return ERR_PTR(-ENOMEM);
> +
> +	data->memfd = memfd;
> +	mutex_init(&data->lock);
> +	INIT_LIST_HEAD(&data->notifiers);
> +
> +	inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> +	if (IS_ERR(inode)) {
> +		kfree(data);
> +		return ERR_CAST(inode);
> +	}

alloc_anon_inode() uses new_pseudo_inode() to get the inode. As per the comment, new inode 
is not added to the superblock s_inodes list.

/**
 *	new_inode_pseudo 	- obtain an inode
 *	@sb: superblock
 *
 *	Allocates a new inode for given superblock.
 *	Inode wont be chained in superblock s_inodes list
 *	This means :
 *	- fs can't be unmount
 *	- quotas, fsnotify, writeback can't work
 */

So the restrictedmem_error_page will not find the inode as it was never added to the s_inodes list.

We might need to add the inode after allocating.

	inode_sb_list_add(inode);

> +void restrictedmem_error_page(struct page *page, struct address_space *mapping)
> +{
> +	struct super_block *sb = restrictedmem_mnt->mnt_sb;
> +	struct inode *inode, *next;
> +
> +	if (!shmem_mapping(mapping))
> +		return;
> +
> +	spin_lock(&sb->s_inode_list_lock);
> +	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> +		struct restrictedmem_data *data = inode->i_mapping->private_data;
> +		struct file *memfd = data->memfd;
> +
> +		if (memfd->f_mapping == mapping) {
> +			pgoff_t start, end;
> +
> +			spin_unlock(&sb->s_inode_list_lock);
> +
> +			start = page->index;
> +			end = start + thp_nr_pages(page);
> +			restrictedmem_notifier_error(data, start, end);
> +			return;
> +		}
> +	}
> +	spin_unlock(&sb->s_inode_list_lock);
> +}

Regards
Nikunj

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-02-13 13:01           ` Michael Roth
@ 2023-02-21 12:11             ` Chao Peng
  2023-03-23  1:27               ` Michael Roth
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2023-02-21 12:11 UTC (permalink / raw)
  To: Michael Roth
  Cc: Sean Christopherson, Isaku Yamahata, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	mhocko, wei.w.wang

> Hi Sean,
> 
> We've rebased the SEV+SNP support onto your updated UPM base support
> tree and things seem to be working okay, but we needed some fixups on
> top of the base support get things working, along with 1 workaround
> for an issue that hasn't been root-caused yet:
> 
>   https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip
> 
>   *stash (upm_base_support): mm: restrictedmem: Kirill's pinning implementation
>   *workaround (use_base_support): mm: restrictedmem: loosen exclusivity check

What I'm seeing is Slot#3 gets added first and then deleted. When it's
gets added, Slot#0 already has the same range bound to restrictedmem so
trigger the exclusive check. This check is exactly the current code for.

>   *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding
>   *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations

As many kernel APIs treat 'end' as exclusive, I would rather keep using
exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind
and notifier callbacks) but fix it internally in the restrictedmem. E.g.
all the places where xarray API needs a 'last'/'max' we use 'end - 1'.
See below for the change.

>   *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations

Subtracting slot->restrictedmem.index for start/end in
restrictedmem_get_gfn_range() is the correct fix.

>   *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes
> 
> We plan to post an updated RFC for v8 soon, but also wanted to share
> the staging tree in case you end up looking at the UPM integration aspects
> before then.
> 
> -Mike

This is the restrictedmem fix to solve 'end' being stored and checked in xarray:

--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -46,12 +46,12 @@ static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode,
         */
        down_read(&rm->lock);
 
-       xa_for_each_range(&rm->bindings, index, notifier, start, end)
+       xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
                notifier->ops->invalidate_start(notifier, start, end);
 
        ret = memfd->f_op->fallocate(memfd, mode, offset, len);
 
-       xa_for_each_range(&rm->bindings, index, notifier, start, end)
+       xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
                notifier->ops->invalidate_end(notifier, start, end);
 
        up_read(&rm->lock);
@@ -224,7 +224,7 @@ static int restricted_error_remove_page(struct address_space *mapping,
                }
                spin_unlock(&inode->i_lock);
 
-               xa_for_each_range(&rm->bindings, index, notifier, start, end)
+               xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
                        notifier->ops->error(notifier, start, end);
                break;
        }
@@ -301,11 +301,12 @@ int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
                if (exclusive != rm->exclusive)
                        goto out_unlock;
 
-               if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT))
+               if (exclusive &&
+                   xa_find(&rm->bindings, &start, end - 1, XA_PRESENT))
                        goto out_unlock;
        }
 
-       xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL);
+       xa_store_range(&rm->bindings, start, end - 1, notifier, GFP_KERNEL);
        rm->exclusive = exclusive;
        ret = 0;
 out_unlock:
@@ -320,7 +321,7 @@ void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end,
        struct restrictedmem *rm = file->f_mapping->private_data;
 
        down_write(&rm->lock);
-       xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL);
+       xa_store_range(&rm->bindings, start, end - 1, NULL, GFP_KERNEL);
        synchronize_rcu();
        up_write(&rm->lock);
 }

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2023-01-13 21:54   ` Sean Christopherson
  2023-01-17 12:41     ` Chao Peng
@ 2023-02-22  2:07     ` Alexey Kardashevskiy
  2023-02-24  5:42       ` Chao Peng
  1 sibling, 1 reply; 398+ messages in thread
From: Alexey Kardashevskiy @ 2023-02-22  2:07 UTC (permalink / raw)
  To: Sean Christopherson, Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
	Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On 14/1/23 08:54, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
>> The system call is currently wired up for x86 arch.
> 
> Building on other architectures (except for arm64 for some reason) yields:
> 
>    CALL    /.../scripts/checksyscalls.sh
>    <stdin>:1565:2: warning: #warning syscall memfd_restricted not implemented [-Wcpp]
> 
> Do we care?  It's the only such warning, which makes me think we either need to
> wire this up for all architectures, or explicitly document that it's unsupported.
> 
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>> ---
> 
> ...
> 
>> diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
>> new file mode 100644
>> index 000000000000..c2700c5daa43
>> --- /dev/null
>> +++ b/include/linux/restrictedmem.h
>> @@ -0,0 +1,71 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +#ifndef _LINUX_RESTRICTEDMEM_H
> 
> Missing
> 
>   #define _LINUX_RESTRICTEDMEM_H
> 
> which causes fireworks if restrictedmem.h is included more than once.
> 
>> +#include <linux/file.h>
>> +#include <linux/magic.h>
>> +#include <linux/pfn_t.h>
> 
> ...
> 
>> +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
>> +					 struct page **pagep, int *order)
>> +{
>> +	return -1;
> 
> This should be a proper -errno, though in the current incarnation of things it's
> a moot point because no stub is needed.  KVM can (and should) easily provide its
> own stub for this one.
> 
>> +}
>> +
>> +static inline bool file_is_restrictedmem(struct file *file)
>> +{
>> +	return false;
>> +}
>> +
>> +static inline void restrictedmem_error_page(struct page *page,
>> +					    struct address_space *mapping)
>> +{
>> +}
>> +
>> +#endif /* CONFIG_RESTRICTEDMEM */
>> +
>> +#endif /* _LINUX_RESTRICTEDMEM_H */
> 
> ...
> 
>> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
>> new file mode 100644
>> index 000000000000..56953c204e5c
>> --- /dev/null
>> +++ b/mm/restrictedmem.c
>> @@ -0,0 +1,318 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +#include "linux/sbitmap.h"
>> +#include <linux/pagemap.h>
>> +#include <linux/pseudo_fs.h>
>> +#include <linux/shmem_fs.h>
>> +#include <linux/syscalls.h>
>> +#include <uapi/linux/falloc.h>
>> +#include <uapi/linux/magic.h>
>> +#include <linux/restrictedmem.h>
>> +
>> +struct restrictedmem_data {
> 
> Any objection to simply calling this "restrictedmem"?  And then using either "rm"
> or "rmem" for local variable names?  I kept reading "data" as the underyling data
> being written to the page, as opposed to the metadata describing the restrictedmem
> instance.
> 
>> +	struct mutex lock;
>> +	struct file *memfd;
>> +	struct list_head notifiers;
>> +};
>> +
>> +static void restrictedmem_invalidate_start(struct restrictedmem_data *data,
>> +					   pgoff_t start, pgoff_t end)
>> +{
>> +	struct restrictedmem_notifier *notifier;
>> +
>> +	mutex_lock(&data->lock);
> 
> This can be a r/w semaphore instead of a mutex, that way punching holes at multiple
> points in the file can at least run the notifiers in parallel.  The actual allocation
> by shmem will still be serialized, but I think it's worth the simple optimization
> since zapping and flushing in KVM may be somewhat slow.
> 
>> +	list_for_each_entry(notifier, &data->notifiers, list) {
>> +		notifier->ops->invalidate_start(notifier, start, end);
> 
> Two major design issues that we overlooked long ago:
> 
>    1. Blindly invoking notifiers will not scale.  E.g. if userspace configures a
>       VM with a large number of convertible memslots that are all backed by a
>       single large restrictedmem instance, then converting a single page will
>       result in a linear walk through all memslots.  I don't expect anyone to
>       actually do something silly like that, but I also never expected there to be
>       a legitimate usecase for thousands of memslots.
> 
>    2. This approach fails to provide the ability for KVM to ensure a guest has
>       exclusive access to a page.  As discussed in the past, the kernel can rely
>       on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but
>       only for SNP and TDX VMs.  For VMs where userspace is trusted to some extent,
>       e.g. SEV, there is value in ensuring a 1:1 association.
> 
>       And probably more importantly, relying on hardware for SNP and TDX yields a
>       poor ABI and complicates KVM's internals.  If the kernel doesn't guarantee a
>       page is exclusive to a guest, i.e. if userspace can hand out the same page
>       from a restrictedmem instance to multiple VMs, then failure will occur only
>       when KVM tries to assign the page to the second VM.  That will happen deep
>       in KVM, which means KVM needs to gracefully handle such errors, and it means
>       that KVM's ABI effectively allows plumbing garbage into its memslots.
> 
> Rather than use a simple list of notifiers, this appears to be yet another
> opportunity to use an xarray.  Supporting sharing of restrictedmem will be
> non-trivial, but IMO we should punt that to the future since it's still unclear
> exactly how sharing will work.
> 
> An xarray will solve #1 by notifying only the consumers (memslots) that are bound
> to the affected range.
> 
> And for #2, it's relatively straightforward (knock wood) to detect existing
> entries, i.e. if the user wants exclusive access to memory, then the bind operation
> can be reject if there's an existing entry.
> 
> VERY lightly tested code snippet at the bottom (will provide link to fully worked
> code in cover letter).
> 
> 
>> +static long restrictedmem_punch_hole(struct restrictedmem_data *data, int mode,
>> +				     loff_t offset, loff_t len)
>> +{
>> +	int ret;
>> +	pgoff_t start, end;
>> +	struct file *memfd = data->memfd;
>> +
>> +	if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
>> +		return -EINVAL;
>> +
>> +	start = offset >> PAGE_SHIFT;
>> +	end = (offset + len) >> PAGE_SHIFT;
>> +
>> +	restrictedmem_invalidate_start(data, start, end);
>> +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
>> +	restrictedmem_invalidate_end(data, start, end);
> 
> The lock needs to be end for the entire duration of the hole punch, i.e. needs to
> be taken before invalidate_start() and released after invalidate_end().  If a user
> (un)binds/(un)registers after invalidate_state(), it will see an unpaired notification,
> e.g. could leave KVM with incorrect notifier counts.
> 
>> +
>> +	return ret;
>> +}
> 
> What I ended up with for an xarray-based implementation.  I'm very flexible on
> names and whatnot, these are just what made sense to me.
> 
> static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode,
> 				     loff_t offset, loff_t len)
> {
> 	struct restrictedmem_notifier *notifier;
> 	struct file *memfd = rm->memfd;
> 	unsigned long index;
> 	pgoff_t start, end;
> 	int ret;
> 
> 	if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> 		return -EINVAL;
> 
> 	start = offset >> PAGE_SHIFT;
> 	end = (offset + len) >> PAGE_SHIFT;
> 
> 	/*
> 	 * Bindings must stable across invalidation to ensure the start+end
> 	 * are balanced.
> 	 */
> 	down_read(&rm->lock);
> 
> 	xa_for_each_range(&rm->bindings, index, notifier, start, end)
> 		notifier->ops->invalidate_start(notifier, start, end);
> 
> 	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> 
> 	xa_for_each_range(&rm->bindings, index, notifier, start, end)
> 		notifier->ops->invalidate_end(notifier, start, end);
> 
> 	up_read(&rm->lock);
> 
> 	return ret;
> }
> 
> int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
> 		       struct restrictedmem_notifier *notifier, bool exclusive)
> {
> 	struct restrictedmem *rm = file->f_mapping->private_data;
> 	int ret = -EINVAL;
> 
> 	down_write(&rm->lock);
> 
> 	/* Non-exclusive mappings are not yet implemented. */
> 	if (!exclusive)
> 		goto out_unlock;
> 
> 	if (!xa_empty(&rm->bindings)) {
> 		if (exclusive != rm->exclusive)
> 			goto out_unlock;
> 
> 		if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT))
> 			goto out_unlock;
> 	}
> 
> 	xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL);


|| ld: mm/restrictedmem.o: in function `restrictedmem_bind':
mm/restrictedmem.c|295| undefined reference to `xa_store_range'


This is missing:
===
diff --git a/mm/Kconfig b/mm/Kconfig
index f952d0172080..03aca542c0da 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1087,6 +1087,7 @@ config SECRETMEM
  config RESTRICTEDMEM
         bool
         depends on TMPFS
+       select XARRAY_MULTI
===

Thanks,



> 	rm->exclusive = exclusive;
> 	ret = 0;
> out_unlock:
> 	up_write(&rm->lock);
> 	return ret;
> }
> EXPORT_SYMBOL_GPL(restrictedmem_bind);
> 
> void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end,
> 			  struct restrictedmem_notifier *notifier)
> {
> 	struct restrictedmem *rm = file->f_mapping->private_data;
> 
> 	down_write(&rm->lock);
> 	xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL);
> 	synchronize_rcu();
> 	up_write(&rm->lock);
> }
> EXPORT_SYMBOL_GPL(restrictedmem_unbind);

-- 
Alexey


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-02-16  9:41   ` David Hildenbrand
@ 2023-02-22 21:53     ` Sean Christopherson
  0 siblings, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-02-22 21:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, tabba, Michael Roth, mhocko, wei.w.wang

On Thu, Feb 16, 2023, David Hildenbrand wrote:
> On 16.02.23 06:13, Mike Rapoport wrote:
> > Hi,
> > 
> > On Fri, Dec 02, 2022 at 02:13:38PM +0800, Chao Peng wrote:
> > > This patch series implements KVM guest private memory for confidential
> > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > TDX-protected guest memory, machine check can happen which can further
> > > crash the running host system, this is terrible for multi-tenant
> > > configurations. The host accesses include those from KVM userspace like
> > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > via a fd-based approach, but it can never access the guest memory
> > > content.
> > 
> > Sorry for jumping late.
> > 
> > Unless I'm missing something, hibernation will also cause an machine check
> > when there is TDX-protected memory in the system. When the hibernation
> > creates memory snapshot it essentially walks all physical pages and saves
> > their contents, so for TDX memory this will trigger machine check, right?

For hibernation specifically, I think that should be handled elsewhere as hibernation
is simply incompatible with TDX, SNP, pKVM, etc. without paravirtualizing the
guest, as none of those technologies support auto-export a la s390.  I suspect
the right approach is to disallow hibernation if KVM is running any protected guests.

> I recall bringing that up in the past (also memory access due to kdump,
> /prov/kcore) and was told that the main focus for now is preventing
> unprivileged users from crashing the system, that is, not mapping such
> memory into user space (e.g., QEMU). In the long run, we'll want to handle
> such pages also properly in the other events where the kernel might access
> them.

Ya, unless someone strongly objects, the plan is to essentially treat "attacks"
from privileged users as out of to scope for initial support, and then iterate
as needed to fix/enable more features.

FWIW, read accesses, e.g. kdump, should be ok for TDX and SNP as they both play
nice with "bad" reads.  pKVM is a different beast though as I believe any access
to guest private memory will fault.  But my understanding is that this series
would be a big step forward for pKVM, which currently doesn't have any safeguards.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2023-02-22  2:07     ` Alexey Kardashevskiy
@ 2023-02-24  5:42       ` Chao Peng
  0 siblings, 0 replies; 398+ messages in thread
From: Chao Peng @ 2023-02-24  5:42 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Sean Christopherson, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

> > int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
> > 		       struct restrictedmem_notifier *notifier, bool exclusive)
> > {
> > 	struct restrictedmem *rm = file->f_mapping->private_data;
> > 	int ret = -EINVAL;
> > 
> > 	down_write(&rm->lock);
> > 
> > 	/* Non-exclusive mappings are not yet implemented. */
> > 	if (!exclusive)
> > 		goto out_unlock;
> > 
> > 	if (!xa_empty(&rm->bindings)) {
> > 		if (exclusive != rm->exclusive)
> > 			goto out_unlock;
> > 
> > 		if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT))
> > 			goto out_unlock;
> > 	}
> > 
> > 	xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL);
> 
> 
> || ld: mm/restrictedmem.o: in function `restrictedmem_bind':
> mm/restrictedmem.c|295| undefined reference to `xa_store_range'

Right, xa_store_range() is only available for XARRAY_MULTI.

> 
> 
> This is missing:
> ===
> diff --git a/mm/Kconfig b/mm/Kconfig
> index f952d0172080..03aca542c0da 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1087,6 +1087,7 @@ config SECRETMEM
>  config RESTRICTEDMEM
>         bool
>         depends on TMPFS
> +       select XARRAY_MULTI
> ===
> 
> Thanks,
> 
> 
> 
> > 	rm->exclusive = exclusive;
> > 	ret = 0;
> > out_unlock:
> > 	up_write(&rm->lock);
> > 	return ret;
> > }
> > EXPORT_SYMBOL_GPL(restrictedmem_bind);
> > 
> > void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end,
> > 			  struct restrictedmem_notifier *notifier)
> > {
> > 	struct restrictedmem *rm = file->f_mapping->private_data;
> > 
> > 	down_write(&rm->lock);
> > 	xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL);
> > 	synchronize_rcu();
> > 	up_write(&rm->lock);
> > }
> > EXPORT_SYMBOL_GPL(restrictedmem_unbind);
> 
> -- 
> Alexey

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
  2022-12-02  6:13 ` [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
                     ` (2 preceding siblings ...)
  2023-01-14  0:01   ` Sean Christopherson
@ 2023-03-07 19:14   ` Ackerley Tng
  2023-03-07 20:27     ` Sean Christopherson
  3 siblings, 1 reply; 398+ messages in thread
From: Ackerley Tng @ 2023-03-07 19:14 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, pbonzini, corbet, seanjc,
	vkuznets, wanpengli, jmattson, joro, tglx, mingo, bp, arnd,
	naoya.horiguchi, linmiaohe, x86, hpa, hughd, jlayton, bfields,
	akpm, shuah, rppt, steven.price, mail, vbabka, vannapurve,
	yu.c.zhang, chao.p.peng, kirill.shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb, qperret,
	tabba, michael.roth, mhocko, wei.w.wang

Chao Peng <chao.p.peng@linux.intel.com> writes:

> Register/unregister private memslot to fd-based memory backing store
> restrictedmem and implement the callbacks for restrictedmem_notifier:
>    - invalidate_start()/invalidate_end() to zap the existing memory
>      mappings in the KVM page table.
>    - error() to request KVM_REQ_MEMORY_MCE and later exit to userspace
>      with KVM_EXIT_SHUTDOWN.

> Expose KVM_MEM_PRIVATE for memslot and KVM_MEMORY_ATTRIBUTE_PRIVATE for
> KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to userspace but either are
> controlled by kvm_arch_has_private_mem() which should be rewritten by
> architecture code.

Could we perhaps rename KVM_MEM_PRIVATE to KVM_MEM_PROTECTED, to be in
line with KVM_X86_PROTECTED_VM?

I feel that a memslot that has the KVM_MEM_PRIVATE flag need not always
be private; It can sometimes be providing memory that is shared and
also accessible from the host.

KVM_MEMORY_ATTRIBUTE_PRIVATE is fine as-is because this flag is set when
the guest memory is meant to be backed by private memory.

KVM_MEMORY_EXIT_FLAG_PRIVATE is also okay because the flag is used to
indicate when the memory error is caused by a private access (as opposed
to a shared access).

kvm_slot_can_be_private() could perhaps be renamed kvm_is_protected_slot()?


> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> ---
>   arch/x86/include/asm/kvm_host.h |   1 +
>   arch/x86/kvm/x86.c              |  13 +++
>   include/linux/kvm_host.h        |   3 +
>   virt/kvm/kvm_main.c             | 179 +++++++++++++++++++++++++++++++-
>   4 files changed, 191 insertions(+), 5 deletions(-)

> diff --git a/arch/x86/include/asm/kvm_host.h  
> b/arch/x86/include/asm/kvm_host.h
> index 7772ab37ac89..27ef31133352 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -114,6 +114,7 @@
>   	KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>   #define KVM_REQ_HV_TLB_FLUSH \
>   	KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> +#define KVM_REQ_MEMORY_MCE		KVM_ARCH_REQ(33)

>   #define CR0_RESERVED_BITS                                               \
>   	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5aefcff614d2..c67e22f3e2ee 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6587,6 +6587,13 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned  
> long state)
>   }
>   #endif /* CONFIG_HAVE_KVM_PM_NOTIFIER */

> +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +void kvm_arch_memory_mce(struct kvm *kvm)
> +{
> +	kvm_make_all_cpus_request(kvm, KVM_REQ_MEMORY_MCE);
> +}
> +#endif
> +
>   static int kvm_vm_ioctl_get_clock(struct kvm *kvm, void __user *argp)
>   {
>   	struct kvm_clock_data data = { 0 };
> @@ -10357,6 +10364,12 @@ static int vcpu_enter_guest(struct kvm_vcpu  
> *vcpu)

>   		if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
>   			static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
> +
> +		if (kvm_check_request(KVM_REQ_MEMORY_MCE, vcpu)) {
> +			vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
> +			r = 0;
> +			goto out;
> +		}
>   	}

>   	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 153842bb33df..f032d878e034 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -590,6 +590,7 @@ struct kvm_memory_slot {
>   	struct file *restricted_file;
>   	loff_t restricted_offset;
>   	struct restrictedmem_notifier notifier;
> +	struct kvm *kvm;
>   };

>   static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot  
> *slot)
> @@ -2363,6 +2364,8 @@ static inline int kvm_restricted_mem_get_pfn(struct  
> kvm_memory_slot *slot,
>   	*pfn = page_to_pfn(page);
>   	return ret;
>   }
> +
> +void kvm_arch_memory_mce(struct kvm *kvm);
>   #endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */

>   #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index e107afea32f0..ac835fc77273 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -936,6 +936,121 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)

>   #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */

> +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
> +					 pgoff_t start, pgoff_t end,
> +					 gfn_t *gfn_start, gfn_t *gfn_end)
> +{
> +	unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
> +
> +	if (start > base_pgoff)
> +		*gfn_start = slot->base_gfn + start - base_pgoff;
> +	else
> +		*gfn_start = slot->base_gfn;
> +
> +	if (end < base_pgoff + slot->npages)
> +		*gfn_end = slot->base_gfn + end - base_pgoff;
> +	else
> +		*gfn_end = slot->base_gfn + slot->npages;
> +
> +	if (*gfn_start >= *gfn_end)
> +		return false;
> +
> +	return true;
> +}
> +
> +static void kvm_restrictedmem_invalidate_begin(struct  
> restrictedmem_notifier *notifier,
> +					       pgoff_t start, pgoff_t end)
> +{
> +	struct kvm_memory_slot *slot = container_of(notifier,
> +						    struct kvm_memory_slot,
> +						    notifier);
> +	struct kvm *kvm = slot->kvm;
> +	gfn_t gfn_start, gfn_end;
> +	struct kvm_gfn_range gfn_range;
> +	int idx;
> +
> +	if (!restrictedmem_range_is_valid(slot, start, end,
> +					  &gfn_start, &gfn_end))
> +		return;
> +
> +	gfn_range.start = gfn_start;
> +	gfn_range.end = gfn_end;
> +	gfn_range.slot = slot;
> +	gfn_range.pte = __pte(0);
> +	gfn_range.may_block = true;
> +
> +	idx = srcu_read_lock(&kvm->srcu);
> +	KVM_MMU_LOCK(kvm);
> +
> +	kvm_mmu_invalidate_begin(kvm);
> +	kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
> +	if (kvm_unmap_gfn_range(kvm, &gfn_range))
> +		kvm_flush_remote_tlbs(kvm);
> +
> +	KVM_MMU_UNLOCK(kvm);
> +	srcu_read_unlock(&kvm->srcu, idx);
> +}
> +
> +static void kvm_restrictedmem_invalidate_end(struct  
> restrictedmem_notifier *notifier,
> +					     pgoff_t start, pgoff_t end)
> +{
> +	struct kvm_memory_slot *slot = container_of(notifier,
> +						    struct kvm_memory_slot,
> +						    notifier);
> +	struct kvm *kvm = slot->kvm;
> +	gfn_t gfn_start, gfn_end;
> +
> +	if (!restrictedmem_range_is_valid(slot, start, end,
> +					  &gfn_start, &gfn_end))
> +		return;
> +
> +	KVM_MMU_LOCK(kvm);
> +	kvm_mmu_invalidate_end(kvm);
> +	KVM_MMU_UNLOCK(kvm);
> +}
> +
> +static void kvm_restrictedmem_error(struct restrictedmem_notifier  
> *notifier,
> +				    pgoff_t start, pgoff_t end)
> +{
> +	struct kvm_memory_slot *slot = container_of(notifier,
> +						    struct kvm_memory_slot,
> +						    notifier);
> +	kvm_arch_memory_mce(slot->kvm);
> +}
> +
> +static struct restrictedmem_notifier_ops kvm_restrictedmem_notifier_ops  
> = {
> +	.invalidate_start = kvm_restrictedmem_invalidate_begin,
> +	.invalidate_end = kvm_restrictedmem_invalidate_end,
> +	.error = kvm_restrictedmem_error,
> +};
> +
> +static inline void kvm_restrictedmem_register(struct kvm_memory_slot  
> *slot)
> +{
> +	slot->notifier.ops = &kvm_restrictedmem_notifier_ops;
> +	restrictedmem_register_notifier(slot->restricted_file, &slot->notifier);
> +}
> +
> +static inline void kvm_restrictedmem_unregister(struct kvm_memory_slot  
> *slot)
> +{
> +	restrictedmem_unregister_notifier(slot->restricted_file,
> +					  &slot->notifier);
> +}
> +
> +#else /* !CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
> +static inline void kvm_restrictedmem_register(struct kvm_memory_slot  
> *slot)
> +{
> +	WARN_ON_ONCE(1);
> +}
> +
> +static inline void kvm_restrictedmem_unregister(struct kvm_memory_slot  
> *slot)
> +{
> +	WARN_ON_ONCE(1);
> +}
> +
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
>   #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>   static int kvm_pm_notifier_call(struct notifier_block *bl,
>   				unsigned long state,
> @@ -980,6 +1095,11 @@ static void kvm_destroy_dirty_bitmap(struct  
> kvm_memory_slot *memslot)
>   /* This does not remove the slot from struct kvm_memslots data  
> structures */
>   static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot  
> *slot)
>   {
> +	if (slot->flags & KVM_MEM_PRIVATE) {
> +		kvm_restrictedmem_unregister(slot);
> +		fput(slot->restricted_file);
> +	}
> +
>   	kvm_destroy_dirty_bitmap(slot);

>   	kvm_arch_free_memslot(kvm, slot);
> @@ -1551,10 +1671,14 @@ static void kvm_replace_memslot(struct kvm *kvm,
>   	}
>   }

> -static int check_memory_region_flags(const struct kvm_user_mem_region  
> *mem)
> +static int check_memory_region_flags(struct kvm *kvm,
> +				     const struct kvm_user_mem_region *mem)
>   {
>   	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;

> +	if (kvm_arch_has_private_mem(kvm))
> +		valid_flags |= KVM_MEM_PRIVATE;
> +
>   #ifdef __KVM_HAVE_READONLY_MEM
>   	valid_flags |= KVM_MEM_READONLY;
>   #endif
> @@ -1630,6 +1754,9 @@ static int kvm_prepare_memory_region(struct kvm  
> *kvm,
>   {
>   	int r;

> +	if (change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
> +		kvm_restrictedmem_register(new);
> +
>   	/*
>   	 * If dirty logging is disabled, nullify the bitmap; the old bitmap
>   	 * will be freed on "commit".  If logging is enabled in both old and
> @@ -1658,6 +1785,9 @@ static int kvm_prepare_memory_region(struct kvm  
> *kvm,
>   	if (r && new && new->dirty_bitmap && (!old || !old->dirty_bitmap))
>   		kvm_destroy_dirty_bitmap(new);

> +	if (r && change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
> +		kvm_restrictedmem_unregister(new);
> +
>   	return r;
>   }

> @@ -1963,7 +2093,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   	int as_id, id;
>   	int r;

> -	r = check_memory_region_flags(mem);
> +	r = check_memory_region_flags(kvm, mem);
>   	if (r)
>   		return r;

> @@ -1982,6 +2112,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
>   			mem->memory_size))
>   		return -EINVAL;
> +	if (mem->flags & KVM_MEM_PRIVATE &&
> +		(mem->restricted_offset & (PAGE_SIZE - 1) ||
> +		 mem->restricted_offset > U64_MAX - mem->memory_size))
> +		return -EINVAL;
>   	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
>   		return -EINVAL;
>   	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> @@ -2020,6 +2154,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   		if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
>   			return -EINVAL;
>   	} else { /* Modify an existing slot. */
> +		/* Private memslots are immutable, they can only be deleted. */
> +		if (mem->flags & KVM_MEM_PRIVATE)
> +			return -EINVAL;
>   		if ((mem->userspace_addr != old->userspace_addr) ||
>   		    (npages != old->npages) ||
>   		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> @@ -2048,10 +2185,28 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   	new->npages = npages;
>   	new->flags = mem->flags;
>   	new->userspace_addr = mem->userspace_addr;
> +	if (mem->flags & KVM_MEM_PRIVATE) {
> +		new->restricted_file = fget(mem->restricted_fd);
> +		if (!new->restricted_file ||
> +		    !file_is_restrictedmem(new->restricted_file)) {
> +			r = -EINVAL;
> +			goto out;
> +		}
> +		new->restricted_offset = mem->restricted_offset;
> +	}
> +
> +	new->kvm = kvm;

>   	r = kvm_set_memslot(kvm, old, new, change);
>   	if (r)
> -		kfree(new);
> +		goto out;
> +
> +	return 0;
> +
> +out:
> +	if (new->restricted_file)
> +		fput(new->restricted_file);
> +	kfree(new);
>   	return r;
>   }
>   EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> @@ -2351,6 +2506,8 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm  
> *kvm,
>   #ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
>   static u64 kvm_supported_mem_attributes(struct kvm *kvm)
>   {
> +	if (kvm_arch_has_private_mem(kvm))
> +		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
>   	return 0;
>   }

> @@ -4822,16 +4979,28 @@ static long kvm_vm_ioctl(struct file *filp,
>   	}
>   	case KVM_SET_USER_MEMORY_REGION: {
>   		struct kvm_user_mem_region mem;
> -		unsigned long size = sizeof(struct kvm_userspace_memory_region);
> +		unsigned int flags_offset = offsetof(typeof(mem), flags);
> +		unsigned long size;
> +		u32 flags;

>   		kvm_sanity_check_user_mem_region_alias();

> +		memset(&mem, 0, sizeof(mem));
> +
>   		r = -EFAULT;
> +		if (get_user(flags, (u32 __user *)(argp + flags_offset)))
> +			goto out;
> +
> +		if (flags & KVM_MEM_PRIVATE)
> +			size = sizeof(struct kvm_userspace_memory_region_ext);
> +		else
> +			size = sizeof(struct kvm_userspace_memory_region);
> +
>   		if (copy_from_user(&mem, argp, size))
>   			goto out;

>   		r = -EINVAL;
> -		if (mem.flags & KVM_MEM_PRIVATE)
> +		if ((flags ^ mem.flags) & KVM_MEM_PRIVATE)
>   			goto out;

>   		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
  2023-03-07 19:14   ` Ackerley Tng
@ 2023-03-07 20:27     ` Sean Christopherson
  0 siblings, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-03-07 20:27 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, pbonzini, corbet,
	vkuznets, wanpengli, jmattson, joro, tglx, mingo, bp, arnd,
	naoya.horiguchi, linmiaohe, x86, hpa, hughd, jlayton, bfields,
	akpm, shuah, rppt, steven.price, mail, vbabka, vannapurve,
	yu.c.zhang, kirill.shutemov, luto, jun.nakajima, dave.hansen, ak,
	david, aarcange, ddutile, dhildenb, qperret, tabba, michael.roth,
	mhocko, wei.w.wang

Please trim your replies so that readers don't need to scan through a hundred or
so lines of quotes just to confirm there's nothing there.

On Tue, Mar 07, 2023, Ackerley Tng wrote:
> Chao Peng <chao.p.peng@linux.intel.com> writes:
> 
> > Register/unregister private memslot to fd-based memory backing store
> > restrictedmem and implement the callbacks for restrictedmem_notifier:
> >    - invalidate_start()/invalidate_end() to zap the existing memory
> >      mappings in the KVM page table.
> >    - error() to request KVM_REQ_MEMORY_MCE and later exit to userspace
> >      with KVM_EXIT_SHUTDOWN.
> 
> > Expose KVM_MEM_PRIVATE for memslot and KVM_MEMORY_ATTRIBUTE_PRIVATE for
> > KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to userspace but either are
> > controlled by kvm_arch_has_private_mem() which should be rewritten by
> > architecture code.
> 
> Could we perhaps rename KVM_MEM_PRIVATE to KVM_MEM_PROTECTED, to be in
> line with KVM_X86_PROTECTED_VM?
> 
> I feel that a memslot that has the KVM_MEM_PRIVATE flag need not always
> be private; It can sometimes be providing memory that is shared and
> also accessible from the host.
> 
> KVM_MEMORY_ATTRIBUTE_PRIVATE is fine as-is because this flag is set when
> the guest memory is meant to be backed by private memory.
> 
> KVM_MEMORY_EXIT_FLAG_PRIVATE is also okay because the flag is used to
> indicate when the memory error is caused by a private access (as opposed
> to a shared access).
> 
> kvm_slot_can_be_private() could perhaps be renamed kvm_is_protected_slot()?

No to this suggestion.  I agree that KVM_MEM_PRIVATE is a bad name, but
kvm_is_protected_slot() is just as wrong.  The _only_ thing that the flag controls
is whether whether or not the memslot has an fd that is bound to restricted memory.
The memslot itself is not protected in any way, and if the entire memslot is mapped
shared, then the data backed by the memslot isn't protected either.

What about KVM_MEM_CAN_BE_PRIVATE?  KVM_MEM_PRIVATIZABLE is more succinct, but
AFAICT that's a made up word, and IMO is unnecessarily fancy.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
  2023-01-28 14:00     ` Chao Peng
@ 2023-03-08  0:13       ` Ackerley Tng
  2023-03-08  7:40         ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Ackerley Tng @ 2023-03-08  0:13 UTC (permalink / raw)
  To: Chao Peng
  Cc: seanjc, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, pbonzini, corbet, vkuznets,
	wanpengli, jmattson, joro, tglx, mingo, bp, arnd,
	naoya.horiguchi, linmiaohe, x86, hpa, hughd, jlayton, bfields,
	akpm, shuah, rppt, steven.price, mail, vbabka, vannapurve,
	yu.c.zhang, kirill.shutemov, luto, jun.nakajima, dave.hansen, ak,
	david, aarcange, ddutile, dhildenb, qperret, tabba, michael.roth,
	mhocko, wei.w.wang

Chao Peng <chao.p.peng@linux.intel.com> writes:

> On Sat, Jan 14, 2023 at 12:01:01AM +0000, Sean Christopherson wrote:
>> On Fri, Dec 02, 2022, Chao Peng wrote:
> ...
>> Strongly prefer to use similar logic to existing code that detects wraps:

>> 		mem->restricted_offset + mem->memory_size < mem->restricted_offset

>> This is also where I'd like to add the "gfn is aligned to offset" check,  
>> though
>> my brain is too fried to figure that out right now.

> Used count_trailing_zeros() for this TODO, unsure we have other better
> approach.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index afc8c26fa652..fd34c5f7cd2f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -56,6 +56,7 @@
>   #include <asm/processor.h>
>   #include <asm/ioctl.h>
>   #include <linux/uaccess.h>
> +#include <linux/count_zeros.h>

>   #include "coalesced_mmio.h"
>   #include "async_pf.h"
> @@ -2087,6 +2088,19 @@ static bool kvm_check_memslot_overlap(struct  
> kvm_memslots *slots, int id,
>   	return false;
>   }

> +/*
> + * Return true when ALIGNMENT(offset) >= ALIGNMENT(gpa).
> + */
> +static bool kvm_check_rmem_offset_alignment(u64 offset, u64 gpa)
> +{
> +	if (!offset)
> +		return true;
> +	if (!gpa)
> +		return false;
> +
> +	return !!(count_trailing_zeros(offset) >= count_trailing_zeros(gpa));

Perhaps we could do something like

#define lowest_set_bit(val) (val & -val)

and use

return lowest_set_bit(offset) >= lowest_set_bit(gpa);

Please help me to understand: why must ALIGNMENT(offset) >=
ALIGNMENT(gpa)? Why is it not sufficient to have both gpa and offset be
aligned to PAGE_SIZE?

> +}
> +
>   /*
>    * Allocate some memory and give it an address in the guest physical  
> address
>    * space.
> @@ -2128,7 +2142,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   	if (mem->flags & KVM_MEM_PRIVATE &&
>   	    (mem->restrictedmem_offset & (PAGE_SIZE - 1) ||
>   	     mem->restrictedmem_offset + mem->memory_size <  
> mem->restrictedmem_offset ||
> -	     0 /* TODO: require gfn be aligned with restricted offset */))
> +	     !kvm_check_rmem_offset_alignment(mem->restrictedmem_offset,
> +					      mem->guest_phys_addr)))
>   		return -EINVAL;
>   	if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_MEM_SLOTS_NUM)
>   		return -EINVAL;

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
  2023-03-08  0:13       ` Ackerley Tng
@ 2023-03-08  7:40         ` Chao Peng
  2023-03-23  0:41           ` Isaku Yamahata
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2023-03-08  7:40 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: seanjc, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, pbonzini, corbet, vkuznets,
	wanpengli, jmattson, joro, tglx, mingo, bp, arnd,
	naoya.horiguchi, linmiaohe, x86, hpa, hughd, jlayton, bfields,
	akpm, shuah, rppt, steven.price, mail, vbabka, vannapurve,
	yu.c.zhang, kirill.shutemov, luto, jun.nakajima, dave.hansen, ak,
	david, aarcange, ddutile, dhildenb, qperret, tabba, michael.roth,
	mhocko, wei.w.wang

On Wed, Mar 08, 2023 at 12:13:24AM +0000, Ackerley Tng wrote:
> Chao Peng <chao.p.peng@linux.intel.com> writes:
> 
> > On Sat, Jan 14, 2023 at 12:01:01AM +0000, Sean Christopherson wrote:
> > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > ...
> > > Strongly prefer to use similar logic to existing code that detects wraps:
> 
> > > 		mem->restricted_offset + mem->memory_size < mem->restricted_offset
> 
> > > This is also where I'd like to add the "gfn is aligned to offset"
> > > check, though
> > > my brain is too fried to figure that out right now.
> 
> > Used count_trailing_zeros() for this TODO, unsure we have other better
> > approach.
> 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index afc8c26fa652..fd34c5f7cd2f 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -56,6 +56,7 @@
> >   #include <asm/processor.h>
> >   #include <asm/ioctl.h>
> >   #include <linux/uaccess.h>
> > +#include <linux/count_zeros.h>
> 
> >   #include "coalesced_mmio.h"
> >   #include "async_pf.h"
> > @@ -2087,6 +2088,19 @@ static bool kvm_check_memslot_overlap(struct
> > kvm_memslots *slots, int id,
> >   	return false;
> >   }
> 
> > +/*
> > + * Return true when ALIGNMENT(offset) >= ALIGNMENT(gpa).
> > + */
> > +static bool kvm_check_rmem_offset_alignment(u64 offset, u64 gpa)
> > +{
> > +	if (!offset)
> > +		return true;
> > +	if (!gpa)
> > +		return false;
> > +
> > +	return !!(count_trailing_zeros(offset) >= count_trailing_zeros(gpa));
> 
> Perhaps we could do something like
> 
> #define lowest_set_bit(val) (val & -val)
> 
> and use
> 
> return lowest_set_bit(offset) >= lowest_set_bit(gpa);

I see kernel already has fls64(), that looks what we need ;)

> 
> Please help me to understand: why must ALIGNMENT(offset) >=
> ALIGNMENT(gpa)? Why is it not sufficient to have both gpa and offset be
> aligned to PAGE_SIZE?

Yes, it's sufficient. Here we just want to be conservative on the uAPI
as Sean explained this at [1]:

  I would rather reject memslot if the gfn has lesser alignment than the
  offset. I'm totally ok with this approach _if_ there's a use case. 
  Until such a use case presents itself, I would rather be conservative
  from a uAPI perspective.

[1] https://lore.kernel.org/all/Y8HldeHBrw+OOZVm@google.com/

Chao
> 
> > +}
> > +
> >   /*
> >    * Allocate some memory and give it an address in the guest physical
> > address
> >    * space.
> > @@ -2128,7 +2142,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >   	if (mem->flags & KVM_MEM_PRIVATE &&
> >   	    (mem->restrictedmem_offset & (PAGE_SIZE - 1) ||
> >   	     mem->restrictedmem_offset + mem->memory_size <
> > mem->restrictedmem_offset ||
> > -	     0 /* TODO: require gfn be aligned with restricted offset */))
> > +	     !kvm_check_rmem_offset_alignment(mem->restrictedmem_offset,
> > +					      mem->guest_phys_addr)))
> >   		return -EINVAL;
> >   	if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_MEM_SLOTS_NUM)
> >   		return -EINVAL;

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2023-02-16  9:51   ` Nikunj A. Dadhania
@ 2023-03-20 19:08     ` Michael Roth
  0 siblings, 0 replies; 398+ messages in thread
From: Michael Roth @ 2023-03-20 19:08 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Arnd Bergmann, Naoya Horiguchi,
	Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, mhocko, wei.w.wang

On Thu, Feb 16, 2023 at 03:21:21PM +0530, Nikunj A. Dadhania wrote:
> 
> > +static struct file *restrictedmem_file_create(struct file *memfd)
> > +{
> > +	struct restrictedmem_data *data;
> > +	struct address_space *mapping;
> > +	struct inode *inode;
> > +	struct file *file;
> > +
> > +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> > +	if (!data)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	data->memfd = memfd;
> > +	mutex_init(&data->lock);
> > +	INIT_LIST_HEAD(&data->notifiers);
> > +
> > +	inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> > +	if (IS_ERR(inode)) {
> > +		kfree(data);
> > +		return ERR_CAST(inode);
> > +	}
> 
> alloc_anon_inode() uses new_pseudo_inode() to get the inode. As per the comment, new inode 
> is not added to the superblock s_inodes list.

Another issue somewhat related to alloc_anon_inode() is that the shmem code
in some cases assumes the inode struct was allocated via shmem_alloc_inode(),
which allocates a struct shmem_inode_info, which is a superset of struct inode
with additional fields for things like spinlocks.

These additional fields don't get allocated/ininitialized in the case of
restrictedmem, so when restrictedmem_getattr() tries to pass the inode on to
shmem handler, it can cause a crash.

For instance, the following trace was seen when executing 'sudo lsof' while a
process/guest was running with an open memfd FD:

    [24393.121409] general protection fault, probably for non-canonical address 0xfe9fb182fea3f077: 0000 [#1] PREEMPT SMP NOPTI
    [24393.133546] CPU: 2 PID: 590073 Comm: lsof Tainted: G            E      6.1.0-rc4-upm10b-host-snp-v8b+ #4
    [24393.144125] Hardware name: AMD Corporation ETHANOL_X/ETHANOL_X, BIOS RXM1009B 05/14/2022
    [24393.153150] RIP: 0010:native_queued_spin_lock_slowpath+0x3a3/0x3e0
    [24393.160049] Code: f3 90 41 8b 04 24 85 c0 74 ea eb f4 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 05 48 63 d2 48 05 00 41 04 00 48 03 04 d5 e0 ea 8b 82 <48> 89 18 8b 43 08 85 c0 75 09 f3 90 8b 43 08 85 c0 74 f7 48 8b 13
    [24393.181004] RSP: 0018:ffffc9006b6a3cf8 EFLAGS: 00010086
    [24393.186832] RAX: fe9fb182fea3f077 RBX: ffff889fcc144100 RCX: 0000000000000000
    [24393.194793] RDX: 0000000000003ffe RSI: ffffffff827acde9 RDI: ffffc9006b6a3cdf
    [24393.202751] RBP: ffffc9006b6a3d20 R08: 0000000000000001 R09: 0000000000000000
    [24393.210710] R10: 0000000000000000 R11: 000000000000ffff R12: ffff888179fa50e0
    [24393.218670] R13: ffff889fcc144100 R14: 00000000000c0000 R15: 00000000000c0000
    [24393.226629] FS:  00007f9440f45400(0000) GS:ffff889fcc100000(0000) knlGS:0000000000000000
    [24393.235692] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [24393.242101] CR2: 000055c55a9cf088 CR3: 0008000220e9c003 CR4: 0000000000770ee0
    [24393.250059] PKRU: 55555554
    [24393.253073] Call Trace:
    [24393.255797]  <TASK>
    [24393.258133]  do_raw_spin_lock+0xc4/0xd0
    [24393.262410]  _raw_spin_lock_irq+0x50/0x70
    [24393.266880]  ? shmem_getattr+0x4c/0xf0
    [24393.271060]  shmem_getattr+0x4c/0xf0
    [24393.275044]  restrictedmem_getattr+0x34/0x40
    [24393.279805]  vfs_getattr_nosec+0xbd/0xe0
    [24393.284178]  vfs_getattr+0x37/0x50
    [24393.287971]  vfs_statx+0xa0/0x150
    [24393.291668]  vfs_fstatat+0x59/0x80
    [24393.295462]  __do_sys_newstat+0x35/0x70
    [24393.299739]  __x64_sys_newstat+0x16/0x20
    [24393.304111]  do_syscall_64+0x3b/0x90
    [24393.308098]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

As a workaround we've been doing the following, but it's probably not the
proper fix:

  https://github.com/AMDESE/linux/commit/0378116b5c4e373295c9101727f2cb5112d6b1f4

-Mike


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
  2023-03-08  7:40         ` Chao Peng
@ 2023-03-23  0:41           ` Isaku Yamahata
  2023-03-24  2:10             ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Isaku Yamahata @ 2023-03-23  0:41 UTC (permalink / raw)
  To: Chao Peng
  Cc: Ackerley Tng, seanjc, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, pbonzini, corbet,
	vkuznets, wanpengli, jmattson, joro, tglx, mingo, bp, arnd,
	naoya.horiguchi, linmiaohe, x86, hpa, hughd, jlayton, bfields,
	akpm, shuah, rppt, steven.price, mail, vbabka, vannapurve,
	yu.c.zhang, kirill.shutemov, luto, jun.nakajima, dave.hansen, ak,
	david, aarcange, ddutile, dhildenb, qperret, tabba, michael.roth,
	mhocko, wei.w.wang, isaku.yamahata

On Wed, Mar 08, 2023 at 03:40:26PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:

> On Wed, Mar 08, 2023 at 12:13:24AM +0000, Ackerley Tng wrote:
> > Chao Peng <chao.p.peng@linux.intel.com> writes:
> > 
> > > On Sat, Jan 14, 2023 at 12:01:01AM +0000, Sean Christopherson wrote:
> > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > ...
> > > > Strongly prefer to use similar logic to existing code that detects wraps:
> > 
> > > > 		mem->restricted_offset + mem->memory_size < mem->restricted_offset
> > 
> > > > This is also where I'd like to add the "gfn is aligned to offset"
> > > > check, though
> > > > my brain is too fried to figure that out right now.
> > 
> > > Used count_trailing_zeros() for this TODO, unsure we have other better
> > > approach.
> > 
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index afc8c26fa652..fd34c5f7cd2f 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -56,6 +56,7 @@
> > >   #include <asm/processor.h>
> > >   #include <asm/ioctl.h>
> > >   #include <linux/uaccess.h>
> > > +#include <linux/count_zeros.h>
> > 
> > >   #include "coalesced_mmio.h"
> > >   #include "async_pf.h"
> > > @@ -2087,6 +2088,19 @@ static bool kvm_check_memslot_overlap(struct
> > > kvm_memslots *slots, int id,
> > >   	return false;
> > >   }
> > 
> > > +/*
> > > + * Return true when ALIGNMENT(offset) >= ALIGNMENT(gpa).
> > > + */
> > > +static bool kvm_check_rmem_offset_alignment(u64 offset, u64 gpa)
> > > +{
> > > +	if (!offset)
> > > +		return true;
> > > +	if (!gpa)
> > > +		return false;
> > > +
> > > +	return !!(count_trailing_zeros(offset) >= count_trailing_zeros(gpa));

This check doesn't work expected. For example, offset = 2GB, gpa=4GB
this check fails.
I come up with the following.

From ec87e25082f0497431b732702fae82c6a05071bf Mon Sep 17 00:00:00 2001
Message-Id: <ec87e25082f0497431b732702fae82c6a05071bf.1679531995.git.isaku.yamahata@intel.com>
From: Isaku Yamahata <isaku.yamahata@intel.com>
Date: Wed, 22 Mar 2023 15:32:56 -0700
Subject: [PATCH] KVM: Relax alignment check for restricted mem

kvm_check_rmem_offset_alignment() only checks based on offset alignment
and GPA alignment.  However, the actual alignment for offset depends
on architecture.  For x86 case, it can be 1G, 2M or 4K.  So even if
GPA is aligned for 1G+, only 1G-alignment is required for offset.

Without this patch, gpa=4G, offset=2G results in failure of memory slot
creation.

Fixes: edc8814b2c77 ("KVM: Require gfn be aligned with restricted offset")
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 15 +++++++++++++++
 virt/kvm/kvm_main.c             |  9 ++++++++-
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 88e11dd3afde..03af44650f24 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -16,6 +16,7 @@
 #include <linux/irq_work.h>
 #include <linux/irq.h>
 #include <linux/workqueue.h>
+#include <linux/count_zeros.h>
 
 #include <linux/kvm.h>
 #include <linux/kvm_para.h>
@@ -143,6 +144,20 @@
 #define KVM_HPAGE_MASK(x)	(~(KVM_HPAGE_SIZE(x) - 1))
 #define KVM_PAGES_PER_HPAGE(x)	(KVM_HPAGE_SIZE(x) / PAGE_SIZE)
 
+#define kvm_arch_required_alignment	kvm_arch_required_alignment
+static inline int kvm_arch_required_alignment(u64 gpa)
+{
+	int zeros = count_trailing_zeros(gpa);
+
+	WARN_ON_ONCE(!PAGE_ALIGNED(gpa));
+	if (zeros >= KVM_HPAGE_SHIFT(PG_LEVEL_1G))
+		return KVM_HPAGE_SHIFT(PG_LEVEL_1G);
+	else if (zeros >= KVM_HPAGE_SHIFT(PG_LEVEL_2M))
+		return KVM_HPAGE_SHIFT(PG_LEVEL_2M);
+
+	return PAGE_SHIFT;
+}
+
 #define KVM_MEMSLOT_PAGES_TO_MMU_PAGES_RATIO 50
 #define KVM_MIN_ALLOC_MMU_PAGES 64UL
 #define KVM_MMU_HASH_SHIFT 12
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c9c4eef457b0..f4ff96171d24 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2113,6 +2113,13 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
 	return false;
 }
 
+#ifndef kvm_arch_required_alignment
+__weak int kvm_arch_required_alignment(u64 gpa)
+{
+	return PAGE_SHIFT
+}
+#endif
+
 /*
  * Return true when ALIGNMENT(offset) >= ALIGNMENT(gpa).
  */
@@ -2123,7 +2130,7 @@ static bool kvm_check_rmem_offset_alignment(u64 offset, u64 gpa)
 	if (!gpa)
 		return false;
 
-	return !!(count_trailing_zeros(offset) >= count_trailing_zeros(gpa));
+	return !!(count_trailing_zeros(offset) >= kvm_arch_required_alignment(gpa));
 }
 
 /*
-- 
2.25.1



-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-02-21 12:11             ` Chao Peng
@ 2023-03-23  1:27               ` Michael Roth
  2023-03-24  2:13                 ` Chao Peng
  2023-04-12 22:01                 ` Sean Christopherson
  0 siblings, 2 replies; 398+ messages in thread
From: Michael Roth @ 2023-03-23  1:27 UTC (permalink / raw)
  To: Chao Peng
  Cc: Sean Christopherson, Isaku Yamahata, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	mhocko, wei.w.wang

On Tue, Feb 21, 2023 at 08:11:35PM +0800, Chao Peng wrote:
> > Hi Sean,
> > 
> > We've rebased the SEV+SNP support onto your updated UPM base support
> > tree and things seem to be working okay, but we needed some fixups on
> > top of the base support get things working, along with 1 workaround
> > for an issue that hasn't been root-caused yet:
> > 
> >   https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip
> > 
> >   *stash (upm_base_support): mm: restrictedmem: Kirill's pinning implementation
> >   *workaround (use_base_support): mm: restrictedmem: loosen exclusivity check
> 
> What I'm seeing is Slot#3 gets added first and then deleted. When it's
> gets added, Slot#0 already has the same range bound to restrictedmem so
> trigger the exclusive check. This check is exactly the current code for.

With the following change in QEMU, we no longer trigger this check:

  diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
  index 20da121374..849b5de469 100644
  --- a/hw/pci-host/q35.c
  +++ b/hw/pci-host/q35.c
  @@ -588,9 +588,9 @@ static void mch_realize(PCIDevice *d, Error **errp)
       memory_region_init_alias(&mch->open_high_smram, OBJECT(mch), "smram-open-high",
                                mch->ram_memory, MCH_HOST_BRIDGE_SMRAM_C_BASE,
                                MCH_HOST_BRIDGE_SMRAM_C_SIZE);
  +    memory_region_set_enabled(&mch->open_high_smram, false);
       memory_region_add_subregion_overlap(mch->system_memory, 0xfeda0000,
                                           &mch->open_high_smram, 1);
  -    memory_region_set_enabled(&mch->open_high_smram, false);

I'm not sure if QEMU is actually doing something wrong here though or if
this check is putting tighter restrictions on userspace than what was
expected before. Will look into it more.

> 
> >   *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding
> >   *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations
> 
> As many kernel APIs treat 'end' as exclusive, I would rather keep using
> exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind
> and notifier callbacks) but fix it internally in the restrictedmem. E.g.
> all the places where xarray API needs a 'last'/'max' we use 'end - 1'.
> See below for the change.

Yes I did feel like I was fighting the kernel a bit on that; your
suggestion seems like it would be a better fit.

> 
> >   *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations
> 
> Subtracting slot->restrictedmem.index for start/end in
> restrictedmem_get_gfn_range() is the correct fix.
> 
> >   *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes
> > 
> > We plan to post an updated RFC for v8 soon, but also wanted to share
> > the staging tree in case you end up looking at the UPM integration aspects
> > before then.
> > 
> > -Mike
> 
> This is the restrictedmem fix to solve 'end' being stored and checked in xarray:

Looks good.

Thanks!

-Mike

> 
> --- a/mm/restrictedmem.c
> +++ b/mm/restrictedmem.c
> @@ -46,12 +46,12 @@ static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode,
>          */
>         down_read(&rm->lock);
>  
> -       xa_for_each_range(&rm->bindings, index, notifier, start, end)
> +       xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
>                 notifier->ops->invalidate_start(notifier, start, end);
>  
>         ret = memfd->f_op->fallocate(memfd, mode, offset, len);
>  
> -       xa_for_each_range(&rm->bindings, index, notifier, start, end)
> +       xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
>                 notifier->ops->invalidate_end(notifier, start, end);
>  
>         up_read(&rm->lock);
> @@ -224,7 +224,7 @@ static int restricted_error_remove_page(struct address_space *mapping,
>                 }
>                 spin_unlock(&inode->i_lock);
>  
> -               xa_for_each_range(&rm->bindings, index, notifier, start, end)
> +               xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
>                         notifier->ops->error(notifier, start, end);
>                 break;
>         }
> @@ -301,11 +301,12 @@ int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
>                 if (exclusive != rm->exclusive)
>                         goto out_unlock;
>  
> -               if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT))
> +               if (exclusive &&
> +                   xa_find(&rm->bindings, &start, end - 1, XA_PRESENT))
>                         goto out_unlock;
>         }
>  
> -       xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL);
> +       xa_store_range(&rm->bindings, start, end - 1, notifier, GFP_KERNEL);
>         rm->exclusive = exclusive;
>         ret = 0;
>  out_unlock:
> @@ -320,7 +321,7 @@ void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end,
>         struct restrictedmem *rm = file->f_mapping->private_data;
>  
>         down_write(&rm->lock);
> -       xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL);
> +       xa_store_range(&rm->bindings, start, end - 1, NULL, GFP_KERNEL);
>         synchronize_rcu();
>         up_write(&rm->lock);
>  }

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
  2023-03-23  0:41           ` Isaku Yamahata
@ 2023-03-24  2:10             ` Chao Peng
  2023-03-24  2:29               ` Xiaoyao Li
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2023-03-24  2:10 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Ackerley Tng, seanjc, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, pbonzini, corbet,
	vkuznets, wanpengli, jmattson, joro, tglx, mingo, bp, arnd,
	naoya.horiguchi, linmiaohe, x86, hpa, hughd, jlayton, bfields,
	akpm, shuah, rppt, steven.price, mail, vbabka, vannapurve,
	yu.c.zhang, kirill.shutemov, luto, jun.nakajima, dave.hansen, ak,
	david, aarcange, ddutile, dhildenb, qperret, tabba, michael.roth,
	mhocko, wei.w.wang

On Wed, Mar 22, 2023 at 05:41:31PM -0700, Isaku Yamahata wrote:
> On Wed, Mar 08, 2023 at 03:40:26PM +0800,
> Chao Peng <chao.p.peng@linux.intel.com> wrote:
> 
> > On Wed, Mar 08, 2023 at 12:13:24AM +0000, Ackerley Tng wrote:
> > > Chao Peng <chao.p.peng@linux.intel.com> writes:
> > > 
> > > > On Sat, Jan 14, 2023 at 12:01:01AM +0000, Sean Christopherson wrote:
> > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > ...
> > > > > Strongly prefer to use similar logic to existing code that detects wraps:
> > > 
> > > > > 		mem->restricted_offset + mem->memory_size < mem->restricted_offset
> > > 
> > > > > This is also where I'd like to add the "gfn is aligned to offset"
> > > > > check, though
> > > > > my brain is too fried to figure that out right now.
> > > 
> > > > Used count_trailing_zeros() for this TODO, unsure we have other better
> > > > approach.
> > > 
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index afc8c26fa652..fd34c5f7cd2f 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -56,6 +56,7 @@
> > > >   #include <asm/processor.h>
> > > >   #include <asm/ioctl.h>
> > > >   #include <linux/uaccess.h>
> > > > +#include <linux/count_zeros.h>
> > > 
> > > >   #include "coalesced_mmio.h"
> > > >   #include "async_pf.h"
> > > > @@ -2087,6 +2088,19 @@ static bool kvm_check_memslot_overlap(struct
> > > > kvm_memslots *slots, int id,
> > > >   	return false;
> > > >   }
> > > 
> > > > +/*
> > > > + * Return true when ALIGNMENT(offset) >= ALIGNMENT(gpa).
> > > > + */
> > > > +static bool kvm_check_rmem_offset_alignment(u64 offset, u64 gpa)
> > > > +{
> > > > +	if (!offset)
> > > > +		return true;
> > > > +	if (!gpa)
> > > > +		return false;
> > > > +
> > > > +	return !!(count_trailing_zeros(offset) >= count_trailing_zeros(gpa));
> 
> This check doesn't work expected. For example, offset = 2GB, gpa=4GB
> this check fails.

This case is expected to fail as Sean initially suggested[*]:
  I would rather reject memslot if the gfn has lesser alignment than
  the offset. I'm totally ok with this approach _if_ there's a use case.
  Until such a use case presents itself, I would rather be conservative
  from a uAPI perspective.

I understand that we put tighter restriction on this but if you see such
restriction is really a big issue for real usage, instead of a
theoretical problem, then we can loosen the check here. But at that time
below code is kind of x86 specific and may need improve.

BTW, in latest code, I replaced count_trailing_zeros() with fls64():
  return !!(fls64(offset) >= fls64(gpa));

[*] https://lore.kernel.org/all/Y8HldeHBrw+OOZVm@google.com/

Chao
> I come up with the following.
> 
> >From ec87e25082f0497431b732702fae82c6a05071bf Mon Sep 17 00:00:00 2001
> Message-Id: <ec87e25082f0497431b732702fae82c6a05071bf.1679531995.git.isaku.yamahata@intel.com>
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> Date: Wed, 22 Mar 2023 15:32:56 -0700
> Subject: [PATCH] KVM: Relax alignment check for restricted mem
> 
> kvm_check_rmem_offset_alignment() only checks based on offset alignment
> and GPA alignment.  However, the actual alignment for offset depends
> on architecture.  For x86 case, it can be 1G, 2M or 4K.  So even if
> GPA is aligned for 1G+, only 1G-alignment is required for offset.
> 
> Without this patch, gpa=4G, offset=2G results in failure of memory slot
> creation.
> 
> Fixes: edc8814b2c77 ("KVM: Require gfn be aligned with restricted offset")
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h | 15 +++++++++++++++
>  virt/kvm/kvm_main.c             |  9 ++++++++-
>  2 files changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 88e11dd3afde..03af44650f24 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -16,6 +16,7 @@
>  #include <linux/irq_work.h>
>  #include <linux/irq.h>
>  #include <linux/workqueue.h>
> +#include <linux/count_zeros.h>
>  
>  #include <linux/kvm.h>
>  #include <linux/kvm_para.h>
> @@ -143,6 +144,20 @@
>  #define KVM_HPAGE_MASK(x)	(~(KVM_HPAGE_SIZE(x) - 1))
>  #define KVM_PAGES_PER_HPAGE(x)	(KVM_HPAGE_SIZE(x) / PAGE_SIZE)
>  
> +#define kvm_arch_required_alignment	kvm_arch_required_alignment
> +static inline int kvm_arch_required_alignment(u64 gpa)
> +{
> +	int zeros = count_trailing_zeros(gpa);
> +
> +	WARN_ON_ONCE(!PAGE_ALIGNED(gpa));
> +	if (zeros >= KVM_HPAGE_SHIFT(PG_LEVEL_1G))
> +		return KVM_HPAGE_SHIFT(PG_LEVEL_1G);
> +	else if (zeros >= KVM_HPAGE_SHIFT(PG_LEVEL_2M))
> +		return KVM_HPAGE_SHIFT(PG_LEVEL_2M);
> +
> +	return PAGE_SHIFT;
> +}
> +
>  #define KVM_MEMSLOT_PAGES_TO_MMU_PAGES_RATIO 50
>  #define KVM_MIN_ALLOC_MMU_PAGES 64UL
>  #define KVM_MMU_HASH_SHIFT 12
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index c9c4eef457b0..f4ff96171d24 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2113,6 +2113,13 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
>  	return false;
>  }
>  
> +#ifndef kvm_arch_required_alignment
> +__weak int kvm_arch_required_alignment(u64 gpa)
> +{
> +	return PAGE_SHIFT
> +}
> +#endif
> +
>  /*
>   * Return true when ALIGNMENT(offset) >= ALIGNMENT(gpa).
>   */
> @@ -2123,7 +2130,7 @@ static bool kvm_check_rmem_offset_alignment(u64 offset, u64 gpa)
>  	if (!gpa)
>  		return false;
>  
> -	return !!(count_trailing_zeros(offset) >= count_trailing_zeros(gpa));
> +	return !!(count_trailing_zeros(offset) >= kvm_arch_required_alignment(gpa));
>  }
>  
>  /*
> -- 
> 2.25.1
> 
> 
> 
> -- 
> Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-03-23  1:27               ` Michael Roth
@ 2023-03-24  2:13                 ` Chao Peng
  2023-04-12 22:01                 ` Sean Christopherson
  1 sibling, 0 replies; 398+ messages in thread
From: Chao Peng @ 2023-03-24  2:13 UTC (permalink / raw)
  To: Michael Roth
  Cc: Sean Christopherson, Isaku Yamahata, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	mhocko, wei.w.wang

On Wed, Mar 22, 2023 at 08:27:37PM -0500, Michael Roth wrote:
> On Tue, Feb 21, 2023 at 08:11:35PM +0800, Chao Peng wrote:
> > > Hi Sean,
> > > 
> > > We've rebased the SEV+SNP support onto your updated UPM base support
> > > tree and things seem to be working okay, but we needed some fixups on
> > > top of the base support get things working, along with 1 workaround
> > > for an issue that hasn't been root-caused yet:
> > > 
> > >   https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip
> > > 
> > >   *stash (upm_base_support): mm: restrictedmem: Kirill's pinning implementation
> > >   *workaround (use_base_support): mm: restrictedmem: loosen exclusivity check
> > 
> > What I'm seeing is Slot#3 gets added first and then deleted. When it's
> > gets added, Slot#0 already has the same range bound to restrictedmem so
> > trigger the exclusive check. This check is exactly the current code for.
> 
> With the following change in QEMU, we no longer trigger this check:
> 
>   diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
>   index 20da121374..849b5de469 100644
>   --- a/hw/pci-host/q35.c
>   +++ b/hw/pci-host/q35.c
>   @@ -588,9 +588,9 @@ static void mch_realize(PCIDevice *d, Error **errp)
>        memory_region_init_alias(&mch->open_high_smram, OBJECT(mch), "smram-open-high",
>                                 mch->ram_memory, MCH_HOST_BRIDGE_SMRAM_C_BASE,
>                                 MCH_HOST_BRIDGE_SMRAM_C_SIZE);
>   +    memory_region_set_enabled(&mch->open_high_smram, false);
>        memory_region_add_subregion_overlap(mch->system_memory, 0xfeda0000,
>                                            &mch->open_high_smram, 1);
>   -    memory_region_set_enabled(&mch->open_high_smram, false);
> 
> I'm not sure if QEMU is actually doing something wrong here though or if
> this check is putting tighter restrictions on userspace than what was
> expected before. Will look into it more.

I don't think above QEMU change is upstream acceptable. It may break
functionality for 'normal' VMs.

The UPM check does putting tighter restriction, the restriction is that
you can't bind the same fd range to more than one memslot. For SMRAM in
QEMU however, it violates this restriction. The right 'fix' is disabling
SMM in QEMU for UPM usages rather than trying to work around it. There
is more discussion in below link:

  https://lore.kernel.org/all/Y8bOB7VuVIsxoMcn@google.com/

Chao

> 
> > 
> > >   *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding
> > >   *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations
> > 
> > As many kernel APIs treat 'end' as exclusive, I would rather keep using
> > exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind
> > and notifier callbacks) but fix it internally in the restrictedmem. E.g.
> > all the places where xarray API needs a 'last'/'max' we use 'end - 1'.
> > See below for the change.
> 
> Yes I did feel like I was fighting the kernel a bit on that; your
> suggestion seems like it would be a better fit.
> 
> > 
> > >   *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations
> > 
> > Subtracting slot->restrictedmem.index for start/end in
> > restrictedmem_get_gfn_range() is the correct fix.
> > 
> > >   *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes
> > > 
> > > We plan to post an updated RFC for v8 soon, but also wanted to share
> > > the staging tree in case you end up looking at the UPM integration aspects
> > > before then.
> > > 
> > > -Mike
> > 
> > This is the restrictedmem fix to solve 'end' being stored and checked in xarray:
> 
> Looks good.
> 
> Thanks!
> 
> -Mike
> 
> > 
> > --- a/mm/restrictedmem.c
> > +++ b/mm/restrictedmem.c
> > @@ -46,12 +46,12 @@ static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode,
> >          */
> >         down_read(&rm->lock);
> >  
> > -       xa_for_each_range(&rm->bindings, index, notifier, start, end)
> > +       xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
> >                 notifier->ops->invalidate_start(notifier, start, end);
> >  
> >         ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> >  
> > -       xa_for_each_range(&rm->bindings, index, notifier, start, end)
> > +       xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
> >                 notifier->ops->invalidate_end(notifier, start, end);
> >  
> >         up_read(&rm->lock);
> > @@ -224,7 +224,7 @@ static int restricted_error_remove_page(struct address_space *mapping,
> >                 }
> >                 spin_unlock(&inode->i_lock);
> >  
> > -               xa_for_each_range(&rm->bindings, index, notifier, start, end)
> > +               xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
> >                         notifier->ops->error(notifier, start, end);
> >                 break;
> >         }
> > @@ -301,11 +301,12 @@ int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
> >                 if (exclusive != rm->exclusive)
> >                         goto out_unlock;
> >  
> > -               if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT))
> > +               if (exclusive &&
> > +                   xa_find(&rm->bindings, &start, end - 1, XA_PRESENT))
> >                         goto out_unlock;
> >         }
> >  
> > -       xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL);
> > +       xa_store_range(&rm->bindings, start, end - 1, notifier, GFP_KERNEL);
> >         rm->exclusive = exclusive;
> >         ret = 0;
> >  out_unlock:
> > @@ -320,7 +321,7 @@ void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end,
> >         struct restrictedmem *rm = file->f_mapping->private_data;
> >  
> >         down_write(&rm->lock);
> > -       xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL);
> > +       xa_store_range(&rm->bindings, start, end - 1, NULL, GFP_KERNEL);
> >         synchronize_rcu();
> >         up_write(&rm->lock);
> >  }

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
  2023-03-24  2:10             ` Chao Peng
@ 2023-03-24  2:29               ` Xiaoyao Li
  2023-03-28 10:41                 ` Chao Peng
  0 siblings, 1 reply; 398+ messages in thread
From: Xiaoyao Li @ 2023-03-24  2:29 UTC (permalink / raw)
  To: Chao Peng, Isaku Yamahata
  Cc: Ackerley Tng, seanjc, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, pbonzini, corbet,
	vkuznets, wanpengli, jmattson, joro, tglx, mingo, bp, arnd,
	naoya.horiguchi, linmiaohe, x86, hpa, hughd, jlayton, bfields,
	akpm, shuah, rppt, steven.price, mail, vbabka, vannapurve,
	yu.c.zhang, kirill.shutemov, luto, jun.nakajima, dave.hansen, ak,
	david, aarcange, ddutile, dhildenb, qperret, tabba, michael.roth,
	mhocko, wei.w.wang

On 3/24/2023 10:10 AM, Chao Peng wrote:
> On Wed, Mar 22, 2023 at 05:41:31PM -0700, Isaku Yamahata wrote:
>> On Wed, Mar 08, 2023 at 03:40:26PM +0800,
>> Chao Peng <chao.p.peng@linux.intel.com> wrote:
>>
>>> On Wed, Mar 08, 2023 at 12:13:24AM +0000, Ackerley Tng wrote:
>>>> Chao Peng <chao.p.peng@linux.intel.com> writes:
>>>>
>>>>> On Sat, Jan 14, 2023 at 12:01:01AM +0000, Sean Christopherson wrote:
>>>>>> On Fri, Dec 02, 2022, Chao Peng wrote:
>>>>> ...
>>>>>> Strongly prefer to use similar logic to existing code that detects wraps:
>>>>
>>>>>> 		mem->restricted_offset + mem->memory_size < mem->restricted_offset
>>>>
>>>>>> This is also where I'd like to add the "gfn is aligned to offset"
>>>>>> check, though
>>>>>> my brain is too fried to figure that out right now.
>>>>
>>>>> Used count_trailing_zeros() for this TODO, unsure we have other better
>>>>> approach.
>>>>
>>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>>> index afc8c26fa652..fd34c5f7cd2f 100644
>>>>> --- a/virt/kvm/kvm_main.c
>>>>> +++ b/virt/kvm/kvm_main.c
>>>>> @@ -56,6 +56,7 @@
>>>>>    #include <asm/processor.h>
>>>>>    #include <asm/ioctl.h>
>>>>>    #include <linux/uaccess.h>
>>>>> +#include <linux/count_zeros.h>
>>>>
>>>>>    #include "coalesced_mmio.h"
>>>>>    #include "async_pf.h"
>>>>> @@ -2087,6 +2088,19 @@ static bool kvm_check_memslot_overlap(struct
>>>>> kvm_memslots *slots, int id,
>>>>>    	return false;
>>>>>    }
>>>>
>>>>> +/*
>>>>> + * Return true when ALIGNMENT(offset) >= ALIGNMENT(gpa).
>>>>> + */
>>>>> +static bool kvm_check_rmem_offset_alignment(u64 offset, u64 gpa)
>>>>> +{
>>>>> +	if (!offset)
>>>>> +		return true;
>>>>> +	if (!gpa)
>>>>> +		return false;
>>>>> +
>>>>> +	return !!(count_trailing_zeros(offset) >= count_trailing_zeros(gpa));
>>
>> This check doesn't work expected. For example, offset = 2GB, gpa=4GB
>> this check fails.
> 
> This case is expected to fail as Sean initially suggested[*]:
>    I would rather reject memslot if the gfn has lesser alignment than
>    the offset. I'm totally ok with this approach _if_ there's a use case.
>    Until such a use case presents itself, I would rather be conservative
>    from a uAPI perspective.
> 
> I understand that we put tighter restriction on this but if you see such
> restriction is really a big issue for real usage, instead of a
> theoretical problem, then we can loosen the check here. But at that time
> below code is kind of x86 specific and may need improve.
> 
> BTW, in latest code, I replaced count_trailing_zeros() with fls64():
>    return !!(fls64(offset) >= fls64(gpa));

wouldn't it be !!(ffs64(offset) <= ffs64(gpa)) ?

> [*] https://lore.kernel.org/all/Y8HldeHBrw+OOZVm@google.com/
> 
> Chao
>> I come up with the following.
>>
>> >From ec87e25082f0497431b732702fae82c6a05071bf Mon Sep 17 00:00:00 2001
>> Message-Id: <ec87e25082f0497431b732702fae82c6a05071bf.1679531995.git.isaku.yamahata@intel.com>
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>> Date: Wed, 22 Mar 2023 15:32:56 -0700
>> Subject: [PATCH] KVM: Relax alignment check for restricted mem
>>
>> kvm_check_rmem_offset_alignment() only checks based on offset alignment
>> and GPA alignment.  However, the actual alignment for offset depends
>> on architecture.  For x86 case, it can be 1G, 2M or 4K.  So even if
>> GPA is aligned for 1G+, only 1G-alignment is required for offset.
>>
>> Without this patch, gpa=4G, offset=2G results in failure of memory slot
>> creation.
>>
>> Fixes: edc8814b2c77 ("KVM: Require gfn be aligned with restricted offset")
>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>> ---
>>   arch/x86/include/asm/kvm_host.h | 15 +++++++++++++++
>>   virt/kvm/kvm_main.c             |  9 ++++++++-
>>   2 files changed, 23 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 88e11dd3afde..03af44650f24 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -16,6 +16,7 @@
>>   #include <linux/irq_work.h>
>>   #include <linux/irq.h>
>>   #include <linux/workqueue.h>
>> +#include <linux/count_zeros.h>
>>   
>>   #include <linux/kvm.h>
>>   #include <linux/kvm_para.h>
>> @@ -143,6 +144,20 @@
>>   #define KVM_HPAGE_MASK(x)	(~(KVM_HPAGE_SIZE(x) - 1))
>>   #define KVM_PAGES_PER_HPAGE(x)	(KVM_HPAGE_SIZE(x) / PAGE_SIZE)
>>   
>> +#define kvm_arch_required_alignment	kvm_arch_required_alignment
>> +static inline int kvm_arch_required_alignment(u64 gpa)
>> +{
>> +	int zeros = count_trailing_zeros(gpa);
>> +
>> +	WARN_ON_ONCE(!PAGE_ALIGNED(gpa));
>> +	if (zeros >= KVM_HPAGE_SHIFT(PG_LEVEL_1G))
>> +		return KVM_HPAGE_SHIFT(PG_LEVEL_1G);
>> +	else if (zeros >= KVM_HPAGE_SHIFT(PG_LEVEL_2M))
>> +		return KVM_HPAGE_SHIFT(PG_LEVEL_2M);
>> +
>> +	return PAGE_SHIFT;
>> +}
>> +
>>   #define KVM_MEMSLOT_PAGES_TO_MMU_PAGES_RATIO 50
>>   #define KVM_MIN_ALLOC_MMU_PAGES 64UL
>>   #define KVM_MMU_HASH_SHIFT 12
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index c9c4eef457b0..f4ff96171d24 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -2113,6 +2113,13 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
>>   	return false;
>>   }
>>   
>> +#ifndef kvm_arch_required_alignment
>> +__weak int kvm_arch_required_alignment(u64 gpa)
>> +{
>> +	return PAGE_SHIFT
>> +}
>> +#endif
>> +
>>   /*
>>    * Return true when ALIGNMENT(offset) >= ALIGNMENT(gpa).
>>    */
>> @@ -2123,7 +2130,7 @@ static bool kvm_check_rmem_offset_alignment(u64 offset, u64 gpa)
>>   	if (!gpa)
>>   		return false;
>>   
>> -	return !!(count_trailing_zeros(offset) >= count_trailing_zeros(gpa));
>> +	return !!(count_trailing_zeros(offset) >= kvm_arch_required_alignment(gpa));
>>   }
>>   
>>   /*
>> -- 
>> 2.25.1
>>
>>
>>
>> -- 
>> Isaku Yamahata <isaku.yamahata@gmail.com>


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
  2023-03-24  2:29               ` Xiaoyao Li
@ 2023-03-28 10:41                 ` Chao Peng
  2023-04-14 21:08                   ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2023-03-28 10:41 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Isaku Yamahata, Ackerley Tng, seanjc, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-arch, linux-api, linux-doc,
	qemu-devel, pbonzini, corbet, vkuznets, wanpengli, jmattson,
	joro, tglx, mingo, bp, arnd, naoya.horiguchi, linmiaohe, x86,
	hpa, hughd, jlayton, bfields, akpm, shuah, rppt, steven.price,
	mail, vbabka, vannapurve, yu.c.zhang, kirill.shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, qperret, tabba, michael.roth, mhocko, wei.w.wang

On Fri, Mar 24, 2023 at 10:29:25AM +0800, Xiaoyao Li wrote:
> On 3/24/2023 10:10 AM, Chao Peng wrote:
> > On Wed, Mar 22, 2023 at 05:41:31PM -0700, Isaku Yamahata wrote:
> > > On Wed, Mar 08, 2023 at 03:40:26PM +0800,
> > > Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > > 
> > > > On Wed, Mar 08, 2023 at 12:13:24AM +0000, Ackerley Tng wrote:
> > > > > Chao Peng <chao.p.peng@linux.intel.com> writes:
> > > > > 
> > > > > > On Sat, Jan 14, 2023 at 12:01:01AM +0000, Sean Christopherson wrote:
> > > > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > > ...
> > > > > > > Strongly prefer to use similar logic to existing code that detects wraps:
> > > > > 
> > > > > > > 		mem->restricted_offset + mem->memory_size < mem->restricted_offset
> > > > > 
> > > > > > > This is also where I'd like to add the "gfn is aligned to offset"
> > > > > > > check, though
> > > > > > > my brain is too fried to figure that out right now.
> > > > > 
> > > > > > Used count_trailing_zeros() for this TODO, unsure we have other better
> > > > > > approach.
> > > > > 
> > > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > > > index afc8c26fa652..fd34c5f7cd2f 100644
> > > > > > --- a/virt/kvm/kvm_main.c
> > > > > > +++ b/virt/kvm/kvm_main.c
> > > > > > @@ -56,6 +56,7 @@
> > > > > >    #include <asm/processor.h>
> > > > > >    #include <asm/ioctl.h>
> > > > > >    #include <linux/uaccess.h>
> > > > > > +#include <linux/count_zeros.h>
> > > > > 
> > > > > >    #include "coalesced_mmio.h"
> > > > > >    #include "async_pf.h"
> > > > > > @@ -2087,6 +2088,19 @@ static bool kvm_check_memslot_overlap(struct
> > > > > > kvm_memslots *slots, int id,
> > > > > >    	return false;
> > > > > >    }
> > > > > 
> > > > > > +/*
> > > > > > + * Return true when ALIGNMENT(offset) >= ALIGNMENT(gpa).
> > > > > > + */
> > > > > > +static bool kvm_check_rmem_offset_alignment(u64 offset, u64 gpa)
> > > > > > +{
> > > > > > +	if (!offset)
> > > > > > +		return true;
> > > > > > +	if (!gpa)
> > > > > > +		return false;
> > > > > > +
> > > > > > +	return !!(count_trailing_zeros(offset) >= count_trailing_zeros(gpa));
> > > 
> > > This check doesn't work expected. For example, offset = 2GB, gpa=4GB
> > > this check fails.
> > 
> > This case is expected to fail as Sean initially suggested[*]:
> >    I would rather reject memslot if the gfn has lesser alignment than
> >    the offset. I'm totally ok with this approach _if_ there's a use case.
> >    Until such a use case presents itself, I would rather be conservative
> >    from a uAPI perspective.
> > 
> > I understand that we put tighter restriction on this but if you see such
> > restriction is really a big issue for real usage, instead of a
> > theoretical problem, then we can loosen the check here. But at that time
> > below code is kind of x86 specific and may need improve.
> > 
> > BTW, in latest code, I replaced count_trailing_zeros() with fls64():
> >    return !!(fls64(offset) >= fls64(gpa));
> 
> wouldn't it be !!(ffs64(offset) <= ffs64(gpa)) ?

As the function document explains, here we want to return true when
ALIGNMENT(offset) >= ALIGNMENT(gpa), so '>=' is what we need.

It's worthy clarifying that in Sean's original suggestion he actually
mentioned the opposite. He said 'reject memslot if the gfn has lesser
alignment than the offset', but I wonder this is his purpose, since
if ALIGNMENT(offset) < ALIGNMENT(gpa), we wouldn't be possible to map
the page as largepage. Consider we have below config:

  gpa=2M, offset=1M

In this case KVM tries to map gpa at 2M as 2M hugepage but the physical
page at the offset(1M) in private_fd cannot provide the 2M page due to
misalignment.

But as we discussed in the off-list thread, here we do find a real use
case indicating this check is too strict. i.e. QEMU immediately fails
when launch a guest > 2G memory. For this case QEMU splits guest memory
space into two slots:

  Slot#1(ram_below_4G): gpa=0x0, offset=0x0, size=2G
  Slot#2(ram_above_4G): gpa=4G,  offset=2G,  size=totalsize-2G

This strict alignment check fails for slot#2 because offset(2G) has less
alignment than gpa(4G). To allow this, one solution can revert to my
previous change in kvm_alloc_memslot_metadata() to disallow hugepage
only when the offset/gpa are not aligned to related page size.

Sean, How do you think?

Chao
> 
> > [*] https://lore.kernel.org/all/Y8HldeHBrw+OOZVm@google.com/
> > 
> > Chao
> > > I come up with the following.
> > > 
> > > >From ec87e25082f0497431b732702fae82c6a05071bf Mon Sep 17 00:00:00 2001
> > > Message-Id: <ec87e25082f0497431b732702fae82c6a05071bf.1679531995.git.isaku.yamahata@intel.com>
> > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > > Date: Wed, 22 Mar 2023 15:32:56 -0700
> > > Subject: [PATCH] KVM: Relax alignment check for restricted mem
> > > 
> > > kvm_check_rmem_offset_alignment() only checks based on offset alignment
> > > and GPA alignment.  However, the actual alignment for offset depends
> > > on architecture.  For x86 case, it can be 1G, 2M or 4K.  So even if
> > > GPA is aligned for 1G+, only 1G-alignment is required for offset.
> > > 
> > > Without this patch, gpa=4G, offset=2G results in failure of memory slot
> > > creation.
> > > 
> > > Fixes: edc8814b2c77 ("KVM: Require gfn be aligned with restricted offset")
> > > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > > ---
> > >   arch/x86/include/asm/kvm_host.h | 15 +++++++++++++++
> > >   virt/kvm/kvm_main.c             |  9 ++++++++-
> > >   2 files changed, 23 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 88e11dd3afde..03af44650f24 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -16,6 +16,7 @@
> > >   #include <linux/irq_work.h>
> > >   #include <linux/irq.h>
> > >   #include <linux/workqueue.h>
> > > +#include <linux/count_zeros.h>
> > >   #include <linux/kvm.h>
> > >   #include <linux/kvm_para.h>
> > > @@ -143,6 +144,20 @@
> > >   #define KVM_HPAGE_MASK(x)	(~(KVM_HPAGE_SIZE(x) - 1))
> > >   #define KVM_PAGES_PER_HPAGE(x)	(KVM_HPAGE_SIZE(x) / PAGE_SIZE)
> > > +#define kvm_arch_required_alignment	kvm_arch_required_alignment
> > > +static inline int kvm_arch_required_alignment(u64 gpa)
> > > +{
> > > +	int zeros = count_trailing_zeros(gpa);
> > > +
> > > +	WARN_ON_ONCE(!PAGE_ALIGNED(gpa));
> > > +	if (zeros >= KVM_HPAGE_SHIFT(PG_LEVEL_1G))
> > > +		return KVM_HPAGE_SHIFT(PG_LEVEL_1G);
> > > +	else if (zeros >= KVM_HPAGE_SHIFT(PG_LEVEL_2M))
> > > +		return KVM_HPAGE_SHIFT(PG_LEVEL_2M);
> > > +
> > > +	return PAGE_SHIFT;
> > > +}
> > > +
> > >   #define KVM_MEMSLOT_PAGES_TO_MMU_PAGES_RATIO 50
> > >   #define KVM_MIN_ALLOC_MMU_PAGES 64UL
> > >   #define KVM_MMU_HASH_SHIFT 12
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index c9c4eef457b0..f4ff96171d24 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -2113,6 +2113,13 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> > >   	return false;
> > >   }
> > > +#ifndef kvm_arch_required_alignment
> > > +__weak int kvm_arch_required_alignment(u64 gpa)
> > > +{
> > > +	return PAGE_SHIFT
> > > +}
> > > +#endif
> > > +
> > >   /*
> > >    * Return true when ALIGNMENT(offset) >= ALIGNMENT(gpa).
> > >    */
> > > @@ -2123,7 +2130,7 @@ static bool kvm_check_rmem_offset_alignment(u64 offset, u64 gpa)
> > >   	if (!gpa)
> > >   		return false;
> > > -	return !!(count_trailing_zeros(offset) >= count_trailing_zeros(gpa));
> > > +	return !!(count_trailing_zeros(offset) >= kvm_arch_required_alignment(gpa));
> > >   }
> > >   /*
> > > -- 
> > > 2.25.1
> > > 
> > > 
> > > 
> > > -- 
> > > Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 398+ messages in thread

* [RFC PATCH v3 0/2] Providing mount in memfd_restricted() syscall
@ 2023-03-31 23:50 Ackerley Tng
  2023-03-31 23:50 ` [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted Ackerley Tng
  2023-03-31 23:50 ` [RFC PATCH v3 2/2] selftests: restrictedmem: Check hugepage-ness of shmem file backing restrictedmem fd Ackerley Tng
  0 siblings, 2 replies; 398+ messages in thread
From: Ackerley Tng @ 2023-03-31 23:50 UTC (permalink / raw)
  To: kvm, linux-api, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, qemu-devel
  Cc: aarcange, ak, akpm, arnd, bfields, bp, chao.p.peng, corbet,
	dave.hansen, david, ddutile, dhildenb, hpa, hughd, jlayton,
	jmattson, joro, jun.nakajima, kirill.shutemov, linmiaohe, luto,
	mail, mhocko, michael.roth, mingo, naoya.horiguchi, pbonzini,
	qperret, rppt, seanjc, shuah, steven.price, tabba, tglx,
	vannapurve, vbabka, vkuznets, wanpengli, wei.w.wang, x86,
	yu.c.zhang, Ackerley Tng

Hello,

This patchset builds upon the memfd_restricted() system call that was
discussed in the ‘KVM: mm: fd-based approach for supporting KVM’ patch
series, at
https://lore.kernel.org/lkml/20221202061347.1070246-1-chao.p.peng@linux.intel.com/T/

The tree can be found at:
https://github.com/googleprodkernel/linux-cc/tree/restrictedmem-provide-mount-fd-rfc-v3

In this patchset, a modification to the memfd_restricted() syscall is
proposed, which allows userspace to provide a mount, on which the
restrictedmem file will be created and returned from the
memfd_restricted().

Allowing userspace to provide a mount allows userspace to control
various memory binding policies via tmpfs mount options, such as
Transparent HugePage memory allocation policy through
‘huge=always/never’ and NUMA memory allocation policy through
‘mpol=local/bind:*’.

Changes since RFCv2:
+ Tightened semantics to accept only fds of the root of a tmpfs mount,
  as Christian suggested
+ Added permissions check on the inode represented by the fd to guard
  against creation of restrictedmem files on read-only tmpfs
  filesystems or mounts
+ Renamed RMFD_TMPFILE to RMFD_USERMNT to better represent providing a
  userspace mount to create a restrictedmem file on
+ Updated selftests for tighter semantics and added selftests to check
  for permissions

Changes since RFCv1:
+ Use fd to represent mount instead of path string, as Kirill
  suggested. I believe using fds makes this syscall interface more
  aligned with the other syscalls like fsopen(), fsconfig(), and
  fsmount() in terms of using and passing around fds
+ Remove unused variable char *orig_shmem_enabled from selftests

Dependencies:
+ Sean’s iteration of the ‘KVM: mm: fd-based approach for supporting
  KVM’ patch series at
  https://github.com/sean-jc/linux/tree/x86/upm_base_support
+ Proposed fixes for these issues mentioned on the mailing list:
    + https://lore.kernel.org/lkml/diqzzga0fv96.fsf@ackerleytng-cloudtop-sg.c.googlers.com/

Links to earlier patch series:
+ RFC v2: https://lore.kernel.org/lkml/cover.1679428901.git.ackerleytng@google.com/T/
+ RFC v1: https://lore.kernel.org/lkml/cover.1676507663.git.ackerleytng@google.com/T/

---

Ackerley Tng (2):
  mm: restrictedmem: Allow userspace to specify mount for
    memfd_restricted
  selftests: restrictedmem: Check hugepage-ness of shmem file backing
    restrictedmem fd

 include/linux/syscalls.h                      |   2 +-
 include/uapi/linux/restrictedmem.h            |   8 +
 mm/restrictedmem.c                            |  74 ++-
 tools/testing/selftests/Makefile              |   1 +
 .../selftests/restrictedmem/.gitignore        |   3 +
 .../testing/selftests/restrictedmem/Makefile  |  15 +
 .../testing/selftests/restrictedmem/common.c  |   9 +
 .../testing/selftests/restrictedmem/common.h  |   8 +
 .../restrictedmem_hugepage_test.c             | 486 ++++++++++++++++++
 9 files changed, 599 insertions(+), 7 deletions(-)
 create mode 100644 include/uapi/linux/restrictedmem.h
 create mode 100644 tools/testing/selftests/restrictedmem/.gitignore
 create mode 100644 tools/testing/selftests/restrictedmem/Makefile
 create mode 100644 tools/testing/selftests/restrictedmem/common.c
 create mode 100644 tools/testing/selftests/restrictedmem/common.h
 create mode 100644 tools/testing/selftests/restrictedmem/restrictedmem_hugepage_test.c

--
2.40.0.348.gf938b09366-goog

^ permalink raw reply	[flat|nested] 398+ messages in thread

* [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted
  2023-03-31 23:50 [RFC PATCH v3 0/2] Providing mount in memfd_restricted() syscall Ackerley Tng
@ 2023-03-31 23:50 ` Ackerley Tng
  2023-04-03  8:21   ` David Hildenbrand
                     ` (2 more replies)
  2023-03-31 23:50 ` [RFC PATCH v3 2/2] selftests: restrictedmem: Check hugepage-ness of shmem file backing restrictedmem fd Ackerley Tng
  1 sibling, 3 replies; 398+ messages in thread
From: Ackerley Tng @ 2023-03-31 23:50 UTC (permalink / raw)
  To: kvm, linux-api, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, qemu-devel
  Cc: aarcange, ak, akpm, arnd, bfields, bp, chao.p.peng, corbet,
	dave.hansen, david, ddutile, dhildenb, hpa, hughd, jlayton,
	jmattson, joro, jun.nakajima, kirill.shutemov, linmiaohe, luto,
	mail, mhocko, michael.roth, mingo, naoya.horiguchi, pbonzini,
	qperret, rppt, seanjc, shuah, steven.price, tabba, tglx,
	vannapurve, vbabka, vkuznets, wanpengli, wei.w.wang, x86,
	yu.c.zhang, Ackerley Tng

By default, the backing shmem file for a restrictedmem fd is created
on shmem's kernel space mount.

With this patch, an optional tmpfs mount can be specified via an fd,
which will be used as the mountpoint for backing the shmem file
associated with a restrictedmem fd.

This will help restrictedmem fds inherit the properties of the
provided tmpfs mounts, for example, hugepage allocation hints, NUMA
binding hints, etc.

Permissions for the fd passed to memfd_restricted() is modeled after
the openat() syscall, since both of these allow creation of a file
upon a mount/directory.

Permission to reference the mount the fd represents is checked upon fd
creation by other syscalls (e.g. fsmount(), open(), or open_tree(),
etc) and any process that can present memfd_restricted() with a valid
fd is expected to have obtained permission to use the mount
represented by the fd. This behavior is intended to parallel that of
the openat() syscall.

memfd_restricted() will check that the tmpfs superblock is
writable, and that the mount is also writable, before attempting to
create a restrictedmem file on the mount.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/linux/syscalls.h           |  2 +-
 include/uapi/linux/restrictedmem.h |  8 ++++
 mm/restrictedmem.c                 | 74 +++++++++++++++++++++++++++---
 3 files changed, 77 insertions(+), 7 deletions(-)
 create mode 100644 include/uapi/linux/restrictedmem.h

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f9e9e0c820c5..a23c4c385cd3 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1056,7 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
 asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
 					    unsigned long home_node,
 					    unsigned long flags);
-asmlinkage long sys_memfd_restricted(unsigned int flags);
+asmlinkage long sys_memfd_restricted(unsigned int flags, int mount_fd);

 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/restrictedmem.h b/include/uapi/linux/restrictedmem.h
new file mode 100644
index 000000000000..22d6f2285f6d
--- /dev/null
+++ b/include/uapi/linux/restrictedmem.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_RESTRICTEDMEM_H
+#define _UAPI_LINUX_RESTRICTEDMEM_H
+
+/* flags for memfd_restricted */
+#define RMFD_USERMNT		0x0001U
+
+#endif /* _UAPI_LINUX_RESTRICTEDMEM_H */
diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
index c5d869d8c2d8..f7b62364a31a 100644
--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -1,11 +1,12 @@
 // SPDX-License-Identifier: GPL-2.0
-#include "linux/sbitmap.h"
+#include <linux/namei.h>
 #include <linux/pagemap.h>
 #include <linux/pseudo_fs.h>
 #include <linux/shmem_fs.h>
 #include <linux/syscalls.h>
 #include <uapi/linux/falloc.h>
 #include <uapi/linux/magic.h>
+#include <uapi/linux/restrictedmem.h>
 #include <linux/restrictedmem.h>

 struct restrictedmem {
@@ -189,19 +190,20 @@ static struct file *restrictedmem_file_create(struct file *memfd)
 	return file;
 }

-SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
+static int restrictedmem_create(struct vfsmount *mount)
 {
 	struct file *file, *restricted_file;
 	int fd, err;

-	if (flags)
-		return -EINVAL;
-
 	fd = get_unused_fd_flags(0);
 	if (fd < 0)
 		return fd;

-	file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
+	if (mount)
+		file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, VM_NORESERVE);
+	else
+		file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
+
 	if (IS_ERR(file)) {
 		err = PTR_ERR(file);
 		goto err_fd;
@@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
 	return err;
 }

+static bool is_shmem_mount(struct vfsmount *mnt)
+{
+	return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC;
+}
+
+static bool is_mount_root(struct file *file)
+{
+	return file->f_path.dentry == file->f_path.mnt->mnt_root;
+}
+
+static int restrictedmem_create_on_user_mount(int mount_fd)
+{
+	int ret;
+	struct fd f;
+	struct vfsmount *mnt;
+
+	f = fdget_raw(mount_fd);
+	if (!f.file)
+		return -EBADF;
+
+	ret = -EINVAL;
+	if (!is_mount_root(f.file))
+		goto out;
+
+	mnt = f.file->f_path.mnt;
+	if (!is_shmem_mount(mnt))
+		goto out;
+
+	ret = file_permission(f.file, MAY_WRITE | MAY_EXEC);
+	if (ret)
+		goto out;
+
+	ret = mnt_want_write(mnt);
+	if (unlikely(ret))
+		goto out;
+
+	ret = restrictedmem_create(mnt);
+
+	mnt_drop_write(mnt);
+out:
+	fdput(f);
+
+	return ret;
+}
+
+SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd)
+{
+	if (flags & ~RMFD_USERMNT)
+		return -EINVAL;
+
+	if (flags == RMFD_USERMNT) {
+		if (mount_fd < 0)
+			return -EINVAL;
+
+		return restrictedmem_create_on_user_mount(mount_fd);
+	} else {
+		return restrictedmem_create(NULL);
+	}
+}
+
 int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
 		       struct restrictedmem_notifier *notifier, bool exclusive)
 {
--
2.40.0.348.gf938b09366-goog

^ permalink raw reply related	[flat|nested] 398+ messages in thread

* [RFC PATCH v3 2/2] selftests: restrictedmem: Check hugepage-ness of shmem file backing restrictedmem fd
  2023-03-31 23:50 [RFC PATCH v3 0/2] Providing mount in memfd_restricted() syscall Ackerley Tng
  2023-03-31 23:50 ` [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted Ackerley Tng
@ 2023-03-31 23:50 ` Ackerley Tng
  2023-04-03  8:24   ` David Hildenbrand
  1 sibling, 1 reply; 398+ messages in thread
From: Ackerley Tng @ 2023-03-31 23:50 UTC (permalink / raw)
  To: kvm, linux-api, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, qemu-devel
  Cc: aarcange, ak, akpm, arnd, bfields, bp, chao.p.peng, corbet,
	dave.hansen, david, ddutile, dhildenb, hpa, hughd, jlayton,
	jmattson, joro, jun.nakajima, kirill.shutemov, linmiaohe, luto,
	mail, mhocko, michael.roth, mingo, naoya.horiguchi, pbonzini,
	qperret, rppt, seanjc, shuah, steven.price, tabba, tglx,
	vannapurve, vbabka, vkuznets, wanpengli, wei.w.wang, x86,
	yu.c.zhang, Ackerley Tng

For memfd_restricted() calls without a userspace mount, the backing
file should be the shmem mount in the kernel, and the size of backing
pages should be as defined by system-wide shmem configuration.

If a userspace mount is provided, the size of backing pages should be
as defined in the mount.

Also includes negative tests for invalid inputs, including fds
representing read-only superblocks/mounts.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 tools/testing/selftests/Makefile              |   1 +
 .../selftests/restrictedmem/.gitignore        |   3 +
 .../testing/selftests/restrictedmem/Makefile  |  15 +
 .../testing/selftests/restrictedmem/common.c  |   9 +
 .../testing/selftests/restrictedmem/common.h  |   8 +
 .../restrictedmem_hugepage_test.c             | 486 ++++++++++++++++++
 6 files changed, 522 insertions(+)
 create mode 100644 tools/testing/selftests/restrictedmem/.gitignore
 create mode 100644 tools/testing/selftests/restrictedmem/Makefile
 create mode 100644 tools/testing/selftests/restrictedmem/common.c
 create mode 100644 tools/testing/selftests/restrictedmem/common.h
 create mode 100644 tools/testing/selftests/restrictedmem/restrictedmem_hugepage_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index f07aef7c592c..44078eeefb79 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -60,6 +60,7 @@ TARGETS += pstore
 TARGETS += ptrace
 TARGETS += openat2
 TARGETS += resctrl
+TARGETS += restrictedmem
 TARGETS += rlimits
 TARGETS += rseq
 TARGETS += rtc
diff --git a/tools/testing/selftests/restrictedmem/.gitignore b/tools/testing/selftests/restrictedmem/.gitignore
new file mode 100644
index 000000000000..2581bcc8ff29
--- /dev/null
+++ b/tools/testing/selftests/restrictedmem/.gitignore
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+restrictedmem_hugepage_test
diff --git a/tools/testing/selftests/restrictedmem/Makefile b/tools/testing/selftests/restrictedmem/Makefile
new file mode 100644
index 000000000000..8e5378d20226
--- /dev/null
+++ b/tools/testing/selftests/restrictedmem/Makefile
@@ -0,0 +1,15 @@
+# SPDX-License-Identifier: GPL-2.0
+
+CFLAGS = $(KHDR_INCLUDES)
+CFLAGS += -Wall -Wstrict-prototypes -Wuninitialized -std=gnu99
+
+TEST_GEN_PROGS += restrictedmem_hugepage_test
+
+include ../lib.mk
+
+EXTRA_CLEAN = $(OUTPUT)/common.o
+
+$(OUTPUT)/common.o: common.c
+	$(CC) $(CFLAGS) $(CPPFLAGS) $(TARGET_ARCH) -c -ffreestanding $< -o $@
+
+$(TEST_GEN_PROGS): $(OUTPUT)/common.o
diff --git a/tools/testing/selftests/restrictedmem/common.c b/tools/testing/selftests/restrictedmem/common.c
new file mode 100644
index 000000000000..03dac843404f
--- /dev/null
+++ b/tools/testing/selftests/restrictedmem/common.c
@@ -0,0 +1,9 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <sys/syscall.h>
+#include <unistd.h>
+
+int memfd_restricted(unsigned int flags, int mount_fd)
+{
+	return syscall(__NR_memfd_restricted, flags, mount_fd);
+}
diff --git a/tools/testing/selftests/restrictedmem/common.h b/tools/testing/selftests/restrictedmem/common.h
new file mode 100644
index 000000000000..06284ed86baf
--- /dev/null
+++ b/tools/testing/selftests/restrictedmem/common.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef SELFTESTS_RESTRICTEDMEM_COMMON_H
+#define SELFTESTS_RESTRICTEDMEM_COMMON_H
+
+int memfd_restricted(unsigned int flags, int mount_fd);
+
+#endif  // SELFTESTS_RESTRICTEDMEM_COMMON_H
diff --git a/tools/testing/selftests/restrictedmem/restrictedmem_hugepage_test.c b/tools/testing/selftests/restrictedmem/restrictedmem_hugepage_test.c
new file mode 100644
index 000000000000..9ed319b83cb8
--- /dev/null
+++ b/tools/testing/selftests/restrictedmem/restrictedmem_hugepage_test.c
@@ -0,0 +1,486 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#define _GNU_SOURCE /* for O_PATH */
+#define _POSIX_C_SOURCE /* for PATH_MAX */
+#include <limits.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/mount.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "linux/restrictedmem.h"
+
+#include "common.h"
+#include "../kselftest_harness.h"
+
+/*
+ * Expect policy to be one of always, within_size, advise, never,
+ * deny, force
+ */
+#define POLICY_BUF_SIZE 12
+
+static int get_hpage_pmd_size(void)
+{
+	FILE *fp;
+	char buf[100];
+	char *ret;
+	int size;
+
+	fp = fopen("/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", "r");
+	if (!fp)
+		return -1;
+
+	ret = fgets(buf, 100, fp);
+	if (ret != buf) {
+		size = -1;
+		goto out;
+	}
+
+	if (sscanf(buf, "%d\n", &size) != 1)
+		size = -1;
+
+out:
+	fclose(fp);
+
+	return size;
+}
+
+static bool is_valid_shmem_thp_policy(char *policy)
+{
+	if (strcmp(policy, "always") == 0)
+		return true;
+	if (strcmp(policy, "within_size") == 0)
+		return true;
+	if (strcmp(policy, "advise") == 0)
+		return true;
+	if (strcmp(policy, "never") == 0)
+		return true;
+	if (strcmp(policy, "deny") == 0)
+		return true;
+	if (strcmp(policy, "force") == 0)
+		return true;
+
+	return false;
+}
+
+static int get_shmem_thp_policy(char *policy)
+{
+	FILE *fp;
+	char buf[100];
+	char *left = NULL;
+	char *right = NULL;
+	int ret = -1;
+
+	fp = fopen("/sys/kernel/mm/transparent_hugepage/shmem_enabled", "r");
+	if (!fp)
+		return -1;
+
+	if (fgets(buf, 100, fp) != buf)
+		goto out;
+
+	/*
+	 * Expect shmem_enabled to be of format like "always within_size advise
+	 * [never] deny force"
+	 */
+	left = memchr(buf, '[', 100);
+	if (!left)
+		goto out;
+
+	right = memchr(buf, ']', 100);
+	if (!right)
+		goto out;
+
+	memcpy(policy, left + 1, right - left - 1);
+
+	ret = !is_valid_shmem_thp_policy(policy);
+
+out:
+	fclose(fp);
+	return ret;
+}
+
+static int write_string_to_file(const char *path, const char *string)
+{
+	FILE *fp;
+	size_t len = strlen(string);
+	int ret = -1;
+
+	fp = fopen(path, "w");
+	if (!fp)
+		return ret;
+
+	if (fwrite(string, 1, len, fp) != len)
+		goto out;
+
+	ret = 0;
+
+out:
+	fclose(fp);
+	return ret;
+}
+
+static int set_shmem_thp_policy(char *policy)
+{
+	int ret = -1;
+	/* +1 for newline */
+	char to_write[POLICY_BUF_SIZE + 1] = { 0 };
+
+	if (!is_valid_shmem_thp_policy(policy))
+		return ret;
+
+	ret = snprintf(to_write, POLICY_BUF_SIZE + 1, "%s\n", policy);
+	if (ret != strlen(policy) + 1)
+		return -1;
+
+	ret = write_string_to_file(
+		"/sys/kernel/mm/transparent_hugepage/shmem_enabled", to_write);
+
+	return ret;
+}
+
+FIXTURE(reset_shmem_enabled)
+{
+	char shmem_enabled[POLICY_BUF_SIZE];
+};
+
+FIXTURE_SETUP(reset_shmem_enabled)
+{
+	memset(self->shmem_enabled, 0, POLICY_BUF_SIZE);
+	ASSERT_EQ(get_shmem_thp_policy(self->shmem_enabled), 0);
+}
+
+FIXTURE_TEARDOWN(reset_shmem_enabled)
+{
+	ASSERT_EQ(set_shmem_thp_policy(self->shmem_enabled), 0);
+}
+
+TEST_F(reset_shmem_enabled, restrictedmem_fstat_shmem_enabled_never)
+{
+	int fd = -1;
+	struct stat stat;
+
+	ASSERT_EQ(set_shmem_thp_policy("never"), 0);
+
+	fd = memfd_restricted(0, -1);
+	ASSERT_GT(fd, 0);
+
+	ASSERT_EQ(fstat(fd, &stat), 0);
+
+	/*
+	 * st_blksize is set based on the superblock's s_blocksize_bits. For
+	 * shmem, this is set to PAGE_SHIFT
+	 */
+	ASSERT_EQ(stat.st_blksize, getpagesize());
+
+	close(fd);
+}
+
+TEST_F(reset_shmem_enabled, restrictedmem_fstat_shmem_enabled_always)
+{
+	int fd = -1;
+	struct stat stat;
+
+	ASSERT_EQ(set_shmem_thp_policy("always"), 0);
+
+	fd = memfd_restricted(0, -1);
+	ASSERT_GT(fd, 0);
+
+	ASSERT_EQ(fstat(fd, &stat), 0);
+
+	ASSERT_EQ(stat.st_blksize, get_hpage_pmd_size());
+
+	close(fd);
+}
+
+TEST(restrictedmem_tmpfile_invalid_fd)
+{
+	int fd = memfd_restricted(RMFD_USERMNT, -2);
+
+	ASSERT_EQ(fd, -1);
+	ASSERT_EQ(errno, EINVAL);
+}
+
+TEST(restrictedmem_tmpfile_fd_not_a_mount)
+{
+	int fd = memfd_restricted(RMFD_USERMNT, STDOUT_FILENO);
+
+	ASSERT_EQ(fd, -1);
+	ASSERT_EQ(errno, EINVAL);
+}
+
+TEST(restrictedmem_tmpfile_not_tmpfs_mount)
+{
+	int fd = -1;
+	int mfd = -1;
+
+	mfd = open("/proc", O_PATH);
+	ASSERT_NE(mfd, -1);
+
+	fd = memfd_restricted(RMFD_USERMNT, mfd);
+
+	ASSERT_EQ(fd, -1);
+	ASSERT_EQ(errno, EINVAL);
+}
+
+FIXTURE(tmpfs_hugepage_sfd)
+{
+	int sfd;
+};
+
+FIXTURE_SETUP(tmpfs_hugepage_sfd)
+{
+	self->sfd = fsopen("tmpfs", 0);
+	ASSERT_NE(self->sfd, -1);
+}
+
+FIXTURE_TEARDOWN(tmpfs_hugepage_sfd)
+{
+	EXPECT_EQ(close(self->sfd), 0);
+}
+
+TEST_F(tmpfs_hugepage_sfd, restrictedmem_fstat_tmpfs_huge_always)
+{
+	int ret = -1;
+	int fd = -1;
+	int mfd = -1;
+	struct stat stat;
+
+	fsconfig(self->sfd, FSCONFIG_SET_STRING, "huge", "always", 0);
+	fsconfig(self->sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
+
+	mfd = fsmount(self->sfd, 0, 0);
+	ASSERT_NE(mfd, -1);
+
+	fd = memfd_restricted(RMFD_USERMNT, mfd);
+	ASSERT_GT(fd, 0);
+
+	/* User can close reference to mount */
+	ret = close(mfd);
+	ASSERT_EQ(ret, 0);
+
+	ret = fstat(fd, &stat);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(stat.st_blksize, get_hpage_pmd_size());
+
+	close(fd);
+}
+
+TEST_F(tmpfs_hugepage_sfd, restrictedmem_fstat_tmpfs_huge_never)
+{
+	int ret = -1;
+	int fd = -1;
+	int mfd = -1;
+	struct stat stat;
+
+	fsconfig(self->sfd, FSCONFIG_SET_STRING, "huge", "never", 0);
+	fsconfig(self->sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
+
+	mfd = fsmount(self->sfd, 0, 0);
+	ASSERT_NE(mfd, -1);
+
+	fd = memfd_restricted(RMFD_USERMNT, mfd);
+	ASSERT_GT(fd, 0);
+
+	/* User can close reference to mount */
+	ret = close(mfd);
+	ASSERT_EQ(ret, 0);
+
+	ret = fstat(fd, &stat);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(stat.st_blksize, getpagesize());
+
+	close(fd);
+}
+
+TEST_F(tmpfs_hugepage_sfd, restrictedmem_check_mount_flags)
+{
+	int ret = -1;
+	int fd = -1;
+	int mfd = -1;
+
+	fsconfig(self->sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
+
+	mfd = fsmount(self->sfd, 0, MOUNT_ATTR_RDONLY);
+	ASSERT_NE(mfd, -1);
+
+	fd = memfd_restricted(RMFD_USERMNT, mfd);
+	ASSERT_EQ(fd, -1);
+	ASSERT_EQ(errno, EROFS);
+
+	ret = close(mfd);
+	ASSERT_EQ(ret, 0);
+}
+
+TEST_F(tmpfs_hugepage_sfd, restrictedmem_check_superblock_flags)
+{
+	int ret = -1;
+	int fd = -1;
+	int mfd = -1;
+
+	fsconfig(self->sfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
+	fsconfig(self->sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
+
+	mfd = fsmount(self->sfd, 0, 0);
+	ASSERT_NE(mfd, -1);
+
+	fd = memfd_restricted(RMFD_USERMNT, mfd);
+	ASSERT_EQ(fd, -1);
+	ASSERT_EQ(errno, EROFS);
+
+	ret = close(mfd);
+	ASSERT_EQ(ret, 0);
+}
+
+static bool directory_exists(const char *path)
+{
+	struct stat sb;
+
+	return stat(path, &sb) == 0 && S_ISDIR(sb.st_mode);
+}
+
+FIXTURE(tmpfs_hugepage_mount_path)
+{
+	char *mount_path;
+};
+
+FIXTURE_SETUP(tmpfs_hugepage_mount_path)
+{
+	int ret = -1;
+
+	/* /tmp is an FHS-mandated world-writable directory */
+	self->mount_path = "/tmp/restrictedmem-selftest-mnt";
+
+	if (!directory_exists(self->mount_path)) {
+		ret = mkdir(self->mount_path, 0777);
+		ASSERT_EQ(ret, 0);
+	}
+}
+
+FIXTURE_TEARDOWN(tmpfs_hugepage_mount_path)
+{
+	int ret = -1;
+
+	if (!directory_exists(self->mount_path))
+		return;
+
+	ret = umount2(self->mount_path, MNT_FORCE);
+	EXPECT_EQ(ret, 0);
+	if (ret == -1 && errno == EINVAL)
+		fprintf(stderr, "  %s was not mounted\n", self->mount_path);
+
+	ret = rmdir(self->mount_path);
+	EXPECT_EQ(ret, 0);
+	if (ret == -1)
+		fprintf(stderr, "  rmdir(%s) failed: %m\n", self->mount_path);
+}
+
+/*
+ * memfd_restricted() syscall can only be used with the fd of the root of the
+ * mount. When the restrictedmem's fd is open, a user should not be able to
+ * unmount or remove the mounted directory
+ */
+TEST_F(tmpfs_hugepage_mount_path, restrictedmem_umount_rmdir_while_file_open)
+{
+	int ret = -1;
+	int fd = -1;
+	int mfd = -1;
+	struct stat stat;
+
+	ret = mount("name", self->mount_path, "tmpfs", 0, "huge=always");
+	ASSERT_EQ(ret, 0);
+
+	mfd = open(self->mount_path, O_PATH);
+	ASSERT_NE(mfd, -1);
+
+	fd = memfd_restricted(RMFD_USERMNT, mfd);
+	ASSERT_GT(fd, 0);
+
+	/* We don't need this reference to the mount anymore */
+	ret = close(mfd);
+	ASSERT_EQ(ret, 0);
+
+	/* restrictedmem's fd should still be usable */
+	ret = fstat(fd, &stat);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(stat.st_blksize, get_hpage_pmd_size());
+
+	/* User should not be able to unmount directory */
+	ret = umount2(self->mount_path, MNT_FORCE);
+	ASSERT_EQ(ret, -1);
+	ASSERT_EQ(errno, EBUSY);
+
+	ret = rmdir(self->mount_path);
+	ASSERT_EQ(ret, -1);
+	ASSERT_EQ(errno, EBUSY);
+
+	close(fd);
+}
+
+/* The fd of a file on the mount cannot be provided as mount_fd */
+TEST_F(tmpfs_hugepage_mount_path, restrictedmem_provide_fd_of_file)
+{
+	int ret = -1;
+	int fd = -1;
+	int ffd = -1;
+	char tmp_file_path[PATH_MAX] = { 0 };
+
+	ret = mount("name", self->mount_path, "tmpfs", 0, "huge=always");
+	ASSERT_EQ(ret, 0);
+
+	snprintf(tmp_file_path, PATH_MAX, "%s/tmp-file", self->mount_path);
+	ret = write_string_to_file(tmp_file_path, "filler\n");
+	ASSERT_EQ(ret, 0);
+
+	ffd = open(tmp_file_path, O_RDWR);
+	ASSERT_GT(ffd, 0);
+
+	fd = memfd_restricted(RMFD_USERMNT, ffd);
+	ASSERT_LT(fd, 0);
+	ASSERT_EQ(errno, EINVAL);
+
+	ret = close(ffd);
+	ASSERT_EQ(ret, 0);
+
+	close(fd);
+	remove(tmp_file_path);
+}
+
+/* The fd of files on the mount cannot be provided as mount_fd */
+TEST_F(tmpfs_hugepage_mount_path, restrictedmem_provide_fd_of_file_in_subdir)
+{
+	int ret = -1;
+	int fd = -1;
+	int ffd = -1;
+	char tmp_dir_path[PATH_MAX] = { 0 };
+	char tmp_file_path[PATH_MAX] = { 0 };
+
+	ret = mount("name", self->mount_path, "tmpfs", 0, "huge=always");
+	ASSERT_EQ(ret, 0);
+
+	snprintf(tmp_dir_path, PATH_MAX, "%s/tmp-subdir", self->mount_path);
+	ret = mkdir(tmp_dir_path, 0777);
+	ASSERT_EQ(ret, 0);
+
+	snprintf(tmp_file_path, PATH_MAX, "%s/tmp-subdir/tmp-file",
+		 self->mount_path);
+	ret = write_string_to_file(tmp_file_path, "filler\n");
+	ASSERT_EQ(ret, 0);
+
+	ffd = open(tmp_file_path, O_RDWR);
+	ASSERT_NE(ffd, -1);
+
+	fd = memfd_restricted(RMFD_USERMNT, ffd);
+	ASSERT_LT(fd, 0);
+	ASSERT_EQ(errno, EINVAL);
+
+	ret = close(ffd);
+	ASSERT_EQ(ret, 0);
+
+	close(fd);
+	remove(tmp_file_path);
+	rmdir(tmp_dir_path);
+}
+
+TEST_HARNESS_MAIN
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted
  2023-03-31 23:50 ` [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted Ackerley Tng
@ 2023-04-03  8:21   ` David Hildenbrand
  2023-04-05 22:29     ` Ackerley Tng
  2023-04-04  8:25   ` Kirill A. Shutemov
  2023-04-04 13:53   ` Christian Brauner
  2 siblings, 1 reply; 398+ messages in thread
From: David Hildenbrand @ 2023-04-03  8:21 UTC (permalink / raw)
  To: Ackerley Tng, kvm, linux-api, linux-arch, linux-doc,
	linux-fsdevel, linux-kernel, linux-mm, qemu-devel
  Cc: aarcange, ak, akpm, arnd, bfields, bp, chao.p.peng, corbet,
	dave.hansen, ddutile, dhildenb, hpa, hughd, jlayton, jmattson,
	joro, jun.nakajima, kirill.shutemov, linmiaohe, luto, mail,
	mhocko, michael.roth, mingo, naoya.horiguchi, pbonzini, qperret,
	rppt, seanjc, shuah, steven.price, tabba, tglx, vannapurve,
	vbabka, vkuznets, wanpengli, wei.w.wang, x86, yu.c.zhang

On 01.04.23 01:50, Ackerley Tng wrote:
> By default, the backing shmem file for a restrictedmem fd is created
> on shmem's kernel space mount.
> 
> With this patch, an optional tmpfs mount can be specified via an fd,
> which will be used as the mountpoint for backing the shmem file
> associated with a restrictedmem fd.
> 
> This will help restrictedmem fds inherit the properties of the
> provided tmpfs mounts, for example, hugepage allocation hints, NUMA
> binding hints, etc.
> 
> Permissions for the fd passed to memfd_restricted() is modeled after
> the openat() syscall, since both of these allow creation of a file
> upon a mount/directory.
> 
> Permission to reference the mount the fd represents is checked upon fd
> creation by other syscalls (e.g. fsmount(), open(), or open_tree(),
> etc) and any process that can present memfd_restricted() with a valid
> fd is expected to have obtained permission to use the mount
> represented by the fd. This behavior is intended to parallel that of
> the openat() syscall.
> 
> memfd_restricted() will check that the tmpfs superblock is
> writable, and that the mount is also writable, before attempting to
> create a restrictedmem file on the mount.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>   include/linux/syscalls.h           |  2 +-
>   include/uapi/linux/restrictedmem.h |  8 ++++
>   mm/restrictedmem.c                 | 74 +++++++++++++++++++++++++++---
>   3 files changed, 77 insertions(+), 7 deletions(-)
>   create mode 100644 include/uapi/linux/restrictedmem.h
> 
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index f9e9e0c820c5..a23c4c385cd3 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1056,7 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
>   asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
>   					    unsigned long home_node,
>   					    unsigned long flags);
> -asmlinkage long sys_memfd_restricted(unsigned int flags);
> +asmlinkage long sys_memfd_restricted(unsigned int flags, int mount_fd);
> 
>   /*
>    * Architecture-specific system calls
> diff --git a/include/uapi/linux/restrictedmem.h b/include/uapi/linux/restrictedmem.h
> new file mode 100644
> index 000000000000..22d6f2285f6d
> --- /dev/null
> +++ b/include/uapi/linux/restrictedmem.h
> @@ -0,0 +1,8 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_RESTRICTEDMEM_H
> +#define _UAPI_LINUX_RESTRICTEDMEM_H
> +
> +/* flags for memfd_restricted */
> +#define RMFD_USERMNT		0x0001U

I wonder if we can come up with a more expressive prefix than RMFD. 
Sounds more like "rm fd" ;) Maybe it should better match the 
"memfd_restricted" syscall name, like "MEMFD_RSTD_USERMNT".


> +
> +#endif /* _UAPI_LINUX_RESTRICTEDMEM_H */
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> index c5d869d8c2d8..f7b62364a31a 100644
> --- a/mm/restrictedmem.c
> +++ b/mm/restrictedmem.c
> @@ -1,11 +1,12 @@
>   // SPDX-License-Identifier: GPL-2.0
> -#include "linux/sbitmap.h"

Looks like an unrelated change?

> +#include <linux/namei.h>
>   #include <linux/pagemap.h>
>   #include <linux/pseudo_fs.h>
>   #include <linux/shmem_fs.h>
>   #include <linux/syscalls.h>
>   #include <uapi/linux/falloc.h>
>   #include <uapi/linux/magic.h>
> +#include <uapi/linux/restrictedmem.h>
>   #include <linux/restrictedmem.h>
> 
>   struct restrictedmem {
> @@ -189,19 +190,20 @@ static struct file *restrictedmem_file_create(struct file *memfd)
>   	return file;
>   }
> 
> -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> +static int restrictedmem_create(struct vfsmount *mount)
>   {
>   	struct file *file, *restricted_file;
>   	int fd, err;
> 
> -	if (flags)
> -		return -EINVAL;
> -
>   	fd = get_unused_fd_flags(0);
>   	if (fd < 0)
>   		return fd;
> 
> -	file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> +	if (mount)
> +		file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, VM_NORESERVE);
> +	else
> +		file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> +
>   	if (IS_ERR(file)) {
>   		err = PTR_ERR(file);
>   		goto err_fd;
> @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
>   	return err;
>   }
> 
> +static bool is_shmem_mount(struct vfsmount *mnt)
> +{
> +	return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC;
> +}
> +
> +static bool is_mount_root(struct file *file)
> +{
> +	return file->f_path.dentry == file->f_path.mnt->mnt_root;
> +}

I'd inline at least that function, pretty self-explaining.

> +
> +static int restrictedmem_create_on_user_mount(int mount_fd)
> +{
> +	int ret;
> +	struct fd f;
> +	struct vfsmount *mnt;
> +
> +	f = fdget_raw(mount_fd);
> +	if (!f.file)
> +		return -EBADF;
> +
> +	ret = -EINVAL;
> +	if (!is_mount_root(f.file))
> +		goto out;
> +
> +	mnt = f.file->f_path.mnt;
> +	if (!is_shmem_mount(mnt))
> +		goto out;
> +
> +	ret = file_permission(f.file, MAY_WRITE | MAY_EXEC);
> +	if (ret)
> +		goto out;
> +
> +	ret = mnt_want_write(mnt);
> +	if (unlikely(ret))
> +		goto out;
> +
> +	ret = restrictedmem_create(mnt);
> +
> +	mnt_drop_write(mnt);
> +out:
> +	fdput(f);
> +
> +	return ret;
> +}
> +
> +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd)
> +{
> +	if (flags & ~RMFD_USERMNT)
> +		return -EINVAL;
> +
> +	if (flags == RMFD_USERMNT) {
> +		if (mount_fd < 0)
> +			return -EINVAL;
> +
> +		return restrictedmem_create_on_user_mount(mount_fd);
> +	} else {
> +		return restrictedmem_create(NULL);
> +	}


You can drop the else case:

if (flags == RMFD_USERMNT) {
	...
	return restrictedmem_create_on_user_mount(mount_fd);
}
return restrictedmem_create(NULL);


I do wonder if you want to properly check for a flag instead of 
comparing values. Results in a more natural way to deal with flags:

if (flags & RMFD_USERMNT) {

}

> +}
> +
>   int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
>   		       struct restrictedmem_notifier *notifier, bool exclusive)
>   {

The "memfd_restricted" vs. "restrictedmem" terminology is a bit 
unfortunate, but not your fault here.


I'm not a FS person, but it does look good to me.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [RFC PATCH v3 2/2] selftests: restrictedmem: Check hugepage-ness of shmem file backing restrictedmem fd
  2023-03-31 23:50 ` [RFC PATCH v3 2/2] selftests: restrictedmem: Check hugepage-ness of shmem file backing restrictedmem fd Ackerley Tng
@ 2023-04-03  8:24   ` David Hildenbrand
  2023-04-11  1:35     ` Ackerley Tng
  0 siblings, 1 reply; 398+ messages in thread
From: David Hildenbrand @ 2023-04-03  8:24 UTC (permalink / raw)
  To: Ackerley Tng, kvm, linux-api, linux-arch, linux-doc,
	linux-fsdevel, linux-kernel, linux-mm, qemu-devel
  Cc: aarcange, ak, akpm, arnd, bfields, bp, chao.p.peng, corbet,
	dave.hansen, ddutile, dhildenb, hpa, hughd, jlayton, jmattson,
	joro, jun.nakajima, kirill.shutemov, linmiaohe, luto, mail,
	mhocko, michael.roth, mingo, naoya.horiguchi, pbonzini, qperret,
	rppt, seanjc, shuah, steven.price, tabba, tglx, vannapurve,
	vbabka, vkuznets, wanpengli, wei.w.wang, x86, yu.c.zhang

On 01.04.23 01:50, Ackerley Tng wrote:
> For memfd_restricted() calls without a userspace mount, the backing
> file should be the shmem mount in the kernel, and the size of backing
> pages should be as defined by system-wide shmem configuration.
> 
> If a userspace mount is provided, the size of backing pages should be
> as defined in the mount.
> 
> Also includes negative tests for invalid inputs, including fds
> representing read-only superblocks/mounts.
> 

When you talk about "hugepage" in this patch, do you mean THP or 
hugetlb? I suspect thp, so it might be better to spell that out. IIRC, 
there are plans to support actual huge pages in the future, at which 
point "hugepage" terminology could be misleading.

> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>   tools/testing/selftests/Makefile              |   1 +
>   .../selftests/restrictedmem/.gitignore        |   3 +
>   .../testing/selftests/restrictedmem/Makefile  |  15 +
>   .../testing/selftests/restrictedmem/common.c  |   9 +
>   .../testing/selftests/restrictedmem/common.h  |   8 +
>   .../restrictedmem_hugepage_test.c             | 486 ++++++++++++++++++
>   6 files changed, 522 insertions(+)
>   create mode 100644 tools/testing/selftests/restrictedmem/.gitignore
>   create mode 100644 tools/testing/selftests/restrictedmem/Makefile
>   create mode 100644 tools/testing/selftests/restrictedmem/common.c
>   create mode 100644 tools/testing/selftests/restrictedmem/common.h
>   create mode 100644 tools/testing/selftests/restrictedmem/restrictedmem_hugepage_test.c
> 
> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
> index f07aef7c592c..44078eeefb79 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -60,6 +60,7 @@ TARGETS += pstore
>   TARGETS += ptrace
>   TARGETS += openat2
>   TARGETS += resctrl
> +TARGETS += restrictedmem
>   TARGETS += rlimits
>   TARGETS += rseq
>   TARGETS += rtc
> diff --git a/tools/testing/selftests/restrictedmem/.gitignore b/tools/testing/selftests/restrictedmem/.gitignore
> new file mode 100644
> index 000000000000..2581bcc8ff29
> --- /dev/null
> +++ b/tools/testing/selftests/restrictedmem/.gitignore
> @@ -0,0 +1,3 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +
> +restrictedmem_hugepage_test
> diff --git a/tools/testing/selftests/restrictedmem/Makefile b/tools/testing/selftests/restrictedmem/Makefile
> new file mode 100644
> index 000000000000..8e5378d20226
> --- /dev/null
> +++ b/tools/testing/selftests/restrictedmem/Makefile
> @@ -0,0 +1,15 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +CFLAGS = $(KHDR_INCLUDES)
> +CFLAGS += -Wall -Wstrict-prototypes -Wuninitialized -std=gnu99
> +
> +TEST_GEN_PROGS += restrictedmem_hugepage_test
> +
> +include ../lib.mk
> +
> +EXTRA_CLEAN = $(OUTPUT)/common.o
> +
> +$(OUTPUT)/common.o: common.c
> +	$(CC) $(CFLAGS) $(CPPFLAGS) $(TARGET_ARCH) -c -ffreestanding $< -o $@
> +
> +$(TEST_GEN_PROGS): $(OUTPUT)/common.o
> diff --git a/tools/testing/selftests/restrictedmem/common.c b/tools/testing/selftests/restrictedmem/common.c
> new file mode 100644
> index 000000000000..03dac843404f
> --- /dev/null
> +++ b/tools/testing/selftests/restrictedmem/common.c
> @@ -0,0 +1,9 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <sys/syscall.h>
> +#include <unistd.h>
> +
> +int memfd_restricted(unsigned int flags, int mount_fd)
> +{
> +	return syscall(__NR_memfd_restricted, flags, mount_fd);
> +}
> diff --git a/tools/testing/selftests/restrictedmem/common.h b/tools/testing/selftests/restrictedmem/common.h
> new file mode 100644
> index 000000000000..06284ed86baf
> --- /dev/null
> +++ b/tools/testing/selftests/restrictedmem/common.h
> @@ -0,0 +1,8 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +
> +#ifndef SELFTESTS_RESTRICTEDMEM_COMMON_H
> +#define SELFTESTS_RESTRICTEDMEM_COMMON_H
> +
> +int memfd_restricted(unsigned int flags, int mount_fd);
> +
> +#endif  // SELFTESTS_RESTRICTEDMEM_COMMON_H
> diff --git a/tools/testing/selftests/restrictedmem/restrictedmem_hugepage_test.c b/tools/testing/selftests/restrictedmem/restrictedmem_hugepage_test.c
> new file mode 100644
> index 000000000000..9ed319b83cb8
> --- /dev/null
> +++ b/tools/testing/selftests/restrictedmem/restrictedmem_hugepage_test.c
> @@ -0,0 +1,486 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#define _GNU_SOURCE /* for O_PATH */
> +#define _POSIX_C_SOURCE /* for PATH_MAX */
> +#include <limits.h>
> +#include <stdio.h>
> +#include <string.h>
> +#include <sys/mman.h>
> +#include <sys/mount.h>
> +#include <sys/stat.h>
> +#include <unistd.h>
> +
> +#include "linux/restrictedmem.h"
> +
> +#include "common.h"
> +#include "../kselftest_harness.h"
> +
> +/*
> + * Expect policy to be one of always, within_size, advise, never,
> + * deny, force
> + */
> +#define POLICY_BUF_SIZE 12
> +
> +static int get_hpage_pmd_size(void)
> +{
> +	FILE *fp;
> +	char buf[100];
> +	char *ret;
> +	int size;
> +
> +	fp = fopen("/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", "r");
> +	if (!fp)
> +		return -1;
> +
> +	ret = fgets(buf, 100, fp);
> +	if (ret != buf) {
> +		size = -1;
> +		goto out;
> +	}
> +
> +	if (sscanf(buf, "%d\n", &size) != 1)
> +		size = -1;
> +
> +out:
> +	fclose(fp);
> +
> +	return size;
> +}
> +
> +static bool is_valid_shmem_thp_policy(char *policy)
> +{
> +	if (strcmp(policy, "always") == 0)
> +		return true;
> +	if (strcmp(policy, "within_size") == 0)
> +		return true;
> +	if (strcmp(policy, "advise") == 0)
> +		return true;
> +	if (strcmp(policy, "never") == 0)
> +		return true;
> +	if (strcmp(policy, "deny") == 0)
> +		return true;
> +	if (strcmp(policy, "force") == 0)
> +		return true;
> +
> +	return false;
> +}
> +
> +static int get_shmem_thp_policy(char *policy)
> +{
> +	FILE *fp;
> +	char buf[100];
> +	char *left = NULL;
> +	char *right = NULL;
> +	int ret = -1;
> +
> +	fp = fopen("/sys/kernel/mm/transparent_hugepage/shmem_enabled", "r");
> +	if (!fp)
> +		return -1;
> +
> +	if (fgets(buf, 100, fp) != buf)
> +		goto out;
> +
> +	/*
> +	 * Expect shmem_enabled to be of format like "always within_size advise
> +	 * [never] deny force"
> +	 */
> +	left = memchr(buf, '[', 100);
> +	if (!left)
> +		goto out;
> +
> +	right = memchr(buf, ']', 100);
> +	if (!right)
> +		goto out;
> +
> +	memcpy(policy, left + 1, right - left - 1);
> +
> +	ret = !is_valid_shmem_thp_policy(policy);
> +
> +out:
> +	fclose(fp);
> +	return ret;
> +}
> +
> +static int write_string_to_file(const char *path, const char *string)
> +{
> +	FILE *fp;
> +	size_t len = strlen(string);
> +	int ret = -1;
> +
> +	fp = fopen(path, "w");
> +	if (!fp)
> +		return ret;
> +
> +	if (fwrite(string, 1, len, fp) != len)
> +		goto out;
> +
> +	ret = 0;
> +
> +out:
> +	fclose(fp);
> +	return ret;
> +}
> +
> +static int set_shmem_thp_policy(char *policy)
> +{
> +	int ret = -1;
> +	/* +1 for newline */
> +	char to_write[POLICY_BUF_SIZE + 1] = { 0 };
> +
> +	if (!is_valid_shmem_thp_policy(policy))
> +		return ret;
> +
> +	ret = snprintf(to_write, POLICY_BUF_SIZE + 1, "%s\n", policy);
> +	if (ret != strlen(policy) + 1)
> +		return -1;
> +
> +	ret = write_string_to_file(
> +		"/sys/kernel/mm/transparent_hugepage/shmem_enabled", to_write);
> +
> +	return ret;
> +}
> +
> +FIXTURE(reset_shmem_enabled)
> +{
> +	char shmem_enabled[POLICY_BUF_SIZE];
> +};
> +
> +FIXTURE_SETUP(reset_shmem_enabled)
> +{
> +	memset(self->shmem_enabled, 0, POLICY_BUF_SIZE);
> +	ASSERT_EQ(get_shmem_thp_policy(self->shmem_enabled), 0);
> +}
> +
> +FIXTURE_TEARDOWN(reset_shmem_enabled)
> +{
> +	ASSERT_EQ(set_shmem_thp_policy(self->shmem_enabled), 0);
> +}
> +
> +TEST_F(reset_shmem_enabled, restrictedmem_fstat_shmem_enabled_never)
> +{
> +	int fd = -1;
> +	struct stat stat;
> +
> +	ASSERT_EQ(set_shmem_thp_policy("never"), 0);
> +
> +	fd = memfd_restricted(0, -1);
> +	ASSERT_GT(fd, 0);
> +
> +	ASSERT_EQ(fstat(fd, &stat), 0);
> +
> +	/*
> +	 * st_blksize is set based on the superblock's s_blocksize_bits. For
> +	 * shmem, this is set to PAGE_SHIFT
> +	 */
> +	ASSERT_EQ(stat.st_blksize, getpagesize());
> +
> +	close(fd);
> +}
> +
> +TEST_F(reset_shmem_enabled, restrictedmem_fstat_shmem_enabled_always)
> +{
> +	int fd = -1;
> +	struct stat stat;
> +
> +	ASSERT_EQ(set_shmem_thp_policy("always"), 0);
> +
> +	fd = memfd_restricted(0, -1);
> +	ASSERT_GT(fd, 0);
> +
> +	ASSERT_EQ(fstat(fd, &stat), 0);
> +
> +	ASSERT_EQ(stat.st_blksize, get_hpage_pmd_size());
> +
> +	close(fd);
> +}
> +
> +TEST(restrictedmem_tmpfile_invalid_fd)
> +{
> +	int fd = memfd_restricted(RMFD_USERMNT, -2);
> +
> +	ASSERT_EQ(fd, -1);
> +	ASSERT_EQ(errno, EINVAL);
> +}
> +
> +TEST(restrictedmem_tmpfile_fd_not_a_mount)
> +{
> +	int fd = memfd_restricted(RMFD_USERMNT, STDOUT_FILENO);
> +
> +	ASSERT_EQ(fd, -1);
> +	ASSERT_EQ(errno, EINVAL);
> +}
> +
> +TEST(restrictedmem_tmpfile_not_tmpfs_mount)
> +{
> +	int fd = -1;
> +	int mfd = -1;
> +
> +	mfd = open("/proc", O_PATH);
> +	ASSERT_NE(mfd, -1);
> +
> +	fd = memfd_restricted(RMFD_USERMNT, mfd);
> +
> +	ASSERT_EQ(fd, -1);
> +	ASSERT_EQ(errno, EINVAL);
> +}
> +
> +FIXTURE(tmpfs_hugepage_sfd)
> +{
> +	int sfd;
> +};
> +
> +FIXTURE_SETUP(tmpfs_hugepage_sfd)
> +{
> +	self->sfd = fsopen("tmpfs", 0);
> +	ASSERT_NE(self->sfd, -1);
> +}
> +
> +FIXTURE_TEARDOWN(tmpfs_hugepage_sfd)
> +{
> +	EXPECT_EQ(close(self->sfd), 0);
> +}
> +
> +TEST_F(tmpfs_hugepage_sfd, restrictedmem_fstat_tmpfs_huge_always)
> +{
> +	int ret = -1;
> +	int fd = -1;
> +	int mfd = -1;
> +	struct stat stat;
> +
> +	fsconfig(self->sfd, FSCONFIG_SET_STRING, "huge", "always", 0);
> +	fsconfig(self->sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> +
> +	mfd = fsmount(self->sfd, 0, 0);
> +	ASSERT_NE(mfd, -1);
> +
> +	fd = memfd_restricted(RMFD_USERMNT, mfd);
> +	ASSERT_GT(fd, 0);
> +
> +	/* User can close reference to mount */
> +	ret = close(mfd);
> +	ASSERT_EQ(ret, 0);
> +
> +	ret = fstat(fd, &stat);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(stat.st_blksize, get_hpage_pmd_size());
> +
> +	close(fd);
> +}
> +
> +TEST_F(tmpfs_hugepage_sfd, restrictedmem_fstat_tmpfs_huge_never)
> +{
> +	int ret = -1;
> +	int fd = -1;
> +	int mfd = -1;
> +	struct stat stat;
> +
> +	fsconfig(self->sfd, FSCONFIG_SET_STRING, "huge", "never", 0);
> +	fsconfig(self->sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> +
> +	mfd = fsmount(self->sfd, 0, 0);
> +	ASSERT_NE(mfd, -1);
> +
> +	fd = memfd_restricted(RMFD_USERMNT, mfd);
> +	ASSERT_GT(fd, 0);
> +
> +	/* User can close reference to mount */
> +	ret = close(mfd);
> +	ASSERT_EQ(ret, 0);
> +
> +	ret = fstat(fd, &stat);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(stat.st_blksize, getpagesize());
> +
> +	close(fd);
> +}
> +
> +TEST_F(tmpfs_hugepage_sfd, restrictedmem_check_mount_flags)
> +{
> +	int ret = -1;
> +	int fd = -1;
> +	int mfd = -1;
> +
> +	fsconfig(self->sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> +
> +	mfd = fsmount(self->sfd, 0, MOUNT_ATTR_RDONLY);
> +	ASSERT_NE(mfd, -1);
> +
> +	fd = memfd_restricted(RMFD_USERMNT, mfd);
> +	ASSERT_EQ(fd, -1);
> +	ASSERT_EQ(errno, EROFS);
> +
> +	ret = close(mfd);
> +	ASSERT_EQ(ret, 0);
> +}
> +
> +TEST_F(tmpfs_hugepage_sfd, restrictedmem_check_superblock_flags)
> +{
> +	int ret = -1;
> +	int fd = -1;
> +	int mfd = -1;
> +
> +	fsconfig(self->sfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> +	fsconfig(self->sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> +
> +	mfd = fsmount(self->sfd, 0, 0);
> +	ASSERT_NE(mfd, -1);
> +
> +	fd = memfd_restricted(RMFD_USERMNT, mfd);
> +	ASSERT_EQ(fd, -1);
> +	ASSERT_EQ(errno, EROFS);
> +
> +	ret = close(mfd);
> +	ASSERT_EQ(ret, 0);
> +}
> +
> +static bool directory_exists(const char *path)
> +{
> +	struct stat sb;
> +
> +	return stat(path, &sb) == 0 && S_ISDIR(sb.st_mode);
> +}
> +
> +FIXTURE(tmpfs_hugepage_mount_path)
> +{
> +	char *mount_path;
> +};
> +
> +FIXTURE_SETUP(tmpfs_hugepage_mount_path)
> +{
> +	int ret = -1;
> +
> +	/* /tmp is an FHS-mandated world-writable directory */
> +	self->mount_path = "/tmp/restrictedmem-selftest-mnt";
> +
> +	if (!directory_exists(self->mount_path)) {
> +		ret = mkdir(self->mount_path, 0777);
> +		ASSERT_EQ(ret, 0);
> +	}
> +}
> +
> +FIXTURE_TEARDOWN(tmpfs_hugepage_mount_path)
> +{
> +	int ret = -1;
> +
> +	if (!directory_exists(self->mount_path))
> +		return;
> +
> +	ret = umount2(self->mount_path, MNT_FORCE);
> +	EXPECT_EQ(ret, 0);
> +	if (ret == -1 && errno == EINVAL)
> +		fprintf(stderr, "  %s was not mounted\n", self->mount_path);
> +
> +	ret = rmdir(self->mount_path);
> +	EXPECT_EQ(ret, 0);
> +	if (ret == -1)
> +		fprintf(stderr, "  rmdir(%s) failed: %m\n", self->mount_path);
> +}
> +
> +/*
> + * memfd_restricted() syscall can only be used with the fd of the root of the
> + * mount. When the restrictedmem's fd is open, a user should not be able to
> + * unmount or remove the mounted directory
> + */
> +TEST_F(tmpfs_hugepage_mount_path, restrictedmem_umount_rmdir_while_file_open)
> +{
> +	int ret = -1;
> +	int fd = -1;
> +	int mfd = -1;
> +	struct stat stat;
> +
> +	ret = mount("name", self->mount_path, "tmpfs", 0, "huge=always");
> +	ASSERT_EQ(ret, 0);
> +
> +	mfd = open(self->mount_path, O_PATH);
> +	ASSERT_NE(mfd, -1);
> +
> +	fd = memfd_restricted(RMFD_USERMNT, mfd);
> +	ASSERT_GT(fd, 0);
> +
> +	/* We don't need this reference to the mount anymore */
> +	ret = close(mfd);
> +	ASSERT_EQ(ret, 0);
> +
> +	/* restrictedmem's fd should still be usable */
> +	ret = fstat(fd, &stat);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(stat.st_blksize, get_hpage_pmd_size());
> +
> +	/* User should not be able to unmount directory */
> +	ret = umount2(self->mount_path, MNT_FORCE);
> +	ASSERT_EQ(ret, -1);
> +	ASSERT_EQ(errno, EBUSY);
> +
> +	ret = rmdir(self->mount_path);
> +	ASSERT_EQ(ret, -1);
> +	ASSERT_EQ(errno, EBUSY);
> +
> +	close(fd);
> +}
> +
> +/* The fd of a file on the mount cannot be provided as mount_fd */
> +TEST_F(tmpfs_hugepage_mount_path, restrictedmem_provide_fd_of_file)
> +{
> +	int ret = -1;
> +	int fd = -1;
> +	int ffd = -1;
> +	char tmp_file_path[PATH_MAX] = { 0 };
> +
> +	ret = mount("name", self->mount_path, "tmpfs", 0, "huge=always");
> +	ASSERT_EQ(ret, 0);
> +
> +	snprintf(tmp_file_path, PATH_MAX, "%s/tmp-file", self->mount_path);
> +	ret = write_string_to_file(tmp_file_path, "filler\n");
> +	ASSERT_EQ(ret, 0);
> +
> +	ffd = open(tmp_file_path, O_RDWR);
> +	ASSERT_GT(ffd, 0);
> +
> +	fd = memfd_restricted(RMFD_USERMNT, ffd);
> +	ASSERT_LT(fd, 0);
> +	ASSERT_EQ(errno, EINVAL);
> +
> +	ret = close(ffd);
> +	ASSERT_EQ(ret, 0);
> +
> +	close(fd);
> +	remove(tmp_file_path);
> +}
> +
> +/* The fd of files on the mount cannot be provided as mount_fd */
> +TEST_F(tmpfs_hugepage_mount_path, restrictedmem_provide_fd_of_file_in_subdir)
> +{
> +	int ret = -1;
> +	int fd = -1;
> +	int ffd = -1;
> +	char tmp_dir_path[PATH_MAX] = { 0 };
> +	char tmp_file_path[PATH_MAX] = { 0 };
> +
> +	ret = mount("name", self->mount_path, "tmpfs", 0, "huge=always");
> +	ASSERT_EQ(ret, 0);
> +
> +	snprintf(tmp_dir_path, PATH_MAX, "%s/tmp-subdir", self->mount_path);
> +	ret = mkdir(tmp_dir_path, 0777);
> +	ASSERT_EQ(ret, 0);
> +
> +	snprintf(tmp_file_path, PATH_MAX, "%s/tmp-subdir/tmp-file",
> +		 self->mount_path);
> +	ret = write_string_to_file(tmp_file_path, "filler\n");
> +	ASSERT_EQ(ret, 0);
> +
> +	ffd = open(tmp_file_path, O_RDWR);
> +	ASSERT_NE(ffd, -1);
> +
> +	fd = memfd_restricted(RMFD_USERMNT, ffd);
> +	ASSERT_LT(fd, 0);
> +	ASSERT_EQ(errno, EINVAL);
> +
> +	ret = close(ffd);
> +	ASSERT_EQ(ret, 0);
> +
> +	close(fd);
> +	remove(tmp_file_path);
> +	rmdir(tmp_dir_path);
> +}
> +
> +TEST_HARNESS_MAIN

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted
  2023-03-31 23:50 ` [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted Ackerley Tng
  2023-04-03  8:21   ` David Hildenbrand
@ 2023-04-04  8:25   ` Kirill A. Shutemov
  2023-04-05 22:32     ` Ackerley Tng
  2023-04-04 13:53   ` Christian Brauner
  2 siblings, 1 reply; 398+ messages in thread
From: Kirill A. Shutemov @ 2023-04-04  8:25 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-api, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, qemu-devel, aarcange, ak, akpm, arnd,
	bfields, bp, chao.p.peng, corbet, dave.hansen, david, ddutile,
	dhildenb, hpa, hughd, jlayton, jmattson, joro, jun.nakajima,
	kirill.shutemov, linmiaohe, luto, mail, mhocko, michael.roth,
	mingo, naoya.horiguchi, pbonzini, qperret, rppt, seanjc, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, vkuznets,
	wanpengli, wei.w.wang, x86, yu.c.zhang

On Fri, Mar 31, 2023 at 11:50:39PM +0000, Ackerley Tng wrote:
> By default, the backing shmem file for a restrictedmem fd is created
> on shmem's kernel space mount.
> 
> With this patch, an optional tmpfs mount can be specified via an fd,
> which will be used as the mountpoint for backing the shmem file
> associated with a restrictedmem fd.
> 
> This will help restrictedmem fds inherit the properties of the
> provided tmpfs mounts, for example, hugepage allocation hints, NUMA
> binding hints, etc.
> 
> Permissions for the fd passed to memfd_restricted() is modeled after
> the openat() syscall, since both of these allow creation of a file
> upon a mount/directory.
> 
> Permission to reference the mount the fd represents is checked upon fd
> creation by other syscalls (e.g. fsmount(), open(), or open_tree(),
> etc) and any process that can present memfd_restricted() with a valid
> fd is expected to have obtained permission to use the mount
> represented by the fd. This behavior is intended to parallel that of
> the openat() syscall.
> 
> memfd_restricted() will check that the tmpfs superblock is
> writable, and that the mount is also writable, before attempting to
> create a restrictedmem file on the mount.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  include/linux/syscalls.h           |  2 +-
>  include/uapi/linux/restrictedmem.h |  8 ++++
>  mm/restrictedmem.c                 | 74 +++++++++++++++++++++++++++---
>  3 files changed, 77 insertions(+), 7 deletions(-)
>  create mode 100644 include/uapi/linux/restrictedmem.h
> 
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index f9e9e0c820c5..a23c4c385cd3 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1056,7 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
>  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
>  					    unsigned long home_node,
>  					    unsigned long flags);
> -asmlinkage long sys_memfd_restricted(unsigned int flags);
> +asmlinkage long sys_memfd_restricted(unsigned int flags, int mount_fd);
> 
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/linux/restrictedmem.h b/include/uapi/linux/restrictedmem.h
> new file mode 100644
> index 000000000000..22d6f2285f6d
> --- /dev/null
> +++ b/include/uapi/linux/restrictedmem.h
> @@ -0,0 +1,8 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_RESTRICTEDMEM_H
> +#define _UAPI_LINUX_RESTRICTEDMEM_H
> +
> +/* flags for memfd_restricted */
> +#define RMFD_USERMNT		0x0001U
> +
> +#endif /* _UAPI_LINUX_RESTRICTEDMEM_H */
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> index c5d869d8c2d8..f7b62364a31a 100644
> --- a/mm/restrictedmem.c
> +++ b/mm/restrictedmem.c
> @@ -1,11 +1,12 @@
>  // SPDX-License-Identifier: GPL-2.0
> -#include "linux/sbitmap.h"
> +#include <linux/namei.h>
>  #include <linux/pagemap.h>
>  #include <linux/pseudo_fs.h>
>  #include <linux/shmem_fs.h>
>  #include <linux/syscalls.h>
>  #include <uapi/linux/falloc.h>
>  #include <uapi/linux/magic.h>
> +#include <uapi/linux/restrictedmem.h>
>  #include <linux/restrictedmem.h>
> 
>  struct restrictedmem {
> @@ -189,19 +190,20 @@ static struct file *restrictedmem_file_create(struct file *memfd)
>  	return file;
>  }
> 
> -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> +static int restrictedmem_create(struct vfsmount *mount)
>  {
>  	struct file *file, *restricted_file;
>  	int fd, err;
> 
> -	if (flags)
> -		return -EINVAL;
> -
>  	fd = get_unused_fd_flags(0);
>  	if (fd < 0)
>  		return fd;
> 
> -	file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> +	if (mount)
> +		file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, VM_NORESERVE);
> +	else
> +		file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> +
>  	if (IS_ERR(file)) {
>  		err = PTR_ERR(file);
>  		goto err_fd;
> @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
>  	return err;
>  }
> 
> +static bool is_shmem_mount(struct vfsmount *mnt)
> +{
> +	return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC;
> +}
> +
> +static bool is_mount_root(struct file *file)
> +{
> +	return file->f_path.dentry == file->f_path.mnt->mnt_root;
> +}
> +
> +static int restrictedmem_create_on_user_mount(int mount_fd)
> +{
> +	int ret;
> +	struct fd f;
> +	struct vfsmount *mnt;
> +
> +	f = fdget_raw(mount_fd);
> +	if (!f.file)
> +		return -EBADF;
> +
> +	ret = -EINVAL;
> +	if (!is_mount_root(f.file))
> +		goto out;
> +
> +	mnt = f.file->f_path.mnt;
> +	if (!is_shmem_mount(mnt))
> +		goto out;
> +
> +	ret = file_permission(f.file, MAY_WRITE | MAY_EXEC);

Why MAY_EXEC?

> +	if (ret)
> +		goto out;
> +
> +	ret = mnt_want_write(mnt);
> +	if (unlikely(ret))
> +		goto out;
> +
> +	ret = restrictedmem_create(mnt);
> +
> +	mnt_drop_write(mnt);
> +out:
> +	fdput(f);
> +
> +	return ret;
> +}

We need review from fs folks. Look mostly sensible, but I have no
experience in fs.

> +
> +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd)
> +{
> +	if (flags & ~RMFD_USERMNT)
> +		return -EINVAL;
> +
> +	if (flags == RMFD_USERMNT) {
> +		if (mount_fd < 0)
> +			return -EINVAL;
> +
> +		return restrictedmem_create_on_user_mount(mount_fd);
> +	} else {
> +		return restrictedmem_create(NULL);
> +	}

Maybe restructure with single restrictedmem_create() call?

	struct vfsmount *mnt = NULL;

	if (flags == RMFD_USERMNT) {
		...
		mnt = ...();
	}

	return restrictedmem_create(mnt);
> +}
> +
>  int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
>  		       struct restrictedmem_notifier *notifier, bool exclusive)
>  {
> --
> 2.40.0.348.gf938b09366-goog

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted
  2023-03-31 23:50 ` [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted Ackerley Tng
  2023-04-03  8:21   ` David Hildenbrand
  2023-04-04  8:25   ` Kirill A. Shutemov
@ 2023-04-04 13:53   ` Christian Brauner
  2023-04-04 14:58     ` Christian Brauner
  2 siblings, 1 reply; 398+ messages in thread
From: Christian Brauner @ 2023-04-04 13:53 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-api, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, qemu-devel, aarcange, ak, akpm, arnd,
	bfields, bp, chao.p.peng, corbet, dave.hansen, david, ddutile,
	dhildenb, hpa, hughd, jlayton, jmattson, joro, jun.nakajima,
	kirill.shutemov, linmiaohe, luto, mail, mhocko, michael.roth,
	mingo, naoya.horiguchi, pbonzini, qperret, rppt, seanjc, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, vkuznets,
	wanpengli, wei.w.wang, x86, yu.c.zhang

On Fri, Mar 31, 2023 at 11:50:39PM +0000, Ackerley Tng wrote:
> By default, the backing shmem file for a restrictedmem fd is created
> on shmem's kernel space mount.
> 
> With this patch, an optional tmpfs mount can be specified via an fd,
> which will be used as the mountpoint for backing the shmem file
> associated with a restrictedmem fd.
> 
> This will help restrictedmem fds inherit the properties of the
> provided tmpfs mounts, for example, hugepage allocation hints, NUMA
> binding hints, etc.
> 
> Permissions for the fd passed to memfd_restricted() is modeled after
> the openat() syscall, since both of these allow creation of a file
> upon a mount/directory.
> 
> Permission to reference the mount the fd represents is checked upon fd
> creation by other syscalls (e.g. fsmount(), open(), or open_tree(),
> etc) and any process that can present memfd_restricted() with a valid
> fd is expected to have obtained permission to use the mount
> represented by the fd. This behavior is intended to parallel that of
> the openat() syscall.
> 
> memfd_restricted() will check that the tmpfs superblock is
> writable, and that the mount is also writable, before attempting to
> create a restrictedmem file on the mount.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  include/linux/syscalls.h           |  2 +-
>  include/uapi/linux/restrictedmem.h |  8 ++++
>  mm/restrictedmem.c                 | 74 +++++++++++++++++++++++++++---
>  3 files changed, 77 insertions(+), 7 deletions(-)
>  create mode 100644 include/uapi/linux/restrictedmem.h
> 
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index f9e9e0c820c5..a23c4c385cd3 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1056,7 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
>  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
>  					    unsigned long home_node,
>  					    unsigned long flags);
> -asmlinkage long sys_memfd_restricted(unsigned int flags);
> +asmlinkage long sys_memfd_restricted(unsigned int flags, int mount_fd);
> 
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/linux/restrictedmem.h b/include/uapi/linux/restrictedmem.h
> new file mode 100644
> index 000000000000..22d6f2285f6d
> --- /dev/null
> +++ b/include/uapi/linux/restrictedmem.h
> @@ -0,0 +1,8 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_RESTRICTEDMEM_H
> +#define _UAPI_LINUX_RESTRICTEDMEM_H
> +
> +/* flags for memfd_restricted */
> +#define RMFD_USERMNT		0x0001U
> +
> +#endif /* _UAPI_LINUX_RESTRICTEDMEM_H */
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> index c5d869d8c2d8..f7b62364a31a 100644
> --- a/mm/restrictedmem.c
> +++ b/mm/restrictedmem.c
> @@ -1,11 +1,12 @@
>  // SPDX-License-Identifier: GPL-2.0
> -#include "linux/sbitmap.h"
> +#include <linux/namei.h>
>  #include <linux/pagemap.h>
>  #include <linux/pseudo_fs.h>
>  #include <linux/shmem_fs.h>
>  #include <linux/syscalls.h>
>  #include <uapi/linux/falloc.h>
>  #include <uapi/linux/magic.h>
> +#include <uapi/linux/restrictedmem.h>
>  #include <linux/restrictedmem.h>
> 
>  struct restrictedmem {
> @@ -189,19 +190,20 @@ static struct file *restrictedmem_file_create(struct file *memfd)
>  	return file;
>  }
> 
> -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> +static int restrictedmem_create(struct vfsmount *mount)
>  {
>  	struct file *file, *restricted_file;
>  	int fd, err;
> 
> -	if (flags)
> -		return -EINVAL;
> -
>  	fd = get_unused_fd_flags(0);

Any reasons the file descriptors aren't O_CLOEXEC by default? I don't
see any reasons why we should introduce new fdtypes that aren't
O_CLOEXEC by default. The "don't mix-and-match" train has already left
the station anyway as we do have seccomp noitifer fds and pidfds both of
which are O_CLOEXEC by default.

>  	if (fd < 0)
>  		return fd;
> 
> -	file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> +	if (mount)
> +		file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, VM_NORESERVE);
> +	else
> +		file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> +
>  	if (IS_ERR(file)) {
>  		err = PTR_ERR(file);
>  		goto err_fd;
> @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
>  	return err;
>  }
> 
> +static bool is_shmem_mount(struct vfsmount *mnt)
> +{
> +	return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC;

This can just be if (mnt->mnt_sb->s_magic == TMPFS_MAGIC).

> +}
> +
> +static bool is_mount_root(struct file *file)
> +{
> +	return file->f_path.dentry == file->f_path.mnt->mnt_root;

mount -t tmpfs tmpfs /mnt
touch /mnt/bla
touch /mnt/ble
mount --bind /mnt/bla /mnt/ble
fd = open("/mnt/ble")
fd_restricted = memfd_restricted(fd)

IOW, this doesn't restrict it to the tmpfs root. It only restricts it to
paths that refer to the root of any tmpfs mount. To exclude bind-mounts
that aren't bind-mounts of the whole filesystem you want:

path->dentry == path->mnt->mnt_root && 
path->mnt->mnt_root == path->mnt->mnt_sb->s_root

> +}
> +
> +static int restrictedmem_create_on_user_mount(int mount_fd)
> +{
> +	int ret;
> +	struct fd f;
> +	struct vfsmount *mnt;
> +
> +	f = fdget_raw(mount_fd);
> +	if (!f.file)
> +		return -EBADF;
> +
> +	ret = -EINVAL;
> +	if (!is_mount_root(f.file))
> +		goto out;
> +
> +	mnt = f.file->f_path.mnt;
> +	if (!is_shmem_mount(mnt))
> +		goto out;
> +
> +	ret = file_permission(f.file, MAY_WRITE | MAY_EXEC);

With the current semantics you're asking whether you have write
permissions on the /mnt/ble file in order to get answer to the question
whether you're allowed to create an unlinked restricted memory file.
That doesn't make much sense afaict.

> +	if (ret)
> +		goto out;
> +
> +	ret = mnt_want_write(mnt);
> +	if (unlikely(ret))
> +		goto out;
> +
> +	ret = restrictedmem_create(mnt);
> +
> +	mnt_drop_write(mnt);
> +out:
> +	fdput(f);
> +
> +	return ret;
> +}
> +
> +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd)
> +{
> +	if (flags & ~RMFD_USERMNT)
> +		return -EINVAL;
> +
> +	if (flags == RMFD_USERMNT) {

Why do you even need this flag? It seems that @mount_fd being < 0 is
sufficient to indicate that a new restricted memory fd is supposed to be
created in the system instance.

> +		if (mount_fd < 0)
> +			return -EINVAL;
> +
> +		return restrictedmem_create_on_user_mount(mount_fd);
> +	} else {
> +		return restrictedmem_create(NULL);
> +	}
> +}

I have to say that I'm very confused by all of this the more I look at it.

Effectively memfd restricted functions as a wrapper filesystem around
the tmpfs filesystem. This is basically a weird overlay filesystem.
You're allocating tmpfs files that you stash in restrictedmem files. 
I have to say that this seems very hacky. I didn't get this at all at
first.

So what does the caller get if they call statx() on a restricted memfd?
Do they get the device number of the tmpfs mount and the inode numbers
of the tmpfs mount? Because it looks like they would:

static int restrictedmem_getattr(struct user_namespace *mnt_userns,
				 const struct path *path, struct kstat *stat,
				 u32 request_mask, unsigned int query_flags)
{
	struct inode *inode = d_inode(path->dentry);
	struct restrictedmem *rm = inode->i_mapping->private_data;
	struct file *memfd = rm->memfd;

	return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
					     request_mask, query_flags);

That @memfd would be a struct file allocated in a tmpfs instance, no? So
you'd be calling the inode operation of the tmpfs file meaning that
struct kstat will be filled up with the info from the tmpfs instance.

But then if I call statfs() and check the fstype I would get
RESTRICTEDMEM_MAGIC, no? This is... unorthodox?

I'm honestly puzzled and this sounds really strange. There must be a
better way to implement all of this.

Shouldn't you try and make this a part of tmpfs proper? Make a really
separate filesystem and add a memfs library that both tmpfs and
restrictedmemfs can use? Add a mount option to tmpfs that makes it a
restricted tmpfs?

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted
  2023-04-04 13:53   ` Christian Brauner
@ 2023-04-04 14:58     ` Christian Brauner
  2023-04-05 21:58       ` Ackerley Tng
  0 siblings, 1 reply; 398+ messages in thread
From: Christian Brauner @ 2023-04-04 14:58 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-api, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, qemu-devel, aarcange, ak, akpm, arnd,
	bfields, bp, chao.p.peng, corbet, dave.hansen, david, ddutile,
	dhildenb, hpa, hughd, jlayton, jmattson, joro, jun.nakajima,
	kirill.shutemov, linmiaohe, luto, mail, mhocko, michael.roth,
	mingo, naoya.horiguchi, pbonzini, qperret, rppt, seanjc, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, vkuznets,
	wanpengli, wei.w.wang, x86, yu.c.zhang

On Tue, Apr 04, 2023 at 03:53:13PM +0200, Christian Brauner wrote:
> On Fri, Mar 31, 2023 at 11:50:39PM +0000, Ackerley Tng wrote:
> > By default, the backing shmem file for a restrictedmem fd is created
> > on shmem's kernel space mount.
> > 
> > With this patch, an optional tmpfs mount can be specified via an fd,
> > which will be used as the mountpoint for backing the shmem file
> > associated with a restrictedmem fd.
> > 
> > This will help restrictedmem fds inherit the properties of the
> > provided tmpfs mounts, for example, hugepage allocation hints, NUMA
> > binding hints, etc.
> > 
> > Permissions for the fd passed to memfd_restricted() is modeled after
> > the openat() syscall, since both of these allow creation of a file
> > upon a mount/directory.
> > 
> > Permission to reference the mount the fd represents is checked upon fd
> > creation by other syscalls (e.g. fsmount(), open(), or open_tree(),
> > etc) and any process that can present memfd_restricted() with a valid
> > fd is expected to have obtained permission to use the mount
> > represented by the fd. This behavior is intended to parallel that of
> > the openat() syscall.
> > 
> > memfd_restricted() will check that the tmpfs superblock is
> > writable, and that the mount is also writable, before attempting to
> > create a restrictedmem file on the mount.
> > 
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > ---
> >  include/linux/syscalls.h           |  2 +-
> >  include/uapi/linux/restrictedmem.h |  8 ++++
> >  mm/restrictedmem.c                 | 74 +++++++++++++++++++++++++++---
> >  3 files changed, 77 insertions(+), 7 deletions(-)
> >  create mode 100644 include/uapi/linux/restrictedmem.h
> > 
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index f9e9e0c820c5..a23c4c385cd3 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -1056,7 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
> >  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
> >  					    unsigned long home_node,
> >  					    unsigned long flags);
> > -asmlinkage long sys_memfd_restricted(unsigned int flags);
> > +asmlinkage long sys_memfd_restricted(unsigned int flags, int mount_fd);
> > 
> >  /*
> >   * Architecture-specific system calls
> > diff --git a/include/uapi/linux/restrictedmem.h b/include/uapi/linux/restrictedmem.h
> > new file mode 100644
> > index 000000000000..22d6f2285f6d
> > --- /dev/null
> > +++ b/include/uapi/linux/restrictedmem.h
> > @@ -0,0 +1,8 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _UAPI_LINUX_RESTRICTEDMEM_H
> > +#define _UAPI_LINUX_RESTRICTEDMEM_H
> > +
> > +/* flags for memfd_restricted */
> > +#define RMFD_USERMNT		0x0001U
> > +
> > +#endif /* _UAPI_LINUX_RESTRICTEDMEM_H */
> > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> > index c5d869d8c2d8..f7b62364a31a 100644
> > --- a/mm/restrictedmem.c
> > +++ b/mm/restrictedmem.c
> > @@ -1,11 +1,12 @@
> >  // SPDX-License-Identifier: GPL-2.0
> > -#include "linux/sbitmap.h"
> > +#include <linux/namei.h>
> >  #include <linux/pagemap.h>
> >  #include <linux/pseudo_fs.h>
> >  #include <linux/shmem_fs.h>
> >  #include <linux/syscalls.h>
> >  #include <uapi/linux/falloc.h>
> >  #include <uapi/linux/magic.h>
> > +#include <uapi/linux/restrictedmem.h>
> >  #include <linux/restrictedmem.h>
> > 
> >  struct restrictedmem {
> > @@ -189,19 +190,20 @@ static struct file *restrictedmem_file_create(struct file *memfd)
> >  	return file;
> >  }
> > 
> > -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> > +static int restrictedmem_create(struct vfsmount *mount)
> >  {
> >  	struct file *file, *restricted_file;
> >  	int fd, err;
> > 
> > -	if (flags)
> > -		return -EINVAL;
> > -
> >  	fd = get_unused_fd_flags(0);
> 
> Any reasons the file descriptors aren't O_CLOEXEC by default? I don't
> see any reasons why we should introduce new fdtypes that aren't
> O_CLOEXEC by default. The "don't mix-and-match" train has already left
> the station anyway as we do have seccomp noitifer fds and pidfds both of
> which are O_CLOEXEC by default.
> 
> >  	if (fd < 0)
> >  		return fd;
> > 
> > -	file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> > +	if (mount)
> > +		file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, VM_NORESERVE);
> > +	else
> > +		file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> > +
> >  	if (IS_ERR(file)) {
> >  		err = PTR_ERR(file);
> >  		goto err_fd;
> > @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> >  	return err;
> >  }
> > 
> > +static bool is_shmem_mount(struct vfsmount *mnt)
> > +{
> > +	return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC;
> 
> This can just be if (mnt->mnt_sb->s_magic == TMPFS_MAGIC).
> 
> > +}
> > +
> > +static bool is_mount_root(struct file *file)
> > +{
> > +	return file->f_path.dentry == file->f_path.mnt->mnt_root;
> 
> mount -t tmpfs tmpfs /mnt
> touch /mnt/bla
> touch /mnt/ble
> mount --bind /mnt/bla /mnt/ble
> fd = open("/mnt/ble")
> fd_restricted = memfd_restricted(fd)
> 
> IOW, this doesn't restrict it to the tmpfs root. It only restricts it to
> paths that refer to the root of any tmpfs mount. To exclude bind-mounts
> that aren't bind-mounts of the whole filesystem you want:
> 
> path->dentry == path->mnt->mnt_root && 
> path->mnt->mnt_root == path->mnt->mnt_sb->s_root
> 
> > +}
> > +
> > +static int restrictedmem_create_on_user_mount(int mount_fd)
> > +{
> > +	int ret;
> > +	struct fd f;
> > +	struct vfsmount *mnt;
> > +
> > +	f = fdget_raw(mount_fd);
> > +	if (!f.file)
> > +		return -EBADF;
> > +
> > +	ret = -EINVAL;
> > +	if (!is_mount_root(f.file))
> > +		goto out;
> > +
> > +	mnt = f.file->f_path.mnt;
> > +	if (!is_shmem_mount(mnt))
> > +		goto out;
> > +
> > +	ret = file_permission(f.file, MAY_WRITE | MAY_EXEC);
> 
> With the current semantics you're asking whether you have write
> permissions on the /mnt/ble file in order to get answer to the question
> whether you're allowed to create an unlinked restricted memory file.
> That doesn't make much sense afaict.
> 
> > +	if (ret)
> > +		goto out;
> > +
> > +	ret = mnt_want_write(mnt);
> > +	if (unlikely(ret))
> > +		goto out;
> > +
> > +	ret = restrictedmem_create(mnt);
> > +
> > +	mnt_drop_write(mnt);
> > +out:
> > +	fdput(f);
> > +
> > +	return ret;
> > +}
> > +
> > +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd)
> > +{
> > +	if (flags & ~RMFD_USERMNT)
> > +		return -EINVAL;
> > +
> > +	if (flags == RMFD_USERMNT) {
> 
> Why do you even need this flag? It seems that @mount_fd being < 0 is
> sufficient to indicate that a new restricted memory fd is supposed to be
> created in the system instance.
> 
> > +		if (mount_fd < 0)
> > +			return -EINVAL;
> > +
> > +		return restrictedmem_create_on_user_mount(mount_fd);
> > +	} else {
> > +		return restrictedmem_create(NULL);
> > +	}
> > +}
> 
> I have to say that I'm very confused by all of this the more I look at it.
> 
> Effectively memfd restricted functions as a wrapper filesystem around
> the tmpfs filesystem. This is basically a weird overlay filesystem.
> You're allocating tmpfs files that you stash in restrictedmem files. 
> I have to say that this seems very hacky. I didn't get this at all at
> first.
> 
> So what does the caller get if they call statx() on a restricted memfd?
> Do they get the device number of the tmpfs mount and the inode numbers
> of the tmpfs mount? Because it looks like they would:
> 
> static int restrictedmem_getattr(struct user_namespace *mnt_userns,
> 				 const struct path *path, struct kstat *stat,
> 				 u32 request_mask, unsigned int query_flags)
> {
> 	struct inode *inode = d_inode(path->dentry);
> 	struct restrictedmem *rm = inode->i_mapping->private_data;
> 	struct file *memfd = rm->memfd;
> 
> 	return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,

This is pretty broken btw, because @path refers to a restrictedmem path
which you're passing to a tmpfs iop...

I see that in

	return memfd->f_inode->i_op->getattr(mnt_userns, &memfd->f_path, stat,
					     request_mask, query_flags);

this if fixed but still, this is... not great.

> 					     request_mask, query_flags);
> 
> That @memfd would be a struct file allocated in a tmpfs instance, no? So
> you'd be calling the inode operation of the tmpfs file meaning that
> struct kstat will be filled up with the info from the tmpfs instance.
> 
> But then if I call statfs() and check the fstype I would get
> RESTRICTEDMEM_MAGIC, no? This is... unorthodox?
> 
> I'm honestly puzzled and this sounds really strange. There must be a
> better way to implement all of this.
> 
> Shouldn't you try and make this a part of tmpfs proper? Make a really
> separate filesystem and add a memfs library that both tmpfs and
> restrictedmemfs can use? Add a mount option to tmpfs that makes it a
> restricted tmpfs?

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted
  2023-04-04 14:58     ` Christian Brauner
@ 2023-04-05 21:58       ` Ackerley Tng
  2023-04-12  9:59         ` Christian Brauner
  0 siblings, 1 reply; 398+ messages in thread
From: Ackerley Tng @ 2023-04-05 21:58 UTC (permalink / raw)
  To: Christian Brauner
  Cc: kvm, linux-api, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, qemu-devel, aarcange, ak, akpm, arnd,
	bfields, bp, chao.p.peng, corbet, dave.hansen, david, ddutile,
	dhildenb, hpa, hughd, jlayton, jmattson, joro, jun.nakajima,
	kirill.shutemov, linmiaohe, luto, mail, mhocko, michael.roth,
	mingo, naoya.horiguchi, pbonzini, qperret, rppt, seanjc, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, vkuznets,
	wanpengli, wei.w.wang, x86, yu.c.zhang


Thanks again for your review!

Christian Brauner <brauner@kernel.org> writes:
> On Tue, Apr 04, 2023 at 03:53:13PM +0200, Christian Brauner wrote:
>> On Fri, Mar 31, 2023 at 11:50:39PM +0000, Ackerley Tng wrote:
>> >
>> > ...
>> >
>> > -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
>> > +static int restrictedmem_create(struct vfsmount *mount)
>> >  {
>> >  	struct file *file, *restricted_file;
>> >  	int fd, err;
>> >
>> > -	if (flags)
>> > -		return -EINVAL;
>> > -
>> >  	fd = get_unused_fd_flags(0);

>> Any reasons the file descriptors aren't O_CLOEXEC by default? I don't
>> see any reasons why we should introduce new fdtypes that aren't
>> O_CLOEXEC by default. The "don't mix-and-match" train has already left
>> the station anyway as we do have seccomp noitifer fds and pidfds both of
>> which are O_CLOEXEC by default.


Thanks for pointing this out. I agree with using O_CLOEXEC, but didn’t
notice this before. Let us discuss this under the original series at
[1].

>> >  	if (fd < 0)
>> >  		return fd;
>> >
>> > -	file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
>> > +	if (mount)
>> > +		file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0,  
>> VM_NORESERVE);
>> > +	else
>> > +		file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
>> > +
>> >  	if (IS_ERR(file)) {
>> >  		err = PTR_ERR(file);
>> >  		goto err_fd;
>> > @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int,  
>> flags)
>> >  	return err;
>> >  }
>> >
>> > +static bool is_shmem_mount(struct vfsmount *mnt)
>> > +{
>> > +	return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC;

>> This can just be if (mnt->mnt_sb->s_magic == TMPFS_MAGIC).


Will simplify this in the next revision.

>> > +}
>> > +
>> > +static bool is_mount_root(struct file *file)
>> > +{
>> > +	return file->f_path.dentry == file->f_path.mnt->mnt_root;

>> mount -t tmpfs tmpfs /mnt
>> touch /mnt/bla
>> touch /mnt/ble
>> mount --bind /mnt/bla /mnt/ble
>> fd = open("/mnt/ble")
>> fd_restricted = memfd_restricted(fd)

>> IOW, this doesn't restrict it to the tmpfs root. It only restricts it to
>> paths that refer to the root of any tmpfs mount. To exclude bind-mounts
>> that aren't bind-mounts of the whole filesystem you want:

>> path->dentry == path->mnt->mnt_root &&
>> path->mnt->mnt_root == path->mnt->mnt_sb->s_root


Will adopt this in the next revision and add a selftest to check
this. Thanks for pointing this out!

>> > +}
>> > +
>> > +static int restrictedmem_create_on_user_mount(int mount_fd)
>> > +{
>> > +	int ret;
>> > +	struct fd f;
>> > +	struct vfsmount *mnt;
>> > +
>> > +	f = fdget_raw(mount_fd);
>> > +	if (!f.file)
>> > +		return -EBADF;
>> > +
>> > +	ret = -EINVAL;
>> > +	if (!is_mount_root(f.file))
>> > +		goto out;
>> > +
>> > +	mnt = f.file->f_path.mnt;
>> > +	if (!is_shmem_mount(mnt))
>> > +		goto out;
>> > +
>> > +	ret = file_permission(f.file, MAY_WRITE | MAY_EXEC);

>> With the current semantics you're asking whether you have write
>> permissions on the /mnt/ble file in order to get answer to the question
>> whether you're allowed to create an unlinked restricted memory file.
>> That doesn't make much sense afaict.


That's true. Since mnt_want_write() already checks for write permissions
and this syscall creates an unlinked file on the mount, we don't have to
check permissions on the file then. Will remove this in the next
revision!

>> > +	if (ret)
>> > +		goto out;
>> > +
>> > +	ret = mnt_want_write(mnt);
>> > +	if (unlikely(ret))
>> > +		goto out;
>> > +
>> > +	ret = restrictedmem_create(mnt);
>> > +
>> > +	mnt_drop_write(mnt);
>> > +out:
>> > +	fdput(f);
>> > +
>> > +	return ret;
>> > +}
>> > +
>> > +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd)
>> > +{
>> > +	if (flags & ~RMFD_USERMNT)
>> > +		return -EINVAL;
>> > +
>> > +	if (flags == RMFD_USERMNT) {

>> Why do you even need this flag? It seems that @mount_fd being < 0 is
>> sufficient to indicate that a new restricted memory fd is supposed to be
>> created in the system instance.


I'm hoping to have this patch series merged after Chao's patch series
introduces the memfd_restricted() syscall [1].

This flag is necessary to indicate the validity of the second argument.

With this flag, we can definitively return an error if the fd is
invalid, which I think is a better experience for the userspace
programmer than if we just silently default to the kernel mount when the
fd provided is invalid.

>> > +		if (mount_fd < 0)
>> > +			return -EINVAL;
>> > +
>> > +		return restrictedmem_create_on_user_mount(mount_fd);
>> > +	} else {
>> > +		return restrictedmem_create(NULL);
>> > +	}
>> > +}

>> I have to say that I'm very confused by all of this the more I look at  
>> it.

>> Effectively memfd restricted functions as a wrapper filesystem around
>> the tmpfs filesystem. This is basically a weird overlay filesystem.
>> You're allocating tmpfs files that you stash in restrictedmem files.
>> I have to say that this seems very hacky. I didn't get this at all at
>> first.

>> So what does the caller get if they call statx() on a restricted memfd?
>> Do they get the device number of the tmpfs mount and the inode numbers
>> of the tmpfs mount? Because it looks like they would:

>> static int restrictedmem_getattr(struct user_namespace *mnt_userns,
>> 				 const struct path *path, struct kstat *stat,
>> 				 u32 request_mask, unsigned int query_flags)
>> {
>> 	struct inode *inode = d_inode(path->dentry);
>> 	struct restrictedmem *rm = inode->i_mapping->private_data;
>> 	struct file *memfd = rm->memfd;

>> 	return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,

> This is pretty broken btw, because @path refers to a restrictedmem path
> which you're passing to a tmpfs iop...

> I see that in

> 	return memfd->f_inode->i_op->getattr(mnt_userns, &memfd->f_path, stat,
> 					     request_mask, query_flags);

> this if fixed but still, this is... not great.


Thanks, this will be fixed in the next revision by rebasing on Chao's
latest code.

>> 					     request_mask, query_flags);

>> That @memfd would be a struct file allocated in a tmpfs instance, no? So
>> you'd be calling the inode operation of the tmpfs file meaning that
>> struct kstat will be filled up with the info from the tmpfs instance.

>> But then if I call statfs() and check the fstype I would get
>> RESTRICTEDMEM_MAGIC, no? This is... unorthodox?

>> I'm honestly puzzled and this sounds really strange. There must be a
>> better way to implement all of this.

>> Shouldn't you try and make this a part of tmpfs proper? Make a really
>> separate filesystem and add a memfs library that both tmpfs and
>> restrictedmemfs can use? Add a mount option to tmpfs that makes it a
>> restricted tmpfs?

This was discussed earlier in the patch series introducing
memfd_restricted and this approach was taken to better manage ownership
of required functionalities between two subsystems. Please see
discussion beginning [2]

[1] ->  
https://lore.kernel.org/lkml/20221202061347.1070246-1-chao.p.peng@linux.intel.com/T/.
[2] ->  
https://lore.kernel.org/lkml/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com/

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted
  2023-04-03  8:21   ` David Hildenbrand
@ 2023-04-05 22:29     ` Ackerley Tng
  0 siblings, 0 replies; 398+ messages in thread
From: Ackerley Tng @ 2023-04-05 22:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, linux-api, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, qemu-devel, aarcange, ak, akpm, arnd,
	bfields, bp, chao.p.peng, corbet, dave.hansen, ddutile, dhildenb,
	hpa, hughd, jlayton, jmattson, joro, jun.nakajima,
	kirill.shutemov, linmiaohe, luto, mail, mhocko, michael.roth,
	mingo, naoya.horiguchi, pbonzini, qperret, rppt, seanjc, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, vkuznets,
	wanpengli, wei.w.wang, x86, yu.c.zhang


Thanks for your review!

David Hildenbrand <david@redhat.com> writes:

> On 01.04.23 01:50, Ackerley Tng wrote:

>> ...

>> diff --git a/include/uapi/linux/restrictedmem.h  
>> b/include/uapi/linux/restrictedmem.h
>> new file mode 100644
>> index 000000000000..22d6f2285f6d
>> --- /dev/null
>> +++ b/include/uapi/linux/restrictedmem.h
>> @@ -0,0 +1,8 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +#ifndef _UAPI_LINUX_RESTRICTEDMEM_H
>> +#define _UAPI_LINUX_RESTRICTEDMEM_H
>> +
>> +/* flags for memfd_restricted */
>> +#define RMFD_USERMNT		0x0001U

> I wonder if we can come up with a more expressive prefix than RMFD.
> Sounds more like "rm fd" ;) Maybe it should better match the
> "memfd_restricted" syscall name, like "MEMFD_RSTD_USERMNT".


RMFD did actually sound vulgar, I'm good with MEMFD_RSTD_USERMNT!

>> +
>> +#endif /* _UAPI_LINUX_RESTRICTEDMEM_H */
>> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
>> index c5d869d8c2d8..f7b62364a31a 100644
>> --- a/mm/restrictedmem.c
>> +++ b/mm/restrictedmem.c
>> @@ -1,11 +1,12 @@
>>    // SPDX-License-Identifier: GPL-2.0
>> -#include "linux/sbitmap.h"

> Looks like an unrelated change?


Will remove this in the next revision.

>> +#include <linux/namei.h>
>>    #include <linux/pagemap.h>
>>    #include <linux/pseudo_fs.h>
>>    #include <linux/shmem_fs.h>
>>    #include <linux/syscalls.h>
>>    #include <uapi/linux/falloc.h>
>>    #include <uapi/linux/magic.h>
>> +#include <uapi/linux/restrictedmem.h>
>>    #include <linux/restrictedmem.h>

>>    struct restrictedmem {
>> @@ -189,19 +190,20 @@ static struct file  
>> *restrictedmem_file_create(struct file *memfd)
>>    	return file;
>>    }

>> -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
>> +static int restrictedmem_create(struct vfsmount *mount)
>>    {
>>    	struct file *file, *restricted_file;
>>    	int fd, err;

>> -	if (flags)
>> -		return -EINVAL;
>> -
>>    	fd = get_unused_fd_flags(0);
>>    	if (fd < 0)
>>    		return fd;

>> -	file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
>> +	if (mount)
>> +		file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0,  
>> VM_NORESERVE);
>> +	else
>> +		file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
>> +
>>    	if (IS_ERR(file)) {
>>    		err = PTR_ERR(file);
>>    		goto err_fd;
>> @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int,  
>> flags)
>>    	return err;
>>    }

>> +static bool is_shmem_mount(struct vfsmount *mnt)
>> +{
>> +	return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC;
>> +}
>> +
>> +static bool is_mount_root(struct file *file)
>> +{
>> +	return file->f_path.dentry == file->f_path.mnt->mnt_root;
>> +}

> I'd inline at least that function, pretty self-explaining.


Will inline this in the next revision.

>> +
>> +static int restrictedmem_create_on_user_mount(int mount_fd)
>> +{
>> +	int ret;
>> +	struct fd f;
>> +	struct vfsmount *mnt;
>> +
>> +	f = fdget_raw(mount_fd);
>> +	if (!f.file)
>> +		return -EBADF;
>> +
>> +	ret = -EINVAL;
>> +	if (!is_mount_root(f.file))
>> +		goto out;
>> +
>> +	mnt = f.file->f_path.mnt;
>> +	if (!is_shmem_mount(mnt))
>> +		goto out;
>> +
>> +	ret = file_permission(f.file, MAY_WRITE | MAY_EXEC);
>> +	if (ret)
>> +		goto out;
>> +
>> +	ret = mnt_want_write(mnt);
>> +	if (unlikely(ret))
>> +		goto out;
>> +
>> +	ret = restrictedmem_create(mnt);
>> +
>> +	mnt_drop_write(mnt);
>> +out:
>> +	fdput(f);
>> +
>> +	return ret;
>> +}
>> +
>> +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd)
>> +{
>> +	if (flags & ~RMFD_USERMNT)
>> +		return -EINVAL;
>> +
>> +	if (flags == RMFD_USERMNT) {
>> +		if (mount_fd < 0)
>> +			return -EINVAL;
>> +
>> +		return restrictedmem_create_on_user_mount(mount_fd);
>> +	} else {
>> +		return restrictedmem_create(NULL);
>> +	}


> You can drop the else case:

> if (flags == RMFD_USERMNT) {
> 	...
> 	return restrictedmem_create_on_user_mount(mount_fd);
> }
> return restrictedmem_create(NULL);


I'll be refactoring this to adopt Kirill's suggestion of using a single
restrictedmem_create(mnt) call.


> I do wonder if you want to properly check for a flag instead of
> comparing values. Results in a more natural way to deal with flags:

> if (flags & RMFD_USERMNT) {

> }


Will use this in the next revision.

>> +}
>> +
>>    int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
>>    		       struct restrictedmem_notifier *notifier, bool exclusive)
>>    {

> The "memfd_restricted" vs. "restrictedmem" terminology is a bit
> unfortunate, but not your fault here.


> I'm not a FS person, but it does look good to me.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted
  2023-04-04  8:25   ` Kirill A. Shutemov
@ 2023-04-05 22:32     ` Ackerley Tng
  0 siblings, 0 replies; 398+ messages in thread
From: Ackerley Tng @ 2023-04-05 22:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: kvm, linux-api, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, qemu-devel, aarcange, ak, akpm, arnd,
	bfields, bp, chao.p.peng, corbet, dave.hansen, david, ddutile,
	dhildenb, hpa, hughd, jlayton, jmattson, joro, jun.nakajima,
	kirill.shutemov, linmiaohe, luto, mail, mhocko, michael.roth,
	mingo, naoya.horiguchi, pbonzini, qperret, rppt, seanjc, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, vkuznets,
	wanpengli, wei.w.wang, x86, yu.c.zhang


Thanks for reviewing these patches!

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Fri, Mar 31, 2023 at 11:50:39PM +0000, Ackerley Tng wrote:

>> ...

>> +static int restrictedmem_create_on_user_mount(int mount_fd)
>> +{
>> +	int ret;
>> +	struct fd f;
>> +	struct vfsmount *mnt;
>> +
>> +	f = fdget_raw(mount_fd);
>> +	if (!f.file)
>> +		return -EBADF;
>> +
>> +	ret = -EINVAL;
>> +	if (!is_mount_root(f.file))
>> +		goto out;
>> +
>> +	mnt = f.file->f_path.mnt;
>> +	if (!is_shmem_mount(mnt))
>> +		goto out;
>> +
>> +	ret = file_permission(f.file, MAY_WRITE | MAY_EXEC);

> Why MAY_EXEC?


Christian pointed out that this check does not make sense, I'll be
removing the entire check in the next revision.

>> +	if (ret)
>> +		goto out;
>> +
>> +	ret = mnt_want_write(mnt);
>> +	if (unlikely(ret))
>> +		goto out;
>> +
>> +	ret = restrictedmem_create(mnt);
>> +
>> +	mnt_drop_write(mnt);
>> +out:
>> +	fdput(f);
>> +
>> +	return ret;
>> +}

> We need review from fs folks. Look mostly sensible, but I have no
> experience in fs.

>> +
>> +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd)
>> +{
>> +	if (flags & ~RMFD_USERMNT)
>> +		return -EINVAL;
>> +
>> +	if (flags == RMFD_USERMNT) {
>> +		if (mount_fd < 0)
>> +			return -EINVAL;
>> +
>> +		return restrictedmem_create_on_user_mount(mount_fd);
>> +	} else {
>> +		return restrictedmem_create(NULL);
>> +	}

> Maybe restructure with single restrictedmem_create() call?

> 	struct vfsmount *mnt = NULL;

> 	if (flags == RMFD_USERMNT) {
> 		...
> 		mnt = ...();
> 	}

> 	return restrictedmem_create(mnt);

Will do so in the next revision.

>> +}
>> +
>>   int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
>>   		       struct restrictedmem_notifier *notifier, bool exclusive)
>>   {
>> --
>> 2.40.0.348.gf938b09366-goog

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [RFC PATCH v3 2/2] selftests: restrictedmem: Check hugepage-ness of shmem file backing restrictedmem fd
  2023-04-03  8:24   ` David Hildenbrand
@ 2023-04-11  1:35     ` Ackerley Tng
  0 siblings, 0 replies; 398+ messages in thread
From: Ackerley Tng @ 2023-04-11  1:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, linux-api, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, qemu-devel, aarcange, ak, akpm, arnd,
	bfields, bp, chao.p.peng, corbet, dave.hansen, ddutile, dhildenb,
	hpa, hughd, jlayton, jmattson, joro, jun.nakajima,
	kirill.shutemov, linmiaohe, luto, mail, mhocko, michael.roth,
	mingo, naoya.horiguchi, pbonzini, qperret, rppt, seanjc, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, vkuznets,
	wanpengli, wei.w.wang, x86, yu.c.zhang

David Hildenbrand <david@redhat.com> writes:

> On 01.04.23 01:50, Ackerley Tng wrote:
>> For memfd_restricted() calls without a userspace mount, the backing
>> file should be the shmem mount in the kernel, and the size of backing
>> pages should be as defined by system-wide shmem configuration.

>> If a userspace mount is provided, the size of backing pages should be
>> as defined in the mount.

>> Also includes negative tests for invalid inputs, including fds
>> representing read-only superblocks/mounts.


> When you talk about "hugepage" in this patch, do you mean THP or
> hugetlb? I suspect thp, so it might be better to spell that out. IIRC,
> there are plans to support actual huge pages in the future, at which
> point "hugepage" terminology could be misleading.


Thanks for pointing this out! I've replaced references to hugepage with
thp, please see RFC v4 at
https://lore.kernel.org/lkml/cover.1681176340.git.ackerleytng@google.com/T/

>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> ---
>>    tools/testing/selftests/Makefile              |   1 +
>>    .../selftests/restrictedmem/.gitignore        |   3 +
>>    .../testing/selftests/restrictedmem/Makefile  |  15 +
>>    .../testing/selftests/restrictedmem/common.c  |   9 +
>>    .../testing/selftests/restrictedmem/common.h  |   8 +
>>    .../restrictedmem_hugepage_test.c             | 486 ++++++++++++++++++
>>    6 files changed, 522 insertions(+)
>>    create mode 100644 tools/testing/selftests/restrictedmem/.gitignore
>>    create mode 100644 tools/testing/selftests/restrictedmem/Makefile
>>    create mode 100644 tools/testing/selftests/restrictedmem/common.c
>>    create mode 100644 tools/testing/selftests/restrictedmem/common.h
>>    create mode 100644  
>> tools/testing/selftests/restrictedmem/restrictedmem_hugepage_test.c

>> ...


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted
  2023-04-05 21:58       ` Ackerley Tng
@ 2023-04-12  9:59         ` Christian Brauner
  2023-04-13 22:53           ` Ackerley Tng
  0 siblings, 1 reply; 398+ messages in thread
From: Christian Brauner @ 2023-04-12  9:59 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-api, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, qemu-devel, aarcange, ak, akpm, arnd,
	bfields, bp, chao.p.peng, corbet, dave.hansen, david, ddutile,
	dhildenb, hpa, hughd, jlayton, jmattson, joro, jun.nakajima,
	kirill.shutemov, linmiaohe, luto, mail, mhocko, michael.roth,
	mingo, naoya.horiguchi, pbonzini, qperret, rppt, seanjc, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, vkuznets,
	wanpengli, wei.w.wang, x86, yu.c.zhang

On Wed, Apr 05, 2023 at 09:58:44PM +0000, Ackerley Tng wrote:
> 
> Thanks again for your review!
> 
> Christian Brauner <brauner@kernel.org> writes:
> > On Tue, Apr 04, 2023 at 03:53:13PM +0200, Christian Brauner wrote:
> > > On Fri, Mar 31, 2023 at 11:50:39PM +0000, Ackerley Tng wrote:
> > > >
> > > > ...
> > > >
> > > > -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> > > > +static int restrictedmem_create(struct vfsmount *mount)
> > > >  {
> > > >  	struct file *file, *restricted_file;
> > > >  	int fd, err;
> > > >
> > > > -	if (flags)
> > > > -		return -EINVAL;
> > > > -
> > > >  	fd = get_unused_fd_flags(0);
> 
> > > Any reasons the file descriptors aren't O_CLOEXEC by default? I don't
> > > see any reasons why we should introduce new fdtypes that aren't
> > > O_CLOEXEC by default. The "don't mix-and-match" train has already left
> > > the station anyway as we do have seccomp noitifer fds and pidfds both of
> > > which are O_CLOEXEC by default.
> 
> 
> Thanks for pointing this out. I agree with using O_CLOEXEC, but didn’t
> notice this before. Let us discuss this under the original series at
> [1].
> 
> > > >  	if (fd < 0)
> > > >  		return fd;
> > > >
> > > > -	file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> > > > +	if (mount)
> > > > +		file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem",
> > > 0, VM_NORESERVE);
> > > > +	else
> > > > +		file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> > > > +
> > > >  	if (IS_ERR(file)) {
> > > >  		err = PTR_ERR(file);
> > > >  		goto err_fd;
> > > > @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned
> > > int, flags)
> > > >  	return err;
> > > >  }
> > > >
> > > > +static bool is_shmem_mount(struct vfsmount *mnt)
> > > > +{
> > > > +	return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC;
> 
> > > This can just be if (mnt->mnt_sb->s_magic == TMPFS_MAGIC).
> 
> 
> Will simplify this in the next revision.
> 
> > > > +}
> > > > +
> > > > +static bool is_mount_root(struct file *file)
> > > > +{
> > > > +	return file->f_path.dentry == file->f_path.mnt->mnt_root;
> 
> > > mount -t tmpfs tmpfs /mnt
> > > touch /mnt/bla
> > > touch /mnt/ble
> > > mount --bind /mnt/bla /mnt/ble
> > > fd = open("/mnt/ble")
> > > fd_restricted = memfd_restricted(fd)
> 
> > > IOW, this doesn't restrict it to the tmpfs root. It only restricts it to
> > > paths that refer to the root of any tmpfs mount. To exclude bind-mounts
> > > that aren't bind-mounts of the whole filesystem you want:
> 
> > > path->dentry == path->mnt->mnt_root &&
> > > path->mnt->mnt_root == path->mnt->mnt_sb->s_root
> 
> 
> Will adopt this in the next revision and add a selftest to check
> this. Thanks for pointing this out!
> 
> > > > +}
> > > > +
> > > > +static int restrictedmem_create_on_user_mount(int mount_fd)
> > > > +{
> > > > +	int ret;
> > > > +	struct fd f;
> > > > +	struct vfsmount *mnt;
> > > > +
> > > > +	f = fdget_raw(mount_fd);
> > > > +	if (!f.file)
> > > > +		return -EBADF;
> > > > +
> > > > +	ret = -EINVAL;
> > > > +	if (!is_mount_root(f.file))
> > > > +		goto out;
> > > > +
> > > > +	mnt = f.file->f_path.mnt;
> > > > +	if (!is_shmem_mount(mnt))
> > > > +		goto out;
> > > > +
> > > > +	ret = file_permission(f.file, MAY_WRITE | MAY_EXEC);
> 
> > > With the current semantics you're asking whether you have write
> > > permissions on the /mnt/ble file in order to get answer to the question
> > > whether you're allowed to create an unlinked restricted memory file.
> > > That doesn't make much sense afaict.
> 
> 
> That's true. Since mnt_want_write() already checks for write permissions
> and this syscall creates an unlinked file on the mount, we don't have to
> check permissions on the file then. Will remove this in the next
> revision!
> 
> > > > +	if (ret)
> > > > +		goto out;
> > > > +
> > > > +	ret = mnt_want_write(mnt);
> > > > +	if (unlikely(ret))
> > > > +		goto out;
> > > > +
> > > > +	ret = restrictedmem_create(mnt);
> > > > +
> > > > +	mnt_drop_write(mnt);
> > > > +out:
> > > > +	fdput(f);
> > > > +
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd)
> > > > +{
> > > > +	if (flags & ~RMFD_USERMNT)
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (flags == RMFD_USERMNT) {
> 
> > > Why do you even need this flag? It seems that @mount_fd being < 0 is
> > > sufficient to indicate that a new restricted memory fd is supposed to be
> > > created in the system instance.
> 
> 
> I'm hoping to have this patch series merged after Chao's patch series
> introduces the memfd_restricted() syscall [1].

I'm curious, is there an LSFMM session for this?

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-03-23  1:27               ` Michael Roth
  2023-03-24  2:13                 ` Chao Peng
@ 2023-04-12 22:01                 ` Sean Christopherson
  1 sibling, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-04-12 22:01 UTC (permalink / raw)
  To: Michael Roth
  Cc: Chao Peng, Isaku Yamahata, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	mhocko, wei.w.wang

On Wed, Mar 22, 2023, Michael Roth wrote:
> On Tue, Feb 21, 2023 at 08:11:35PM +0800, Chao Peng wrote:
> > >   *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem binding/unbinding
> > >   *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for issuing invalidations
> > 
> > As many kernel APIs treat 'end' as exclusive, I would rather keep using
> > exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind
> > and notifier callbacks) but fix it internally in the restrictedmem. E.g.
> > all the places where xarray API needs a 'last'/'max' we use 'end - 1'.
> > See below for the change.
> 
> Yes I did feel like I was fighting the kernel a bit on that; your
> suggestion seems like it would be a better fit.

Comically belated +1, XArray is the odd one here.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-01-25 12:53       ` Kirill A. Shutemov
  2023-01-25 16:01         ` Liam Merwick
@ 2023-04-13  1:07         ` Sean Christopherson
  2023-04-13 16:04           ` Kirill A. Shutemov
  1 sibling, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-04-13  1:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Liam Merwick, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, wei.w.wang

On Wed, Jan 25, 2023, Kirill A. Shutemov wrote:
> On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote:
> > On Tue, Jan 24, 2023, Liam Merwick wrote:
> > > On 14/01/2023 00:37, Sean Christopherson wrote:
> > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > This patch series implements KVM guest private memory for confidential
> > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > TDX-protected guest memory, machine check can happen which can further
> > > > > crash the running host system, this is terrible for multi-tenant
> > > > > configurations. The host accesses include those from KVM userspace like
> > > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > > via a fd-based approach, but it can never access the guest memory
> > > > > content.
> > > > > 
> > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > > reviews are always welcome.
> > > > >    - 01: mm change, target for mm tree
> > > > >    - 02-09: KVM change, target for KVM tree
> > > > 
> > > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > > is available here:
> > > > 
> > > >    git@github.com:sean-jc/linux.git x86/upm_base_support
> > > > 
> > > > It compiles and passes the selftest, but it's otherwise barely tested.  There are
> > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > > a WIP.
> > > > 
> > > 
> > > When running LTP (https://github.com/linux-test-project/ltp) on the v10
> > > bits (and also with Sean's branch above) I encounter the following NULL
> > > pointer dereference with testcases/kernel/syscalls/madvise/madvise01
> > > (100% reproducible).
> > > 
> > > It appears that in restrictedmem_error_page()
> > > inode->i_mapping->private_data is NULL in the
> > > list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) but I
> > > don't know why.
> > 
> > Kirill, can you take a look?  Or pass the buck to someone who can? :-)
> 
> The patch below should help.
> 
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> index 15c52301eeb9..39ada985c7c0 100644
> --- a/mm/restrictedmem.c
> +++ b/mm/restrictedmem.c
> @@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct address_space *mapping)
>  
>  	spin_lock(&sb->s_inode_list_lock);
>  	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> -		struct restrictedmem *rm = inode->i_mapping->private_data;
>  		struct restrictedmem_notifier *notifier;
> -		struct file *memfd = rm->memfd;
> +		struct restrictedmem *rm;
>  		unsigned long index;
> +		struct file *memfd;
>  
> -		if (memfd->f_mapping != mapping)
> +		if (atomic_read(&inode->i_count))

Kirill, should this be

		if (!atomic_read(&inode->i_count))
			continue;

i.e. skip unreferenced inodes, not skip referenced inodes?

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2022-12-02  6:13 ` [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
                     ` (4 preceding siblings ...)
  2023-02-16  9:51   ` Nikunj A. Dadhania
@ 2023-04-13 15:25   ` Christian Brauner
  2023-04-13 22:28     ` Sean Christopherson
  2023-04-13 17:22   ` [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Ackerley Tng
  6 siblings, 1 reply; 398+ messages in thread
From: Christian Brauner @ 2023-04-13 15:25 UTC (permalink / raw)
  To: Kirill A . Shutemov, Ackerley Tng, Chao Peng
  Cc: Hugh Dickins, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, luto, jun.nakajima, dave.hansen, ak,
	david, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, Gupta, Pankaj, linux-arch, arnd, linmiaohe,
	naoya.horiguchi, tabba, wei.w.wang

On Thu, Aug 18, 2022 at 04:24:21PM +0300, Kirill A . Shutemov wrote:
> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > On Wed, 6 Jul 2022, Chao Peng wrote:
> > > This is the v7 of this series which tries to implement the fd-based KVM
> > > guest private memory.
> > 
> > Here at last are my reluctant thoughts on this patchset.
> > 
> > fd-based approach for supporting KVM guest private memory: fine.
> > 
> > Use or abuse of memfd and shmem.c: mistaken.
> > 
> > memfd_create() was an excellent way to put together the initial prototype.
> > 
> > But since then, TDX in particular has forced an effort into preventing
> > (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
> > 
> > Are any of the shmem.c mods useful to existing users of shmem.c? No.
> > Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
> > 
> > What use do you have for a filesystem here?  Almost none.
> > IIUC, what you want is an fd through which QEMU can allocate kernel
> > memory, selectively free that memory, and communicate fd+offset+length
> > to KVM.  And perhaps an interface to initialize a little of that memory
> > from a template (presumably copied from a real file on disk somewhere).
> > 
> > You don't need shmem.c or a filesystem for that!
> > 
> > If your memory could be swapped, that would be enough of a good reason
> > to make use of shmem.c: but it cannot be swapped; and although there
> > are some references in the mailthreads to it perhaps being swappable
> > in future, I get the impression that will not happen soon if ever.
> > 
> > If your memory could be migrated, that would be some reason to use
> > filesystem page cache (because page migration happens to understand
> > that type of memory): but it cannot be migrated.
> 
> Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
> theoretically possible, but I'm not aware of any plans as of now.
> 
> [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> 
> > Some of these impressions may come from earlier iterations of the
> > patchset (v7 looks better in several ways than v5).  I am probably
> > underestimating the extent to which you have taken on board other
> > usages beyond TDX and SEV private memory, and rightly want to serve
> > them all with similar interfaces: perhaps there is enough justification
> > for shmem there, but I don't see it.  There was mention of userfaultfd
> > in one link: does that provide the justification for using shmem?
> > 
> > I'm afraid of the special demands you may make of memory allocation
> > later on - surprised that huge pages are not mentioned already;
> > gigantic contiguous extents? secretmem removed from direct map?
> 
> The design allows for extension to hugetlbfs if needed. Combination of
> MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero
> implications for shmem. It is going to be separate struct memfile_backing_store.
> 
> I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE
> to be movable if platform supports it and secretmem is not migratable by
> design (without direct mapping fragmentations).
> 
> > Here's what I would prefer, and imagine much easier for you to maintain;
> > but I'm no system designer, and may be misunderstanding throughout.
> > 
> > QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
> > the fallocate syscall interface itself) to allocate and free the memory,
> > ioctl for initializing some of it too.  KVM in control of whether that
> > fd can be read or written or mmap'ed or whatever, no need to prevent it
> > in shmem.c, no need for flags, seals, notifications to and fro because
> > KVM is already in control and knows the history.  If shmem actually has
> > value, call into it underneath - somewhat like SysV SHM, and /dev/zero
> > mmap, and i915/gem make use of it underneath.  If shmem has nothing to
> > add, just allocate and free kernel memory directly, recorded in your
> > own xarray.
> 
> I guess shim layer on top of shmem *can* work. I don't see immediately why
> it would not. But I'm not sure it is right direction. We risk creating yet
> another parallel VM with own rules/locking/accounting that opaque to
> core-mm.

Sorry for necrobumping this thread but I've been reviewing the
memfd_restricted() extension that Ackerley is currently working on. I
was pointed to this thread as this is what the extension is building
on but I'll reply to both threads here.

From a glance at v10, memfd_restricted() is currently implemented as an
in-kernel stacking filesystem. A call to memfd_restricted() creates a
new restricted memfd file and a new unlinked tmpfs file and stashes the
tmpfs file into the memfd file's private data member. It then uses the
tmpfs file's f_ops and i_ops to perform the relevant file and inode
operations. So it has the same callstack as a general stacking
filesystem like overlayfs in some cases:

        memfd_restricted->getattr()
        -> tmpfs->getattr()
        
The extension that Ackerley is now proposing is to allow passing in a
tmpfs file descriptor explicitly to identify the tmpfs instance in which
to allocate the tmpfs file which is stashed in the memfd secret file.

So in the ->getattr() callstack I mentioned above this patchset
currently does:

        static int restrictedmem_getattr(struct user_namespace *mnt_userns,
                                        const struct path *path, struct kstat stat,
                                        u32 request_mask, unsigned int uery_flags)
        {
               struct inode *inode = d_inode(path->dentry);
               struct restrictedmem_data *data = node->i_mapping->private_data;
               struct file *memfd = data->memfd;
        
               return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
                                                    request_mask, query_flags);
        }

There's a bug in here that I mentioned in another thread and I see that
Ackerley has mentioned as well in
https://lore.kernel.org/lkml/diqzzga0fv96.fsf@ackerleytng-cloudtop-sg.c.googlers.com
namely that this is passing a restricted memfd struct path to a tmpfs
inode operation which is very wrong.

But also in the current implementation - I mentioned this in the other
thread as well - when you call stat() on a restricted memfd file
descriptor you get all the information about the underlying tmpfs inode.
Specifically this includes the device number and inode number.

But when you call statfs() then you get a report that this is a memfd
restricted filesystem which somehow shares the device number with a
tmpfs instance. That's messy.

Since you're effectively acting like a stacking filesystem you should
really use the device number of your memfd restricted filesystem. IOW,
sm like:

        stat->dev = memfd_restricted_dentry->d_sb->s_dev;

But then you run into trouble if you want to go forward with Ackerley's
extension that allows to explicitly pass in tmpfs fds to
memfd_restricted(). Afaict, two tmpfs instances might allocate the same
inode number. So now the inode and device number pair isn't unique
anymore.

So you might best be served by allocating and reporting your own inode
numbers as well.

But if you want to preserve the inode number and device number of the
relevant tmpfs instance but still report memfd restricted as your
filesystem type then I think it's reasonable to ask whether a stacking
implementation really makes sense here.

If you extend memfd_restricted() or even consider extending it in the
future to take tmpfs file descriptors as arguments to identify the tmpfs
instance in which to allocate the underlying tmpfs file for the new
restricted memfd file you should really consider a tmpfs based
implementation.

Because at that point it just feels like a pointless wrapper to get
custom f_ops and i_ops. Plus it's wasteful because you allocate dentries
and inodes that you don't really care about at all.

Just off the top of my hat you might be better served:
* by a new ioctl() on tmpfs instances that
  yield regular tmpfs file descriptors with restricted f_ops and i_ops.
  That's not that different from btrfs subvolumes which effectively are
  directories but are created through an ioctl().
* by a mount option to tmpfs that makes it act
  in this restricted manner then you don't need an ioctl() and can get
  away with regular open calls. Such a tmpfs instance would only create
  regular, restricted memfds.

I think especially with the possibility of an extension that allows you
to inherit tmpfs properties by allocating the memfd restriced file in a
specific tmpfs instance the argument that you're not really making use
of tmpfs things has gone out of the window.

> 
> Note that on machines that run TDX guests such memory would likely be the
> bulk of memory use. Treating it as a fringe case may bite us one day.
> 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov

On Wed, Apr 05, 2023 at 09:58:44PM +0000, Ackerley Tng wrote:
> 
> Thanks again for your review!
> 
> Christian Brauner <brauner@kernel.org> writes:
> > On Tue, Apr 04, 2023 at 03:53:13PM +0200, Christian Brauner wrote:
> > > On Fri, Mar 31, 2023 at 11:50:39PM +0000, Ackerley Tng wrote:
> > > >
> > > > ...
> > > >
> > > > -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> > > > +static int restrictedmem_create(struct vfsmount *mount)
> > > >  {
> > > >  	struct file *file, *restricted_file;
> > > >  	int fd, err;
> > > >
> > > > -	if (flags)
> > > > -		return -EINVAL;
> > > > -
> > > >  	fd = get_unused_fd_flags(0);
> 
> > > Any reasons the file descriptors aren't O_CLOEXEC by default? I don't
> > > see any reasons why we should introduce new fdtypes that aren't
> > > O_CLOEXEC by default. The "don't mix-and-match" train has already left
> > > the station anyway as we do have seccomp noitifer fds and pidfds both of
> > > which are O_CLOEXEC by default.
> 
> 
> Thanks for pointing this out. I agree with using O_CLOEXEC, but didn’t
> notice this before. Let us discuss this under the original series at
> [1].
> 
> > > >  	if (fd < 0)
> > > >  		return fd;
> > > >
> > > > -	file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> > > > +	if (mount)
> > > > +		file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem",
> > > 0, VM_NORESERVE);
> > > > +	else
> > > > +		file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> > > > +
> > > >  	if (IS_ERR(file)) {
> > > >  		err = PTR_ERR(file);
> > > >  		goto err_fd;
> > > > @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned
> > > int, flags)
> > > >  	return err;
> > > >  }
> > > >
> > > > +static bool is_shmem_mount(struct vfsmount *mnt)
> > > > +{
> > > > +	return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC;
> 
> > > This can just be if (mnt->mnt_sb->s_magic == TMPFS_MAGIC).
> 
> 
> Will simplify this in the next revision.
> 
> > > > +}
> > > > +
> > > > +static bool is_mount_root(struct file *file)
> > > > +{
> > > > +	return file->f_path.dentry == file->f_path.mnt->mnt_root;
> 
> > > mount -t tmpfs tmpfs /mnt
> > > touch /mnt/bla
> > > touch /mnt/ble
> > > mount --bind /mnt/bla /mnt/ble
> > > fd = open("/mnt/ble")
> > > fd_restricted = memfd_restricted(fd)
> 
> > > IOW, this doesn't restrict it to the tmpfs root. It only restricts it to
> > > paths that refer to the root of any tmpfs mount. To exclude bind-mounts
> > > that aren't bind-mounts of the whole filesystem you want:
> 
> > > path->dentry == path->mnt->mnt_root &&
> > > path->mnt->mnt_root == path->mnt->mnt_sb->s_root
> 
> 
> Will adopt this in the next revision and add a selftest to check
> this. Thanks for pointing this out!
> 
> > > > +}
> > > > +
> > > > +static int restrictedmem_create_on_user_mount(int mount_fd)
> > > > +{
> > > > +	int ret;
> > > > +	struct fd f;
> > > > +	struct vfsmount *mnt;
> > > > +
> > > > +	f = fdget_raw(mount_fd);
> > > > +	if (!f.file)
> > > > +		return -EBADF;
> > > > +
> > > > +	ret = -EINVAL;
> > > > +	if (!is_mount_root(f.file))
> > > > +		goto out;
> > > > +
> > > > +	mnt = f.file->f_path.mnt;
> > > > +	if (!is_shmem_mount(mnt))
> > > > +		goto out;
> > > > +
> > > > +	ret = file_permission(f.file, MAY_WRITE | MAY_EXEC);
> 
> > > With the current semantics you're asking whether you have write
> > > permissions on the /mnt/ble file in order to get answer to the question
> > > whether you're allowed to create an unlinked restricted memory file.
> > > That doesn't make much sense afaict.
> 
> 
> That's true. Since mnt_want_write() already checks for write permissions
> and this syscall creates an unlinked file on the mount, we don't have to
> check permissions on the file then. Will remove this in the next
> revision!
> 
> > > > +	if (ret)
> > > > +		goto out;
> > > > +
> > > > +	ret = mnt_want_write(mnt);
> > > > +	if (unlikely(ret))
> > > > +		goto out;
> > > > +
> > > > +	ret = restrictedmem_create(mnt);
> > > > +
> > > > +	mnt_drop_write(mnt);
> > > > +out:
> > > > +	fdput(f);
> > > > +
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd)
> > > > +{
> > > > +	if (flags & ~RMFD_USERMNT)
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (flags == RMFD_USERMNT) {
> 
> > > Why do you even need this flag? It seems that @mount_fd being < 0 is
> > > sufficient to indicate that a new restricted memory fd is supposed to be
> > > created in the system instance.
> 
> 
> I'm hoping to have this patch series merged after Chao's patch series
> introduces the memfd_restricted() syscall [1].
> 
> This flag is necessary to indicate the validity of the second argument.
> 
> With this flag, we can definitively return an error if the fd is
> invalid, which I think is a better experience for the userspace
> programmer than if we just silently default to the kernel mount when the
> fd provided is invalid.
> 
> > > > +		if (mount_fd < 0)
> > > > +			return -EINVAL;
> > > > +
> > > > +		return restrictedmem_create_on_user_mount(mount_fd);
> > > > +	} else {
> > > > +		return restrictedmem_create(NULL);
> > > > +	}
> > > > +}
> 
> > > I have to say that I'm very confused by all of this the more I look
> > > at it.
> 
> > > Effectively memfd restricted functions as a wrapper filesystem around
> > > the tmpfs filesystem. This is basically a weird overlay filesystem.
> > > You're allocating tmpfs files that you stash in restrictedmem files.
> > > I have to say that this seems very hacky. I didn't get this at all at
> > > first.
> 
> > > So what does the caller get if they call statx() on a restricted memfd?
> > > Do they get the device number of the tmpfs mount and the inode numbers
> > > of the tmpfs mount? Because it looks like they would:
> 
> > > static int restrictedmem_getattr(struct user_namespace *mnt_userns,
> > > 				 const struct path *path, struct kstat *stat,
> > > 				 u32 request_mask, unsigned int query_flags)
> > > {
> > > 	struct inode *inode = d_inode(path->dentry);
> > > 	struct restrictedmem *rm = inode->i_mapping->private_data;
> > > 	struct file *memfd = rm->memfd;
> 
> > > 	return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
> 
> > This is pretty broken btw, because @path refers to a restrictedmem path
> > which you're passing to a tmpfs iop...
> 
> > I see that in
> 
> > 	return memfd->f_inode->i_op->getattr(mnt_userns, &memfd->f_path, stat,
> > 					     request_mask, query_flags);
> 
> > this if fixed but still, this is... not great.
> 
> 
> Thanks, this will be fixed in the next revision by rebasing on Chao's
> latest code.
> 
> > > 					     request_mask, query_flags);
> 
> > > That @memfd would be a struct file allocated in a tmpfs instance, no? So
> > > you'd be calling the inode operation of the tmpfs file meaning that
> > > struct kstat will be filled up with the info from the tmpfs instance.
> 
> > > But then if I call statfs() and check the fstype I would get
> > > RESTRICTEDMEM_MAGIC, no? This is... unorthodox?
> 
> > > I'm honestly puzzled and this sounds really strange. There must be a
> > > better way to implement all of this.
> 
> > > Shouldn't you try and make this a part of tmpfs proper? Make a really
> > > separate filesystem and add a memfs library that both tmpfs and
> > > restrictedmemfs can use? Add a mount option to tmpfs that makes it a
> > > restricted tmpfs?
> 
> This was discussed earlier in the patch series introducing
> memfd_restricted and this approach was taken to better manage ownership
> of required functionalities between two subsystems. Please see
> discussion beginning [2]
> 
> [1] -> https://lore.kernel.org/lkml/20221202061347.1070246-1-chao.p.peng@linux.intel.com/T/.
> [2] ->
> https://lore.kernel.org/lkml/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com/

On Fri, Dec 02, 2022 at 02:13:39PM +0800, Chao Peng wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Introduce 'memfd_restricted' system call with the ability to create
> memory areas that are restricted from userspace access through ordinary
> MMU operations (e.g. read/write/mmap). The memory content is expected to
> be used through the new in-kernel interface by a third kernel module.
> 
> memfd_restricted() is useful for scenarios where a file descriptor(fd)
> can be used as an interface into mm but want to restrict userspace's
> ability on the fd. Initially it is designed to provide protections for
> KVM encrypted guest memory.
> 
> Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> (e.g. QEMU) and then using the mmaped virtual address to setup the
> mapping in the KVM secondary page table (e.g. EPT). With confidential
> computing technologies like Intel TDX, the memfd memory may be encrypted
> with special key for special software domain (e.g. KVM guest) and is not
> expected to be directly accessed by userspace. Precisely, userspace
> access to such encrypted memory may lead to host crash so should be
> prevented.
> 
> memfd_restricted() provides semantics required for KVM guest encrypted
> memory support that a fd created with memfd_restricted() is going to be
> used as the source of guest memory in confidential computing environment
> and KVM can directly interact with core-mm without the need to expose
> the memoy content into KVM userspace.
> 
> KVM userspace is still in charge of the lifecycle of the fd. It should
> pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> obtain the physical memory page and then uses it to populate the KVM
> secondary page table entries.
> 
> The userspace restricted memfd can be fallocate-ed or hole-punched
> from userspace. When hole-punched, KVM can get notified through
> invalidate_start/invalidate_end() callbacks, KVM then gets chance to
> remove any mapped entries of the range in the secondary page tables.
> 
> Machine check can happen for memory pages in the restricted memfd,
> instead of routing this directly to userspace, we call the error()
> callback that KVM registered. KVM then gets chance to handle it
> correctly.
> 
> memfd_restricted() itself is implemented as a shim layer on top of real
> memory file systems (currently tmpfs). Pages in restrictedmem are marked
> as unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
> 
> By default memfd_restricted() prevents userspace read, write and mmap.
> By defining new bit in the 'flags', it can be extended to support other
> restricted semantics in the future.
> 
> The system call is currently wired up for x86 arch.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  include/linux/restrictedmem.h          |  71 ++++++
>  include/linux/syscalls.h               |   1 +
>  include/uapi/asm-generic/unistd.h      |   5 +-
>  include/uapi/linux/magic.h             |   1 +
>  kernel/sys_ni.c                        |   3 +
>  mm/Kconfig                             |   4 +
>  mm/Makefile                            |   1 +
>  mm/memory-failure.c                    |   3 +
>  mm/restrictedmem.c                     | 318 +++++++++++++++++++++++++
>  11 files changed, 408 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/restrictedmem.h
>  create mode 100644 mm/restrictedmem.c
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 320480a8db4f..dc70ba90247e 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -455,3 +455,4 @@
>  448	i386	process_mrelease	sys_process_mrelease
>  449	i386	futex_waitv		sys_futex_waitv
>  450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
> +451	i386	memfd_restricted	sys_memfd_restricted
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index c84d12608cd2..06516abc8318 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -372,6 +372,7 @@
>  448	common	process_mrelease	sys_process_mrelease
>  449	common	futex_waitv		sys_futex_waitv
>  450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
> +451	common	memfd_restricted	sys_memfd_restricted
>  
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> new file mode 100644
> index 000000000000..c2700c5daa43
> --- /dev/null
> +++ b/include/linux/restrictedmem.h
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _LINUX_RESTRICTEDMEM_H
> +
> +#include <linux/file.h>
> +#include <linux/magic.h>
> +#include <linux/pfn_t.h>
> +
> +struct restrictedmem_notifier;
> +
> +struct restrictedmem_notifier_ops {
> +	void (*invalidate_start)(struct restrictedmem_notifier *notifier,
> +				 pgoff_t start, pgoff_t end);
> +	void (*invalidate_end)(struct restrictedmem_notifier *notifier,
> +			       pgoff_t start, pgoff_t end);
> +	void (*error)(struct restrictedmem_notifier *notifier,
> +			       pgoff_t start, pgoff_t end);
> +};
> +
> +struct restrictedmem_notifier {
> +	struct list_head list;
> +	const struct restrictedmem_notifier_ops *ops;
> +};
> +
> +#ifdef CONFIG_RESTRICTEDMEM
> +
> +void restrictedmem_register_notifier(struct file *file,
> +				     struct restrictedmem_notifier *notifier);
> +void restrictedmem_unregister_notifier(struct file *file,
> +				       struct restrictedmem_notifier *notifier);
> +
> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +			   struct page **pagep, int *order);
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +	return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
> +}
> +
> +void restrictedmem_error_page(struct page *page, struct address_space *mapping);
> +
> +#else
> +
> +static inline void restrictedmem_register_notifier(struct file *file,
> +				     struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline void restrictedmem_unregister_notifier(struct file *file,
> +				       struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +					 struct page **pagep, int *order)
> +{
> +	return -1;
> +}
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +	return false;
> +}
> +
> +static inline void restrictedmem_error_page(struct page *page,
> +					    struct address_space *mapping)
> +{
> +}
> +
> +#endif /* CONFIG_RESTRICTEDMEM */
> +
> +#endif /* _LINUX_RESTRICTEDMEM_H */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a34b0f9a9972..f9e9e0c820c5 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
>  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
>  					    unsigned long home_node,
>  					    unsigned long flags);
> +asmlinkage long sys_memfd_restricted(unsigned int flags);
>  
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 45fa180cc56a..e93cd35e46d0 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
>  #define __NR_set_mempolicy_home_node 450
>  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
>  
> +#define __NR_memfd_restricted 451
> +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
> +
>  #undef __NR_syscalls
> -#define __NR_syscalls 451
> +#define __NR_syscalls 452
>  
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 6325d1d0e90f..8aa38324b90a 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -101,5 +101,6 @@
>  #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
>  #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
>  #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
> +#define RESTRICTEDMEM_MAGIC	0x5245534d	/* "RESM" */
>  
>  #endif /* __LINUX_MAGIC_H__ */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 860b2dcf3ac4..7c4a32cbd2e7 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
>  /* memfd_secret */
>  COND_SYSCALL(memfd_secret);
>  
> +/* memfd_restricted */
> +COND_SYSCALL(memfd_restricted);
> +
>  /*
>   * Architecture specific weak syscall entries.
>   */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 57e1d8c5b505..06b0e1d6b8c1 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1076,6 +1076,10 @@ config IO_MAPPING
>  config SECRETMEM
>  	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
>  
> +config RESTRICTEDMEM
> +	bool
> +	depends on TMPFS
> +
>  config ANON_VMA_NAME
>  	bool "Anonymous VMA name support"
>  	depends on PROC_FS && ADVISE_SYSCALLS && MMU
> diff --git a/mm/Makefile b/mm/Makefile
> index 8e105e5b3e29..bcbb0edf9ba1 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -121,6 +121,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>  obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>  obj-$(CONFIG_SECRETMEM) += secretmem.o
> +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
>  obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>  obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 145bb561ddb3..f91b444e471e 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -62,6 +62,7 @@
>  #include <linux/page-isolation.h>
>  #include <linux/pagewalk.h>
>  #include <linux/shmem_fs.h>
> +#include <linux/restrictedmem.h>
>  #include "swap.h"
>  #include "internal.h"
>  #include "ras/ras_event.h"
> @@ -940,6 +941,8 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
>  		goto out;
>  	}
>  
> +	restrictedmem_error_page(p, mapping);
> +
>  	/*
>  	 * The shmem page is kept in page cache instead of truncating
>  	 * so is expected to have an extra refcount after error-handling.
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> new file mode 100644
> index 000000000000..56953c204e5c
> --- /dev/null
> +++ b/mm/restrictedmem.c
> @@ -0,0 +1,318 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/shmem_fs.h>
> +#include <linux/syscalls.h>
> +#include <uapi/linux/falloc.h>
> +#include <uapi/linux/magic.h>
> +#include <linux/restrictedmem.h>
> +
> +struct restrictedmem_data {
> +	struct mutex lock;
> +	struct file *memfd;
> +	struct list_head notifiers;
> +};
> +
> +static void restrictedmem_invalidate_start(struct restrictedmem_data *data,
> +					   pgoff_t start, pgoff_t end)
> +{
> +	struct restrictedmem_notifier *notifier;
> +
> +	mutex_lock(&data->lock);
> +	list_for_each_entry(notifier, &data->notifiers, list) {
> +		notifier->ops->invalidate_start(notifier, start, end);
> +	}
> +	mutex_unlock(&data->lock);
> +}
> +
> +static void restrictedmem_invalidate_end(struct restrictedmem_data *data,
> +					 pgoff_t start, pgoff_t end)
> +{
> +	struct restrictedmem_notifier *notifier;
> +
> +	mutex_lock(&data->lock);
> +	list_for_each_entry(notifier, &data->notifiers, list) {
> +		notifier->ops->invalidate_end(notifier, start, end);
> +	}
> +	mutex_unlock(&data->lock);
> +}
> +
> +static void restrictedmem_notifier_error(struct restrictedmem_data *data,
> +					 pgoff_t start, pgoff_t end)
> +{
> +	struct restrictedmem_notifier *notifier;
> +
> +	mutex_lock(&data->lock);
> +	list_for_each_entry(notifier, &data->notifiers, list) {
> +		notifier->ops->error(notifier, start, end);
> +	}
> +	mutex_unlock(&data->lock);
> +}
> +
> +static int restrictedmem_release(struct inode *inode, struct file *file)
> +{
> +	struct restrictedmem_data *data = inode->i_mapping->private_data;
> +
> +	fput(data->memfd);
> +	kfree(data);
> +	return 0;
> +}
> +
> +static long restrictedmem_punch_hole(struct restrictedmem_data *data, int mode,
> +				     loff_t offset, loff_t len)
> +{
> +	int ret;
> +	pgoff_t start, end;
> +	struct file *memfd = data->memfd;
> +
> +	if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> +		return -EINVAL;
> +
> +	start = offset >> PAGE_SHIFT;
> +	end = (offset + len) >> PAGE_SHIFT;
> +
> +	restrictedmem_invalidate_start(data, start, end);
> +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> +	restrictedmem_invalidate_end(data, start, end);
> +
> +	return ret;
> +}
> +
> +static long restrictedmem_fallocate(struct file *file, int mode,
> +				    loff_t offset, loff_t len)
> +{
> +	struct restrictedmem_data *data = file->f_mapping->private_data;
> +	struct file *memfd = data->memfd;
> +
> +	if (mode & FALLOC_FL_PUNCH_HOLE)
> +		return restrictedmem_punch_hole(data, mode, offset, len);
> +
> +	return memfd->f_op->fallocate(memfd, mode, offset, len);
> +}
> +
> +static const struct file_operations restrictedmem_fops = {
> +	.release = restrictedmem_release,
> +	.fallocate = restrictedmem_fallocate,
> +};
> +
> +static int restrictedmem_getattr(struct user_namespace *mnt_userns,
> +				 const struct path *path, struct kstat *stat,
> +				 u32 request_mask, unsigned int query_flags)
> +{
> +	struct inode *inode = d_inode(path->dentry);
> +	struct restrictedmem_data *data = inode->i_mapping->private_data;
> +	struct file *memfd = data->memfd;
> +
> +	return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
> +					     request_mask, query_flags);
> +}
> +
> +static int restrictedmem_setattr(struct user_namespace *mnt_userns,
> +				 struct dentry *dentry, struct iattr *attr)
> +{
> +	struct inode *inode = d_inode(dentry);
> +	struct restrictedmem_data *data = inode->i_mapping->private_data;
> +	struct file *memfd = data->memfd;
> +	int ret;
> +
> +	if (attr->ia_valid & ATTR_SIZE) {
> +		if (memfd->f_inode->i_size)
> +			return -EPERM;
> +
> +		if (!PAGE_ALIGNED(attr->ia_size))
> +			return -EINVAL;
> +	}
> +
> +	ret = memfd->f_inode->i_op->setattr(mnt_userns,
> +					    file_dentry(memfd), attr);
> +	return ret;
> +}
> +
> +static const struct inode_operations restrictedmem_iops = {
> +	.getattr = restrictedmem_getattr,
> +	.setattr = restrictedmem_setattr,
> +};
> +
> +static int restrictedmem_init_fs_context(struct fs_context *fc)
> +{
> +	if (!init_pseudo(fc, RESTRICTEDMEM_MAGIC))
> +		return -ENOMEM;
> +
> +	fc->s_iflags |= SB_I_NOEXEC;
> +	return 0;
> +}
> +
> +static struct file_system_type restrictedmem_fs = {
> +	.owner		= THIS_MODULE,
> +	.name		= "memfd:restrictedmem",
> +	.init_fs_context = restrictedmem_init_fs_context,
> +	.kill_sb	= kill_anon_super,
> +};
> +
> +static struct vfsmount *restrictedmem_mnt;
> +
> +static __init int restrictedmem_init(void)
> +{
> +	restrictedmem_mnt = kern_mount(&restrictedmem_fs);
> +	if (IS_ERR(restrictedmem_mnt))
> +		return PTR_ERR(restrictedmem_mnt);
> +	return 0;
> +}
> +fs_initcall(restrictedmem_init);
> +
> +static struct file *restrictedmem_file_create(struct file *memfd)
> +{
> +	struct restrictedmem_data *data;
> +	struct address_space *mapping;
> +	struct inode *inode;
> +	struct file *file;
> +
> +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> +	if (!data)
> +		return ERR_PTR(-ENOMEM);
> +
> +	data->memfd = memfd;
> +	mutex_init(&data->lock);
> +	INIT_LIST_HEAD(&data->notifiers);
> +
> +	inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> +	if (IS_ERR(inode)) {
> +		kfree(data);
> +		return ERR_CAST(inode);
> +	}
> +
> +	inode->i_mode |= S_IFREG;
> +	inode->i_op = &restrictedmem_iops;
> +	inode->i_mapping->private_data = data;
> +
> +	file = alloc_file_pseudo(inode, restrictedmem_mnt,
> +				 "restrictedmem", O_RDWR,
> +				 &restrictedmem_fops);
> +	if (IS_ERR(file)) {
> +		iput(inode);
> +		kfree(data);
> +		return ERR_CAST(file);
> +	}
> +
> +	file->f_flags |= O_LARGEFILE;
> +
> +	/*
> +	 * These pages are currently unmovable so don't place them into movable
> +	 * pageblocks (e.g. CMA and ZONE_MOVABLE).
> +	 */
> +	mapping = memfd->f_mapping;
> +	mapping_set_unevictable(mapping);
> +	mapping_set_gfp_mask(mapping,
> +			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> +
> +	return file;
> +}
> +
> +SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> +{
> +	struct file *file, *restricted_file;
> +	int fd, err;
> +
> +	if (flags)
> +		return -EINVAL;
> +
> +	fd = get_unused_fd_flags(0);
> +	if (fd < 0)
> +		return fd;
> +
> +	file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> +	if (IS_ERR(file)) {
> +		err = PTR_ERR(file);
> +		goto err_fd;
> +	}
> +	file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> +	file->f_flags |= O_LARGEFILE;
> +
> +	restricted_file = restrictedmem_file_create(file);
> +	if (IS_ERR(restricted_file)) {
> +		err = PTR_ERR(restricted_file);
> +		fput(file);
> +		goto err_fd;
> +	}
> +
> +	fd_install(fd, restricted_file);
> +	return fd;
> +err_fd:
> +	put_unused_fd(fd);
> +	return err;
> +}
> +
> +void restrictedmem_register_notifier(struct file *file,
> +				     struct restrictedmem_notifier *notifier)
> +{
> +	struct restrictedmem_data *data = file->f_mapping->private_data;
> +
> +	mutex_lock(&data->lock);
> +	list_add(&notifier->list, &data->notifiers);
> +	mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_register_notifier);
> +
> +void restrictedmem_unregister_notifier(struct file *file,
> +				       struct restrictedmem_notifier *notifier)
> +{
> +	struct restrictedmem_data *data = file->f_mapping->private_data;
> +
> +	mutex_lock(&data->lock);
> +	list_del(&notifier->list);
> +	mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier);
> +
> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +			   struct page **pagep, int *order)
> +{
> +	struct restrictedmem_data *data = file->f_mapping->private_data;
> +	struct file *memfd = data->memfd;
> +	struct folio *folio;
> +	struct page *page;
> +	int ret;
> +
> +	ret = shmem_get_folio(file_inode(memfd), offset, &folio, SGP_WRITE);
> +	if (ret)
> +		return ret;
> +
> +	page = folio_file_page(folio, offset);
> +	*pagep = page;
> +	if (order)
> +		*order = thp_order(compound_head(page));
> +
> +	SetPageUptodate(page);
> +	unlock_page(page);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> +
> +void restrictedmem_error_page(struct page *page, struct address_space *mapping)
> +{
> +	struct super_block *sb = restrictedmem_mnt->mnt_sb;
> +	struct inode *inode, *next;
> +
> +	if (!shmem_mapping(mapping))
> +		return;
> +
> +	spin_lock(&sb->s_inode_list_lock);
> +	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> +		struct restrictedmem_data *data = inode->i_mapping->private_data;
> +		struct file *memfd = data->memfd;
> +
> +		if (memfd->f_mapping == mapping) {
> +			pgoff_t start, end;
> +
> +			spin_unlock(&sb->s_inode_list_lock);
> +
> +			start = page->index;
> +			end = start + thp_nr_pages(page);
> +			restrictedmem_notifier_error(data, start, end);
> +			return;
> +		}
> +	}
> +	spin_unlock(&sb->s_inode_list_lock);
> +}
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-04-13  1:07         ` Sean Christopherson
@ 2023-04-13 16:04           ` Kirill A. Shutemov
  0 siblings, 0 replies; 398+ messages in thread
From: Kirill A. Shutemov @ 2023-04-13 16:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Liam Merwick, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, wei.w.wang

On Wed, Apr 12, 2023 at 06:07:28PM -0700, Sean Christopherson wrote:
> On Wed, Jan 25, 2023, Kirill A. Shutemov wrote:
> > On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote:
> > > On Tue, Jan 24, 2023, Liam Merwick wrote:
> > > > On 14/01/2023 00:37, Sean Christopherson wrote:
> > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > > This patch series implements KVM guest private memory for confidential
> > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > > TDX-protected guest memory, machine check can happen which can further
> > > > > > crash the running host system, this is terrible for multi-tenant
> > > > > > configurations. The host accesses include those from KVM userspace like
> > > > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > > > via a fd-based approach, but it can never access the guest memory
> > > > > > content.
> > > > > > 
> > > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > > > reviews are always welcome.
> > > > > >    - 01: mm change, target for mm tree
> > > > > >    - 02-09: KVM change, target for KVM tree
> > > > > 
> > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > > > is available here:
> > > > > 
> > > > >    git@github.com:sean-jc/linux.git x86/upm_base_support
> > > > > 
> > > > > It compiles and passes the selftest, but it's otherwise barely tested.  There are
> > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > > > a WIP.
> > > > > 
> > > > 
> > > > When running LTP (https://github.com/linux-test-project/ltp) on the v10
> > > > bits (and also with Sean's branch above) I encounter the following NULL
> > > > pointer dereference with testcases/kernel/syscalls/madvise/madvise01
> > > > (100% reproducible).
> > > > 
> > > > It appears that in restrictedmem_error_page()
> > > > inode->i_mapping->private_data is NULL in the
> > > > list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) but I
> > > > don't know why.
> > > 
> > > Kirill, can you take a look?  Or pass the buck to someone who can? :-)
> > 
> > The patch below should help.
> > 
> > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> > index 15c52301eeb9..39ada985c7c0 100644
> > --- a/mm/restrictedmem.c
> > +++ b/mm/restrictedmem.c
> > @@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct address_space *mapping)
> >  
> >  	spin_lock(&sb->s_inode_list_lock);
> >  	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> > -		struct restrictedmem *rm = inode->i_mapping->private_data;
> >  		struct restrictedmem_notifier *notifier;
> > -		struct file *memfd = rm->memfd;
> > +		struct restrictedmem *rm;
> >  		unsigned long index;
> > +		struct file *memfd;
> >  
> > -		if (memfd->f_mapping != mapping)
> > +		if (atomic_read(&inode->i_count))
> 
> Kirill, should this be
> 
> 		if (!atomic_read(&inode->i_count))
> 			continue;
> 
> i.e. skip unreferenced inodes, not skip referenced inodes?

Ouch. Yes.

But looking at other instances of s_inodes usage, I think we can drop the
check altogether. inode cannot be completely free until it is removed from
s_inodes list.

While there, replace list_for_each_entry_safe() with
list_for_each_entry() as we don't remove anything from the list.

diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
index 55e99e6c09a1..8e8a4420d3d1 100644
--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -194,22 +194,19 @@ static int restricted_error_remove_page(struct address_space *mapping,
 					struct page *page)
 {
 	struct super_block *sb = restrictedmem_mnt->mnt_sb;
-	struct inode *inode, *next;
+	struct inode *inode;
 	pgoff_t start, end;
 
 	start = page->index;
 	end = start + thp_nr_pages(page);
 
 	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		struct restrictedmem_notifier *notifier;
 		struct restrictedmem *rm;
 		unsigned long index;
 		struct file *memfd;
 
-		if (atomic_read(&inode->i_count))
-			continue;
-
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-02  6:13 ` [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
                     ` (5 preceding siblings ...)
  2023-04-13 15:25   ` [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Christian Brauner
@ 2023-04-13 17:22   ` Ackerley Tng
  6 siblings, 0 replies; 398+ messages in thread
From: Ackerley Tng @ 2023-04-13 17:22 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, pbonzini, corbet, seanjc,
	vkuznets, wanpengli, jmattson, joro, tglx, mingo, bp, arnd,
	naoya.horiguchi, linmiaohe, x86, hpa, hughd, jlayton, bfields,
	akpm, shuah, rppt, steven.price, mail, vbabka, vannapurve,
	yu.c.zhang, chao.p.peng, kirill.shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb, qperret,
	tabba, michael.roth, mhocko, wei.w.wang

Chao Peng <chao.p.peng@linux.intel.com> writes:

> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

> Introduce 'memfd_restricted' system call with the ability to create
> memory areas that are restricted from userspace access through ordinary
> MMU operations (e.g. read/write/mmap). The memory content is expected to
> be used through the new in-kernel interface by a third kernel module.

> ...

> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> new file mode 100644
> index 000000000000..56953c204e5c
> --- /dev/null
> +++ b/mm/restrictedmem.c
> @@ -0,0 +1,318 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/shmem_fs.h>
> +#include <linux/syscalls.h>
> +#include <uapi/linux/falloc.h>
> +#include <uapi/linux/magic.h>
> +#include <linux/restrictedmem.h>
> +
> +struct restrictedmem_data {
> +	struct mutex lock;
> +	struct file *memfd;

Can this be renamed to file, or lower_file (as in stacking filesystems)?

It's a little confusing because this pointer doesn't actually refer to
an fd.

'memfd' is already used by udmabuf to refer to an actual fd [1], which
makes this a little misleading.

[1]  
https://elixir.bootlin.com/linux/v6.2.10/source/tools/testing/selftests/drivers/dma-buf/udmabuf.c#L63

> +	struct list_head notifiers;
> +};
> +
> ...


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2023-04-13 15:25   ` [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Christian Brauner
@ 2023-04-13 22:28     ` Sean Christopherson
  2023-04-14 22:38       ` Ackerley Tng
  2023-04-19  8:29       ` Christian Brauner
  0 siblings, 2 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-04-13 22:28 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kirill A . Shutemov, Ackerley Tng, Chao Peng, Hugh Dickins, kvm,
	linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Pankaj Gupta,
	linux-arch, arnd, linmiaohe, naoya.horiguchi, tabba, wei.w.wang

On Thu, Apr 13, 2023, Christian Brauner wrote:
> On Thu, Aug 18, 2022 at 04:24:21PM +0300, Kirill A . Shutemov wrote:
> > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > > Here's what I would prefer, and imagine much easier for you to maintain;
> > > but I'm no system designer, and may be misunderstanding throughout.
> > > 
> > > QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
> > > the fallocate syscall interface itself) to allocate and free the memory,
> > > ioctl for initializing some of it too.  KVM in control of whether that
> > > fd can be read or written or mmap'ed or whatever, no need to prevent it
> > > in shmem.c, no need for flags, seals, notifications to and fro because
> > > KVM is already in control and knows the history.  If shmem actually has
> > > value, call into it underneath - somewhat like SysV SHM, and /dev/zero
> > > mmap, and i915/gem make use of it underneath.  If shmem has nothing to
> > > add, just allocate and free kernel memory directly, recorded in your
> > > own xarray.
> > 
> > I guess shim layer on top of shmem *can* work. I don't see immediately why
> > it would not. But I'm not sure it is right direction. We risk creating yet
> > another parallel VM with own rules/locking/accounting that opaque to
> > core-mm.
> 
> Sorry for necrobumping this thread but I've been reviewing the

No worries, I'm just stoked someone who actually knows what they're doing is
chiming in :-)

> memfd_restricted() extension that Ackerley is currently working on. I
> was pointed to this thread as this is what the extension is building
> on but I'll reply to both threads here.
> 
> From a glance at v10, memfd_restricted() is currently implemented as an
> in-kernel stacking filesystem. A call to memfd_restricted() creates a
> new restricted memfd file and a new unlinked tmpfs file and stashes the
> tmpfs file into the memfd file's private data member. It then uses the
> tmpfs file's f_ops and i_ops to perform the relevant file and inode
> operations. So it has the same callstack as a general stacking
> filesystem like overlayfs in some cases:
> 
>         memfd_restricted->getattr()
>         -> tmpfs->getattr()

...

> Since you're effectively acting like a stacking filesystem you should
> really use the device number of your memfd restricted filesystem. IOW,
> sm like:
> 
>         stat->dev = memfd_restricted_dentry->d_sb->s_dev;
> 
> But then you run into trouble if you want to go forward with Ackerley's
> extension that allows to explicitly pass in tmpfs fds to
> memfd_restricted(). Afaict, two tmpfs instances might allocate the same
> inode number. So now the inode and device number pair isn't unique
> anymore.
> 
> So you might best be served by allocating and reporting your own inode
> numbers as well.
> 
> But if you want to preserve the inode number and device number of the
> relevant tmpfs instance but still report memfd restricted as your
> filesystem type

Unless I missed something along the way, reporting memfd_restricted as a distinct
filesystem is very much a non-goal.  AFAIK it's purely a side effect of the
proposed implementation.

> then I think it's reasonable to ask whether a stacking implementation really
> makes sense here.
> 
> If you extend memfd_restricted() or even consider extending it in the
> future to take tmpfs file descriptors as arguments to identify the tmpfs
> instance in which to allocate the underlying tmpfs file for the new
> restricted memfd file you should really consider a tmpfs based
> implementation.
> 
> Because at that point it just feels like a pointless wrapper to get
> custom f_ops and i_ops. Plus it's wasteful because you allocate dentries
> and inodes that you don't really care about at all.
> 
> Just off the top of my hat you might be better served:
> * by a new ioctl() on tmpfs instances that
>   yield regular tmpfs file descriptors with restricted f_ops and i_ops.
>   That's not that different from btrfs subvolumes which effectively are
>   directories but are created through an ioctl().

I think this is more or less what we want to do, except via a dedicated syscall
instead of an ioctl() so that the primary interface isn't strictly tied to tmpfs,
e.g. so that it can be extended to other backing types in the future.

> * by a mount option to tmpfs that makes it act
>   in this restricted manner then you don't need an ioctl() and can get
>   away with regular open calls. Such a tmpfs instance would only create
>   regular, restricted memfds.

I'd prefer to not go this route, becuase IIUC, it would require relatively invasive
changes to shmem code, and IIUC would require similar changes to other support
backings in the future, e.g. hugetlbfs?  And as above, I don't think any of the
potential use cases need restrictedmem to be a uniquely identifiable mount.

One of the goals (hopefully not a pipe dream) is to design restrictmem in such a
way that extending it to support other backing types isn't terribly difficult.
In case it's not obvious, most of us working on this stuff aren't filesystems
experts, and many of us aren't mm experts either.  The more we (KVM folks for the
most part) can leverage existing code to do the heavy lifting, the better.

After giving myself a bit of a crash course in file systems, would something like
the below have any chance of (a) working, (b) getting merged, and (c) being
maintainable?

The idea is similar to a stacking filesystem, but instead of stacking, restrictedmem
hijacks a f_ops and a_ops to create a lightweight shim around tmpfs.  There are
undoubtedly issues and edge cases, I'm just looking for a quick "yes, this might
be doable" or a "no, that's absolutely bonkers, don't try it".

Thanks!


struct restrictedmem {
	struct rw_semaphore lock;
	struct file *file;
	const struct file_operations *backing_f_ops;
	const struct address_space_operations *backing_a_ops;
	struct xarray bindings;
	bool exclusive;
};

static int restrictedmem_release(struct inode *inode, struct file *file)
{
	struct restrictedmem *rm = inode->i_mapping->private_data;

	xa_destroy(&rm->bindings);
	kfree(rm);

	WARN_ON_ONCE(rm->backing_f_ops->release);
	return 0;
}

static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode,
				     loff_t offset, loff_t len)
{
	struct restrictedmem_notifier *notifier;
	unsigned long index;
	pgoff_t start, end;
	int ret;

	if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
		return -EINVAL;

	start = offset >> PAGE_SHIFT;
	end = (offset + len) >> PAGE_SHIFT;

	/*
	 * Bindings must be stable across invalidation to ensure the start+end
	 * are balanced.
	 */
	down_read(&rm->lock);

	xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
		notifier->ops->invalidate_start(notifier, start, end);

	ret = rm->backing_f_ops->fallocate(rm->file, mode, offset, len);

	xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
		notifier->ops->invalidate_end(notifier, start, end);

	up_read(&rm->lock);

	return ret;
}

static long restrictedmem_fallocate(struct file *file, int mode,
				    loff_t offset, loff_t len)
{
	struct restrictedmem *rm = file->f_mapping->private_data;

	if (mode & FALLOC_FL_PUNCH_HOLE)
		return restrictedmem_punch_hole(rm, mode, offset, len);

	return rm->backing_f_ops->fallocate(file, mode, offset, len);
}

static int restrictedmem_migrate_folio(struct address_space *mapping,
				       struct folio *dst, struct folio *src,
				       enum migrate_mode)
{
	WARN_ON_ONCE(1);
	return -EINVAL;
}

static int restrictedmem_error_page(struct address_space *mapping,
				    struct page *page)
{
	struct restrictedmem *rm = mapping->private_data;
	struct restrictedmem_notifier *notifier;
	unsigned long index;
	pgoff_t start, end;

	start = page->index;
	end = start + thp_nr_pages(page);

	down_read(&rm->lock);

	xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
		notifier->ops->error(notifier, start, end);

	up_read(&rm->lock);

	return rm->backing_a_ops->error_remove_page(mapping, page);
}

static const struct file_operations restrictedmem_fops = {
	.release = restrictedmem_release,
	.fallocate = restrictedmem_fallocate,
};

static const struct address_space_operations restrictedmem_aops = {
	.dirty_folio = noop_dirty_folio,
#ifdef CONFIG_MIGRATION
	.migrate_folio	= restrictedmem_migrate_folio,
#endif
	.error_remove_page = restrictedmem_error_page,
};

static int restrictedmem_file_create(struct file *file)
{
	struct address_space *mapping = file->f_mapping;
	struct restrictedmem *rm;

	rm = kzalloc(sizeof(*rm), GFP_KERNEL);
	if (!rm)
		return -ENOMEM;

	rm->backing_f_ops = file->f_op;
	rm->backing_a_ops = mapping->a_ops;
	rm->file = file;
	init_rwsem(&rm->lock);
	xa_init(&rm->bindings);

	file->f_flags |= O_LARGEFILE;

	file->f_op = &restrictedmem_fops;
	mapping->a_ops = &restrictedmem_aops;

	mapping_set_unevictable(mapping);
	mapping_set_unmovable(mapping);
	mapping_set_gfp_mask(mapping,
			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
	return 0;
}


static int restrictedmem_create(struct vfsmount *mount)
{
	struct file *file;
	int fd, err;

	fd = get_unused_fd_flags(0);
	if (fd < 0)
		return fd;

	file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, VM_NORESERVE);
	if (IS_ERR(file)) {
		err = PTR_ERR(file);
		goto err_fd;
	}
	if (WARN_ON_ONCE(file->private_data)) {
		err = -EEXIST;
		goto err_fd;
	}

	file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
	file->f_flags |= O_LARGEFILE;

	err = restrictedmem_file_create(file);
	if (err) {
		fput(file);
		goto err_fd;
	}

	fd_install(fd, file);
	return fd;
err_fd:
	put_unused_fd(fd);
	return err;
}

SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd)
{
	struct vfsmount *mnt;
	struct path *path;
	struct fd f;
	int ret;

	if (flags)
		return -EINVAL;

	f = fdget_raw(mount_fd);
	if (!f.file)
		return -EBADF;

	ret = -EINVAL;

	path = &f.file->f_path;
	if (path->dentry != path->mnt->mnt_root)
		goto out;


	/* Disallow bind-mounts that aren't bind-mounts of the whole filesystem. */
	mnt = path->mnt;
	if (mnt->mnt_root != mnt->mnt_sb->s_root)
		goto out;

	/*
	 * The filesystem must be mounted no-execute, executing from guest
	 * private memory in the host is nonsensical and unsafe.
	 */
	if (!(mnt->mnt_sb->s_iflags & SB_I_NOEXEC))
		goto out;

	/* Currently only TMPFS is supported as underlying storage. */
	if (mnt->mnt_sb->s_magic != TMPFS_MAGIC)
		goto out;

	ret = mnt_want_write(mnt);
	if (ret)
		goto out;

	ret = restrictedmem_create(mnt);

	if (mnt)
		mnt_drop_write(mnt);
out:
	if (f.file)
		fdput(f);

	return ret;
}


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted
  2023-04-12  9:59         ` Christian Brauner
@ 2023-04-13 22:53           ` Ackerley Tng
  2023-04-13 23:07             ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Ackerley Tng @ 2023-04-13 22:53 UTC (permalink / raw)
  To: Christian Brauner
  Cc: kvm, linux-api, linux-arch, linux-doc, linux-fsdevel,
	linux-kernel, linux-mm, qemu-devel, aarcange, ak, akpm, arnd,
	bfields, bp, chao.p.peng, corbet, dave.hansen, david, ddutile,
	dhildenb, hpa, hughd, jlayton, jmattson, joro, jun.nakajima,
	kirill.shutemov, linmiaohe, luto, mail, mhocko, michael.roth,
	mingo, naoya.horiguchi, pbonzini, qperret, rppt, seanjc, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, vkuznets,
	wanpengli, wei.w.wang, x86, yu.c.zhang

Christian Brauner <brauner@kernel.org> writes:

> On Wed, Apr 05, 2023 at 09:58:44PM +0000, Ackerley Tng wrote:

>> ...

>> > > Why do you even need this flag? It seems that @mount_fd being < 0 is
>> > > sufficient to indicate that a new restricted memory fd is supposed  
>> to be
>> > > created in the system instance.


>> I'm hoping to have this patch series merged after Chao's patch series
>> introduces the memfd_restricted() syscall [1].

> I'm curious, is there an LSFMM session for this?

As far as I know, there is no LSFMM session for this.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted
  2023-04-13 22:53           ` Ackerley Tng
@ 2023-04-13 23:07             ` Sean Christopherson
  0 siblings, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-04-13 23:07 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Christian Brauner, kvm, linux-api, linux-arch, linux-doc,
	linux-fsdevel, linux-kernel, linux-mm, qemu-devel, aarcange, ak,
	akpm, arnd, bfields, bp, chao.p.peng, corbet, dave.hansen, david,
	ddutile, dhildenb, hpa, hughd, jlayton, jmattson, joro,
	jun.nakajima, kirill.shutemov, linmiaohe, luto, mail, mhocko,
	michael.roth, mingo, naoya.horiguchi, pbonzini, qperret, rppt,
	shuah, steven.price, tabba, tglx, vannapurve, vbabka, vkuznets,
	wanpengli, wei.w.wang, x86, yu.c.zhang

On Thu, Apr 13, 2023, Ackerley Tng wrote:
> Christian Brauner <brauner@kernel.org> writes:
> > I'm curious, is there an LSFMM session for this?
> 
> As far as I know, there is no LSFMM session for this.

Correct, no LSFMM session.  In hindsight, that's obviously something we should
have pursued :-(

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
  2023-03-28 10:41                 ` Chao Peng
@ 2023-04-14 21:08                   ` Sean Christopherson
  2023-04-18 23:38                     ` Ackerley Tng
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-04-14 21:08 UTC (permalink / raw)
  To: Chao Peng
  Cc: Xiaoyao Li, Isaku Yamahata, Ackerley Tng, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-arch, linux-api, linux-doc,
	qemu-devel, pbonzini, corbet, vkuznets, wanpengli, jmattson,
	joro, tglx, mingo, bp, arnd, naoya.horiguchi, linmiaohe, x86,
	hpa, hughd, jlayton, bfields, akpm, shuah, rppt, steven.price,
	mail, vbabka, vannapurve, yu.c.zhang, kirill.shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, qperret, tabba, michael.roth, mhocko, wei.w.wang

On Tue, Mar 28, 2023, Chao Peng wrote:
> On Fri, Mar 24, 2023 at 10:29:25AM +0800, Xiaoyao Li wrote:
> > On 3/24/2023 10:10 AM, Chao Peng wrote:
> > > On Wed, Mar 22, 2023 at 05:41:31PM -0700, Isaku Yamahata wrote:
> > > > On Wed, Mar 08, 2023 at 03:40:26PM +0800,
> > > > Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > > > 
> > > > > On Wed, Mar 08, 2023 at 12:13:24AM +0000, Ackerley Tng wrote:
> > > > > > Chao Peng <chao.p.peng@linux.intel.com> writes:
> > > > > > 
> > > > > > > On Sat, Jan 14, 2023 at 12:01:01AM +0000, Sean Christopherson wrote:
> > > > > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > > > +static bool kvm_check_rmem_offset_alignment(u64 offset, u64 gpa)
> > > > > > > +{
> > > > > > > +	if (!offset)
> > > > > > > +		return true;
> > > > > > > +	if (!gpa)
> > > > > > > +		return false;
> > > > > > > +
> > > > > > > +	return !!(count_trailing_zeros(offset) >= count_trailing_zeros(gpa));
> > > > 
> > > > This check doesn't work expected. For example, offset = 2GB, gpa=4GB
> > > > this check fails.
> > > 
> > > This case is expected to fail as Sean initially suggested[*]:
> > >    I would rather reject memslot if the gfn has lesser alignment than
> > >    the offset. I'm totally ok with this approach _if_ there's a use case.
> > >    Until such a use case presents itself, I would rather be conservative
> > >    from a uAPI perspective.
> > > 
> > > I understand that we put tighter restriction on this but if you see such
> > > restriction is really a big issue for real usage, instead of a
> > > theoretical problem, then we can loosen the check here. But at that time
> > > below code is kind of x86 specific and may need improve.
> > > 
> > > BTW, in latest code, I replaced count_trailing_zeros() with fls64():
> > >    return !!(fls64(offset) >= fls64(gpa));
> > 
> > wouldn't it be !!(ffs64(offset) <= ffs64(gpa)) ?
> 
> As the function document explains, here we want to return true when
> ALIGNMENT(offset) >= ALIGNMENT(gpa), so '>=' is what we need.
> 
> It's worthy clarifying that in Sean's original suggestion he actually
> mentioned the opposite. He said 'reject memslot if the gfn has lesser
> alignment than the offset', but I wonder this is his purpose, since
> if ALIGNMENT(offset) < ALIGNMENT(gpa), we wouldn't be possible to map
> the page as largepage. Consider we have below config:
> 
>   gpa=2M, offset=1M
> 
> In this case KVM tries to map gpa at 2M as 2M hugepage but the physical
> page at the offset(1M) in private_fd cannot provide the 2M page due to
> misalignment.
> 
> But as we discussed in the off-list thread, here we do find a real use
> case indicating this check is too strict. i.e. QEMU immediately fails
> when launch a guest > 2G memory. For this case QEMU splits guest memory
> space into two slots:
> 
>   Slot#1(ram_below_4G): gpa=0x0, offset=0x0, size=2G
>   Slot#2(ram_above_4G): gpa=4G,  offset=2G,  size=totalsize-2G
> 
> This strict alignment check fails for slot#2 because offset(2G) has less
> alignment than gpa(4G). To allow this, one solution can revert to my
> previous change in kvm_alloc_memslot_metadata() to disallow hugepage
> only when the offset/gpa are not aligned to related page size.
> 
> Sean, How do you think?

I agree, a pure alignment check is too restrictive, and not really what I intended
despite past me literally saying that's what I wanted :-)  I think I may have also
inverted the "less alignment" statement, but luckily I believe that ends up being
a moot point.

The goal is to avoid having to juggle scenarios where KVM wants to create a hugepage,
but restrictedmem can't provide one because of a misaligned file offset.  I think
the rule we want is that the offset must be aligned to the largest page size allowed
by the memslot _size_.  E.g. on x86, if the memslot size is >=1GiB then the offset
must be 1GiB or beter, ditto for >=2MiB and >=4KiB (ignoring that 4KiB is already a
requirement).

We could loosen that to say the largest size allowed by the memslot, but I don't
think that's worth the effort unless it's trivially easy to implement in code,
e.g. KVM could technically allow a 4KiB aligned offset if the memslot is 2MiB
sized but only 4KiB aligned on the GPA.  I doubt there's a real use case for such
a memslot, so I want to disallow that unless it's super easy to implement.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2023-04-13 22:28     ` Sean Christopherson
@ 2023-04-14 22:38       ` Ackerley Tng
  2023-04-14 23:26         ` Sean Christopherson
  2023-04-19  8:29       ` Christian Brauner
  1 sibling, 1 reply; 398+ messages in thread
From: Ackerley Tng @ 2023-04-14 22:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brauner, kirill.shutemov, chao.p.peng, hughd, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	linux-kselftest, pbonzini, corbet, vkuznets, wanpengli, jmattson,
	joro, tglx, mingo, bp, x86, hpa, jlayton, bfields, akpm, shuah,
	rppt, steven.price, mail, vbabka, vannapurve, yu.c.zhang, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, qperret, michael.roth, mhocko, songmuchun,
	pankaj.gupta, linux-arch, arnd, linmiaohe, naoya.horiguchi,
	tabba, wei.w.wang

Sean Christopherson <seanjc@google.com> writes:

> On Thu, Apr 13, 2023, Christian Brauner wrote:
>> On Thu, Aug 18, 2022 at 04:24:21PM +0300, Kirill A . Shutemov wrote:
>> > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
>> > > Here's what I would prefer, and imagine much easier for you to  
>> maintain;
>> > > but I'm no system designer, and may be misunderstanding throughout.
>> > >
>> > > QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
>> > > the fallocate syscall interface itself) to allocate and free the  
>> memory,
>> > > ioctl for initializing some of it too.  KVM in control of whether  
>> that
>> > > fd can be read or written or mmap'ed or whatever, no need to prevent  
>> it
>> > > in shmem.c, no need for flags, seals, notifications to and fro  
>> because
>> > > KVM is already in control and knows the history.  If shmem actually  
>> has
>> > > value, call into it underneath - somewhat like SysV SHM, and  
>> /dev/zero
>> > > mmap, and i915/gem make use of it underneath.  If shmem has nothing  
>> to
>> > > add, just allocate and free kernel memory directly, recorded in your
>> > > own xarray.
>> >
>> > I guess shim layer on top of shmem *can* work. I don't see immediately  
>> why
>> > it would not. But I'm not sure it is right direction. We risk creating  
>> yet
>> > another parallel VM with own rules/locking/accounting that opaque to
>> > core-mm.

>> Sorry for necrobumping this thread but I've been reviewing the

> No worries, I'm just stoked someone who actually knows what they're doing  
> is
> chiming in :-)


+1, thanks Christian!

>> memfd_restricted() extension that Ackerley is currently working on. I
>> was pointed to this thread as this is what the extension is building
>> on but I'll reply to both threads here.

>>  From a glance at v10, memfd_restricted() is currently implemented as an
>> in-kernel stacking filesystem. A call to memfd_restricted() creates a
>> new restricted memfd file and a new unlinked tmpfs file and stashes the
>> tmpfs file into the memfd file's private data member. It then uses the
>> tmpfs file's f_ops and i_ops to perform the relevant file and inode
>> operations. So it has the same callstack as a general stacking
>> filesystem like overlayfs in some cases:

>>          memfd_restricted->getattr()
>>          -> tmpfs->getattr()

> ...

>> Since you're effectively acting like a stacking filesystem you should
>> really use the device number of your memfd restricted filesystem. IOW,
>> sm like:

>>          stat->dev = memfd_restricted_dentry->d_sb->s_dev;

>> But then you run into trouble if you want to go forward with Ackerley's
>> extension that allows to explicitly pass in tmpfs fds to
>> memfd_restricted(). Afaict, two tmpfs instances might allocate the same
>> inode number. So now the inode and device number pair isn't unique
>> anymore.

>> So you might best be served by allocating and reporting your own inode
>> numbers as well.

>> But if you want to preserve the inode number and device number of the
>> relevant tmpfs instance but still report memfd restricted as your
>> filesystem type

> Unless I missed something along the way, reporting memfd_restricted as a  
> distinct
> filesystem is very much a non-goal.  AFAIK it's purely a side effect of  
> the
> proposed implementation.

>> then I think it's reasonable to ask whether a stacking implementation  
>> really
>> makes sense here.

>> If you extend memfd_restricted() or even consider extending it in the
>> future to take tmpfs file descriptors as arguments to identify the tmpfs
>> instance in which to allocate the underlying tmpfs file for the new
>> restricted memfd file you should really consider a tmpfs based
>> implementation.

>> Because at that point it just feels like a pointless wrapper to get
>> custom f_ops and i_ops. Plus it's wasteful because you allocate dentries
>> and inodes that you don't really care about at all.

>> Just off the top of my hat you might be better served:
>> * by a new ioctl() on tmpfs instances that
>>    yield regular tmpfs file descriptors with restricted f_ops and i_ops.
>>    That's not that different from btrfs subvolumes which effectively are
>>    directories but are created through an ioctl().

> I think this is more or less what we want to do, except via a dedicated  
> syscall
> instead of an ioctl() so that the primary interface isn't strictly tied  
> to tmpfs,
> e.g. so that it can be extended to other backing types in the future.

>> * by a mount option to tmpfs that makes it act
>>    in this restricted manner then you don't need an ioctl() and can get
>>    away with regular open calls. Such a tmpfs instance would only create
>>    regular, restricted memfds.

> I'd prefer to not go this route, becuase IIUC, it would require  
> relatively invasive
> changes to shmem code, and IIUC would require similar changes to other  
> support
> backings in the future, e.g. hugetlbfs?  And as above, I don't think any  
> of the
> potential use cases need restrictedmem to be a uniquely identifiable
> mount.

FWIW, I'm starting to look at extending restrictedmem to hugetlbfs and
the separation that the current implementation has is very helpful. Also
helps that hugetlbfs and tmpfs are structured similarly, I guess.


> One of the goals (hopefully not a pipe dream) is to design restrictmem in  
> such a
> way that extending it to support other backing types isn't terribly  
> difficult.
> In case it's not obvious, most of us working on this stuff aren't  
> filesystems
> experts, and many of us aren't mm experts either.  The more we (KVM folks  
> for the
> most part) can leverage existing code to do the heavy lifting, the better.

> After giving myself a bit of a crash course in file systems, would  
> something like
> the below have any chance of (a) working, (b) getting merged, and (c)  
> being
> maintainable?

> The idea is similar to a stacking filesystem, but instead of stacking,  
> restrictedmem
> hijacks a f_ops and a_ops to create a lightweight shim around tmpfs.   
> There are
> undoubtedly issues and edge cases, I'm just looking for a quick "yes,  
> this might
> be doable" or a "no, that's absolutely bonkers, don't try it".

Not an FS expert by any means, but I did think of approaching it this
way as well!

"Hijacking" perhaps gives this approach a bit of a negative
connotation. I thought this is pretty close to subclassing (as in Object
Oriented Programming). When some methods (e.g. fallocate) are called,
restrictedmem does some work, and calls the same method in the
superclass.

The existing restrictedmem code is a more like instantiating an shmem
object and keeping that object as a field within the restrictedmem
object.

Some (maybe small) issues I can think of now:

(1)

One difficulty with this approach is that other functions may make
assumptions about private_data being of a certain type, or functions may
use private_data.

I checked and IIUC neither shmem nor hugetlbfs use the private_data
field in the inode's i_mapping (also file's f_mapping).

But there's fs/buffer.c which uses private_data, although those
functions seem to be used by FSes like ext4 and fat, not memory-backed
FSes.

We can probably fix this if any backing filesystems of restrictedmem,
like tmpfs and future ones use private_data.

Could the solution here be to store private_data of the superclass
instance in restrictedmem, and then override every method in the
superclass that uses private_data to first restore private_data before
making the superclass call? Perhaps we can take private_lock to change
private_data.

(2)

Perhaps there are other slightly hidden cases that might need cleaning up.

For example, one of the patches in this series amends the
shmem_mapping() function from

return mapping->a_ops == &shmem_aops;

to

return mapping->host->i_sb->s_magic == TMPFS_MAGIC;

The former/original is more accurate since it checks a property of the
mapping itself instead of checking a property of the mapping's host's
superblock.

The impact of changing this guard is more obvious if we now override
a_ops but keep the mapping's host's superblock's s_magic.

Specifically for this example, maybe we should handle restrictedmem in
the caller (me_pagecache_clean()) specially, in addition to shmem.


> Thanks!


> struct restrictedmem {
> 	struct rw_semaphore lock;
> 	struct file *file;
> 	const struct file_operations *backing_f_ops;
> 	const struct address_space_operations *backing_a_ops;
> 	struct xarray bindings;
> 	bool exclusive;
> };

> static int restrictedmem_release(struct inode *inode, struct file *file)
> {
> 	struct restrictedmem *rm = inode->i_mapping->private_data;

> 	xa_destroy(&rm->bindings);
> 	kfree(rm);

> 	WARN_ON_ONCE(rm->backing_f_ops->release);
> 	return 0;
> }

> static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode,
> 				     loff_t offset, loff_t len)
> {
> 	struct restrictedmem_notifier *notifier;
> 	unsigned long index;
> 	pgoff_t start, end;
> 	int ret;

> 	if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> 		return -EINVAL;

> 	start = offset >> PAGE_SHIFT;
> 	end = (offset + len) >> PAGE_SHIFT;

> 	/*
> 	 * Bindings must be stable across invalidation to ensure the start+end
> 	 * are balanced.
> 	 */
> 	down_read(&rm->lock);

> 	xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
> 		notifier->ops->invalidate_start(notifier, start, end);

> 	ret = rm->backing_f_ops->fallocate(rm->file, mode, offset, len);

> 	xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
> 		notifier->ops->invalidate_end(notifier, start, end);

> 	up_read(&rm->lock);

> 	return ret;
> }

> static long restrictedmem_fallocate(struct file *file, int mode,
> 				    loff_t offset, loff_t len)
> {
> 	struct restrictedmem *rm = file->f_mapping->private_data;

> 	if (mode & FALLOC_FL_PUNCH_HOLE)
> 		return restrictedmem_punch_hole(rm, mode, offset, len);

> 	return rm->backing_f_ops->fallocate(file, mode, offset, len);
> }

> static int restrictedmem_migrate_folio(struct address_space *mapping,
> 				       struct folio *dst, struct folio *src,
> 				       enum migrate_mode)
> {
> 	WARN_ON_ONCE(1);
> 	return -EINVAL;
> }

> static int restrictedmem_error_page(struct address_space *mapping,
> 				    struct page *page)
> {
> 	struct restrictedmem *rm = mapping->private_data;
> 	struct restrictedmem_notifier *notifier;
> 	unsigned long index;
> 	pgoff_t start, end;

> 	start = page->index;
> 	end = start + thp_nr_pages(page);

> 	down_read(&rm->lock);

> 	xa_for_each_range(&rm->bindings, index, notifier, start, end - 1)
> 		notifier->ops->error(notifier, start, end);

> 	up_read(&rm->lock);

> 	return rm->backing_a_ops->error_remove_page(mapping, page);
> }

When I was thinking of this I was stuck on handling error_remove_page,
because it was looking up the superblock to iterate over the inodes to
find the right mapping. Glad to see that the solution is simply to use
the given mapping from the arguments!


> static const struct file_operations restrictedmem_fops = {
> 	.release = restrictedmem_release,
> 	.fallocate = restrictedmem_fallocate,
> };

> static const struct address_space_operations restrictedmem_aops = {
> 	.dirty_folio = noop_dirty_folio,
> #ifdef CONFIG_MIGRATION
> 	.migrate_folio	= restrictedmem_migrate_folio,
> #endif
> 	.error_remove_page = restrictedmem_error_page,
> };

> static int restrictedmem_file_create(struct file *file)
> {
> 	struct address_space *mapping = file->f_mapping;
> 	struct restrictedmem *rm;

> 	rm = kzalloc(sizeof(*rm), GFP_KERNEL);
> 	if (!rm)
> 		return -ENOMEM;

> 	rm->backing_f_ops = file->f_op;
> 	rm->backing_a_ops = mapping->a_ops;
> 	rm->file = file;

We don't really need to do this, since rm->file is already the same as
file, we could just pass the file itself when it's needed

> 	init_rwsem(&rm->lock);
> 	xa_init(&rm->bindings);

> 	file->f_flags |= O_LARGEFILE;

> 	file->f_op = &restrictedmem_fops;
> 	mapping->a_ops = &restrictedmem_aops;

I think we probably have to override inode_operations as well, because
otherwise other methods would become available to a restrictedmem file
(like link, unlink, mkdir, tmpfile). Or maybe that's a feature instead
of a bug.


> 	mapping_set_unevictable(mapping);
> 	mapping_set_unmovable(mapping);
> 	mapping_set_gfp_mask(mapping,
> 			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> 	return 0;
> }


> static int restrictedmem_create(struct vfsmount *mount)
> {
> 	struct file *file;
> 	int fd, err;

> 	fd = get_unused_fd_flags(0);
> 	if (fd < 0)
> 		return fd;

> 	file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0,  
> VM_NORESERVE);
> 	if (IS_ERR(file)) {
> 		err = PTR_ERR(file);
> 		goto err_fd;
> 	}
> 	if (WARN_ON_ONCE(file->private_data)) {
> 		err = -EEXIST;
> 		goto err_fd;
> 	}

Did you intend this as a check that the backing filesystem isn't using
the private_data field in the mapping?

I think you meant file->f_mapping->private_data.

On this note, we will probably have to fix things whenever any backing
filesystems need the private_data field.


> 	file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> 	file->f_flags |= O_LARGEFILE;

> 	err = restrictedmem_file_create(file);
> 	if (err) {
> 		fput(file);
> 		goto err_fd;
> 	}

> 	fd_install(fd, file);
> 	return fd;
> err_fd:
> 	put_unused_fd(fd);
> 	return err;
> }

> SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd)
> {
> 	struct vfsmount *mnt;
> 	struct path *path;
> 	struct fd f;
> 	int ret;

> 	if (flags)
> 		return -EINVAL;

> 	f = fdget_raw(mount_fd);
> 	if (!f.file)
> 		return -EBADF;

> 	ret = -EINVAL;

> 	path = &f.file->f_path;
> 	if (path->dentry != path->mnt->mnt_root)
> 		goto out;


> 	/* Disallow bind-mounts that aren't bind-mounts of the whole filesystem.  
> */
> 	mnt = path->mnt;
> 	if (mnt->mnt_root != mnt->mnt_sb->s_root)
> 		goto out;

> 	/*
> 	 * The filesystem must be mounted no-execute, executing from guest
> 	 * private memory in the host is nonsensical and unsafe.
> 	 */
> 	if (!(mnt->mnt_sb->s_iflags & SB_I_NOEXEC))
> 		goto out;

> 	/* Currently only TMPFS is supported as underlying storage. */
> 	if (mnt->mnt_sb->s_magic != TMPFS_MAGIC)
> 		goto out;

> 	ret = mnt_want_write(mnt);
> 	if (ret)
> 		goto out;

> 	ret = restrictedmem_create(mnt);

> 	if (mnt)
> 		mnt_drop_write(mnt);
> out:
> 	if (f.file)
> 		fdput(f);

> 	return ret;
> }

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2023-04-14 22:38       ` Ackerley Tng
@ 2023-04-14 23:26         ` Sean Christopherson
  2023-04-15  0:06           ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-04-14 23:26 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: brauner, kirill.shutemov, chao.p.peng, hughd, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	linux-kselftest, pbonzini, corbet, vkuznets, wanpengli, jmattson,
	joro, tglx, mingo, bp, x86, hpa, jlayton, bfields, akpm, shuah,
	rppt, steven.price, mail, vbabka, vannapurve, yu.c.zhang, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, qperret, michael.roth, mhocko, songmuchun,
	pankaj.gupta, linux-arch, arnd, linmiaohe, naoya.horiguchi,
	tabba, wei.w.wang

On Fri, Apr 14, 2023, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > On Thu, Apr 13, 2023, Christian Brauner wrote:
> > > * by a mount option to tmpfs that makes it act
> > >    in this restricted manner then you don't need an ioctl() and can get
> > >    away with regular open calls. Such a tmpfs instance would only create
> > >    regular, restricted memfds.
> 
> > I'd prefer to not go this route, becuase IIUC, it would require relatively
> > invasive changes to shmem code, and IIUC would require similar changes to
> > other support backings in the future, e.g. hugetlbfs?  And as above, I
> > don't think any of the potential use cases need restrictedmem to be a
> > uniquely identifiable mount.
> 
> FWIW, I'm starting to look at extending restrictedmem to hugetlbfs and
> the separation that the current implementation has is very helpful. Also
> helps that hugetlbfs and tmpfs are structured similarly, I guess.
> 
> > One of the goals (hopefully not a pipe dream) is to design restrictmem in
> > such a way that extending it to support other backing types isn't terribly
> > difficult.  In case it's not obvious, most of us working on this stuff
> > aren't filesystems experts, and many of us aren't mm experts either.  The
> > more we (KVM folks for the most part) can leverage existing code to do the
> > heavy lifting, the better.
> 
> > After giving myself a bit of a crash course in file systems, would
> > something like the below have any chance of (a) working, (b) getting
> > merged, and (c) being maintainable?
> 
> > The idea is similar to a stacking filesystem, but instead of stacking,
> > restrictedmem hijacks a f_ops and a_ops to create a lightweight shim around
> > tmpfs.  There are undoubtedly issues and edge cases, I'm just looking for a
> > quick "yes, this might be doable" or a "no, that's absolutely bonkers,
> > don't try it".
> 
> Not an FS expert by any means, but I did think of approaching it this
> way as well!
> 
> "Hijacking" perhaps gives this approach a bit of a negative connotation.

Heh, commandeer then.

> I thought this is pretty close to subclassing (as in Object
> Oriented Programming). When some methods (e.g. fallocate) are called,
> restrictedmem does some work, and calls the same method in the
> superclass.
> 
> The existing restrictedmem code is a more like instantiating an shmem
> object and keeping that object as a field within the restrictedmem
> object.
> 
> Some (maybe small) issues I can think of now:
> 
> (1)
> 
> One difficulty with this approach is that other functions may make
> assumptions about private_data being of a certain type, or functions may
> use private_data.
> 
> I checked and IIUC neither shmem nor hugetlbfs use the private_data
> field in the inode's i_mapping (also file's f_mapping).
> 
> But there's fs/buffer.c which uses private_data, although those
> functions seem to be used by FSes like ext4 and fat, not memory-backed
> FSes.
> 
> We can probably fix this if any backing filesystems of restrictedmem,
> like tmpfs and future ones use private_data.

Ya, if we go the route of poking into f_ops and stuff, I would want to add
WARN_ON_ONCE() hardening of everything that restrictemem wants to "commandeer" ;-)

> > static int restrictedmem_file_create(struct file *file)
> > {
> > 	struct address_space *mapping = file->f_mapping;
> > 	struct restrictedmem *rm;
> 
> > 	rm = kzalloc(sizeof(*rm), GFP_KERNEL);
> > 	if (!rm)
> > 		return -ENOMEM;
> 
> > 	rm->backing_f_ops = file->f_op;
> > 	rm->backing_a_ops = mapping->a_ops;
> > 	rm->file = file;
> 
> We don't really need to do this, since rm->file is already the same as
> file, we could just pass the file itself when it's needed

Aha!  I was working on getting rid of it, but forgot to go back and do another
pass.

> > 	init_rwsem(&rm->lock);
> > 	xa_init(&rm->bindings);
> 
> > 	file->f_flags |= O_LARGEFILE;
> 
> > 	file->f_op = &restrictedmem_fops;
> > 	mapping->a_ops = &restrictedmem_aops;
> 
> I think we probably have to override inode_operations as well, because
> otherwise other methods would become available to a restrictedmem file
> (like link, unlink, mkdir, tmpfile). Or maybe that's a feature instead
> of a bug.

I think we want those?  What we want to restrict are operations that require
read/write/execute access to the file, everything else should be ok. fallocate()
is a special case because restrictmem needs to tell KVM to unmap the memory when
a hole is punched.  I assume ->setattr() needs similar treatment to handle
ftruncate()?

I'd love to hear Christian's input on this aspect of things.

> > 	if (WARN_ON_ONCE(file->private_data)) {
> > 		err = -EEXIST;
> > 		goto err_fd;
> > 	}
> 
> Did you intend this as a check that the backing filesystem isn't using
> the private_data field in the mapping?
>
> I think you meant file->f_mapping->private_data.

Ya, sounds right.  I should have added disclaimers that (a) I wrote this quite
quickly and (b) it's compile tested only at this point.

> On this note, we will probably have to fix things whenever any backing
> filesystems need the private_data field.

Yep.

> > 	f = fdget_raw(mount_fd);
> > 	if (!f.file)
> > 		return -EBADF;

...

> > 	/*
> > 	 * The filesystem must be mounted no-execute, executing from guest
> > 	 * private memory in the host is nonsensical and unsafe.
> > 	 */
> > 	if (!(mnt->mnt_sb->s_iflags & SB_I_NOEXEC))
> > 		goto out;

Looking at this more closely, I don't think we need to require NOEXEC, things like
like execve() should get squashed by virtue of not providing any read/write
implementations.  And dropping my misguided NOEXEC requirement means there's no
reason to disallow using the kernel internal mount.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2023-04-14 23:26         ` Sean Christopherson
@ 2023-04-15  0:06           ` Sean Christopherson
  0 siblings, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-04-15  0:06 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: brauner, kirill.shutemov, chao.p.peng, hughd, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	linux-kselftest, pbonzini, corbet, vkuznets, wanpengli, jmattson,
	joro, tglx, mingo, bp, x86, hpa, jlayton, bfields, akpm, shuah,
	rppt, steven.price, mail, vbabka, vannapurve, yu.c.zhang, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, qperret, michael.roth, mhocko, songmuchun,
	pankaj.gupta, linux-arch, arnd, linmiaohe, naoya.horiguchi,
	tabba, wei.w.wang

On Fri, Apr 14, 2023, Sean Christopherson wrote:
> On Fri, Apr 14, 2023, Ackerley Tng wrote:
> > Sean Christopherson <seanjc@google.com> writes:
> > > 	if (WARN_ON_ONCE(file->private_data)) {
> > > 		err = -EEXIST;
> > > 		goto err_fd;
> > > 	}
> > 
> > Did you intend this as a check that the backing filesystem isn't using
> > the private_data field in the mapping?
> >
> > I think you meant file->f_mapping->private_data.
> 
> Ya, sounds right.  I should have added disclaimers that (a) I wrote this quite
> quickly and (b) it's compile tested only at this point.

FWIW, here's a very lightly tested version that doesn't explode on a basic selftest.

https://github.com/sean-jc/linux/tree/x86/upm_base_support

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-01-24  1:27         ` Sean Christopherson
  2023-02-08 12:24           ` Isaku Yamahata
  2023-02-13 13:01           ` Michael Roth
@ 2023-04-17 14:37           ` Chao Peng
  2023-04-17 15:01             ` Sean Christopherson
  2 siblings, 1 reply; 398+ messages in thread
From: Chao Peng @ 2023-04-17 14:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Isaku Yamahata, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Tue, Jan 24, 2023 at 01:27:50AM +0000, Sean Christopherson wrote:
> On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > On Thu, Jan 19, 2023 at 03:25:08PM +0000,
> > Sean Christopherson <seanjc@google.com> wrote:
> > 
> > > On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > > > On Sat, Jan 14, 2023 at 12:37:59AM +0000,
> > > > Sean Christopherson <seanjc@google.com> wrote:
> > > > 
> > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > > This patch series implements KVM guest private memory for confidential
> > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > > TDX-protected guest memory, machine check can happen which can further
> > > > > > crash the running host system, this is terrible for multi-tenant
> > > > > > configurations. The host accesses include those from KVM userspace like
> > > > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > > > via a fd-based approach, but it can never access the guest memory
> > > > > > content.
> > > > > > 
> > > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > > > reviews are always welcome.
> > > > > >   - 01: mm change, target for mm tree
> > > > > >   - 02-09: KVM change, target for KVM tree
> > > > > 
> > > > > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > > > > is available here:
> > > > > 
> > > > >   git@github.com:sean-jc/linux.git x86/upm_base_support
> > > > > 
> > > > > It compiles and passes the selftest, but it's otherwise barely tested.  There are
> > > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > > > > a WIP.
> > > > > 
> > > > > As for next steps, can you (handwaving all of the TDX folks) take a look at what
> > > > > I pushed and see if there's anything horrifically broken, and that it still works
> > > > > for TDX?
> > > > > 
> > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
> > > > > (and I mean that).
> > > > > 
> > > > > On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> > > > > (SEV and TDX).  For tests, I want to build a lists of tests that are required for
> > > > > merging so that the criteria for merging are clear, and so that if the list is large
> > > > > (haven't thought much yet), the work of writing and running tests can be distributed.
> > > > > 
> > > > > Regarding downstream dependencies, before this lands, I want to pull in all the
> > > > > TDX and SNP series and see how everything fits together.  Specifically, I want to
> > > > > make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> > > > > don't miss an opportunity to make things simpler.  The patches in the SNP series to
> > > > > add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> > > > > details.  Nothing remotely major, but something that needs attention since it'll
> > > > > be uAPI.
> > > > 
> > > > Although I'm still debuging with TDX KVM, I needed the following.
> > > > kvm_faultin_pfn() is called without mmu_lock held.  the race to change
> > > > private/shared is handled by mmu_seq.  Maybe dedicated function only for
> > > > kvm_faultin_pfn().
> > > 
> > > Gah, you're not on the other thread where this was discussed[*].  Simply deleting
> > > the lockdep assertion is safe, for guest types that rely on the attributes to
> > > define shared vs. private, KVM rechecks the attributes under the protection of
> > > mmu_seq.
> > > 
> > > I'll get a fixed version pushed out today.
> > > 
> > > [*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com
> > 
> > Now I have tdx kvm working. I've uploaded at the followings.
> > It's rebased to v6.2-rc3.
> >         git@github.com:yamahata/linux.git tdx/upm
> >         git@github.com:yamahata/qemu.git tdx/upm
> 
> And I finally got a working, building version updated and pushed out (again to):
> 
>   git@github.com:sean-jc/linux.git x86/upm_base_support
> 
> Took longer than expected to get the memslot restrictions sussed out.  I'm done
> working on the code for now, my plan is to come back to it+TDX+SNP in 2-3 weeks
> to resolves any remaining todos (that no one else tackles) and to do the whole
> "merge the world" excersise.

Hi Sean,

In case you started working on the code again, I have a branch [1]
originally planned as v11 candidate which I believe I addressed all the
discussions we had for v10 except the very latest one [2] and integrated
all the newly added selftests from Ackerley and myself. The branch was
based on your original upm_base_support and then rebased to your
kvm-x86/mmu head. Feel free to take anything you think useful( most of
them are trivial things but also some fixes for bugs).

[1] https://github.com/chao-p/linux/commits/privmem-v11.6
[2] https://lore.kernel.org/all/20230413160405.h6ov2yl6l3i7mvsj@box.shutemov.name/

Chao
> 
> > kvm_mmu_do_page_fault() needs the following change.
> > kvm_mem_is_private() queries mem_attr_array.  kvm_faultin_pfn() also uses
> > kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't
> > make sense. This change would belong to TDX KVM patches, though.
> 
> Yeah, SNP needs similar treatment.  Sorting that out is high up on the todo list.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
  2023-04-17 14:37           ` Chao Peng
@ 2023-04-17 15:01             ` Sean Christopherson
  0 siblings, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-04-17 15:01 UTC (permalink / raw)
  To: Chao Peng
  Cc: Isaku Yamahata, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	wei.w.wang

On Mon, Apr 17, 2023, Chao Peng wrote:
> In case you started working on the code again, I have a branch [1]
> originally planned as v11 candidate which I believe I addressed all the
> discussions we had for v10 except the very latest one [2] and integrated
> all the newly added selftests from Ackerley and myself. The branch was
> based on your original upm_base_support and then rebased to your
> kvm-x86/mmu head. Feel free to take anything you think useful( most of
> them are trivial things but also some fixes for bugs).

Nice!  I am going to work on splicing together the various series this week, I'll
make sure to grab your work.

Thanks much! 

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2022-12-02  6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (10 preceding siblings ...)
  2023-02-16  5:13 ` Mike Rapoport
@ 2023-04-17 15:40 ` Sean Christopherson
  2023-04-17 15:48   ` David Hildenbrand
  2023-04-23 13:14   ` Jarkko Sakkinen
  11 siblings, 2 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-04-17 15:40 UTC (permalink / raw)
  To: Chao Peng
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov,
	Jim Mattson, Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Chao Peng, Kirill A . Shutemov,
	dhildenb, Quentin Perret, tabba, Michael Roth, wei.w.wang,
	Mike Rapoport, Liam Merwick, Isaku Yamahata, Jarkko Sakkinen,
	Ackerley Tng, kvm, linux-kernel

What do y'all think about renaming "restrictedmem" to "guardedmem"?

I want to start referring to the code/patches by its syscall/implementation name
instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
and not just the non-KVM code, and (c) will likely be confusing for future reviewers
since there's nothing in the code that mentions "UPM" in any way.

But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
already used to refer to "reserved memory".

Renaming the syscall to "guardedmem"...

  1. Allows for a shorthand and namespace, "gmem", that isn't already in use by
     the kernel (see "reserved memory above").
 
  2. Provides a stronger hint as to its purpose.  "Restricted" conveys that the
     allocated memory is limited in some way, but doesn't capture how the memory
     is restricted, e.g. "restricted" could just as easily mean that the allocation
     can be restricted to certain types of backing stores or something.  "Guarded"
     on the other hand captures that the memory has extra defenses of some form.

  3. Is shorter to type and speak.  Work smart, not hard :-)

  4. Isn't totally wrong for the KVM use case if someone assumes the "g" means
     "guest" when reading mail and whatnot.


P.S. I trimmed the Cc/To substantially for this particular discussion to avoid
     spamming folks that don't (yet) care about this stuff with another potentially
     lengthy thread.  Feel free to add (back) any people/lists.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-17 15:40 ` Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM) Sean Christopherson
@ 2023-04-17 15:48   ` David Hildenbrand
  2023-04-17 16:40     ` Sean Christopherson
  2023-04-23 13:28     ` Jarkko Sakkinen
  2023-04-23 13:14   ` Jarkko Sakkinen
  1 sibling, 2 replies; 398+ messages in thread
From: David Hildenbrand @ 2023-04-17 15:48 UTC (permalink / raw)
  To: Sean Christopherson, Chao Peng
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Jim Mattson, Joerg Roedel,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, dhildenb, Quentin Perret, tabba,
	Michael Roth, wei.w.wang, Mike Rapoport, Liam Merwick,
	Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm, linux-kernel

On 17.04.23 17:40, Sean Christopherson wrote:
> What do y'all think about renaming "restrictedmem" to "guardedmem"?

Yeay, let's add more confusion :D

If we're at renaming, I'd appreciate if we could find a terminology that 
does look/sound less horrible.

> 
> I want to start referring to the code/patches by its syscall/implementation name
> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
> since there's nothing in the code that mentions "UPM" in any way.
> 
> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
> already used to refer to "reserved memory".
> 
> Renaming the syscall to "guardedmem"...

restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-17 15:48   ` David Hildenbrand
@ 2023-04-17 16:40     ` Sean Christopherson
  2023-04-17 17:09       ` David Hildenbrand
                         ` (2 more replies)
  2023-04-23 13:28     ` Jarkko Sakkinen
  1 sibling, 3 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-04-17 16:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Chao Peng, Paolo Bonzini, Vitaly Kuznetsov, Jim Mattson,
	Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, dhildenb,
	Quentin Perret, tabba, Michael Roth, wei.w.wang, Mike Rapoport,
	Liam Merwick, Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm,
	linux-kernel

On Mon, Apr 17, 2023, David Hildenbrand wrote:
> On 17.04.23 17:40, Sean Christopherson wrote:
> > I want to start referring to the code/patches by its syscall/implementation name
> > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
> > and not just the non-KVM code, and (c) will likely be confusing for future reviewers
> > since there's nothing in the code that mentions "UPM" in any way.
> > 
> > But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
> > already used to refer to "reserved memory".
> > 
> > Renaming the syscall to "guardedmem"...
> 
> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...

I'm definitely open to other suggestions, but I suspect it's going to be difficult
to be more precise than something like "guarded".

E.g. we discussed "unmappable" at one point, but the memory can still be mapped,
just not via mmap().  And it's not just about mappings, e.g. read() and its many
variants are all disallowed too, despite the kernel direct map still being live
(modulo SNP requirements).

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-17 16:40     ` Sean Christopherson
@ 2023-04-17 17:09       ` David Hildenbrand
  2023-04-17 19:16         ` Sean Christopherson
  2023-04-17 17:11       ` Ackerley Tng
  2023-04-18 17:01       ` Ackerley Tng
  2 siblings, 1 reply; 398+ messages in thread
From: David Hildenbrand @ 2023-04-17 17:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, Paolo Bonzini, Vitaly Kuznetsov, Jim Mattson,
	Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, dhildenb,
	Quentin Perret, tabba, Michael Roth, wei.w.wang, Mike Rapoport,
	Liam Merwick, Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm,
	linux-kernel

On 17.04.23 18:40, Sean Christopherson wrote:
> On Mon, Apr 17, 2023, David Hildenbrand wrote:
>> On 17.04.23 17:40, Sean Christopherson wrote:
>>> I want to start referring to the code/patches by its syscall/implementation name
>>> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
>>> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
>>> since there's nothing in the code that mentions "UPM" in any way.
>>>
>>> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
>>> already used to refer to "reserved memory".
>>>
>>> Renaming the syscall to "guardedmem"...
>>
>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...
> 
> I'm definitely open to other suggestions, but I suspect it's going to be difficult
> to be more precise than something like "guarded".

Guardedmem is just as bad as restrictedmem IMHO, sorry.


Restricted: what's restricted? how does the restriction manifest? 
secretmem also has it's restrictions/limitations (pinning), why does 
that one not fall under the same category?

Make a stranger guess what "restrictedmem" is and I can guarantee that 
it has nothing to do with the concept we're introducing here.


Guarded: what's guarded? From whom? For which purpose? How does the 
"guarding" manifest?

Again, make a stranger guess what "guardedmem" is and I can guarantee 
that it has nothing to do with the concept we're introducing here.

If, at all, the guess might be "guarded storage" [1] on s390x, which, of 
course, has nothing to do with the concept here. (storage on s390x is 
just the dinosaur slang for memory)


Often, if we fail to find a good name, the concept is either unclear or 
not well defined.

So what are the characteristics we want to generalize under that new 
name? We want to have an fd, that

(a) cannot be mapped into user space (mmap)
(b) cannot be accessed using ordinary system calls (read/write)
(c) can still be managed like other fds (fallocate, future NUMA
     policies?)
(d) can be consumed by some special entities that are allowed to
     read/write/map.

So the fd content is inaccessible using the ordinary POSIX syscalls. 
It's only accessible by special entities (e.g., KVM).

Most probably I am forgetting something. But maybe that will help to 
find a more expressive name. Maybe :)

> 
> E.g. we discussed "unmappable" at one point, but the memory can still be mapped,
> just not via mmap().  And it's not just about mappings, e.g. read() and its many
> variants are all disallowed too, despite the kernel direct map still being live
> (modulo SNP requirements).
> 

[1] https://man.archlinux.org/man/s390_guarded_storage.2.en

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-17 16:40     ` Sean Christopherson
  2023-04-17 17:09       ` David Hildenbrand
@ 2023-04-17 17:11       ` Ackerley Tng
  2023-04-17 18:17         ` Sean Christopherson
  2023-04-18 17:01       ` Ackerley Tng
  2 siblings, 1 reply; 398+ messages in thread
From: Ackerley Tng @ 2023-04-17 17:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: david, chao.p.peng, pbonzini, vkuznets, jmattson, joro, mail,
	vbabka, vannapurve, yu.c.zhang, kirill.shutemov, dhildenb,
	qperret, tabba, michael.roth, wei.w.wang, rppt, liam.merwick,
	isaku.yamahata, jarkko, kvm, linux-kernel

Sean Christopherson <seanjc@google.com> writes:

> On Mon, Apr 17, 2023, David Hildenbrand wrote:
>> On 17.04.23 17:40, Sean Christopherson wrote:
>> > I want to start referring to the code/patches by its  
>> syscall/implementation name
>> > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the  
>> broader effort
>> > and not just the non-KVM code, and (c) will likely be confusing for  
>> future reviewers
>> > since there's nothing in the code that mentions "UPM" in any way.
>> >
>> > But typing out restrictedmem is quite tedious, and git grep shows  
>> that "rmem" is
>> > already used to refer to "reserved memory".
>> >
>> > Renaming the syscall to "guardedmem"...

>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask  
>> me ...

> I'm definitely open to other suggestions, but I suspect it's going to be  
> difficult
> to be more precise than something like "guarded".

> E.g. we discussed "unmappable" at one point, but the memory can still be  
> mapped,
> just not via mmap().  And it's not just about mappings, e.g. read() and  
> its many
> variants are all disallowed too, despite the kernel direct map still  
> being live
> (modulo SNP requirements).

I'm for renaming the concept because restrictedmem is quite a
mouthful. :)

How about "concealedmem" or "obscuredmem" to highlight the idea of this
memory being hidden/unreadable/unmappable from userspace?

Guarded is better than restricted but doesn't really highlight how/in
what way it is being guarded.


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-17 17:11       ` Ackerley Tng
@ 2023-04-17 18:17         ` Sean Christopherson
  0 siblings, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-04-17 18:17 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: david, chao.p.peng, pbonzini, vkuznets, jmattson, joro, mail,
	vbabka, vannapurve, yu.c.zhang, kirill.shutemov, dhildenb,
	qperret, tabba, michael.roth, wei.w.wang, rppt, liam.merwick,
	isaku.yamahata, jarkko, kvm, linux-kernel

On Mon, Apr 17, 2023, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > On Mon, Apr 17, 2023, David Hildenbrand wrote:
> > > On 17.04.23 17:40, Sean Christopherson wrote:
> > > > I want to start referring to the code/patches by its
> > > syscall/implementation name
> > > > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to
> > > the broader effort
> > > > and not just the non-KVM code, and (c) will likely be confusing
> > > for future reviewers
> > > > since there's nothing in the code that mentions "UPM" in any way.
> > > >
> > > > But typing out restrictedmem is quite tedious, and git grep shows
> > > that "rmem" is

Your mail client appears to be wrapping too aggressively and mangling quotes.  I'm
guessing gmail is to blame?

> > > > already used to refer to "reserved memory".
> > > >
> > > > Renaming the syscall to "guardedmem"...
> 
> > > restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask
> > > me ...
> 
> > I'm definitely open to other suggestions, but I suspect it's going to be
> > difficult
> > to be more precise than something like "guarded".
> 
> > E.g. we discussed "unmappable" at one point, but the memory can still be
> > mapped,
> > just not via mmap().  And it's not just about mappings, e.g. read() and
> > its many
> > variants are all disallowed too, despite the kernel direct map still
> > being live
> > (modulo SNP requirements).
> 
> I'm for renaming the concept because restrictedmem is quite a
> mouthful. :)
> 
> How about "concealedmem" or "obscuredmem" to highlight the idea of this
> memory being hidden/unreadable/unmappable from userspace?

I'm hesitant to use something like "concealed" becuase it's too close to secretmem,
e.g. might be miscontrued as concealing the memory from anything _but_ the process
that creates the file.

Obscured has similar problems, and obscure often suggests that something is unclear,
as opposed to outright unreachable.

The other aspect of hidden/concealed/etc is that the memory isn't necessarily
concealed from the user.  Long term, I hope to get to the point where even "normal"
VMs use restricted/guarded/??? memory, e.g. to guard (heh) against _unintentional_
access from userspace.  In that use case, the memory isn't truly concealed, espeically
if the user is both the "admin" and the consumer.

Though by that argument, "guarded" is also a poor choice.  And now I'm remembering
how we ended up with "restricted"...

> Guarded is better than restricted but doesn't really highlight how/in
> what way it is being guarded.

Ya, though in practice I think it's infeasible for us to get a name that is both
precise and succinct.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-17 17:09       ` David Hildenbrand
@ 2023-04-17 19:16         ` Sean Christopherson
  2023-04-18  8:53           ` Fuad Tabba
  2023-04-18  9:10           ` David Hildenbrand
  0 siblings, 2 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-04-17 19:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Chao Peng, Paolo Bonzini, Vitaly Kuznetsov, Jim Mattson,
	Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, dhildenb,
	Quentin Perret, tabba, Michael Roth, wei.w.wang, Mike Rapoport,
	Liam Merwick, Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm,
	linux-kernel

On Mon, Apr 17, 2023, David Hildenbrand wrote:
> On 17.04.23 18:40, Sean Christopherson wrote:
> > On Mon, Apr 17, 2023, David Hildenbrand wrote:
> > > On 17.04.23 17:40, Sean Christopherson wrote:
> > > > I want to start referring to the code/patches by its syscall/implementation name
> > > > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
> > > > and not just the non-KVM code, and (c) will likely be confusing for future reviewers
> > > > since there's nothing in the code that mentions "UPM" in any way.
> > > > 
> > > > But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
> > > > already used to refer to "reserved memory".
> > > > 
> > > > Renaming the syscall to "guardedmem"...
> > > 
> > > restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...
> > 
> > I'm definitely open to other suggestions, but I suspect it's going to be difficult
> > to be more precise than something like "guarded".
> 
> Guardedmem is just as bad as restrictedmem IMHO, sorry.
> 
> 
> Restricted: what's restricted? how does the restriction manifest? secretmem
> also has it's restrictions/limitations (pinning), why does that one not fall
> under the same category?
> 
> Make a stranger guess what "restrictedmem" is and I can guarantee that it
> has nothing to do with the concept we're introducing here.
> 
> 
> Guarded: what's guarded? From whom? For which purpose? How does the
> "guarding" manifest?

I completely agree that "guarded" lacks precision, but as above, I've pretty much
given up hope of being super precise.  I actually like "restricted", I just don't
like that I can't shorten the name.

Hmm, maybe that won't be a huge problem in practice.  I can't say I've ever heard
any use "rmem" in verbale or written communication, it's primarily just "rmem" in
code that we can't use, and I don't mind having to use restrictedmem for the namespace.
So maybe we can use "rmem", just not in code?

Or, we could pretend we're pirates and call it arrrmem!, which is definitely going
to be how I refer to it in my internal dialogue if we keep "restricted" :-)

> Again, make a stranger guess what "guardedmem" is and I can guarantee that
> it has nothing to do with the concept we're introducing here.
> 
> If, at all, the guess might be "guarded storage" [1] on s390x, which, of
> course, has nothing to do with the concept here.

Oof, and guarded storage is even documented in Documentation/virt/kvm/api.rst.

> (storage on s390x is just the dinosaur slang for memory)
> 
> 
> Often, if we fail to find a good name, the concept is either unclear or not
> well defined.
> 
> So what are the characteristics we want to generalize under that new name?
> We want to have an fd, that
> 
> (a) cannot be mapped into user space (mmap)
> (b) cannot be accessed using ordinary system calls (read/write)
> (c) can still be managed like other fds (fallocate, future NUMA
>     policies?)
> (d) can be consumed by some special entities that are allowed to
>     read/write/map.
> 
> So the fd content is inaccessible using the ordinary POSIX syscalls. It's
> only accessible by special entities (e.g., KVM).
> 
> Most probably I am forgetting something. But maybe that will help to find a
> more expressive name. Maybe :)

Hidden/Concealed/etc - Too close to secretmem, suffers the "hidden from whom" problem,
and depending on the use case, the memory may not actually be concealed from the
user that controls the VMM.

Restricted - "rmem" collides with "reserved memory" in code.

Guarded - Conflicts with s390's "guarded storage", has the "from whom" problem.

Inaccessible - Many of the same problems as "hidden".

Unmappable - Doesn't cover things like read/write, and is wrong in the sense that
the memory is still mappable, just not via mmap().

Secured - I'm not getting anywhere near this one :-)

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-17 19:16         ` Sean Christopherson
@ 2023-04-18  8:53           ` Fuad Tabba
  2023-04-18  9:10           ` David Hildenbrand
  1 sibling, 0 replies; 398+ messages in thread
From: Fuad Tabba @ 2023-04-18  8:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Hildenbrand, Chao Peng, Paolo Bonzini, Vitaly Kuznetsov,
	Jim Mattson, Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, dhildenb,
	Quentin Perret, Michael Roth, wei.w.wang, Mike Rapoport,
	Liam Merwick, Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm,
	linux-kernel

On Mon, Apr 17, 2023 at 8:16 PM Sean Christopherson <seanjc@google.com> wrote:
 ....

> > So the fd content is inaccessible using the ordinary POSIX syscalls. It's
> > only accessible by special entities (e.g., KVM).
> >
> > Most probably I am forgetting something. But maybe that will help to find a
> > more expressive name. Maybe :)
>
> Hidden/Concealed/etc - Too close to secretmem, suffers the "hidden from whom" problem,
> and depending on the use case, the memory may not actually be concealed from the
> user that controls the VMM.
>
> Restricted - "rmem" collides with "reserved memory" in code.
>
> Guarded - Conflicts with s390's "guarded storage", has the "from whom" problem.
>
> Inaccessible - Many of the same problems as "hidden".
>
> Unmappable - Doesn't cover things like read/write, and is wrong in the sense that
> the memory is still mappable, just not via mmap().
>
> Secured - I'm not getting anywhere near this one :-)

How about "protected" ;)? _ducks_

To me the name doesn't matter much, but fwiw I have developed a liking
to "restricted", more than the previous "private", since of all of the
one-word suggestions I think it captures most of what it's trying to
do.

Cheers,
/fuad

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-17 19:16         ` Sean Christopherson
  2023-04-18  8:53           ` Fuad Tabba
@ 2023-04-18  9:10           ` David Hildenbrand
  2023-04-19  0:47             ` Sean Christopherson
  1 sibling, 1 reply; 398+ messages in thread
From: David Hildenbrand @ 2023-04-18  9:10 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, Paolo Bonzini, Vitaly Kuznetsov, Jim Mattson,
	Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, dhildenb,
	Quentin Perret, tabba, Michael Roth, wei.w.wang, Mike Rapoport,
	Liam Merwick, Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm,
	linux-kernel

On 17.04.23 21:16, Sean Christopherson wrote:
> On Mon, Apr 17, 2023, David Hildenbrand wrote:
>> On 17.04.23 18:40, Sean Christopherson wrote:
>>> On Mon, Apr 17, 2023, David Hildenbrand wrote:
>>>> On 17.04.23 17:40, Sean Christopherson wrote:
>>>>> I want to start referring to the code/patches by its syscall/implementation name
>>>>> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
>>>>> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
>>>>> since there's nothing in the code that mentions "UPM" in any way.
>>>>>
>>>>> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
>>>>> already used to refer to "reserved memory".
>>>>>
>>>>> Renaming the syscall to "guardedmem"...
>>>>
>>>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...
>>>
>>> I'm definitely open to other suggestions, but I suspect it's going to be difficult
>>> to be more precise than something like "guarded".
>>
>> Guardedmem is just as bad as restrictedmem IMHO, sorry.
>>
>>
>> Restricted: what's restricted? how does the restriction manifest? secretmem
>> also has it's restrictions/limitations (pinning), why does that one not fall
>> under the same category?
>>
>> Make a stranger guess what "restrictedmem" is and I can guarantee that it
>> has nothing to do with the concept we're introducing here.
>>
>>
>> Guarded: what's guarded? From whom? For which purpose? How does the
>> "guarding" manifest?
> 
> I completely agree that "guarded" lacks precision, but as above, I've pretty much
> given up hope of being super precise.  I actually like "restricted", I just don't
> like that I can't shorten the name.
> 
> Hmm, maybe that won't be a huge problem in practice.  I can't say I've ever heard
> any use "rmem" in verbale or written communication, it's primarily just "rmem" in
> code that we can't use, and I don't mind having to use restrictedmem for the namespace.
> So maybe we can use "rmem", just not in code?
> 
> Or, we could pretend we're pirates and call it arrrmem!, which is definitely going
> to be how I refer to it in my internal dialogue if we keep "restricted" :-)

:)

> 
>> Again, make a stranger guess what "guardedmem" is and I can guarantee that
>> it has nothing to do with the concept we're introducing here.
>>
>> If, at all, the guess might be "guarded storage" [1] on s390x, which, of
>> course, has nothing to do with the concept here.
> 
> Oof, and guarded storage is even documented in Documentation/virt/kvm/api.rst.
> 
>> (storage on s390x is just the dinosaur slang for memory)
>>
>>
>> Often, if we fail to find a good name, the concept is either unclear or not
>> well defined.
>>
>> So what are the characteristics we want to generalize under that new name?
>> We want to have an fd, that
>>
>> (a) cannot be mapped into user space (mmap)
>> (b) cannot be accessed using ordinary system calls (read/write)
>> (c) can still be managed like other fds (fallocate, future NUMA
>>      policies?)
>> (d) can be consumed by some special entities that are allowed to
>>      read/write/map.
>>
>> So the fd content is inaccessible using the ordinary POSIX syscalls. It's
>> only accessible by special entities (e.g., KVM).
>>
>> Most probably I am forgetting something. But maybe that will help to find a
>> more expressive name. Maybe :)
> 
> Hidden/Concealed/etc - Too close to secretmem, suffers the "hidden from whom" problem,
> and depending on the use case, the memory may not actually be concealed from the
> user that controls the VMM.
> 
> Restricted - "rmem" collides with "reserved memory" in code.
> 
> Guarded - Conflicts with s390's "guarded storage", has the "from whom" problem.
> 
> Inaccessible - Many of the same problems as "hidden".
> 
> Unmappable - Doesn't cover things like read/write, and is wrong in the sense that
> the memory is still mappable, just not via mmap().
> 
> Secured - I'm not getting anywhere near this one :-)

The think about "secretmem" that I kind-of like (a little) is that it 
explains what it's used for: storing secrets. We don't call it 
"unmapped" memory because we unmap it from the directmap or "unpinnable" 
memory or "inaccessible" memory ... or even "restricted" because it has 
restrictions ... how the secrets are protected is kind of an 
implementation detail.

So instead of describing *why*/*how* restrictedmem is the weird kid 
(restricted/guarded/hidden/restricted/inaccessible/ ...), maybe rather 
describe what it's used for?

I know, I know, "there are other use cases where it will be used outside 
of VM context". I really don't care. "memfd_vm" / "vm_mem" would be sooo 
(feel free to add some more o's here) much easier to get. It's a special 
fd to be used to back VM memory. Depending on the VM type 
(encrypted/protected/whatever), restrictions might apply (not able to 
mmap, not able to read/write ...). For example, there really is no need 
to disallow mmap/read/write when using that memory to back a simple VM 
where all we want to do is avoid user-space page tables.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-17 16:40     ` Sean Christopherson
  2023-04-17 17:09       ` David Hildenbrand
  2023-04-17 17:11       ` Ackerley Tng
@ 2023-04-18 17:01       ` Ackerley Tng
  2 siblings, 0 replies; 398+ messages in thread
From: Ackerley Tng @ 2023-04-18 17:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: david, chao.p.peng, pbonzini, vkuznets, jmattson, joro, mail,
	vbabka, vannapurve, yu.c.zhang, kirill.shutemov, dhildenb,
	qperret, tabba, michael.roth, wei.w.wang, rppt, liam.merwick,
	isaku.yamahata, jarkko, kvm, linux-kernel

Sean Christopherson <seanjc@google.com> writes:

> On Mon, Apr 17, 2023, David Hildenbrand wrote:
>> On 17.04.23 17:40, Sean Christopherson wrote:
>> > I want to start referring to the code/patches by its  
>> syscall/implementation name
>> > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the  
>> broader effort
>> > and not just the non-KVM code, and (c) will likely be confusing for  
>> future reviewers
>> > since there's nothing in the code that mentions "UPM" in any way.
>> >
>> > But typing out restrictedmem is quite tedious, and git grep shows  
>> that "rmem" is
>> > already used to refer to "reserved memory".
>> >
>> > Renaming the syscall to "guardedmem"...

>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask  
>> me ...

> I'm definitely open to other suggestions, but I suspect it's going to be  
> difficult
> to be more precise than something like "guarded".

> E.g. we discussed "unmappable" at one point, but the memory can still be  
> mapped,
> just not via mmap().  And it's not just about mappings, e.g. read() and  
> its many
> variants are all disallowed too, despite the kernel direct map still  
> being live
> (modulo SNP requirements).

How about "opaque"?

I think opaque captures the idea of enforced information hiding from the
user(space), and that the contents can only be manipulated via internal
(kernel) functions.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
  2023-04-14 21:08                   ` Sean Christopherson
@ 2023-04-18 23:38                     ` Ackerley Tng
  2023-04-25 23:01                       ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Ackerley Tng @ 2023-04-18 23:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: chao.p.peng, xiaoyao.li, isaku.yamahata, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-arch, linux-api, linux-doc,
	qemu-devel, pbonzini, corbet, vkuznets, wanpengli, jmattson,
	joro, tglx, mingo, bp, arnd, naoya.horiguchi, linmiaohe, x86,
	hpa, hughd, jlayton, bfields, akpm, shuah, rppt, steven.price,
	mail, vbabka, vannapurve, yu.c.zhang, kirill.shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, qperret, tabba, michael.roth, mhocko, wei.w.wang

Sean Christopherson <seanjc@google.com> writes:

> On Tue, Mar 28, 2023, Chao Peng wrote:
>> On Fri, Mar 24, 2023 at 10:29:25AM +0800, Xiaoyao Li wrote:
>> > On 3/24/2023 10:10 AM, Chao Peng wrote:
>> > > On Wed, Mar 22, 2023 at 05:41:31PM -0700, Isaku Yamahata wrote:
>> > > > On Wed, Mar 08, 2023 at 03:40:26PM +0800,
>> > > > Chao Peng <chao.p.peng@linux.intel.com> wrote:
>> > > >
>> > > > > On Wed, Mar 08, 2023 at 12:13:24AM +0000, Ackerley Tng wrote:
>> > > > > > Chao Peng <chao.p.peng@linux.intel.com> writes:
>> > > > > >
>> > > > > > > On Sat, Jan 14, 2023 at 12:01:01AM +0000, Sean  
>> Christopherson wrote:
>> > > > > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
>> > > > > > > +static bool kvm_check_rmem_offset_alignment(u64 offset, u64  
>> gpa)
>> > > > > > > +{
>> > > > > > > +	if (!offset)
>> > > > > > > +		return true;
>> > > > > > > +	if (!gpa)
>> > > > > > > +		return false;
>> > > > > > > +
>> > > > > > > +	return !!(count_trailing_zeros(offset) >=  
>> count_trailing_zeros(gpa));
>> > > >
>> > > > This check doesn't work expected. For example, offset = 2GB,  
>> gpa=4GB
>> > > > this check fails.
>> > >
>> > > This case is expected to fail as Sean initially suggested[*]:
>> > >    I would rather reject memslot if the gfn has lesser alignment than
>> > >    the offset. I'm totally ok with this approach _if_ there's a use  
>> case.
>> > >    Until such a use case presents itself, I would rather be  
>> conservative
>> > >    from a uAPI perspective.
>> > >
>> > > I understand that we put tighter restriction on this but if you see  
>> such
>> > > restriction is really a big issue for real usage, instead of a
>> > > theoretical problem, then we can loosen the check here. But at that  
>> time
>> > > below code is kind of x86 specific and may need improve.
>> > >
>> > > BTW, in latest code, I replaced count_trailing_zeros() with fls64():
>> > >    return !!(fls64(offset) >= fls64(gpa));
>> >
>> > wouldn't it be !!(ffs64(offset) <= ffs64(gpa)) ?

>> As the function document explains, here we want to return true when
>> ALIGNMENT(offset) >= ALIGNMENT(gpa), so '>=' is what we need.

>> It's worthy clarifying that in Sean's original suggestion he actually
>> mentioned the opposite. He said 'reject memslot if the gfn has lesser
>> alignment than the offset', but I wonder this is his purpose, since
>> if ALIGNMENT(offset) < ALIGNMENT(gpa), we wouldn't be possible to map
>> the page as largepage. Consider we have below config:

>>    gpa=2M, offset=1M

>> In this case KVM tries to map gpa at 2M as 2M hugepage but the physical
>> page at the offset(1M) in private_fd cannot provide the 2M page due to
>> misalignment.

>> But as we discussed in the off-list thread, here we do find a real use
>> case indicating this check is too strict. i.e. QEMU immediately fails
>> when launch a guest > 2G memory. For this case QEMU splits guest memory
>> space into two slots:

>>    Slot#1(ram_below_4G): gpa=0x0, offset=0x0, size=2G
>>    Slot#2(ram_above_4G): gpa=4G,  offset=2G,  size=totalsize-2G

>> This strict alignment check fails for slot#2 because offset(2G) has less
>> alignment than gpa(4G). To allow this, one solution can revert to my
>> previous change in kvm_alloc_memslot_metadata() to disallow hugepage
>> only when the offset/gpa are not aligned to related page size.

>> Sean, How do you think?

> I agree, a pure alignment check is too restrictive, and not really what I  
> intended
> despite past me literally saying that's what I wanted :-)  I think I may  
> have also
> inverted the "less alignment" statement, but luckily I believe that ends  
> up being
> a moot point.

> The goal is to avoid having to juggle scenarios where KVM wants to create  
> a hugepage,
> but restrictedmem can't provide one because of a misaligned file offset.   
> I think
> the rule we want is that the offset must be aligned to the largest page  
> size allowed
> by the memslot _size_.  E.g. on x86, if the memslot size is >=1GiB then  
> the offset
> must be 1GiB or beter, ditto for >=2MiB and >=4KiB (ignoring that 4KiB is  
> already a
> requirement).

> We could loosen that to say the largest size allowed by the memslot, but  
> I don't
> think that's worth the effort unless it's trivially easy to implement in  
> code,
> e.g. KVM could technically allow a 4KiB aligned offset if the memslot is  
> 2MiB
> sized but only 4KiB aligned on the GPA.  I doubt there's a real use case  
> for such
> a memslot, so I want to disallow that unless it's super easy to implement.

Checking my understanding here about why we need this alignment check:

When KVM requests a page from restrictedmem, KVM will provide an offset
into the file in terms of 4K pages.

When shmem is configured to use hugepages, shmem_get_folio() will round
the requested offset down to the nearest hugepage-aligned boundary in
shmem_alloc_hugefolio().

Example of problematic configuration provided to
KVM_SET_USER_MEMORY_REGION2:

+ shmem configured to use 1GB pages
+ restrictedmem_offset provided to KVM_SET_USER_MEMORY_REGION2: 0x4000
+ memory_size provided in KVM_SET_USER_MEMORY_REGION2: 1GB
+ KVM requests offset (pgoff_t) 0x8, which translates to offset 0x8000

restrictedmem_get_page() and shmem_get_folio() returns the page for
offset 0x0 in the file, since rounding down 0x8000 to the nearest 1GB is
0x0. This is allocating outside the range that KVM is supposed to use,
since the parameters provided in KVM_SET_USER_MEMORY_REGION2 is only
supposed to be offset 0x4000 to (0x4000 + 1GB = 0x40004000) in the file.

IIUC shmem will actually just round down (0x4000 rounded down to nearest
1GB will be 0x0) and allocate without checking bounds, so if offset 0x0
to 0x4000 in the file were supposed to be used by something else, there
might be issues.

Hence, this alignment check ensures that rounding down of any offsets
provided by KVM (based on page size configured in the backing file
provided) to restrictedmem_get_page() must not go below the offset
provided to KVM_SET_USER_MEMORY_REGION2.

Enforcing alignment of restrictedmem_offset based on the currently-set
page size in the backing file (i.e. shmem) may not be effective, since
the size of the pages in the backing file can be adjusted to a larger
size after KVM_SET_USER_MEMORY_REGION2 succeeds. With that, we may still
end up allocating outside the range that KVM was provided with.

Hence, to be safe, we should check alignment to the max page size across
all backing filesystems, so the constraint is

     rounding down restrictedmem_offset to
     min(max page size across all backing filesystems,
         max page size that fits in memory_size) == restrictedmem_offset

which is the same check as

     restrictedmem_offset must be aligned to min(max page size across all
     backing filesystems, max page size that fits in memory_size)

which can safely reduce to

     restrictedmem_offset must be aligned to max page size that fits in
     memory_size

since "max page size that fits in memory_size" is probably <= to "max
page size across all backing filesystems", and if it's larger, it'll
just be a tighter constraint.

If the above understanding is correct:

+ We must enforce this in the KVM_SET_USER_MEMORY_REGION2 handler, since
   IIUC shmem will just round down and allocate without checking bounds.

     + I think this is okay because holes in the restrictedmem file (in
       terms of offset) made to accommodate this constraint don't cost us
       anything anyway(?) Are they just arbitrary offsets in a file? In
       our case, this file is usually a new and empty file.

     + In the case of migration of a restrictedmem file between two KVM
       VMs, this constraint would cause a problem is if the largest
       possible page size on the destination machine is larger than that
       of the source machine. In that case, we might have to move the
       data in the file to a different offset (a separate problem).

+ On this note, it seems like there is no check for when the range is
   smaller than the allocated page? Like if the range provided is 4KB in
   size, but shmem is then configured to use a 1GB page, will we end up
   allocating past the end of the range?

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-18  9:10           ` David Hildenbrand
@ 2023-04-19  0:47             ` Sean Christopherson
  2023-04-19  7:21               ` David Hildenbrand
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-04-19  0:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Chao Peng, Paolo Bonzini, Vitaly Kuznetsov, Jim Mattson,
	Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, dhildenb,
	Quentin Perret, tabba, Michael Roth, wei.w.wang, Mike Rapoport,
	Liam Merwick, Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm,
	linux-kernel

On Tue, Apr 18, 2023, David Hildenbrand wrote:
> On 17.04.23 21:16, Sean Christopherson wrote:
> > Hidden/Concealed/etc - Too close to secretmem, suffers the "hidden from whom" problem,
> > and depending on the use case, the memory may not actually be concealed from the
> > user that controls the VMM.
> > 
> > Restricted - "rmem" collides with "reserved memory" in code.
> > 
> > Guarded - Conflicts with s390's "guarded storage", has the "from whom" problem.
> > 
> > Inaccessible - Many of the same problems as "hidden".
> > 
> > Unmappable - Doesn't cover things like read/write, and is wrong in the sense that
> > the memory is still mappable, just not via mmap().
> > 
> > Secured - I'm not getting anywhere near this one :-)
> 
> The think about "secretmem" that I kind-of like (a little) is that it
> explains what it's used for: storing secrets. We don't call it "unmapped"
> memory because we unmap it from the directmap or "unpinnable" memory or
> "inaccessible" memory ... or even "restricted" because it has restrictions
> ... how the secrets are protected is kind of an implementation detail.
> 
> So instead of describing *why*/*how* restrictedmem is the weird kid
> (restricted/guarded/hidden/restricted/inaccessible/ ...), maybe rather
> describe what it's used for?
> 
> I know, I know, "there are other use cases where it will be used outside of
> VM context". I really don't care.

Heh, we originally proposed F_SEAL_GUEST, but that was also sub-optimal[1] ;-)

> "memfd_vm" / "vm_mem" would be sooo (feel free to add some more o's here)
> much easier to get. It's a special fd to be used to back VM memory. Depending
> on the VM type (encrypted/protected/whatever), restrictions might apply (not
> able to mmap, not able to read/write ...). For example, there really is no
> need to disallow mmap/read/write when using that memory to back a simple VM
> where all we want to do is avoid user-space page tables.

In seriousness, I do agree with Jason's very explicit objection[2] against naming
a non-KVM uAPI "guest", or any variation thereof.

An alternative that we haven't considered since the very early days is making the
uAPI a KVM ioctl() instead of a memfd() flag or dedicated syscall.  Looking at the
code for "pure shim" implementation[3], that's actually not that crazy of an idea.

I don't know that I love the idea of burying this in KVM, but there are benefits
to coupling restrictedmem to KVM (aside from getting out from behind this bikeshed
that I created).

The big benefit is that the layer of indirection goes away.  That simplifies things
like enhancing restrictedmem to allow host userspace access for debug purposes,
batching TLB flushes if a PUNCH_HOLE spans multiple memslots, enforcing exclusive
access, likely the whole "share with a device" story if/when we get there, etc.

The obvious downsides are that (a) maintenance falls under the KVM umbrella, but
that's likely to be true in practice regardless of where the code lands, and
(b) if another use case comes along, e.g. the Gunyah hypervisor[4][5], we risk
someone reinventing a similar solution.

If we can get Gunyah on board and they don't need substantial changes to the
restrictedmem implementation, then I'm all for continuing on the path we're on.
But if Gunyah wants to do their own thing, and the lightweight shim approach is
viable, then it's awfully tempting to put this all behind a KVM ioctl().

[1] https://lore.kernel.org/all/df11d753-6242-8f7c-cb04-c095f68b41fa@redhat.com
[2] https://lore.kernel.org/all/20211123171723.GD5112@ziepe.ca
[3] https://lore.kernel.org/all/ZDiCG%2F7OgDI0SwMR@google.com
[4] https://lore.kernel.org/all/Y%2FkI66qQFJJ6bkTq@google.com
[5] https://lore.kernel.org/all/20230304010632.2127470-13-quic_eberman@quicinc.com

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-19  0:47             ` Sean Christopherson
@ 2023-04-19  7:21               ` David Hildenbrand
  2023-04-19 15:17                 ` Sean Christopherson
  2023-04-22  1:33                 ` Sean Christopherson
  0 siblings, 2 replies; 398+ messages in thread
From: David Hildenbrand @ 2023-04-19  7:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, Paolo Bonzini, Vitaly Kuznetsov, Jim Mattson,
	Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, dhildenb,
	Quentin Perret, tabba, Michael Roth, wei.w.wang, Mike Rapoport,
	Liam Merwick, Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm,
	linux-kernel

On 19.04.23 02:47, Sean Christopherson wrote:
> On Tue, Apr 18, 2023, David Hildenbrand wrote:
>> On 17.04.23 21:16, Sean Christopherson wrote:
>>> Hidden/Concealed/etc - Too close to secretmem, suffers the "hidden from whom" problem,
>>> and depending on the use case, the memory may not actually be concealed from the
>>> user that controls the VMM.
>>>
>>> Restricted - "rmem" collides with "reserved memory" in code.
>>>
>>> Guarded - Conflicts with s390's "guarded storage", has the "from whom" problem.
>>>
>>> Inaccessible - Many of the same problems as "hidden".
>>>
>>> Unmappable - Doesn't cover things like read/write, and is wrong in the sense that
>>> the memory is still mappable, just not via mmap().
>>>
>>> Secured - I'm not getting anywhere near this one :-)
>>
>> The think about "secretmem" that I kind-of like (a little) is that it
>> explains what it's used for: storing secrets. We don't call it "unmapped"
>> memory because we unmap it from the directmap or "unpinnable" memory or
>> "inaccessible" memory ... or even "restricted" because it has restrictions
>> ... how the secrets are protected is kind of an implementation detail.
>>
>> So instead of describing *why*/*how* restrictedmem is the weird kid
>> (restricted/guarded/hidden/restricted/inaccessible/ ...), maybe rather
>> describe what it's used for?
>>
>> I know, I know, "there are other use cases where it will be used outside of
>> VM context". I really don't care.
> 
> Heh, we originally proposed F_SEAL_GUEST, but that was also sub-optimal[1] ;-)
> 
>> "memfd_vm" / "vm_mem" would be sooo (feel free to add some more o's here)
>> much easier to get. It's a special fd to be used to back VM memory. Depending
>> on the VM type (encrypted/protected/whatever), restrictions might apply (not
>> able to mmap, not able to read/write ...). For example, there really is no
>> need to disallow mmap/read/write when using that memory to back a simple VM
>> where all we want to do is avoid user-space page tables.
> 
> In seriousness, I do agree with Jason's very explicit objection[2] against naming
> a non-KVM uAPI "guest", or any variation thereof.

While I agree, it's all better than the naming we use right now ...


Let me throw "tee_mem" / "memfd_tee" into the picture. That could 
eventually catch what we want to have.

Or "coco_mem" / "memfd_coco".

Of course, both expect that people know the terminology (just like what 
"vm" stands for), but it's IMHO significantly better than 
restricted/guarded/opaque/whatsoever.

Again, expresses what it's used for, not why it behaves in weird ways.


> 
> An alternative that we haven't considered since the very early days is making the
> uAPI a KVM ioctl() instead of a memfd() flag or dedicated syscall.  Looking at the
> code for "pure shim" implementation[3], that's actually not that crazy of an idea.

Yes.

> 
> I don't know that I love the idea of burying this in KVM, but there are benefits
> to coupling restrictedmem to KVM (aside from getting out from behind this bikeshed
> that I created).

Yes, it's all better than jumping through hoops to come up with a bad 
name like "restrictedmem".

> 
> The big benefit is that the layer of indirection goes away.  That simplifies things
> like enhancing restrictedmem to allow host userspace access for debug purposes,
> batching TLB flushes if a PUNCH_HOLE spans multiple memslots, enforcing exclusive
> access, likely the whole "share with a device" story if/when we get there, etc.
> 
> The obvious downsides are that (a) maintenance falls under the KVM umbrella, but
> that's likely to be true in practice regardless of where the code lands, and

Yes.

> (b) if another use case comes along, e.g. the Gunyah hypervisor[4][5], we risk
> someone reinventing a similar solution.

I agree. But if it's as simple as providing an ioctl for that hypervisor 
that simply wires up the existing implementation, it's not too bad.

> 
> If we can get Gunyah on board and they don't need substantial changes to the
> restrictedmem implementation, then I'm all for continuing on the path we're on.
> But if Gunyah wants to do their own thing, and the lightweight shim approach is
> viable, then it's awfully tempting to put this all behind a KVM ioctl().

Right. Or we still succeed in finding a name that's not as bad as what 
we had so far.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2023-04-13 22:28     ` Sean Christopherson
  2023-04-14 22:38       ` Ackerley Tng
@ 2023-04-19  8:29       ` Christian Brauner
  2023-04-20  0:49         ` Sean Christopherson
  1 sibling, 1 reply; 398+ messages in thread
From: Christian Brauner @ 2023-04-19  8:29 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kirill A . Shutemov, Ackerley Tng, Chao Peng, Hugh Dickins, kvm,
	linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Pankaj Gupta,
	linux-arch, arnd, linmiaohe, naoya.horiguchi, tabba, wei.w.wang

On Thu, Apr 13, 2023 at 03:28:43PM -0700, Sean Christopherson wrote:
> On Thu, Apr 13, 2023, Christian Brauner wrote:
> > On Thu, Aug 18, 2022 at 04:24:21PM +0300, Kirill A . Shutemov wrote:
> > > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > > > Here's what I would prefer, and imagine much easier for you to maintain;
> > > > but I'm no system designer, and may be misunderstanding throughout.
> > > > 
> > > > QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
> > > > the fallocate syscall interface itself) to allocate and free the memory,
> > > > ioctl for initializing some of it too.  KVM in control of whether that
> > > > fd can be read or written or mmap'ed or whatever, no need to prevent it
> > > > in shmem.c, no need for flags, seals, notifications to and fro because
> > > > KVM is already in control and knows the history.  If shmem actually has
> > > > value, call into it underneath - somewhat like SysV SHM, and /dev/zero
> > > > mmap, and i915/gem make use of it underneath.  If shmem has nothing to
> > > > add, just allocate and free kernel memory directly, recorded in your
> > > > own xarray.
> > > 
> > > I guess shim layer on top of shmem *can* work. I don't see immediately why
> > > it would not. But I'm not sure it is right direction. We risk creating yet
> > > another parallel VM with own rules/locking/accounting that opaque to
> > > core-mm.
> > 
> > Sorry for necrobumping this thread but I've been reviewing the
> 
> No worries, I'm just stoked someone who actually knows what they're doing is
> chiming in :-)

It's a dangerous business, going out of your subsystem. You step into
code, and if you don't watch your hands, there is no knowing where you
might be swept off to.

That saying goes for me here specifically...

> 
> > memfd_restricted() extension that Ackerley is currently working on. I
> > was pointed to this thread as this is what the extension is building
> > on but I'll reply to both threads here.
> > 
> > From a glance at v10, memfd_restricted() is currently implemented as an
> > in-kernel stacking filesystem. A call to memfd_restricted() creates a
> > new restricted memfd file and a new unlinked tmpfs file and stashes the
> > tmpfs file into the memfd file's private data member. It then uses the
> > tmpfs file's f_ops and i_ops to perform the relevant file and inode
> > operations. So it has the same callstack as a general stacking
> > filesystem like overlayfs in some cases:
> > 
> >         memfd_restricted->getattr()
> >         -> tmpfs->getattr()
> 
> ...
> 
> > Since you're effectively acting like a stacking filesystem you should
> > really use the device number of your memfd restricted filesystem. IOW,
> > sm like:
> > 
> >         stat->dev = memfd_restricted_dentry->d_sb->s_dev;
> > 
> > But then you run into trouble if you want to go forward with Ackerley's
> > extension that allows to explicitly pass in tmpfs fds to
> > memfd_restricted(). Afaict, two tmpfs instances might allocate the same
> > inode number. So now the inode and device number pair isn't unique
> > anymore.
> > 
> > So you might best be served by allocating and reporting your own inode
> > numbers as well.
> > 
> > But if you want to preserve the inode number and device number of the
> > relevant tmpfs instance but still report memfd restricted as your
> > filesystem type
> 
> Unless I missed something along the way, reporting memfd_restricted as a distinct
> filesystem is very much a non-goal.  AFAIK it's purely a side effect of the
> proposed implementation.

In the current implementation you would have to put in effort to fake
this. For example, you would need to also implement ->statfs
super_operation where you'd need to fill in the details of the tmpfs
instance. At that point all that memfd_restricted fs code that you've
written is nothing but deadweight, I would reckon.

> 
> > then I think it's reasonable to ask whether a stacking implementation really
> > makes sense here.
> > 
> > If you extend memfd_restricted() or even consider extending it in the
> > future to take tmpfs file descriptors as arguments to identify the tmpfs
> > instance in which to allocate the underlying tmpfs file for the new
> > restricted memfd file you should really consider a tmpfs based
> > implementation.
> > 
> > Because at that point it just feels like a pointless wrapper to get
> > custom f_ops and i_ops. Plus it's wasteful because you allocate dentries
> > and inodes that you don't really care about at all.
> > 
> > Just off the top of my hat you might be better served:
> > * by a new ioctl() on tmpfs instances that
> >   yield regular tmpfs file descriptors with restricted f_ops and i_ops.
> >   That's not that different from btrfs subvolumes which effectively are
> >   directories but are created through an ioctl().
> 
> I think this is more or less what we want to do, except via a dedicated syscall
> instead of an ioctl() so that the primary interface isn't strictly tied to tmpfs,
> e.g. so that it can be extended to other backing types in the future.

Ok. But just to point this out, this would make memfd_restricted()
a multiplexer on types of memory. And my wild guess is that not all
memory types you might reasonably want to use will have a filesystem
like interface such. So in the future you might end up with multiple
ways of specifying the type of memory:

// use tmpfs backing
memfd_restricted(fd_tmpfs, 0);

// use hugetlbfs backing
memfd_restricted(fd_hugetlbfs, 0);

// use non-fs type memory backing
memfd_restricted(-EBADF, MEMFD_SUPER_FANCY_MEMORY_TYPE);

interface wise I find an unpleasant design. But that multi-memory-open
goal also makes it a bit hard to come up with a clean design (On
possibility would be to use an extensible struct - versioned by size -
similar to openat2() and clone3() such that you can specify all types of
options on the memory in the future.).

> 
> > * by a mount option to tmpfs that makes it act
> >   in this restricted manner then you don't need an ioctl() and can get
> >   away with regular open calls. Such a tmpfs instance would only create
> >   regular, restricted memfds.
> 
> I'd prefer to not go this route, becuase IIUC, it would require relatively invasive
> changes to shmem code, and IIUC would require similar changes to other support
> backings in the future, e.g. hugetlbfs?  And as above, I don't think any of the
> potential use cases need restrictedmem to be a uniquely identifiable mount.

Ok, see my comment above then.

> 
> One of the goals (hopefully not a pipe dream) is to design restrictmem in such a
> way that extending it to support other backing types isn't terribly difficult.

Not necessarily difficult, just difficult to do tastefully imho. But
it's not that has traditionally held people back. ;)

> In case it's not obvious, most of us working on this stuff aren't filesystems
> experts, and many of us aren't mm experts either.  The more we (KVM folks for the
> most part) can leverage existing code to do the heavy lifting, the better.

Well, hopefully we can complement each other's knowledge here.

> 
> After giving myself a bit of a crash course in file systems, would something like
> the below have any chance of (a) working, (b) getting merged, and (c) being
> maintainable?
> 
> The idea is similar to a stacking filesystem, but instead of stacking, restrictedmem
> hijacks a f_ops and a_ops to create a lightweight shim around tmpfs.  There are
> undoubtedly issues and edge cases, I'm just looking for a quick "yes, this might
> be doable" or a "no, that's absolutely bonkers, don't try it".

Maybe, but I think it's weird. _Replacing_ f_ops isn't something that's
unprecedented. It happens everytime a character device is opened (see
fs/char_dev.c:chrdev_open()). And debugfs does a similar (much more
involved) thing where it replaces it's proxy f_ops with the relevant
subsystem's f_ops. The difference is that in both cases the replace
happens at ->open() time; and the replace is done once. Afterwards only
the newly added f_ops are relevant.

In your case you'd be keeping two sets of {f,a}_ops; one usable by
userspace and another only usable by in-kernel consumers. And there are
some concerns (non-exhaustive list), I think:

* {f,a}_ops weren't designed for this. IOW, one set of {f,a}_ops is
  authoritative per @file and it is left to the individual subsystems to
  maintain driver specific ops (see the sunrpc stuff or sockets).
* lifetime management for the two sets of {f,a}_ops: If the ops belong
  to a module then you need to make sure that the module can't get
  unloaded while you're using the fops. Might not be a concern in this
  case.
* brittleness: Not all f_ops for example deal with userspace
  functionality some deal with cleanup when the file is closed like
  ->release(). So it's delicate to override that functionality with
  custom f_ops. Restricted memfds could easily forget to cleanup
  resources.
* Potential for confusion why there's two sets of {f,a}_ops.
* f_ops specifically are generic across a vast amount of consumers and
  are subject to change. If memfd_restricted() has specific requirements
  because of this weird double-use they won't be taken into account.

I find this hard to navigate tbh and it feels like taking a shortcut to
avoid building a proper api. If you only care about a specific set of
operations specific to memfd restricte that needs to be available to
in-kernel consumers, I wonder if you shouldn't just go one step further
then your proposal below and build a dedicated minimal ops api. Idk,
sketching like a madman on a drawning board here with no claim to
feasibility from a mm perspective whatsoever:

struct restrictedmem_ops {
	// only contains very limited stuff you need or special stuff
	// you nee, similar to struct proto_ops (sockets) and so on
};

struct restrictedmem {
	const struct restrictedmem_ops ops;
};

This would avoid fuzzing with two different set of {f,a}_ops in this
brittle way. It would force you to clarify the semantics that you need
and the operations that you need or don't need implemented. And it would
get rid of the ambiguity inherent to using two sets of {f,a}_ops.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-19  7:21               ` David Hildenbrand
@ 2023-04-19 15:17                 ` Sean Christopherson
  2023-04-19 15:27                   ` David Hildenbrand
  2023-04-22  1:33                 ` Sean Christopherson
  1 sibling, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-04-19 15:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Chao Peng, Paolo Bonzini, Vitaly Kuznetsov, Jim Mattson,
	Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, dhildenb,
	Quentin Perret, tabba, Michael Roth, wei.w.wang, Mike Rapoport,
	Liam Merwick, Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm,
	linux-kernel

On Wed, Apr 19, 2023, David Hildenbrand wrote:
> On 19.04.23 02:47, Sean Christopherson wrote:
> > On Tue, Apr 18, 2023, David Hildenbrand wrote:
> > > "memfd_vm" / "vm_mem" would be sooo (feel free to add some more o's here)
> > > much easier to get. It's a special fd to be used to back VM memory. Depending
> > > on the VM type (encrypted/protected/whatever), restrictions might apply (not
> > > able to mmap, not able to read/write ...). For example, there really is no
> > > need to disallow mmap/read/write when using that memory to back a simple VM
> > > where all we want to do is avoid user-space page tables.
> > 
> > In seriousness, I do agree with Jason's very explicit objection[2] against naming
> > a non-KVM uAPI "guest", or any variation thereof.
> 
> While I agree, it's all better than the naming we use right now ...
> 
> 
> Let me throw "tee_mem" / "memfd_tee" into the picture. That could eventually
> catch what we want to have.
> 
> Or "coco_mem" / "memfd_coco".
> 
> Of course, both expect that people know the terminology (just like what "vm"
> stands for), but it's IMHO significantly better than
> restricted/guarded/opaque/whatsoever.
> 
> Again, expresses what it's used for, not why it behaves in weird ways.

I don't want to explicitly tie this to trusted execution or confidential compute,
as there is value in backing "normal" guests with memory that cannot be accessed
by the host userspace without jumping through a few extra hoops, e.g. to add a
layer of protection against data corruption due to host userspace bugs.

> > (b) if another use case comes along, e.g. the Gunyah hypervisor[4][5], we risk
> > someone reinventing a similar solution.
> 
> I agree. But if it's as simple as providing an ioctl for that hypervisor
> that simply wires up the existing implementation, it's not too bad.

Yeah, my mind was wandering in this direction too.  The absolute worst case
scenario seems to be that we do end up creating a generic syscall that is a
superset of KVM's functionality, in which case KVM would end up with an ioctl()
that is just a redirect/wrapper.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-19 15:17                 ` Sean Christopherson
@ 2023-04-19 15:27                   ` David Hildenbrand
  0 siblings, 0 replies; 398+ messages in thread
From: David Hildenbrand @ 2023-04-19 15:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, Paolo Bonzini, Vitaly Kuznetsov, Jim Mattson,
	Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, dhildenb,
	Quentin Perret, tabba, Michael Roth, wei.w.wang, Mike Rapoport,
	Liam Merwick, Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm,
	linux-kernel

On 19.04.23 17:17, Sean Christopherson wrote:
> On Wed, Apr 19, 2023, David Hildenbrand wrote:
>> On 19.04.23 02:47, Sean Christopherson wrote:
>>> On Tue, Apr 18, 2023, David Hildenbrand wrote:
>>>> "memfd_vm" / "vm_mem" would be sooo (feel free to add some more o's here)
>>>> much easier to get. It's a special fd to be used to back VM memory. Depending
>>>> on the VM type (encrypted/protected/whatever), restrictions might apply (not
>>>> able to mmap, not able to read/write ...). For example, there really is no
>>>> need to disallow mmap/read/write when using that memory to back a simple VM
>>>> where all we want to do is avoid user-space page tables.
>>>
>>> In seriousness, I do agree with Jason's very explicit objection[2] against naming
>>> a non-KVM uAPI "guest", or any variation thereof.
>>
>> While I agree, it's all better than the naming we use right now ...
>>
>>
>> Let me throw "tee_mem" / "memfd_tee" into the picture. That could eventually
>> catch what we want to have.
>>
>> Or "coco_mem" / "memfd_coco".
>>
>> Of course, both expect that people know the terminology (just like what "vm"
>> stands for), but it's IMHO significantly better than
>> restricted/guarded/opaque/whatsoever.
>>
>> Again, expresses what it's used for, not why it behaves in weird ways.
> 
> I don't want to explicitly tie this to trusted execution or confidential compute,
> as there is value in backing "normal" guests with memory that cannot be accessed
> by the host userspace without jumping through a few extra hoops, e.g. to add a
> layer of protection against data corruption due to host userspace bugs.

Nothing speaks against using tee_mem for the same purpose I guess. I 
like the sound of it after all. :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2023-04-19  8:29       ` Christian Brauner
@ 2023-04-20  0:49         ` Sean Christopherson
  2023-04-20  8:35           ` Christian Brauner
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-04-20  0:49 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kirill A . Shutemov, Ackerley Tng, Chao Peng, Hugh Dickins, kvm,
	linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Pankaj Gupta,
	linux-arch, arnd, linmiaohe, naoya.horiguchi, tabba, wei.w.wang

On Wed, Apr 19, 2023, Christian Brauner wrote:
> On Thu, Apr 13, 2023 at 03:28:43PM -0700, Sean Christopherson wrote:
> > > But if you want to preserve the inode number and device number of the
> > > relevant tmpfs instance but still report memfd restricted as your
> > > filesystem type
> > 
> > Unless I missed something along the way, reporting memfd_restricted as a distinct
> > filesystem is very much a non-goal.  AFAIK it's purely a side effect of the
> > proposed implementation.
> 
> In the current implementation you would have to put in effort to fake
> this. For example, you would need to also implement ->statfs
> super_operation where you'd need to fill in the details of the tmpfs
> instance. At that point all that memfd_restricted fs code that you've
> written is nothing but deadweight, I would reckon.

After digging a bit, I suspect the main reason Kirill implemented an overlay to
inode_operations was to prevent modifying the file size via ->setattr().  Relying
on shmem_setattr() to unmap entries in KVM's MMU wouldn't work because, by design,
the memory can't be mmap()'d into host userspace. 

	if (attr->ia_valid & ATTR_SIZE) {
		if (memfd->f_inode->i_size)
			return -EPERM;

		if (!PAGE_ALIGNED(attr->ia_size))
			return -EINVAL;	
	}

But I think we can solve this particular problem by using F_SEAL_{GROW,SHRINK} or
SHMEM_LONGPIN.  For a variety of reasons, I'm leaning more and more toward making
this a KVM ioctl() instead of a dedicated syscall, at which point we can be both
more flexible and more draconian, e.g. let userspace provide the file size at the
time of creation, but make the size immutable, at least by default.

> > After giving myself a bit of a crash course in file systems, would something like
> > the below have any chance of (a) working, (b) getting merged, and (c) being
> > maintainable?
> > 
> > The idea is similar to a stacking filesystem, but instead of stacking, restrictedmem
> > hijacks a f_ops and a_ops to create a lightweight shim around tmpfs.  There are
> > undoubtedly issues and edge cases, I'm just looking for a quick "yes, this might
> > be doable" or a "no, that's absolutely bonkers, don't try it".
> 
> Maybe, but I think it's weird.

Yeah, agreed.

> _Replacing_ f_ops isn't something that's unprecedented. It happens everytime
> a character device is opened (see fs/char_dev.c:chrdev_open()). And debugfs
> does a similar (much more involved) thing where it replaces it's proxy f_ops
> with the relevant subsystem's f_ops. The difference is that in both cases the
> replace happens at ->open() time; and the replace is done once. Afterwards
> only the newly added f_ops are relevant.
> 
> In your case you'd be keeping two sets of {f,a}_ops; one usable by
> userspace and another only usable by in-kernel consumers. And there are
> some concerns (non-exhaustive list), I think:
> 
> * {f,a}_ops weren't designed for this. IOW, one set of {f,a}_ops is
>   authoritative per @file and it is left to the individual subsystems to
>   maintain driver specific ops (see the sunrpc stuff or sockets).
> * lifetime management for the two sets of {f,a}_ops: If the ops belong
>   to a module then you need to make sure that the module can't get
>   unloaded while you're using the fops. Might not be a concern in this
>   case.

Ah, whereas I assume the owner of inode_operations is pinned by ??? (dentry?)
holding a reference to the inode?

> * brittleness: Not all f_ops for example deal with userspace
>   functionality some deal with cleanup when the file is closed like
>   ->release(). So it's delicate to override that functionality with
>   custom f_ops. Restricted memfds could easily forget to cleanup
>   resources.
> * Potential for confusion why there's two sets of {f,a}_ops.
> * f_ops specifically are generic across a vast amount of consumers and
>   are subject to change. If memfd_restricted() has specific requirements
>   because of this weird double-use they won't be taken into account.
> 
> I find this hard to navigate tbh and it feels like taking a shortcut to
> avoid building a proper api.

Agreed.  At the very least, it would be better to take an explicit dependency on
whatever APIs are being used instead of somewhat blindly bouncing through ->fallocate().
I think that gives us a clearer path to getting something merged too, as we'll
need Acks on making specific functions visible, i.e. will give MM maintainers
something concrete to react too.

> If you only care about a specific set of operations specific to memfd
> restricte that needs to be available to in-kernel consumers, I wonder if you
> shouldn't just go one step further then your proposal below and build a
> dedicated minimal ops api.

This is actually very doable for shmem.  Unless I'm missing something, because
our use case doesn't allow mmap(), swap, or migration, a good chunk of
shmem_fallocate() is simply irrelevant.  The result is only ~100 lines of code,
and quite straightforward.

My biggest concern, outside of missing a detail in shmem, is adding support for
HugeTLBFS, which is likely going to be requested/needed sooner than later.  At a
glance, hugetlbfs_fallocate() is quite a bit more complex, i.e. not something I'm
keen to duplicate.  But that's also a future problem to some extent, as it's
purely kernel internals; the uAPI side of things doesn't seem like it'll be messy
at all.

Thanks again!

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
  2023-04-20  0:49         ` Sean Christopherson
@ 2023-04-20  8:35           ` Christian Brauner
  0 siblings, 0 replies; 398+ messages in thread
From: Christian Brauner @ 2023-04-20  8:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kirill A . Shutemov, Ackerley Tng, Chao Peng, Hugh Dickins, kvm,
	linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, Pankaj Gupta,
	linux-arch, arnd, linmiaohe, naoya.horiguchi, tabba, wei.w.wang

On Wed, Apr 19, 2023 at 05:49:55PM -0700, Sean Christopherson wrote:
> On Wed, Apr 19, 2023, Christian Brauner wrote:
> > On Thu, Apr 13, 2023 at 03:28:43PM -0700, Sean Christopherson wrote:
> > > > But if you want to preserve the inode number and device number of the
> > > > relevant tmpfs instance but still report memfd restricted as your
> > > > filesystem type
> > > 
> > > Unless I missed something along the way, reporting memfd_restricted as a distinct
> > > filesystem is very much a non-goal.  AFAIK it's purely a side effect of the
> > > proposed implementation.
> > 
> > In the current implementation you would have to put in effort to fake
> > this. For example, you would need to also implement ->statfs
> > super_operation where you'd need to fill in the details of the tmpfs
> > instance. At that point all that memfd_restricted fs code that you've
> > written is nothing but deadweight, I would reckon.
> 
> After digging a bit, I suspect the main reason Kirill implemented an overlay to
> inode_operations was to prevent modifying the file size via ->setattr().  Relying
> on shmem_setattr() to unmap entries in KVM's MMU wouldn't work because, by design,
> the memory can't be mmap()'d into host userspace. 
> 
> 	if (attr->ia_valid & ATTR_SIZE) {
> 		if (memfd->f_inode->i_size)
> 			return -EPERM;
> 
> 		if (!PAGE_ALIGNED(attr->ia_size))
> 			return -EINVAL;	
> 	}
> 
> But I think we can solve this particular problem by using F_SEAL_{GROW,SHRINK} or
> SHMEM_LONGPIN.  For a variety of reasons, I'm leaning more and more toward making
> this a KVM ioctl() instead of a dedicated syscall, at which point we can be both
> more flexible and more draconian, e.g. let userspace provide the file size at the
> time of creation, but make the size immutable, at least by default.
> 
> > > After giving myself a bit of a crash course in file systems, would something like
> > > the below have any chance of (a) working, (b) getting merged, and (c) being
> > > maintainable?
> > > 
> > > The idea is similar to a stacking filesystem, but instead of stacking, restrictedmem
> > > hijacks a f_ops and a_ops to create a lightweight shim around tmpfs.  There are
> > > undoubtedly issues and edge cases, I'm just looking for a quick "yes, this might
> > > be doable" or a "no, that's absolutely bonkers, don't try it".
> > 
> > Maybe, but I think it's weird.
> 
> Yeah, agreed.
> 
> > _Replacing_ f_ops isn't something that's unprecedented. It happens everytime
> > a character device is opened (see fs/char_dev.c:chrdev_open()). And debugfs
> > does a similar (much more involved) thing where it replaces it's proxy f_ops
> > with the relevant subsystem's f_ops. The difference is that in both cases the
> > replace happens at ->open() time; and the replace is done once. Afterwards
> > only the newly added f_ops are relevant.
> > 
> > In your case you'd be keeping two sets of {f,a}_ops; one usable by
> > userspace and another only usable by in-kernel consumers. And there are
> > some concerns (non-exhaustive list), I think:
> > 
> > * {f,a}_ops weren't designed for this. IOW, one set of {f,a}_ops is
> >   authoritative per @file and it is left to the individual subsystems to
> >   maintain driver specific ops (see the sunrpc stuff or sockets).
> > * lifetime management for the two sets of {f,a}_ops: If the ops belong
> >   to a module then you need to make sure that the module can't get
> >   unloaded while you're using the fops. Might not be a concern in this
> >   case.
> 
> Ah, whereas I assume the owner of inode_operations is pinned by ??? (dentry?)
> holding a reference to the inode?

I don't think it would be possible to safely replace inode_operations
after the inode's been made visible in caches.

It works with file_operations because when a file is opened a new struct
file is allocated which isn't reachable anywhere before fd_install() is
called. So it is possible to replace f_ops in the default
f->f_op->open() method (which is what devices do as the inode is located
on e.g., ext4/xfs/tmpfs but the functionality of the device usually
provided by some driver/module through its file_operations). The default
f_ops are taken from i_fop of the inode.

The lifetime of the file_/inode_operations will be aligned with the
lifetime of the module they're originating from. If only
file_/inode_operations are used from within the same module then there
should never be any lifetime concerns.

So an inode doesn't explictly pin file_/inode_operations because there's
usually no need to do that and it be weird if each new inode would take
a reference on the f_ops/i_ops on the off-chance that someone _might_
open the file. Let alone the overhead of calling try_module_get()
everytime a new inode is added to the cache. There are various fs
objects - the superblock which is pinning the filesystem/module - that
exceed the lifetime of inodes and dentries. Both also may be dropped
from their respective caches and readded later.

Pinning of the module for f_ops is done because it is possible that some
filesystem/driver might want to use the file_operations of some other
filesystem/driver by default and they are in separate modules. So the
fops_get() in do_dentry_open is there because it's not guaranteed that
file_/inode_operations originate from the same module as the inode
that's opened. If the module is still alive during the open then a
reference to its f_ops is taken if not then the open will fail with
ENODEV.

That's to the best of my knowledge.

> 
> > * brittleness: Not all f_ops for example deal with userspace
> >   functionality some deal with cleanup when the file is closed like
> >   ->release(). So it's delicate to override that functionality with
> >   custom f_ops. Restricted memfds could easily forget to cleanup
> >   resources.
> > * Potential for confusion why there's two sets of {f,a}_ops.
> > * f_ops specifically are generic across a vast amount of consumers and
> >   are subject to change. If memfd_restricted() has specific requirements
> >   because of this weird double-use they won't be taken into account.
> > 
> > I find this hard to navigate tbh and it feels like taking a shortcut to
> > avoid building a proper api.
> 
> Agreed.  At the very least, it would be better to take an explicit dependency on
> whatever APIs are being used instead of somewhat blindly bouncing through ->fallocate().
> I think that gives us a clearer path to getting something merged too, as we'll
> need Acks on making specific functions visible, i.e. will give MM maintainers
> something concrete to react too.
> 
> > If you only care about a specific set of operations specific to memfd
> > restricte that needs to be available to in-kernel consumers, I wonder if you
> > shouldn't just go one step further then your proposal below and build a
> > dedicated minimal ops api.
> 
> This is actually very doable for shmem.  Unless I'm missing something, because
> our use case doesn't allow mmap(), swap, or migration, a good chunk of
> shmem_fallocate() is simply irrelevant.  The result is only ~100 lines of code,
> and quite straightforward.
> 
> My biggest concern, outside of missing a detail in shmem, is adding support for
> HugeTLBFS, which is likely going to be requested/needed sooner than later.  At a
> glance, hugetlbfs_fallocate() is quite a bit more complex, i.e. not something I'm
> keen to duplicate.  But that's also a future problem to some extent, as it's
> purely kernel internals; the uAPI side of things doesn't seem like it'll be messy
> at all.
> 
> Thanks again!

Sure thing.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-19  7:21               ` David Hildenbrand
  2023-04-19 15:17                 ` Sean Christopherson
@ 2023-04-22  1:33                 ` Sean Christopherson
  2023-05-05 19:39                   ` Ackerley Tng
                                     ` (3 more replies)
  1 sibling, 4 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-04-22  1:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Chao Peng, Paolo Bonzini, Vitaly Kuznetsov, Jim Mattson,
	Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, dhildenb,
	Quentin Perret, tabba, Michael Roth, wei.w.wang, Mike Rapoport,
	Liam Merwick, Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm,
	linux-kernel, Hugh Dickins, Christian Brauner

+Christian and Hugh

On Wed, Apr 19, 2023, David Hildenbrand wrote:
> On 19.04.23 02:47, Sean Christopherson wrote:
> > An alternative that we haven't considered since the very early days is making the
> > uAPI a KVM ioctl() instead of a memfd() flag or dedicated syscall.  Looking at the
> > code for "pure shim" implementation[3], that's actually not that crazy of an idea.
> 
> Yes.
> 
> > 
> > I don't know that I love the idea of burying this in KVM, but there are benefits
> > to coupling restrictedmem to KVM (aside from getting out from behind this bikeshed
> > that I created).
> 
> Yes, it's all better than jumping through hoops to come up with a bad name
> like "restrictedmem".
> 
> > 
> > The big benefit is that the layer of indirection goes away.  That simplifies things
> > like enhancing restrictedmem to allow host userspace access for debug purposes,
> > batching TLB flushes if a PUNCH_HOLE spans multiple memslots, enforcing exclusive
> > access, likely the whole "share with a device" story if/when we get there, etc.
> > 
> > The obvious downsides are that (a) maintenance falls under the KVM umbrella, but
> > that's likely to be true in practice regardless of where the code lands, and
> 
> Yes.
> 
> > (b) if another use case comes along, e.g. the Gunyah hypervisor[4][5], we risk
> > someone reinventing a similar solution.
> 
> I agree. But if it's as simple as providing an ioctl for that hypervisor
> that simply wires up the existing implementation, it's not too bad.
> 
> > 
> > If we can get Gunyah on board and they don't need substantial changes to the
> > restrictedmem implementation, then I'm all for continuing on the path we're on.
> > But if Gunyah wants to do their own thing, and the lightweight shim approach is
> > viable, then it's awfully tempting to put this all behind a KVM ioctl().
> 
> Right. Or we still succeed in finding a name that's not as bad as what we
> had so far.

Okie dokie.  So as not to bury the lede:

I think we should provide a KVM ioctl() and have KVM provide its own low-level
implementation, i.e. not wrap shmem.  As much as I don't want to have KVM serving
up memory to userspace, by trying to keep KVM out of the memory management business,
we're actually making things *more* complex and harder to maintain (and merge).

Hugh said something to this effect quite early on[1], it just unfortunately took
me a long time to understand the benefits.

 : QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
 : the fallocate syscall interface itself) to allocate and free the memory,
 : ioctl for initializing some of it too.  KVM in control of whether that
 : fd can be read or written or mmap'ed or whatever, no need to prevent it
 : in shmem.c, no need for flags, seals, notifications to and fro because
 : KVM is already in control and knows the history.  If shmem actually has
 : value, call into it underneath - somewhat like SysV SHM, and /dev/zero
 : mmap, and i915/gem make use of it underneath.  If shmem has nothing to
 : add, just allocate and free kernel memory directly, recorded in your
 : own xarray.

Christian also suggested that we stop trying to be lazy and create a proper API[2].

 : I find this hard to navigate tbh and it feels like taking a shortcut to
 : avoid building a proper api. If you only care about a specific set of
 : operations specific to memfd restricte that needs to be available to
 : in-kernel consumers, I wonder if you shouldn't just go one step further
 : then your proposal below and build a dedicated minimal ops api. Idk,
 : sketching like a madman on a drawning board here with no claim to
 : feasibility from a mm perspective whatsoever

The key point from Hugh is that, except for a few minor things that are trivially
easy to replicate, the things that make shmem "shmem" don't provide any value for
the KVM use case:

  - We have no plans to support swap, and migration support is dubious at best.
    Swap in particular brings in a lot of complexity for no benefit (to us).  That
    complexity doesn't mean depending on shmem is inherently bad, but it does mean
    rolling our own implementation is highly unlikely to result in reinventing
    shmem's wheel.

  - Sharing a backing file between arbitrary process is _unwanted_ for the initial
    use cases.  There may come a time when mutually trusted VMs can share "private"
    data, but (a) that's a distant future problem and (b) it'll likely require even
    more KVM control over the memory.

  - All of the interfaces for read/write/mmap/etc. are dead weight from our
    perspective.  Even worse, we have to actively avoid those interfaces.  That
    can kinda sorta be done without polluting shmem code by using a shim, but
    that has problems of its own (see below).

And Christian pointed out several flaws with wrapping shmem:

  - Implementing a partial overlay filesystem leads to inconsistencies because
    only some of the ops are changed, e.g. poking at the inode_operations or
    super_operations will show shmem stuff, whereas address_space_operations and
    file_operations will show restrictedmem stuff.


  - Usurping just f_ops and a_ops without creating a full blown filesystem
    avoids the partial overlay issues, but partially overriding ops isn't really
    supported (because it's weird and brittle), e.g. blindly forwarding a
    fallocate() after restrictedmem does it's thing "works", but only because
    we very carefully and deliberately designed restrictedmem on top of shmem.

On top of the points raised by Hugh and Christian, wrapping shmem isn't really
any simpler, just different.  E.g. there's a very subtle bug in the shim variant:
by passing SGP_WRITE, shmem skips zeroing the page because restrictemem is telling
shmem that it (restrictedmem) will immediately write the page.  For TDX and SNP,
that's a "feature" because in those cases trusted firmware will zero (or init)
memory when it's assigned to the guest, but it's a nasty flaw for other use cases.

I'm not saying that we'll magically avoid such bugs by avoiding shmem, just pointing
out that using shmem requires understanding exactly how shmem works, i.e. using
shmem isn't necessarily any easier than building directly on filemap and/or folio
APIs.  And I gotta imagine it will be a similar story if/when we want to add
hugetlbfs support.  Building on filemap/folio will definitely have its own challenges,
but after prototyping an implementation I can confidently say that none of the
problems will be insurmountable.  KVM has _way_ more complex code in its memslot
and MMU implementations.

And another benefit of building on filemap and/or folio APIs is that, because we
would be reinventing the wheel to some extent, when we do inevitablly run into
problems, it will be easier to get help solving those problems because (a) we won't
be doing weird things no one wants to deal with and (b) because the problems will
likely be things others have already dealt with.

The other angle I've been looking at is whether or not having KVM provide its
own implementation will lead to maintenance problems in the future, specifically
if we get to the point where we want to support "fancy" things like swap and
migration.  For migration, I'm quite confident that a dedicated KVM ioctl() versus
wrapping shmem would be at worst a wash, and more than likely simpler if KVM owns
everything.  E.g. migrating pages under TDX and SNP requires invoking magic
instructions, and so we'd be overriding ->migrate_folio() no matter what.

As for swap, I think we should put a stake in the ground and say that KVM will
never support swap for KVM's ioctl().  Sooo much of the infrastructure around
swap/reclaim is tied to userspace mappings, e.g. knowing which pages are LRU and/or
cold.  I poked around a bit to see how we could avoid reinventing all of that
infrastructure for fd-only memory, and the best idea I could come up with is
basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e.
allow "mapping" fd-only memory, but ensure that memory is never actually present
from hardware's perspective.

But on top of the various problems with that approach, the only use cases I can
think of for using fd-only to back non-confidential VMs is to guard against spurious
writes/reads to guest memory and/or avoid memory overhead for mapping guest
memory into the user page tables.  Avoiding memory overhead is completely defeated
if the page tables are populated PROT_NONE, which just leaves the "harden against
guest data corruption use case".  And for that specific use case, _if_ swap is
desirable, it would be far, far easier to teach the kernel to allow KVM to follow
PROT_NONE (as Kirill's series did), as opposed to trying to teach the kernel and/or
KVM how to swap fd-only memory.

In other words, fd-only memory is purely for slice-of-hardware VMs.  If someone
wants to overcommit VMs, then they use KVM's traditional API for mapping memory
into the guest.

Regarding putting this in KVM, as mentioned (way) above in the quote, the worst
case scenario of making this a KVM ioctl() seems to be that KVM will end up with
an ioctl() that redirects to common kernel code.  On the flip side, implementing
this in KVM gives us a much clearer path to getting all of this merged.  There will
be a few small non-KVM patches, but they will either be trivial, e.g. exporting
APIs (that aren't contentious) or not strictly required, e.g. adding a flag to
mark an address space as completely unmigratable (for improved performance).  I.e.
the non-KVM patches will be small and be very actionable for their maintainers,
and other than that we control our own destiny.

And on the topic of control, making this a KVM ioctl() and implementation gives
KVM a _lot_ of control.  E.g. we can make the file size immutable to simplify the
implementation, bind the fd to a single VM at creation time, add KVM-defined flags
for controlling hugepage behavior, etc.

Last but not least, I think forcing us KVM folks to get our hands at least a little
dirty with MM and FS stuff would be a *good* thing.  KVM has had more than a few
bugs that we missed in no small part because most of the people that work on KVM
have almost zero knowledge of MM and FS, and especially at the boundaries between
those two.

Note, I implemented the POC on top of the filemap APIs because that was much faster
and less error prone than re-implementing xarray management myself.  We may or may
not want to actually make the kernel at large aware of these allocations, i.e. it
may be better to follow Hugh's suggestion and use the folio APIs directly instead
of piggybacking filemap+folios.  E.g. I _think_ migration becomes a complete non-issue
if the pages are "raw" allocations and never get marked as LRU-friendly.  What I'm
not so sure about is if there is anything substantial that we'd lose out on by not
reusing the filemap stuff.

The POC also doesn't play nice with Ackerley's patches to allocate files on a
user-defined mount.  AIUI, that's largely a non-issue as something like fbind()
would provide a superset of that functionality and more than one maintainer has
expressed (handwavy) support for a generic fbind().

Code is available here if folks want to take a look before any kind of formal
posting:

	https://github.com/sean-jc/linux.git x86/kvm_gmem_solo

[1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
[2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
[3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-17 15:40 ` Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM) Sean Christopherson
  2023-04-17 15:48   ` David Hildenbrand
@ 2023-04-23 13:14   ` Jarkko Sakkinen
  1 sibling, 0 replies; 398+ messages in thread
From: Jarkko Sakkinen @ 2023-04-23 13:14 UTC (permalink / raw)
  To: Sean Christopherson, Chao Peng
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Jim Mattson, Joerg Roedel,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, dhildenb, Quentin Perret, tabba,
	Michael Roth, wei.w.wang, Mike Rapoport, Liam Merwick,
	Isaku Yamahata, Ackerley Tng, kvm, linux-kernel

On Mon Apr 17, 2023 at 6:40 PM EEST, Sean Christopherson wrote:
> What do y'all think about renaming "restrictedmem" to "guardedmem"?
>
> I want to start referring to the code/patches by its syscall/implementation name
> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
> since there's nothing in the code that mentions "UPM" in any way.
>
> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
> already used to refer to "reserved memory".
>
> Renaming the syscall to "guardedmem"...
>
>   1. Allows for a shorthand and namespace, "gmem", that isn't already in use by
>      the kernel (see "reserved memory above").
>  
>   2. Provides a stronger hint as to its purpose.  "Restricted" conveys that the
>      allocated memory is limited in some way, but doesn't capture how the memory
>      is restricted, e.g. "restricted" could just as easily mean that the allocation
>      can be restricted to certain types of backing stores or something.  "Guarded"
>      on the other hand captures that the memory has extra defenses of some form.
>
>   3. Is shorter to type and speak.  Work smart, not hard :-)
>
>   4. Isn't totally wrong for the KVM use case if someone assumes the "g" means
>      "guest" when reading mail and whatnot.
>
>
> P.S. I trimmed the Cc/To substantially for this particular discussion to avoid
>      spamming folks that don't (yet) care about this stuff with another potentially
>      lengthy thread.  Feel free to add (back) any people/lists.

I guess 'guarded' could be a good noun in the sense that it does not
get easily mixed up to anything pre-existing, and it does give the idea
of the purpose.

BR, Jarkko

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-17 15:48   ` David Hildenbrand
  2023-04-17 16:40     ` Sean Christopherson
@ 2023-04-23 13:28     ` Jarkko Sakkinen
  2023-05-05 20:00       ` David Hildenbrand
  1 sibling, 1 reply; 398+ messages in thread
From: Jarkko Sakkinen @ 2023-04-23 13:28 UTC (permalink / raw)
  To: David Hildenbrand, Sean Christopherson, Chao Peng
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Jim Mattson, Joerg Roedel,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, dhildenb, Quentin Perret, tabba,
	Michael Roth, wei.w.wang, Mike Rapoport, Liam Merwick,
	Isaku Yamahata, Ackerley Tng, kvm, linux-kernel

On Mon Apr 17, 2023 at 6:48 PM EEST, David Hildenbrand wrote:
> On 17.04.23 17:40, Sean Christopherson wrote:
> > What do y'all think about renaming "restrictedmem" to "guardedmem"?
>
> Yeay, let's add more confusion :D
>
> If we're at renaming, I'd appreciate if we could find a terminology that 
> does look/sound less horrible.
>
> > 
> > I want to start referring to the code/patches by its syscall/implementation name
> > instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
> > and not just the non-KVM code, and (c) will likely be confusing for future reviewers
> > since there's nothing in the code that mentions "UPM" in any way.
> > 
> > But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
> > already used to refer to "reserved memory".
> > 
> > Renaming the syscall to "guardedmem"...
>
> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...

In the world of TEE's and confidential computing it is fairly common to
call memory areas enclaves, even outside SGX context. So in that sense
enclave memory would be the most correct terminology.

BR, Jarkko

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
  2023-04-18 23:38                     ` Ackerley Tng
@ 2023-04-25 23:01                       ` Sean Christopherson
  0 siblings, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-04-25 23:01 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: chao.p.peng, xiaoyao.li, isaku.yamahata, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-arch, linux-api, linux-doc,
	qemu-devel, pbonzini, corbet, vkuznets, wanpengli, jmattson,
	joro, tglx, mingo, bp, arnd, naoya.horiguchi, linmiaohe, x86,
	hpa, hughd, jlayton, bfields, akpm, shuah, rppt, steven.price,
	mail, vbabka, vannapurve, yu.c.zhang, kirill.shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, qperret, tabba, michael.roth, mhocko, wei.w.wang

On Tue, Apr 18, 2023, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> > I agree, a pure alignment check is too restrictive, and not really what I
> > intended despite past me literally saying that's what I wanted :-)  I think
> > I may have also inverted the "less alignment" statement, but luckily I
> > believe that ends up being a moot point.
> 
> > The goal is to avoid having to juggle scenarios where KVM wants to create a
> > hugepage, but restrictedmem can't provide one because of a misaligned file
> > offset.  I think the rule we want is that the offset must be aligned to the
> > largest page size allowed by the memslot _size_.  E.g. on x86, if the
> > memslot size is >=1GiB then the offset must be 1GiB or beter, ditto for
> > >=2MiB and >=4KiB (ignoring that 4KiB is already a requirement).
> 
> > We could loosen that to say the largest size allowed by the memslot, but I
> > don't think that's worth the effort unless it's trivially easy to implement
> > in code, e.g. KVM could technically allow a 4KiB aligned offset if the
> > memslot is 2MiB sized but only 4KiB aligned on the GPA.  I doubt there's a
> > real use case for such a memslot, so I want to disallow that unless it's
> > super easy to implement.
> 
> Checking my understanding here about why we need this alignment check:
> 
> When KVM requests a page from restrictedmem, KVM will provide an offset
> into the file in terms of 4K pages.
> 
> When shmem is configured to use hugepages, shmem_get_folio() will round
> the requested offset down to the nearest hugepage-aligned boundary in
> shmem_alloc_hugefolio().
> 
> Example of problematic configuration provided to
> KVM_SET_USER_MEMORY_REGION2:
> 
> + shmem configured to use 1GB pages
> + restrictedmem_offset provided to KVM_SET_USER_MEMORY_REGION2: 0x4000
> + memory_size provided in KVM_SET_USER_MEMORY_REGION2: 1GB
> + KVM requests offset (pgoff_t) 0x8, which translates to offset 0x8000
> 
> restrictedmem_get_page() and shmem_get_folio() returns the page for
> offset 0x0 in the file, since rounding down 0x8000 to the nearest 1GB is
> 0x0. This is allocating outside the range that KVM is supposed to use,
> since the parameters provided in KVM_SET_USER_MEMORY_REGION2 is only
> supposed to be offset 0x4000 to (0x4000 + 1GB = 0x40004000) in the file.
> 
> IIUC shmem will actually just round down (0x4000 rounded down to nearest
> 1GB will be 0x0) and allocate without checking bounds, so if offset 0x0
> to 0x4000 in the file were supposed to be used by something else, there
> might be issues.
> 
> Hence, this alignment check ensures that rounding down of any offsets
> provided by KVM (based on page size configured in the backing file
> provided) to restrictedmem_get_page() must not go below the offset
> provided to KVM_SET_USER_MEMORY_REGION2.
> 
> Enforcing alignment of restrictedmem_offset based on the currently-set
> page size in the backing file (i.e. shmem) may not be effective, since
> the size of the pages in the backing file can be adjusted to a larger
> size after KVM_SET_USER_MEMORY_REGION2 succeeds. With that, we may still
> end up allocating outside the range that KVM was provided with.
> 
> Hence, to be safe, we should check alignment to the max page size across
> all backing filesystems, so the constraint is
> 
>     rounding down restrictedmem_offset to
>     min(max page size across all backing filesystems,
>         max page size that fits in memory_size) == restrictedmem_offset
> 
> which is the same check as
> 
>     restrictedmem_offset must be aligned to min(max page size across all
>     backing filesystems, max page size that fits in memory_size)
> 
> which can safely reduce to
> 
>     restrictedmem_offset must be aligned to max page size that fits in
>     memory_size
> 
> since "max page size that fits in memory_size" is probably <= to "max
> page size across all backing filesystems", and if it's larger, it'll
> just be a tighter constraint.

Yes?  The alignment check isn't strictly required, KVM _could_ deal with the above
scenario, it's just a lot simpler and safer for KVM if the file offset needs to
be sanely aligned.

> If the above understanding is correct:
> 
> + We must enforce this in the KVM_SET_USER_MEMORY_REGION2 handler, since
>   IIUC shmem will just round down and allocate without checking bounds.
> 
>     + I think this is okay because holes in the restrictedmem file (in
>       terms of offset) made to accommodate this constraint don't cost us
>       anything anyway(?) Are they just arbitrary offsets in a file? In
>       our case, this file is usually a new and empty file.
> 
>     + In the case of migration of a restrictedmem file between two KVM
>       VMs, this constraint would cause a problem is if the largest
>       possible page size on the destination machine is larger than that
>       of the source machine. In that case, we might have to move the
>       data in the file to a different offset (a separate problem).

Hmm, I was thinking this would be a non-issue because the check would be tied to
the max page _possible_ page size irrespective of hardware support, but that would
be problematic if KVM ever supports 512GiB pages.  I'm not sure that speculatively
requiring super huge memslots to be 512GiB aligned is sensible.

Aha!  If we go with a KVM ioctl(), a clean way around this is tie the alignment
requirement to the memfd flags, e.g. if userspace requests the memfd to be backed
by PMD hugepages, then the memslot offset needs to be 2MiB aligned on x86.  That
will continue to work if (big if) KVM supports 512GiB pages because the "legacy"
memfd would still be capped at 2MiB pages.

Architectures that support variable hugepage sizes might need to do something
else, but I don't think that possibility affects what x86 can/can't do.

> + On this note, it seems like there is no check for when the range is
>   smaller than the allocated page? Like if the range provided is 4KB in
>   size, but shmem is then configured to use a 1GB page, will we end up
>   allocating past the end of the range?

No, KVM already gracefully handles situations like this.  Well, x86 does, I assume
other architectures do too :-)

As above, the intent of the extra restriction is so that KVM doen't need even more
weird code (read: math) to gracefully handle the new edge cases that would come with
fd-only memslots.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-22  1:33                 ` Sean Christopherson
@ 2023-05-05 19:39                   ` Ackerley Tng
  2023-05-06  0:55                     ` Sean Christopherson
  2023-05-09 12:44                     ` Chao Peng
  2023-05-10 17:26                   ` Vishal Annapurve
                                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 398+ messages in thread
From: Ackerley Tng @ 2023-05-05 19:39 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: david, chao.p.peng, pbonzini, vkuznets, jmattson, joro, mail,
	vbabka, vannapurve, yu.c.zhang, kirill.shutemov, dhildenb,
	qperret, tabba, michael.roth, wei.w.wang, rppt, liam.merwick,
	isaku.yamahata, jarkko, kvm, linux-kernel, hughd, brauner


Hi Sean,

Thanks for implementing this POC!

I’ve started porting the selftests (both Chao’s and those I added [1]).

guest mem seems to cover the use cases that have been discussed and
proposed so far, but I still need to figure out how gmem can work with

+ hugetlbfs
+ specification of/storing memory policy (for NUMA node bindings)
+ memory accounting - we may need to account for memory used separately,
   so that guest mem shows up separately on /proc/meminfo and similar
   places.

One issue I’ve found so far is that the pointer to kvm (gmem->kvm) is
not cleaned up, and hence it is possible to crash the host kernel in the
following way

1. Create a KVM VM
2. Create a guest mem fd on that VM
3. Create a memslot with the guest mem fd (hence binding the fd to the
    VM)
4. Close/destroy the KVM VM
5. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
    when it tries to do invalidation.

I then tried to clean up the gmem->kvm pointer during unbinding when the
KVM VM is destroyed.

That works, but then I realized there’s a simpler way to use the pointer
after freeing:

1. Create a KVM VM
2. Create a guest mem fd on that VM
3. Close/destroy the KVM VM
4. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
    when it tries to do invalidation.

Perhaps binding should mean setting the gmem->kvm pointer in addition to
gmem->bindings. This makes binding and unbinding symmetric and avoids
the use-after-frees described above.

This also means that creating a guest mem fd is no longer dependent on
the VM. Perhaps we can make creating a gmem fd a system ioctl (like
KVM_GET_API_VERSION and KVM_CREATE_VM) instead of a vm ioctl?

[1]  
https://lore.kernel.org/all/cover.1678926164.git.ackerleytng@google.com/T/

Ackerley

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-23 13:28     ` Jarkko Sakkinen
@ 2023-05-05 20:00       ` David Hildenbrand
  2023-05-06  7:44         ` Vlastimil Babka
  0 siblings, 1 reply; 398+ messages in thread
From: David Hildenbrand @ 2023-05-05 20:00 UTC (permalink / raw)
  To: Jarkko Sakkinen, Sean Christopherson, Chao Peng
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Jim Mattson, Joerg Roedel,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, dhildenb, Quentin Perret, tabba,
	Michael Roth, wei.w.wang, Mike Rapoport, Liam Merwick,
	Isaku Yamahata, Ackerley Tng, kvm, linux-kernel

On 23.04.23 15:28, Jarkko Sakkinen wrote:
> On Mon Apr 17, 2023 at 6:48 PM EEST, David Hildenbrand wrote:
>> On 17.04.23 17:40, Sean Christopherson wrote:
>>> What do y'all think about renaming "restrictedmem" to "guardedmem"?
>>
>> Yeay, let's add more confusion :D
>>
>> If we're at renaming, I'd appreciate if we could find a terminology that
>> does look/sound less horrible.
>>
>>>
>>> I want to start referring to the code/patches by its syscall/implementation name
>>> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
>>> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
>>> since there's nothing in the code that mentions "UPM" in any way.
>>>
>>> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
>>> already used to refer to "reserved memory".
>>>
>>> Renaming the syscall to "guardedmem"...
>>
>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...
> 
> In the world of TEE's and confidential computing it is fairly common to
> call memory areas enclaves, even outside SGX context. So in that sense
> enclave memory would be the most correct terminology.

I was also thinking along the lines of isolated_mem or imem ... 
essentially, isolated from (unprivileged) user space.

... if we still want to have a common syscall for it.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-05-05 19:39                   ` Ackerley Tng
@ 2023-05-06  0:55                     ` Sean Christopherson
  2023-05-06  1:17                       ` Vishal Annapurve
                                         ` (2 more replies)
  2023-05-09 12:44                     ` Chao Peng
  1 sibling, 3 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-05-06  0:55 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: david, chao.p.peng, pbonzini, vkuznets, jmattson, joro, mail,
	vbabka, vannapurve, yu.c.zhang, kirill.shutemov, dhildenb,
	qperret, tabba, michael.roth, wei.w.wang, rppt, liam.merwick,
	isaku.yamahata, jarkko, kvm, linux-kernel, hughd, brauner

On Fri, May 05, 2023, Ackerley Tng wrote:
> 
> Hi Sean,
> 
> Thanks for implementing this POC!
> 
> I’ve started porting the selftests (both Chao’s and those I added [1]).
> 
> guest mem seems to cover the use cases that have been discussed and
> proposed so far, but I still need to figure out how gmem can work with
> 
> + hugetlbfs
> + specification of/storing memory policy (for NUMA node bindings)
> + memory accounting - we may need to account for memory used separately,
>   so that guest mem shows up separately on /proc/meminfo and similar
>   places.
> 
> One issue I’ve found so far is that the pointer to kvm (gmem->kvm) is
> not cleaned up, and hence it is possible to crash the host kernel in the
> following way
> 
> 1. Create a KVM VM
> 2. Create a guest mem fd on that VM
> 3. Create a memslot with the guest mem fd (hence binding the fd to the
>    VM)
> 4. Close/destroy the KVM VM
> 5. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
>    when it tries to do invalidation.
> 
> I then tried to clean up the gmem->kvm pointer during unbinding when the
> KVM VM is destroyed.
> 
> That works, but then I realized there’s a simpler way to use the pointer
> after freeing:
> 
> 1. Create a KVM VM
> 2. Create a guest mem fd on that VM
> 3. Close/destroy the KVM VM
> 4. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
>    when it tries to do invalidation.
> 
> Perhaps binding should mean setting the gmem->kvm pointer in addition to
> gmem->bindings. This makes binding and unbinding symmetric and avoids
> the use-after-frees described above.

Hrm, that would work, though it's a bit convoluted, e.g. would require detecting
when the last binding is being removed.  A similar (also ugly) solution would be
to nullify gmem->kvm when KVM dies.

I don't love either approach idea because it means a file created in the context
of a VM can outlive the VM itself, and then userspace ends up with a file descriptor
that it can't do anything with except close().  I doubt that matters in practice
though, e.g. when the VM dies, all memory can be freed so that the file ends up
being little more than a shell.  And if we go that route, there's no need to grab
a reference to the file during bind, KVM can just grab a longterm reference when
the file is initially created and then drop it when KVM dies (and nullifies gmem->kvm).

Blech, another wart is that I believe gmem would need to do __module_get() during
file creation to prevent kvm.ko from being unloaded after the last VM dies.  Ah,
but that'd also be true if we went with a system-scoped KVM ioctl(), so I suppose
it's not _that_ ugly.

Exchanging references (at binding or at creation) doesn't work, because that
creates a circular dependency, i.e. gmem and KVM would pin each other. 

A "proper" refcounting approach, where the file pins KVM and not vice versa, gets
nasty because of how KVM's memslots work.  The least awful approach I can think of
would be to delete the associated memslot(s) when the file is released, possibly
via deferred work to avoid deadlock issues.  Not the prettiest thing ever and in
some ways that'd yield an even worse ABI.

Side topic, there's a second bug (and probably more lurking): kvm_swap_active_memslots()'s
call to synchronize_srcu_expedited() is done _before_ the call to kvm_gmem_unbind(),
i.e. doesn't wait for readers in kvm_gmem_invalidate_begin() to go away.  The easy
solution for that one is to add another synchronize_srcu_expedited() after unbinding.

> This also means that creating a guest mem fd is no longer dependent on
> the VM. Perhaps we can make creating a gmem fd a system ioctl (like
> KVM_GET_API_VERSION and KVM_CREATE_VM) instead of a vm ioctl?

My preference is to make it a VM-scoped ioctl(), if it ends up being a KVM ioctl()
and not a common syscall.  If the file isn't tightly coupled to a single VM, then
punching a hole is further complicated by needing to deal with invalidating multiple
regions that are bound to different @kvm instances.  It's not super complex, but
AFAICT having the ioctl() be system-scoped doesn't add value, e.g. I don't think
having one VM own the memory will complicate even if/when we get to the point where
VMs can share "private" memory, and the gmem code would still need to deal with
grabbing a module reference.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-05-06  0:55                     ` Sean Christopherson
@ 2023-05-06  1:17                       ` Vishal Annapurve
  2023-05-15 23:46                       ` Sean Christopherson
  2023-07-13 22:46                       ` Ackerley Tng
  2 siblings, 0 replies; 398+ messages in thread
From: Vishal Annapurve @ 2023-05-06  1:17 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ackerley Tng, david, chao.p.peng, pbonzini, vkuznets, jmattson,
	joro, mail, vbabka, yu.c.zhang, kirill.shutemov, dhildenb,
	qperret, tabba, michael.roth, wei.w.wang, rppt, liam.merwick,
	isaku.yamahata, jarkko, kvm, linux-kernel, hughd, brauner

On Fri, May 5, 2023 at 5:55 PM Sean Christopherson <seanjc@google.com> wrote:
>
> ...
> My preference is to make it a VM-scoped ioctl(), if it ends up being a KVM ioctl()
> and not a common syscall.  If the file isn't tightly coupled to a single VM, then
> punching a hole is further complicated by needing to deal with invalidating multiple
> regions that are bound to different @kvm instances.  It's not super complex, but
> AFAICT having the ioctl() be system-scoped doesn't add value, e.g. I don't think
> having one VM own the memory will complicate even if/when we get to the point where
> VMs can share "private" memory, and the gmem code would still need to deal with
> grabbing a module reference.

Copyless migration would be a scenario where "private" memory may need
to be shared between source and target VMs depending on how migration
support is implemented.

Regards,
Vishal

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-05-05 20:00       ` David Hildenbrand
@ 2023-05-06  7:44         ` Vlastimil Babka
  2023-05-06  9:16           ` David Hildenbrand
  0 siblings, 1 reply; 398+ messages in thread
From: Vlastimil Babka @ 2023-05-06  7:44 UTC (permalink / raw)
  To: David Hildenbrand, Jarkko Sakkinen, Sean Christopherson, Chao Peng
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Jim Mattson, Joerg Roedel,
	Maciej S . Szmigiero, Vishal Annapurve, Yu Zhang,
	Kirill A . Shutemov, dhildenb, Quentin Perret, tabba,
	Michael Roth, wei.w.wang, Mike Rapoport, Liam Merwick,
	Isaku Yamahata, Ackerley Tng, kvm, linux-kernel

On 5/5/23 22:00, David Hildenbrand wrote:
> On 23.04.23 15:28, Jarkko Sakkinen wrote:
>> On Mon Apr 17, 2023 at 6:48 PM EEST, David Hildenbrand wrote:
>>> On 17.04.23 17:40, Sean Christopherson wrote:
>>>> What do y'all think about renaming "restrictedmem" to "guardedmem"?
>>>
>>> Yeay, let's add more confusion :D
>>>
>>> If we're at renaming, I'd appreciate if we could find a terminology that
>>> does look/sound less horrible.
>>>
>>>>
>>>> I want to start referring to the code/patches by its syscall/implementation name
>>>> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
>>>> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
>>>> since there's nothing in the code that mentions "UPM" in any way.
>>>>
>>>> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
>>>> already used to refer to "reserved memory".
>>>>
>>>> Renaming the syscall to "guardedmem"...
>>>
>>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...
>> 
>> In the world of TEE's and confidential computing it is fairly common to
>> call memory areas enclaves, even outside SGX context. So in that sense
>> enclave memory would be the most correct terminology.
> 
> I was also thinking along the lines of isolated_mem or imem ... 
> essentially, isolated from (unprivileged) user space.
> 
> ... if we still want to have a common syscall for it.

I'm fan of the ioctl, if it has a chance of working out.


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-05-06  7:44         ` Vlastimil Babka
@ 2023-05-06  9:16           ` David Hildenbrand
  0 siblings, 0 replies; 398+ messages in thread
From: David Hildenbrand @ 2023-05-06  9:16 UTC (permalink / raw)
  To: Vlastimil Babka, Jarkko Sakkinen, Sean Christopherson, Chao Peng
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Jim Mattson, Joerg Roedel,
	Maciej S . Szmigiero, Vishal Annapurve, Yu Zhang,
	Kirill A . Shutemov, dhildenb, Quentin Perret, tabba,
	Michael Roth, wei.w.wang, Mike Rapoport, Liam Merwick,
	Isaku Yamahata, Ackerley Tng, kvm, linux-kernel

On 06.05.23 09:44, Vlastimil Babka wrote:
> On 5/5/23 22:00, David Hildenbrand wrote:
>> On 23.04.23 15:28, Jarkko Sakkinen wrote:
>>> On Mon Apr 17, 2023 at 6:48 PM EEST, David Hildenbrand wrote:
>>>> On 17.04.23 17:40, Sean Christopherson wrote:
>>>>> What do y'all think about renaming "restrictedmem" to "guardedmem"?
>>>>
>>>> Yeay, let's add more confusion :D
>>>>
>>>> If we're at renaming, I'd appreciate if we could find a terminology that
>>>> does look/sound less horrible.
>>>>
>>>>>
>>>>> I want to start referring to the code/patches by its syscall/implementation name
>>>>> instead of "UPM", as "UPM" is (a) very KVM centric, (b) refers to the broader effort
>>>>> and not just the non-KVM code, and (c) will likely be confusing for future reviewers
>>>>> since there's nothing in the code that mentions "UPM" in any way.
>>>>>
>>>>> But typing out restrictedmem is quite tedious, and git grep shows that "rmem" is
>>>>> already used to refer to "reserved memory".
>>>>>
>>>>> Renaming the syscall to "guardedmem"...
>>>>
>>>> restrictedmem, guardedmem, ... all fairly "suboptimal" if you'd ask me ...
>>>
>>> In the world of TEE's and confidential computing it is fairly common to
>>> call memory areas enclaves, even outside SGX context. So in that sense
>>> enclave memory would be the most correct terminology.
>>
>> I was also thinking along the lines of isolated_mem or imem ...
>> essentially, isolated from (unprivileged) user space.
>>
>> ... if we still want to have a common syscall for it.
> 
> I'm fan of the ioctl, if it has a chance of working out.
Yes, me too.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-05-05 19:39                   ` Ackerley Tng
  2023-05-06  0:55                     ` Sean Christopherson
@ 2023-05-09 12:44                     ` Chao Peng
  1 sibling, 0 replies; 398+ messages in thread
From: Chao Peng @ 2023-05-09 12:44 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Sean Christopherson, david, pbonzini, vkuznets, jmattson, joro,
	mail, vbabka, vannapurve, yu.c.zhang, kirill.shutemov, dhildenb,
	qperret, tabba, michael.roth, wei.w.wang, rppt, liam.merwick,
	isaku.yamahata, jarkko, kvm, linux-kernel, hughd, brauner

On Fri, May 05, 2023 at 07:39:36PM +0000, Ackerley Tng wrote:
> 
> Hi Sean,
> 
> Thanks for implementing this POC!
> 
> I’ve started porting the selftests (both Chao’s and those I added [1]).

Hi Sean/Ackerley,

Thanks for doing that. Overall making gmem a KVM ioctl() looks good to
me and it should also play nice with Intel TDX. Besides what Ackerley
mentioned below, I think we haven't discussed device assignment, which
will be supported in not too long distance. Current VFIO_IOMMU_MAP_DMA
consumes virtual address so that needs to be fixed for fd-based memory
anyway, and the fix looks not related to whether this being a syscall()
or a KVM ioctl(). There will be some initialization sequence dependency,
e.g. if gmem is finally a VM-scope ioctl() then we need VM created first
before can we map fd-based memory in VFIO, but that sounds not an issue
at all.

I also see Vlastimil/David expressed their preference on ioctl. So maybe
we can move forward on your current PoC. Do you already have a plan to
post a formal version?

Chao

> 
> guest mem seems to cover the use cases that have been discussed and
> proposed so far, but I still need to figure out how gmem can work with
> 
> + hugetlbfs
> + specification of/storing memory policy (for NUMA node bindings)
> + memory accounting - we may need to account for memory used separately,
>   so that guest mem shows up separately on /proc/meminfo and similar
>   places.
> 
> One issue I’ve found so far is that the pointer to kvm (gmem->kvm) is
> not cleaned up, and hence it is possible to crash the host kernel in the
> following way
> 
> 1. Create a KVM VM
> 2. Create a guest mem fd on that VM
> 3. Create a memslot with the guest mem fd (hence binding the fd to the
>    VM)
> 4. Close/destroy the KVM VM
> 5. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
>    when it tries to do invalidation.
> 
> I then tried to clean up the gmem->kvm pointer during unbinding when the
> KVM VM is destroyed.
> 
> That works, but then I realized there’s a simpler way to use the pointer
> after freeing:
> 
> 1. Create a KVM VM
> 2. Create a guest mem fd on that VM
> 3. Close/destroy the KVM VM
> 4. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
>    when it tries to do invalidation.
> 
> Perhaps binding should mean setting the gmem->kvm pointer in addition to
> gmem->bindings. This makes binding and unbinding symmetric and avoids
> the use-after-frees described above.
> 
> This also means that creating a guest mem fd is no longer dependent on
> the VM. Perhaps we can make creating a gmem fd a system ioctl (like
> KVM_GET_API_VERSION and KVM_CREATE_VM) instead of a vm ioctl?
> 
> [1]
> https://lore.kernel.org/all/cover.1678926164.git.ackerleytng@google.com/T/
> 
> Ackerley

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-22  1:33                 ` Sean Christopherson
  2023-05-05 19:39                   ` Ackerley Tng
@ 2023-05-10 17:26                   ` Vishal Annapurve
  2023-05-10 20:23                     ` Vishal Annapurve
  2023-05-10 21:39                     ` Sean Christopherson
  2023-05-12  0:21                   ` Michael Roth
  2023-06-06 19:14                   ` Ackerley Tng
  3 siblings, 2 replies; 398+ messages in thread
From: Vishal Annapurve @ 2023-05-10 17:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Hildenbrand, Chao Peng, Paolo Bonzini, Vitaly Kuznetsov,
	Jim Mattson, Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Yu Zhang, Kirill A . Shutemov, dhildenb, Quentin Perret, tabba,
	Michael Roth, wei.w.wang, Mike Rapoport, Liam Merwick,
	Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm, linux-kernel,
	Hugh Dickins, Christian Brauner

On Fri, Apr 21, 2023 at 6:33 PM Sean Christopherson <seanjc@google.com> wrote:
>
> ...
> cold.  I poked around a bit to see how we could avoid reinventing all of that
> infrastructure for fd-only memory, and the best idea I could come up with is
> basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e.
> allow "mapping" fd-only memory, but ensure that memory is never actually present
> from hardware's perspective.
>

I am most likely missing a lot of context here and possibly venturing
into an infeasible/already shot down direction here. But I would still
like to get this discussed here before we move on.

I am wondering if it would make sense to implement
restricted_mem/guest_mem file to expose both private and shared memory
regions, inline with Kirill's original proposal now that the file
implementation is controlled by KVM.

Thinking from userspace perspective:
1) Userspace creates guest mem files and is able to mmap them but all
accesses to these files result into faults as no memory is allowed to
be mapped into userspace VMM pagetables.
2) Userspace registers mmaped HVA ranges with KVM with additional
KVM_MEM_PRIVATE flag
3) Userspace converts memory attributes and this memory conversion
allows userspace to access shared ranges of the file because those are
allowed to be faulted in from guest_mem. Shared to private conversion
unmaps the file ranges from userspace VMM pagetables.
4) Granularity of userspace pagetable mappings for shared ranges will
have to be dictated by KVM guest_mem file implementation.

Caveat here is that once private pages are mapped into userspace view.

Benefits here:
1) Userspace view remains consistent while still being able to use HVA ranges
2) It would be possible to use HVA based APIs from userspace to do
things like binding.
3) Double allocation wouldn't be a concern since hva ranges and gpa
ranges possibly map to the same HPA ranges.

>
> Code is available here if folks want to take a look before any kind of formal
> posting:
>
>         https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
>
> [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
> [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-05-10 17:26                   ` Vishal Annapurve
@ 2023-05-10 20:23                     ` Vishal Annapurve
  2023-05-10 21:39                     ` Sean Christopherson
  1 sibling, 0 replies; 398+ messages in thread
From: Vishal Annapurve @ 2023-05-10 20:23 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Hildenbrand, Chao Peng, Paolo Bonzini, Vitaly Kuznetsov,
	Jim Mattson, Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Yu Zhang, Kirill A . Shutemov, dhildenb, Quentin Perret, tabba,
	Michael Roth, wei.w.wang, Mike Rapoport, Liam Merwick,
	Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm, linux-kernel,
	Hugh Dickins, Christian Brauner

On Wed, May 10, 2023 at 10:26 AM Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Fri, Apr 21, 2023 at 6:33 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > ...
> > cold.  I poked around a bit to see how we could avoid reinventing all of that
> > infrastructure for fd-only memory, and the best idea I could come up with is
> > basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e.
> > allow "mapping" fd-only memory, but ensure that memory is never actually present
> > from hardware's perspective.
> >
>
> I am most likely missing a lot of context here and possibly venturing
> into an infeasible/already shot down direction here. But I would still
> like to get this discussed here before we move on.
>
> I am wondering if it would make sense to implement
> restricted_mem/guest_mem file to expose both private and shared memory
> regions, inline with Kirill's original proposal now that the file
> implementation is controlled by KVM.
>
> Thinking from userspace perspective:
> 1) Userspace creates guest mem files and is able to mmap them but all
> accesses to these files result into faults as no memory is allowed to
> be mapped into userspace VMM pagetables.
> 2) Userspace registers mmaped HVA ranges with KVM with additional
> KVM_MEM_PRIVATE flag
> 3) Userspace converts memory attributes and this memory conversion
> allows userspace to access shared ranges of the file because those are
> allowed to be faulted in from guest_mem. Shared to private conversion
> unmaps the file ranges from userspace VMM pagetables.
> 4) Granularity of userspace pagetable mappings for shared ranges will
> have to be dictated by KVM guest_mem file implementation.
>
> Caveat here is that once private pages are mapped into userspace view.
>
> Benefits here:
> 1) Userspace view remains consistent while still being able to use HVA ranges
> 2) It would be possible to use HVA based APIs from userspace to do
> things like binding.
> 3) Double allocation wouldn't be a concern since hva ranges and gpa
> ranges possibly map to the same HPA ranges.
>

I realize now that VFIO IOMMU mappings are not associated with
userspace MMU in any way. So this approach does leave the gap of not
being able to handle modifications of IOMMU mappings when HVA mappings
are modified in userspace page tables. I guess this could be a good
enough reason to give up on this option.


> >
> > Code is available here if folks want to take a look before any kind of formal
> > posting:
> >
> >         https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
> >
> > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
> > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-05-10 17:26                   ` Vishal Annapurve
  2023-05-10 20:23                     ` Vishal Annapurve
@ 2023-05-10 21:39                     ` Sean Christopherson
  2023-05-10 23:03                       ` Vishal Annapurve
  1 sibling, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-05-10 21:39 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: David Hildenbrand, Chao Peng, Paolo Bonzini, Vitaly Kuznetsov,
	Jim Mattson, Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Yu Zhang, Kirill A . Shutemov, dhildenb, Quentin Perret, tabba,
	Michael Roth, wei.w.wang, Mike Rapoport, Liam Merwick,
	Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm, linux-kernel,
	Hugh Dickins, Christian Brauner

On Wed, May 10, 2023, Vishal Annapurve wrote:
> On Fri, Apr 21, 2023 at 6:33 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > ...
> > cold.  I poked around a bit to see how we could avoid reinventing all of that
> > infrastructure for fd-only memory, and the best idea I could come up with is
> > basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e.
> > allow "mapping" fd-only memory, but ensure that memory is never actually present
> > from hardware's perspective.
> >
> 
> I am most likely missing a lot of context here and possibly venturing
> into an infeasible/already shot down direction here.

Both :-)

> But I would still like to get this discussed here before we move on.
> 
> I am wondering if it would make sense to implement
> restricted_mem/guest_mem file to expose both private and shared memory
> regions, inline with Kirill's original proposal now that the file
> implementation is controlled by KVM.
> 
> Thinking from userspace perspective:
> 1) Userspace creates guest mem files and is able to mmap them but all
> accesses to these files result into faults as no memory is allowed to
> be mapped into userspace VMM pagetables.

Never mapping anything into the userspace page table is infeasible.  Technically
it's doable, but it'd effectively require all of the work of an fd-based approach
(and probably significantly more), _and_ it'd require touching core mm code.

VMAs don't provide hva=>pfn information, they're the kernel's way of implementing
the abstraction provided to userspace by mmap(), mprotect() etc.  Among many other
things, a VMA describes properties of what is mapped, e.g. hugetblfs versus
anonymous, where memory is mapped (virtual address), how memory is mapped, e.g.
RWX protections, etc.  But a VMA doesn't track the physical address, that info
is all managed through the userspace page tables.

To make it possible to allow userspace to mmap() but not access memory (without
redoing how the kernel fundamentally manages virtual=>physical mappings), the
simplest approach is to install PTEs into userspace page tables, but never mark
them Present in hardware, i.e. prevent actually accessing the backing memory.
This is is exactly what Kirill's series in link [3] below implemented.

Issues that led to us abandoning the "map with special !Present PTEs" approach:

 - Using page tables, i.e. hardware defined structures, to track gfn=>pfn mappings
   is inefficient and inflexible compared to software defined structures, especially
   for the expected use cases for CoCo guests.

 - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest association,
   let alone a 1:1 pfn:gfn mapping.
 
 - Does not work for memory that isn't backed by 'struct page', e.g. if devices
   gain support for exposing encrypted memory regions to guests.

 - Poking into the VMAs to convert memory would be likely be less performant due
   to using infrastructure that is much "heavier", e.g. would require taking
   mmap_lock for write.

In short, shoehorning this into mmap() requires fighting how the kernel works at
pretty much every step, and in the end, adding e.g. fbind() is a lot easier.

> 2) Userspace registers mmaped HVA ranges with KVM with additional
> KVM_MEM_PRIVATE flag
> 3) Userspace converts memory attributes and this memory conversion
> allows userspace to access shared ranges of the file because those are
> allowed to be faulted in from guest_mem. Shared to private conversion
> unmaps the file ranges from userspace VMM pagetables.
> 4) Granularity of userspace pagetable mappings for shared ranges will
> have to be dictated by KVM guest_mem file implementation.
> 
> Caveat here is that once private pages are mapped into userspace view.
> 
> Benefits here:
> 1) Userspace view remains consistent while still being able to use HVA ranges
> 2) It would be possible to use HVA based APIs from userspace to do
> things like binding.
> 3) Double allocation wouldn't be a concern since hva ranges and gpa
> ranges possibly map to the same HPA ranges.

#3 isn't entirely correct.  If a different process (call it "B") maps shared memory,
and then the guest converts that memory from shared to private, the backing pages
for the previously shared mapping will still be mapped by process B unless userspace
also ensures process B also unmaps on conversion.

#3 is also a limiter.  E.g. if a guest is primarly backed by 1GiB pages, keeping
the 1GiB mapping is desirable if the guest converts a few KiB of memory to shared,
and possibly even if the guest converts a few MiB of memory.

> > Code is available here if folks want to take a look before any kind of formal
> > posting:
> >
> >         https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
> >
> > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
> > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-05-10 21:39                     ` Sean Christopherson
@ 2023-05-10 23:03                       ` Vishal Annapurve
  2023-05-11 20:22                         ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Vishal Annapurve @ 2023-05-10 23:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Hildenbrand, Chao Peng, Paolo Bonzini, Vitaly Kuznetsov,
	Jim Mattson, Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Yu Zhang, Kirill A . Shutemov, dhildenb, Quentin Perret, tabba,
	Michael Roth, wei.w.wang, Mike Rapoport, Liam Merwick,
	Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm, linux-kernel,
	Hugh Dickins, Christian Brauner

On Wed, May 10, 2023 at 2:39 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, May 10, 2023, Vishal Annapurve wrote:
> > On Fri, Apr 21, 2023 at 6:33 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > ...
> > > cold.  I poked around a bit to see how we could avoid reinventing all of that
> > > infrastructure for fd-only memory, and the best idea I could come up with is
> > > basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e.
> > > allow "mapping" fd-only memory, but ensure that memory is never actually present
> > > from hardware's perspective.
> > >
> >
> > I am most likely missing a lot of context here and possibly venturing
> > into an infeasible/already shot down direction here.
>
> Both :-)
>
> > But I would still like to get this discussed here before we move on.
> >
> > I am wondering if it would make sense to implement
> > restricted_mem/guest_mem file to expose both private and shared memory
> > regions, inline with Kirill's original proposal now that the file
> > implementation is controlled by KVM.
> >
> > Thinking from userspace perspective:
> > 1) Userspace creates guest mem files and is able to mmap them but all
> > accesses to these files result into faults as no memory is allowed to
> > be mapped into userspace VMM pagetables.
>
> Never mapping anything into the userspace page table is infeasible.  Technically
> it's doable, but it'd effectively require all of the work of an fd-based approach
> (and probably significantly more), _and_ it'd require touching core mm code.
>
> VMAs don't provide hva=>pfn information, they're the kernel's way of implementing
> the abstraction provided to userspace by mmap(), mprotect() etc.  Among many other
> things, a VMA describes properties of what is mapped, e.g. hugetblfs versus
> anonymous, where memory is mapped (virtual address), how memory is mapped, e.g.
> RWX protections, etc.  But a VMA doesn't track the physical address, that info
> is all managed through the userspace page tables.
>
> To make it possible to allow userspace to mmap() but not access memory (without
> redoing how the kernel fundamentally manages virtual=>physical mappings), the
> simplest approach is to install PTEs into userspace page tables, but never mark
> them Present in hardware, i.e. prevent actually accessing the backing memory.
> This is is exactly what Kirill's series in link [3] below implemented.
>

Maybe it's simpler to do when mmaped regions are backed with files.

I see that shmem has fault handlers for accesses to VMA regions
associated with the files, In theory a file implementation can always
choose to not allocate physical pages for such faults (similar to
F_SEAL_FAULT_AUTOALLOCATE that was discussed earlier).

> Issues that led to us abandoning the "map with special !Present PTEs" approach:
>
>  - Using page tables, i.e. hardware defined structures, to track gfn=>pfn mappings
>    is inefficient and inflexible compared to software defined structures, especially
>    for the expected use cases for CoCo guests.
>
>  - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest association,
>    let alone a 1:1 pfn:gfn mapping.

Maybe KVM can ensure that each page of the guest_mem file is
associated with a single memslot. HVAs when they are registered can be
associated with offsets into guest_mem files.

>
>  - Does not work for memory that isn't backed by 'struct page', e.g. if devices
>    gain support for exposing encrypted memory regions to guests.
>
>  - Poking into the VMAs to convert memory would be likely be less performant due
>    to using infrastructure that is much "heavier", e.g. would require taking
>    mmap_lock for write.

Converting memory doesn't necessarily need to poke holes into VMA, but
rather just unmap pagetables just like what would happen when mmapped
files are punched to free the backing file offsets.

>
> In short, shoehorning this into mmap() requires fighting how the kernel works at
> pretty much every step, and in the end, adding e.g. fbind() is a lot easier.
>
> > 2) Userspace registers mmaped HVA ranges with KVM with additional
> > KVM_MEM_PRIVATE flag
> > 3) Userspace converts memory attributes and this memory conversion
> > allows userspace to access shared ranges of the file because those are
> > allowed to be faulted in from guest_mem. Shared to private conversion
> > unmaps the file ranges from userspace VMM pagetables.
> > 4) Granularity of userspace pagetable mappings for shared ranges will
> > have to be dictated by KVM guest_mem file implementation.
> >
> > Caveat here is that once private pages are mapped into userspace view.
> >
> > Benefits here:
> > 1) Userspace view remains consistent while still being able to use HVA ranges
> > 2) It would be possible to use HVA based APIs from userspace to do
> > things like binding.
> > 3) Double allocation wouldn't be a concern since hva ranges and gpa
> > ranges possibly map to the same HPA ranges.
>
> #3 isn't entirely correct.  If a different process (call it "B") maps shared memory,
> and then the guest converts that memory from shared to private, the backing pages
> for the previously shared mapping will still be mapped by process B unless userspace
> also ensures process B also unmaps on conversion.
>

This should be ideally handled by something like: unmap_mapping_range()

> #3 is also a limiter.  E.g. if a guest is primarly backed by 1GiB pages, keeping
> the 1GiB mapping is desirable if the guest converts a few KiB of memory to shared,
> and possibly even if the guest converts a few MiB of memory.

This caveat maybe can be lived with as shared ranges most likely will
not be backed by 1G pages anyways, possibly causing IO performance to
get hit. This possibly needs more discussion about conversion
granularity used by guests.

>
> > > Code is available here if folks want to take a look before any kind of formal
> > > posting:
> > >
> > >         https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
> > >
> > > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
> > > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> > > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-05-10 23:03                       ` Vishal Annapurve
@ 2023-05-11 20:22                         ` Sean Christopherson
  2023-05-19  1:07                           ` Vishal Annapurve
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-05-11 20:22 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: David Hildenbrand, Chao Peng, Paolo Bonzini, Vitaly Kuznetsov,
	Jim Mattson, Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Yu Zhang, Kirill A . Shutemov, dhildenb, Quentin Perret, tabba,
	Michael Roth, wei.w.wang, Mike Rapoport, Liam Merwick,
	Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm, linux-kernel,
	Hugh Dickins, Christian Brauner

On Wed, May 10, 2023, Vishal Annapurve wrote:
> On Wed, May 10, 2023 at 2:39 PM Sean Christopherson <seanjc@google.com> wrote:
> > > But I would still like to get this discussed here before we move on.
> > >
> > > I am wondering if it would make sense to implement
> > > restricted_mem/guest_mem file to expose both private and shared memory
> > > regions, inline with Kirill's original proposal now that the file
> > > implementation is controlled by KVM.
> > >
> > > Thinking from userspace perspective:
> > > 1) Userspace creates guest mem files and is able to mmap them but all
> > > accesses to these files result into faults as no memory is allowed to
> > > be mapped into userspace VMM pagetables.
> >
> > Never mapping anything into the userspace page table is infeasible.  Technically
> > it's doable, but it'd effectively require all of the work of an fd-based approach
> > (and probably significantly more), _and_ it'd require touching core mm code.
> >
> > VMAs don't provide hva=>pfn information, they're the kernel's way of implementing
> > the abstraction provided to userspace by mmap(), mprotect() etc.  Among many other
> > things, a VMA describes properties of what is mapped, e.g. hugetblfs versus
> > anonymous, where memory is mapped (virtual address), how memory is mapped, e.g.
> > RWX protections, etc.  But a VMA doesn't track the physical address, that info
> > is all managed through the userspace page tables.
> >
> > To make it possible to allow userspace to mmap() but not access memory (without
> > redoing how the kernel fundamentally manages virtual=>physical mappings), the
> > simplest approach is to install PTEs into userspace page tables, but never mark
> > them Present in hardware, i.e. prevent actually accessing the backing memory.
> > This is is exactly what Kirill's series in link [3] below implemented.
> >
> 
> Maybe it's simpler to do when mmaped regions are backed with files.
> 
> I see that shmem has fault handlers for accesses to VMA regions
> associated with the files, In theory a file implementation can always
> choose to not allocate physical pages for such faults (similar to
> F_SEAL_FAULT_AUTOALLOCATE that was discussed earlier).

Ah, you're effectively suggesting a hybrid model where the file is the single
source of truth for what's private versus shared, ad KVM gets pfns through
direct communication with the backing store via the file descriptor, but userspace
can still control things via mmap() and friends.

If you're not suggesting a backdoor, i.e. KVM still gets private pfns via hvas,
then we're back at Kirill's series, because otherwise there's no easy way for KVM
to retrieve the pfn.

A form of this was also discussed, though I don't know how much of the discussion
happened on-list.
 
KVM actually does something like this for s390's Ultravisor (UV), which is quite
a bit like TDX (UV is a trusted intermediary) except that it handles faults much,
much more gracefully.  Specifically, when the untrusted host attempts to access a
secure page, a fault occurs and the kernel responds by telling UV to export the
page.  The fault is gracefully handled even even for kernel accesses
(see do_secure_storage_access()).  The kernel does BUG() if the export fails when
handling fault from kernel context, but my understanding is that export can fail
if and only if there's a fatal error elsewhere, i.e. the UV essentialy _ensures_
success, and goes straight to BUG()/panic() if something goes wrong.

On the guest side, accesses to exported (swapped) secure pages generate intercepts
and KVM faults in the page.  To do so, KVM freezes the page/folio refcount, tells
the UV to import the page, and then unfreezes the page/folio.  But very crucially,
when _anything_ in the untrusted host attempts to access the secure page, the
above fault handling for untrusted host accesses kicks in.  In other words, the
guest can cause thrash, but can't bring down the host.

TDX on the other hand silently poisons memory, i.e. doesn't even generate a
synchronous fault.  Thus the kernel needs to be 100% perfect on preventing _any_
accesses to private memory from the host, and doing that is non-trivial and
invasive.

SNP does synchronously fault, but the automatically converting in the #PF handler
got NAK'd[*] for good reasons, e.g. SNP doesn't guarantee conversion success as the
guest can trigger concurrent RMP modifications.  So the end result ends up being
the same as TDX, host accesses need to be completely prevented.

Again, this is all doable, but costly.  And IMO, provides very little value.

Allowing things like mbind() is nice-to-have at best, as implementing fbind()
isn't straightforward and arguably valuable to have irrespective of this
discussion, e.g. to allow userspace to say "use this policy regardless of what
process maps the file".

Using a common memory pool (same physical page is used for both shared and private)
is a similar story.  There are plenty of existing controls to limit userspace/guest
memory usage and to deal with OOM scenarios, so barring egregious host accounting
and/or limiting bugs, which would affect _all_ VM types, the worst case scenario
is that a VM is terminated because host userspace is buggy.  On the slip side, using
a common pool brings complexity into the kernel, as backing stores would need to
be taught to deny access to a subset of pages in their mappings, and in multiple
paths, e.g. faults, read()/write() and similar, page migration, swap, etc.

[*] https://lore.kernel.org/linux-mm/8a244d34-2b10-4cf8-894a-1bf12b59cf92@www.fastmail.com

> > Issues that led to us abandoning the "map with special !Present PTEs" approach:
> >
> >  - Using page tables, i.e. hardware defined structures, to track gfn=>pfn mappings
> >    is inefficient and inflexible compared to software defined structures, especially
> >    for the expected use cases for CoCo guests.
> >
> >  - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest association,
> >    let alone a 1:1 pfn:gfn mapping.
> 
> Maybe KVM can ensure that each page of the guest_mem file is
> associated with a single memslot.

This is a hard NAK.  Guest physical address space is guaranteed to have holes
and/or be discontiguous, for the PCI hole at the top of lower memory.  Allowing
only a single binding would prevent userspace from backing all (or large chunks)
of guest memory with a single file.

> HVAs when they are registered can be associated with offsets into guest_mem files.

Enforcing 1:1 assocations is doable if KVM inserts a shim/interposer, e.g. essentially
implements the exclusivity bits of restrictedmem.  But that's adding even more
complexity.

> >  - Does not work for memory that isn't backed by 'struct page', e.g. if devices
> >    gain support for exposing encrypted memory regions to guests.
> >
> >  - Poking into the VMAs to convert memory would be likely be less performant due
> >    to using infrastructure that is much "heavier", e.g. would require taking
> >    mmap_lock for write.
> 
> Converting memory doesn't necessarily need to poke holes into VMA, but
> rather just unmap pagetables just like what would happen when mmapped
> files are punched to free the backing file offsets.

Sorry, bad choice of word on my part.  I didn't intend to imply poking holes, in
this case I used "poking" to mean "modifying".  munmap(), mprotected(), etc all
require modifying VMAs, which means taking mmap_lock for write.

> > In short, shoehorning this into mmap() requires fighting how the kernel works at
> > pretty much every step, and in the end, adding e.g. fbind() is a lot easier.
> >
> > > 2) Userspace registers mmaped HVA ranges with KVM with additional
> > > KVM_MEM_PRIVATE flag
> > > 3) Userspace converts memory attributes and this memory conversion
> > > allows userspace to access shared ranges of the file because those are
> > > allowed to be faulted in from guest_mem. Shared to private conversion
> > > unmaps the file ranges from userspace VMM pagetables.
> > > 4) Granularity of userspace pagetable mappings for shared ranges will
> > > have to be dictated by KVM guest_mem file implementation.
> > >
> > > Caveat here is that once private pages are mapped into userspace view.
> > >
> > > Benefits here:
> > > 1) Userspace view remains consistent while still being able to use HVA ranges
> > > 2) It would be possible to use HVA based APIs from userspace to do
> > > things like binding.
> > > 3) Double allocation wouldn't be a concern since hva ranges and gpa
> > > ranges possibly map to the same HPA ranges.
> >
> > #3 isn't entirely correct.  If a different process (call it "B") maps shared memory,
> > and then the guest converts that memory from shared to private, the backing pages
> > for the previously shared mapping will still be mapped by process B unless userspace
> > also ensures process B also unmaps on conversion.
> >
> 
> This should be ideally handled by something like: unmap_mapping_range()

That'd work for the hybrid model (fd backdoor with pseudo mmap() support), but
not for a generic VMA-based implementation.  If the file isn't the single source
of truth, then forcing all mappings to go away simply can't work.
 
> > #3 is also a limiter.  E.g. if a guest is primarly backed by 1GiB pages, keeping
> > the 1GiB mapping is desirable if the guest converts a few KiB of memory to shared,
> > and possibly even if the guest converts a few MiB of memory.
> 
> This caveat maybe can be lived with as shared ranges most likely will
> not be backed by 1G pages anyways, possibly causing IO performance to
> get hit. This possibly needs more discussion about conversion
> granularity used by guests.

Yes, it's not the end of the world.  My point is that separating shared and private
memory provides more flexibility.  Maybe that flexibility never ends up being
super important, but at the same time we shouldn't willingly paint ourselves into
a corner.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-22  1:33                 ` Sean Christopherson
  2023-05-05 19:39                   ` Ackerley Tng
  2023-05-10 17:26                   ` Vishal Annapurve
@ 2023-05-12  0:21                   ` Michael Roth
  2023-05-12 18:01                     ` Sean Christopherson
  2023-06-06 19:14                   ` Ackerley Tng
  3 siblings, 1 reply; 398+ messages in thread
From: Michael Roth @ 2023-05-12  0:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Hildenbrand, Chao Peng, Paolo Bonzini, Vitaly Kuznetsov,
	Jim Mattson, Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, dhildenb,
	Quentin Perret, tabba, wei.w.wang, Mike Rapoport, Liam Merwick,
	Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm, linux-kernel,
	Hugh Dickins, Christian Brauner

On Fri, Apr 21, 2023 at 06:33:26PM -0700, Sean Christopherson wrote:
> 
> Code is available here if folks want to take a look before any kind of formal
> posting:
> 
> 	https://github.com/sean-jc/linux.git x86/kvm_gmem_solo

Hi Sean,

I've been working on getting the SNP patches ported to this but I'm having
some trouble working out a reasonable scheme for how to work the
RMPUPDATE hooks into the proposed design.

One of the main things is kvm_gmem_punch_hole(): this is can free pages
back to the host whenever userspace feels like it. Pages that are still
marked private in the RMP table will blow up the host if they aren't returned
to the normal state before handing them back to the kernel. So I'm trying to
add a hook, orchestrated by kvm_arch_gmem_invalidate(), to handle that,
e.g.:

  static long kvm_gmem_punch_hole(struct file *file, int mode, loff_t offset,
                                  loff_t len)
  {
          struct kvm_gmem *gmem = file->private_data;
          pgoff_t start = offset >> PAGE_SHIFT;
          pgoff_t end = (offset + len) >> PAGE_SHIFT;
          struct kvm *kvm = gmem->kvm;
  
          /*
           * Bindings must stable across invalidation to ensure the start+end
           * are balanced.
           */
          filemap_invalidate_lock(file->f_mapping);
          kvm_gmem_invalidate_begin(kvm, gmem, start, end);
  
          /* Handle arch-specific cleanups before releasing pages */
          kvm_arch_gmem_invalidate(kvm, gmem, start, end);
          truncate_inode_pages_range(file->f_mapping, offset, offset + len);
  
          kvm_gmem_invalidate_end(kvm, gmem, start, end);
          filemap_invalidate_unlock(file->f_mapping);
  
          return 0;
  }

But there's another hook, kvm_arch_gmem_set_mem_attributes(), needed to put
the page in its intended state in the RMP table prior to mapping it into the
guest's NPT. Currently I'm calling that hook via
kvm_vm_ioctl_set_mem_attributes(), just after kvm->mem_attr_array is updated
based on the ioctl. The reasoning there is that KVM MMU can then rely on the
existing mmu_invalidate_seq logic to ensure both the state in the
mem_attr_array and the RMP table are in sync and up-to-date once MMU lock is
acquired and MMU is ready to map it, or retry #NPF otherwise.

But for kvm_gmem_punch_hole(), kvm_vm_ioctl_set_mem_attributes() can potentially
result in something like the following sequence if I implement things as above:

  CPU0: kvm_gmem_punch_hole():
          kvm_gmem_invalidate_begin()
          kvm_arch_gmem_invalidate()         // set pages to default/shared state in RMP table before free'ing
  CPU1: kvm_vm_ioctl_set_mem_attributes():
          kvm_arch_gmem_set_mem_attributes() // maliciously set pages to private in RMP table
  CPU0:   truncate_inode_pages_range()       // HOST BLOWS UP TOUCHING PRIVATE PAGES
          kvm_arch_gmem_invalidate_end()

One quick and lazy solution is to rely on the fact that
kvm_vm_ioctl_set_mem_attributes() holds the kvm->slots_lock throughout the
entire begin()/end() portion of the invalidation sequence, and to similarly
hold the kvm->slots_lock throughout the begin()/end() sequence in
kvm_gmem_punch_hole() to prevent any interleaving.

But I'd imagine overloading kvm->slots_lock is not the proper approach. But
would introducing a similar mutex to keep these operations grouped/atomic be
a reasonable approach to you, or should we be doing something else entirely
here?

Keep in mind that RMP updates can't be done while holding KVM->mmu_lock
spinlock, because we also need to unmap pages from the directmap, which can
lead to scheduling-while-atomic BUG()s[1], so that's another constraint we
need to work around.

Thanks!

-Mike

[1] https://lore.kernel.org/linux-coco/20221214194056.161492-7-michael.roth@amd.com/T/#m45a1af063aa5ac0b9314d6a7d46eecb1253bba7a

> 
> [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
> [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-05-12  0:21                   ` Michael Roth
@ 2023-05-12 18:01                     ` Sean Christopherson
  2023-05-22 13:50                       ` Michael Roth
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-05-12 18:01 UTC (permalink / raw)
  To: Michael Roth
  Cc: David Hildenbrand, Chao Peng, Paolo Bonzini, Vitaly Kuznetsov,
	Jim Mattson, Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, dhildenb,
	Quentin Perret, tabba, wei.w.wang, Mike Rapoport, Liam Merwick,
	Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm, linux-kernel,
	Hugh Dickins, Christian Brauner

On Thu, May 11, 2023, Michael Roth wrote:
> On Fri, Apr 21, 2023 at 06:33:26PM -0700, Sean Christopherson wrote:
> > 
> > Code is available here if folks want to take a look before any kind of formal
> > posting:
> > 
> > 	https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
> 
> Hi Sean,
> 
> I've been working on getting the SNP patches ported to this but I'm having
> some trouble working out a reasonable scheme for how to work the
> RMPUPDATE hooks into the proposed design.
> 
> One of the main things is kvm_gmem_punch_hole(): this is can free pages
> back to the host whenever userspace feels like it. Pages that are still
> marked private in the RMP table will blow up the host if they aren't returned
> to the normal state before handing them back to the kernel. So I'm trying to
> add a hook, orchestrated by kvm_arch_gmem_invalidate(), to handle that,
> e.g.:
> 
>   static long kvm_gmem_punch_hole(struct file *file, int mode, loff_t offset,
>                                   loff_t len)
>   {
>           struct kvm_gmem *gmem = file->private_data;
>           pgoff_t start = offset >> PAGE_SHIFT;
>           pgoff_t end = (offset + len) >> PAGE_SHIFT;
>           struct kvm *kvm = gmem->kvm;
>   
>           /*
>            * Bindings must stable across invalidation to ensure the start+end
>            * are balanced.
>            */
>           filemap_invalidate_lock(file->f_mapping);
>           kvm_gmem_invalidate_begin(kvm, gmem, start, end);
>   
>           /* Handle arch-specific cleanups before releasing pages */
>           kvm_arch_gmem_invalidate(kvm, gmem, start, end);
>           truncate_inode_pages_range(file->f_mapping, offset, offset + len);
>   
>           kvm_gmem_invalidate_end(kvm, gmem, start, end);
>           filemap_invalidate_unlock(file->f_mapping);
>   
>           return 0;
>   }
> 
> But there's another hook, kvm_arch_gmem_set_mem_attributes(), needed to put
> the page in its intended state in the RMP table prior to mapping it into the
> guest's NPT.

IMO, this approach is wrong.  kvm->mem_attr_array is the source of truth for whether
userspace wants _guest_ physical pages mapped private vs. shared, but the attributes
array has zero insight into the _host_ physical pages.  I.e. SNP shouldn't hook
kvm_mem_attrs_changed(), because operating on the RMP from that code is fundamentally
wrong.

A good analogy is moving a memslot (ignoring that AFAIK no VMM actually moves
memslots, but it's a good analogy for KVM internals).  KVM needs to zap all mappings
for the old memslot gfn, but KVM does not create mappings for the new memslot gfn.
Same for changing attributes; unmap, but never map.

As for the unmapping side of things, kvm_unmap_gfn_range() will unmap all relevant
NPT entries, and the elevated mmu_invalidate_in_progress will prevent KVM from
establishing a new NPT mapping.  And mmu_invalidate_in_progress will reach '0' only
after both truncation _and_ kvm_vm_ioctl_set_mem_attributes() complete, i.e. KVM
can create new mappings only when both kvm->mem_attr_array and any relevant
guest_mem bindings have reached steady state.

That leaves the question of when/where to do RMP updates.  Off the cuff, I think
RMP updates (and I _think_ also TDX page conversions) should _always_ be done in
the context of either (a) file truncation (make host owned due, a.k.a. TDX reclaim)
or (b) allocating a new page/folio in guest_mem, a.k.a. kvm_gmem_get_folio().
Under the hood, even though the gfn is the same, the backing pfn is different, i.e.
installing a shared mapping should _never_ need to touch the RMP because pages
common from the normal (non-guest_mem) pool must already be host owned.

> Currently I'm calling that hook via kvm_vm_ioctl_set_mem_attributes(), just
> after kvm->mem_attr_array is updated based on the ioctl. The reasoning there
> is that KVM MMU can then rely on the existing mmu_invalidate_seq logic to
> ensure both the state in the mem_attr_array and the RMP table are in sync and
> up-to-date once MMU lock is acquired and MMU is ready to map it, or retry
> #NPF otherwise.
> 
> But for kvm_gmem_punch_hole(), kvm_vm_ioctl_set_mem_attributes() can potentially
> result in something like the following sequence if I implement things as above:
> 
>   CPU0: kvm_gmem_punch_hole():
>           kvm_gmem_invalidate_begin()
>           kvm_arch_gmem_invalidate()         // set pages to default/shared state in RMP table before free'ing
>   CPU1: kvm_vm_ioctl_set_mem_attributes():
>           kvm_arch_gmem_set_mem_attributes() // maliciously set pages to private in RMP table
>   CPU0:   truncate_inode_pages_range()       // HOST BLOWS UP TOUCHING PRIVATE PAGES
>           kvm_arch_gmem_invalidate_end()
> 
> One quick and lazy solution is to rely on the fact that
> kvm_vm_ioctl_set_mem_attributes() holds the kvm->slots_lock throughout the
> entire begin()/end() portion of the invalidation sequence, and to similarly
> hold the kvm->slots_lock throughout the begin()/end() sequence in
> kvm_gmem_punch_hole() to prevent any interleaving.
> 
> But I'd imagine overloading kvm->slots_lock is not the proper approach. But
> would introducing a similar mutex to keep these operations grouped/atomic be
> a reasonable approach to you, or should we be doing something else entirely
> here?
> 
> Keep in mind that RMP updates can't be done while holding KVM->mmu_lock
> spinlock, because we also need to unmap pages from the directmap, which can
> lead to scheduling-while-atomic BUG()s[1], so that's another constraint we
> need to work around.
> 
> Thanks!
> 
> -Mike
> 
> [1] https://lore.kernel.org/linux-coco/20221214194056.161492-7-michael.roth@amd.com/T/#m45a1af063aa5ac0b9314d6a7d46eecb1253bba7a
> 
> > 
> > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
> > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-05-06  0:55                     ` Sean Christopherson
  2023-05-06  1:17                       ` Vishal Annapurve
@ 2023-05-15 23:46                       ` Sean Christopherson
  2023-07-13 22:46                       ` Ackerley Tng
  2 siblings, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-05-15 23:46 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: david, chao.p.peng, pbonzini, vkuznets, jmattson, joro, mail,
	vbabka, vannapurve, yu.c.zhang, kirill.shutemov, dhildenb,
	qperret, tabba, michael.roth, wei.w.wang, rppt, liam.merwick,
	isaku.yamahata, jarkko, kvm, linux-kernel, hughd, brauner

On Fri, May 05, 2023, Sean Christopherson wrote:
> On Fri, May 05, 2023, Ackerley Tng wrote:
> > One issue I’ve found so far is that the pointer to kvm (gmem->kvm) is
> > not cleaned up, and hence it is possible to crash the host kernel in the
> > following way
> > 
> > 1. Create a KVM VM
> > 2. Create a guest mem fd on that VM
> > 3. Create a memslot with the guest mem fd (hence binding the fd to the
> >    VM)
> > 4. Close/destroy the KVM VM
> > 5. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
> >    when it tries to do invalidation.
> > 
> > I then tried to clean up the gmem->kvm pointer during unbinding when the
> > KVM VM is destroyed.
> > 
> > That works, but then I realized there’s a simpler way to use the pointer
> > after freeing:
> > 
> > 1. Create a KVM VM
> > 2. Create a guest mem fd on that VM
> > 3. Close/destroy the KVM VM
> > 4. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
> >    when it tries to do invalidation.
> > 
> > Perhaps binding should mean setting the gmem->kvm pointer in addition to
> > gmem->bindings. This makes binding and unbinding symmetric and avoids
> > the use-after-frees described above.
> 
> Hrm, that would work, though it's a bit convoluted, e.g. would require detecting
> when the last binding is being removed.  A similar (also ugly) solution would be
> to nullify gmem->kvm when KVM dies.
> 
> I don't love either approach idea because it means a file created in the context
> of a VM can outlive the VM itself, and then userspace ends up with a file descriptor
> that it can't do anything with except close().  I doubt that matters in practice
> though, e.g. when the VM dies, all memory can be freed so that the file ends up
> being little more than a shell.  And if we go that route, there's no need to grab
> a reference to the file during bind, KVM can just grab a longterm reference when
> the file is initially created and then drop it when KVM dies (and nullifies gmem->kvm).
> 
> Blech, another wart is that I believe gmem would need to do __module_get() during
> file creation to prevent kvm.ko from being unloaded after the last VM dies.  Ah,
> but that'd also be true if we went with a system-scoped KVM ioctl(), so I suppose
> it's not _that_ ugly.
> 
> Exchanging references (at binding or at creation) doesn't work, because that
> creates a circular dependency, i.e. gmem and KVM would pin each other. 
> 
> A "proper" refcounting approach, where the file pins KVM and not vice versa, gets
> nasty because of how KVM's memslots work.  The least awful approach I can think of
> would be to delete the associated memslot(s) when the file is released, possibly
> via deferred work to avoid deadlock issues.  Not the prettiest thing ever and in
> some ways that'd yield an even worse ABI.

Circling back to this.  Pending testing, the "proper" refcounting approach seems
to be the way to go.  KVM's existing memslots actually work this way, e.g. if
userspace does munmap() on an active memslot, KVM zaps any PTEs but the memslot
stays alive.  A similar approach can be taken here, the only wrinkle is that the
association between gmem and the memslot is stronger than between VMAs and memslots,
specifically KVM binds the file and not simply the file descriptor.  This is
necessary because not binding to an exact file would let userspace install a
different file at the file descriptor.

That's solvable without having to delete memslots though, e.g. by protecting the
file pointer in the memslot with RCU and directly bumping the refcount in the two
places where KVM needs to get at gmem (the file) via the memslot (unbind and
faulting in a page).  E.g.

  static struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
  {
	struct file *file;

	rcu_read_lock();

	file = rcu_dereference(slot->gmem.file);
	if (file && !get_file_rcu(file))
		file = NULL;
	rcu_read_unlock();

	return file;
  }

The gotcha is that ->release could race with memslot deletion, as kvm_gmem_unbind()
won't be able to differentiate between "file was deleted" and "file is currently
being deleted".  That's easy enough to deal with though, kvm_gmem_release() can
take slots_lock to prevent the memslot from going away when nullifying and
invalidating ranges for the memslot.

> Side topic, there's a second bug (and probably more lurking): kvm_swap_active_memslots()'s
> call to synchronize_srcu_expedited() is done _before_ the call to kvm_gmem_unbind(),
> i.e. doesn't wait for readers in kvm_gmem_invalidate_begin() to go away.  The easy
> solution for that one is to add another synchronize_srcu_expedited() after unbinding.

There's a bug here, but not the one I pointed out.  Acquiring kvm->srcu doesn't
provide any protection, the binding already has a pointer to the memslot, i.e.
isn't doing an SRCU-protected lookup in the memslots.  The actual protection is
provided by the filemap invalidate lock, which prevents unbinding a memslot until
all invalidations complete, i.e. acquiring kvm->srcu in the punch hole path is
completely unnecessary.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-05-11 20:22                         ` Sean Christopherson
@ 2023-05-19  1:07                           ` Vishal Annapurve
  0 siblings, 0 replies; 398+ messages in thread
From: Vishal Annapurve @ 2023-05-19  1:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Hildenbrand, Chao Peng, Paolo Bonzini, Vitaly Kuznetsov,
	Jim Mattson, Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Yu Zhang, Kirill A . Shutemov, dhildenb, Quentin Perret, tabba,
	Michael Roth, wei.w.wang, Mike Rapoport, Liam Merwick,
	Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm, linux-kernel,
	Hugh Dickins, Christian Brauner

On Thu, May 11, 2023 at 1:22 PM Sean Christopherson <seanjc@google.com> wrote:
> ...
> Ah, you're effectively suggesting a hybrid model where the file is the single
> source of truth for what's private versus shared, ad KVM gets pfns through
> direct communication with the backing store via the file descriptor, but userspace
> can still control things via mmap() and friends.
>
> If you're not suggesting a backdoor, i.e. KVM still gets private pfns via hvas,
> then we're back at Kirill's series, because otherwise there's no easy way for KVM
> to retrieve the pfn.
>

Yeah, I was hinting towards using the backdoor, where KVM still gets
private pfns via HVAs.

> A form of this was also discussed, though I don't know how much of the discussion
> happened on-list.
>
> KVM actually does something like this for s390's Ultravisor (UV), which is quite
> a bit like TDX (UV is a trusted intermediary) except that it handles faults much,
> much more gracefully.  Specifically, when the untrusted host attempts to access a
> secure page, a fault occurs and the kernel responds by telling UV to export the
> page.  The fault is gracefully handled even even for kernel accesses
> (see do_secure_storage_access()).  The kernel does BUG() if the export fails when
> handling fault from kernel context, but my understanding is that export can fail
> if and only if there's a fatal error elsewhere, i.e. the UV essentialy _ensures_
> success, and goes straight to BUG()/panic() if something goes wrong.
>
> On the guest side, accesses to exported (swapped) secure pages generate intercepts
> and KVM faults in the page.  To do so, KVM freezes the page/folio refcount, tells
> the UV to import the page, and then unfreezes the page/folio.  But very crucially,
> when _anything_ in the untrusted host attempts to access the secure page, the
> above fault handling for untrusted host accesses kicks in.  In other words, the
> guest can cause thrash, but can't bring down the host.
>

Yeah, this is very similar to what I was trying to propose. Except in
this case, the backing store i.e. guest_mem will have to let the fault
be unhandled for untrusted host accesses to private ranges of
guest_mem file.

> TDX on the other hand silently poisons memory, i.e. doesn't even generate a
> synchronous fault.  Thus the kernel needs to be 100% perfect on preventing _any_
> accesses to private memory from the host, and doing that is non-trivial and
> invasive.
>
> SNP does synchronously fault, but the automatically converting in the #PF handler
> got NAK'd[*] for good reasons, e.g. SNP doesn't guarantee conversion success as the
> guest can trigger concurrent RMP modifications.  So the end result ends up being
> the same as TDX, host accesses need to be completely prevented.
>
> Again, this is all doable, but costly.  And IMO, provides very little value.

With this hybrid approach with the backdoor access to pfns from KVM,
do we see a scenario where host can bypass the guest_mem restrictions
and still be able to access the private ranges using HVA ranges? One
possibility is that these pages are mapped in the IOMMU (when they are
shared) and then get converted to private without getting unmapped
from IOMMU. Maybe KVM can disallow converting the ranges which are
pinned for DMA (not sure if there is a way to do that).

Few additional benefits here:
1) Possibly handle the pkvm usecase in this series without the need
for additional modifications.
2) Handling UPM for normal VMs possibly could get simpler as this
hybrid approach can allow preserving the contents across conversions.

>
> Allowing things like mbind() is nice-to-have at best, as implementing fbind()
> isn't straightforward and arguably valuable to have irrespective of this
> discussion, e.g. to allow userspace to say "use this policy regardless of what
> process maps the file".
>

Agreed, having mbind supported is not a significant gain given the cost here.

> Using a common memory pool (same physical page is used for both shared and private)
> is a similar story.  There are plenty of existing controls to limit userspace/guest
> memory usage and to deal with OOM scenarios, so barring egregious host accounting
> and/or limiting bugs, which would affect _all_ VM types, the worst case scenario
> is that a VM is terminated because host userspace is buggy.  On the slip side, using
> a common pool brings complexity into the kernel, as backing stores would need to
> be taught to deny access to a subset of pages in their mappings, and in multiple
> paths, e.g. faults, read()/write() and similar, page migration, swap, etc.

In this case the backing store that needs to be modified would just be
guest_mem though.

>
> [*] https://lore.kernel.org/linux-mm/8a244d34-2b10-4cf8-894a-1bf12b59cf92@www.fastmail.com
>
> > > Issues that led to us abandoning the "map with special !Present PTEs" approach:
> > >
> > >  - Using page tables, i.e. hardware defined structures, to track gfn=>pfn mappings
> > >    is inefficient and inflexible compared to software defined structures, especially
> > >    for the expected use cases for CoCo guests.
> > >
> > >  - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest association,
> > >    let alone a 1:1 pfn:gfn mapping.
> >
> > Maybe KVM can ensure that each page of the guest_mem file is
> > associated with a single memslot.
>
> This is a hard NAK.  Guest physical address space is guaranteed to have holes
> and/or be discontiguous, for the PCI hole at the top of lower memory.  Allowing
> only a single binding would prevent userspace from backing all (or large chunks)
> of guest memory with a single file.
>

Poor choice of words from my side. I meant to suggest that KVM can
ensure that ANY page of the guest_mem file is associated with at max
one memslot.

> ...
> That'd work for the hybrid model (fd backdoor with pseudo mmap() support), but
> not for a generic VMA-based implementation.  If the file isn't the single source
> of truth, then forcing all mappings to go away simply can't work.
>
> > > #3 is also a limiter.  E.g. if a guest is primarly backed by 1GiB pages, keeping
> > > the 1GiB mapping is desirable if the guest converts a few KiB of memory to shared,
> > > and possibly even if the guest converts a few MiB of memory.
> >
> > This caveat maybe can be lived with as shared ranges most likely will
> > not be backed by 1G pages anyways, possibly causing IO performance to
> > get hit. This possibly needs more discussion about conversion
> > granularity used by guests.
>
> Yes, it's not the end of the world.  My point is that separating shared and private
> memory provides more flexibility.  Maybe that flexibility never ends up being
> super important, but at the same time we shouldn't willingly paint ourselves into
> a corner.

There are some performance implications here with the split approach.

This flexibility is actually coming with the cost of managing double
allocation effectively. As the granularity of mappings increases for
the shared memory, it gets difficult to cap the amount of double
allocation. So effectively it comes down to always using smaller
granularity for shared memory and also the private memory for
converted ranges. In general the performance requirements would always
try to push for higher mapping granularities depending on the scale of
usage.

In general private memory (and also the shared memory on respective
conversions) will always need to be hole punched to ensure that the
double allocation won't happen. And so, even if this is something for
the future, using hugetlbfs pages for backing private memory with the
split model effectively makes it impossible to cap the double
allocation. I am not sure if the 1G pages can be handled better with
the hybrid model but maybe it's worth checking.

Split shared/private mem approach also increases the uncertainties
around memory management in general where the same amount of memory
which was available earlier is first freed to the system and then
allocated back from the system. .e.g. Even if hugepages were around
when private memory was initially allocated, further allocations keep
increasing the possibilities of not being able to use a huge page to
back the memory even if the whole huge page is private/shared.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2022-12-02  6:13 ` [PATCH v10 2/9] KVM: Introduce per-page memory attributes Chao Peng
                     ` (6 preceding siblings ...)
  2023-02-09  7:25   ` Isaku Yamahata
@ 2023-05-19 17:32   ` Nicolas Saenz Julienne
  2023-05-19 18:23     ` Sean Christopherson
  7 siblings, 1 reply; 398+ messages in thread
From: Nicolas Saenz Julienne @ 2023-05-19 17:32 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, graf, seanjc
  Cc: Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, wei.w.wang, anelkz

Hi,

On Fri Dec 2, 2022 at 6:13 AM UTC, Chao Peng wrote:

[...]

> +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: u64 memory attributes bitmask(out)
> +:Returns: 0 on success, <0 on error
> +
> +Returns supported memory attributes bitmask. Supported memory attributes will
> +have the corresponding bits set in u64 memory attributes bitmask.
> +
> +The following memory attributes are defined::
> +
> +  #define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> +  #define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> +  #define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> +  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> +
> +4.139 KVM_SET_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct kvm_memory_attributes(in/out)
> +:Returns: 0 on success, <0 on error
> +
> +Sets memory attributes for pages in a guest memory range. Parameters are
> +specified via the following structure::
> +
> +  struct kvm_memory_attributes {
> +	__u64 address;
> +	__u64 size;
> +	__u64 attributes;
> +	__u64 flags;
> +  };
> +
> +The user sets the per-page memory attributes to a guest memory range indicated
> +by address/size, and in return KVM adjusts address and size to reflect the
> +actual pages of the memory range have been successfully set to the attributes.
> +If the call returns 0, "address" is updated to the last successful address + 1
> +and "size" is updated to the remaining address size that has not been set
> +successfully. The user should check the return value as well as the size to
> +decide if the operation succeeded for the whole range or not. The user may want
> +to retry the operation with the returned address/size if the previous range was
> +partially successful.
> +
> +Both address and size should be page aligned and the supported attributes can be
> +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> +
> +The "flags" field may be used for future extensions and should be set to 0s.

We have been looking into adding support for the Hyper-V VSM extensions
which Windows uses to implement Credential Guard. This interface seems
like a good fit for one of its underlying features. I just wanted to
share a bit about it, and see if we can expand it to fit this use-case.
Note that this was already briefly discussed between Sean and Alex some
time ago[1].

VSM introduces isolated guest execution contexts called Virtual Trust
Levels (VTL) [2]. Each VTL has its own memory access protections,
virtual processors states, interrupt controllers and overlay pages. VTLs
are hierarchical and might enforce memory protections on less privileged
VTLs. Memory protections are enforced on a per-GPA granularity.

The list of possible protections is:
- No access -- This needs a new memory attribute, I think.
- Read-only, no execute
- Read-only, execute
- Read/write, no execute
- Read/write, execute

We implemented this in the past by using a separate address space per
VTL and updating memory regions on protection changes. But having to
update the memory slot layout for every permission change scales poorly,
especially as we have to perform 100.000s of these operations at boot
(see [1] for a little more context).

I believe the biggest barrier for us to use memory attributes is not
having the ability to target specific address spaces, or to the very
least having some mechanism to maintain multiple independent layers of
attributes.

Also sorry for not posting our VSM patches. They are not ready for
upstream review yet, but we're working on it.

Nicolas

[1] https://patchwork.kernel.org/comment/25054908/
[2] See Chapter 15 of Microsoft's 'Hypervisor Top Level Functional Specification':
    https://raw.githubusercontent.com/MicrosoftDocs/Virtualization-Documentation/main/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2023-05-19 17:32   ` Nicolas Saenz Julienne
@ 2023-05-19 18:23     ` Sean Christopherson
  2023-05-19 19:49       ` Nicolas Saenz Julienne
  2023-05-23 18:59       ` Nicolas Saenz Julienne
  0 siblings, 2 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-05-19 18:23 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, graf,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, wei.w.wang, anelkz

On Fri, May 19, 2023, Nicolas Saenz Julienne wrote:
> Hi,
> 
> On Fri Dec 2, 2022 at 6:13 AM UTC, Chao Peng wrote:
> 
> [...]
> > +The user sets the per-page memory attributes to a guest memory range indicated
> > +by address/size, and in return KVM adjusts address and size to reflect the
> > +actual pages of the memory range have been successfully set to the attributes.
> > +If the call returns 0, "address" is updated to the last successful address + 1
> > +and "size" is updated to the remaining address size that has not been set
> > +successfully. The user should check the return value as well as the size to
> > +decide if the operation succeeded for the whole range or not. The user may want
> > +to retry the operation with the returned address/size if the previous range was
> > +partially successful.
> > +
> > +Both address and size should be page aligned and the supported attributes can be
> > +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> > +
> > +The "flags" field may be used for future extensions and should be set to 0s.
> 
> We have been looking into adding support for the Hyper-V VSM extensions
> which Windows uses to implement Credential Guard. This interface seems
> like a good fit for one of its underlying features. I just wanted to
> share a bit about it, and see if we can expand it to fit this use-case.
> Note that this was already briefly discussed between Sean and Alex some
> time ago[1].
> 
> VSM introduces isolated guest execution contexts called Virtual Trust
> Levels (VTL) [2]. Each VTL has its own memory access protections,
> virtual processors states, interrupt controllers and overlay pages. VTLs
> are hierarchical and might enforce memory protections on less privileged
> VTLs. Memory protections are enforced on a per-GPA granularity.
> 
> The list of possible protections is:
> - No access -- This needs a new memory attribute, I think.

No, if KVM provides three bits for READ, WRITE, and EXECUTE, then userspace can
get all the possible combinations.  E.g. this is RWX=000b

> - Read-only, no execute

RWX=100b (using my completely arbitrary ordering of RWX bits :-) )

> - Read-only, execute

RWX=101b

> - Read/write, no execute

RWX=110b

> - Read/write, execute

RWX=111b

> We implemented this in the past by using a separate address space per
> VTL and updating memory regions on protection changes. But having to
> update the memory slot layout for every permission change scales poorly,
> especially as we have to perform 100.000s of these operations at boot
> (see [1] for a little more context).
> 
> I believe the biggest barrier for us to use memory attributes is not
> having the ability to target specific address spaces, or to the very
> least having some mechanism to maintain multiple independent layers of
> attributes.

Can you elaborate on "specific address spaces"?  In KVM, that usually means SMM,
but the VTL comment above makes me think you're talking about something entirely
different.  E.g. can you provide a brief summary of the requirements/expectations?

> Also sorry for not posting our VSM patches. They are not ready for
> upstream review yet, but we're working on it.
> 
> Nicolas
> 
> [1] https://patchwork.kernel.org/comment/25054908/
> [2] See Chapter 15 of Microsoft's 'Hypervisor Top Level Functional Specification':
>     https://raw.githubusercontent.com/MicrosoftDocs/Virtualization-Documentation/main/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2023-05-19 18:23     ` Sean Christopherson
@ 2023-05-19 19:49       ` Nicolas Saenz Julienne
  2023-05-19 19:57         ` Sean Christopherson
  2023-05-23 18:59       ` Nicolas Saenz Julienne
  1 sibling, 1 reply; 398+ messages in thread
From: Nicolas Saenz Julienne @ 2023-05-19 19:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, graf,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, wei.w.wang, anelkz

Hi Sean,

On Fri May 19, 2023 at 6:23 PM UTC, Sean Christopherson wrote:
> On Fri, May 19, 2023, Nicolas Saenz Julienne wrote:
> > Hi,
> >
> > On Fri Dec 2, 2022 at 6:13 AM UTC, Chao Peng wrote:
> >
> > [...]
> > > +The user sets the per-page memory attributes to a guest memory range indicated
> > > +by address/size, and in return KVM adjusts address and size to reflect the
> > > +actual pages of the memory range have been successfully set to the attributes.
> > > +If the call returns 0, "address" is updated to the last successful address + 1
> > > +and "size" is updated to the remaining address size that has not been set
> > > +successfully. The user should check the return value as well as the size to
> > > +decide if the operation succeeded for the whole range or not. The user may want
> > > +to retry the operation with the returned address/size if the previous range was
> > > +partially successful.
> > > +
> > > +Both address and size should be page aligned and the supported attributes can be
> > > +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> > > +
> > > +The "flags" field may be used for future extensions and should be set to 0s.
> >
> > We have been looking into adding support for the Hyper-V VSM extensions
> > which Windows uses to implement Credential Guard. This interface seems
> > like a good fit for one of its underlying features. I just wanted to
> > share a bit about it, and see if we can expand it to fit this use-case.
> > Note that this was already briefly discussed between Sean and Alex some
> > time ago[1].
> >
> > VSM introduces isolated guest execution contexts called Virtual Trust
> > Levels (VTL) [2]. Each VTL has its own memory access protections,
> > virtual processors states, interrupt controllers and overlay pages. VTLs
> > are hierarchical and might enforce memory protections on less privileged
> > VTLs. Memory protections are enforced on a per-GPA granularity.
> >
> > The list of possible protections is:
> > - No access -- This needs a new memory attribute, I think.
>
> No, if KVM provides three bits for READ, WRITE, and EXECUTE, then userspace can
> get all the possible combinations.  E.g. this is RWX=000b

That's not what the current implementation does, when attributes is
equal 0 it clears the entries from the xarray:

static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
					   struct kvm_memory_attributes *attrs)
{

    entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
[...]
    for (i = start; i < end; i++)
    	if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
    			    GFP_KERNEL_ACCOUNT)))
        		break;
}

From Documentation/core-api/xarray.rst:

"There is no difference between an entry that has never
been stored to, one that has been erased and one that has most recently
had ``NULL`` stored to it."

The way I understood the series, there needs to be a differentiation
between no attributes (regular page fault) and no-access.

> > We implemented this in the past by using a separate address space per
> > VTL and updating memory regions on protection changes. But having to
> > update the memory slot layout for every permission change scales poorly,
> > especially as we have to perform 100.000s of these operations at boot
> > (see [1] for a little more context).
> >
> > I believe the biggest barrier for us to use memory attributes is not
> > having the ability to target specific address spaces, or to the very
> > least having some mechanism to maintain multiple independent layers of
> > attributes.
>
> Can you elaborate on "specific address spaces"?  In KVM, that usually means SMM,
> but the VTL comment above makes me think you're talking about something entirely
> different.  E.g. can you provide a brief summary of the requirements/expectations?

I'll do so with a clear head on Monday. :)

Thanks!
Nicolas

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2023-05-19 19:49       ` Nicolas Saenz Julienne
@ 2023-05-19 19:57         ` Sean Christopherson
  0 siblings, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-05-19 19:57 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, graf,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, wei.w.wang, anelkz

On Fri, May 19, 2023, Nicolas Saenz Julienne wrote:
> Hi Sean,
> 
> On Fri May 19, 2023 at 6:23 PM UTC, Sean Christopherson wrote:
> > On Fri, May 19, 2023, Nicolas Saenz Julienne wrote:
> > > Hi,
> > >
> > > On Fri Dec 2, 2022 at 6:13 AM UTC, Chao Peng wrote:
> > >
> > > [...]
> > > > +The user sets the per-page memory attributes to a guest memory range indicated
> > > > +by address/size, and in return KVM adjusts address and size to reflect the
> > > > +actual pages of the memory range have been successfully set to the attributes.
> > > > +If the call returns 0, "address" is updated to the last successful address + 1
> > > > +and "size" is updated to the remaining address size that has not been set
> > > > +successfully. The user should check the return value as well as the size to
> > > > +decide if the operation succeeded for the whole range or not. The user may want
> > > > +to retry the operation with the returned address/size if the previous range was
> > > > +partially successful.
> > > > +
> > > > +Both address and size should be page aligned and the supported attributes can be
> > > > +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> > > > +
> > > > +The "flags" field may be used for future extensions and should be set to 0s.
> > >
> > > We have been looking into adding support for the Hyper-V VSM extensions
> > > which Windows uses to implement Credential Guard. This interface seems
> > > like a good fit for one of its underlying features. I just wanted to
> > > share a bit about it, and see if we can expand it to fit this use-case.
> > > Note that this was already briefly discussed between Sean and Alex some
> > > time ago[1].
> > >
> > > VSM introduces isolated guest execution contexts called Virtual Trust
> > > Levels (VTL) [2]. Each VTL has its own memory access protections,
> > > virtual processors states, interrupt controllers and overlay pages. VTLs
> > > are hierarchical and might enforce memory protections on less privileged
> > > VTLs. Memory protections are enforced on a per-GPA granularity.
> > >
> > > The list of possible protections is:
> > > - No access -- This needs a new memory attribute, I think.
> >
> > No, if KVM provides three bits for READ, WRITE, and EXECUTE, then userspace can
> > get all the possible combinations.  E.g. this is RWX=000b
> 
> That's not what the current implementation does, when attributes is
> equal 0 it clears the entries from the xarray:
> 
> static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> 					   struct kvm_memory_attributes *attrs)
> {
> 
>     entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> [...]
>     for (i = start; i < end; i++)
>     	if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
>     			    GFP_KERNEL_ACCOUNT)))
>         		break;
> }
> 
> >From Documentation/core-api/xarray.rst:
> 
> "There is no difference between an entry that has never
> been stored to, one that has been erased and one that has most recently
> had ``NULL`` stored to it."
> 
> The way I understood the series, there needs to be a differentiation
> between no attributes (regular page fault) and no-access.

Ah, I see what you're saying.  There are multiple ways to solve things without a
"no access" flag while still maintaining an empty xarray for the default case.
E.g. invert the flags to be DENY flags[*], have an internal-only "entry valid" flag,
etc.

[*] I vaguely recall suggesting a "deny" approach somewhere, but I may just be
    making things up to make it look like I thought deeply about this ;-)

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-05-12 18:01                     ` Sean Christopherson
@ 2023-05-22 13:50                       ` Michael Roth
  2023-05-22 17:09                         ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Michael Roth @ 2023-05-22 13:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Hildenbrand, Chao Peng, Paolo Bonzini, Vitaly Kuznetsov,
	Jim Mattson, Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, dhildenb,
	Quentin Perret, tabba, wei.w.wang, Mike Rapoport, Liam Merwick,
	Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm, linux-kernel,
	Hugh Dickins, Christian Brauner

On Fri, May 12, 2023 at 11:01:10AM -0700, Sean Christopherson wrote:
> On Thu, May 11, 2023, Michael Roth wrote:
> > On Fri, Apr 21, 2023 at 06:33:26PM -0700, Sean Christopherson wrote:
> > > 
> > > Code is available here if folks want to take a look before any kind of formal
> > > posting:
> > > 
> > > 	https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
> > 
> > Hi Sean,
> > 
> > I've been working on getting the SNP patches ported to this but I'm having
> > some trouble working out a reasonable scheme for how to work the
> > RMPUPDATE hooks into the proposed design.
> > 
> > One of the main things is kvm_gmem_punch_hole(): this is can free pages
> > back to the host whenever userspace feels like it. Pages that are still
> > marked private in the RMP table will blow up the host if they aren't returned
> > to the normal state before handing them back to the kernel. So I'm trying to
> > add a hook, orchestrated by kvm_arch_gmem_invalidate(), to handle that,
> > e.g.:
> > 
> >   static long kvm_gmem_punch_hole(struct file *file, int mode, loff_t offset,
> >                                   loff_t len)
> >   {
> >           struct kvm_gmem *gmem = file->private_data;
> >           pgoff_t start = offset >> PAGE_SHIFT;
> >           pgoff_t end = (offset + len) >> PAGE_SHIFT;
> >           struct kvm *kvm = gmem->kvm;
> >   
> >           /*
> >            * Bindings must stable across invalidation to ensure the start+end
> >            * are balanced.
> >            */
> >           filemap_invalidate_lock(file->f_mapping);
> >           kvm_gmem_invalidate_begin(kvm, gmem, start, end);
> >   
> >           /* Handle arch-specific cleanups before releasing pages */
> >           kvm_arch_gmem_invalidate(kvm, gmem, start, end);
> >           truncate_inode_pages_range(file->f_mapping, offset, offset + len);
> >   
> >           kvm_gmem_invalidate_end(kvm, gmem, start, end);
> >           filemap_invalidate_unlock(file->f_mapping);
> >   
> >           return 0;
> >   }
> > 
> > But there's another hook, kvm_arch_gmem_set_mem_attributes(), needed to put
> > the page in its intended state in the RMP table prior to mapping it into the
> > guest's NPT.
> 
> IMO, this approach is wrong.  kvm->mem_attr_array is the source of truth for whether
> userspace wants _guest_ physical pages mapped private vs. shared, but the attributes
> array has zero insight into the _host_ physical pages.  I.e. SNP shouldn't hook
> kvm_mem_attrs_changed(), because operating on the RMP from that code is fundamentally
> wrong.
> 
> A good analogy is moving a memslot (ignoring that AFAIK no VMM actually moves
> memslots, but it's a good analogy for KVM internals).  KVM needs to zap all mappings
> for the old memslot gfn, but KVM does not create mappings for the new memslot gfn.
> Same for changing attributes; unmap, but never map.
> 
> As for the unmapping side of things, kvm_unmap_gfn_range() will unmap all relevant
> NPT entries, and the elevated mmu_invalidate_in_progress will prevent KVM from
> establishing a new NPT mapping.  And mmu_invalidate_in_progress will reach '0' only
> after both truncation _and_ kvm_vm_ioctl_set_mem_attributes() complete, i.e. KVM
> can create new mappings only when both kvm->mem_attr_array and any relevant
> guest_mem bindings have reached steady state.
> 
> That leaves the question of when/where to do RMP updates.  Off the cuff, I think
> RMP updates (and I _think_ also TDX page conversions) should _always_ be done in
> the context of either (a) file truncation (make host owned due, a.k.a. TDX reclaim)
> or (b) allocating a new page/folio in guest_mem, a.k.a. kvm_gmem_get_folio().
> Under the hood, even though the gfn is the same, the backing pfn is different, i.e.
> installing a shared mapping should _never_ need to touch the RMP because pages
> common from the normal (non-guest_mem) pool must already be host owned.

Hi Sean, thanks for the suggestions.

I reworked things based on this approach and things seems to work out
pretty nicely for SNP.

I needed to add the hook to kvm_gmem_get_pfn() instead of
kvm_gmem_get_folio() because SNP needs to know the GFN in order to mark
the page as private in the RMP table, but otherwise I think things are
the same as what you had in mind. One downside to this approach is since
the hook always gets called during kvm_gmem_get_pfn(), we need to do an
extra RMP lookup to determine whether or not that page has already been
set to private state, vs. being able to assume it's already been put in
the expected state, but it's only a memory access so not a huge
overhead. Not sure if that would be a concern of not on the TDX side
though.

I put together a tree with some fixups that are needed for against the
kvm_gmem_solo base tree, and a set of hooks to handle invalidations,
preparing the initial private state as suggested above, and a
platform-configurable mask that the x86 MMU code can use for determining
whether a fault is for private vs. shared pages.

  KVM: x86: Determine shared/private faults using a configurable mask
  ^ for TDX we could trivially add an inverted analogue of the mask/logic
  KVM: x86: Use full 64-bit error code for kvm_mmu_do_page_fault
  KVM: x86: Add platform hooks for private memory invalidations
  KVM: x86: Add platform hook for initializing private memory
  *fixup (kvm_gmem_solo): KVM: Fix end range calculation for MMU invalidations
  *fixup (kvm_gmem_solo): KVM: selftests: update kvm_create_guest_memfd struct usage

  https://github.com/mdroth/linux/commits/kvm_gmem_solo_x86

I'm hoping these are similarly usable for TDX, but could use some input
from TDX folks on that aspect.

> > 
> > Keep in mind that RMP updates can't be done while holding KVM->mmu_lock
> > spinlock, because we also need to unmap pages from the directmap, which can
> > lead to scheduling-while-atomic BUG()s[1], so that's another constraint we
> > need to work around.

This concern also ends up going away since GFP_RECLAIM also has similar
issues when called under kvm->mmu_lock, so having the hook in
kvm_gmem_get_pfn() sort of guarantees we wouldn't hit issues with this.

-Mike

> > 
> > Thanks!
> > 
> > -Mike
> > 
> > [1] https://lore.kernel.org/linux-coco/20221214194056.161492-7-michael.roth@amd.com/T/#m45a1af063aa5ac0b9314d6a7d46eecb1253bba7a
> > 
> > > 
> > > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
> > > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> > > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-05-22 13:50                       ` Michael Roth
@ 2023-05-22 17:09                         ` Sean Christopherson
  2023-05-22 23:58                           ` Michael Roth
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-05-22 17:09 UTC (permalink / raw)
  To: Michael Roth
  Cc: David Hildenbrand, Chao Peng, Paolo Bonzini, Vitaly Kuznetsov,
	Jim Mattson, Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, dhildenb,
	Quentin Perret, tabba, wei.w.wang, Mike Rapoport, Liam Merwick,
	Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm, linux-kernel,
	Hugh Dickins, Christian Brauner

On Mon, May 22, 2023, Michael Roth wrote:
> On Fri, May 12, 2023 at 11:01:10AM -0700, Sean Christopherson wrote:
> > On Thu, May 11, 2023, Michael Roth wrote:
> I put together a tree with some fixups that are needed for against the
> kvm_gmem_solo base tree, and a set of hooks to handle invalidations,
> preparing the initial private state as suggested above, and a
> platform-configurable mask that the x86 MMU code can use for determining
> whether a fault is for private vs. shared pages.
> 
>   KVM: x86: Determine shared/private faults using a configurable mask
>   ^ for TDX we could trivially add an inverted analogue of the mask/logic
>   KVM: x86: Use full 64-bit error code for kvm_mmu_do_page_fault
>   KVM: x86: Add platform hooks for private memory invalidations

Hrm, I'd prefer to avoid adding another hook for this case, arch code already has
a "hook" in the form of kvm_unmap_gfn_range().  We'd probably just need a
kvm_gfn_range.is_private flag to communicate to arch/vendor code that the memory
being zapped is private.

That'd leave a gap for the unbind() case because kvm_unmap_gfn_range() is invoked
if and only if there's an overlapping memslot.  I'll chew on that a bit to see if
there's a way to cleanly handle that case without another hook.  I think it's worth
mapping out exactly what we want unbind() to look like anyways, e.g. right now the
code subtly relies on private memslots being immutable.

>   KVM: x86: Add platform hook for initializing private memory

This should also be unnecessary.  The call to kvm_gmem_get_pfn() is from arch
code, KVM just needs to ensure the RMP is converted before acquiring mmu_lock,
e.g. KVM has all the necessary info in kvm_tdp_mmu_page_fault().

The only reason to add another arch hook would be if we wanted to converted the
RMP when _allocating_, e.g. to preconvert in response to fallocate() instead of
waiting until #NPF.  But I think I would rather add a generic ioctl() to allow
userspace to effectively prefault guest memory, e.g. to setup the RMP before
running a vCPU.  Such an ioctl() would potentially be useful in other scenarios,
e.g. on the dest during live migration to reduce jitter.

>   *fixup (kvm_gmem_solo): KVM: Fix end range calculation for MMU invalidations

There was another bug in this path.  The math for handling a non-zero offsets into
the file was wrong.  The code now looks like:

	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
		struct kvm_gfn_range gfn_range = {
			.start = slot->base_gfn + start - slot->gmem.index,
			.end = slot->base_gfn + min(end - slot->gmem.index, slot->npages),
			.slot = slot,
			.pte = __pte(0),
			.may_block = true,
		};

		if (WARN_ON_ONCE(start < slot->gmem.index ||
				 end > slot->gmem.index + slot->npages))
			continue;

		kvm_mmu_invalidate_range_add(kvm, gfn_range.start, gfn_range.end);

		flush |= kvm_unmap_gfn_range(kvm, &gfn_range);
	}

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-05-22 17:09                         ` Sean Christopherson
@ 2023-05-22 23:58                           ` Michael Roth
  2023-05-23  0:21                             ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Michael Roth @ 2023-05-22 23:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Hildenbrand, Chao Peng, Paolo Bonzini, Vitaly Kuznetsov,
	Jim Mattson, Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, dhildenb,
	Quentin Perret, tabba, wei.w.wang, Mike Rapoport, Liam Merwick,
	Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm, linux-kernel,
	Hugh Dickins, Christian Brauner

On Mon, May 22, 2023 at 10:09:40AM -0700, Sean Christopherson wrote:
> On Mon, May 22, 2023, Michael Roth wrote:
> > On Fri, May 12, 2023 at 11:01:10AM -0700, Sean Christopherson wrote:
> > > On Thu, May 11, 2023, Michael Roth wrote:
> > I put together a tree with some fixups that are needed for against the
> > kvm_gmem_solo base tree, and a set of hooks to handle invalidations,
> > preparing the initial private state as suggested above, and a
> > platform-configurable mask that the x86 MMU code can use for determining
> > whether a fault is for private vs. shared pages.
> > 
> >   KVM: x86: Determine shared/private faults using a configurable mask
> >   ^ for TDX we could trivially add an inverted analogue of the mask/logic
> >   KVM: x86: Use full 64-bit error code for kvm_mmu_do_page_fault
> >   KVM: x86: Add platform hooks for private memory invalidations
> 
> Hrm, I'd prefer to avoid adding another hook for this case, arch code already has
> a "hook" in the form of kvm_unmap_gfn_range().  We'd probably just need a
> kvm_gfn_range.is_private flag to communicate to arch/vendor code that the memory
> being zapped is private.

kvm_unmap_gfn_range() does however get called with kvm->mmu_lock held so
it might be tricky to tie RMP updates into that path.

> 
> That'd leave a gap for the unbind() case because kvm_unmap_gfn_range() is invoked
> if and only if there's an overlapping memslot.  I'll chew on that a bit to see if
> there's a way to cleanly handle that case without another hook.  I think it's worth
> mapping out exactly what we want unbind() to look like anyways, e.g. right now the
> code subtly relies on private memslots being immutable.

I thought the direction you sort of driving at was to completely decouple
RMP updates for physical pages from the KVM MMU map/unmap paths since the
life-cycles of those backing pages and associated RMP state are somewhat
separate from the state of the GFNs and kvm->mem_attr_array. It seems to
make sense when dealing with things like this unbind() case.

There's also cases like userspaces that opt to not discard memory after
conversions because they highly favor performance over memory usage. In
those cases it would make sense to defer marking the pages as shared in
the RMP until the FALLOC_FL_PUNCH_HOLE, rather than triggering it via
KVM MMU invalidation path after a conversion.

> 
> >   KVM: x86: Add platform hook for initializing private memory
> 
> This should also be unnecessary.  The call to kvm_gmem_get_pfn() is from arch
> code, KVM just needs to ensure the RMP is converted before acquiring mmu_lock,
> e.g. KVM has all the necessary info in kvm_tdp_mmu_page_fault().

I think that approach would work fine. The way I was thinking of things
is that KVM MMU would necessarily call kvm_gmem_get_pfn() to grab the
page before mapping it into the guest, so moving it out into an explicit
call should work just as well. That would also drop the need for the
__kvm_gmem_get_pfn() stuff I needed to add for the initial case where we
need to access the PFN prior to making it private.

> 
> The only reason to add another arch hook would be if we wanted to converted the
> RMP when _allocating_, e.g. to preconvert in response to fallocate() instead of
> waiting until #NPF.  But I think I would rather add a generic ioctl() to allow
> userspace to effectively prefault guest memory, e.g. to setup the RMP before
> running a vCPU.  Such an ioctl() would potentially be useful in other scenarios,
> e.g. on the dest during live migration to reduce jitter.

Agreed, deferring the RMPUPDATE until it's actually needed would give us
more flexibility on optimizing for things like lazy-acceptance.

For less-common scenarios like preallocation it makes sense to make that
an opt-in sort of thing for userspace to configure explicitly.

> 
> >   *fixup (kvm_gmem_solo): KVM: Fix end range calculation for MMU invalidations
> 
> There was another bug in this path.  The math for handling a non-zero offsets into
> the file was wrong.  The code now looks like:
> 
> 	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
> 		struct kvm_gfn_range gfn_range = {
> 			.start = slot->base_gfn + start - slot->gmem.index,

Sorry if I'm missing something here, but isn't there a risk that:

  start - slot->gmem.index

would be less than zero? E.g. starting GFN was 0, but current slot is bound
at some non-zero offset in the same gmem instance. I guess the warning below
shouldn't caught that, but it seems like a real scenario.

Since 'index' corresponds to the gmem offset of the current slot, is there any
reason not to do something like this?:

  .start = slot->base_gfn + index - slot->gmem.index,

But then, if that's the case, wouldn't index == slot->gmem.index? Suggesting
we case just simplify to this?:

  .start = slot->base_gfn,

-Mike

> 			.end = slot->base_gfn + min(end - slot->gmem.index, slot->npages),
> 			.slot = slot,
> 			.pte = __pte(0),
> 			.may_block = true,
> 		};
> 
> 		if (WARN_ON_ONCE(start < slot->gmem.index ||
> 				 end > slot->gmem.index + slot->npages))
> 			continue;
> 
> 		kvm_mmu_invalidate_range_add(kvm, gfn_range.start, gfn_range.end);
> 
> 		flush |= kvm_unmap_gfn_range(kvm, &gfn_range);
> 	}

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-05-22 23:58                           ` Michael Roth
@ 2023-05-23  0:21                             ` Sean Christopherson
  0 siblings, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-05-23  0:21 UTC (permalink / raw)
  To: Michael Roth
  Cc: David Hildenbrand, Chao Peng, Paolo Bonzini, Vitaly Kuznetsov,
	Jim Mattson, Joerg Roedel, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, dhildenb,
	Quentin Perret, tabba, wei.w.wang, Mike Rapoport, Liam Merwick,
	Isaku Yamahata, Jarkko Sakkinen, Ackerley Tng, kvm, linux-kernel,
	Hugh Dickins, Christian Brauner

On Mon, May 22, 2023, Michael Roth wrote:
> On Mon, May 22, 2023 at 10:09:40AM -0700, Sean Christopherson wrote:
> > On Mon, May 22, 2023, Michael Roth wrote:
> > > On Fri, May 12, 2023 at 11:01:10AM -0700, Sean Christopherson wrote:
> > > > On Thu, May 11, 2023, Michael Roth wrote:
> > > I put together a tree with some fixups that are needed for against the
> > > kvm_gmem_solo base tree, and a set of hooks to handle invalidations,
> > > preparing the initial private state as suggested above, and a
> > > platform-configurable mask that the x86 MMU code can use for determining
> > > whether a fault is for private vs. shared pages.
> > > 
> > >   KVM: x86: Determine shared/private faults using a configurable mask
> > >   ^ for TDX we could trivially add an inverted analogue of the mask/logic
> > >   KVM: x86: Use full 64-bit error code for kvm_mmu_do_page_fault
> > >   KVM: x86: Add platform hooks for private memory invalidations
> > 
> > Hrm, I'd prefer to avoid adding another hook for this case, arch code already has
> > a "hook" in the form of kvm_unmap_gfn_range().  We'd probably just need a
> > kvm_gfn_range.is_private flag to communicate to arch/vendor code that the memory
> > being zapped is private.
> 
> kvm_unmap_gfn_range() does however get called with kvm->mmu_lock held so
> it might be tricky to tie RMP updates into that path.

Gah, I caught the mmu_lock issue before the end of my email, but forgot to go back
and rethink the first half.

> > That'd leave a gap for the unbind() case because kvm_unmap_gfn_range() is invoked
> > if and only if there's an overlapping memslot.  I'll chew on that a bit to see if
> > there's a way to cleanly handle that case without another hook.  I think it's worth
> > mapping out exactly what we want unbind() to look like anyways, e.g. right now the
> > code subtly relies on private memslots being immutable.
m 
> I thought the direction you sort of driving at was to completely decouple
> RMP updates for physical pages from the KVM MMU map/unmap paths since the
> life-cycles of those backing pages and associated RMP state are somewhat
> separate from the state of the GFNs and kvm->mem_attr_array. It seems to
> make sense when dealing with things like this unbind() case.
> 
> There's also cases like userspaces that opt to not discard memory after
> conversions because they highly favor performance over memory usage. In
> those cases it would make sense to defer marking the pages as shared in
> the RMP until the FALLOC_FL_PUNCH_HOLE, rather than triggering it via
> KVM MMU invalidation path after a conversion.

Hmm, right.  I got overzealous in my desire to avoid new hooks.

> > >   KVM: x86: Add platform hook for initializing private memory
> > 
> > This should also be unnecessary.  The call to kvm_gmem_get_pfn() is from arch
> > code, KVM just needs to ensure the RMP is converted before acquiring mmu_lock,
> > e.g. KVM has all the necessary info in kvm_tdp_mmu_page_fault().
> 
> I think that approach would work fine. The way I was thinking of things
> is that KVM MMU would necessarily call kvm_gmem_get_pfn() to grab the
> page before mapping it into the guest, so moving it out into an explicit
> call should work just as well. That would also drop the need for the
> __kvm_gmem_get_pfn() stuff I needed to add for the initial case where we
> need to access the PFN prior to making it private.
> 
> > 
> > The only reason to add another arch hook would be if we wanted to converted the
> > RMP when _allocating_, e.g. to preconvert in response to fallocate() instead of
> > waiting until #NPF.  But I think I would rather add a generic ioctl() to allow
> > userspace to effectively prefault guest memory, e.g. to setup the RMP before
> > running a vCPU.  Such an ioctl() would potentially be useful in other scenarios,
> > e.g. on the dest during live migration to reduce jitter.
> 
> Agreed, deferring the RMPUPDATE until it's actually needed would give us
> more flexibility on optimizing for things like lazy-acceptance.
> 
> For less-common scenarios like preallocation it makes sense to make that
> an opt-in sort of thing for userspace to configure explicitly.
> 
> > 
> > >   *fixup (kvm_gmem_solo): KVM: Fix end range calculation for MMU invalidations
> > 
> > There was another bug in this path.  The math for handling a non-zero offsets into
> > the file was wrong.  The code now looks like:
> > 
> > 	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
> > 		struct kvm_gfn_range gfn_range = {
> > 			.start = slot->base_gfn + start - slot->gmem.index,
> 
> Sorry if I'm missing something here, but isn't there a risk that:
> 
>   start - slot->gmem.index
> 
> would be less than zero? E.g. starting GFN was 0, but current slot is bound
> at some non-zero offset in the same gmem instance. I guess the warning below
> shouldn't caught that, but it seems like a real scenario.

Heh, only if there's a testcase for it.  Assuming start >= the slot offset does
seem broken, e.g. if the range-to-invalidate overlaps multiple slots, later slots
will have index==slot->gmem.index > start.

> Since 'index' corresponds to the gmem offset of the current slot, is there any
> reason not to do something like this?:
> 
>   .start = slot->base_gfn + index - slot->gmem.index,
> 
> But then, if that's the case, wouldn't index == slot->gmem.index? Suggesting
> we case just simplify to this?:
> 
>   .start = slot->base_gfn,

No, e.g. if start is partway through a memslot, there's no need to invalidate
the entire memslot.  I'll stare at this tomorrow when my brain is hopefully a
bit more functional, I suspect there is a min() and/or max() needed somewhere.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 08/14] KVM: Rename mmu_notifier_*
  2022-07-06  8:20 ` [PATCH v7 08/14] KVM: Rename mmu_notifier_* Chao Peng
  2022-07-29 19:02   ` Sean Christopherson
@ 2023-05-23  7:19   ` Kautuk Consul
  2023-05-23 14:19     ` Sean Christopherson
  1 sibling, 1 reply; 398+ messages in thread
From: Kautuk Consul @ 2023-05-23  7:19 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, linux-kselftest, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On 2022-07-06 16:20:10, Chao Peng wrote:
> The sync mechanism between mmu_notifier and page fault handler employs
> fields mmu_notifier_seq/count and mmu_notifier_range_start/end. For the
> to be added private memory, there is the same mechanism needed but not
> rely on mmu_notifier (It uses new introduced memfile_notifier). This
> patch renames the existing fields and related helper functions to a
> neutral name mmu_updating_* so private memory can reuse.
> 
> No functional change intended.
> 
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  arch/arm64/kvm/mmu.c                     |  8 ++---
>  arch/mips/kvm/mmu.c                      | 10 +++---
>  arch/powerpc/include/asm/kvm_book3s_64.h |  2 +-
>  arch/powerpc/kvm/book3s_64_mmu_host.c    |  4 +--
>  arch/powerpc/kvm/book3s_64_mmu_hv.c      |  4 +--
>  arch/powerpc/kvm/book3s_64_mmu_radix.c   |  6 ++--
>  arch/powerpc/kvm/book3s_hv_nested.c      |  2 +-
>  arch/powerpc/kvm/book3s_hv_rm_mmu.c      |  8 ++---
>  arch/powerpc/kvm/e500_mmu_host.c         |  4 +--
>  arch/riscv/kvm/mmu.c                     |  4 +--
>  arch/x86/kvm/mmu/mmu.c                   | 14 ++++----
>  arch/x86/kvm/mmu/paging_tmpl.h           |  4 +--
>  include/linux/kvm_host.h                 | 38 ++++++++++-----------
>  virt/kvm/kvm_main.c                      | 42 +++++++++++-------------
>  virt/kvm/pfncache.c                      | 14 ++++----
>  15 files changed, 81 insertions(+), 83 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 87f1cd0df36e..7ee6fafc24ee 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -993,7 +993,7 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  		 * THP doesn't start to split while we are adjusting the
>  		 * refcounts.
>  		 *
> -		 * We are sure this doesn't happen, because mmu_notifier_retry
> +		 * We are sure this doesn't happen, because mmu_updating_retry
>  		 * was successful and we are holding the mmu_lock, so if this
>  		 * THP is trying to split, it will be blocked in the mmu
>  		 * notifier before touching any of the pages, specifically
> @@ -1188,9 +1188,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			return ret;
>  	}
>  
> -	mmu_seq = vcpu->kvm->mmu_notifier_seq;
> +	mmu_seq = vcpu->kvm->mmu_updating_seq;
>  	/*
> -	 * Ensure the read of mmu_notifier_seq happens before we call
> +	 * Ensure the read of mmu_updating_seq happens before we call
>  	 * gfn_to_pfn_prot (which calls get_user_pages), so that we don't risk
>  	 * the page we just got a reference to gets unmapped before we have a
>  	 * chance to grab the mmu_lock, which ensure that if the page gets
> @@ -1246,7 +1246,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	else
>  		write_lock(&kvm->mmu_lock);
>  	pgt = vcpu->arch.hw_mmu->pgt;
> -	if (mmu_notifier_retry(kvm, mmu_seq))
> +	if (mmu_updating_retry(kvm, mmu_seq))
>  		goto out_unlock;
>  
>  	/*
> diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
> index 1bfd1b501d82..abd468c6a749 100644
> --- a/arch/mips/kvm/mmu.c
> +++ b/arch/mips/kvm/mmu.c
> @@ -615,17 +615,17 @@ static int kvm_mips_map_page(struct kvm_vcpu *vcpu, unsigned long gpa,
>  	 * Used to check for invalidations in progress, of the pfn that is
>  	 * returned by pfn_to_pfn_prot below.
>  	 */
> -	mmu_seq = kvm->mmu_notifier_seq;
> +	mmu_seq = kvm->mmu_updating_seq;
>  	/*
> -	 * Ensure the read of mmu_notifier_seq isn't reordered with PTE reads in
> +	 * Ensure the read of mmu_updating_seq isn't reordered with PTE reads in
>  	 * gfn_to_pfn_prot() (which calls get_user_pages()), so that we don't
>  	 * risk the page we get a reference to getting unmapped before we have a
> -	 * chance to grab the mmu_lock without mmu_notifier_retry() noticing.
> +	 * chance to grab the mmu_lock without mmu_updating_retry () noticing.
>  	 *
>  	 * This smp_rmb() pairs with the effective smp_wmb() of the combination
>  	 * of the pte_unmap_unlock() after the PTE is zapped, and the
>  	 * spin_lock() in kvm_mmu_notifier_invalidate_<page|range_end>() before
> -	 * mmu_notifier_seq is incremented.
> +	 * mmu_updating_seq is incremented.
>  	 */
>  	smp_rmb();
>  
> @@ -638,7 +638,7 @@ static int kvm_mips_map_page(struct kvm_vcpu *vcpu, unsigned long gpa,
>  
>  	spin_lock(&kvm->mmu_lock);
>  	/* Check if an invalidation has taken place since we got pfn */
> -	if (mmu_notifier_retry(kvm, mmu_seq)) {
> +	if (mmu_updating_retry(kvm, mmu_seq)) {
>  		/*
>  		 * This can happen when mappings are changed asynchronously, but
>  		 * also synchronously if a COW is triggered by
> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
> index 4def2bd17b9b..4d35fb913de5 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> @@ -666,7 +666,7 @@ static inline pte_t *find_kvm_host_pte(struct kvm *kvm, unsigned long mmu_seq,
>  	VM_WARN(!spin_is_locked(&kvm->mmu_lock),
>  		"%s called with kvm mmu_lock not held \n", __func__);
>  
> -	if (mmu_notifier_retry(kvm, mmu_seq))
> +	if (mmu_updating_retry(kvm, mmu_seq))
>  		return NULL;
>  
>  	pte = __find_linux_pte(kvm->mm->pgd, ea, NULL, hshift);
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c b/arch/powerpc/kvm/book3s_64_mmu_host.c
> index 1ae09992c9ea..78f1aae8cb60 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_host.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_host.c
> @@ -90,7 +90,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte,
>  	unsigned long pfn;
>  
>  	/* used to check for invalidations in progress */
> -	mmu_seq = kvm->mmu_notifier_seq;
> +	mmu_seq = kvm->mmu_updating_seq;
>  	smp_rmb();
>  
>  	/* Get host physical address for gpa */
> @@ -151,7 +151,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte,
>  	cpte = kvmppc_mmu_hpte_cache_next(vcpu);
>  
>  	spin_lock(&kvm->mmu_lock);
> -	if (!cpte || mmu_notifier_retry(kvm, mmu_seq)) {
> +	if (!cpte || mmu_updating_retry(kvm, mmu_seq)) {
>  		r = -EAGAIN;
>  		goto out_unlock;
>  	}
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> index 514fd45c1994..bcdec6a6f2a7 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> @@ -578,7 +578,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
>  		return -EFAULT;
>  
>  	/* used to check for invalidations in progress */
> -	mmu_seq = kvm->mmu_notifier_seq;
> +	mmu_seq = kvm->mmu_updating_seq;
>  	smp_rmb();
>  
>  	ret = -EFAULT;
> @@ -693,7 +693,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
>  
>  	/* Check if we might have been invalidated; let the guest retry if so */
>  	ret = RESUME_GUEST;
> -	if (mmu_notifier_retry(vcpu->kvm, mmu_seq)) {
> +	if (mmu_updating_retry(vcpu->kvm, mmu_seq)) {
>  		unlock_rmap(rmap);
>  		goto out_unlock;
>  	}
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index 42851c32ff3b..c8890ccc3f40 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -639,7 +639,7 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
>  	/* Check if we might have been invalidated; let the guest retry if so */
>  	spin_lock(&kvm->mmu_lock);
>  	ret = -EAGAIN;
> -	if (mmu_notifier_retry(kvm, mmu_seq))
> +	if (mmu_updating_retry(kvm, mmu_seq))
>  		goto out_unlock;
>  
>  	/* Now traverse again under the lock and change the tree */
> @@ -829,7 +829,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
>  	bool large_enable;
>  
>  	/* used to check for invalidations in progress */
> -	mmu_seq = kvm->mmu_notifier_seq;
> +	mmu_seq = kvm->mmu_updating_seq;
>  	smp_rmb();
>  
>  	/*
> @@ -1190,7 +1190,7 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm,
>  	 * Increase the mmu notifier sequence number to prevent any page
>  	 * fault that read the memslot earlier from writing a PTE.
>  	 */
> -	kvm->mmu_notifier_seq++;
> +	kvm->mmu_updating_seq++;
>  	spin_unlock(&kvm->mmu_lock);
>  }
>  
> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
> index 0644732d1a25..09f841f730da 100644
> --- a/arch/powerpc/kvm/book3s_hv_nested.c
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -1579,7 +1579,7 @@ static long int __kvmhv_nested_page_fault(struct kvm_vcpu *vcpu,
>  	/* 2. Find the host pte for this L1 guest real address */
>  
>  	/* Used to check for invalidations in progress */
> -	mmu_seq = kvm->mmu_notifier_seq;
> +	mmu_seq = kvm->mmu_updating_seq;
>  	smp_rmb();
>  
>  	/* See if can find translation in our partition scoped tables for L1 */
> diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> index 2257fb18cb72..952b504dc98a 100644
> --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> @@ -219,7 +219,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
>  	g_ptel = ptel;
>  
>  	/* used later to detect if we might have been invalidated */
> -	mmu_seq = kvm->mmu_notifier_seq;
> +	mmu_seq = kvm->mmu_updating_seq;
>  	smp_rmb();
>  
>  	/* Find the memslot (if any) for this address */
> @@ -366,7 +366,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
>  			rmap = real_vmalloc_addr(rmap);
>  		lock_rmap(rmap);
>  		/* Check for pending invalidations under the rmap chain lock */
> -		if (mmu_notifier_retry(kvm, mmu_seq)) {
> +		if (mmu_updating_retry(kvm, mmu_seq)) {
>  			/* inval in progress, write a non-present HPTE */
>  			pteh |= HPTE_V_ABSENT;
>  			pteh &= ~HPTE_V_VALID;
> @@ -932,7 +932,7 @@ static long kvmppc_do_h_page_init_zero(struct kvm_vcpu *vcpu,
>  	int i;
>  
>  	/* Used later to detect if we might have been invalidated */
> -	mmu_seq = kvm->mmu_notifier_seq;
> +	mmu_seq = kvm->mmu_updating_seq;
>  	smp_rmb();
>  
>  	arch_spin_lock(&kvm->mmu_lock.rlock.raw_lock);
> @@ -960,7 +960,7 @@ static long kvmppc_do_h_page_init_copy(struct kvm_vcpu *vcpu,
>  	long ret = H_SUCCESS;
>  
>  	/* Used later to detect if we might have been invalidated */
> -	mmu_seq = kvm->mmu_notifier_seq;
> +	mmu_seq = kvm->mmu_updating_seq;
>  	smp_rmb();
>  
>  	arch_spin_lock(&kvm->mmu_lock.rlock.raw_lock);
> diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
> index 7f16afc331ef..d7636b926f25 100644
> --- a/arch/powerpc/kvm/e500_mmu_host.c
> +++ b/arch/powerpc/kvm/e500_mmu_host.c
> @@ -339,7 +339,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
>  	unsigned long flags;
>  
>  	/* used to check for invalidations in progress */
> -	mmu_seq = kvm->mmu_notifier_seq;
> +	mmu_seq = kvm->mmu_updating_seq;
>  	smp_rmb();
>  
>  	/*
> @@ -460,7 +460,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
>  	}
>  
>  	spin_lock(&kvm->mmu_lock);
> -	if (mmu_notifier_retry(kvm, mmu_seq)) {
> +	if (mmu_updating_retry(kvm, mmu_seq)) {
>  		ret = -EAGAIN;
>  		goto out;
>  	}
> diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> index 081f8d2b9cf3..a7db374d3861 100644
> --- a/arch/riscv/kvm/mmu.c
> +++ b/arch/riscv/kvm/mmu.c
> @@ -654,7 +654,7 @@ int kvm_riscv_gstage_map(struct kvm_vcpu *vcpu,
>  		return ret;
>  	}
>  
> -	mmu_seq = kvm->mmu_notifier_seq;
> +	mmu_seq = kvm->mmu_updating_seq;
>  
>  	hfn = gfn_to_pfn_prot(kvm, gfn, is_write, &writeable);
>  	if (hfn == KVM_PFN_ERR_HWPOISON) {
> @@ -674,7 +674,7 @@ int kvm_riscv_gstage_map(struct kvm_vcpu *vcpu,
>  
>  	spin_lock(&kvm->mmu_lock);
>  
> -	if (mmu_notifier_retry(kvm, mmu_seq))
> +	if (mmu_updating_retry(kvm, mmu_seq))
>  		goto out_unlock;
>  
>  	if (writeable) {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0d882fad4bc1..545eb74305fe 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2908,7 +2908,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
>  	 * If addresses are being invalidated, skip prefetching to avoid
>  	 * accidentally prefetching those addresses.
>  	 */
> -	if (unlikely(vcpu->kvm->mmu_notifier_count))
> +	if (unlikely(vcpu->kvm->mmu_updating_count))
>  		return;
>  
>  	__direct_pte_prefetch(vcpu, sp, sptep);
> @@ -2950,7 +2950,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  	/*
>  	 * Lookup the mapping level in the current mm.  The information
>  	 * may become stale soon, but it is safe to use as long as
> -	 * 1) mmu_notifier_retry was checked after taking mmu_lock, and
> +	 * 1) mmu_updating_retry was checked after taking mmu_lock, and
>  	 * 2) mmu_lock is taken now.
>  	 *
>  	 * We still need to disable IRQs to prevent concurrent tear down
> @@ -3035,7 +3035,7 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  		return;
>  
>  	/*
> -	 * mmu_notifier_retry() was successful and mmu_lock is held, so
> +	 * mmu_updating_retry was successful and mmu_lock is held, so
>  	 * the pmd can't be split from under us.
>  	 */
>  	fault->goal_level = fault->req_level;
> @@ -4182,7 +4182,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
>  		return true;
>  
>  	return fault->slot &&
> -	       mmu_notifier_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
> +	       mmu_updating_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
>  }
>  
>  static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> @@ -4206,7 +4206,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	if (r)
>  		return r;
>  
> -	mmu_seq = vcpu->kvm->mmu_notifier_seq;
> +	mmu_seq = vcpu->kvm->mmu_updating_seq;
>  	smp_rmb();
>  
>  	r = kvm_faultin_pfn(vcpu, fault);
> @@ -6023,7 +6023,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  
>  	write_lock(&kvm->mmu_lock);
>  
> -	kvm_inc_notifier_count(kvm, gfn_start, gfn_end);
> +	kvm_mmu_updating_begin(kvm, gfn_start, gfn_end);
>  
>  	flush = __kvm_zap_rmaps(kvm, gfn_start, gfn_end);
>  
> @@ -6037,7 +6037,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  		kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
>  						   gfn_end - gfn_start);
>  
> -	kvm_dec_notifier_count(kvm, gfn_start, gfn_end);
> +	kvm_mmu_updating_end(kvm, gfn_start, gfn_end);
>  
>  	write_unlock(&kvm->mmu_lock);
>  }
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 2448fa8d8438..acf7e41aa02b 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -589,7 +589,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
>  	 * If addresses are being invalidated, skip prefetching to avoid
>  	 * accidentally prefetching those addresses.
>  	 */
> -	if (unlikely(vcpu->kvm->mmu_notifier_count))
> +	if (unlikely(vcpu->kvm->mmu_updating_count))
>  		return;
>  
>  	if (sp->role.direct)
> @@ -838,7 +838,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	else
>  		fault->max_level = walker.level;
>  
> -	mmu_seq = vcpu->kvm->mmu_notifier_seq;
> +	mmu_seq = vcpu->kvm->mmu_updating_seq;
>  	smp_rmb();
>  
>  	r = kvm_faultin_pfn(vcpu, fault);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index e9153b54e2a4..c262ebb168a7 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -765,10 +765,10 @@ struct kvm {
>  
>  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>  	struct mmu_notifier mmu_notifier;
> -	unsigned long mmu_notifier_seq;
> -	long mmu_notifier_count;
> -	gfn_t mmu_notifier_range_start;
> -	gfn_t mmu_notifier_range_end;
> +	unsigned long mmu_updating_seq;
> +	long mmu_updating_count;

Can we convert mmu_updating_seq and mmu_updating_count to atomic_t ?
I see that not all accesses to these are under the kvm->mmu_lock
spinlock. This will also remove the need for putting separate smp_wmb() and
smp_rmb() memory barriers while accessing these structure members.

> +	gfn_t mmu_updating_range_start;
> +	gfn_t mmu_updating_range_end;
>  #endif
>  	struct list_head devices;
>  	u64 manual_dirty_log_protect;
> @@ -1362,8 +1362,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
>  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  #endif
>  
> -void kvm_inc_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end);
> -void kvm_dec_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end);
> +void kvm_mmu_updating_begin(struct kvm *kvm, gfn_t start, gfn_t end);
> +void kvm_mmu_updating_end(struct kvm *kvm, gfn_t start, gfn_t end);
>  
>  long kvm_arch_dev_ioctl(struct file *filp,
>  			unsigned int ioctl, unsigned long arg);
> @@ -1901,42 +1901,42 @@ extern const struct kvm_stats_header kvm_vcpu_stats_header;
>  extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
>  
>  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> -static inline int mmu_notifier_retry(struct kvm *kvm, unsigned long mmu_seq)
> +static inline int mmu_updating_retry(struct kvm *kvm, unsigned long mmu_seq)
>  {
> -	if (unlikely(kvm->mmu_notifier_count))
> +	if (unlikely(kvm->mmu_updating_count))
>  		return 1;
>  	/*
> -	 * Ensure the read of mmu_notifier_count happens before the read
> -	 * of mmu_notifier_seq.  This interacts with the smp_wmb() in
> +	 * Ensure the read of mmu_updating_count happens before the read
> +	 * of mmu_updating_seq.  This interacts with the smp_wmb() in
>  	 * mmu_notifier_invalidate_range_end to make sure that the caller
> -	 * either sees the old (non-zero) value of mmu_notifier_count or
> -	 * the new (incremented) value of mmu_notifier_seq.
> +	 * either sees the old (non-zero) value of mmu_updating_count or
> +	 * the new (incremented) value of mmu_updating_seq.
>  	 * PowerPC Book3s HV KVM calls this under a per-page lock
>  	 * rather than under kvm->mmu_lock, for scalability, so
>  	 * can't rely on kvm->mmu_lock to keep things ordered.
>  	 */
>  	smp_rmb();
> -	if (kvm->mmu_notifier_seq != mmu_seq)
> +	if (kvm->mmu_updating_seq != mmu_seq)
>  		return 1;
>  	return 0;
>  }
>  
> -static inline int mmu_notifier_retry_gfn(struct kvm *kvm,
> +static inline int mmu_updating_retry_gfn(struct kvm *kvm,
>  					 unsigned long mmu_seq,
>  					 gfn_t gfn)
>  {
>  	lockdep_assert_held(&kvm->mmu_lock);
>  	/*
> -	 * If mmu_notifier_count is non-zero, then the range maintained by
> +	 * If mmu_updating_count is non-zero, then the range maintained by
>  	 * kvm_mmu_notifier_invalidate_range_start contains all addresses that
>  	 * might be being invalidated. Note that it may include some false
>  	 * positives, due to shortcuts when handing concurrent invalidations.
>  	 */
> -	if (unlikely(kvm->mmu_notifier_count) &&
> -	    gfn >= kvm->mmu_notifier_range_start &&
> -	    gfn < kvm->mmu_notifier_range_end)
> +	if (unlikely(kvm->mmu_updating_count) &&
> +	    gfn >= kvm->mmu_updating_range_start &&
> +	    gfn < kvm->mmu_updating_range_end)
>  		return 1;
> -	if (kvm->mmu_notifier_seq != mmu_seq)
> +	if (kvm->mmu_updating_seq != mmu_seq)
>  		return 1;
>  	return 0;
>  }
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4d7f0e72366f..3ae4944b9f15 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -698,30 +698,29 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  
>  	/*
>  	 * .change_pte() must be surrounded by .invalidate_range_{start,end}().
> -	 * If mmu_notifier_count is zero, then no in-progress invalidations,
> +	 * If mmu_updating_count is zero, then no in-progress invalidations,
>  	 * including this one, found a relevant memslot at start(); rechecking
>  	 * memslots here is unnecessary.  Note, a false positive (count elevated
>  	 * by a different invalidation) is sub-optimal but functionally ok.
>  	 */
>  	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> -	if (!READ_ONCE(kvm->mmu_notifier_count))
> +	if (!READ_ONCE(kvm->mmu_updating_count))
>  		return;
>  
>  	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
>  }
>  
> -void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
> -				   unsigned long end)
> +void kvm_mmu_updating_begin(struct kvm *kvm, gfn_t start, gfn_t end)
>  {
>  	/*
>  	 * The count increase must become visible at unlock time as no
>  	 * spte can be established without taking the mmu_lock and
>  	 * count is also read inside the mmu_lock critical section.
>  	 */
> -	kvm->mmu_notifier_count++;
> -	if (likely(kvm->mmu_notifier_count == 1)) {
> -		kvm->mmu_notifier_range_start = start;
> -		kvm->mmu_notifier_range_end = end;
> +	kvm->mmu_updating_count++;
> +	if (likely(kvm->mmu_updating_count == 1)) {
> +		kvm->mmu_updating_range_start = start;
> +		kvm->mmu_updating_range_end = end;
>  	} else {
>  		/*
>  		 * Fully tracking multiple concurrent ranges has diminishing
> @@ -732,10 +731,10 @@ void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
>  		 * accumulate and persist until all outstanding invalidates
>  		 * complete.
>  		 */
> -		kvm->mmu_notifier_range_start =
> -			min(kvm->mmu_notifier_range_start, start);
> -		kvm->mmu_notifier_range_end =
> -			max(kvm->mmu_notifier_range_end, end);
> +		kvm->mmu_updating_range_start =
> +			min(kvm->mmu_updating_range_start, start);
> +		kvm->mmu_updating_range_end =
> +			max(kvm->mmu_updating_range_end, end);
>  	}
>  }
>  
> @@ -748,7 +747,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  		.end		= range->end,
>  		.pte		= __pte(0),
>  		.handler	= kvm_unmap_gfn_range,
> -		.on_lock	= kvm_inc_notifier_count,
> +		.on_lock	= kvm_mmu_updating_begin,
>  		.on_unlock	= kvm_arch_guest_memory_reclaimed,
>  		.flush_on_ret	= true,
>  		.may_block	= mmu_notifier_range_blockable(range),
> @@ -759,7 +758,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  	/*
>  	 * Prevent memslot modification between range_start() and range_end()
>  	 * so that conditionally locking provides the same result in both
> -	 * functions.  Without that guarantee, the mmu_notifier_count
> +	 * functions.  Without that guarantee, the mmu_updating_count
>  	 * adjustments will be imbalanced.
>  	 *
>  	 * Pairs with the decrement in range_end().
> @@ -775,7 +774,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  	 * any given time, and the caches themselves can check for hva overlap,
>  	 * i.e. don't need to rely on memslot overlap checks for performance.
>  	 * Because this runs without holding mmu_lock, the pfn caches must use
> -	 * mn_active_invalidate_count (see above) instead of mmu_notifier_count.
> +	 * mn_active_invalidate_count (see above) instead of mmu_updating_count.
>  	 */
>  	gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end,
>  					  hva_range.may_block);
> @@ -785,22 +784,21 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  	return 0;
>  }
>  
> -void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start,
> -				   unsigned long end)
> +void kvm_mmu_updating_end(struct kvm *kvm, gfn_t start, gfn_t end)
>  {
>  	/*
>  	 * This sequence increase will notify the kvm page fault that
>  	 * the page that is going to be mapped in the spte could have
>  	 * been freed.
>  	 */
> -	kvm->mmu_notifier_seq++;
> +	kvm->mmu_updating_seq++;
>  	smp_wmb();
>  	/*
>  	 * The above sequence increase must be visible before the
>  	 * below count decrease, which is ensured by the smp_wmb above
> -	 * in conjunction with the smp_rmb in mmu_notifier_retry().
> +	 * in conjunction with the smp_rmb in mmu_updating_retry().
>  	 */
> -	kvm->mmu_notifier_count--;
> +	kvm->mmu_updating_count--;
>  }
>  
>  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> @@ -812,7 +810,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>  		.end		= range->end,
>  		.pte		= __pte(0),
>  		.handler	= (void *)kvm_null_fn,
> -		.on_lock	= kvm_dec_notifier_count,
> +		.on_lock	= kvm_mmu_updating_end,
>  		.on_unlock	= (void *)kvm_null_fn,
>  		.flush_on_ret	= false,
>  		.may_block	= mmu_notifier_range_blockable(range),
> @@ -833,7 +831,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>  	if (wake)
>  		rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
>  
> -	BUG_ON(kvm->mmu_notifier_count < 0);
> +	BUG_ON(kvm->mmu_updating_count < 0);
>  }
>  
>  static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
> diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
> index ab519f72f2cd..aa6d24966a76 100644
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -112,27 +112,27 @@ static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_s
>  {
>  	/*
>  	 * mn_active_invalidate_count acts for all intents and purposes
> -	 * like mmu_notifier_count here; but the latter cannot be used
> +	 * like mmu_updating_count here; but the latter cannot be used
>  	 * here because the invalidation of caches in the mmu_notifier
> -	 * event occurs _before_ mmu_notifier_count is elevated.
> +	 * event occurs _before_ mmu_updating_count is elevated.
>  	 *
>  	 * Note, it does not matter that mn_active_invalidate_count
>  	 * is not protected by gpc->lock.  It is guaranteed to
>  	 * be elevated before the mmu_notifier acquires gpc->lock, and
> -	 * isn't dropped until after mmu_notifier_seq is updated.
> +	 * isn't dropped until after mmu_updating_seq is updated.
>  	 */
>  	if (kvm->mn_active_invalidate_count)
>  		return true;
>  
>  	/*
>  	 * Ensure mn_active_invalidate_count is read before
> -	 * mmu_notifier_seq.  This pairs with the smp_wmb() in
> +	 * mmu_updating_seq.  This pairs with the smp_wmb() in
>  	 * mmu_notifier_invalidate_range_end() to guarantee either the
>  	 * old (non-zero) value of mn_active_invalidate_count or the
> -	 * new (incremented) value of mmu_notifier_seq is observed.
> +	 * new (incremented) value of mmu_updating_seq is observed.
>  	 */
>  	smp_rmb();
> -	return kvm->mmu_notifier_seq != mmu_seq;
> +	return kvm->mmu_updating_seq != mmu_seq;
>  }
>  
>  static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc)
> @@ -155,7 +155,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc)
>  	gpc->valid = false;
>  
>  	do {
> -		mmu_seq = kvm->mmu_notifier_seq;
> +		mmu_seq = kvm->mmu_updating_seq;
>  		smp_rmb();
>  
>  		write_unlock_irq(&gpc->lock);
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 08/14] KVM: Rename mmu_notifier_*
  2023-05-23  7:19   ` Kautuk Consul
@ 2023-05-23 14:19     ` Sean Christopherson
  2023-05-24  6:12       ` Kautuk Consul
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-05-23 14:19 UTC (permalink / raw)
  To: Kautuk Consul
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Tue, May 23, 2023, Kautuk Consul wrote:
> On 2022-07-06 16:20:10, Chao Peng wrote:
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index e9153b54e2a4..c262ebb168a7 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -765,10 +765,10 @@ struct kvm {
> >  
> >  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> >  	struct mmu_notifier mmu_notifier;
> > -	unsigned long mmu_notifier_seq;
> > -	long mmu_notifier_count;
> > -	gfn_t mmu_notifier_range_start;
> > -	gfn_t mmu_notifier_range_end;
> > +	unsigned long mmu_updating_seq;
> > +	long mmu_updating_count;
> 
> Can we convert mmu_updating_seq and mmu_updating_count to atomic_t ?

Heh, can we?  Yes.  Should we?  No.

> I see that not all accesses to these are under the kvm->mmu_lock
> spinlock.

Ya, working as intended.  Ignoring gfn_to_pfn_cache for the moment, all accesses
to mmu_invalidate_in_progress (was mmu_notifier_count / mmu_updating_count above)
are done under mmu_lock.  And for for mmu_notifier_seq (mmu_updating_seq above),
all writes and some reads are done under mmu_lock.  The only reads that are done
outside of mmu_lock are the initial snapshots of the sequence number.

gfn_to_pfn_cache uses a different locking scheme, the comments in
mmu_notifier_retry_cache() do a good job explaining the ordering.

> This will also remove the need for putting separate smp_wmb() and
> smp_rmb() memory barriers while accessing these structure members.

No, the memory barriers aren't there to provide any kind of atomicity.  The barriers
exist to ensure that stores and loads to/from the sequence and invalidate in-progress
counts are ordered relative to the invalidation (stores to counts) and creation (loads)
of SPTEs.  Making the counts atomic changes nothing because atomic operations don't
guarantee the necessary ordering.

E.g. when handling a page fault, KVM snapshots the sequence outside of mmu_lock
_before_ touching any state that is involved in resolving the host pfn, e.g. primary
MMU state (VMAs, host page tables, etc.).   After the page fault task acquires
mmu_lock, KVM checks that there are no in-progress invalidations and that the sequence
count is the same.  This ensures that if there is a concurrent page fault and
invalidation event, the page fault task will either acquire mmu_lock and create SPTEs
_before_ the invalidation is processed, or the page fault task will observe either an
elevated mmu_invalidate_in_progress or a different sequence count, and thus retry the
page fault, if the page fault task acquires mmu_lock after the invalidation event.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
  2023-05-19 18:23     ` Sean Christopherson
  2023-05-19 19:49       ` Nicolas Saenz Julienne
@ 2023-05-23 18:59       ` Nicolas Saenz Julienne
  1 sibling, 0 replies; 398+ messages in thread
From: Nicolas Saenz Julienne @ 2023-05-23 18:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, graf,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, wei.w.wang, anelkz

Hi Sean,

On Fri May 19, 2023 at 6:23 PM UTC, Sean Christopherson wrote:
> On Fri, May 19, 2023, Nicolas Saenz Julienne wrote:
> > Hi,
> > On Fri Dec 2, 2022 at 6:13 AM UTC, Chao Peng wrote:

[...]

> > VSM introduces isolated guest execution contexts called Virtual Trust
> > Levels (VTL) [2]. Each VTL has its own memory access protections,
> > virtual processors states, interrupt controllers and overlay pages. VTLs
> > are hierarchical and might enforce memory protections on less privileged
> > VTLs. Memory protections are enforced on a per-GPA granularity.
> >
> > We implemented this in the past by using a separate address space per
> > VTL and updating memory regions on protection changes. But having to
> > update the memory slot layout for every permission change scales poorly,
> > especially as we have to perform 100.000s of these operations at boot
> > (see [1] for a little more context).
> >
> > I believe the biggest barrier for us to use memory attributes is not
> > having the ability to target specific address spaces, or to the very
> > least having some mechanism to maintain multiple independent layers of
> > attributes.
>
> Can you elaborate on "specific address spaces"?  In KVM, that usually means SMM,
> but the VTL comment above makes me think you're talking about something entirely
> different.  E.g. can you provide a brief summary of the requirements/expectations?

Let me refresh some concepts first. VTLs are vCPU modes implemented by
the hypervisor. Lower VTLs switch into higher VTLs [1] through a
hypercall or asynchronously through interrupts. Each VTL has its own CPU
architectural state, lapic and MSR state (applies to only some MSRs).
These are saved/restored when switching VTLS [2]. Additionally, VTLs
share a common GPA->HPA mapping, but protection bits differ depending on
which VTL the CPU is on. Privileged VTLs might revoke R/W/X(+MBEC,
optional) access bits from lower VTLs on a per-GPA basis.

In order to deal with the per-VTL memory protection bits, we extended
the number of KVM address spaces and assigned one to each VTL. The
hypervisor initializes all VTLs address spaces with the same mappings
and protections, they are expected to diverge during runtime. Operations
that rely on memory slots for GPA->HPA/HVA translations (including page
faults) are already address space aware, so adding VTL support was
fairly simple.

Ultimately, when a privileged VTL enforces memory protections on lower
VTLs we update that VTL's address space memory regions to reflect them.
Protection changes are requested through a hypercall, which expects the
new protection to be visible system wide upon returning from it. These
hypercalls happen around 100000+ times during boot, so we introduced an
"atomic memory slot update" API similar to Emanuele's [3] that allows
splitting memory regions/changing permissions concurrent with other
vCPUs.

Now, if we had a way to map memory attributes to specific VTLs, we could
use that instead. Actually, we wouldn't need to extend address spaces at
all to support this (we might still need them to support Overlay Pages,
but that's another story).

Hope it makes a little more sense now. :)

Nicolas

[1] In practice we've only seen VTL0 and VTL1 being used. The spec
    supports up to 16 VTLs.

[2] One can draw an analogy with arm's TrustZone. The hypervisor plays
    the role of EL3. Windows (VTL0) runs in Non-Secure (EL0/EL1) and the
    secure kernel (VTL1) in Secure World (EL1s/EL0s).

[3] https://lore.kernel.org/all/20220909104506.738478-1-eesposit@redhat.com/

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 08/14] KVM: Rename mmu_notifier_*
  2023-05-23 14:19     ` Sean Christopherson
@ 2023-05-24  6:12       ` Kautuk Consul
  2023-05-24 20:16         ` Sean Christopherson
  2023-05-24 20:28         ` Peter Zijlstra
  0 siblings, 2 replies; 398+ messages in thread
From: Kautuk Consul @ 2023-05-24  6:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On 2023-05-23 07:19:43, Sean Christopherson wrote:
> On Tue, May 23, 2023, Kautuk Consul wrote:
> > On 2022-07-06 16:20:10, Chao Peng wrote:
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index e9153b54e2a4..c262ebb168a7 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -765,10 +765,10 @@ struct kvm {
> > >  
> > >  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > >  	struct mmu_notifier mmu_notifier;
> > > -	unsigned long mmu_notifier_seq;
> > > -	long mmu_notifier_count;
> > > -	gfn_t mmu_notifier_range_start;
> > > -	gfn_t mmu_notifier_range_end;
> > > +	unsigned long mmu_updating_seq;
> > > +	long mmu_updating_count;
> > 
> > Can we convert mmu_updating_seq and mmu_updating_count to atomic_t ?
> 
> Heh, can we?  Yes.  Should we?  No.
> 
> > I see that not all accesses to these are under the kvm->mmu_lock
> > spinlock.
> 
> Ya, working as intended.  Ignoring gfn_to_pfn_cache for the moment, all accesses
> to mmu_invalidate_in_progress (was mmu_notifier_count / mmu_updating_count above)
> are done under mmu_lock.  And for for mmu_notifier_seq (mmu_updating_seq above),
> all writes and some reads are done under mmu_lock.  The only reads that are done
> outside of mmu_lock are the initial snapshots of the sequence number.
> 
> gfn_to_pfn_cache uses a different locking scheme, the comments in
> mmu_notifier_retry_cache() do a good job explaining the ordering.
> 
> > This will also remove the need for putting separate smp_wmb() and
> > smp_rmb() memory barriers while accessing these structure members.
> 
> No, the memory barriers aren't there to provide any kind of atomicity.  The barriers
> exist to ensure that stores and loads to/from the sequence and invalidate in-progress
> counts are ordered relative to the invalidation (stores to counts) and creation (loads)
> of SPTEs.  Making the counts atomic changes nothing because atomic operations don't
> guarantee the necessary ordering.
I'm not saying that the memory barriers provide atomicity.
My comment was based on the assumption that "all atomic operations are
implicit memory barriers". If that assumption is true then we won't need
the memory barriers here if we use atomic operations for protecting
these 2 structure members.
> 
> E.g. when handling a page fault, KVM snapshots the sequence outside of mmu_lock
> _before_ touching any state that is involved in resolving the host pfn, e.g. primary
> MMU state (VMAs, host page tables, etc.).   After the page fault task acquires
> mmu_lock, KVM checks that there are no in-progress invalidations and that the sequence
> count is the same.  This ensures that if there is a concurrent page fault and
> invalidation event, the page fault task will either acquire mmu_lock and create SPTEs
> _before_ the invalidation is processed, or the page fault task will observe either an
> elevated mmu_invalidate_in_progress or a different sequence count, and thus retry the
> page fault, if the page fault task acquires mmu_lock after the invalidation event.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 08/14] KVM: Rename mmu_notifier_*
  2023-05-24  6:12       ` Kautuk Consul
@ 2023-05-24 20:16         ` Sean Christopherson
  2023-05-24 20:33           ` Peter Zijlstra
  2023-05-24 20:28         ` Peter Zijlstra
  1 sibling, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-05-24 20:16 UTC (permalink / raw)
  To: Kautuk Consul
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, linux-kselftest, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song

On Wed, May 24, 2023, Kautuk Consul wrote:
> On 2023-05-23 07:19:43, Sean Christopherson wrote:
> > On Tue, May 23, 2023, Kautuk Consul wrote:
> > > On 2022-07-06 16:20:10, Chao Peng wrote:
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index e9153b54e2a4..c262ebb168a7 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -765,10 +765,10 @@ struct kvm {
> > > >  
> > > >  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > > >  	struct mmu_notifier mmu_notifier;
> > > > -	unsigned long mmu_notifier_seq;
> > > > -	long mmu_notifier_count;
> > > > -	gfn_t mmu_notifier_range_start;
> > > > -	gfn_t mmu_notifier_range_end;
> > > > +	unsigned long mmu_updating_seq;
> > > > +	long mmu_updating_count;
> > > 
> > > Can we convert mmu_updating_seq and mmu_updating_count to atomic_t ?
> > 
> > Heh, can we?  Yes.  Should we?  No.
> > 
> > > I see that not all accesses to these are under the kvm->mmu_lock
> > > spinlock.
> > 
> > Ya, working as intended.  Ignoring gfn_to_pfn_cache for the moment, all accesses
> > to mmu_invalidate_in_progress (was mmu_notifier_count / mmu_updating_count above)
> > are done under mmu_lock.  And for for mmu_notifier_seq (mmu_updating_seq above),
> > all writes and some reads are done under mmu_lock.  The only reads that are done
> > outside of mmu_lock are the initial snapshots of the sequence number.
> > 
> > gfn_to_pfn_cache uses a different locking scheme, the comments in
> > mmu_notifier_retry_cache() do a good job explaining the ordering.
> > 
> > > This will also remove the need for putting separate smp_wmb() and
> > > smp_rmb() memory barriers while accessing these structure members.
> > 
> > No, the memory barriers aren't there to provide any kind of atomicity.  The barriers
> > exist to ensure that stores and loads to/from the sequence and invalidate in-progress
> > counts are ordered relative to the invalidation (stores to counts) and creation (loads)
> > of SPTEs.  Making the counts atomic changes nothing because atomic operations don't
> > guarantee the necessary ordering.
> I'm not saying that the memory barriers provide atomicity.
> My comment was based on the assumption that "all atomic operations are
> implicit memory barriers". If that assumption is true then we won't need
> the memory barriers here if we use atomic operations for protecting
> these 2 structure members.

Atomics aren't memory barriers on all architectures, e.g. see the various
definitions of smp_mb__after_atomic().

Even if atomic operations did provide barriers, using an atomic would be overkill
and a net negative.  On strongly ordered architectures like x86, memory barriers are
just compiler barriers, whereas atomics may be more expensive.  Of course, the only
accesses outside of mmu_lock are reads, so on x86 that "atomic" access is just a
READ_ONCE() load, but that's not the case for all architectures.

Anyways, the point is that atomics and memory barriers are different things that
serve different purposes.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 08/14] KVM: Rename mmu_notifier_*
  2023-05-24  6:12       ` Kautuk Consul
  2023-05-24 20:16         ` Sean Christopherson
@ 2023-05-24 20:28         ` Peter Zijlstra
  1 sibling, 0 replies; 398+ messages in thread
From: Peter Zijlstra @ 2023-05-24 20:28 UTC (permalink / raw)
  To: Kautuk Consul
  Cc: Sean Christopherson, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song

On Wed, May 24, 2023 at 11:42:15AM +0530, Kautuk Consul wrote:

> My comment was based on the assumption that "all atomic operations are
> implicit memory barriers". If that assumption is true then we won't need

It is not -- also see Documentation/atomic_t.txt.

Specifically atomic_read() doesn't imply any ordering on any
architecture including the strongly ordered TSO-archs (like x86).

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 08/14] KVM: Rename mmu_notifier_*
  2023-05-24 20:16         ` Sean Christopherson
@ 2023-05-24 20:33           ` Peter Zijlstra
  2023-05-24 21:39             ` Sean Christopherson
  2023-05-25  3:52             ` Kautuk Consul
  0 siblings, 2 replies; 398+ messages in thread
From: Peter Zijlstra @ 2023-05-24 20:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kautuk Consul, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song

On Wed, May 24, 2023 at 01:16:03PM -0700, Sean Christopherson wrote:

> Atomics aren't memory barriers on all architectures, e.g. see the various
> definitions of smp_mb__after_atomic().
> 
> Even if atomic operations did provide barriers, using an atomic would be overkill
> and a net negative.  On strongly ordered architectures like x86, memory barriers are
> just compiler barriers, whereas atomics may be more expensive. 

Not quite, smp_{r,w}mb() and smp_mb__{before,after}_atomic() are
compiler barriers on the TSO archs, but smp_mb() very much isn't. TSO
still allows stores to be delayed vs later loads (iow it doesn't pretend
to hide the store buffer).

> Of course, the only
> accesses outside of mmu_lock are reads, so on x86 that "atomic" access is just a
> READ_ONCE() load, but that's not the case for all architectures.

This is true on *all* archs. atomic_set() and atomic_read() are no more
and no less than WRITE_ONCE() / READ_ONCE().

> Anyways, the point is that atomics and memory barriers are different things that
> serve different purposes.

This is true; esp. on the weakly ordered architectures where atomics do
not naturally imply any ordering.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 08/14] KVM: Rename mmu_notifier_*
  2023-05-24 20:33           ` Peter Zijlstra
@ 2023-05-24 21:39             ` Sean Christopherson
  2023-05-25  8:54               ` Peter Zijlstra
  2023-05-25  3:52             ` Kautuk Consul
  1 sibling, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-05-24 21:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kautuk Consul, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song

On Wed, May 24, 2023, Peter Zijlstra wrote:
> On Wed, May 24, 2023 at 01:16:03PM -0700, Sean Christopherson wrote:
> > Of course, the only accesses outside of mmu_lock are reads, so on x86 that
> > "atomic" access is just a READ_ONCE() load, but that's not the case for all
> > architectures.
> 
> This is true on *all* archs. atomic_set() and atomic_read() are no more
> and no less than WRITE_ONCE() / READ_ONCE().

Ah, I take it s390's handcoded assembly routines are just a paranoid equivalents
and not truly special?  "l" and "st" do sound quite generic...

  commit 7657e41a0bd16c9d8b3cefe8fd5d6ac3c25ae4bf
  Author: Heiko Carstens <hca@linux.ibm.com>
  Date:   Thu Feb 17 13:13:58 2011 +0100

    [S390] atomic: use inline asm
    
    Use inline assemblies for atomic_read/set(). This way there shouldn't
    be any questions or subtle volatile semantics left.

static inline int __atomic_read(const atomic_t *v)
{
	int c;

	asm volatile(
		"	l	%0,%1\n"
		: "=d" (c) : "R" (v->counter));
	return c;
}

static inline void __atomic_set(atomic_t *v, int i)
{
	asm volatile(
		"	st	%1,%0\n"
		: "=R" (v->counter) : "d" (i));
}

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 08/14] KVM: Rename mmu_notifier_*
  2023-05-24 20:33           ` Peter Zijlstra
  2023-05-24 21:39             ` Sean Christopherson
@ 2023-05-25  3:52             ` Kautuk Consul
  1 sibling, 0 replies; 398+ messages in thread
From: Kautuk Consul @ 2023-05-25  3:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sean Christopherson, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song

On 2023-05-24 22:33:36, Peter Zijlstra wrote:
> On Wed, May 24, 2023 at 01:16:03PM -0700, Sean Christopherson wrote:
> 
> > Atomics aren't memory barriers on all architectures, e.g. see the various
> > definitions of smp_mb__after_atomic().
> > 
> > Even if atomic operations did provide barriers, using an atomic would be overkill
> > and a net negative.  On strongly ordered architectures like x86, memory barriers are
> > just compiler barriers, whereas atomics may be more expensive. 
> 
> Not quite, smp_{r,w}mb() and smp_mb__{before,after}_atomic() are
> compiler barriers on the TSO archs, but smp_mb() very much isn't. TSO
> still allows stores to be delayed vs later loads (iow it doesn't pretend
> to hide the store buffer).
> 
> > Of course, the only
> > accesses outside of mmu_lock are reads, so on x86 that "atomic" access is just a
> > READ_ONCE() load, but that's not the case for all architectures.
> 
> This is true on *all* archs. atomic_set() and atomic_read() are no more
> and no less than WRITE_ONCE() / READ_ONCE().
> 
> > Anyways, the point is that atomics and memory barriers are different things that
> > serve different purposes.
> 
> This is true; esp. on the weakly ordered architectures where atomics do
> not naturally imply any ordering.

Thanks for the information, everyone.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: [PATCH v7 08/14] KVM: Rename mmu_notifier_*
  2023-05-24 21:39             ` Sean Christopherson
@ 2023-05-25  8:54               ` Peter Zijlstra
  0 siblings, 0 replies; 398+ messages in thread
From: Peter Zijlstra @ 2023-05-25  8:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kautuk Consul, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, linux-kselftest,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song

On Wed, May 24, 2023 at 02:39:50PM -0700, Sean Christopherson wrote:
> On Wed, May 24, 2023, Peter Zijlstra wrote:
> > On Wed, May 24, 2023 at 01:16:03PM -0700, Sean Christopherson wrote:
> > > Of course, the only accesses outside of mmu_lock are reads, so on x86 that
> > > "atomic" access is just a READ_ONCE() load, but that's not the case for all
> > > architectures.
> > 
> > This is true on *all* archs. atomic_set() and atomic_read() are no more
> > and no less than WRITE_ONCE() / READ_ONCE().
> 
> Ah, I take it s390's handcoded assembly routines are just a paranoid equivalents
> and not truly special?  "l" and "st" do sound quite generic...

Yep, compiler *should* generate the same with READ_ONCE/WRITE_ONCE.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-04-22  1:33                 ` Sean Christopherson
                                     ` (2 preceding siblings ...)
  2023-05-12  0:21                   ` Michael Roth
@ 2023-06-06 19:14                   ` Ackerley Tng
  2023-06-06 23:25                     ` Sean Christopherson
  3 siblings, 1 reply; 398+ messages in thread
From: Ackerley Tng @ 2023-06-06 19:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: david, chao.p.peng, pbonzini, vkuznets, jmattson, joro, mail,
	vbabka, vannapurve, yu.c.zhang, kirill.shutemov, dhildenb,
	qperret, tabba, michael.roth, wei.w.wang, rppt, liam.merwick,
	isaku.yamahata, jarkko, kvm, linux-kernel, hughd, brauner


I've ported selftests from Chao and I [1] while working on hugetlb support  
for
guest_mem [2].

In the process, I found some bugs and have some suggestions for guest_mem.
Please see separate commits at [3].

Here are some highlights/questions:

+ "KVM: guest_mem: Explain the use of the uptodate flag for gmem"
     + Generally, uptodate flags means that the contents of this page match  
the
       backing store. Since gmem is memory-backed, does "uptodate" for gmem  
mean
       "zeroed"?
+ "KVM: guest_mem: Don't re-mark accessed after getting a folio" and "KVM:
   guest_mem: Don't set dirty flag for folio"
     + Do we need to folio_mark_accessed(), when it was created with
       FGP_ACCESSED?
     + What is the significance of these LRU flags when gmem doesn't support
       swapping/eviction?
+ "KVM: guest_mem: Align so that at least 1 page is allocated"
     + Bug in current implementation: without this alignment, fallocate() of  
a
       size less than the gmem page size will result in no allocation at all
     + Both shmem and hugetlbfs perform this alignment
+ "KVM: guest_mem: Add alignment checks"
     + Implemented the alignment checks for guest_mem because hugetlb on gmem
       would hit a BUG_ON without this check
+ "KVM: guest_mem: Prevent overflows in kvm_gmem_invalidate_begin()"
     + Sean fixed a bug in the offset-to-gfn conversion in
       kvm_gmem_invalidate_begin() earlier, adding a WARN_ON_ONCE()
     + Code will always hit WARN_ON_ONCE() when the entire file is closed  
and all
       offsets are invalidated, so WARN_ON_ONCE() should be removed
     + Vishal noticed that the conversion might result in an overflow, so I  
fixed
       that
+ And of course, hugetlb support! Please let me know what you think of the
   approach proposed at [2].

[1]  
https://lore.kernel.org/all/cover.1678926164.git.ackerleytng@google.com/T/
[2]  
https://lore.kernel.org/lkml/cover.1686077275.git.ackerleytng@google.com/T/
[3] https://github.com/googleprodkernel/linux-cc/tree/gmem-hugetlb-rfc-v1

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-06-06 19:14                   ` Ackerley Tng
@ 2023-06-06 23:25                     ` Sean Christopherson
  2023-06-08 17:13                       ` Ackerley Tng
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-06-06 23:25 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: david, chao.p.peng, pbonzini, vkuznets, jmattson, joro, mail,
	vbabka, vannapurve, yu.c.zhang, kirill.shutemov, dhildenb,
	qperret, tabba, michael.roth, wei.w.wang, rppt, liam.merwick,
	isaku.yamahata, jarkko, kvm, linux-kernel, hughd, brauner

On Tue, Jun 06, 2023, Ackerley Tng wrote:
> 
> I've ported selftests from Chao and I [1] while working on hugetlb support
> for guest_mem [2].
> 
> In the process, I found some bugs and have some suggestions for guest_mem.
> Please see separate commits at [3].
> 
> Here are some highlights/questions:
> 
> + "KVM: guest_mem: Explain the use of the uptodate flag for gmem"
>     + Generally, uptodate flags means that the contents of this page match the
>       backing store. Since gmem is memory-backed, does "uptodate" for gmem mean
>       "zeroed"?

Don't read too much into the code, my POC was very much a "beat on it until it
works" scenario.

> + "KVM: guest_mem: Don't re-mark accessed after getting a folio" and "KVM:
>   guest_mem: Don't set dirty flag for folio"
>     + Do we need to folio_mark_accessed(), when it was created with
>       FGP_ACCESSED?

Probably not.  And as you note below, it's all pretty nonsensical anyways.

>     + What is the significance of these LRU flags when gmem doesn't support
>       swapping/eviction?

Likely none.  I used the filemap APIs in my POC because it was easy, not because
it was necessarily the best approach, i.e. that the folios/pages show up in the
LRUs is an unwanted side effect, not a feature.  If guest_memfd only needs a small
subset of the filemap support, going with a true from-scratch implemenation on
top of xarray might be cleaner overall, e.g. would avoid the need for a new flag
to say "this folio can't be migrated even though it's on the LRUs".

> + "KVM: guest_mem: Align so that at least 1 page is allocated"
>     + Bug in current implementation: without this alignment, fallocate() of
>       a size less than the gmem page size will result in no allocation at all

I'm not convinced this is a bug.  I don't see any reason to allow allocating and
punching holes in sub-page granularity.

>     + Both shmem and hugetlbfs perform this alignment
> + "KVM: guest_mem: Add alignment checks"
>     + Implemented the alignment checks for guest_mem because hugetlb on gmem
>       would hit a BUG_ON without this check
> + "KVM: guest_mem: Prevent overflows in kvm_gmem_invalidate_begin()"
>     + Sean fixed a bug in the offset-to-gfn conversion in
>       kvm_gmem_invalidate_begin() earlier, adding a WARN_ON_ONCE()

As Mike pointed out, there's likely still a bug here[*].  I was planning on
diving into that last week, but that never happened.  If you or anyone else can
take a peek and/or write a testcase, that would be awesome.

 : Heh, only if there's a testcase for it.  Assuming start >= the slot offset does
 : seem broken, e.g. if the range-to-invalidate overlaps multiple slots, later slots
 : will have index==slot->gmem.index > start.
 : 
 : > Since 'index' corresponds to the gmem offset of the current slot, is there any
 : > reason not to do something like this?:
 : >
 : >   .start = slot->base_gfn + index - slot->gmem.index,
 : >
 : > But then, if that's the case, wouldn't index == slot->gmem.index? Suggesting
 : > we case just simplify to this?:
 : >
 : >   .start = slot->base_gfn,
 : 
 : No, e.g. if start is partway through a memslot, there's no need to invalidate
 : the entire memslot.  I'll stare at this tomorrow when my brain is hopefully a
 : bit more functional, I suspect there is a min() and/or max() needed somewhere.

[*] https://lore.kernel.org/all/20230512002124.3sap3kzxpegwj3n2@amd.com

>     + Code will always hit WARN_ON_ONCE() when the entire file is closed and
>       all offsets are invalidated, so WARN_ON_ONCE() should be removed
>     + Vishal noticed that the conversion might result in an overflow, so I
>       fixed that
> + And of course, hugetlb support! Please let me know what you think of the
>   approach proposed at [2].
> 
> [1] https://lore.kernel.org/all/cover.1678926164.git.ackerleytng@google.com/T/
> [2] https://lore.kernel.org/lkml/cover.1686077275.git.ackerleytng@google.com/T/
> [3] https://github.com/googleprodkernel/linux-cc/tree/gmem-hugetlb-rfc-v1

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-06-06 23:25                     ` Sean Christopherson
@ 2023-06-08 17:13                       ` Ackerley Tng
  0 siblings, 0 replies; 398+ messages in thread
From: Ackerley Tng @ 2023-06-08 17:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: david, chao.p.peng, pbonzini, vkuznets, jmattson, joro, mail,
	vbabka, vannapurve, yu.c.zhang, kirill.shutemov, dhildenb,
	qperret, tabba, michael.roth, wei.w.wang, rppt, liam.merwick,
	isaku.yamahata, jarkko, kvm, linux-kernel, hughd, brauner

Sean Christopherson <seanjc@google.com> writes:

> ...

> Probably not.  And as you note below, it's all pretty nonsensical anyways.

>>      + What is the significance of these LRU flags when gmem doesn't  
>> support
>>        swapping/eviction?

> Likely none.  I used the filemap APIs in my POC because it was easy, not  
> because
> it was necessarily the best approach, i.e. that the folios/pages show up  
> in the
> LRUs is an unwanted side effect, not a feature.  If guest_memfd only  
> needs a small
> subset of the filemap support, going with a true from-scratch  
> implemenation on
> top of xarray might be cleaner overall, e.g. would avoid the need for a  
> new flag
> to say "this folio can't be migrated even though it's on the LRUs".

For hugetlb support on gmem, using an xarray in place of a filemap should  
work
fine. Page migration could come up in future - perhaps migration code works
better with filemap? Not sure.


>> + "KVM: guest_mem: Align so that at least 1 page is allocated"
>>      + Bug in current implementation: without this alignment, fallocate()  
>> of
>>        a size less than the gmem page size will result in no allocation  
>> at all

> I'm not convinced this is a bug.  I don't see any reason to allow  
> allocating and
> punching holes in sub-page granularity.


I looked at the code more closely, you're right. len is checked to be 4K
aligned. When userspace requests a gmem page size of larger than 4K (gmem  
THP),
the allocation loop still does the right thing.

This issue only arises for hugetlb pages. I'll rebase the next revision of  
the
hugetlb series accordingly.


>>      + Both shmem and hugetlbfs perform this alignment
>> + "KVM: guest_mem: Add alignment checks"
>>      + Implemented the alignment checks for guest_mem because hugetlb on  
>> gmem
>>        would hit a BUG_ON without this check
>> + "KVM: guest_mem: Prevent overflows in kvm_gmem_invalidate_begin()"
>>      + Sean fixed a bug in the offset-to-gfn conversion in
>>        kvm_gmem_invalidate_begin() earlier, adding a WARN_ON_ONCE()

> As Mike pointed out, there's likely still a bug here[*].  I was planning  
> on
> diving into that last week, but that never happened.  If you or anyone  
> else can
> take a peek and/or write a testcase, that would be awesome.

>   : Heh, only if there's a testcase for it.  Assuming start >= the slot  
> offset does
>   : seem broken, e.g. if the range-to-invalidate overlaps multiple slots,  
> later slots
>   : will have index==slot->gmem.index > start.
>   :
>   : > Since 'index' corresponds to the gmem offset of the current slot, is  
> there any
>   : > reason not to do something like this?:
>   : >
>   : >   .start = slot->base_gfn + index - slot->gmem.index,
>   : >
>   : > But then, if that's the case, wouldn't index == slot->gmem.index?  
> Suggesting
>   : > we case just simplify to this?:
>   : >
>   : >   .start = slot->base_gfn,
>   :
>   : No, e.g. if start is partway through a memslot, there's no need to  
> invalidate
>   : the entire memslot.  I'll stare at this tomorrow when my brain is  
> hopefully a
>   : bit more functional, I suspect there is a min() and/or max() needed  
> somewhere.

> [*] https://lore.kernel.org/all/20230512002124.3sap3kzxpegwj3n2@amd.com


I think I have fixed this, please see "KVM: guest_mem: Prevent overflows in
kvm_gmem_invalidate_begin()" [1].

This patch does take into account that start could be greater than
slot->gmem.index, when userspace chooses to punch holes beginning in the  
middle
of the memslot.

The process could be split into figuring out file indices, then GFNs:

1. Figure out the start and end in terms of index in the file
     + index_start: taking max(start, slot->gmem.index)
         + start will only be greater than slot->gmem.index but not greater  
than
           the end of the slot, due to the nature of xa_for_each_range()
     + index_end: taking min(end, slot->gmem.index + slot->npages)
2. Convert indices to GFNs

This also prevents overflows as described at [1].

>> ...

[1]  
https://github.com/googleprodkernel/linux-cc/commit/bcc304e3657a998b8f61aa1b841754fbb90d8994

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-05-06  0:55                     ` Sean Christopherson
  2023-05-06  1:17                       ` Vishal Annapurve
  2023-05-15 23:46                       ` Sean Christopherson
@ 2023-07-13 22:46                       ` Ackerley Tng
  2023-07-14 19:29                         ` Sean Christopherson
  2 siblings, 1 reply; 398+ messages in thread
From: Ackerley Tng @ 2023-07-13 22:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: david, chao.p.peng, pbonzini, vkuznets, jmattson, joro, mail,
	vbabka, vannapurve, yu.c.zhang, kirill.shutemov, dhildenb,
	qperret, tabba, michael.roth, wei.w.wang, rppt, liam.merwick,
	isaku.yamahata, jarkko, kvm, linux-kernel, hughd, brauner

Sean Christopherson <seanjc@google.com> writes:

> On Fri, May 05, 2023, Ackerley Tng wrote:
>>
>> Hi Sean,
>>
>> Thanks for implementing this POC!
>>
>> ... snip ...
>>
>
> I don't love either approach idea because it means a file created in the context
> of a VM can outlive the VM itself, and then userspace ends up with a file descriptor
> that it can't do anything with except close().  I doubt that matters in practice
> though, e.g. when the VM dies, all memory can be freed so that the file ends up
> being little more than a shell.  And if we go that route, there's no need to grab
> a reference to the file during bind, KVM can just grab a longterm reference when
> the file is initially created and then drop it when KVM dies (and nullifies gmem->kvm).
>
> ... snip ...
>
> My preference is to make it a VM-scoped ioctl(), if it ends up being a KVM ioctl()
> and not a common syscall.  If the file isn't tightly coupled to a single VM, then
> punching a hole is further complicated by needing to deal with invalidating multiple
> regions that are bound to different @kvm instances.  It's not super complex, but
> AFAICT having the ioctl() be system-scoped doesn't add value, e.g. I don't think
> having one VM own the memory will complicate even if/when we get to the point where
> VMs can share "private" memory, and the gmem code would still need to deal with
> grabbing a module reference.

I’d like to follow up on this discussion about a guest_mem file
outliving the VM and whether to have a VM-scoped ioctl or a KVM ioctl.

Here's a POC of delayed binding of a guest_mem file to a memslot, where
the guest_mem file outlives the VM [1].

I also hope to raise some points before we do the first integration of
guest_mem patches!


A use case for guest_mem inodes outliving the VM is when the host VMM
needs to be upgraded. The guest_mem inode is passed between two VMs on
the same host machine and all memory associated with the inode needs to
be retained.

To support the above use case, binding of memslots is delayed until
first use, so that the following inode passing flow can be used:

1. Source (old version of host VMM) process passes guest_mem inodes to
   destination (new version of host VMM) process via unix sockets.
2. Destination process initializes memslots identical to source process.
3. Destination process invokes ioctl to migrate guest_mem inode over to
   destination process by unbinding all memslots from the source VM and
   binding them to the destination VM. (The kvm pointer is updated in
   this process)

Without delayed binding, step 2 will fail since initialization of
memslots would check and find that the kvm pointer in the guest_mem
inode points to the kvm in the source process.


These two patches contain the meat of the changes required to support
delayed binding:

https://github.com/googleprodkernel/linux-cc/commit/93b31a006ef2e4dbe1ef0ec5d2534ca30f3bf60c
https://github.com/googleprodkernel/linux-cc/commit/dd5ac5e53f14a1ef9915c9c1e4cc1006a40b49df

Some things to highlight for the approach set out in these two patches:

1. Previously, closing the guest_mem file in userspace is taken to mean
   that all associated memory is to be removed and cleared. With these
   two patches, each memslot also holds a reference to the file (and
   therefore inode) and so even if the host VMM closes the fd, the VM
   will be able to continue to function.

   This is desirable to userspace since closing the file should not be
   interpreted as a command to clear memory. This is aligned with the
   way tmpfs files are used with KVM before guest_mem: when the file is
   closed in userspace, the memory contents are still mapped and can
   still be used by the VM. fallocate(PUNCH_HOLE) is how userspace
   should command memory to be removed, just like munmap() would be used
   to remove memory from use by KVM.

2. Creating a guest mem file no longer depends on a specific VM and
   hence the guest_mem creation ioctl can be a system ioctl instead of a
   VM specific ioctl. This will also address Chao's concern at [3].


I also separated cleaning up files vs inodes in
https://github.com/googleprodkernel/linux-cc/commit/0f5aa18910c515141e57e05c4cc791022047a242,
which I believe is more aligned with how files and inodes are cleaned up
in FSes. This alignment makes it easier to extend gmem to hugetlb, for
one. While working on this, I was also wondering if we should perhaps be
storing the inode pointer in slot->gmem instead of the file pointer? The
memory is associated with an inode->mapping rather than the file. Are we
binding to a userspace handle on the inode (store file pointer), or are
we really referencing the inode (store inode pointer)?

The branch [1] doesn't handle the bug Sean previously mentioned at [2]:
Need to take a reference on the KVM module, so that even if guest_mem
files are not bound to any VM, the KVM module cannot be unloaded. If the
KVM module can be unloaded while guest_mem files are open, then
userspace may be able to crash the kernel by invoking guest_mem
functions that had been unloaded together with KVM.


[1] https://github.com/googleprodkernel/linux-cc/tree/gmem-delayed-binding
[2] https://lore.kernel.org/lkml/ZFWli2%2FH5M8MZRiY@google.com/
[3] https://lore.kernel.org/lkml/20230509124428.GA217130@chaop.bj.intel.com/

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-07-13 22:46                       ` Ackerley Tng
@ 2023-07-14 19:29                         ` Sean Christopherson
  2023-07-14 23:09                           ` Vishal Annapurve
  0 siblings, 1 reply; 398+ messages in thread
From: Sean Christopherson @ 2023-07-14 19:29 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: david, chao.p.peng, pbonzini, vkuznets, jmattson, joro, mail,
	vbabka, vannapurve, yu.c.zhang, kirill.shutemov, dhildenb,
	qperret, tabba, michael.roth, wei.w.wang, rppt, liam.merwick,
	isaku.yamahata, jarkko, kvm, linux-kernel, hughd, brauner

On Thu, Jul 13, 2023, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > On Fri, May 05, 2023, Ackerley Tng wrote:
> >>
> >> Hi Sean,
> >>
> >> Thanks for implementing this POC!
> >>
> >> ... snip ...
> >>
> >
> > I don't love either approach idea because it means a file created in the context
> > of a VM can outlive the VM itself, and then userspace ends up with a file descriptor
> > that it can't do anything with except close().  I doubt that matters in practice
> > though, e.g. when the VM dies, all memory can be freed so that the file ends up
> > being little more than a shell.  And if we go that route, there's no need to grab
> > a reference to the file during bind, KVM can just grab a longterm reference when
> > the file is initially created and then drop it when KVM dies (and nullifies gmem->kvm).
> >
> > ... snip ...
> >
> > My preference is to make it a VM-scoped ioctl(), if it ends up being a KVM ioctl()
> > and not a common syscall.  If the file isn't tightly coupled to a single VM, then
> > punching a hole is further complicated by needing to deal with invalidating multiple
> > regions that are bound to different @kvm instances.  It's not super complex, but
> > AFAICT having the ioctl() be system-scoped doesn't add value, e.g. I don't think
> > having one VM own the memory will complicate even if/when we get to the point where
> > VMs can share "private" memory, and the gmem code would still need to deal with
> > grabbing a module reference.
> 
> I’d like to follow up on this discussion about a guest_mem file
> outliving the VM and whether to have a VM-scoped ioctl or a KVM ioctl.
> 
> Here's a POC of delayed binding of a guest_mem file to a memslot, where
> the guest_mem file outlives the VM [1].
> 
> I also hope to raise some points before we do the first integration of
> guest_mem patches!
> 
> A use case for guest_mem inodes outliving the VM is when the host VMM
> needs to be upgraded.

Translation: to support intra-host migration, a.k.a. KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM

> The guest_mem inode is passed between two VMs on the same host machine and
> all memory associated with the inode needs to be retained.
> 
> To support the above use case, binding of memslots is delayed until
> first use, so that the following inode passing flow can be used:
> 
> 1. Source (old version of host VMM) process passes guest_mem inodes to
>    destination (new version of host VMM) process via unix sockets.
> 2. Destination process initializes memslots identical to source process.
> 3. Destination process invokes ioctl to migrate guest_mem inode over to
>    destination process by unbinding all memslots from the source VM and
>    binding them to the destination VM. (The kvm pointer is updated in
>    this process)
> 
> Without delayed binding, step 2 will fail since initialization of
> memslots would check and find that the kvm pointer in the guest_mem
> inode points to the kvm in the source process.
> 
> 
> These two patches contain the meat of the changes required to support
> delayed binding:
> 
> https://github.com/googleprodkernel/linux-cc/commit/93b31a006ef2e4dbe1ef0ec5d2534ca30f3bf60c
> https://github.com/googleprodkernel/linux-cc/commit/dd5ac5e53f14a1ef9915c9c1e4cc1006a40b49df
> 
> Some things to highlight for the approach set out in these two patches:
> 
> 1. Previously, closing the guest_mem file in userspace is taken to mean
>    that all associated memory is to be removed and cleared. With these
>    two patches, each memslot also holds a reference to the file (and
>    therefore inode) and so even if the host VMM closes the fd, the VM
>    will be able to continue to function.
> 
>    This is desirable to userspace since closing the file should not be
>    interpreted as a command to clear memory.

100% agreed.  However, after more thought since we first discussed this, I think
that a deferred binding is the wrong way to solve this particular problem.  More
below.

>    This is aligned with the
>    way tmpfs files are used with KVM before guest_mem: when the file is
>    closed in userspace, the memory contents are still mapped and can
>    still be used by the VM. fallocate(PUNCH_HOLE) is how userspace
>    should command memory to be removed, just like munmap() would be used
>    to remove memory from use by KVM.
>
> 2. Creating a guest mem file no longer depends on a specific VM and
>    hence the guest_mem creation ioctl can be a system ioctl instead of a
>    VM specific ioctl. This will also address Chao's concern at [3].

That concern is a non-issue for QEMU as memory backends are created after
accelerators are initialized, and AFAICT Google's VMM behaves similarly.

And _if_ there is a VMM that instantiates memory before KVM_CREATE_VM, IMO making
the ioctl() /dev/kvm scoped would have no meaningful impact on adapting userspace
to play nice with the required ordering.  If userspace can get at /dev/kvm, then
it can do KVM_CREATE_VM, because the only input to KVM_CREATE_VM is the type, i.e.
the only dependencies for KVM_CREATE_VM should be known/resolved long before the
VMM knows it wants to use gmem.

Using a non-KVM syscall would eliminate any such dependencies, but in my very
strong opinion, that is not a good reason to go with a syscall.

> I also separated cleaning up files vs inodes in
> https://github.com/googleprodkernel/linux-cc/commit/0f5aa18910c515141e57e05c4cc791022047a242,
> which I believe is more aligned with how files and inodes are cleaned up
> in FSes.

I didn't take these, though I am in the process of incorporating parts of the
underlying feedback (see below).

> This alignment makes it easier to extend gmem to hugetlb, for one.

I am not convinced that utilizing hugetlb is the best way to provide 1GiB support
in gmem.  I'm not necessarily against it, but there's a fair bit of uncertainty
around the future of hugetlb, and there are fundamental aspects of hugetlb that
may be non-goals for the gmem use case, e.g. mapping the memory with 1GiB pages
in the userspace page tables likely isn't necessariy, and might even be
undesirable.

And practically speaking, the changes aren't _that_ invasive, i.e. punting any 
necessary refactoring should not substantially increase the size/complexity of
hugetlb support (*if* we end up adding it).

That said, I do think we'll implement .evict_inode() very early on in order to
support SNP and TDX, because (in addition to PUNCH_HOLE) that's when the backing
memory will be freed and thus reclaimed, i.e. unassigned in the RMP (SNP) / zeroed
with the shared key ID (TDX).

> While working on this, I was also wondering if we should perhaps be
> storing the inode pointer in slot->gmem instead of the file pointer? The
> memory is associated with an inode->mapping rather than the file. Are we
> binding to a userspace handle on the inode (store file pointer), or are
> we really referencing the inode (store inode pointer)?

Conceptually, I think KVM should to bind to the file.  The inode is effectively
the raw underlying physical storage, while the file is the VM's view of that
storage. 

Practically, I think that gives us a clean, intuitive way to handle intra-host
migration.  Rather than transfer ownership of the file, instantiate a new file
for the target VM, using the gmem inode from the source VM, i.e. create a hard
link.  That'd probably require new uAPI, but I don't think that will be hugely
problematic.  KVM would need to ensure the new VM's guest_memfd can't be mapped
until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify the
memslots/bindings are identical), but that should be easy enough to enforce.

That way, a VM, its memslots, and its SPTEs are tied to the file, while allowing
the memory and the *contents* of memory to outlive the VM, i.e. be effectively
transfered to the new target VM.  And we'll maintain the invariant that each
guest_memfd is bound 1:1 with a single VM.

As above, that should also help us draw the line between mapping memory into a
VM (file), and freeing/reclaiming the memory (inode).

There will be extra complexity/overhead as we'll have to play nice with the
possibility of multiple files per inode, e.g. to zap mappings across all files
when punching a hole, but the extra complexity is quite small, e.g. we can use
address_space.private_list to keep track of the guest_memfd instances associated
with the inode.

Setting aside TDX and SNP for the moment, as it's not clear how they'll support
memory that is "private" but shared between multiple VMs, I think per-VM files
would work well for sharing gmem between two VMs.  E.g. would allow a give page
to be bound to a different gfn for each VM, would allow having different permissions
for each file (e.g. to allow fallocate() only from the original owner).

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-07-14 19:29                         ` Sean Christopherson
@ 2023-07-14 23:09                           ` Vishal Annapurve
  2023-07-15  0:30                             ` Sean Christopherson
  0 siblings, 1 reply; 398+ messages in thread
From: Vishal Annapurve @ 2023-07-14 23:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ackerley Tng, david, chao.p.peng, pbonzini, vkuznets, jmattson,
	joro, mail, vbabka, yu.c.zhang, kirill.shutemov, dhildenb,
	qperret, tabba, michael.roth, wei.w.wang, rppt, liam.merwick,
	isaku.yamahata, jarkko, kvm, linux-kernel, hughd, brauner

On Fri, Jul 14, 2023 at 12:29 PM Sean Christopherson <seanjc@google.com> wrote:
> ...
> And _if_ there is a VMM that instantiates memory before KVM_CREATE_VM, IMO making
> the ioctl() /dev/kvm scoped would have no meaningful impact on adapting userspace
> to play nice with the required ordering.  If userspace can get at /dev/kvm, then
> it can do KVM_CREATE_VM, because the only input to KVM_CREATE_VM is the type, i.e.
> the only dependencies for KVM_CREATE_VM should be known/resolved long before the
> VMM knows it wants to use gmem.

I am not sure about the benefits of tying gmem creation to any given
kvm instance. I think the most important requirement here is that a
given gmem range is always tied to a single VM - This can be enforced
when memslots are bound to the gmem files.

I believe "Required ordering" is that gmem files are created first and
then supplied while creating the memslots whose gpa ranges can
generate private memory accesses.
Is there any other ordering we want to enforce here?

> ...
> Practically, I think that gives us a clean, intuitive way to handle intra-host
> migration.  Rather than transfer ownership of the file, instantiate a new file
> for the target VM, using the gmem inode from the source VM, i.e. create a hard
> link.  That'd probably require new uAPI, but I don't think that will be hugely
> problematic.  KVM would need to ensure the new VM's guest_memfd can't be mapped
> until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify the
> memslots/bindings are identical), but that should be easy enough to enforce.
>
> That way, a VM, its memslots, and its SPTEs are tied to the file, while allowing
> the memory and the *contents* of memory to outlive the VM, i.e. be effectively
> transfered to the new target VM.  And we'll maintain the invariant that each
> guest_memfd is bound 1:1 with a single VM.
>
> As above, that should also help us draw the line between mapping memory into a
> VM (file), and freeing/reclaiming the memory (inode).
>
> There will be extra complexity/overhead as we'll have to play nice with the
> possibility of multiple files per inode, e.g. to zap mappings across all files
> when punching a hole, but the extra complexity is quite small, e.g. we can use
> address_space.private_list to keep track of the guest_memfd instances associated
> with the inode.

Are we talking about a different usecase of sharing gmem fd across VMs
other than intra-host migration?
If not, ideally only one of the files should be catering to the guest
memory mappings at any given time. i.e. any inode should be ideally
bound to (through the file) a single kvm instance, as we we are
planning to ensure that guest_memfd can't be mapped until
KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM is invoked on the target side.

^ permalink raw reply	[flat|nested] 398+ messages in thread

* Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)
  2023-07-14 23:09                           ` Vishal Annapurve
@ 2023-07-15  0:30                             ` Sean Christopherson
  0 siblings, 0 replies; 398+ messages in thread
From: Sean Christopherson @ 2023-07-15  0:30 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Ackerley Tng, david, chao.p.peng, pbonzini, vkuznets, jmattson,
	joro, mail, vbabka, yu.c.zhang, kirill.shutemov, dhildenb,
	qperret, tabba, michael.roth, wei.w.wang, rppt, liam.merwick,
	isaku.yamahata, jarkko, kvm, linux-kernel, hughd, brauner

On Fri, Jul 14, 2023, Vishal Annapurve wrote:
> On Fri, Jul 14, 2023 at 12:29 PM Sean Christopherson <seanjc@google.com> wrote:
> > ...
> > And _if_ there is a VMM that instantiates memory before KVM_CREATE_VM, IMO making
> > the ioctl() /dev/kvm scoped would have no meaningful impact on adapting userspace
> > to play nice with the required ordering.  If userspace can get at /dev/kvm, then
> > it can do KVM_CREATE_VM, because the only input to KVM_CREATE_VM is the type, i.e.
> > the only dependencies for KVM_CREATE_VM should be known/resolved long before the
> > VMM knows it wants to use gmem.
> 
> I am not sure about the benefits of tying gmem creation to any given
> kvm instance.

IMO, making gmem->kvm immutable is very nice to have, e.g. gmem->kvm will always be
valid and the refcounting rules are fairly straightforward.  

> I think the most important requirement here is that a given gmem range is always
> tied to a single VM 

I'm not convinced that that requirement will always hold true (see below).

> This can be enforced when memslots are bound to the gmem files.

Yeah, but TBH, waiting until the guest faults in memory to detect an invalid memslot
is gross.  And looking more closely, taking filemap_invalidate_lock(), i.e. taking
a semaphore for write, in the page fault path is a complete non-starter.  The
"if (existing_slot == slot)" check is likely a non-starter, because KVM handles
FLAGS_ONLY memslot updates, e.g. toggling dirty logging, by duplicating and
replacing the memslot, not by updating the live memslot.

> I believe "Required ordering" is that gmem files are created first and
> then supplied while creating the memslots whose gpa ranges can
> generate private memory accesses.
> Is there any other ordering we want to enforce here?

I wasn't talking about enforcing arbitrary ordering, I was simply talking about
what userspace literally needs to be able to do KVM_CREATE_GUEST_MEMFD.

> > Practically, I think that gives us a clean, intuitive way to handle intra-host
> > migration.  Rather than transfer ownership of the file, instantiate a new file
> > for the target VM, using the gmem inode from the source VM, i.e. create a hard
> > link.  That'd probably require new uAPI, but I don't think that will be hugely
> > problematic.  KVM would need to ensure the new VM's guest_memfd can't be mapped
> > until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify the
> > memslots/bindings are identical), but that should be easy enough to enforce.
> >
> > That way, a VM, its memslots, and its SPTEs are tied to the file, while allowing
> > the memory and the *contents* of memory to outlive the VM, i.e. be effectively
> > transfered to the new target VM.  And we'll maintain the invariant that each
> > guest_memfd is bound 1:1 with a single VM.
> >
> > As above, that should also help us draw the line between mapping memory into a
> > VM (file), and freeing/reclaiming the memory (inode).
> >
> > There will be extra complexity/overhead as we'll have to play nice with the
> > possibility of multiple files per inode, e.g. to zap mappings across all files
> > when punching a hole, but the extra complexity is quite small, e.g. we can use
> > address_space.private_list to keep track of the guest_memfd instances associated
> > with the inode.
> 
> Are we talking about a different usecase of sharing gmem fd across VMs
> other than intra-host migration?

Well, I am :-)  I don't want to build all of this on an assumption that we'll
never ever want to share a guest_memfd across multiple VMs.  E.g. SEV (and SEV-ES?)
already have the migration helper concept, and I've heard more than a few rumblings
of TDX utilizing helper TDs.  IMO, it's not far fetched at all to think that there
will eventually be a need to let multiple VMs share a guest_memfd.

> If not, ideally only one of the files should be catering to the guest
> memory mappings at any given time. i.e. any inode should be ideally
> bound to (through the file) a single kvm instance,

Why?  Honest question, what does it buy us?

For TDX and SNP intra-host migration, it should be easy enough to ensure the new
VM can't create mappings before migration, and that the old VM can't create mappings
or run after migration.  I don't see that being any harder if the source and
dest use different files.

FWIW, it might be easier to hold off on this discussion until I post the RFC
(which is going to happen on Monday at this point), as then we'll have actual code
to discuss.

> as we we are planning to ensure that guest_memfd can't be mapped until
> KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM is invoked on the target side.

^ permalink raw reply	[flat|nested] 398+ messages in thread

end of thread, other threads:[~2023-07-15  0:30 UTC | newest]

Thread overview: 398+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-06  8:20 [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
2022-07-06  8:20 ` [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd Chao Peng
2022-07-21  9:44   ` David Hildenbrand
2022-07-21  9:50     ` David Hildenbrand
2022-07-21 15:05       ` Sean Christopherson
2022-07-25 13:46         ` Chao Peng
2022-07-21 10:27     ` Gupta, Pankaj
2022-07-25 13:54       ` Chao Peng
2022-07-25 14:49         ` Gupta, Pankaj
2022-07-25 13:42     ` Chao Peng
2022-08-05 17:55     ` Paolo Bonzini
2022-08-05 18:06       ` David Hildenbrand
2022-08-10  9:40         ` Chao Peng
2022-08-10  9:38       ` Chao Peng
2022-08-17 23:41       ` Kirill A. Shutemov
2022-08-18  9:09         ` Paolo Bonzini
2022-08-23  7:36         ` David Hildenbrand
2022-08-24 10:20           ` Chao Peng
2022-08-26 15:19   ` Fuad Tabba
2022-08-29 15:18     ` Chao Peng
2022-07-06  8:20 ` [PATCH v7 02/14] selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE Chao Peng
2022-08-05 13:11   ` David Hildenbrand
2022-07-06  8:20 ` [PATCH v7 03/14] mm: Introduce memfile_notifier Chao Peng
2022-08-05 13:22   ` David Hildenbrand
2022-08-10  9:22     ` Chao Peng
2022-08-10 10:05       ` David Hildenbrand
2022-08-10 14:38         ` Sean Christopherson
2022-08-11 12:27           ` Quentin Perret
2022-08-11 13:39             ` Chao Peng
2022-07-06  8:20 ` [PATCH v7 04/14] mm/shmem: Support memfile_notifier Chao Peng
2022-07-12 18:02   ` Gupta, Pankaj
2022-07-13  7:44     ` Chao Peng
2022-07-13 10:01       ` Gupta, Pankaj
2022-07-13 23:49         ` Chao Peng
2022-07-14  4:15           ` Gupta, Pankaj
2022-08-05 13:26   ` David Hildenbrand
2022-08-10  9:25     ` Chao Peng
2022-07-06  8:20 ` [PATCH v7 05/14] mm/memfd: Introduce MFD_INACCESSIBLE flag Chao Peng
2022-08-05 13:28   ` David Hildenbrand
2022-08-10  9:37     ` Chao Peng
2022-08-10  9:55       ` David Hildenbrand
2022-08-11 13:17         ` Chao Peng
2022-09-07 16:18     ` Kirill A. Shutemov
2022-07-06  8:20 ` [PATCH v7 06/14] KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS Chao Peng
2022-07-06  8:20 ` [PATCH v7 07/14] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
2022-07-15 11:36   ` Gupta, Pankaj
2022-07-18 13:29     ` Chao Peng
2022-07-18 15:26       ` Sean Christopherson
2022-07-19 14:02         ` Chao Peng
2022-08-04  7:10   ` Isaku Yamahata
2022-08-10  8:19     ` Chao Peng
2022-07-06  8:20 ` [PATCH v7 08/14] KVM: Rename mmu_notifier_* Chao Peng
2022-07-29 19:02   ` Sean Christopherson
2022-08-03 10:13     ` Chao Peng
2022-08-05 19:54     ` Paolo Bonzini
2022-08-10  8:09       ` Chao Peng
2023-05-23  7:19   ` Kautuk Consul
2023-05-23 14:19     ` Sean Christopherson
2023-05-24  6:12       ` Kautuk Consul
2023-05-24 20:16         ` Sean Christopherson
2023-05-24 20:33           ` Peter Zijlstra
2023-05-24 21:39             ` Sean Christopherson
2023-05-25  8:54               ` Peter Zijlstra
2023-05-25  3:52             ` Kautuk Consul
2023-05-24 20:28         ` Peter Zijlstra
2022-07-06  8:20 ` [PATCH v7 09/14] KVM: Extend the memslot to support fd-based private memory Chao Peng
2022-07-29 19:51   ` Sean Christopherson
2022-08-03 10:08     ` Chao Peng
2022-08-03 14:42       ` Sean Christopherson
2022-07-06  8:20 ` [PATCH v7 10/14] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
2022-07-06  8:20 ` [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions Chao Peng
2022-07-19  8:00   ` Gupta, Pankaj
2022-07-19 14:08     ` Chao Peng
2022-07-19 14:23       ` Gupta, Pankaj
2022-07-20 15:07         ` Chao Peng
2022-07-20 15:31           ` Gupta, Pankaj
2022-07-20 16:21             ` Sean Christopherson
2022-07-20 17:41               ` Gupta, Pankaj
2022-07-21  7:34               ` Wei Wang
2022-07-21  9:29                 ` Chao Peng
2022-07-21 17:58                   ` Sean Christopherson
2022-07-25 13:04                     ` Chao Peng
2022-07-29 19:54                       ` Sean Christopherson
2022-08-02  0:49                         ` Sean Christopherson
2022-08-02 16:38                           ` Sean Christopherson
2022-08-03  9:48                             ` Chao Peng
2022-08-03 15:51                               ` Sean Christopherson
2022-08-04  7:58                                 ` Chao Peng
2022-07-20 16:44   ` Sean Christopherson
2022-07-21  9:37     ` Chao Peng
2022-08-19 19:37   ` Vishal Annapurve
2022-08-24 10:37     ` Chao Peng
2022-08-26 15:19   ` Fuad Tabba
2022-08-29 15:21     ` Chao Peng
2022-07-06  8:20 ` [PATCH v7 12/14] KVM: Handle page fault for private memory Chao Peng
2022-07-29 20:58   ` Sean Christopherson
2022-08-03  9:52     ` Chao Peng
2022-07-06  8:20 ` [PATCH v7 13/14] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
2022-07-19  9:55   ` Gupta, Pankaj
2022-07-19 14:12     ` Chao Peng
2022-07-06  8:20 ` [PATCH v7 14/14] memfd_create.2: Describe MFD_INACCESSIBLE flag Chao Peng
2022-08-01 14:40   ` Dave Hansen
2022-08-03  9:53     ` Chao Peng
2022-07-13  3:58 ` [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Gupta, Pankaj
2022-07-13  7:57   ` Chao Peng
2022-07-13 10:35     ` Gupta, Pankaj
2022-07-13 23:59       ` Chao Peng
2022-07-14  4:39         ` Gupta, Pankaj
2022-07-14  5:06           ` Gupta, Pankaj
2022-07-14  4:29       ` Andy Lutomirski
2022-07-14  5:13         ` Gupta, Pankaj
2022-08-11 10:02 ` Nikunj A. Dadhania
2022-08-11 11:30   ` Gupta, Pankaj
2022-08-11 13:32     ` Chao Peng
2022-08-11 17:28       ` Nikunj A. Dadhania
2022-08-12  3:22       ` Nikunj A. Dadhania
2022-08-11 17:18     ` Nikunj A. Dadhania
2022-08-11 23:02       ` Gupta, Pankaj
2022-08-12  6:02         ` Gupta, Pankaj
2022-08-12  7:18           ` Gupta, Pankaj
2022-08-12  8:48             ` Nikunj A. Dadhania
2022-08-12  9:33               ` Gupta, Pankaj
2022-08-15 13:04               ` Chao Peng
2022-08-16  4:28                 ` Nikunj A. Dadhania
2022-08-16 11:33                 ` Gupta, Pankaj
2022-08-16 12:24                   ` Kirill A . Shutemov
2022-08-16 13:03                     ` Gupta, Pankaj
2022-08-16 15:38                       ` Sean Christopherson
2022-08-17 15:27                         ` Michael Roth
2022-08-23  1:25                           ` Isaku Yamahata
2022-08-23 17:41                         ` Gupta, Pankaj
2022-08-18  5:40 ` Hugh Dickins
2022-08-18 13:24   ` Kirill A . Shutemov
2022-08-19  0:20     ` Sean Christopherson
2022-08-19  3:38       ` Hugh Dickins
2022-08-19 22:53         ` Sean Christopherson
2022-08-23  7:55         ` David Hildenbrand
2022-08-23 16:05           ` Sean Christopherson
2022-08-24  9:41             ` Chao Peng
2022-09-09  4:55               ` Andy Lutomirski
2022-08-19  3:00     ` Hugh Dickins
2022-08-20  0:27       ` Kirill A. Shutemov
2022-08-21  5:15         ` Hugh Dickins
2022-08-31 14:24           ` Kirill A . Shutemov
2022-09-02 10:27             ` Chao Peng
2022-09-02 12:30               ` Kirill A . Shutemov
2022-09-08  1:10             ` Kirill A. Shutemov
2022-09-13  9:44               ` Sean Christopherson
2022-09-13 13:28                 ` Kirill A. Shutemov
2022-09-13 14:53                   ` Sean Christopherson
2022-09-13 16:00                     ` Kirill A. Shutemov
2022-09-13 16:12                       ` Sean Christopherson
2022-09-09  4:48         ` Andy Lutomirski
2022-09-09 14:32           ` Kirill A . Shutemov
2022-09-09 19:11             ` Andy Lutomirski
2022-09-09 23:02               ` Kirill A . Shutemov
2022-08-21 10:27       ` Matthew Wilcox
2022-08-24 10:27         ` Chao Peng
2022-09-09  4:44     ` Andy Lutomirski
2022-08-26 15:19 ` Fuad Tabba
2022-08-29 15:17   ` Chao Peng
2022-08-31  9:12     ` Fuad Tabba
2022-09-02 10:19       ` Chao Peng
2022-09-09 15:35 ` Michael Roth
2022-12-02  6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
2022-12-02  6:13 ` [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
2022-12-06 14:57   ` Fuad Tabba
2022-12-07 13:50     ` Chao Peng
2022-12-13 23:49   ` Huang, Kai
2022-12-19  7:53     ` Chao Peng
2022-12-19  8:48       ` Huang, Kai
2022-12-20  7:22         ` Chao Peng
2022-12-20  8:33           ` Huang, Kai
2022-12-21 13:39             ` Chao Peng
2022-12-22  0:37               ` Huang, Kai
2022-12-23  8:20                 ` Chao Peng
2023-01-23 14:03                 ` Vlastimil Babka
2023-01-23 15:18                   ` Kirill A. Shutemov
2023-02-13 14:23                     ` Vlastimil Babka
2023-01-23 23:01                   ` Huang, Kai
2023-01-23 23:38                     ` Sean Christopherson
2023-01-24  7:51                       ` Vlastimil Babka
2022-12-22 18:15               ` Sean Christopherson
2022-12-23  0:50                 ` Huang, Kai
2022-12-23  8:24                 ` Chao Peng
2023-01-23 15:43                 ` Kirill A. Shutemov
2023-02-13 11:43                   ` Vlastimil Babka
2023-02-13 13:10                   ` Michael Roth
2023-01-13 21:54   ` Sean Christopherson
2023-01-17 12:41     ` Chao Peng
2023-01-17 16:34       ` Sean Christopherson
2023-01-18  8:16         ` Chao Peng
2023-01-18 10:17           ` Isaku Yamahata
2023-02-22  2:07     ` Alexey Kardashevskiy
2023-02-24  5:42       ` Chao Peng
2023-01-30  5:26   ` Ackerley Tng
2023-01-30  6:04     ` Wang, Wei W
2023-02-16  9:51   ` Nikunj A. Dadhania
2023-03-20 19:08     ` Michael Roth
2023-04-13 15:25   ` [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Christian Brauner
2023-04-13 22:28     ` Sean Christopherson
2023-04-14 22:38       ` Ackerley Tng
2023-04-14 23:26         ` Sean Christopherson
2023-04-15  0:06           ` Sean Christopherson
2023-04-19  8:29       ` Christian Brauner
2023-04-20  0:49         ` Sean Christopherson
2023-04-20  8:35           ` Christian Brauner
2023-04-13 17:22   ` [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Ackerley Tng
2022-12-02  6:13 ` [PATCH v10 2/9] KVM: Introduce per-page memory attributes Chao Peng
2022-12-06 13:34   ` Fabiano Rosas
2022-12-07 14:31     ` Chao Peng
2022-12-06 15:07   ` Fuad Tabba
2022-12-07 14:51     ` Chao Peng
2022-12-16 15:09   ` Borislav Petkov
2022-12-19  8:15     ` Chao Peng
2022-12-19 10:17       ` Borislav Petkov
2022-12-20  7:24         ` Chao Peng
2022-12-28  8:28   ` Chenyi Qiang
2023-01-03  1:39     ` Chao Peng
2023-01-03  3:32       ` Wang, Wei W
2023-01-03 23:06         ` Sean Christopherson
2023-01-05  4:39           ` Chao Peng
2023-01-13 22:02   ` Sean Christopherson
2023-01-17  3:21   ` Binbin Wu
2023-01-17 13:30     ` Chao Peng
2023-01-17 17:25       ` Sean Christopherson
2023-02-09  7:25   ` Isaku Yamahata
2023-02-10  0:35     ` Sean Christopherson
2023-02-13 23:53       ` Isaku Yamahata
2023-02-14 18:07         ` Sean Christopherson
2023-05-19 17:32   ` Nicolas Saenz Julienne
2023-05-19 18:23     ` Sean Christopherson
2023-05-19 19:49       ` Nicolas Saenz Julienne
2023-05-19 19:57         ` Sean Christopherson
2023-05-23 18:59       ` Nicolas Saenz Julienne
2022-12-02  6:13 ` [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory Chao Peng
2022-12-05  9:03   ` Fuad Tabba
2022-12-06 11:53     ` Chao Peng
2022-12-06 12:39       ` Fuad Tabba
2022-12-07 15:10         ` Chao Peng
2022-12-08  8:37   ` Xiaoyao Li
2022-12-08 11:30     ` Chao Peng
2022-12-13 12:04       ` Xiaoyao Li
2022-12-19  7:50         ` Chao Peng
2022-12-19 14:36   ` Borislav Petkov
2022-12-20  7:43     ` Chao Peng
2022-12-20  9:55       ` Borislav Petkov
2022-12-21 13:42         ` Chao Peng
2023-01-05 11:23   ` Jarkko Sakkinen
2023-01-06  9:40     ` Chao Peng
2023-01-09 19:32       ` Sean Christopherson
2023-01-10  9:14         ` Chao Peng
2023-01-10 22:51           ` Vishal Annapurve
2023-01-13 22:37           ` Sean Christopherson
2023-01-17 12:42             ` Chao Peng
2023-01-20 23:42           ` Jarkko Sakkinen
2023-01-20 23:28         ` Jarkko Sakkinen
2022-12-02  6:13 ` [PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
2022-12-06 15:47   ` Fuad Tabba
2022-12-07 15:11     ` Chao Peng
2023-01-13 23:13   ` Sean Christopherson
2022-12-02  6:13 ` [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
2022-12-05  9:23   ` Fuad Tabba
2022-12-06 11:56     ` Chao Peng
2022-12-06 15:48       ` Fuad Tabba
2022-12-09  6:24         ` Chao Peng
2022-12-07  6:34       ` Isaku Yamahata
2022-12-07 15:14         ` Chao Peng
2022-12-02  6:13 ` [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes Chao Peng
2022-12-07  8:13   ` Yuan Yao
2022-12-08 11:20     ` Chao Peng
2022-12-09  5:43       ` Yuan Yao
2022-12-07 17:16   ` Fuad Tabba
2022-12-08 11:13     ` Chao Peng
2022-12-09  8:57       ` Fuad Tabba
2022-12-12  7:22         ` Chao Peng
2022-12-13 23:51   ` Huang, Kai
2022-12-19  7:54     ` Chao Peng
2023-01-13 22:50   ` Sean Christopherson
2022-12-02  6:13 ` [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed Chao Peng
2022-12-05 22:49   ` Isaku Yamahata
2022-12-06 12:02     ` Chao Peng
2022-12-07  6:42       ` Isaku Yamahata
2022-12-08 11:17         ` Chao Peng
2023-01-13 23:12   ` Sean Christopherson
2023-01-13 23:16   ` Sean Christopherson
2023-01-28 13:54     ` Chao Peng
2022-12-02  6:13 ` [PATCH v10 8/9] KVM: Handle page fault for private memory Chao Peng
2022-12-08  2:29   ` Yuan Yao
2022-12-08 11:23     ` Chao Peng
2022-12-09  5:45       ` Yuan Yao
2022-12-09  9:01   ` Fuad Tabba
2022-12-12  7:23     ` Chao Peng
2023-01-13 23:29   ` Sean Christopherson
2022-12-02  6:13 ` [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
2022-12-09  9:11   ` Fuad Tabba
2023-01-05 20:38   ` Vishal Annapurve
2023-01-06  4:13     ` Chao Peng
2023-01-14  0:01   ` Sean Christopherson
2023-01-17 13:12     ` Chao Peng
2023-01-17 19:35       ` Sean Christopherson
2023-01-18  8:23         ` Chao Peng
2023-01-28 14:00     ` Chao Peng
2023-03-08  0:13       ` Ackerley Tng
2023-03-08  7:40         ` Chao Peng
2023-03-23  0:41           ` Isaku Yamahata
2023-03-24  2:10             ` Chao Peng
2023-03-24  2:29               ` Xiaoyao Li
2023-03-28 10:41                 ` Chao Peng
2023-04-14 21:08                   ` Sean Christopherson
2023-04-18 23:38                     ` Ackerley Tng
2023-04-25 23:01                       ` Sean Christopherson
2023-03-07 19:14   ` Ackerley Tng
2023-03-07 20:27     ` Sean Christopherson
2023-01-14  0:37 ` [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Sean Christopherson
2023-01-16 13:48   ` Kirill A. Shutemov
2023-01-17 13:19   ` Chao Peng
2023-01-17 14:32   ` Fuad Tabba
2023-01-19 11:13   ` Isaku Yamahata
2023-01-19 15:25     ` Sean Christopherson
2023-01-19 22:37       ` Isaku Yamahata
2023-01-24  1:27         ` Sean Christopherson
2023-02-08 12:24           ` Isaku Yamahata
2023-02-13 13:01           ` Michael Roth
2023-02-21 12:11             ` Chao Peng
2023-03-23  1:27               ` Michael Roth
2023-03-24  2:13                 ` Chao Peng
2023-04-12 22:01                 ` Sean Christopherson
2023-04-17 14:37           ` Chao Peng
2023-04-17 15:01             ` Sean Christopherson
2023-01-24 16:08   ` Liam Merwick
2023-01-25  0:20     ` Sean Christopherson
2023-01-25 12:53       ` Kirill A. Shutemov
2023-01-25 16:01         ` Liam Merwick
2023-04-13  1:07         ` Sean Christopherson
2023-04-13 16:04           ` Kirill A. Shutemov
2023-02-16  5:13 ` Mike Rapoport
2023-02-16  9:41   ` David Hildenbrand
2023-02-22 21:53     ` Sean Christopherson
2023-04-17 15:40 ` Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM) Sean Christopherson
2023-04-17 15:48   ` David Hildenbrand
2023-04-17 16:40     ` Sean Christopherson
2023-04-17 17:09       ` David Hildenbrand
2023-04-17 19:16         ` Sean Christopherson
2023-04-18  8:53           ` Fuad Tabba
2023-04-18  9:10           ` David Hildenbrand
2023-04-19  0:47             ` Sean Christopherson
2023-04-19  7:21               ` David Hildenbrand
2023-04-19 15:17                 ` Sean Christopherson
2023-04-19 15:27                   ` David Hildenbrand
2023-04-22  1:33                 ` Sean Christopherson
2023-05-05 19:39                   ` Ackerley Tng
2023-05-06  0:55                     ` Sean Christopherson
2023-05-06  1:17                       ` Vishal Annapurve
2023-05-15 23:46                       ` Sean Christopherson
2023-07-13 22:46                       ` Ackerley Tng
2023-07-14 19:29                         ` Sean Christopherson
2023-07-14 23:09                           ` Vishal Annapurve
2023-07-15  0:30                             ` Sean Christopherson
2023-05-09 12:44                     ` Chao Peng
2023-05-10 17:26                   ` Vishal Annapurve
2023-05-10 20:23                     ` Vishal Annapurve
2023-05-10 21:39                     ` Sean Christopherson
2023-05-10 23:03                       ` Vishal Annapurve
2023-05-11 20:22                         ` Sean Christopherson
2023-05-19  1:07                           ` Vishal Annapurve
2023-05-12  0:21                   ` Michael Roth
2023-05-12 18:01                     ` Sean Christopherson
2023-05-22 13:50                       ` Michael Roth
2023-05-22 17:09                         ` Sean Christopherson
2023-05-22 23:58                           ` Michael Roth
2023-05-23  0:21                             ` Sean Christopherson
2023-06-06 19:14                   ` Ackerley Tng
2023-06-06 23:25                     ` Sean Christopherson
2023-06-08 17:13                       ` Ackerley Tng
2023-04-17 17:11       ` Ackerley Tng
2023-04-17 18:17         ` Sean Christopherson
2023-04-18 17:01       ` Ackerley Tng
2023-04-23 13:28     ` Jarkko Sakkinen
2023-05-05 20:00       ` David Hildenbrand
2023-05-06  7:44         ` Vlastimil Babka
2023-05-06  9:16           ` David Hildenbrand
2023-04-23 13:14   ` Jarkko Sakkinen
2023-03-31 23:50 [RFC PATCH v3 0/2] Providing mount in memfd_restricted() syscall Ackerley Tng
2023-03-31 23:50 ` [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted Ackerley Tng
2023-04-03  8:21   ` David Hildenbrand
2023-04-05 22:29     ` Ackerley Tng
2023-04-04  8:25   ` Kirill A. Shutemov
2023-04-05 22:32     ` Ackerley Tng
2023-04-04 13:53   ` Christian Brauner
2023-04-04 14:58     ` Christian Brauner
2023-04-05 21:58       ` Ackerley Tng
2023-04-12  9:59         ` Christian Brauner
2023-04-13 22:53           ` Ackerley Tng
2023-04-13 23:07             ` Sean Christopherson
2023-03-31 23:50 ` [RFC PATCH v3 2/2] selftests: restrictedmem: Check hugepage-ness of shmem file backing restrictedmem fd Ackerley Tng
2023-04-03  8:24   ` David Hildenbrand
2023-04-11  1:35     ` Ackerley Tng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).