linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM
@ 2022-10-25 15:13 Chao Peng
  2022-10-25 15:13 ` [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
                   ` (9 more replies)
  0 siblings, 10 replies; 101+ messages in thread
From: Chao Peng @ 2022-10-25 15:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, tabba, Michael Roth, mhocko, Muchun Song,
	wei.w.wang

This patch series implements KVM guest private memory for confidential
computing scenarios like Intel TDX[1]. If a TDX host accesses
TDX-protected guest memory, machine check can happen which can further
crash the running host system, this is terrible for multi-tenant
configurations. The host accesses include those from KVM userspace like
QEMU. This series addresses KVM userspace induced crash by introducing
new mm and KVM interfaces so KVM userspace can still manage guest memory
via a fd-based approach, but it can never access the guest memory
content.

The patch series touches both core mm and KVM code. I appreciate
Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
reviews are always welcome.
  - 01: mm change, target for mm tree
  - 02-08: KVM change, target for KVM tree

Given KVM is the only current user for the mm part, I have chatted with
Paolo and he is OK to merge the mm change through KVM tree, but
reviewed-by/acked-by is still expected from the mm people.

The patches have been verified in Intel TDX environment, but Vishal has
done an excellent work on the selftests[4] which are dedicated for this
series, making it possible to test this series without innovative
hardware and fancy steps of building a VM environment. See Test section
below for more info.


Introduction
============
KVM userspace being able to crash the host is horrible. Under current
KVM architecture, all guest memory is inherently accessible from KVM
userspace and is exposed to the mentioned crash issue. The goal of this
series is to provide a solution to align mm and KVM, on a userspace
inaccessible approach of exposing guest memory. 

Normally, KVM populates secondary page table (e.g. EPT) by using a host
virtual address (hva) from core mm page table (e.g. x86 userspace page
table). This requires guest memory being mmaped into KVM userspace, but
this is also the source where the mentioned crash issue can happen. In
theory, apart from those 'shared' memory for device emulation etc, guest
memory doesn't have to be mmaped into KVM userspace.

This series introduces fd-based guest memory which will not be mmaped
into KVM userspace. KVM populates secondary page table by using a
fd/offset pair backed by a memory file system. The fd can be created
from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
directly interact with them with newly introduced in-kernel interface,
therefore remove the KVM userspace from the path of accessing/mmaping
the guest memory. 

Kirill had a patch [2] to address the same issue in a different way. It
tracks guest encrypted memory at the 'struct page' level and relies on
HWPOISON to reject the userspace access. The patch has been discussed in
several online and offline threads and resulted in a design document [3]
which is also the original proposal for this series. Later this patch
series evolved as more comments received in community but the major
concepts in [3] still hold true so recommend reading.

The patch series may also be useful for other usages, for example, pure
software approach may use it to harden itself against unintentional
access to guest memory. This series is designed with these usages in
mind but doesn't have code directly support them and extension might be
needed.


mm change
=========
Introduces a new memfd_restricted system call which can create memory
file that is restricted from userspace access via normal MMU operations
like read(), write() or mmap() etc and the only way to use it is
passing it to a third kernel module like KVM and relying on it to
access the fd through the newly added restrictedmem kernel interface.
The restrictedmem interface bridges the memory file subsystems
(tmpfs/hugetlbfs etc) and their users (KVM in this case) and provides
bi-directional communication between them. 


KVM change
==========
Extends the KVM memslot to provide guest private (encrypted) memory from
a fd. With this extension, a single memslot can maintain both private
memory through private fd (restricted_fd/restricted_offset) and shared
(unencrypted) memory through userspace mmaped host virtual address
(userspace_addr). For a particular guest page, the corresponding page in
KVM memslot can be only either private or shared and only one of the
shared/private parts of the memslot is visible to guest. For how this
new extension is used in QEMU, please refer to kvm_set_phys_mem() in
below TDX-enabled QEMU repo.

Introduces new KVM_EXIT_MEMORY_FAULT exit to allow userspace to get the
chance on decision-making for shared <-> private memory conversion. The
exit can be an implicit conversion in KVM page fault handler or an
explicit conversion from guest OS.

Extends existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to
convert a guest page between private <-> shared. The data maintained in
these ioctls tells the truth whether a guest page is private or shared
and this information will be used in KVM page fault handler to decide
whether the private or the shared part of the memslot is visible to
guest.


Test
====
Ran two kinds of tests:
  - Selftests [4] from Vishal and VM boot tests in non-TDX environment
    Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v9

  - Functional tests in TDX capable environment
    Tested the new functionalities in TDX environment. Code repos:
    Linux: https://github.com/chao-p/linux/tree/privmem-v9-tdx
    QEMU: https://github.com/chao-p/qemu/tree/privmem-v9

    An example QEMU command line for TDX test:
    -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
    -machine confidential-guest-support=tdx \
    -object memory-backend-memfd-private,id=ram1,size=${mem} \
    -machine memory-backend=ram1


TODO
====
  - Page accounting and limiting for encrypted memory
  - hugetlbfs support


Changelog
=========
v9:
  - mm: move inaccessible memfd into separated syscall.
  - mm: return page instead of pfn_t for inaccessible_get_pfn and remove
    inaccessible_put_pfn.
  - KVM: rename inaccessible/private to restricted and CONFIG change to
    make the code friendly to pKVM.
  - KVM: add invalidate_begin/end pair to fix race contention and revise
    the lock protection for invalidation path.
  - KVM: optimize setting lpage_info for > 2M level by direct accessing
    lower level's result.
  - KVM: avoid load xarray in kvm_mmu_max_mapping_level() and instead let
    the caller to pass in is_private.
  - KVM: API doc improvement.
v8:
  - mm: redesign mm part by introducing a shim layer(inaccessible_memfd)
    in memfd to avoid touch the memory file systems directly.
  - mm: exclude F_SEAL_AUTO_ALLOCATE as it is for shared memory and
    cause confusion in this series, will send out separately.
  - doc: exclude the man page change, it's not kernel patch and will
    send out separately.
  - KVM: adapt to use the new mm inaccessible_memfd interface.
  - KVM: update lpage_info when setting mem_attr_array to support
    large page.
  - KVM: change from xa_store_range to xa_store for mem_attr_array due
    to xa_store_range overrides all entries which is not intended
    behavior for us.
  - KVM: refine the mmu_invalidate_retry_gfn mechanism for private page.
  - KVM: reorganize KVM_MEMORY_ENCRYPT_{UN,}REG_REGION and private page
    handling code suggested by Sean.
v7:
  - mm: introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
  - KVM: use KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to record
    private/shared info.
  - KVM: use similar sync mechanism between zap/page fault paths as
    mmu_notifier for memfile_notifier based invalidation.
v6:
  - mm: introduce MEMFILE_F_* flags into memfile_node to allow checking
    feature consistence among all memfile_notifier users and get rid of
    internal flags like SHM_F_INACCESSIBLE.
  - mm: make pfn_ops callbacks being members of memfile_backing_store
    and then refer to it directly in memfile_notifier.
  - mm: remove backing store unregister.
  - mm: remove RLIMIT_MEMLOCK based memory accounting and limiting.
  - KVM: reorganize patch sequence for page fault handling and private
    memory enabling.
v5:
  - Add man page for MFD_INACCESSIBLE flag and improve KVM API do for
    the new memslot extensions.
  - mm: introduce memfile_{un}register_backing_store to allow memory
    backing store to register/unregister it from memfile_notifier.
  - mm: remove F_SEAL_INACCESSIBLE, use in-kernel flag
    (SHM_F_INACCESSIBLE for shmem) instead. 
  - mm: add memory accounting and limiting (RLIMIT_MEMLOCK based) for
    MFD_INACCESSIBLE memory.
  - KVM: remove the overlap check for mapping the same file+offset into
    multiple gfns due to perf consideration, warned in document.
v4:
  - mm: rename memfd_ops to memfile_notifier and separate it from
    memfd.c to standalone memfile-notifier.c.
  - KVM: move pfn_ops to per-memslot scope from per-vm scope and allow
    registering multiple memslots to the same memory backing store.
  - KVM: add a 'kvm' reference in memslot so that we can recover kvm in
    memfile_notifier handlers.
  - KVM: add 'private_' prefix for the new fields in memslot.
  - KVM: reshape the 'type' to 'flag' for kvm_memory_exit
v3:
  - Remove 'RFC' prefix.
  - Fix race condition between memfile_notifier handlers and kvm destroy.
  - mm: introduce MFD_INACCESSIBLE flag for memfd_create() to force
    setting F_SEAL_INACCESSIBLE when the fd is created.
  - KVM: add the shared part of the memslot back to make private/shared
    pages live in one memslot.

Reference
=========
[1] Intel TDX:
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
[2] Kirill's implementation:
https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com/T/ 
[3] Original design proposal:
https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com/  
[4] Selftest:
https://lore.kernel.org/all/20220819174659.2427983-1-vannapurve@google.com/ 


Chao Peng (7):
  KVM: Extend the memslot to support fd-based private memory
  KVM: Add KVM_EXIT_MEMORY_FAULT exit
  KVM: Use gfn instead of hva for mmu_notifier_retry
  KVM: Register/unregister the guest private memory regions
  KVM: Update lpage info when private/shared memory are mixed
  KVM: Handle page fault for private memory
  KVM: Enable and expose KVM_MEM_PRIVATE

Kirill A. Shutemov (1):
  mm: Introduce memfd_restricted system call to create restricted user
    memory

 Documentation/virt/kvm/api.rst         |  88 ++++-
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 arch/x86/include/asm/kvm_host.h        |   8 +
 arch/x86/kvm/Kconfig                   |   3 +
 arch/x86/kvm/mmu/mmu.c                 | 170 +++++++++-
 arch/x86/kvm/mmu/mmu_internal.h        |  14 +-
 arch/x86/kvm/mmu/mmutrace.h            |   1 +
 arch/x86/kvm/mmu/spte.h                |   6 +
 arch/x86/kvm/mmu/tdp_mmu.c             |   3 +-
 arch/x86/kvm/x86.c                     |   4 +-
 include/linux/kvm_host.h               |  89 ++++-
 include/linux/restrictedmem.h          |  62 ++++
 include/linux/syscalls.h               |   1 +
 include/uapi/asm-generic/unistd.h      |   5 +-
 include/uapi/linux/kvm.h               |  38 +++
 include/uapi/linux/magic.h             |   1 +
 kernel/sys_ni.c                        |   3 +
 mm/Kconfig                             |   4 +
 mm/Makefile                            |   1 +
 mm/restrictedmem.c                     | 250 ++++++++++++++
 virt/kvm/Kconfig                       |   7 +
 virt/kvm/kvm_main.c                    | 453 +++++++++++++++++++++----
 23 files changed, 1121 insertions(+), 92 deletions(-)
 create mode 100644 include/linux/restrictedmem.h
 create mode 100644 mm/restrictedmem.c


base-commit: e18d6152ff0f41b7f01f9817372022df04e0d354
-- 
2.25.1



^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-10-25 15:13 [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
@ 2022-10-25 15:13 ` Chao Peng
  2022-10-26 17:31   ` Isaku Yamahata
                     ` (5 more replies)
  2022-10-25 15:13 ` [PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
                   ` (8 subsequent siblings)
  9 siblings, 6 replies; 101+ messages in thread
From: Chao Peng @ 2022-10-25 15:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, tabba, Michael Roth, mhocko, Muchun Song,
	wei.w.wang

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Introduce 'memfd_restricted' system call with the ability to create
memory areas that are restricted from userspace access through ordinary
MMU operations (e.g. read/write/mmap). The memory content is expected to
be used through a new in-kernel interface by a third kernel module.

memfd_restricted() is useful for scenarios where a file descriptor(fd)
can be used as an interface into mm but want to restrict userspace's
ability on the fd. Initially it is designed to provide protections for
KVM encrypted guest memory.

Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
(e.g. QEMU) and then using the mmaped virtual address to setup the
mapping in the KVM secondary page table (e.g. EPT). With confidential
computing technologies like Intel TDX, the memfd memory may be encrypted
with special key for special software domain (e.g. KVM guest) and is not
expected to be directly accessed by userspace. Precisely, userspace
access to such encrypted memory may lead to host crash so should be
prevented.

memfd_restricted() provides semantics required for KVM guest encrypted
memory support that a fd created with memfd_restricted() is going to be
used as the source of guest memory in confidential computing environment
and KVM can directly interact with core-mm without the need to expose
the memoy content into KVM userspace.

KVM userspace is still in charge of the lifecycle of the fd. It should
pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
obtain the physical memory page and then uses it to populate the KVM
secondary page table entries.

The userspace restricted memfd can be fallocate-ed or hole-punched
from userspace. When these operations happen, KVM can get notified
through restrictedmem_notifier, it then gets chance to remove any
mapped entries of the range in the secondary page tables.

memfd_restricted() itself is implemented as a shim layer on top of real
memory file systems (currently tmpfs). Pages in restrictedmem are marked
as unmovable and unevictable, this is required for current confidential
usage. But in future this might be changed.

By default memfd_restricted() prevents userspace read, write and mmap.
By defining new bit in the 'flags', it can be extended to support other
restricted semantics in the future.

The system call is currently wired up for x86 arch.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/restrictedmem.h          |  62 ++++++
 include/linux/syscalls.h               |   1 +
 include/uapi/asm-generic/unistd.h      |   5 +-
 include/uapi/linux/magic.h             |   1 +
 kernel/sys_ni.c                        |   3 +
 mm/Kconfig                             |   4 +
 mm/Makefile                            |   1 +
 mm/restrictedmem.c                     | 250 +++++++++++++++++++++++++
 10 files changed, 328 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/restrictedmem.h
 create mode 100644 mm/restrictedmem.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 320480a8db4f..dc70ba90247e 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -455,3 +455,4 @@
 448	i386	process_mrelease	sys_process_mrelease
 449	i386	futex_waitv		sys_futex_waitv
 450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	i386	memfd_restricted	sys_memfd_restricted
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..06516abc8318 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
 450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
+451	common	memfd_restricted	sys_memfd_restricted
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
new file mode 100644
index 000000000000..9c37c3ea3180
--- /dev/null
+++ b/include/linux/restrictedmem.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _LINUX_RESTRICTEDMEM_H
+
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/pfn_t.h>
+
+struct restrictedmem_notifier;
+
+struct restrictedmem_notifier_ops {
+	void (*invalidate_start)(struct restrictedmem_notifier *notifier,
+				 pgoff_t start, pgoff_t end);
+	void (*invalidate_end)(struct restrictedmem_notifier *notifier,
+			       pgoff_t start, pgoff_t end);
+};
+
+struct restrictedmem_notifier {
+	struct list_head list;
+	const struct restrictedmem_notifier_ops *ops;
+};
+
+#ifdef CONFIG_RESTRICTEDMEM
+
+void restrictedmem_register_notifier(struct file *file,
+				     struct restrictedmem_notifier *notifier);
+void restrictedmem_unregister_notifier(struct file *file,
+				       struct restrictedmem_notifier *notifier);
+
+int restrictedmem_get_page(struct file *file, pgoff_t offset,
+			   struct page **pagep, int *order);
+
+static inline bool file_is_restrictedmem(struct file *file)
+{
+	return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
+}
+
+#else
+
+static inline void restrictedmem_register_notifier(struct file *file,
+				     struct restrictedmem_notifier *notifier)
+{
+}
+
+static inline void restrictedmem_unregister_notifier(struct file *file,
+				       struct restrictedmem_notifier *notifier)
+{
+}
+
+static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
+					 struct page **pagep, int *order)
+{
+	return -1;
+}
+
+static inline bool file_is_restrictedmem(struct file *file)
+{
+	return false;
+}
+
+#endif /* CONFIG_RESTRICTEDMEM */
+
+#endif /* _LINUX_RESTRICTEDMEM_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a34b0f9a9972..f9e9e0c820c5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
 asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
 					    unsigned long home_node,
 					    unsigned long flags);
+asmlinkage long sys_memfd_restricted(unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..e93cd35e46d0 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 #define __NR_set_mempolicy_home_node 450
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 
+#define __NR_memfd_restricted 451
+__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
+
 #undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..8aa38324b90a 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@
 #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
+#define RESTRICTEDMEM_MAGIC	0x5245534d	/* "RESM" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 860b2dcf3ac4..7c4a32cbd2e7 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
 /* memfd_secret */
 COND_SYSCALL(memfd_secret);
 
+/* memfd_restricted */
+COND_SYSCALL(memfd_restricted);
+
 /*
  * Architecture specific weak syscall entries.
  */
diff --git a/mm/Kconfig b/mm/Kconfig
index 0331f1461f81..0177d53676c7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1076,6 +1076,10 @@ config IO_MAPPING
 config SECRETMEM
 	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
 
+config RESTRICTEDMEM
+	bool
+	depends on TMPFS
+
 config ANON_VMA_NAME
 	bool "Anonymous VMA name support"
 	depends on PROC_FS && ADVISE_SYSCALLS && MMU
diff --git a/mm/Makefile b/mm/Makefile
index 9a564f836403..6cb6403ffd40 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -117,6 +117,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
 obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
 obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
 obj-$(CONFIG_SECRETMEM) += secretmem.o
+obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
 obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
 obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
 obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
new file mode 100644
index 000000000000..e5bf8907e0f8
--- /dev/null
+++ b/mm/restrictedmem.c
@@ -0,0 +1,250 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "linux/sbitmap.h"
+#include <linux/pagemap.h>
+#include <linux/pseudo_fs.h>
+#include <linux/shmem_fs.h>
+#include <linux/syscalls.h>
+#include <uapi/linux/falloc.h>
+#include <uapi/linux/magic.h>
+#include <linux/restrictedmem.h>
+
+struct restrictedmem_data {
+	struct mutex lock;
+	struct file *memfd;
+	struct list_head notifiers;
+};
+
+static void restrictedmem_notifier_invalidate(struct restrictedmem_data *data,
+				 pgoff_t start, pgoff_t end, bool notify_start)
+{
+	struct restrictedmem_notifier *notifier;
+
+	mutex_lock(&data->lock);
+	list_for_each_entry(notifier, &data->notifiers, list) {
+		if (notify_start)
+			notifier->ops->invalidate_start(notifier, start, end);
+		else
+			notifier->ops->invalidate_end(notifier, start, end);
+	}
+	mutex_unlock(&data->lock);
+}
+
+static int restrictedmem_release(struct inode *inode, struct file *file)
+{
+	struct restrictedmem_data *data = inode->i_mapping->private_data;
+
+	fput(data->memfd);
+	kfree(data);
+	return 0;
+}
+
+static long restrictedmem_fallocate(struct file *file, int mode,
+				    loff_t offset, loff_t len)
+{
+	struct restrictedmem_data *data = file->f_mapping->private_data;
+	struct file *memfd = data->memfd;
+	int ret;
+
+	if (mode & FALLOC_FL_PUNCH_HOLE) {
+		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
+			return -EINVAL;
+	}
+
+	restrictedmem_notifier_invalidate(data, offset, offset + len, true);
+	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
+	restrictedmem_notifier_invalidate(data, offset, offset + len, false);
+	return ret;
+}
+
+static const struct file_operations restrictedmem_fops = {
+	.release = restrictedmem_release,
+	.fallocate = restrictedmem_fallocate,
+};
+
+static int restrictedmem_getattr(struct user_namespace *mnt_userns,
+				 const struct path *path, struct kstat *stat,
+				 u32 request_mask, unsigned int query_flags)
+{
+	struct inode *inode = d_inode(path->dentry);
+	struct restrictedmem_data *data = inode->i_mapping->private_data;
+	struct file *memfd = data->memfd;
+
+	return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
+					     request_mask, query_flags);
+}
+
+static int restrictedmem_setattr(struct user_namespace *mnt_userns,
+				 struct dentry *dentry, struct iattr *attr)
+{
+	struct inode *inode = d_inode(dentry);
+	struct restrictedmem_data *data = inode->i_mapping->private_data;
+	struct file *memfd = data->memfd;
+	int ret;
+
+	if (attr->ia_valid & ATTR_SIZE) {
+		if (memfd->f_inode->i_size)
+			return -EPERM;
+
+		if (!PAGE_ALIGNED(attr->ia_size))
+			return -EINVAL;
+	}
+
+	ret = memfd->f_inode->i_op->setattr(mnt_userns,
+					    file_dentry(memfd), attr);
+	return ret;
+}
+
+static const struct inode_operations restrictedmem_iops = {
+	.getattr = restrictedmem_getattr,
+	.setattr = restrictedmem_setattr,
+};
+
+static int restrictedmem_init_fs_context(struct fs_context *fc)
+{
+	if (!init_pseudo(fc, RESTRICTEDMEM_MAGIC))
+		return -ENOMEM;
+
+	fc->s_iflags |= SB_I_NOEXEC;
+	return 0;
+}
+
+static struct file_system_type restrictedmem_fs = {
+	.owner		= THIS_MODULE,
+	.name		= "memfd:restrictedmem",
+	.init_fs_context = restrictedmem_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static struct vfsmount *restrictedmem_mnt;
+
+static __init int restrictedmem_init(void)
+{
+	restrictedmem_mnt = kern_mount(&restrictedmem_fs);
+	if (IS_ERR(restrictedmem_mnt))
+		return PTR_ERR(restrictedmem_mnt);
+	return 0;
+}
+fs_initcall(restrictedmem_init);
+
+static struct file *restrictedmem_file_create(struct file *memfd)
+{
+	struct restrictedmem_data *data;
+	struct address_space *mapping;
+	struct inode *inode;
+	struct file *file;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return ERR_PTR(-ENOMEM);
+
+	data->memfd = memfd;
+	mutex_init(&data->lock);
+	INIT_LIST_HEAD(&data->notifiers);
+
+	inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
+	if (IS_ERR(inode)) {
+		kfree(data);
+		return ERR_CAST(inode);
+	}
+
+	inode->i_mode |= S_IFREG;
+	inode->i_op = &restrictedmem_iops;
+	inode->i_mapping->private_data = data;
+
+	file = alloc_file_pseudo(inode, restrictedmem_mnt,
+				 "restrictedmem", O_RDWR,
+				 &restrictedmem_fops);
+	if (IS_ERR(file)) {
+		iput(inode);
+		kfree(data);
+		return ERR_CAST(file);
+	}
+
+	file->f_flags |= O_LARGEFILE;
+
+	mapping = memfd->f_mapping;
+	mapping_set_unevictable(mapping);
+	mapping_set_gfp_mask(mapping,
+			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
+
+	return file;
+}
+
+SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
+{
+	struct file *file, *restricted_file;
+	int fd, err;
+
+	if (flags)
+		return -EINVAL;
+
+	fd = get_unused_fd_flags(0);
+	if (fd < 0)
+		return fd;
+
+	file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto err_fd;
+	}
+	file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
+	file->f_flags |= O_LARGEFILE;
+
+	restricted_file = restrictedmem_file_create(file);
+	if (IS_ERR(restricted_file)) {
+		err = PTR_ERR(restricted_file);
+		fput(file);
+		goto err_fd;
+	}
+
+	fd_install(fd, restricted_file);
+	return fd;
+err_fd:
+	put_unused_fd(fd);
+	return err;
+}
+
+void restrictedmem_register_notifier(struct file *file,
+				     struct restrictedmem_notifier *notifier)
+{
+	struct restrictedmem_data *data = file->f_mapping->private_data;
+
+	mutex_lock(&data->lock);
+	list_add(&notifier->list, &data->notifiers);
+	mutex_unlock(&data->lock);
+}
+EXPORT_SYMBOL_GPL(restrictedmem_register_notifier);
+
+void restrictedmem_unregister_notifier(struct file *file,
+				       struct restrictedmem_notifier *notifier)
+{
+	struct restrictedmem_data *data = file->f_mapping->private_data;
+
+	mutex_lock(&data->lock);
+	list_del(&notifier->list);
+	mutex_unlock(&data->lock);
+}
+EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier);
+
+int restrictedmem_get_page(struct file *file, pgoff_t offset,
+			   struct page **pagep, int *order)
+{
+	struct restrictedmem_data *data = file->f_mapping->private_data;
+	struct file *memfd = data->memfd;
+	struct page *page;
+	int ret;
+
+	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
+	if (ret)
+		return ret;
+
+	*pagep = page;
+	if (order)
+		*order = thp_order(compound_head(page));
+
+	SetPageUptodate(page);
+	unlock_page(page);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(restrictedmem_get_page);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-10-25 15:13 [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
  2022-10-25 15:13 ` [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
@ 2022-10-25 15:13 ` Chao Peng
  2022-10-27 10:25   ` Fuad Tabba
                     ` (2 more replies)
  2022-10-25 15:13 ` [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
                   ` (7 subsequent siblings)
  9 siblings, 3 replies; 101+ messages in thread
From: Chao Peng @ 2022-10-25 15:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, tabba, Michael Roth, mhocko, Muchun Song,
	wei.w.wang

In memory encryption usage, guest memory may be encrypted with special
key and can be accessed only by the guest itself. We call such memory
private memory. It's valueless and sometimes can cause problem to allow
userspace to access guest private memory. This new KVM memslot extension
allows guest private memory being provided though a restrictedmem
backed file descriptor(fd) and userspace is restricted to access the
bookmarked memory in the fd.

This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
additional KVM memslot fields restricted_fd/restricted_offset to allow
userspace to instruct KVM to provide guest memory through restricted_fd.
'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
and the size is 'memory_size'.

The extended memslot can still have the userspace_addr(hva). When use, a
single memslot can maintain both private memory through restricted_fd
and shared memory through userspace_addr. Whether the private or shared
part is visible to guest is maintained by other KVM code.

A restrictedmem_notifier field is also added to the memslot structure to
allow the restricted_fd's backing store to notify KVM the memory change,
KVM then can invalidate its page table entries.

Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
and right now it is selected on X86_64 only. A KVM_CAP_PRIVATE_MEM is
also introduced to indicate KVM support for KVM_MEM_PRIVATE.

To make code maintenance easy, internally we use a binary compatible
alias struct kvm_user_mem_region to handle both the normal and the
'_ext' variants.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 Documentation/virt/kvm/api.rst | 48 ++++++++++++++++++++++++++++-----
 arch/x86/kvm/Kconfig           |  2 ++
 arch/x86/kvm/x86.c             |  2 +-
 include/linux/kvm_host.h       | 13 +++++++--
 include/uapi/linux/kvm.h       | 29 ++++++++++++++++++++
 virt/kvm/Kconfig               |  3 +++
 virt/kvm/kvm_main.c            | 49 ++++++++++++++++++++++++++++------
 7 files changed, 128 insertions(+), 18 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index eee9f857a986..f3fa75649a78 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
 :Capability: KVM_CAP_USER_MEMORY
 :Architectures: all
 :Type: vm ioctl
-:Parameters: struct kvm_userspace_memory_region (in)
+:Parameters: struct kvm_userspace_memory_region(_ext) (in)
 :Returns: 0 on success, -1 on error
 
 ::
@@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
 	__u64 userspace_addr; /* start of the userspace allocated memory */
   };
 
+  struct kvm_userspace_memory_region_ext {
+	struct kvm_userspace_memory_region region;
+	__u64 restricted_offset;
+	__u32 restricted_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+  };
+
   /* for kvm_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
   #define KVM_MEM_READONLY	(1UL << 1)
+  #define KVM_MEM_PRIVATE		(1UL << 2)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1365,12 +1374,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
-writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only.  In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+kvm_userspace_memory_region_ext struct includes all fields of
+kvm_userspace_memory_region struct, while also adds additional fields for some
+other features. See below description of flags field for more information.
+It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
+
+The flags field supports following flags:
+
+- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
+  within the slot.  For more details, see KVM_GET_DIRTY_LOG ioctl.
+
+- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
+  read-only.  In this case, writes to this memory will be posted to userspace as
+  KVM_EXIT_MMIO exits.
+
+- KVM_MEM_PRIVATE, if KVM_CAP_PRIVATE_MEM allows, to indicate a new slot has
+  private memory backed by a file descriptor(fd) and userspace access to the
+  fd may be restricted. Userspace should use restricted_fd/restricted_offset in
+  kvm_userspace_memory_region_ext to instruct KVM to provide private memory
+  to guest. Userspace should guarantee not to map the same pfn indicated by
+  restricted_fd/restricted_offset to different gfns with multiple memslots.
+  Failed to do this may result undefined behavior.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
@@ -8215,6 +8239,16 @@ structure.
 When getting the Modified Change Topology Report value, the attr->addr
 must point to a byte where the value will be stored or retrieved from.
 
+8.36 KVM_CAP_PRIVATE_MEM
+------------------------
+
+:Architectures: x86
+
+This capability indicates that private memory is supported and userspace can
+set KVM_MEM_PRIVATE flag for KVM_SET_USER_MEMORY_REGION ioctl.  See
+KVM_SET_USER_MEMORY_REGION for details on the usage of KVM_MEM_PRIVATE and
+kvm_userspace_memory_region_ext fields.
+
 9. Known KVM API problems
 =========================
 
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 67be7f217e37..8d2bd455c0cd 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -49,6 +49,8 @@ config KVM
 	select SRCU
 	select INTERVAL_TREE
 	select HAVE_KVM_PM_NOTIFIER if PM
+	select HAVE_KVM_RESTRICTED_MEM if X86_64
+	select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4bd5f8a751de..02ad31f46dd7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12425,7 +12425,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 	}
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-		struct kvm_userspace_memory_region m;
+		struct kvm_user_mem_region m;
 
 		m.slot = id | (i << 16);
 		m.flags = 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 32f259fa5801..739a7562a1f3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -44,6 +44,7 @@
 
 #include <asm/kvm_host.h>
 #include <linux/kvm_dirty_ring.h>
+#include <linux/restrictedmem.h>
 
 #ifndef KVM_MAX_VCPU_IDS
 #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
@@ -575,8 +576,16 @@ struct kvm_memory_slot {
 	u32 flags;
 	short id;
 	u16 as_id;
+	struct file *restricted_file;
+	loff_t restricted_offset;
+	struct restrictedmem_notifier notifier;
 };
 
+static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
+{
+	return slot && (slot->flags & KVM_MEM_PRIVATE);
+}
+
 static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
 {
 	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
@@ -1103,9 +1112,9 @@ enum kvm_mr_change {
 };
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem);
+			  const struct kvm_user_mem_region *mem);
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem);
+			    const struct kvm_user_mem_region *mem);
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0d5d4419139a..f1ae45c10c94 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
 	__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
+struct kvm_userspace_memory_region_ext {
+	struct kvm_userspace_memory_region region;
+	__u64 restricted_offset;
+	__u32 restricted_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+};
+
+#ifdef __KERNEL__
+/*
+ * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
+ * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
+ * all fields from the top-level "extended" region.
+ */
+struct kvm_user_mem_region {
+	__u32 slot;
+	__u32 flags;
+	__u64 guest_phys_addr;
+	__u64 memory_size;
+	__u64 userspace_addr;
+	__u64 restricted_offset;
+	__u32 restricted_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+};
+#endif
+
 /*
  * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
  * other bits are reserved for kvm internal use which are defined in
@@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_PRIVATE		(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -1178,6 +1206,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_ZPCI_OP 221
 #define KVM_CAP_S390_CPU_TOPOLOGY 222
 #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
+#define KVM_CAP_PRIVATE_MEM 224
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 800f9470e36b..9ff164c7e0cc 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -86,3 +86,6 @@ config KVM_XFER_TO_GUEST_WORK
 
 config HAVE_KVM_PM_NOTIFIER
        bool
+
+config HAVE_KVM_RESTRICTED_MEM
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e30f1b4ecfa5..8dace78a0278 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1526,7 +1526,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
 	}
 }
 
-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
+static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
 
@@ -1920,7 +1920,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
  * Must be called holding kvm->slots_lock for write.
  */
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem)
+			    const struct kvm_user_mem_region *mem)
 {
 	struct kvm_memory_slot *old, *new;
 	struct kvm_memslots *slots;
@@ -2024,7 +2024,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem)
+			  const struct kvm_user_mem_region *mem)
 {
 	int r;
 
@@ -2036,7 +2036,7 @@ int kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(kvm_set_memory_region);
 
 static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
-					  struct kvm_userspace_memory_region *mem)
+					  struct kvm_user_mem_region *mem)
 {
 	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
 		return -EINVAL;
@@ -4627,6 +4627,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
 	return fd;
 }
 
+#define SANITY_CHECK_MEM_REGION_FIELD(field)					\
+do {										\
+	BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=		\
+		     offsetof(struct kvm_userspace_memory_region, field));	\
+	BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=		\
+		     sizeof_field(struct kvm_userspace_memory_region, field));	\
+} while (0)
+
+#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)					\
+do {											\
+	BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=			\
+		     offsetof(struct kvm_userspace_memory_region_ext, field));		\
+	BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=			\
+		     sizeof_field(struct kvm_userspace_memory_region_ext, field));	\
+} while (0)
+
+static void kvm_sanity_check_user_mem_region_alias(void)
+{
+	SANITY_CHECK_MEM_REGION_FIELD(slot);
+	SANITY_CHECK_MEM_REGION_FIELD(flags);
+	SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
+	SANITY_CHECK_MEM_REGION_FIELD(memory_size);
+	SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
+	SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
+	SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
+}
+
 static long kvm_vm_ioctl(struct file *filp,
 			   unsigned int ioctl, unsigned long arg)
 {
@@ -4650,14 +4677,20 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 	case KVM_SET_USER_MEMORY_REGION: {
-		struct kvm_userspace_memory_region kvm_userspace_mem;
+		struct kvm_user_mem_region mem;
+		unsigned long size = sizeof(struct kvm_userspace_memory_region);
+
+		kvm_sanity_check_user_mem_region_alias();
 
 		r = -EFAULT;
-		if (copy_from_user(&kvm_userspace_mem, argp,
-						sizeof(kvm_userspace_mem)))
+		if (copy_from_user(&mem, argp, size))
+			goto out;
+
+		r = -EINVAL;
+		if (mem.flags & KVM_MEM_PRIVATE)
 			goto out;
 
-		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
+		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
 		break;
 	}
 	case KVM_GET_DIRTY_LOG: {
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-10-25 15:13 [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
  2022-10-25 15:13 ` [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
  2022-10-25 15:13 ` [PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
@ 2022-10-25 15:13 ` Chao Peng
  2022-10-25 15:26   ` Peter Maydell
                     ` (3 more replies)
  2022-10-25 15:13 ` [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
                   ` (6 subsequent siblings)
  9 siblings, 4 replies; 101+ messages in thread
From: Chao Peng @ 2022-10-25 15:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, tabba, Michael Roth, mhocko, Muchun Song,
	wei.w.wang

This new KVM exit allows userspace to handle memory-related errors. It
indicates an error happens in KVM at guest memory range [gpa, gpa+size).
The flags includes additional information for userspace to handle the
error. Currently bit 0 is defined as 'private memory' where '1'
indicates error happens due to private memory access and '0' indicates
error happens due to shared memory access.

When private memory is enabled, this new exit will be used for KVM to
exit to userspace for shared <-> private memory conversion in memory
encryption usage. In such usage, typically there are two kind of memory
conversions:
  - explicit conversion: happens when guest explicitly calls into KVM
    to map a range (as private or shared), KVM then exits to userspace
    to perform the map/unmap operations.
  - implicit conversion: happens in KVM page fault handler where KVM
    exits to userspace for an implicit conversion when the page is in a
    different state than requested (private or shared).

Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
 include/uapi/linux/kvm.h       |  9 +++++++++
 2 files changed, 32 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f3fa75649a78..975688912b8c 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
+			__u32 flags;
+			__u32 padding;
+			__u64 gpa;
+			__u64 size;
+		} memory;
+
+If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
+encountered a memory error which is not handled by KVM kernel module and
+userspace may choose to handle it. The 'flags' field indicates the memory
+properties of the exit.
+
+ - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
+   private memory access when the bit is set. Otherwise the memory error is
+   caused by shared memory access when the bit is clear.
+
+'gpa' and 'size' indicate the memory range the error occurs at. The userspace
+may handle the error and return to KVM to retry the previous memory access.
+
 ::
 
     /* KVM_EXIT_NOTIFY */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f1ae45c10c94..fa60b032a405 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -300,6 +300,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MEMORY_FAULT     38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -538,6 +539,14 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
+			__u32 flags;
+			__u32 padding;
+			__u64 gpa;
+			__u64 size;
+		} memory;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-10-25 15:13 [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (2 preceding siblings ...)
  2022-10-25 15:13 ` [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
@ 2022-10-25 15:13 ` Chao Peng
  2022-10-27 10:29   ` Fuad Tabba
  2022-11-10 20:06   ` Sean Christopherson
  2022-10-25 15:13 ` [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions Chao Peng
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 101+ messages in thread
From: Chao Peng @ 2022-10-25 15:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, tabba, Michael Roth, mhocko, Muchun Song,
	wei.w.wang

Currently in mmu_notifier validate path, hva range is recorded and then
checked against in the mmu_notifier_retry_hva() of the page fault path.
However, for the to be introduced private memory, a page fault may not
have a hva associated, checking gfn(gpa) makes more sense.

For existing non private memory case, gfn is expected to continue to
work. The only downside is when aliasing multiple gfns to a single hva,
the current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.

It also fixes a bug in kvm_zap_gfn_range() which has already been using
gfn when calling kvm_mmu_invalidate_begin/end() while these functions
accept hva in current code.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/kvm/mmu/mmu.c   |  2 +-
 include/linux/kvm_host.h | 18 +++++++---------
 virt/kvm/kvm_main.c      | 45 ++++++++++++++++++++++++++--------------
 3 files changed, 39 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6f81539061d6..33b1aec44fb8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4217,7 +4217,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
 		return true;
 
 	return fault->slot &&
-	       mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
+	       mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
 }
 
 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 739a7562a1f3..79e5cbc35fcf 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -775,8 +775,8 @@ struct kvm {
 	struct mmu_notifier mmu_notifier;
 	unsigned long mmu_invalidate_seq;
 	long mmu_invalidate_in_progress;
-	unsigned long mmu_invalidate_range_start;
-	unsigned long mmu_invalidate_range_end;
+	gfn_t mmu_invalidate_range_start;
+	gfn_t mmu_invalidate_range_end;
 #endif
 	struct list_head devices;
 	u64 manual_dirty_log_protect;
@@ -1365,10 +1365,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 #endif
 
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
-			      unsigned long end);
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
-			    unsigned long end);
+void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end);
+void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end);
 
 long kvm_arch_dev_ioctl(struct file *filp,
 			unsigned int ioctl, unsigned long arg);
@@ -1937,9 +1935,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
 	return 0;
 }
 
-static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
+static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
 					   unsigned long mmu_seq,
-					   unsigned long hva)
+					   gfn_t gfn)
 {
 	lockdep_assert_held(&kvm->mmu_lock);
 	/*
@@ -1949,8 +1947,8 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
 	 * positives, due to shortcuts when handing concurrent invalidations.
 	 */
 	if (unlikely(kvm->mmu_invalidate_in_progress) &&
-	    hva >= kvm->mmu_invalidate_range_start &&
-	    hva < kvm->mmu_invalidate_range_end)
+	    gfn >= kvm->mmu_invalidate_range_start &&
+	    gfn < kvm->mmu_invalidate_range_end)
 		return 1;
 	if (kvm->mmu_invalidate_seq != mmu_seq)
 		return 1;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8dace78a0278..09c9cdeb773c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -540,8 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
 
 typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
-typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
-			     unsigned long end);
+typedef void (*on_lock_fn_t)(struct kvm *kvm, gfn_t start, gfn_t end);
 
 typedef void (*on_unlock_fn_t)(struct kvm *kvm);
 
@@ -628,7 +627,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 				locked = true;
 				KVM_MMU_LOCK(kvm);
 				if (!IS_KVM_NULL_FN(range->on_lock))
-					range->on_lock(kvm, range->start, range->end);
+					range->on_lock(kvm, gfn_range.start,
+							    gfn_range.end);
 				if (IS_KVM_NULL_FN(range->handler))
 					break;
 			}
@@ -715,15 +715,9 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
 }
 
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
-			      unsigned long end)
+static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
+							    gfn_t end)
 {
-	/*
-	 * The count increase must become visible at unlock time as no
-	 * spte can be established without taking the mmu_lock and
-	 * count is also read inside the mmu_lock critical section.
-	 */
-	kvm->mmu_invalidate_in_progress++;
 	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
 		kvm->mmu_invalidate_range_start = start;
 		kvm->mmu_invalidate_range_end = end;
@@ -744,6 +738,28 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
 	}
 }
 
+static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	/*
+	 * The count increase must become visible at unlock time as no
+	 * spte can be established without taking the mmu_lock and
+	 * count is also read inside the mmu_lock critical section.
+	 */
+	kvm->mmu_invalidate_in_progress++;
+}
+
+static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	update_invalidate_range(kvm, range->start, range->end);
+	return kvm_unmap_gfn_range(kvm, range);
+}
+
+void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	mark_invalidate_in_progress(kvm, start, end);
+	update_invalidate_range(kvm, start, end);
+}
+
 static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
@@ -752,8 +768,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.start		= range->start,
 		.end		= range->end,
 		.pte		= __pte(0),
-		.handler	= kvm_unmap_gfn_range,
-		.on_lock	= kvm_mmu_invalidate_begin,
+		.handler	= kvm_mmu_handle_gfn_range,
+		.on_lock	= mark_invalidate_in_progress,
 		.on_unlock	= kvm_arch_guest_memory_reclaimed,
 		.flush_on_ret	= true,
 		.may_block	= mmu_notifier_range_blockable(range),
@@ -791,8 +807,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	return 0;
 }
 
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
-			    unsigned long end)
+void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
 {
 	/*
 	 * This sequence increase will notify the kvm page fault that
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions
  2022-10-25 15:13 [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (3 preceding siblings ...)
  2022-10-25 15:13 ` [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
@ 2022-10-25 15:13 ` Chao Peng
  2022-10-27 10:31   ` Fuad Tabba
                     ` (3 more replies)
  2022-10-25 15:13 ` [PATCH v9 6/8] KVM: Update lpage info when private/shared memory are mixed Chao Peng
                   ` (4 subsequent siblings)
  9 siblings, 4 replies; 101+ messages in thread
From: Chao Peng @ 2022-10-25 15:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, tabba, Michael Roth, mhocko, Muchun Song,
	wei.w.wang

Introduce generic private memory register/unregister by reusing existing
SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION. It differs from SEV case
by treating address in the region as gpa instead of hva. Which cases
should these ioctls go is determined by the kvm_arch_has_private_mem().
Architecture which supports KVM_PRIVATE_MEM should override this function.

KVM internally defaults all guest memory as private memory and maintain
the shared memory in 'mem_attr_array'. The above ioctls operate on this
field and unmap existing mappings if any.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 Documentation/virt/kvm/api.rst |  17 ++-
 arch/x86/kvm/Kconfig           |   1 +
 include/linux/kvm_host.h       |  10 +-
 virt/kvm/Kconfig               |   4 +
 virt/kvm/kvm_main.c            | 227 +++++++++++++++++++++++++--------
 5 files changed, 198 insertions(+), 61 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 975688912b8c..08253cf498d1 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4717,10 +4717,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst.
 This ioctl can be used to register a guest memory region which may
 contain encrypted data (e.g. guest RAM, SMRAM etc).
 
-It is used in the SEV-enabled guest. When encryption is enabled, a guest
-memory region may contain encrypted data. The SEV memory encryption
-engine uses a tweak such that two identical plaintext pages, each at
-different locations will have differing ciphertexts. So swapping or
+Currently this ioctl supports registering memory regions for two usages:
+private memory and SEV-encrypted memory.
+
+When private memory is enabled, this ioctl is used to register guest private
+memory region and the addr/size of kvm_enc_region represents guest physical
+address (GPA). In this usage, this ioctl zaps the existing guest memory
+mappings in KVM that fallen into the region.
+
+When SEV-encrypted memory is enabled, this ioctl is used to register guest
+memory region which may contain encrypted data for a SEV-enabled guest. The
+addr/size of kvm_enc_region represents userspace address (HVA). The SEV
+memory encryption engine uses a tweak such that two identical plaintext pages,
+each at different locations will have differing ciphertexts. So swapping or
 moving ciphertext of those pages will not result in plaintext being
 swapped. So relocating (or migrating) physical backing pages for the SEV
 guest will require some additional steps.
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 8d2bd455c0cd..73fdfa429b20 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -51,6 +51,7 @@ config KVM
 	select HAVE_KVM_PM_NOTIFIER if PM
 	select HAVE_KVM_RESTRICTED_MEM if X86_64
 	select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
+	select KVM_GENERIC_PRIVATE_MEM if HAVE_KVM_RESTRICTED_MEM
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 79e5cbc35fcf..4ce98fa0153c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -245,7 +245,8 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
-#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
+
+#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_KVM_GENERIC_PRIVATE_MEM)
 struct kvm_gfn_range {
 	struct kvm_memory_slot *slot;
 	gfn_t start;
@@ -254,6 +255,9 @@ struct kvm_gfn_range {
 	bool may_block;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
+#endif
+
+#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
@@ -794,6 +798,9 @@ struct kvm {
 	struct notifier_block pm_notifier;
 #endif
 	char stats_id[KVM_STATS_NAME_SIZE];
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
+	struct xarray mem_attr_array;
+#endif
 };
 
 #define kvm_err(fmt, ...) \
@@ -1453,6 +1460,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
 int kvm_arch_post_init_vm(struct kvm *kvm);
 void kvm_arch_pre_destroy_vm(struct kvm *kvm);
 int kvm_arch_create_vm_debugfs(struct kvm *kvm);
+bool kvm_arch_has_private_mem(struct kvm *kvm);
 
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 /*
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 9ff164c7e0cc..69ca59e82149 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -89,3 +89,7 @@ config HAVE_KVM_PM_NOTIFIER
 
 config HAVE_KVM_RESTRICTED_MEM
        bool
+
+config KVM_GENERIC_PRIVATE_MEM
+       bool
+       depends on HAVE_KVM_RESTRICTED_MEM
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 09c9cdeb773c..fc3835826ace 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
 }
 EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
 
+static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
+							    gfn_t end)
+{
+	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
+		kvm->mmu_invalidate_range_start = start;
+		kvm->mmu_invalidate_range_end = end;
+	} else {
+		/*
+		 * Fully tracking multiple concurrent ranges has diminishing
+		 * returns. Keep things simple and just find the minimal range
+		 * which includes the current and new ranges. As there won't be
+		 * enough information to subtract a range after its invalidate
+		 * completes, any ranges invalidated concurrently will
+		 * accumulate and persist until all outstanding invalidates
+		 * complete.
+		 */
+		kvm->mmu_invalidate_range_start =
+			min(kvm->mmu_invalidate_range_start, start);
+		kvm->mmu_invalidate_range_end =
+			max(kvm->mmu_invalidate_range_end, end);
+	}
+}
+
+static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	/*
+	 * The count increase must become visible at unlock time as no
+	 * spte can be established without taking the mmu_lock and
+	 * count is also read inside the mmu_lock critical section.
+	 */
+	kvm->mmu_invalidate_in_progress++;
+}
+
+void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	mark_invalidate_in_progress(kvm, start, end);
+	update_invalidate_range(kvm, start, end);
+}
+
+void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	/*
+	 * This sequence increase will notify the kvm page fault that
+	 * the page that is going to be mapped in the spte could have
+	 * been freed.
+	 */
+	kvm->mmu_invalidate_seq++;
+	smp_wmb();
+	/*
+	 * The above sequence increase must be visible before the
+	 * below count decrease, which is ensured by the smp_wmb above
+	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
+	 */
+	kvm->mmu_invalidate_in_progress--;
+}
+
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
 static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 {
@@ -715,51 +771,12 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
 }
 
-static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
-							    gfn_t end)
-{
-	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
-		kvm->mmu_invalidate_range_start = start;
-		kvm->mmu_invalidate_range_end = end;
-	} else {
-		/*
-		 * Fully tracking multiple concurrent ranges has diminishing
-		 * returns. Keep things simple and just find the minimal range
-		 * which includes the current and new ranges. As there won't be
-		 * enough information to subtract a range after its invalidate
-		 * completes, any ranges invalidated concurrently will
-		 * accumulate and persist until all outstanding invalidates
-		 * complete.
-		 */
-		kvm->mmu_invalidate_range_start =
-			min(kvm->mmu_invalidate_range_start, start);
-		kvm->mmu_invalidate_range_end =
-			max(kvm->mmu_invalidate_range_end, end);
-	}
-}
-
-static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
-{
-	/*
-	 * The count increase must become visible at unlock time as no
-	 * spte can be established without taking the mmu_lock and
-	 * count is also read inside the mmu_lock critical section.
-	 */
-	kvm->mmu_invalidate_in_progress++;
-}
-
 static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	update_invalidate_range(kvm, range->start, range->end);
 	return kvm_unmap_gfn_range(kvm, range);
 }
 
-void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
-{
-	mark_invalidate_in_progress(kvm, start, end);
-	update_invalidate_range(kvm, start, end);
-}
-
 static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
@@ -807,23 +824,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	return 0;
 }
 
-void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
-{
-	/*
-	 * This sequence increase will notify the kvm page fault that
-	 * the page that is going to be mapped in the spte could have
-	 * been freed.
-	 */
-	kvm->mmu_invalidate_seq++;
-	smp_wmb();
-	/*
-	 * The above sequence increase must be visible before the
-	 * below count decrease, which is ensured by the smp_wmb above
-	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
-	 */
-	kvm->mmu_invalidate_in_progress--;
-}
-
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
@@ -937,6 +937,89 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
 
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
+
+static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	struct kvm_gfn_range gfn_range;
+	struct kvm_memory_slot *slot;
+	struct kvm_memslots *slots;
+	struct kvm_memslot_iter iter;
+	int i;
+	int r = 0;
+
+	gfn_range.pte = __pte(0);
+	gfn_range.may_block = true;
+
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		slots = __kvm_memslots(kvm, i);
+
+		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
+			slot = iter.slot;
+			gfn_range.start = max(start, slot->base_gfn);
+			gfn_range.end = min(end, slot->base_gfn + slot->npages);
+			if (gfn_range.start >= gfn_range.end)
+				continue;
+			gfn_range.slot = slot;
+
+			r |= kvm_unmap_gfn_range(kvm, &gfn_range);
+		}
+	}
+
+	if (r)
+		kvm_flush_remote_tlbs(kvm);
+}
+
+#define KVM_MEM_ATTR_SHARED	0x0001
+static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
+				     bool is_private)
+{
+	gfn_t start, end;
+	unsigned long i;
+	void *entry;
+	int idx;
+	int r = 0;
+
+	if (size == 0 || gpa + size < gpa)
+		return -EINVAL;
+	if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
+		return -EINVAL;
+
+	start = gpa >> PAGE_SHIFT;
+	end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
+
+	/*
+	 * Guest memory defaults to private, kvm->mem_attr_array only stores
+	 * shared memory.
+	 */
+	entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
+
+	idx = srcu_read_lock(&kvm->srcu);
+	KVM_MMU_LOCK(kvm);
+	kvm_mmu_invalidate_begin(kvm, start, end);
+
+	for (i = start; i < end; i++) {
+		r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
+				    GFP_KERNEL_ACCOUNT));
+		if (r)
+			goto err;
+	}
+
+	kvm_unmap_mem_range(kvm, start, end);
+
+	goto ret;
+err:
+	for (; i > start; i--)
+		xa_erase(&kvm->mem_attr_array, i);
+ret:
+	kvm_mmu_invalidate_end(kvm, start, end);
+	KVM_MMU_UNLOCK(kvm);
+	srcu_read_unlock(&kvm->srcu, idx);
+
+	return r;
+}
+#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
+
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 static int kvm_pm_notifier_call(struct notifier_block *bl,
 				unsigned long state,
@@ -1165,6 +1248,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	spin_lock_init(&kvm->mn_invalidate_lock);
 	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
 	xa_init(&kvm->vcpu_array);
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
+	xa_init(&kvm->mem_attr_array);
+#endif
 
 	INIT_LIST_HEAD(&kvm->gpc_list);
 	spin_lock_init(&kvm->gpc_lock);
@@ -1338,6 +1424,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
 		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
 		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
 	}
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
+	xa_destroy(&kvm->mem_attr_array);
+#endif
 	cleanup_srcu_struct(&kvm->irq_srcu);
 	cleanup_srcu_struct(&kvm->srcu);
 	kvm_arch_free_vm(kvm);
@@ -1541,6 +1630,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
 	}
 }
 
+bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
+{
+	return false;
+}
+
 static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
@@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
 		break;
 	}
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
+	case KVM_MEMORY_ENCRYPT_REG_REGION:
+	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
+		struct kvm_enc_region region;
+		bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
+
+		if (!kvm_arch_has_private_mem(kvm))
+			goto arch_vm_ioctl;
+
+		r = -EFAULT;
+		if (copy_from_user(&region, argp, sizeof(region)))
+			goto out;
+
+		r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
+					      region.size, set);
+		break;
+	}
+#endif
 	case KVM_GET_DIRTY_LOG: {
 		struct kvm_dirty_log log;
 
@@ -4861,6 +4973,9 @@ static long kvm_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_get_stats_fd(kvm);
 		break;
 	default:
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
+arch_vm_ioctl:
+#endif
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 	}
 out:
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v9 6/8] KVM: Update lpage info when private/shared memory are mixed
  2022-10-25 15:13 [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (4 preceding siblings ...)
  2022-10-25 15:13 ` [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions Chao Peng
@ 2022-10-25 15:13 ` Chao Peng
  2022-10-26 20:46   ` Isaku Yamahata
  2022-11-08 12:08   ` Yuan Yao
  2022-10-25 15:13 ` [PATCH v9 7/8] KVM: Handle page fault for private memory Chao Peng
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 101+ messages in thread
From: Chao Peng @ 2022-10-25 15:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, tabba, Michael Roth, mhocko, Muchun Song,
	wei.w.wang

When private/shared memory are mixed in a large page, the lpage_info may
not be accurate and should be updated with this mixed info. A large page
has mixed pages can't be really mapped as large page since its
private/shared pages are from different physical memory.

Update lpage_info when private/shared memory attribute is changed. If
both private and shared pages are within a large page region, it can't
be mapped as large page. It's a bit challenge to track the mixed
info in a 'count' like variable, this patch instead reserves a bit in
'disallow_lpage' to indicate a large page has mixed private/share pages.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h |   8 +++
 arch/x86/kvm/mmu/mmu.c          | 112 +++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              |   2 +
 include/linux/kvm_host.h        |  19 ++++++
 virt/kvm/kvm_main.c             |  16 +++--
 5 files changed, 152 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7551b6f9c31c..db811a54e3fd 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -37,6 +37,7 @@
 #include <asm/hyperv-tlfs.h>
 
 #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
+#define __KVM_HAVE_ARCH_UPDATE_MEM_ATTR
 
 #define KVM_MAX_VCPUS 1024
 
@@ -952,6 +953,13 @@ struct kvm_vcpu_arch {
 #endif
 };
 
+/*
+ * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
+ * level. The remaining bits are used as a reference count.
+ */
+#define KVM_LPAGE_PRIVATE_SHARED_MIXED		(1U << 31)
+#define KVM_LPAGE_COUNT_MAX			((1U << 31) - 1)
+
 struct kvm_lpage_info {
 	int disallow_lpage;
 };
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 33b1aec44fb8..67a9823a8c35 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -762,11 +762,16 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
 {
 	struct kvm_lpage_info *linfo;
 	int i;
+	int disallow_count;
 
 	for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
 		linfo = lpage_info_slot(gfn, slot, i);
+
+		disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
+		WARN_ON(disallow_count + count < 0 ||
+			disallow_count > KVM_LPAGE_COUNT_MAX - count);
+
 		linfo->disallow_lpage += count;
-		WARN_ON(linfo->disallow_lpage < 0);
 	}
 }
 
@@ -6910,3 +6915,108 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 	if (kvm->arch.nx_lpage_recovery_thread)
 		kthread_stop(kvm->arch.nx_lpage_recovery_thread);
 }
+
+static inline bool linfo_is_mixed(struct kvm_lpage_info *linfo)
+{
+	return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
+}
+
+static inline void linfo_update_mixed(struct kvm_lpage_info *linfo, bool mixed)
+{
+	if (mixed)
+		linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
+	else
+		linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
+}
+
+static bool mem_attr_is_mixed_2m(struct kvm *kvm, unsigned int attr,
+				 gfn_t start, gfn_t end)
+{
+	XA_STATE(xas, &kvm->mem_attr_array, start);
+	gfn_t gfn = start;
+	void *entry;
+	bool shared = attr == KVM_MEM_ATTR_SHARED;
+	bool mixed = false;
+
+	rcu_read_lock();
+	entry = xas_load(&xas);
+	while (gfn < end) {
+		if (xas_retry(&xas, entry))
+			continue;
+
+		KVM_BUG_ON(gfn != xas.xa_index, kvm);
+
+		if ((entry && !shared) || (!entry && shared)) {
+			mixed = true;
+			goto out;
+		}
+
+		entry = xas_next(&xas);
+		gfn++;
+	}
+out:
+	rcu_read_unlock();
+	return mixed;
+}
+
+static bool mem_attr_is_mixed(struct kvm *kvm, struct kvm_memory_slot *slot,
+			      int level, unsigned int attr,
+			      gfn_t start, gfn_t end)
+{
+	unsigned long gfn;
+	void *entry;
+
+	if (level == PG_LEVEL_2M)
+		return mem_attr_is_mixed_2m(kvm, attr, start, end);
+
+	entry = xa_load(&kvm->mem_attr_array, start);
+	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
+		if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)))
+			return true;
+		if (xa_load(&kvm->mem_attr_array, gfn) != entry)
+			return true;
+	}
+	return false;
+}
+
+void kvm_arch_update_mem_attr(struct kvm *kvm, struct kvm_memory_slot *slot,
+			      unsigned int attr, gfn_t start, gfn_t end)
+{
+
+	unsigned long lpage_start, lpage_end;
+	unsigned long gfn, pages, mask;
+	int level;
+
+	WARN_ONCE(!(attr & (KVM_MEM_ATTR_PRIVATE | KVM_MEM_ATTR_SHARED)),
+			"Unsupported mem attribute.\n");
+
+	/*
+	 * The sequence matters here: we update the higher level basing on the
+	 * lower level's scanning result.
+	 */
+	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
+		pages = KVM_PAGES_PER_HPAGE(level);
+		mask = ~(pages - 1);
+		lpage_start = max(start & mask, slot->base_gfn);
+		lpage_end = (end - 1) & mask;
+
+		/*
+		 * We only need to scan the head and tail page, for middle pages
+		 * we know they are not mixed.
+		 */
+		linfo_update_mixed(lpage_info_slot(lpage_start, slot, level),
+				   mem_attr_is_mixed(kvm, slot, level, attr,
+						     lpage_start, start));
+
+		if (lpage_start == lpage_end)
+			return;
+
+		for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages)
+			linfo_update_mixed(lpage_info_slot(gfn, slot, level),
+					   false);
+
+		linfo_update_mixed(lpage_info_slot(lpage_end, slot, level),
+				   mem_attr_is_mixed(kvm, slot, level, attr,
+						     end, lpage_end + pages));
+	}
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 02ad31f46dd7..4276ca73bd7b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12563,6 +12563,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
 		if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
 			linfo[lpages - 1].disallow_lpage = 1;
 		ugfn = slot->userspace_addr >> PAGE_SHIFT;
+		if (kvm_slot_can_be_private(slot))
+			ugfn |= slot->restricted_offset >> PAGE_SHIFT;
 		/*
 		 * If the gfn and userspace address are not aligned wrt each
 		 * other, disable large page support for this slot.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4ce98fa0153c..6ce36065532c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2284,4 +2284,23 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
+
+#define KVM_MEM_ATTR_SHARED	0x0001
+#define KVM_MEM_ATTR_PRIVATE	0x0002
+
+#ifdef __KVM_HAVE_ARCH_UPDATE_MEM_ATTR
+void kvm_arch_update_mem_attr(struct kvm *kvm, struct kvm_memory_slot *slot,
+			      unsigned int attr, gfn_t start, gfn_t end);
+#else
+static inline void kvm_arch_update_mem_attr(struct kvm *kvm,
+					    struct kvm_memory_slot *slot,
+					    unsigned int attr,
+					    gfn_t start, gfn_t end)
+{
+}
+#endif
+
+#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
+
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fc3835826ace..13a37b4d9e97 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -939,7 +939,8 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 
 #ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
 
-static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
+static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end,
+				unsigned int attr)
 {
 	struct kvm_gfn_range gfn_range;
 	struct kvm_memory_slot *slot;
@@ -963,6 +964,7 @@ static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
 			gfn_range.slot = slot;
 
 			r |= kvm_unmap_gfn_range(kvm, &gfn_range);
+			kvm_arch_update_mem_attr(kvm, slot, attr, start, end);
 		}
 	}
 
@@ -970,7 +972,6 @@ static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
 		kvm_flush_remote_tlbs(kvm);
 }
 
-#define KVM_MEM_ATTR_SHARED	0x0001
 static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
 				     bool is_private)
 {
@@ -979,6 +980,7 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
 	void *entry;
 	int idx;
 	int r = 0;
+	unsigned int attr;
 
 	if (size == 0 || gpa + size < gpa)
 		return -EINVAL;
@@ -992,7 +994,13 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
 	 * Guest memory defaults to private, kvm->mem_attr_array only stores
 	 * shared memory.
 	 */
-	entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
+	if (is_private) {
+		attr = KVM_MEM_ATTR_PRIVATE;
+		entry = NULL;
+	} else {
+		attr = KVM_MEM_ATTR_SHARED;
+		entry = xa_mk_value(KVM_MEM_ATTR_SHARED);
+	}
 
 	idx = srcu_read_lock(&kvm->srcu);
 	KVM_MMU_LOCK(kvm);
@@ -1005,7 +1013,7 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
 			goto err;
 	}
 
-	kvm_unmap_mem_range(kvm, start, end);
+	kvm_unmap_mem_range(kvm, start, end, attr);
 
 	goto ret;
 err:
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v9 7/8] KVM: Handle page fault for private memory
  2022-10-25 15:13 [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (5 preceding siblings ...)
  2022-10-25 15:13 ` [PATCH v9 6/8] KVM: Update lpage info when private/shared memory are mixed Chao Peng
@ 2022-10-25 15:13 ` Chao Peng
  2022-10-26 21:54   ` Isaku Yamahata
  2022-11-16 20:50   ` Ackerley Tng
  2022-10-25 15:13 ` [PATCH v9 8/8] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 101+ messages in thread
From: Chao Peng @ 2022-10-25 15:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, tabba, Michael Roth, mhocko, Muchun Song,
	wei.w.wang

A memslot with KVM_MEM_PRIVATE being set can include both fd-based
private memory and hva-based shared memory. Architecture code (like TDX
code) can tell whether the on-going fault is private or not. This patch
adds a 'is_private' field to kvm_page_fault to indicate this and
architecture code is expected to set it.

To handle page fault for such memslot, the handling logic is different
depending on whether the fault is private or shared. KVM checks if
'is_private' matches the host's view of the page (maintained in
mem_attr_array).
  - For a successful match, private pfn is obtained with
    restrictedmem_get_page () from private fd and shared pfn is obtained
    with existing get_user_pages().
  - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
    userspace. Userspace then can convert memory between private/shared
    in host's view and retry the fault.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/kvm/mmu/mmu.c          | 56 +++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu/mmu_internal.h | 14 ++++++++-
 arch/x86/kvm/mmu/mmutrace.h     |  1 +
 arch/x86/kvm/mmu/spte.h         |  6 ++++
 arch/x86/kvm/mmu/tdp_mmu.c      |  3 +-
 include/linux/kvm_host.h        | 28 +++++++++++++++++
 6 files changed, 103 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 67a9823a8c35..10017a9f26ee 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3030,7 +3030,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
 
 int kvm_mmu_max_mapping_level(struct kvm *kvm,
 			      const struct kvm_memory_slot *slot, gfn_t gfn,
-			      int max_level)
+			      int max_level, bool is_private)
 {
 	struct kvm_lpage_info *linfo;
 	int host_level;
@@ -3042,6 +3042,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
 			break;
 	}
 
+	if (is_private)
+		return max_level;
+
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
 
@@ -3070,7 +3073,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	 * level, which will be used to do precise, accurate accounting.
 	 */
 	fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
-						     fault->gfn, fault->max_level);
+						     fault->gfn, fault->max_level,
+						     fault->is_private);
 	if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
 		return;
 
@@ -4141,6 +4145,32 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
 }
 
+static inline u8 order_to_level(int order)
+{
+	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
+
+	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
+		return PG_LEVEL_1G;
+
+	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
+		return PG_LEVEL_2M;
+
+	return PG_LEVEL_4K;
+}
+
+static int kvm_faultin_pfn_private(struct kvm_page_fault *fault)
+{
+	int order;
+	struct kvm_memory_slot *slot = fault->slot;
+
+	if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
+		return RET_PF_RETRY;
+
+	fault->max_level = min(order_to_level(order), fault->max_level);
+	fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
+	return RET_PF_CONTINUE;
+}
+
 static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
@@ -4173,6 +4203,22 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			return RET_PF_EMULATE;
 	}
 
+	if (kvm_slot_can_be_private(slot) &&
+	    fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
+		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+		if (fault->is_private)
+			vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
+		else
+			vcpu->run->memory.flags = 0;
+		vcpu->run->memory.padding = 0;
+		vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
+		vcpu->run->memory.size = PAGE_SIZE;
+		return RET_PF_USER;
+	}
+
+	if (fault->is_private)
+		return kvm_faultin_pfn_private(fault);
+
 	async = false;
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
 					  fault->write, &fault->map_writable,
@@ -5557,6 +5603,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
 			return -EIO;
 	}
 
+	if (r == RET_PF_USER)
+		return 0;
+
 	if (r < 0)
 		return r;
 	if (r != RET_PF_EMULATE)
@@ -6408,7 +6457,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 		 */
 		if (sp->role.direct &&
 		    sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
-							       PG_LEVEL_NUM)) {
+							       PG_LEVEL_NUM,
+							       false)) {
 			kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
 
 			if (kvm_available_flush_tlb_with_range())
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 582def531d4d..5cdff5ca546c 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -188,6 +188,7 @@ struct kvm_page_fault {
 
 	/* Derived from mmu and global state.  */
 	const bool is_tdp;
+	const bool is_private;
 	const bool nx_huge_page_workaround_enabled;
 
 	/*
@@ -236,6 +237,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
  * RET_PF_RETRY: let CPU fault again on the address.
  * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
  * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
+ * RET_PF_USER: need to exit to userspace to handle this fault.
  * RET_PF_FIXED: The faulting entry has been fixed.
  * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
  *
@@ -252,6 +254,7 @@ enum {
 	RET_PF_RETRY,
 	RET_PF_EMULATE,
 	RET_PF_INVALID,
+	RET_PF_USER,
 	RET_PF_FIXED,
 	RET_PF_SPURIOUS,
 };
@@ -309,7 +312,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 
 int kvm_mmu_max_mapping_level(struct kvm *kvm,
 			      const struct kvm_memory_slot *slot, gfn_t gfn,
-			      int max_level);
+			      int max_level, bool is_private);
 void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
 
@@ -318,4 +321,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
+static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
+					gfn_t gfn, kvm_pfn_t *pfn, int *order)
+{
+	WARN_ON_ONCE(1);
+	return -EOPNOTSUPP;
+}
+#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index ae86820cef69..2d7555381955 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
 TRACE_DEFINE_ENUM(RET_PF_RETRY);
 TRACE_DEFINE_ENUM(RET_PF_EMULATE);
 TRACE_DEFINE_ENUM(RET_PF_INVALID);
+TRACE_DEFINE_ENUM(RET_PF_USER);
 TRACE_DEFINE_ENUM(RET_PF_FIXED);
 TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
 
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 7670c13ce251..9acdf72537ce 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -315,6 +315,12 @@ static inline bool is_dirty_spte(u64 spte)
 	return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
 }
 
+static inline bool is_private_spte(u64 spte)
+{
+	/* FIXME: Query C-bit/S-bit for SEV/TDX. */
+	return false;
+}
+
 static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64 pte,
 				int level)
 {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 672f0432d777..9f97aac90606 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1768,7 +1768,8 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 			continue;
 
 		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
-							      iter.gfn, PG_LEVEL_NUM);
+						iter.gfn, PG_LEVEL_NUM,
+						is_private_spte(iter.old_spte));
 		if (max_mapping_level < iter.level)
 			continue;
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6ce36065532c..69300fc6d572 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2301,6 +2301,34 @@ static inline void kvm_arch_update_mem_attr(struct kvm *kvm,
 }
 #endif
 
+static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+{
+	return !xa_load(&kvm->mem_attr_array, gfn);
+}
+
+#else /* !CONFIG_KVM_GENERIC_PRIVATE_MEM */
+
+static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+{
+	return false;
+}
+
 #endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
 
+#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
+static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
+					gfn_t gfn, kvm_pfn_t *pfn, int *order)
+{
+	int ret;
+	struct page *page;
+	pgoff_t index = gfn - slot->base_gfn +
+			(slot->restricted_offset >> PAGE_SHIFT);
+
+	ret = restrictedmem_get_page(slot->restricted_file, index,
+				     &page, order);
+	*pfn = page_to_pfn(page);
+	return ret;
+}
+#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
+
 #endif
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v9 8/8] KVM: Enable and expose KVM_MEM_PRIVATE
  2022-10-25 15:13 [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (6 preceding siblings ...)
  2022-10-25 15:13 ` [PATCH v9 7/8] KVM: Handle page fault for private memory Chao Peng
@ 2022-10-25 15:13 ` Chao Peng
  2022-10-27 10:31   ` Fuad Tabba
  2022-11-03 12:13 ` [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM Vishal Annapurve
  2022-11-14 11:43 ` Alex Bennée
  9 siblings, 1 reply; 101+ messages in thread
From: Chao Peng @ 2022-10-25 15:13 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, tabba, Michael Roth, mhocko, Muchun Song,
	wei.w.wang

Expose KVM_MEM_PRIVATE and memslot fields restricted_fd/offset to
userspace. KVM register/unregister private memslot to fd-based
memory backing store and responses to invalidation event from
restrictedmem_notifier to zap the existing memory mappings in the
secondary page table.

Whether KVM_MEM_PRIVATE is actually exposed to userspace is determined
by architecture code which can turn on it by overriding the default
kvm_arch_has_private_mem().

A 'kvm' reference is added in memslot structure since in
restrictedmem_notifier callback we can only obtain a memslot reference
but 'kvm' is needed to do the zapping.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/kvm_host.h |   3 +-
 virt/kvm/kvm_main.c      | 174 +++++++++++++++++++++++++++++++++++++--
 2 files changed, 171 insertions(+), 6 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 69300fc6d572..e27d62c30484 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -246,7 +246,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
 
-#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_KVM_GENERIC_PRIVATE_MEM)
+#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_HAVE_KVM_RESTRICTED_MEM)
 struct kvm_gfn_range {
 	struct kvm_memory_slot *slot;
 	gfn_t start;
@@ -583,6 +583,7 @@ struct kvm_memory_slot {
 	struct file *restricted_file;
 	loff_t restricted_offset;
 	struct restrictedmem_notifier notifier;
+	struct kvm *kvm;
 };
 
 static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 13a37b4d9e97..dae6a2c196ad 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1028,6 +1028,111 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
 }
 #endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
 
+#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
+static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
+					 pgoff_t start, pgoff_t end,
+					 gfn_t *gfn_start, gfn_t *gfn_end)
+{
+	unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
+
+	if (start > base_pgoff)
+		*gfn_start = slot->base_gfn + start - base_pgoff;
+	else
+		*gfn_start = slot->base_gfn;
+
+	if (end < base_pgoff + slot->npages)
+		*gfn_end = slot->base_gfn + end - base_pgoff;
+	else
+		*gfn_end = slot->base_gfn + slot->npages;
+
+	if (*gfn_start >= *gfn_end)
+		return false;
+
+	return true;
+}
+
+static void kvm_restrictedmem_invalidate_begin(struct restrictedmem_notifier *notifier,
+					       pgoff_t start, pgoff_t end)
+{
+	struct kvm_memory_slot *slot = container_of(notifier,
+						    struct kvm_memory_slot,
+						    notifier);
+	struct kvm *kvm = slot->kvm;
+	gfn_t gfn_start, gfn_end;
+	struct kvm_gfn_range gfn_range;
+	int idx;
+
+	if (!restrictedmem_range_is_valid(slot, start, end,
+						&gfn_start, &gfn_end))
+		return;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	KVM_MMU_LOCK(kvm);
+
+	kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
+
+	gfn_range.start = gfn_start;
+	gfn_range.end = gfn_end;
+	gfn_range.slot = slot;
+	gfn_range.pte = __pte(0);
+	gfn_range.may_block = true;
+
+	if (kvm_unmap_gfn_range(kvm, &gfn_range))
+		kvm_flush_remote_tlbs(kvm);
+
+	KVM_MMU_UNLOCK(kvm);
+	srcu_read_unlock(&kvm->srcu, idx);
+}
+
+static void kvm_restrictedmem_invalidate_end(struct restrictedmem_notifier *notifier,
+					     pgoff_t start, pgoff_t end)
+{
+	struct kvm_memory_slot *slot = container_of(notifier,
+						    struct kvm_memory_slot,
+						    notifier);
+	struct kvm *kvm = slot->kvm;
+	gfn_t gfn_start, gfn_end;
+
+	if (!restrictedmem_range_is_valid(slot, start, end,
+						&gfn_start, &gfn_end))
+		return;
+
+	KVM_MMU_LOCK(kvm);
+	kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
+	KVM_MMU_UNLOCK(kvm);
+}
+
+static struct restrictedmem_notifier_ops kvm_restrictedmem_notifier_ops = {
+	.invalidate_start = kvm_restrictedmem_invalidate_begin,
+	.invalidate_end = kvm_restrictedmem_invalidate_end,
+};
+
+static inline void kvm_restrictedmem_register(struct kvm_memory_slot *slot)
+{
+	slot->notifier.ops = &kvm_restrictedmem_notifier_ops;
+	restrictedmem_register_notifier(slot->restricted_file, &slot->notifier);
+}
+
+static inline void kvm_restrictedmem_unregister(struct kvm_memory_slot *slot)
+{
+	restrictedmem_unregister_notifier(slot->restricted_file,
+					  &slot->notifier);
+}
+
+#else /* !CONFIG_HAVE_KVM_RESTRICTED_MEM */
+
+static inline void kvm_restrictedmem_register(struct kvm_memory_slot *slot)
+{
+	WARN_ON_ONCE(1);
+}
+
+static inline void kvm_restrictedmem_unregister(struct kvm_memory_slot *slot)
+{
+	WARN_ON_ONCE(1);
+}
+
+#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
+
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 static int kvm_pm_notifier_call(struct notifier_block *bl,
 				unsigned long state,
@@ -1072,6 +1177,11 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
 /* This does not remove the slot from struct kvm_memslots data structures */
 static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
+	if (slot->flags & KVM_MEM_PRIVATE) {
+		kvm_restrictedmem_unregister(slot);
+		fput(slot->restricted_file);
+	}
+
 	kvm_destroy_dirty_bitmap(slot);
 
 	kvm_arch_free_memslot(kvm, slot);
@@ -1643,10 +1753,16 @@ bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
 	return false;
 }
 
-static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
+static int check_memory_region_flags(struct kvm *kvm,
+				     const struct kvm_user_mem_region *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
 
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
+	if (kvm_arch_has_private_mem(kvm))
+		valid_flags |= KVM_MEM_PRIVATE;
+#endif
+
 #ifdef __KVM_HAVE_READONLY_MEM
 	valid_flags |= KVM_MEM_READONLY;
 #endif
@@ -1722,6 +1838,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
 {
 	int r;
 
+	if (change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
+		kvm_restrictedmem_register(new);
+
 	/*
 	 * If dirty logging is disabled, nullify the bitmap; the old bitmap
 	 * will be freed on "commit".  If logging is enabled in both old and
@@ -1750,6 +1869,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
 	if (r && new && new->dirty_bitmap && (!old || !old->dirty_bitmap))
 		kvm_destroy_dirty_bitmap(new);
 
+	if (r && change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
+		kvm_restrictedmem_unregister(new);
+
 	return r;
 }
 
@@ -2047,7 +2169,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	int as_id, id;
 	int r;
 
-	r = check_memory_region_flags(mem);
+	r = check_memory_region_flags(kvm, mem);
 	if (r)
 		return r;
 
@@ -2066,6 +2188,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
 			mem->memory_size))
 		return -EINVAL;
+	if (mem->flags & KVM_MEM_PRIVATE &&
+		(mem->restricted_offset & (PAGE_SIZE - 1) ||
+		 mem->restricted_offset > U64_MAX - mem->memory_size))
+		return -EINVAL;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
 		return -EINVAL;
 	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
@@ -2104,6 +2230,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
 			return -EINVAL;
 	} else { /* Modify an existing slot. */
+		/* Private memslots are immutable, they can only be deleted. */
+		if (mem->flags & KVM_MEM_PRIVATE)
+			return -EINVAL;
 		if ((mem->userspace_addr != old->userspace_addr) ||
 		    (npages != old->npages) ||
 		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
@@ -2132,10 +2261,28 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	new->npages = npages;
 	new->flags = mem->flags;
 	new->userspace_addr = mem->userspace_addr;
+	if (mem->flags & KVM_MEM_PRIVATE) {
+		new->restricted_file = fget(mem->restricted_fd);
+		if (!new->restricted_file ||
+		    !file_is_restrictedmem(new->restricted_file)) {
+			r = -EINVAL;
+			goto out;
+		}
+		new->restricted_offset = mem->restricted_offset;
+	}
+
+	new->kvm = kvm;
 
 	r = kvm_set_memslot(kvm, old, new, change);
 	if (r)
-		kfree(new);
+		goto out;
+
+	return 0;
+
+out:
+	if (new->restricted_file)
+		fput(new->restricted_file);
+	kfree(new);
 	return r;
 }
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
@@ -4604,6 +4751,11 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 	case KVM_CAP_BINARY_STATS_FD:
 	case KVM_CAP_SYSTEM_EVENT_DATA:
 		return 1;
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
+	case KVM_CAP_PRIVATE_MEM:
+		return 1;
+#endif
+
 	default:
 		break;
 	}
@@ -4795,16 +4947,28 @@ static long kvm_vm_ioctl(struct file *filp,
 	}
 	case KVM_SET_USER_MEMORY_REGION: {
 		struct kvm_user_mem_region mem;
-		unsigned long size = sizeof(struct kvm_userspace_memory_region);
+		unsigned int flags_offset = offsetof(typeof(mem), flags);
+		unsigned long size;
+		u32 flags;
 
 		kvm_sanity_check_user_mem_region_alias();
 
+		memset(&mem, 0, sizeof(mem));
+
 		r = -EFAULT;
+		if (get_user(flags, (u32 __user *)(argp + flags_offset)))
+			goto out;
+
+		if (flags & KVM_MEM_PRIVATE)
+			size = sizeof(struct kvm_userspace_memory_region_ext);
+		else
+			size = sizeof(struct kvm_userspace_memory_region);
+
 		if (copy_from_user(&mem, argp, size))
 			goto out;
 
 		r = -EINVAL;
-		if (mem.flags & KVM_MEM_PRIVATE)
+		if ((flags ^ mem.flags) & KVM_MEM_PRIVATE)
 			goto out;
 
 		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-10-25 15:13 ` [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
@ 2022-10-25 15:26   ` Peter Maydell
  2022-10-25 16:17     ` Sean Christopherson
  2022-10-27 10:27   ` Fuad Tabba
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 101+ messages in thread
From: Peter Maydell @ 2022-10-25 15:26 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Tue, 25 Oct 2022 at 16:21, Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> This new KVM exit allows userspace to handle memory-related errors. It
> indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> The flags includes additional information for userspace to handle the
> error. Currently bit 0 is defined as 'private memory' where '1'
> indicates error happens due to private memory access and '0' indicates
> error happens due to shared memory access.
>
> When private memory is enabled, this new exit will be used for KVM to
> exit to userspace for shared <-> private memory conversion in memory
> encryption usage. In such usage, typically there are two kind of memory
> conversions:
>   - explicit conversion: happens when guest explicitly calls into KVM
>     to map a range (as private or shared), KVM then exits to userspace
>     to perform the map/unmap operations.
>   - implicit conversion: happens in KVM page fault handler where KVM
>     exits to userspace for an implicit conversion when the page is in a
>     different state than requested (private or shared).
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
>  include/uapi/linux/kvm.h       |  9 +++++++++
>  2 files changed, 32 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index f3fa75649a78..975688912b8c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>
> +::
> +
> +               /* KVM_EXIT_MEMORY_FAULT */
> +               struct {
> +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0)
> +                       __u32 flags;
> +                       __u32 padding;
> +                       __u64 gpa;
> +                       __u64 size;
> +               } memory;
> +
> +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> +encountered a memory error which is not handled by KVM kernel module and
> +userspace may choose to handle it. The 'flags' field indicates the memory
> +properties of the exit.
> +
> + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> +   private memory access when the bit is set. Otherwise the memory error is
> +   caused by shared memory access when the bit is clear.
> +
> +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
> +may handle the error and return to KVM to retry the previous memory access.
> +

What's the difference between this and a plain old MMIO exit ?
Just that we can specify a wider size and some flags ?

-- PMM


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-10-25 15:26   ` Peter Maydell
@ 2022-10-25 16:17     ` Sean Christopherson
  0 siblings, 0 replies; 101+ messages in thread
From: Sean Christopherson @ 2022-10-25 16:17 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Tue, Oct 25, 2022, Peter Maydell wrote:
> On Tue, 25 Oct 2022 at 16:21, Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index f3fa75649a78..975688912b8c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >
> > +::
> > +
> > +               /* KVM_EXIT_MEMORY_FAULT */
> > +               struct {
> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0)
> > +                       __u32 flags;
> > +                       __u32 padding;
> > +                       __u64 gpa;
> > +                       __u64 size;
> > +               } memory;
> > +
> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> > +encountered a memory error which is not handled by KVM kernel module and
> > +userspace may choose to handle it. The 'flags' field indicates the memory
> > +properties of the exit.
> > +
> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> > +   private memory access when the bit is set. Otherwise the memory error is
> > +   caused by shared memory access when the bit is clear.
> > +
> > +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
> > +may handle the error and return to KVM to retry the previous memory access.
> > +
> 
> What's the difference between this and a plain old MMIO exit ?
> Just that we can specify a wider size and some flags ?

KVM_EXIT_MMIO is purely for cases where there is no memslot.  KVM_EXIT_MEMORY_FAULT
will be used for scenarios where there is a valid memslot for a GPA, but for
whatever reason KVM cannot map the memslot into the guest.

In this series, the new exit type is use to handle guest-initiated conversions
between shared and private memory.  By design, conversion requires explicit action
from userspace, and so even though KVM has a valid memslot, KVM needs to exit to
userspace to effectively forward the conversion request to userspace.

Long term, I also hope to convert all guest-triggered -EFAULT paths to instead
return KVM_EXIT_MEMORY_FAULT.  At minimum, returning KVM_EXIT_MEMORY_FAULT instead
of -EFAULT will allow KVM to provide userspace with the "bad" GPA when something
goes sideways, e.g. if faulting in the page failed because there's no valid
userspace mapping.

There have also been two potential use cases[1][2], though they both appear to have
been abandoned, where userspace would do something more than just kill the guest
in response to KVM_EXIT_MEMORY_FAULT.

[1] https://lkml.kernel.org/r/20200617230052.GB27751@linux.intel.com
[2] https://lore.kernel.org/all/YKxJLcg%2FWomPE422@google.com


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-10-25 15:13 ` [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
@ 2022-10-26 17:31   ` Isaku Yamahata
  2022-10-28  6:12     ` Chao Peng
  2022-10-27 10:20   ` Fuad Tabba
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 101+ messages in thread
From: Isaku Yamahata @ 2022-10-26 17:31 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang, isaku.yamahata

On Tue, Oct 25, 2022 at 11:13:37PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:

> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +			   struct page **pagep, int *order)
> +{
> +	struct restrictedmem_data *data = file->f_mapping->private_data;
> +	struct file *memfd = data->memfd;
> +	struct page *page;
> +	int ret;
> +
> +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);

shmem_getpage() was removed.
https://lkml.kernel.org/r/20220902194653.1739778-34-willy@infradead.org

I needed the following fix to compile.

thanks,

diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
index e5bf8907e0f8..4694dd5609d6 100644
--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -231,13 +231,15 @@ int restrictedmem_get_page(struct file *file, pgoff_t offset,
 {
        struct restrictedmem_data *data = file->f_mapping->private_data;
        struct file *memfd = data->memfd;
+       struct folio *folio = NULL;
        struct page *page;
        int ret;
 
-       ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
+       ret = shmem_get_folio(file_inode(memfd), offset, &folio, SGP_WRITE);
        if (ret)
                return ret;
 
+       page = folio_file_page(folio, offset);
        *pagep = page;
        if (order)
                *order = thp_order(compound_head(page));
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 6/8] KVM: Update lpage info when private/shared memory are mixed
  2022-10-25 15:13 ` [PATCH v9 6/8] KVM: Update lpage info when private/shared memory are mixed Chao Peng
@ 2022-10-26 20:46   ` Isaku Yamahata
  2022-10-28  6:38     ` Chao Peng
  2022-11-08 12:08   ` Yuan Yao
  1 sibling, 1 reply; 101+ messages in thread
From: Isaku Yamahata @ 2022-10-26 20:46 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang, isaku.yamahata

On Tue, Oct 25, 2022 at 11:13:42PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:

> When private/shared memory are mixed in a large page, the lpage_info may
> not be accurate and should be updated with this mixed info. A large page
> has mixed pages can't be really mapped as large page since its
> private/shared pages are from different physical memory.
> 
> Update lpage_info when private/shared memory attribute is changed. If
> both private and shared pages are within a large page region, it can't
> be mapped as large page. It's a bit challenge to track the mixed
> info in a 'count' like variable, this patch instead reserves a bit in
> 'disallow_lpage' to indicate a large page has mixed private/share pages.
> 
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   8 +++
>  arch/x86/kvm/mmu/mmu.c          | 112 +++++++++++++++++++++++++++++++-
>  arch/x86/kvm/x86.c              |   2 +
>  include/linux/kvm_host.h        |  19 ++++++
>  virt/kvm/kvm_main.c             |  16 +++--
>  5 files changed, 152 insertions(+), 5 deletions(-)
> 
...
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 33b1aec44fb8..67a9823a8c35 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
...
> @@ -6910,3 +6915,108 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
>  	if (kvm->arch.nx_lpage_recovery_thread)
>  		kthread_stop(kvm->arch.nx_lpage_recovery_thread);
>  }
> +
> +static inline bool linfo_is_mixed(struct kvm_lpage_info *linfo)
> +{
> +	return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +}
> +
> +static inline void linfo_update_mixed(struct kvm_lpage_info *linfo, bool mixed)
> +{
> +	if (mixed)
> +		linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +	else
> +		linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +}
> +
> +static bool mem_attr_is_mixed_2m(struct kvm *kvm, unsigned int attr,
> +				 gfn_t start, gfn_t end)
> +{
> +	XA_STATE(xas, &kvm->mem_attr_array, start);
> +	gfn_t gfn = start;
> +	void *entry;
> +	bool shared = attr == KVM_MEM_ATTR_SHARED;
> +	bool mixed = false;
> +
> +	rcu_read_lock();
> +	entry = xas_load(&xas);
> +	while (gfn < end) {
> +		if (xas_retry(&xas, entry))
> +			continue;
> +
> +		KVM_BUG_ON(gfn != xas.xa_index, kvm);
> +
> +		if ((entry && !shared) || (!entry && shared)) {
> +			mixed = true;
> +			goto out;

nitpick: goto isn't needed. break should work.

> +		}
> +
> +		entry = xas_next(&xas);
> +		gfn++;
> +	}
> +out:
> +	rcu_read_unlock();
> +	return mixed;
> +}
> +
> +static bool mem_attr_is_mixed(struct kvm *kvm, struct kvm_memory_slot *slot,
> +			      int level, unsigned int attr,
> +			      gfn_t start, gfn_t end)
> +{
> +	unsigned long gfn;
> +	void *entry;
> +
> +	if (level == PG_LEVEL_2M)
> +		return mem_attr_is_mixed_2m(kvm, attr, start, end);
> +
> +	entry = xa_load(&kvm->mem_attr_array, start);
> +	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
> +		if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)))
> +			return true;
> +		if (xa_load(&kvm->mem_attr_array, gfn) != entry)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +void kvm_arch_update_mem_attr(struct kvm *kvm, struct kvm_memory_slot *slot,
> +			      unsigned int attr, gfn_t start, gfn_t end)
> +{
> +
> +	unsigned long lpage_start, lpage_end;
> +	unsigned long gfn, pages, mask;
> +	int level;
> +
> +	WARN_ONCE(!(attr & (KVM_MEM_ATTR_PRIVATE | KVM_MEM_ATTR_SHARED)),
> +			"Unsupported mem attribute.\n");
> +
> +	/*
> +	 * The sequence matters here: we update the higher level basing on the
> +	 * lower level's scanning result.
> +	 */
> +	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> +		pages = KVM_PAGES_PER_HPAGE(level);
> +		mask = ~(pages - 1);

nitpick: KVM_HPAGE_MASK(level).  Maybe matter of preference.


> +		lpage_start = max(start & mask, slot->base_gfn);
> +		lpage_end = (end - 1) & mask;
> +
> +		/*
> +		 * We only need to scan the head and tail page, for middle pages
> +		 * we know they are not mixed.
> +		 */
> +		linfo_update_mixed(lpage_info_slot(lpage_start, slot, level),
> +				   mem_attr_is_mixed(kvm, slot, level, attr,
> +						     lpage_start, start));
> +
> +		if (lpage_start == lpage_end)
> +			return;
> +
> +		for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages)
> +			linfo_update_mixed(lpage_info_slot(gfn, slot, level),
> +					   false);
> +
> +		linfo_update_mixed(lpage_info_slot(lpage_end, slot, level),
> +				   mem_attr_is_mixed(kvm, slot, level, attr,
> +						     end, lpage_end + pages));
> +	}
> +}

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 7/8] KVM: Handle page fault for private memory
  2022-10-25 15:13 ` [PATCH v9 7/8] KVM: Handle page fault for private memory Chao Peng
@ 2022-10-26 21:54   ` Isaku Yamahata
  2022-10-28  6:55     ` Chao Peng
  2022-11-16 20:50   ` Ackerley Tng
  1 sibling, 1 reply; 101+ messages in thread
From: Isaku Yamahata @ 2022-10-26 21:54 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang, isaku.yamahata

On Tue, Oct 25, 2022 at 11:13:43PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:

> A memslot with KVM_MEM_PRIVATE being set can include both fd-based
> private memory and hva-based shared memory. Architecture code (like TDX
> code) can tell whether the on-going fault is private or not. This patch
> adds a 'is_private' field to kvm_page_fault to indicate this and
> architecture code is expected to set it.
> 
> To handle page fault for such memslot, the handling logic is different
> depending on whether the fault is private or shared. KVM checks if
> 'is_private' matches the host's view of the page (maintained in
> mem_attr_array).
>   - For a successful match, private pfn is obtained with
>     restrictedmem_get_page () from private fd and shared pfn is obtained
>     with existing get_user_pages().
>   - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
>     userspace. Userspace then can convert memory between private/shared
>     in host's view and retry the fault.
> 
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 56 +++++++++++++++++++++++++++++++--
>  arch/x86/kvm/mmu/mmu_internal.h | 14 ++++++++-
>  arch/x86/kvm/mmu/mmutrace.h     |  1 +
>  arch/x86/kvm/mmu/spte.h         |  6 ++++
>  arch/x86/kvm/mmu/tdp_mmu.c      |  3 +-
>  include/linux/kvm_host.h        | 28 +++++++++++++++++
>  6 files changed, 103 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 67a9823a8c35..10017a9f26ee 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3030,7 +3030,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
>  
>  int kvm_mmu_max_mapping_level(struct kvm *kvm,
>  			      const struct kvm_memory_slot *slot, gfn_t gfn,
> -			      int max_level)
> +			      int max_level, bool is_private)
>  {
>  	struct kvm_lpage_info *linfo;
>  	int host_level;
> @@ -3042,6 +3042,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
>  			break;
>  	}
>  
> +	if (is_private)
> +		return max_level;

Below PG_LEVEL_NUM is passed by zap_collapsible_spte_range().  It doesn't make
sense.

> +
>  	if (max_level == PG_LEVEL_4K)
>  		return PG_LEVEL_4K;
>  
> @@ -3070,7 +3073,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	 * level, which will be used to do precise, accurate accounting.
>  	 */
>  	fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> -						     fault->gfn, fault->max_level);
> +						     fault->gfn, fault->max_level,
> +						     fault->is_private);
>  	if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
>  		return;
>  
> @@ -4141,6 +4145,32 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
>  }
>  
> +static inline u8 order_to_level(int order)
> +{
> +	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> +
> +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> +		return PG_LEVEL_1G;
> +
> +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> +		return PG_LEVEL_2M;
> +
> +	return PG_LEVEL_4K;
> +}
> +
> +static int kvm_faultin_pfn_private(struct kvm_page_fault *fault)
> +{
> +	int order;
> +	struct kvm_memory_slot *slot = fault->slot;
> +
> +	if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> +		return RET_PF_RETRY;
> +
> +	fault->max_level = min(order_to_level(order), fault->max_level);
> +	fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> +	return RET_PF_CONTINUE;
> +}
> +
>  static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	struct kvm_memory_slot *slot = fault->slot;
> @@ -4173,6 +4203,22 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  			return RET_PF_EMULATE;
>  	}
>  
> +	if (kvm_slot_can_be_private(slot) &&
> +	    fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> +		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> +		if (fault->is_private)
> +			vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> +		else
> +			vcpu->run->memory.flags = 0;
> +		vcpu->run->memory.padding = 0;
> +		vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> +		vcpu->run->memory.size = PAGE_SIZE;
> +		return RET_PF_USER;
> +	}
> +
> +	if (fault->is_private)
> +		return kvm_faultin_pfn_private(fault);
> +
>  	async = false;
>  	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
>  					  fault->write, &fault->map_writable,
> @@ -5557,6 +5603,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>  			return -EIO;
>  	}
>  
> +	if (r == RET_PF_USER)
> +		return 0;
> +
>  	if (r < 0)
>  		return r;
>  	if (r != RET_PF_EMULATE)
> @@ -6408,7 +6457,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>  		 */
>  		if (sp->role.direct &&
>  		    sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> -							       PG_LEVEL_NUM)) {
> +							       PG_LEVEL_NUM,
> +							       false)) {
>  			kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
>  
>  			if (kvm_available_flush_tlb_with_range())
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 582def531d4d..5cdff5ca546c 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -188,6 +188,7 @@ struct kvm_page_fault {
>  
>  	/* Derived from mmu and global state.  */
>  	const bool is_tdp;
> +	const bool is_private;
>  	const bool nx_huge_page_workaround_enabled;
>  
>  	/*
> @@ -236,6 +237,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>   * RET_PF_RETRY: let CPU fault again on the address.
>   * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
>   * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> + * RET_PF_USER: need to exit to userspace to handle this fault.
>   * RET_PF_FIXED: The faulting entry has been fixed.
>   * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
>   *
> @@ -252,6 +254,7 @@ enum {
>  	RET_PF_RETRY,
>  	RET_PF_EMULATE,
>  	RET_PF_INVALID,
> +	RET_PF_USER,
>  	RET_PF_FIXED,
>  	RET_PF_SPURIOUS,
>  };
> @@ -309,7 +312,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  
>  int kvm_mmu_max_mapping_level(struct kvm *kvm,
>  			      const struct kvm_memory_slot *slot, gfn_t gfn,
> -			      int max_level);
> +			      int max_level, bool is_private);
>  void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>  void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
>  
> @@ -318,4 +321,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  
> +#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> +					gfn_t gfn, kvm_pfn_t *pfn, int *order)
> +{
> +	WARN_ON_ONCE(1);
> +	return -EOPNOTSUPP;
> +}
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
>  #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> index ae86820cef69..2d7555381955 100644
> --- a/arch/x86/kvm/mmu/mmutrace.h
> +++ b/arch/x86/kvm/mmu/mmutrace.h
> @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
>  TRACE_DEFINE_ENUM(RET_PF_RETRY);
>  TRACE_DEFINE_ENUM(RET_PF_EMULATE);
>  TRACE_DEFINE_ENUM(RET_PF_INVALID);
> +TRACE_DEFINE_ENUM(RET_PF_USER);
>  TRACE_DEFINE_ENUM(RET_PF_FIXED);
>  TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
>  
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 7670c13ce251..9acdf72537ce 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -315,6 +315,12 @@ static inline bool is_dirty_spte(u64 spte)
>  	return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
>  }
>  
> +static inline bool is_private_spte(u64 spte)
> +{
> +	/* FIXME: Query C-bit/S-bit for SEV/TDX. */
> +	return false;
> +}
> +

PFN encoded in spte doesn't make sense.  In VMM for TDX, private-vs-shared is
determined by S-bit of GFN.


>  static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64 pte,
>  				int level)
>  {
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 672f0432d777..9f97aac90606 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1768,7 +1768,8 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
>  			continue;
>  
>  		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> -							      iter.gfn, PG_LEVEL_NUM);
> +						iter.gfn, PG_LEVEL_NUM,
> +						is_private_spte(iter.old_spte));
>  		if (max_mapping_level < iter.level)
>  			continue;

This is to merge pages into a large page on the next kvm page fault.  large page
support is not yet supported.  Let's skip the private slot until large page
support is done.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-10-25 15:13 ` [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
  2022-10-26 17:31   ` Isaku Yamahata
@ 2022-10-27 10:20   ` Fuad Tabba
  2022-10-31 17:47   ` Michael Roth
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 101+ messages in thread
From: Fuad Tabba @ 2022-10-27 10:20 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

Hi,


On Tue, Oct 25, 2022 at 4:18 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Introduce 'memfd_restricted' system call with the ability to create
> memory areas that are restricted from userspace access through ordinary
> MMU operations (e.g. read/write/mmap). The memory content is expected to
> be used through a new in-kernel interface by a third kernel module.
>
> memfd_restricted() is useful for scenarios where a file descriptor(fd)
> can be used as an interface into mm but want to restrict userspace's
> ability on the fd. Initially it is designed to provide protections for
> KVM encrypted guest memory.
>
> Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> (e.g. QEMU) and then using the mmaped virtual address to setup the
> mapping in the KVM secondary page table (e.g. EPT). With confidential
> computing technologies like Intel TDX, the memfd memory may be encrypted
> with special key for special software domain (e.g. KVM guest) and is not
> expected to be directly accessed by userspace. Precisely, userspace
> access to such encrypted memory may lead to host crash so should be
> prevented.
>
> memfd_restricted() provides semantics required for KVM guest encrypted
> memory support that a fd created with memfd_restricted() is going to be
> used as the source of guest memory in confidential computing environment
> and KVM can directly interact with core-mm without the need to expose
> the memoy content into KVM userspace.
>
> KVM userspace is still in charge of the lifecycle of the fd. It should
> pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> obtain the physical memory page and then uses it to populate the KVM
> secondary page table entries.
>
> The userspace restricted memfd can be fallocate-ed or hole-punched
> from userspace. When these operations happen, KVM can get notified
> through restrictedmem_notifier, it then gets chance to remove any
> mapped entries of the range in the secondary page tables.
>
> memfd_restricted() itself is implemented as a shim layer on top of real
> memory file systems (currently tmpfs). Pages in restrictedmem are marked
> as unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
>
> By default memfd_restricted() prevents userspace read, write and mmap.
> By defining new bit in the 'flags', it can be extended to support other
> restricted semantics in the future.
>
> The system call is currently wired up for x86 arch.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---

Reviewed-by: Fuad Tabba <tabba@google.com>

And I'm working on porting to arm64 and testing V9.

Cheers,
/fuad


>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  include/linux/restrictedmem.h          |  62 ++++++
>  include/linux/syscalls.h               |   1 +
>  include/uapi/asm-generic/unistd.h      |   5 +-
>  include/uapi/linux/magic.h             |   1 +
>  kernel/sys_ni.c                        |   3 +
>  mm/Kconfig                             |   4 +
>  mm/Makefile                            |   1 +
>  mm/restrictedmem.c                     | 250 +++++++++++++++++++++++++
>  10 files changed, 328 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/restrictedmem.h
>  create mode 100644 mm/restrictedmem.c
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 320480a8db4f..dc70ba90247e 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -455,3 +455,4 @@
>  448    i386    process_mrelease        sys_process_mrelease
>  449    i386    futex_waitv             sys_futex_waitv
>  450    i386    set_mempolicy_home_node         sys_set_mempolicy_home_node
> +451    i386    memfd_restricted        sys_memfd_restricted
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index c84d12608cd2..06516abc8318 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -372,6 +372,7 @@
>  448    common  process_mrelease        sys_process_mrelease
>  449    common  futex_waitv             sys_futex_waitv
>  450    common  set_mempolicy_home_node sys_set_mempolicy_home_node
> +451    common  memfd_restricted        sys_memfd_restricted
>
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> new file mode 100644
> index 000000000000..9c37c3ea3180
> --- /dev/null
> +++ b/include/linux/restrictedmem.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _LINUX_RESTRICTEDMEM_H
> +
> +#include <linux/file.h>
> +#include <linux/magic.h>
> +#include <linux/pfn_t.h>
> +
> +struct restrictedmem_notifier;
> +
> +struct restrictedmem_notifier_ops {
> +       void (*invalidate_start)(struct restrictedmem_notifier *notifier,
> +                                pgoff_t start, pgoff_t end);
> +       void (*invalidate_end)(struct restrictedmem_notifier *notifier,
> +                              pgoff_t start, pgoff_t end);
> +};
> +
> +struct restrictedmem_notifier {
> +       struct list_head list;
> +       const struct restrictedmem_notifier_ops *ops;
> +};
> +
> +#ifdef CONFIG_RESTRICTEDMEM
> +
> +void restrictedmem_register_notifier(struct file *file,
> +                                    struct restrictedmem_notifier *notifier);
> +void restrictedmem_unregister_notifier(struct file *file,
> +                                      struct restrictedmem_notifier *notifier);
> +
> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +                          struct page **pagep, int *order);
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +       return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
> +}
> +
> +#else
> +
> +static inline void restrictedmem_register_notifier(struct file *file,
> +                                    struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline void restrictedmem_unregister_notifier(struct file *file,
> +                                      struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +                                        struct page **pagep, int *order)
> +{
> +       return -1;
> +}
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +       return false;
> +}
> +
> +#endif /* CONFIG_RESTRICTEDMEM */
> +
> +#endif /* _LINUX_RESTRICTEDMEM_H */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a34b0f9a9972..f9e9e0c820c5 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
>  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
>                                             unsigned long home_node,
>                                             unsigned long flags);
> +asmlinkage long sys_memfd_restricted(unsigned int flags);
>
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 45fa180cc56a..e93cd35e46d0 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
>  #define __NR_set_mempolicy_home_node 450
>  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
>
> +#define __NR_memfd_restricted 451
> +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
> +
>  #undef __NR_syscalls
> -#define __NR_syscalls 451
> +#define __NR_syscalls 452
>
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 6325d1d0e90f..8aa38324b90a 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -101,5 +101,6 @@
>  #define DMA_BUF_MAGIC          0x444d4142      /* "DMAB" */
>  #define DEVMEM_MAGIC           0x454d444d      /* "DMEM" */
>  #define SECRETMEM_MAGIC                0x5345434d      /* "SECM" */
> +#define RESTRICTEDMEM_MAGIC    0x5245534d      /* "RESM" */
>
>  #endif /* __LINUX_MAGIC_H__ */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 860b2dcf3ac4..7c4a32cbd2e7 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
>  /* memfd_secret */
>  COND_SYSCALL(memfd_secret);
>
> +/* memfd_restricted */
> +COND_SYSCALL(memfd_restricted);
> +
>  /*
>   * Architecture specific weak syscall entries.
>   */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0331f1461f81..0177d53676c7 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1076,6 +1076,10 @@ config IO_MAPPING
>  config SECRETMEM
>         def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
>
> +config RESTRICTEDMEM
> +       bool
> +       depends on TMPFS
> +
>  config ANON_VMA_NAME
>         bool "Anonymous VMA name support"
>         depends on PROC_FS && ADVISE_SYSCALLS && MMU
> diff --git a/mm/Makefile b/mm/Makefile
> index 9a564f836403..6cb6403ffd40 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -117,6 +117,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>  obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>  obj-$(CONFIG_SECRETMEM) += secretmem.o
> +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
>  obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>  obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> new file mode 100644
> index 000000000000..e5bf8907e0f8
> --- /dev/null
> +++ b/mm/restrictedmem.c
> @@ -0,0 +1,250 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/shmem_fs.h>
> +#include <linux/syscalls.h>
> +#include <uapi/linux/falloc.h>
> +#include <uapi/linux/magic.h>
> +#include <linux/restrictedmem.h>
> +
> +struct restrictedmem_data {
> +       struct mutex lock;
> +       struct file *memfd;
> +       struct list_head notifiers;
> +};
> +
> +static void restrictedmem_notifier_invalidate(struct restrictedmem_data *data,
> +                                pgoff_t start, pgoff_t end, bool notify_start)
> +{
> +       struct restrictedmem_notifier *notifier;
> +
> +       mutex_lock(&data->lock);
> +       list_for_each_entry(notifier, &data->notifiers, list) {
> +               if (notify_start)
> +                       notifier->ops->invalidate_start(notifier, start, end);
> +               else
> +                       notifier->ops->invalidate_end(notifier, start, end);
> +       }
> +       mutex_unlock(&data->lock);
> +}
> +
> +static int restrictedmem_release(struct inode *inode, struct file *file)
> +{
> +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> +
> +       fput(data->memfd);
> +       kfree(data);
> +       return 0;
> +}
> +
> +static long restrictedmem_fallocate(struct file *file, int mode,
> +                                   loff_t offset, loff_t len)
> +{
> +       struct restrictedmem_data *data = file->f_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       int ret;
> +
> +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> +               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> +                       return -EINVAL;
> +       }
> +
> +       restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> +       restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> +       return ret;
> +}
> +
> +static const struct file_operations restrictedmem_fops = {
> +       .release = restrictedmem_release,
> +       .fallocate = restrictedmem_fallocate,
> +};
> +
> +static int restrictedmem_getattr(struct user_namespace *mnt_userns,
> +                                const struct path *path, struct kstat *stat,
> +                                u32 request_mask, unsigned int query_flags)
> +{
> +       struct inode *inode = d_inode(path->dentry);
> +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +
> +       return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
> +                                            request_mask, query_flags);
> +}
> +
> +static int restrictedmem_setattr(struct user_namespace *mnt_userns,
> +                                struct dentry *dentry, struct iattr *attr)
> +{
> +       struct inode *inode = d_inode(dentry);
> +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       int ret;
> +
> +       if (attr->ia_valid & ATTR_SIZE) {
> +               if (memfd->f_inode->i_size)
> +                       return -EPERM;
> +
> +               if (!PAGE_ALIGNED(attr->ia_size))
> +                       return -EINVAL;
> +       }
> +
> +       ret = memfd->f_inode->i_op->setattr(mnt_userns,
> +                                           file_dentry(memfd), attr);
> +       return ret;
> +}
> +
> +static const struct inode_operations restrictedmem_iops = {
> +       .getattr = restrictedmem_getattr,
> +       .setattr = restrictedmem_setattr,
> +};
> +
> +static int restrictedmem_init_fs_context(struct fs_context *fc)
> +{
> +       if (!init_pseudo(fc, RESTRICTEDMEM_MAGIC))
> +               return -ENOMEM;
> +
> +       fc->s_iflags |= SB_I_NOEXEC;
> +       return 0;
> +}
> +
> +static struct file_system_type restrictedmem_fs = {
> +       .owner          = THIS_MODULE,
> +       .name           = "memfd:restrictedmem",
> +       .init_fs_context = restrictedmem_init_fs_context,
> +       .kill_sb        = kill_anon_super,
> +};
> +
> +static struct vfsmount *restrictedmem_mnt;
> +
> +static __init int restrictedmem_init(void)
> +{
> +       restrictedmem_mnt = kern_mount(&restrictedmem_fs);
> +       if (IS_ERR(restrictedmem_mnt))
> +               return PTR_ERR(restrictedmem_mnt);
> +       return 0;
> +}
> +fs_initcall(restrictedmem_init);
> +
> +static struct file *restrictedmem_file_create(struct file *memfd)
> +{
> +       struct restrictedmem_data *data;
> +       struct address_space *mapping;
> +       struct inode *inode;
> +       struct file *file;
> +
> +       data = kzalloc(sizeof(*data), GFP_KERNEL);
> +       if (!data)
> +               return ERR_PTR(-ENOMEM);
> +
> +       data->memfd = memfd;
> +       mutex_init(&data->lock);
> +       INIT_LIST_HEAD(&data->notifiers);
> +
> +       inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> +       if (IS_ERR(inode)) {
> +               kfree(data);
> +               return ERR_CAST(inode);
> +       }
> +
> +       inode->i_mode |= S_IFREG;
> +       inode->i_op = &restrictedmem_iops;
> +       inode->i_mapping->private_data = data;
> +
> +       file = alloc_file_pseudo(inode, restrictedmem_mnt,
> +                                "restrictedmem", O_RDWR,
> +                                &restrictedmem_fops);
> +       if (IS_ERR(file)) {
> +               iput(inode);
> +               kfree(data);
> +               return ERR_CAST(file);
> +       }
> +
> +       file->f_flags |= O_LARGEFILE;
> +
> +       mapping = memfd->f_mapping;
> +       mapping_set_unevictable(mapping);
> +       mapping_set_gfp_mask(mapping,
> +                            mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> +
> +       return file;
> +}
> +
> +SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> +{
> +       struct file *file, *restricted_file;
> +       int fd, err;
> +
> +       if (flags)
> +               return -EINVAL;
> +
> +       fd = get_unused_fd_flags(0);
> +       if (fd < 0)
> +               return fd;
> +
> +       file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> +       if (IS_ERR(file)) {
> +               err = PTR_ERR(file);
> +               goto err_fd;
> +       }
> +       file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> +       file->f_flags |= O_LARGEFILE;
> +
> +       restricted_file = restrictedmem_file_create(file);
> +       if (IS_ERR(restricted_file)) {
> +               err = PTR_ERR(restricted_file);
> +               fput(file);
> +               goto err_fd;
> +       }
> +
> +       fd_install(fd, restricted_file);
> +       return fd;
> +err_fd:
> +       put_unused_fd(fd);
> +       return err;
> +}
> +
> +void restrictedmem_register_notifier(struct file *file,
> +                                    struct restrictedmem_notifier *notifier)
> +{
> +       struct restrictedmem_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_add(&notifier->list, &data->notifiers);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_register_notifier);
> +
> +void restrictedmem_unregister_notifier(struct file *file,
> +                                      struct restrictedmem_notifier *notifier)
> +{
> +       struct restrictedmem_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_del(&notifier->list);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier);
> +
> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +                          struct page **pagep, int *order)
> +{
> +       struct restrictedmem_data *data = file->f_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       struct page *page;
> +       int ret;
> +
> +       ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> +       if (ret)
> +               return ret;
> +
> +       *pagep = page;
> +       if (order)
> +               *order = thp_order(compound_head(page));
> +
> +       SetPageUptodate(page);
> +       unlock_page(page);
> +
> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> --
> 2.25.1
>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-10-25 15:13 ` [PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
@ 2022-10-27 10:25   ` Fuad Tabba
  2022-10-28  7:04   ` Xiaoyao Li
  2022-11-14 16:04   ` Alex Bennée
  2 siblings, 0 replies; 101+ messages in thread
From: Fuad Tabba @ 2022-10-27 10:25 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Tue, Oct 25, 2022 at 4:18 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> In memory encryption usage, guest memory may be encrypted with special
> key and can be accessed only by the guest itself. We call such memory
> private memory. It's valueless and sometimes can cause problem to allow
> userspace to access guest private memory. This new KVM memslot extension
> allows guest private memory being provided though a restrictedmem
> backed file descriptor(fd) and userspace is restricted to access the
> bookmarked memory in the fd.
>
> This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> additional KVM memslot fields restricted_fd/restricted_offset to allow
> userspace to instruct KVM to provide guest memory through restricted_fd.
> 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> and the size is 'memory_size'.
>
> The extended memslot can still have the userspace_addr(hva). When use, a
> single memslot can maintain both private memory through restricted_fd
> and shared memory through userspace_addr. Whether the private or shared
> part is visible to guest is maintained by other KVM code.
>
> A restrictedmem_notifier field is also added to the memslot structure to
> allow the restricted_fd's backing store to notify KVM the memory change,
> KVM then can invalidate its page table entries.
>
> Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> and right now it is selected on X86_64 only. A KVM_CAP_PRIVATE_MEM is
> also introduced to indicate KVM support for KVM_MEM_PRIVATE.
>
> To make code maintenance easy, internally we use a binary compatible
> alias struct kvm_user_mem_region to handle both the normal and the
> '_ext' variants.
>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>

Reviewed-by: Fuad Tabba <tabba@google.com>

I have tested the V8 version of this patch on arm64/qemu (which has
the fix to copy_from_user included in this patch), and considering
this hasn't changed much:
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad



> ---
>  Documentation/virt/kvm/api.rst | 48 ++++++++++++++++++++++++++++-----
>  arch/x86/kvm/Kconfig           |  2 ++
>  arch/x86/kvm/x86.c             |  2 +-
>  include/linux/kvm_host.h       | 13 +++++++--
>  include/uapi/linux/kvm.h       | 29 ++++++++++++++++++++
>  virt/kvm/Kconfig               |  3 +++
>  virt/kvm/kvm_main.c            | 49 ++++++++++++++++++++++++++++------
>  7 files changed, 128 insertions(+), 18 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index eee9f857a986..f3fa75649a78 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
>  :Capability: KVM_CAP_USER_MEMORY
>  :Architectures: all
>  :Type: vm ioctl
> -:Parameters: struct kvm_userspace_memory_region (in)
> +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
>  :Returns: 0 on success, -1 on error
>
>  ::
> @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
>         __u64 userspace_addr; /* start of the userspace allocated memory */
>    };
>
> +  struct kvm_userspace_memory_region_ext {
> +       struct kvm_userspace_memory_region region;
> +       __u64 restricted_offset;
> +       __u32 restricted_fd;
> +       __u32 pad1;
> +       __u64 pad2[14];
> +  };
> +
>    /* for kvm_memory_region::flags */
>    #define KVM_MEM_LOG_DIRTY_PAGES      (1UL << 0)
>    #define KVM_MEM_READONLY     (1UL << 1)
> +  #define KVM_MEM_PRIVATE              (1UL << 2)
>
>  This ioctl allows the user to create, modify or delete a guest physical
>  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> @@ -1365,12 +1374,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
>  be identical.  This allows large pages in the guest to be backed by large
>  pages in the host.
>
> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> -to make a new slot read-only.  In this case, writes to this memory will be
> -posted to userspace as KVM_EXIT_MMIO exits.
> +kvm_userspace_memory_region_ext struct includes all fields of
> +kvm_userspace_memory_region struct, while also adds additional fields for some
> +other features. See below description of flags field for more information.
> +It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
> +
> +The flags field supports following flags:
> +
> +- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
> +  within the slot.  For more details, see KVM_GET_DIRTY_LOG ioctl.
> +
> +- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
> +  read-only.  In this case, writes to this memory will be posted to userspace as
> +  KVM_EXIT_MMIO exits.
> +
> +- KVM_MEM_PRIVATE, if KVM_CAP_PRIVATE_MEM allows, to indicate a new slot has
> +  private memory backed by a file descriptor(fd) and userspace access to the
> +  fd may be restricted. Userspace should use restricted_fd/restricted_offset in
> +  kvm_userspace_memory_region_ext to instruct KVM to provide private memory
> +  to guest. Userspace should guarantee not to map the same pfn indicated by
> +  restricted_fd/restricted_offset to different gfns with multiple memslots.
> +  Failed to do this may result undefined behavior.
>
>  When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
>  the memory region are automatically reflected into the guest.  For example, an
> @@ -8215,6 +8239,16 @@ structure.
>  When getting the Modified Change Topology Report value, the attr->addr
>  must point to a byte where the value will be stored or retrieved from.
>
> +8.36 KVM_CAP_PRIVATE_MEM
> +------------------------
> +
> +:Architectures: x86
> +
> +This capability indicates that private memory is supported and userspace can
> +set KVM_MEM_PRIVATE flag for KVM_SET_USER_MEMORY_REGION ioctl.  See
> +KVM_SET_USER_MEMORY_REGION for details on the usage of KVM_MEM_PRIVATE and
> +kvm_userspace_memory_region_ext fields.
> +
>  9. Known KVM API problems
>  =========================
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 67be7f217e37..8d2bd455c0cd 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -49,6 +49,8 @@ config KVM
>         select SRCU
>         select INTERVAL_TREE
>         select HAVE_KVM_PM_NOTIFIER if PM
> +       select HAVE_KVM_RESTRICTED_MEM if X86_64
> +       select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
>         help
>           Support hosting fully virtualized guest machines using hardware
>           virtualization extensions.  You will need a fairly recent
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4bd5f8a751de..02ad31f46dd7 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12425,7 +12425,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
>         }
>
>         for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> -               struct kvm_userspace_memory_region m;
> +               struct kvm_user_mem_region m;
>
>                 m.slot = id | (i << 16);
>                 m.flags = 0;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 32f259fa5801..739a7562a1f3 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -44,6 +44,7 @@
>
>  #include <asm/kvm_host.h>
>  #include <linux/kvm_dirty_ring.h>
> +#include <linux/restrictedmem.h>
>
>  #ifndef KVM_MAX_VCPU_IDS
>  #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> @@ -575,8 +576,16 @@ struct kvm_memory_slot {
>         u32 flags;
>         short id;
>         u16 as_id;
> +       struct file *restricted_file;
> +       loff_t restricted_offset;
> +       struct restrictedmem_notifier notifier;
>  };
>
> +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
> +{
> +       return slot && (slot->flags & KVM_MEM_PRIVATE);
> +}
> +
>  static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
>  {
>         return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
> @@ -1103,9 +1112,9 @@ enum kvm_mr_change {
>  };
>
>  int kvm_set_memory_region(struct kvm *kvm,
> -                         const struct kvm_userspace_memory_region *mem);
> +                         const struct kvm_user_mem_region *mem);
>  int __kvm_set_memory_region(struct kvm *kvm,
> -                           const struct kvm_userspace_memory_region *mem);
> +                           const struct kvm_user_mem_region *mem);
>  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
>  void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
>  int kvm_arch_prepare_memory_region(struct kvm *kvm,
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 0d5d4419139a..f1ae45c10c94 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
>         __u64 userspace_addr; /* start of the userspace allocated memory */
>  };
>
> +struct kvm_userspace_memory_region_ext {
> +       struct kvm_userspace_memory_region region;
> +       __u64 restricted_offset;
> +       __u32 restricted_fd;
> +       __u32 pad1;
> +       __u64 pad2[14];
> +};
> +
> +#ifdef __KERNEL__
> +/*
> + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> + * all fields from the top-level "extended" region.
> + */
> +struct kvm_user_mem_region {
> +       __u32 slot;
> +       __u32 flags;
> +       __u64 guest_phys_addr;
> +       __u64 memory_size;
> +       __u64 userspace_addr;
> +       __u64 restricted_offset;
> +       __u32 restricted_fd;
> +       __u32 pad1;
> +       __u64 pad2[14];
> +};
> +#endif
> +
>  /*
>   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
>   * other bits are reserved for kvm internal use which are defined in
> @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
>   */
>  #define KVM_MEM_LOG_DIRTY_PAGES        (1UL << 0)
>  #define KVM_MEM_READONLY       (1UL << 1)
> +#define KVM_MEM_PRIVATE                (1UL << 2)
>
>  /* for KVM_IRQ_LINE */
>  struct kvm_irq_level {
> @@ -1178,6 +1206,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_S390_ZPCI_OP 221
>  #define KVM_CAP_S390_CPU_TOPOLOGY 222
>  #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
> +#define KVM_CAP_PRIVATE_MEM 224
>
>  #ifdef KVM_CAP_IRQ_ROUTING
>
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 800f9470e36b..9ff164c7e0cc 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -86,3 +86,6 @@ config KVM_XFER_TO_GUEST_WORK
>
>  config HAVE_KVM_PM_NOTIFIER
>         bool
> +
> +config HAVE_KVM_RESTRICTED_MEM
> +       bool
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index e30f1b4ecfa5..8dace78a0278 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1526,7 +1526,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
>         }
>  }
>
> -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
>  {
>         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>
> @@ -1920,7 +1920,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
>   * Must be called holding kvm->slots_lock for write.
>   */
>  int __kvm_set_memory_region(struct kvm *kvm,
> -                           const struct kvm_userspace_memory_region *mem)
> +                           const struct kvm_user_mem_region *mem)
>  {
>         struct kvm_memory_slot *old, *new;
>         struct kvm_memslots *slots;
> @@ -2024,7 +2024,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
>
>  int kvm_set_memory_region(struct kvm *kvm,
> -                         const struct kvm_userspace_memory_region *mem)
> +                         const struct kvm_user_mem_region *mem)
>  {
>         int r;
>
> @@ -2036,7 +2036,7 @@ int kvm_set_memory_region(struct kvm *kvm,
>  EXPORT_SYMBOL_GPL(kvm_set_memory_region);
>
>  static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> -                                         struct kvm_userspace_memory_region *mem)
> +                                         struct kvm_user_mem_region *mem)
>  {
>         if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
>                 return -EINVAL;
> @@ -4627,6 +4627,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
>         return fd;
>  }
>
> +#define SANITY_CHECK_MEM_REGION_FIELD(field)                                   \
> +do {                                                                           \
> +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=             \
> +                    offsetof(struct kvm_userspace_memory_region, field));      \
> +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=         \
> +                    sizeof_field(struct kvm_userspace_memory_region, field));  \
> +} while (0)
> +
> +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)                                       \
> +do {                                                                                   \
> +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=                     \
> +                    offsetof(struct kvm_userspace_memory_region_ext, field));          \
> +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=                 \
> +                    sizeof_field(struct kvm_userspace_memory_region_ext, field));      \
> +} while (0)
> +
> +static void kvm_sanity_check_user_mem_region_alias(void)
> +{
> +       SANITY_CHECK_MEM_REGION_FIELD(slot);
> +       SANITY_CHECK_MEM_REGION_FIELD(flags);
> +       SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> +       SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> +       SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> +       SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
> +       SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
> +}
> +
>  static long kvm_vm_ioctl(struct file *filp,
>                            unsigned int ioctl, unsigned long arg)
>  {
> @@ -4650,14 +4677,20 @@ static long kvm_vm_ioctl(struct file *filp,
>                 break;
>         }
>         case KVM_SET_USER_MEMORY_REGION: {
> -               struct kvm_userspace_memory_region kvm_userspace_mem;
> +               struct kvm_user_mem_region mem;
> +               unsigned long size = sizeof(struct kvm_userspace_memory_region);
> +
> +               kvm_sanity_check_user_mem_region_alias();
>
>                 r = -EFAULT;
> -               if (copy_from_user(&kvm_userspace_mem, argp,
> -                                               sizeof(kvm_userspace_mem)))
> +               if (copy_from_user(&mem, argp, size))
> +                       goto out;
> +
> +               r = -EINVAL;
> +               if (mem.flags & KVM_MEM_PRIVATE)
>                         goto out;
>
> -               r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> +               r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>                 break;
>         }
>         case KVM_GET_DIRTY_LOG: {
> --
> 2.25.1
>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-10-25 15:13 ` [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
  2022-10-25 15:26   ` Peter Maydell
@ 2022-10-27 10:27   ` Fuad Tabba
  2022-10-28  6:14     ` Chao Peng
  2022-11-15 16:56   ` Alex Bennée
  2022-11-16 18:15   ` Andy Lutomirski
  3 siblings, 1 reply; 101+ messages in thread
From: Fuad Tabba @ 2022-10-27 10:27 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

Hi,

On Tue, Oct 25, 2022 at 4:19 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> This new KVM exit allows userspace to handle memory-related errors. It
> indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> The flags includes additional information for userspace to handle the
> error. Currently bit 0 is defined as 'private memory' where '1'
> indicates error happens due to private memory access and '0' indicates
> error happens due to shared memory access.
>
> When private memory is enabled, this new exit will be used for KVM to
> exit to userspace for shared <-> private memory conversion in memory
> encryption usage. In such usage, typically there are two kind of memory
> conversions:
>   - explicit conversion: happens when guest explicitly calls into KVM
>     to map a range (as private or shared), KVM then exits to userspace
>     to perform the map/unmap operations.
>   - implicit conversion: happens in KVM page fault handler where KVM
>     exits to userspace for an implicit conversion when the page is in a
>     different state than requested (private or shared).
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---

Reviewed-by: Fuad Tabba <tabba@google.com>

I have tested the V8 version of this patch on arm64/qemu, and
considering this hasn't changed:
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad



>  Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
>  include/uapi/linux/kvm.h       |  9 +++++++++
>  2 files changed, 32 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index f3fa75649a78..975688912b8c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>
> +::
> +
> +               /* KVM_EXIT_MEMORY_FAULT */
> +               struct {
> +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0)
> +                       __u32 flags;
> +                       __u32 padding;
> +                       __u64 gpa;
> +                       __u64 size;
> +               } memory;
> +
> +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> +encountered a memory error which is not handled by KVM kernel module and
> +userspace may choose to handle it. The 'flags' field indicates the memory
> +properties of the exit.
> +
> + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> +   private memory access when the bit is set. Otherwise the memory error is
> +   caused by shared memory access when the bit is clear.
> +
> +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
> +may handle the error and return to KVM to retry the previous memory access.
> +
>  ::
>
>      /* KVM_EXIT_NOTIFY */
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index f1ae45c10c94..fa60b032a405 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -300,6 +300,7 @@ struct kvm_xen_exit {
>  #define KVM_EXIT_RISCV_SBI        35
>  #define KVM_EXIT_RISCV_CSR        36
>  #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MEMORY_FAULT     38
>
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -538,6 +539,14 @@ struct kvm_run {
>  #define KVM_NOTIFY_CONTEXT_INVALID     (1 << 0)
>                         __u32 flags;
>                 } notify;
> +               /* KVM_EXIT_MEMORY_FAULT */
> +               struct {
> +#define KVM_MEMORY_EXIT_FLAG_PRIVATE   (1 << 0)
> +                       __u32 flags;
> +                       __u32 padding;
> +                       __u64 gpa;
> +                       __u64 size;
> +               } memory;
>                 /* Fix the size of the union. */
>                 char padding[256];
>         };
> --
> 2.25.1
>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-10-25 15:13 ` [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
@ 2022-10-27 10:29   ` Fuad Tabba
  2022-11-04  2:28     ` Chao Peng
  2022-11-10 20:06   ` Sean Christopherson
  1 sibling, 1 reply; 101+ messages in thread
From: Fuad Tabba @ 2022-10-27 10:29 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

Hi,

On Tue, Oct 25, 2022 at 4:19 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> Currently in mmu_notifier validate path, hva range is recorded and then
> checked against in the mmu_notifier_retry_hva() of the page fault path.
> However, for the to be introduced private memory, a page fault may not
> have a hva associated, checking gfn(gpa) makes more sense.
>
> For existing non private memory case, gfn is expected to continue to
> work. The only downside is when aliasing multiple gfns to a single hva,
> the current algorithm of checking multiple ranges could result in a much
> larger range being rejected. Such aliasing should be uncommon, so the
> impact is expected small.
>
> It also fixes a bug in kvm_zap_gfn_range() which has already been using

nit: Now it's kvm_unmap_gfn_range().

> gfn when calling kvm_mmu_invalidate_begin/end() while these functions
> accept hva in current code.
>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---

Based on reading this code and my limited knowledge of the x86 MMU code:
Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad


>  arch/x86/kvm/mmu/mmu.c   |  2 +-
>  include/linux/kvm_host.h | 18 +++++++---------
>  virt/kvm/kvm_main.c      | 45 ++++++++++++++++++++++++++--------------
>  3 files changed, 39 insertions(+), 26 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f81539061d6..33b1aec44fb8 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4217,7 +4217,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
>                 return true;
>
>         return fault->slot &&
> -              mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> +              mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
>  }
>
>  static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 739a7562a1f3..79e5cbc35fcf 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -775,8 +775,8 @@ struct kvm {
>         struct mmu_notifier mmu_notifier;
>         unsigned long mmu_invalidate_seq;
>         long mmu_invalidate_in_progress;
> -       unsigned long mmu_invalidate_range_start;
> -       unsigned long mmu_invalidate_range_end;
> +       gfn_t mmu_invalidate_range_start;
> +       gfn_t mmu_invalidate_range_end;
>  #endif
>         struct list_head devices;
>         u64 manual_dirty_log_protect;
> @@ -1365,10 +1365,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
>  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  #endif
>
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> -                             unsigned long end);
> -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> -                           unsigned long end);
> +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end);
> +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end);
>
>  long kvm_arch_dev_ioctl(struct file *filp,
>                         unsigned int ioctl, unsigned long arg);
> @@ -1937,9 +1935,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
>         return 0;
>  }
>
> -static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> +static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
>                                            unsigned long mmu_seq,
> -                                          unsigned long hva)
> +                                          gfn_t gfn)
>  {
>         lockdep_assert_held(&kvm->mmu_lock);
>         /*
> @@ -1949,8 +1947,8 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
>          * positives, due to shortcuts when handing concurrent invalidations.
>          */
>         if (unlikely(kvm->mmu_invalidate_in_progress) &&
> -           hva >= kvm->mmu_invalidate_range_start &&
> -           hva < kvm->mmu_invalidate_range_end)
> +           gfn >= kvm->mmu_invalidate_range_start &&
> +           gfn < kvm->mmu_invalidate_range_end)
>                 return 1;
>         if (kvm->mmu_invalidate_seq != mmu_seq)
>                 return 1;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 8dace78a0278..09c9cdeb773c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -540,8 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
>
>  typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
>
> -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> -                            unsigned long end);
> +typedef void (*on_lock_fn_t)(struct kvm *kvm, gfn_t start, gfn_t end);
>
>  typedef void (*on_unlock_fn_t)(struct kvm *kvm);
>
> @@ -628,7 +627,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>                                 locked = true;
>                                 KVM_MMU_LOCK(kvm);
>                                 if (!IS_KVM_NULL_FN(range->on_lock))
> -                                       range->on_lock(kvm, range->start, range->end);
> +                                       range->on_lock(kvm, gfn_range.start,
> +                                                           gfn_range.end);
>                                 if (IS_KVM_NULL_FN(range->handler))
>                                         break;
>                         }
> @@ -715,15 +715,9 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>         kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
>  }
>
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> -                             unsigned long end)
> +static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
> +                                                           gfn_t end)
>  {
> -       /*
> -        * The count increase must become visible at unlock time as no
> -        * spte can be established without taking the mmu_lock and
> -        * count is also read inside the mmu_lock critical section.
> -        */
> -       kvm->mmu_invalidate_in_progress++;
>         if (likely(kvm->mmu_invalidate_in_progress == 1)) {
>                 kvm->mmu_invalidate_range_start = start;
>                 kvm->mmu_invalidate_range_end = end;
> @@ -744,6 +738,28 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
>         }
>  }
>
> +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +       /*
> +        * The count increase must become visible at unlock time as no
> +        * spte can be established without taking the mmu_lock and
> +        * count is also read inside the mmu_lock critical section.
> +        */
> +       kvm->mmu_invalidate_in_progress++;
> +}
> +
> +static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> +       update_invalidate_range(kvm, range->start, range->end);
> +       return kvm_unmap_gfn_range(kvm, range);
> +}
> +
> +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +       mark_invalidate_in_progress(kvm, start, end);
> +       update_invalidate_range(kvm, start, end);
> +}
> +
>  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>                                         const struct mmu_notifier_range *range)
>  {
> @@ -752,8 +768,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>                 .start          = range->start,
>                 .end            = range->end,
>                 .pte            = __pte(0),
> -               .handler        = kvm_unmap_gfn_range,
> -               .on_lock        = kvm_mmu_invalidate_begin,
> +               .handler        = kvm_mmu_handle_gfn_range,
> +               .on_lock        = mark_invalidate_in_progress,
>                 .on_unlock      = kvm_arch_guest_memory_reclaimed,
>                 .flush_on_ret   = true,
>                 .may_block      = mmu_notifier_range_blockable(range),
> @@ -791,8 +807,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>         return 0;
>  }
>
> -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> -                           unsigned long end)
> +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
>  {
>         /*
>          * This sequence increase will notify the kvm page fault that
> --
> 2.25.1
>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions
  2022-10-25 15:13 ` [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions Chao Peng
@ 2022-10-27 10:31   ` Fuad Tabba
  2022-11-03 23:04   ` Sean Christopherson
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 101+ messages in thread
From: Fuad Tabba @ 2022-10-27 10:31 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

Hi,

On Tue, Oct 25, 2022 at 4:19 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> Introduce generic private memory register/unregister by reusing existing
> SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION. It differs from SEV case
> by treating address in the region as gpa instead of hva. Which cases
> should these ioctls go is determined by the kvm_arch_has_private_mem().
> Architecture which supports KVM_PRIVATE_MEM should override this function.
>
> KVM internally defaults all guest memory as private memory and maintain
> the shared memory in 'mem_attr_array'. The above ioctls operate on this
> field and unmap existing mappings if any.
>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---

Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad


>  Documentation/virt/kvm/api.rst |  17 ++-
>  arch/x86/kvm/Kconfig           |   1 +
>  include/linux/kvm_host.h       |  10 +-
>  virt/kvm/Kconfig               |   4 +
>  virt/kvm/kvm_main.c            | 227 +++++++++++++++++++++++++--------
>  5 files changed, 198 insertions(+), 61 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 975688912b8c..08253cf498d1 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -4717,10 +4717,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst.
>  This ioctl can be used to register a guest memory region which may
>  contain encrypted data (e.g. guest RAM, SMRAM etc).
>
> -It is used in the SEV-enabled guest. When encryption is enabled, a guest
> -memory region may contain encrypted data. The SEV memory encryption
> -engine uses a tweak such that two identical plaintext pages, each at
> -different locations will have differing ciphertexts. So swapping or
> +Currently this ioctl supports registering memory regions for two usages:
> +private memory and SEV-encrypted memory.
> +
> +When private memory is enabled, this ioctl is used to register guest private
> +memory region and the addr/size of kvm_enc_region represents guest physical
> +address (GPA). In this usage, this ioctl zaps the existing guest memory
> +mappings in KVM that fallen into the region.
> +
> +When SEV-encrypted memory is enabled, this ioctl is used to register guest
> +memory region which may contain encrypted data for a SEV-enabled guest. The
> +addr/size of kvm_enc_region represents userspace address (HVA). The SEV
> +memory encryption engine uses a tweak such that two identical plaintext pages,
> +each at different locations will have differing ciphertexts. So swapping or
>  moving ciphertext of those pages will not result in plaintext being
>  swapped. So relocating (or migrating) physical backing pages for the SEV
>  guest will require some additional steps.
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 8d2bd455c0cd..73fdfa429b20 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -51,6 +51,7 @@ config KVM
>         select HAVE_KVM_PM_NOTIFIER if PM
>         select HAVE_KVM_RESTRICTED_MEM if X86_64
>         select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> +       select KVM_GENERIC_PRIVATE_MEM if HAVE_KVM_RESTRICTED_MEM
>         help
>           Support hosting fully virtualized guest machines using hardware
>           virtualization extensions.  You will need a fairly recent
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 79e5cbc35fcf..4ce98fa0153c 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -245,7 +245,8 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>  #endif
>
> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> +
> +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_KVM_GENERIC_PRIVATE_MEM)
>  struct kvm_gfn_range {
>         struct kvm_memory_slot *slot;
>         gfn_t start;
> @@ -254,6 +255,9 @@ struct kvm_gfn_range {
>         bool may_block;
>  };
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> +#endif
> +
> +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> @@ -794,6 +798,9 @@ struct kvm {
>         struct notifier_block pm_notifier;
>  #endif
>         char stats_id[KVM_STATS_NAME_SIZE];
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> +       struct xarray mem_attr_array;
> +#endif
>  };
>
>  #define kvm_err(fmt, ...) \
> @@ -1453,6 +1460,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
>  int kvm_arch_post_init_vm(struct kvm *kvm);
>  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
>  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_has_private_mem(struct kvm *kvm);
>
>  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
>  /*
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 9ff164c7e0cc..69ca59e82149 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -89,3 +89,7 @@ config HAVE_KVM_PM_NOTIFIER
>
>  config HAVE_KVM_RESTRICTED_MEM
>         bool
> +
> +config KVM_GENERIC_PRIVATE_MEM
> +       bool
> +       depends on HAVE_KVM_RESTRICTED_MEM
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 09c9cdeb773c..fc3835826ace 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
>  }
>  EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
>
> +static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
> +                                                           gfn_t end)
> +{
> +       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> +               kvm->mmu_invalidate_range_start = start;
> +               kvm->mmu_invalidate_range_end = end;
> +       } else {
> +               /*
> +                * Fully tracking multiple concurrent ranges has diminishing
> +                * returns. Keep things simple and just find the minimal range
> +                * which includes the current and new ranges. As there won't be
> +                * enough information to subtract a range after its invalidate
> +                * completes, any ranges invalidated concurrently will
> +                * accumulate and persist until all outstanding invalidates
> +                * complete.
> +                */
> +               kvm->mmu_invalidate_range_start =
> +                       min(kvm->mmu_invalidate_range_start, start);
> +               kvm->mmu_invalidate_range_end =
> +                       max(kvm->mmu_invalidate_range_end, end);
> +       }
> +}
> +
> +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +       /*
> +        * The count increase must become visible at unlock time as no
> +        * spte can be established without taking the mmu_lock and
> +        * count is also read inside the mmu_lock critical section.
> +        */
> +       kvm->mmu_invalidate_in_progress++;
> +}
> +
> +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +       mark_invalidate_in_progress(kvm, start, end);
> +       update_invalidate_range(kvm, start, end);
> +}
> +
> +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +       /*
> +        * This sequence increase will notify the kvm page fault that
> +        * the page that is going to be mapped in the spte could have
> +        * been freed.
> +        */
> +       kvm->mmu_invalidate_seq++;
> +       smp_wmb();
> +       /*
> +        * The above sequence increase must be visible before the
> +        * below count decrease, which is ensured by the smp_wmb above
> +        * in conjunction with the smp_rmb in mmu_invalidate_retry().
> +        */
> +       kvm->mmu_invalidate_in_progress--;
> +}
> +
>  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>  static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>  {
> @@ -715,51 +771,12 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>         kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
>  }
>
> -static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
> -                                                           gfn_t end)
> -{
> -       if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> -               kvm->mmu_invalidate_range_start = start;
> -               kvm->mmu_invalidate_range_end = end;
> -       } else {
> -               /*
> -                * Fully tracking multiple concurrent ranges has diminishing
> -                * returns. Keep things simple and just find the minimal range
> -                * which includes the current and new ranges. As there won't be
> -                * enough information to subtract a range after its invalidate
> -                * completes, any ranges invalidated concurrently will
> -                * accumulate and persist until all outstanding invalidates
> -                * complete.
> -                */
> -               kvm->mmu_invalidate_range_start =
> -                       min(kvm->mmu_invalidate_range_start, start);
> -               kvm->mmu_invalidate_range_end =
> -                       max(kvm->mmu_invalidate_range_end, end);
> -       }
> -}
> -
> -static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> -{
> -       /*
> -        * The count increase must become visible at unlock time as no
> -        * spte can be established without taking the mmu_lock and
> -        * count is also read inside the mmu_lock critical section.
> -        */
> -       kvm->mmu_invalidate_in_progress++;
> -}
> -
>  static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>         update_invalidate_range(kvm, range->start, range->end);
>         return kvm_unmap_gfn_range(kvm, range);
>  }
>
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
> -{
> -       mark_invalidate_in_progress(kvm, start, end);
> -       update_invalidate_range(kvm, start, end);
> -}
> -
>  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>                                         const struct mmu_notifier_range *range)
>  {
> @@ -807,23 +824,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>         return 0;
>  }
>
> -void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
> -{
> -       /*
> -        * This sequence increase will notify the kvm page fault that
> -        * the page that is going to be mapped in the spte could have
> -        * been freed.
> -        */
> -       kvm->mmu_invalidate_seq++;
> -       smp_wmb();
> -       /*
> -        * The above sequence increase must be visible before the
> -        * below count decrease, which is ensured by the smp_wmb above
> -        * in conjunction with the smp_rmb in mmu_invalidate_retry().
> -        */
> -       kvm->mmu_invalidate_in_progress--;
> -}
> -
>  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>                                         const struct mmu_notifier_range *range)
>  {
> @@ -937,6 +937,89 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>
>  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> +
> +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +       struct kvm_gfn_range gfn_range;
> +       struct kvm_memory_slot *slot;
> +       struct kvm_memslots *slots;
> +       struct kvm_memslot_iter iter;
> +       int i;
> +       int r = 0;
> +
> +       gfn_range.pte = __pte(0);
> +       gfn_range.may_block = true;
> +
> +       for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +               slots = __kvm_memslots(kvm, i);
> +
> +               kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> +                       slot = iter.slot;
> +                       gfn_range.start = max(start, slot->base_gfn);
> +                       gfn_range.end = min(end, slot->base_gfn + slot->npages);
> +                       if (gfn_range.start >= gfn_range.end)
> +                               continue;
> +                       gfn_range.slot = slot;
> +
> +                       r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> +               }
> +       }
> +
> +       if (r)
> +               kvm_flush_remote_tlbs(kvm);
> +}
> +
> +#define KVM_MEM_ATTR_SHARED    0x0001
> +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> +                                    bool is_private)
> +{
> +       gfn_t start, end;
> +       unsigned long i;
> +       void *entry;
> +       int idx;
> +       int r = 0;
> +
> +       if (size == 0 || gpa + size < gpa)
> +               return -EINVAL;
> +       if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> +               return -EINVAL;
> +
> +       start = gpa >> PAGE_SHIFT;
> +       end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> +       /*
> +        * Guest memory defaults to private, kvm->mem_attr_array only stores
> +        * shared memory.
> +        */
> +       entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> +
> +       idx = srcu_read_lock(&kvm->srcu);
> +       KVM_MMU_LOCK(kvm);
> +       kvm_mmu_invalidate_begin(kvm, start, end);
> +
> +       for (i = start; i < end; i++) {
> +               r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> +                                   GFP_KERNEL_ACCOUNT));
> +               if (r)
> +                       goto err;
> +       }
> +
> +       kvm_unmap_mem_range(kvm, start, end);
> +
> +       goto ret;
> +err:
> +       for (; i > start; i--)
> +               xa_erase(&kvm->mem_attr_array, i);
> +ret:
> +       kvm_mmu_invalidate_end(kvm, start, end);
> +       KVM_MMU_UNLOCK(kvm);
> +       srcu_read_unlock(&kvm->srcu, idx);
> +
> +       return r;
> +}
> +#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
> +
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>  static int kvm_pm_notifier_call(struct notifier_block *bl,
>                                 unsigned long state,
> @@ -1165,6 +1248,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>         spin_lock_init(&kvm->mn_invalidate_lock);
>         rcuwait_init(&kvm->mn_memslots_update_rcuwait);
>         xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> +       xa_init(&kvm->mem_attr_array);
> +#endif
>
>         INIT_LIST_HEAD(&kvm->gpc_list);
>         spin_lock_init(&kvm->gpc_lock);
> @@ -1338,6 +1424,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
>                 kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
>                 kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
>         }
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> +       xa_destroy(&kvm->mem_attr_array);
> +#endif
>         cleanup_srcu_struct(&kvm->irq_srcu);
>         cleanup_srcu_struct(&kvm->srcu);
>         kvm_arch_free_vm(kvm);
> @@ -1541,6 +1630,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
>         }
>  }
>
> +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> +{
> +       return false;
> +}
> +
>  static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
>  {
>         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
>                 r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>                 break;
>         }
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> +       case KVM_MEMORY_ENCRYPT_REG_REGION:
> +       case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> +               struct kvm_enc_region region;
> +               bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> +
> +               if (!kvm_arch_has_private_mem(kvm))
> +                       goto arch_vm_ioctl;
> +
> +               r = -EFAULT;
> +               if (copy_from_user(&region, argp, sizeof(region)))
> +                       goto out;
> +
> +               r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> +                                             region.size, set);
> +               break;
> +       }
> +#endif
>         case KVM_GET_DIRTY_LOG: {
>                 struct kvm_dirty_log log;
>
> @@ -4861,6 +4973,9 @@ static long kvm_vm_ioctl(struct file *filp,
>                 r = kvm_vm_ioctl_get_stats_fd(kvm);
>                 break;
>         default:
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> +arch_vm_ioctl:
> +#endif
>                 r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>         }
>  out:
> --
> 2.25.1
>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 8/8] KVM: Enable and expose KVM_MEM_PRIVATE
  2022-10-25 15:13 ` [PATCH v9 8/8] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
@ 2022-10-27 10:31   ` Fuad Tabba
  0 siblings, 0 replies; 101+ messages in thread
From: Fuad Tabba @ 2022-10-27 10:31 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

Hi,

On Tue, Oct 25, 2022 at 4:20 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> Expose KVM_MEM_PRIVATE and memslot fields restricted_fd/offset to
> userspace. KVM register/unregister private memslot to fd-based
> memory backing store and responses to invalidation event from
> restrictedmem_notifier to zap the existing memory mappings in the
> secondary page table.
>
> Whether KVM_MEM_PRIVATE is actually exposed to userspace is determined
> by architecture code which can turn on it by overriding the default
> kvm_arch_has_private_mem().
>
> A 'kvm' reference is added in memslot structure since in
> restrictedmem_notifier callback we can only obtain a memslot reference
> but 'kvm' is needed to do the zapping.
>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  include/linux/kvm_host.h |   3 +-
>  virt/kvm/kvm_main.c      | 174 +++++++++++++++++++++++++++++++++++++--
>  2 files changed, 171 insertions(+), 6 deletions(-)

Reviewed-by: Fuad Tabba <tabba@google.com>

Thanks,
/fuad


>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 69300fc6d572..e27d62c30484 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -246,7 +246,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>  #endif
>
>
> -#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_KVM_GENERIC_PRIVATE_MEM)
> +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_HAVE_KVM_RESTRICTED_MEM)
>  struct kvm_gfn_range {
>         struct kvm_memory_slot *slot;
>         gfn_t start;
> @@ -583,6 +583,7 @@ struct kvm_memory_slot {
>         struct file *restricted_file;
>         loff_t restricted_offset;
>         struct restrictedmem_notifier notifier;
> +       struct kvm *kvm;
>  };
>
>  static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 13a37b4d9e97..dae6a2c196ad 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1028,6 +1028,111 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
>  }
>  #endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
>
> +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
> +                                        pgoff_t start, pgoff_t end,
> +                                        gfn_t *gfn_start, gfn_t *gfn_end)
> +{
> +       unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
> +
> +       if (start > base_pgoff)
> +               *gfn_start = slot->base_gfn + start - base_pgoff;
> +       else
> +               *gfn_start = slot->base_gfn;
> +
> +       if (end < base_pgoff + slot->npages)
> +               *gfn_end = slot->base_gfn + end - base_pgoff;
> +       else
> +               *gfn_end = slot->base_gfn + slot->npages;
> +
> +       if (*gfn_start >= *gfn_end)
> +               return false;
> +
> +       return true;
> +}
> +
> +static void kvm_restrictedmem_invalidate_begin(struct restrictedmem_notifier *notifier,
> +                                              pgoff_t start, pgoff_t end)
> +{
> +       struct kvm_memory_slot *slot = container_of(notifier,
> +                                                   struct kvm_memory_slot,
> +                                                   notifier);
> +       struct kvm *kvm = slot->kvm;
> +       gfn_t gfn_start, gfn_end;
> +       struct kvm_gfn_range gfn_range;
> +       int idx;
> +
> +       if (!restrictedmem_range_is_valid(slot, start, end,
> +                                               &gfn_start, &gfn_end))
> +               return;
> +
> +       idx = srcu_read_lock(&kvm->srcu);
> +       KVM_MMU_LOCK(kvm);
> +
> +       kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
> +
> +       gfn_range.start = gfn_start;
> +       gfn_range.end = gfn_end;
> +       gfn_range.slot = slot;
> +       gfn_range.pte = __pte(0);
> +       gfn_range.may_block = true;
> +
> +       if (kvm_unmap_gfn_range(kvm, &gfn_range))
> +               kvm_flush_remote_tlbs(kvm);
> +
> +       KVM_MMU_UNLOCK(kvm);
> +       srcu_read_unlock(&kvm->srcu, idx);
> +}
> +
> +static void kvm_restrictedmem_invalidate_end(struct restrictedmem_notifier *notifier,
> +                                            pgoff_t start, pgoff_t end)
> +{
> +       struct kvm_memory_slot *slot = container_of(notifier,
> +                                                   struct kvm_memory_slot,
> +                                                   notifier);
> +       struct kvm *kvm = slot->kvm;
> +       gfn_t gfn_start, gfn_end;
> +
> +       if (!restrictedmem_range_is_valid(slot, start, end,
> +                                               &gfn_start, &gfn_end))
> +               return;
> +
> +       KVM_MMU_LOCK(kvm);
> +       kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
> +       KVM_MMU_UNLOCK(kvm);
> +}
> +
> +static struct restrictedmem_notifier_ops kvm_restrictedmem_notifier_ops = {
> +       .invalidate_start = kvm_restrictedmem_invalidate_begin,
> +       .invalidate_end = kvm_restrictedmem_invalidate_end,
> +};
> +
> +static inline void kvm_restrictedmem_register(struct kvm_memory_slot *slot)
> +{
> +       slot->notifier.ops = &kvm_restrictedmem_notifier_ops;
> +       restrictedmem_register_notifier(slot->restricted_file, &slot->notifier);
> +}
> +
> +static inline void kvm_restrictedmem_unregister(struct kvm_memory_slot *slot)
> +{
> +       restrictedmem_unregister_notifier(slot->restricted_file,
> +                                         &slot->notifier);
> +}
> +
> +#else /* !CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
> +static inline void kvm_restrictedmem_register(struct kvm_memory_slot *slot)
> +{
> +       WARN_ON_ONCE(1);
> +}
> +
> +static inline void kvm_restrictedmem_unregister(struct kvm_memory_slot *slot)
> +{
> +       WARN_ON_ONCE(1);
> +}
> +
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>  static int kvm_pm_notifier_call(struct notifier_block *bl,
>                                 unsigned long state,
> @@ -1072,6 +1177,11 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
>  /* This does not remove the slot from struct kvm_memslots data structures */
>  static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
>  {
> +       if (slot->flags & KVM_MEM_PRIVATE) {
> +               kvm_restrictedmem_unregister(slot);
> +               fput(slot->restricted_file);
> +       }
> +
>         kvm_destroy_dirty_bitmap(slot);
>
>         kvm_arch_free_memslot(kvm, slot);
> @@ -1643,10 +1753,16 @@ bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
>         return false;
>  }
>
> -static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> +static int check_memory_region_flags(struct kvm *kvm,
> +                                    const struct kvm_user_mem_region *mem)
>  {
>         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> +       if (kvm_arch_has_private_mem(kvm))
> +               valid_flags |= KVM_MEM_PRIVATE;
> +#endif
> +
>  #ifdef __KVM_HAVE_READONLY_MEM
>         valid_flags |= KVM_MEM_READONLY;
>  #endif
> @@ -1722,6 +1838,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
>  {
>         int r;
>
> +       if (change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
> +               kvm_restrictedmem_register(new);
> +
>         /*
>          * If dirty logging is disabled, nullify the bitmap; the old bitmap
>          * will be freed on "commit".  If logging is enabled in both old and
> @@ -1750,6 +1869,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
>         if (r && new && new->dirty_bitmap && (!old || !old->dirty_bitmap))
>                 kvm_destroy_dirty_bitmap(new);
>
> +       if (r && change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
> +               kvm_restrictedmem_unregister(new);
> +
>         return r;
>  }
>
> @@ -2047,7 +2169,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>         int as_id, id;
>         int r;
>
> -       r = check_memory_region_flags(mem);
> +       r = check_memory_region_flags(kvm, mem);
>         if (r)
>                 return r;
>
> @@ -2066,6 +2188,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
>              !access_ok((void __user *)(unsigned long)mem->userspace_addr,
>                         mem->memory_size))
>                 return -EINVAL;
> +       if (mem->flags & KVM_MEM_PRIVATE &&
> +               (mem->restricted_offset & (PAGE_SIZE - 1) ||
> +                mem->restricted_offset > U64_MAX - mem->memory_size))
> +               return -EINVAL;
>         if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
>                 return -EINVAL;
>         if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> @@ -2104,6 +2230,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
>                 if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
>                         return -EINVAL;
>         } else { /* Modify an existing slot. */
> +               /* Private memslots are immutable, they can only be deleted. */
> +               if (mem->flags & KVM_MEM_PRIVATE)
> +                       return -EINVAL;
>                 if ((mem->userspace_addr != old->userspace_addr) ||
>                     (npages != old->npages) ||
>                     ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> @@ -2132,10 +2261,28 @@ int __kvm_set_memory_region(struct kvm *kvm,
>         new->npages = npages;
>         new->flags = mem->flags;
>         new->userspace_addr = mem->userspace_addr;
> +       if (mem->flags & KVM_MEM_PRIVATE) {
> +               new->restricted_file = fget(mem->restricted_fd);
> +               if (!new->restricted_file ||
> +                   !file_is_restrictedmem(new->restricted_file)) {
> +                       r = -EINVAL;
> +                       goto out;
> +               }
> +               new->restricted_offset = mem->restricted_offset;
> +       }
> +
> +       new->kvm = kvm;
>
>         r = kvm_set_memslot(kvm, old, new, change);
>         if (r)
> -               kfree(new);
> +               goto out;
> +
> +       return 0;
> +
> +out:
> +       if (new->restricted_file)
> +               fput(new->restricted_file);
> +       kfree(new);
>         return r;
>  }
>  EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> @@ -4604,6 +4751,11 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>         case KVM_CAP_BINARY_STATS_FD:
>         case KVM_CAP_SYSTEM_EVENT_DATA:
>                 return 1;
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> +       case KVM_CAP_PRIVATE_MEM:
> +               return 1;
> +#endif
> +
>         default:
>                 break;
>         }
> @@ -4795,16 +4947,28 @@ static long kvm_vm_ioctl(struct file *filp,
>         }
>         case KVM_SET_USER_MEMORY_REGION: {
>                 struct kvm_user_mem_region mem;
> -               unsigned long size = sizeof(struct kvm_userspace_memory_region);
> +               unsigned int flags_offset = offsetof(typeof(mem), flags);
> +               unsigned long size;
> +               u32 flags;
>
>                 kvm_sanity_check_user_mem_region_alias();
>
> +               memset(&mem, 0, sizeof(mem));
> +
>                 r = -EFAULT;
> +               if (get_user(flags, (u32 __user *)(argp + flags_offset)))
> +                       goto out;
> +
> +               if (flags & KVM_MEM_PRIVATE)
> +                       size = sizeof(struct kvm_userspace_memory_region_ext);
> +               else
> +                       size = sizeof(struct kvm_userspace_memory_region);
> +
>                 if (copy_from_user(&mem, argp, size))
>                         goto out;
>
>                 r = -EINVAL;
> -               if (mem.flags & KVM_MEM_PRIVATE)
> +               if ((flags ^ mem.flags) & KVM_MEM_PRIVATE)
>                         goto out;
>
>                 r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> --
> 2.25.1
>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-10-26 17:31   ` Isaku Yamahata
@ 2022-10-28  6:12     ` Chao Peng
  0 siblings, 0 replies; 101+ messages in thread
From: Chao Peng @ 2022-10-28  6:12 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Wed, Oct 26, 2022 at 10:31:45AM -0700, Isaku Yamahata wrote:
> On Tue, Oct 25, 2022 at 11:13:37PM +0800,
> Chao Peng <chao.p.peng@linux.intel.com> wrote:
> 
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +			   struct page **pagep, int *order)
> > +{
> > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > +	struct file *memfd = data->memfd;
> > +	struct page *page;
> > +	int ret;
> > +
> > +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> 
> shmem_getpage() was removed.
> https://lkml.kernel.org/r/20220902194653.1739778-34-willy@infradead.org

Thanks for pointing out. My current base(kvm/queue) has not included
this change yet so still use shmem_getpage().

Chao
> 
> I needed the following fix to compile.
> 
> thanks,
> 
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> index e5bf8907e0f8..4694dd5609d6 100644
> --- a/mm/restrictedmem.c
> +++ b/mm/restrictedmem.c
> @@ -231,13 +231,15 @@ int restrictedmem_get_page(struct file *file, pgoff_t offset,
>  {
>         struct restrictedmem_data *data = file->f_mapping->private_data;
>         struct file *memfd = data->memfd;
> +       struct folio *folio = NULL;
>         struct page *page;
>         int ret;
>  
> -       ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> +       ret = shmem_get_folio(file_inode(memfd), offset, &folio, SGP_WRITE);
>         if (ret)
>                 return ret;
>  
> +       page = folio_file_page(folio, offset);
>         *pagep = page;
>         if (order)
>                 *order = thp_order(compound_head(page));
> -- 
> Isaku Yamahata <isaku.yamahata@gmail.com>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-10-27 10:27   ` Fuad Tabba
@ 2022-10-28  6:14     ` Chao Peng
  0 siblings, 0 replies; 101+ messages in thread
From: Chao Peng @ 2022-10-28  6:14 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Thu, Oct 27, 2022 at 11:27:05AM +0100, Fuad Tabba wrote:
> Hi,
> 
> On Tue, Oct 25, 2022 at 4:19 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > This new KVM exit allows userspace to handle memory-related errors. It
> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> > The flags includes additional information for userspace to handle the
> > error. Currently bit 0 is defined as 'private memory' where '1'
> > indicates error happens due to private memory access and '0' indicates
> > error happens due to shared memory access.
> >
> > When private memory is enabled, this new exit will be used for KVM to
> > exit to userspace for shared <-> private memory conversion in memory
> > encryption usage. In such usage, typically there are two kind of memory
> > conversions:
> >   - explicit conversion: happens when guest explicitly calls into KVM
> >     to map a range (as private or shared), KVM then exits to userspace
> >     to perform the map/unmap operations.
> >   - implicit conversion: happens in KVM page fault handler where KVM
> >     exits to userspace for an implicit conversion when the page is in a
> >     different state than requested (private or shared).
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> 
> Reviewed-by: Fuad Tabba <tabba@google.com>
> 
> I have tested the V8 version of this patch on arm64/qemu, and
> considering this hasn't changed:
> Tested-by: Fuad Tabba <tabba@google.com>

Appreciate your review and testing!

Chao
> 
> Cheers,
> /fuad
> 
> 
> 
> >  Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
> >  include/uapi/linux/kvm.h       |  9 +++++++++
> >  2 files changed, 32 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index f3fa75649a78..975688912b8c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >
> > +::
> > +
> > +               /* KVM_EXIT_MEMORY_FAULT */
> > +               struct {
> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0)
> > +                       __u32 flags;
> > +                       __u32 padding;
> > +                       __u64 gpa;
> > +                       __u64 size;
> > +               } memory;
> > +
> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> > +encountered a memory error which is not handled by KVM kernel module and
> > +userspace may choose to handle it. The 'flags' field indicates the memory
> > +properties of the exit.
> > +
> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> > +   private memory access when the bit is set. Otherwise the memory error is
> > +   caused by shared memory access when the bit is clear.
> > +
> > +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
> > +may handle the error and return to KVM to retry the previous memory access.
> > +
> >  ::
> >
> >      /* KVM_EXIT_NOTIFY */
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index f1ae45c10c94..fa60b032a405 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -300,6 +300,7 @@ struct kvm_xen_exit {
> >  #define KVM_EXIT_RISCV_SBI        35
> >  #define KVM_EXIT_RISCV_CSR        36
> >  #define KVM_EXIT_NOTIFY           37
> > +#define KVM_EXIT_MEMORY_FAULT     38
> >
> >  /* For KVM_EXIT_INTERNAL_ERROR */
> >  /* Emulate instruction failed. */
> > @@ -538,6 +539,14 @@ struct kvm_run {
> >  #define KVM_NOTIFY_CONTEXT_INVALID     (1 << 0)
> >                         __u32 flags;
> >                 } notify;
> > +               /* KVM_EXIT_MEMORY_FAULT */
> > +               struct {
> > +#define KVM_MEMORY_EXIT_FLAG_PRIVATE   (1 << 0)
> > +                       __u32 flags;
> > +                       __u32 padding;
> > +                       __u64 gpa;
> > +                       __u64 size;
> > +               } memory;
> >                 /* Fix the size of the union. */
> >                 char padding[256];
> >         };
> > --
> > 2.25.1
> >


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 6/8] KVM: Update lpage info when private/shared memory are mixed
  2022-10-26 20:46   ` Isaku Yamahata
@ 2022-10-28  6:38     ` Chao Peng
  0 siblings, 0 replies; 101+ messages in thread
From: Chao Peng @ 2022-10-28  6:38 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Wed, Oct 26, 2022 at 01:46:20PM -0700, Isaku Yamahata wrote:
> On Tue, Oct 25, 2022 at 11:13:42PM +0800,
> Chao Peng <chao.p.peng@linux.intel.com> wrote:
> 
> > When private/shared memory are mixed in a large page, the lpage_info may
> > not be accurate and should be updated with this mixed info. A large page
> > has mixed pages can't be really mapped as large page since its
> > private/shared pages are from different physical memory.
> > 
> > Update lpage_info when private/shared memory attribute is changed. If
> > both private and shared pages are within a large page region, it can't
> > be mapped as large page. It's a bit challenge to track the mixed
> > info in a 'count' like variable, this patch instead reserves a bit in
> > 'disallow_lpage' to indicate a large page has mixed private/share pages.
> > 
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |   8 +++
> >  arch/x86/kvm/mmu/mmu.c          | 112 +++++++++++++++++++++++++++++++-
> >  arch/x86/kvm/x86.c              |   2 +
> >  include/linux/kvm_host.h        |  19 ++++++
> >  virt/kvm/kvm_main.c             |  16 +++--
> >  5 files changed, 152 insertions(+), 5 deletions(-)
> > 
> ...
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 33b1aec44fb8..67a9823a8c35 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> ...
> > @@ -6910,3 +6915,108 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> >  	if (kvm->arch.nx_lpage_recovery_thread)
> >  		kthread_stop(kvm->arch.nx_lpage_recovery_thread);
> >  }
> > +
> > +static inline bool linfo_is_mixed(struct kvm_lpage_info *linfo)
> > +{
> > +	return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +}
> > +
> > +static inline void linfo_update_mixed(struct kvm_lpage_info *linfo, bool mixed)
> > +{
> > +	if (mixed)
> > +		linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +	else
> > +		linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +}
> > +
> > +static bool mem_attr_is_mixed_2m(struct kvm *kvm, unsigned int attr,
> > +				 gfn_t start, gfn_t end)
> > +{
> > +	XA_STATE(xas, &kvm->mem_attr_array, start);
> > +	gfn_t gfn = start;
> > +	void *entry;
> > +	bool shared = attr == KVM_MEM_ATTR_SHARED;
> > +	bool mixed = false;
> > +
> > +	rcu_read_lock();
> > +	entry = xas_load(&xas);
> > +	while (gfn < end) {
> > +		if (xas_retry(&xas, entry))
> > +			continue;
> > +
> > +		KVM_BUG_ON(gfn != xas.xa_index, kvm);
> > +
> > +		if ((entry && !shared) || (!entry && shared)) {
> > +			mixed = true;
> > +			goto out;
> 
> nitpick: goto isn't needed. break should work.

Thanks.

> 
> > +		}
> > +
> > +		entry = xas_next(&xas);
> > +		gfn++;
> > +	}
> > +out:
> > +	rcu_read_unlock();
> > +	return mixed;
> > +}
> > +
> > +static bool mem_attr_is_mixed(struct kvm *kvm, struct kvm_memory_slot *slot,
> > +			      int level, unsigned int attr,
> > +			      gfn_t start, gfn_t end)
> > +{
> > +	unsigned long gfn;
> > +	void *entry;
> > +
> > +	if (level == PG_LEVEL_2M)
> > +		return mem_attr_is_mixed_2m(kvm, attr, start, end);
> > +
> > +	entry = xa_load(&kvm->mem_attr_array, start);
> > +	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
> > +		if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)))
> > +			return true;
> > +		if (xa_load(&kvm->mem_attr_array, gfn) != entry)
> > +			return true;
> > +	}
> > +	return false;
> > +}
> > +
> > +void kvm_arch_update_mem_attr(struct kvm *kvm, struct kvm_memory_slot *slot,
> > +			      unsigned int attr, gfn_t start, gfn_t end)
> > +{
> > +
> > +	unsigned long lpage_start, lpage_end;
> > +	unsigned long gfn, pages, mask;
> > +	int level;
> > +
> > +	WARN_ONCE(!(attr & (KVM_MEM_ATTR_PRIVATE | KVM_MEM_ATTR_SHARED)),
> > +			"Unsupported mem attribute.\n");
> > +
> > +	/*
> > +	 * The sequence matters here: we update the higher level basing on the
> > +	 * lower level's scanning result.
> > +	 */
> > +	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> > +		pages = KVM_PAGES_PER_HPAGE(level);
> > +		mask = ~(pages - 1);
> 
> nitpick: KVM_HPAGE_MASK(level).  Maybe matter of preference.

Yes, haven't noticed there is a KVM_HPAGE_MASK defined. Have no
strong preference here, since I already have KVM_PAGES_PER_HPAGE(level),
getting mask is straightforward.

A single KVM_HPAGE_MASK(level) will not give me what I need since here
is gfn, KVM_HPAGE_MASK(level)>> PAGE_SHIFT should be the right
equivalent.

Chao
> 
> 
> > +		lpage_start = max(start & mask, slot->base_gfn);
> > +		lpage_end = (end - 1) & mask;
> > +
> > +		/*
> > +		 * We only need to scan the head and tail page, for middle pages
> > +		 * we know they are not mixed.
> > +		 */
> > +		linfo_update_mixed(lpage_info_slot(lpage_start, slot, level),
> > +				   mem_attr_is_mixed(kvm, slot, level, attr,
> > +						     lpage_start, start));
> > +
> > +		if (lpage_start == lpage_end)
> > +			return;
> > +
> > +		for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages)
> > +			linfo_update_mixed(lpage_info_slot(gfn, slot, level),
> > +					   false);
> > +
> > +		linfo_update_mixed(lpage_info_slot(lpage_end, slot, level),
> > +				   mem_attr_is_mixed(kvm, slot, level, attr,
> > +						     end, lpage_end + pages));
> > +	}
> > +}
> 
> -- 
> Isaku Yamahata <isaku.yamahata@gmail.com>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 7/8] KVM: Handle page fault for private memory
  2022-10-26 21:54   ` Isaku Yamahata
@ 2022-10-28  6:55     ` Chao Peng
  2022-11-01  0:02       ` Isaku Yamahata
  0 siblings, 1 reply; 101+ messages in thread
From: Chao Peng @ 2022-10-28  6:55 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Wed, Oct 26, 2022 at 02:54:25PM -0700, Isaku Yamahata wrote:
> On Tue, Oct 25, 2022 at 11:13:43PM +0800,
> Chao Peng <chao.p.peng@linux.intel.com> wrote:
> 
> > A memslot with KVM_MEM_PRIVATE being set can include both fd-based
> > private memory and hva-based shared memory. Architecture code (like TDX
> > code) can tell whether the on-going fault is private or not. This patch
> > adds a 'is_private' field to kvm_page_fault to indicate this and
> > architecture code is expected to set it.
> > 
> > To handle page fault for such memslot, the handling logic is different
> > depending on whether the fault is private or shared. KVM checks if
> > 'is_private' matches the host's view of the page (maintained in
> > mem_attr_array).
> >   - For a successful match, private pfn is obtained with
> >     restrictedmem_get_page () from private fd and shared pfn is obtained
> >     with existing get_user_pages().
> >   - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> >     userspace. Userspace then can convert memory between private/shared
> >     in host's view and retry the fault.
> > 
> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c          | 56 +++++++++++++++++++++++++++++++--
> >  arch/x86/kvm/mmu/mmu_internal.h | 14 ++++++++-
> >  arch/x86/kvm/mmu/mmutrace.h     |  1 +
> >  arch/x86/kvm/mmu/spte.h         |  6 ++++
> >  arch/x86/kvm/mmu/tdp_mmu.c      |  3 +-
> >  include/linux/kvm_host.h        | 28 +++++++++++++++++
> >  6 files changed, 103 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 67a9823a8c35..10017a9f26ee 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3030,7 +3030,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
> >  
> >  int kvm_mmu_max_mapping_level(struct kvm *kvm,
> >  			      const struct kvm_memory_slot *slot, gfn_t gfn,
> > -			      int max_level)
> > +			      int max_level, bool is_private)
> >  {
> >  	struct kvm_lpage_info *linfo;
> >  	int host_level;
> > @@ -3042,6 +3042,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> >  			break;
> >  	}
> >  
> > +	if (is_private)
> > +		return max_level;
> 
> Below PG_LEVEL_NUM is passed by zap_collapsible_spte_range().  It doesn't make
> sense.
> 
> > +
> >  	if (max_level == PG_LEVEL_4K)
> >  		return PG_LEVEL_4K;
> >  
> > @@ -3070,7 +3073,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  	 * level, which will be used to do precise, accurate accounting.
> >  	 */
> >  	fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> > -						     fault->gfn, fault->max_level);
> > +						     fault->gfn, fault->max_level,
> > +						     fault->is_private);
> >  	if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> >  		return;
> >  
> > @@ -4141,6 +4145,32 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> >  	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> >  }
> >  
> > +static inline u8 order_to_level(int order)
> > +{
> > +	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> > +
> > +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> > +		return PG_LEVEL_1G;
> > +
> > +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> > +		return PG_LEVEL_2M;
> > +
> > +	return PG_LEVEL_4K;
> > +}
> > +
> > +static int kvm_faultin_pfn_private(struct kvm_page_fault *fault)
> > +{
> > +	int order;
> > +	struct kvm_memory_slot *slot = fault->slot;
> > +
> > +	if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> > +		return RET_PF_RETRY;
> > +
> > +	fault->max_level = min(order_to_level(order), fault->max_level);
> > +	fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> > +	return RET_PF_CONTINUE;
> > +}
> > +
> >  static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  {
> >  	struct kvm_memory_slot *slot = fault->slot;
> > @@ -4173,6 +4203,22 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  			return RET_PF_EMULATE;
> >  	}
> >  
> > +	if (kvm_slot_can_be_private(slot) &&
> > +	    fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> > +		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > +		if (fault->is_private)
> > +			vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> > +		else
> > +			vcpu->run->memory.flags = 0;
> > +		vcpu->run->memory.padding = 0;
> > +		vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> > +		vcpu->run->memory.size = PAGE_SIZE;
> > +		return RET_PF_USER;
> > +	}
> > +
> > +	if (fault->is_private)
> > +		return kvm_faultin_pfn_private(fault);
> > +
> >  	async = false;
> >  	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
> >  					  fault->write, &fault->map_writable,
> > @@ -5557,6 +5603,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> >  			return -EIO;
> >  	}
> >  
> > +	if (r == RET_PF_USER)
> > +		return 0;
> > +
> >  	if (r < 0)
> >  		return r;
> >  	if (r != RET_PF_EMULATE)
> > @@ -6408,7 +6457,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> >  		 */
> >  		if (sp->role.direct &&
> >  		    sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> > -							       PG_LEVEL_NUM)) {
> > +							       PG_LEVEL_NUM,
> > +							       false)) {
> >  			kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
> >  
> >  			if (kvm_available_flush_tlb_with_range())
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 582def531d4d..5cdff5ca546c 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -188,6 +188,7 @@ struct kvm_page_fault {
> >  
> >  	/* Derived from mmu and global state.  */
> >  	const bool is_tdp;
> > +	const bool is_private;
> >  	const bool nx_huge_page_workaround_enabled;
> >  
> >  	/*
> > @@ -236,6 +237,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> >   * RET_PF_RETRY: let CPU fault again on the address.
> >   * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
> >   * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> > + * RET_PF_USER: need to exit to userspace to handle this fault.
> >   * RET_PF_FIXED: The faulting entry has been fixed.
> >   * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
> >   *
> > @@ -252,6 +254,7 @@ enum {
> >  	RET_PF_RETRY,
> >  	RET_PF_EMULATE,
> >  	RET_PF_INVALID,
> > +	RET_PF_USER,
> >  	RET_PF_FIXED,
> >  	RET_PF_SPURIOUS,
> >  };
> > @@ -309,7 +312,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >  
> >  int kvm_mmu_max_mapping_level(struct kvm *kvm,
> >  			      const struct kvm_memory_slot *slot, gfn_t gfn,
> > -			      int max_level);
> > +			      int max_level, bool is_private);
> >  void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> >  void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
> >  
> > @@ -318,4 +321,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >  
> > +#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > +					gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > +{
> > +	WARN_ON_ONCE(1);
> > +	return -EOPNOTSUPP;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > +
> >  #endif /* __KVM_X86_MMU_INTERNAL_H */
> > diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> > index ae86820cef69..2d7555381955 100644
> > --- a/arch/x86/kvm/mmu/mmutrace.h
> > +++ b/arch/x86/kvm/mmu/mmutrace.h
> > @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
> >  TRACE_DEFINE_ENUM(RET_PF_RETRY);
> >  TRACE_DEFINE_ENUM(RET_PF_EMULATE);
> >  TRACE_DEFINE_ENUM(RET_PF_INVALID);
> > +TRACE_DEFINE_ENUM(RET_PF_USER);
> >  TRACE_DEFINE_ENUM(RET_PF_FIXED);
> >  TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
> >  
> > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > index 7670c13ce251..9acdf72537ce 100644
> > --- a/arch/x86/kvm/mmu/spte.h
> > +++ b/arch/x86/kvm/mmu/spte.h
> > @@ -315,6 +315,12 @@ static inline bool is_dirty_spte(u64 spte)
> >  	return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
> >  }
> >  
> > +static inline bool is_private_spte(u64 spte)
> > +{
> > +	/* FIXME: Query C-bit/S-bit for SEV/TDX. */
> > +	return false;
> > +}
> > +
> 
> PFN encoded in spte doesn't make sense.  In VMM for TDX, private-vs-shared is
> determined by S-bit of GFN.

My understanding is we will have software bit in the spte, will we? In
current TDX code I see we have SPTE_SHARED_MASK bit defined.

> 
> 
> >  static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64 pte,
> >  				int level)
> >  {
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 672f0432d777..9f97aac90606 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1768,7 +1768,8 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> >  			continue;
> >  
> >  		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> > -							      iter.gfn, PG_LEVEL_NUM);
> > +						iter.gfn, PG_LEVEL_NUM,
> > +						is_private_spte(iter.old_spte));
> >  		if (max_mapping_level < iter.level)
> >  			continue;
> 
> This is to merge pages into a large page on the next kvm page fault.  large page
> support is not yet supported.  Let's skip the private slot until large page
> support is done.

So what your suggestion is passing in a 'false' at this time for
'is_private'? Unless we will decide not use the above is_private_spte,
this code does not hurt, right? is_private_spte() return false before
we finally get chance to add the large page support.

Thanks,
Chao
> -- 
> Isaku Yamahata <isaku.yamahata@gmail.com>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-10-25 15:13 ` [PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
  2022-10-27 10:25   ` Fuad Tabba
@ 2022-10-28  7:04   ` Xiaoyao Li
  2022-10-31 14:14     ` Chao Peng
  2022-11-14 16:04   ` Alex Bennée
  2 siblings, 1 reply; 101+ messages in thread
From: Xiaoyao Li @ 2022-10-28  7:04 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On 10/25/2022 11:13 PM, Chao Peng wrote:
> In memory encryption usage, guest memory may be encrypted with special
> key and can be accessed only by the guest itself. We call such memory
> private memory. It's valueless and sometimes can cause problem to allow
> userspace to access guest private memory. This new KVM memslot extension
> allows guest private memory being provided though a restrictedmem
                                                  ^

typo

> backed file descriptor(fd) and userspace is restricted to access the
> bookmarked memory in the fd.
> 
> This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> additional KVM memslot fields restricted_fd/restricted_offset to allow
> userspace to instruct KVM to provide guest memory through restricted_fd.
> 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> and the size is 'memory_size'.
> 
> The extended memslot can still have the userspace_addr(hva). When use, a
> single memslot can maintain both private memory through restricted_fd
> and shared memory through userspace_addr. Whether the private or shared
> part is visible to guest is maintained by other KVM code.
> 
> A restrictedmem_notifier field is also added to the memslot structure to
> allow the restricted_fd's backing store to notify KVM the memory change,
> KVM then can invalidate its page table entries.
> 
> Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> and right now it is selected on X86_64 only. A KVM_CAP_PRIVATE_MEM is
> also introduced to indicate KVM support for KVM_MEM_PRIVATE.
> 
> To make code maintenance easy, internally we use a binary compatible
> alias struct kvm_user_mem_region to handle both the normal and the
> '_ext' variants.
> 
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>   Documentation/virt/kvm/api.rst | 48 ++++++++++++++++++++++++++++-----
>   arch/x86/kvm/Kconfig           |  2 ++
>   arch/x86/kvm/x86.c             |  2 +-
>   include/linux/kvm_host.h       | 13 +++++++--
>   include/uapi/linux/kvm.h       | 29 ++++++++++++++++++++
>   virt/kvm/Kconfig               |  3 +++
>   virt/kvm/kvm_main.c            | 49 ++++++++++++++++++++++++++++------
>   7 files changed, 128 insertions(+), 18 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index eee9f857a986..f3fa75649a78 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
>   :Capability: KVM_CAP_USER_MEMORY
>   :Architectures: all
>   :Type: vm ioctl
> -:Parameters: struct kvm_userspace_memory_region (in)
> +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
>   :Returns: 0 on success, -1 on error
>   
>   ::
> @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
>   	__u64 userspace_addr; /* start of the userspace allocated memory */
>     };
>   
> +  struct kvm_userspace_memory_region_ext {
> +	struct kvm_userspace_memory_region region;
> +	__u64 restricted_offset;
> +	__u32 restricted_fd;
> +	__u32 pad1;
> +	__u64 pad2[14];
> +  };
> +
>     /* for kvm_memory_region::flags */
>     #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>     #define KVM_MEM_READONLY	(1UL << 1)
> +  #define KVM_MEM_PRIVATE		(1UL << 2)
>   
>   This ioctl allows the user to create, modify or delete a guest physical
>   memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> @@ -1365,12 +1374,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
>   be identical.  This allows large pages in the guest to be backed by large
>   pages in the host.
>   
> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> -to make a new slot read-only.  In this case, writes to this memory will be
> -posted to userspace as KVM_EXIT_MMIO exits.
> +kvm_userspace_memory_region_ext struct includes all fields of
> +kvm_userspace_memory_region struct, while also adds additional fields for some
> +other features. See below description of flags field for more information.
> +It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
> +
> +The flags field supports following flags:
> +
> +- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
> +  within the slot.  For more details, see KVM_GET_DIRTY_LOG ioctl.
> +
> +- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
> +  read-only.  In this case, writes to this memory will be posted to userspace as
> +  KVM_EXIT_MMIO exits.
> +
> +- KVM_MEM_PRIVATE, if KVM_CAP_PRIVATE_MEM allows, to indicate a new slot has
> +  private memory backed by a file descriptor(fd) and userspace access to the
> +  fd may be restricted. Userspace should use restricted_fd/restricted_offset in
> +  kvm_userspace_memory_region_ext to instruct KVM to provide private memory
> +  to guest. Userspace should guarantee not to map the same pfn indicated by
> +  restricted_fd/restricted_offset to different gfns with multiple memslots.
> +  Failed to do this may result undefined behavior.
>   
>   When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
>   the memory region are automatically reflected into the guest.  For example, an
> @@ -8215,6 +8239,16 @@ structure.
>   When getting the Modified Change Topology Report value, the attr->addr
>   must point to a byte where the value will be stored or retrieved from.
>   
> +8.36 KVM_CAP_PRIVATE_MEM
> +------------------------
> +
> +:Architectures: x86
> +
> +This capability indicates that private memory is supported and userspace can
> +set KVM_MEM_PRIVATE flag for KVM_SET_USER_MEMORY_REGION ioctl.  See
> +KVM_SET_USER_MEMORY_REGION for details on the usage of KVM_MEM_PRIVATE and
> +kvm_userspace_memory_region_ext fields.
> +
>   9. Known KVM API problems
>   =========================
>   
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 67be7f217e37..8d2bd455c0cd 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -49,6 +49,8 @@ config KVM
>   	select SRCU
>   	select INTERVAL_TREE
>   	select HAVE_KVM_PM_NOTIFIER if PM
> +	select HAVE_KVM_RESTRICTED_MEM if X86_64
> +	select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
>   	help
>   	  Support hosting fully virtualized guest machines using hardware
>   	  virtualization extensions.  You will need a fairly recent
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4bd5f8a751de..02ad31f46dd7 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12425,7 +12425,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
>   	}
>   
>   	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> -		struct kvm_userspace_memory_region m;
> +		struct kvm_user_mem_region m;
>   
>   		m.slot = id | (i << 16);
>   		m.flags = 0;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 32f259fa5801..739a7562a1f3 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -44,6 +44,7 @@
>   
>   #include <asm/kvm_host.h>
>   #include <linux/kvm_dirty_ring.h>
> +#include <linux/restrictedmem.h>
>   
>   #ifndef KVM_MAX_VCPU_IDS
>   #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> @@ -575,8 +576,16 @@ struct kvm_memory_slot {
>   	u32 flags;
>   	short id;
>   	u16 as_id;
> +	struct file *restricted_file;
> +	loff_t restricted_offset;
> +	struct restrictedmem_notifier notifier;
>   };
>   
> +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
> +{
> +	return slot && (slot->flags & KVM_MEM_PRIVATE);
> +}
> +

We can introduce this function in patch 6 when it's first used.





^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-10-28  7:04   ` Xiaoyao Li
@ 2022-10-31 14:14     ` Chao Peng
  0 siblings, 0 replies; 101+ messages in thread
From: Chao Peng @ 2022-10-31 14:14 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Fri, Oct 28, 2022 at 03:04:27PM +0800, Xiaoyao Li wrote:
> On 10/25/2022 11:13 PM, Chao Peng wrote:
> > In memory encryption usage, guest memory may be encrypted with special
> > key and can be accessed only by the guest itself. We call such memory
> > private memory. It's valueless and sometimes can cause problem to allow
> > userspace to access guest private memory. This new KVM memslot extension
> > allows guest private memory being provided though a restrictedmem
>                                                  ^
> 
> typo

Thanks!

> 
> > backed file descriptor(fd) and userspace is restricted to access the
> > bookmarked memory in the fd.
> > 
> > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > userspace to instruct KVM to provide guest memory through restricted_fd.
> > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > and the size is 'memory_size'.
> > 
> > The extended memslot can still have the userspace_addr(hva). When use, a
> > single memslot can maintain both private memory through restricted_fd
> > and shared memory through userspace_addr. Whether the private or shared
> > part is visible to guest is maintained by other KVM code.
> > 
> > A restrictedmem_notifier field is also added to the memslot structure to
> > allow the restricted_fd's backing store to notify KVM the memory change,
> > KVM then can invalidate its page table entries.
> > 
> > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > and right now it is selected on X86_64 only. A KVM_CAP_PRIVATE_MEM is
> > also introduced to indicate KVM support for KVM_MEM_PRIVATE.
> > 
> > To make code maintenance easy, internally we use a binary compatible
> > alias struct kvm_user_mem_region to handle both the normal and the
> > '_ext' variants.
> > 
> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >   Documentation/virt/kvm/api.rst | 48 ++++++++++++++++++++++++++++-----
> >   arch/x86/kvm/Kconfig           |  2 ++
> >   arch/x86/kvm/x86.c             |  2 +-
> >   include/linux/kvm_host.h       | 13 +++++++--
> >   include/uapi/linux/kvm.h       | 29 ++++++++++++++++++++
> >   virt/kvm/Kconfig               |  3 +++
> >   virt/kvm/kvm_main.c            | 49 ++++++++++++++++++++++++++++------
> >   7 files changed, 128 insertions(+), 18 deletions(-)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index eee9f857a986..f3fa75649a78 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
> >   :Capability: KVM_CAP_USER_MEMORY
> >   :Architectures: all
> >   :Type: vm ioctl
> > -:Parameters: struct kvm_userspace_memory_region (in)
> > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> >   :Returns: 0 on success, -1 on error
> >   ::
> > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> >   	__u64 userspace_addr; /* start of the userspace allocated memory */
> >     };
> > +  struct kvm_userspace_memory_region_ext {
> > +	struct kvm_userspace_memory_region region;
> > +	__u64 restricted_offset;
> > +	__u32 restricted_fd;
> > +	__u32 pad1;
> > +	__u64 pad2[14];
> > +  };
> > +
> >     /* for kvm_memory_region::flags */
> >     #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> >     #define KVM_MEM_READONLY	(1UL << 1)
> > +  #define KVM_MEM_PRIVATE		(1UL << 2)
> >   This ioctl allows the user to create, modify or delete a guest physical
> >   memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> > @@ -1365,12 +1374,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> >   be identical.  This allows large pages in the guest to be backed by large
> >   pages in the host.
> > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> > -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> > -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> > -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> > -to make a new slot read-only.  In this case, writes to this memory will be
> > -posted to userspace as KVM_EXIT_MMIO exits.
> > +kvm_userspace_memory_region_ext struct includes all fields of
> > +kvm_userspace_memory_region struct, while also adds additional fields for some
> > +other features. See below description of flags field for more information.
> > +It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
> > +
> > +The flags field supports following flags:
> > +
> > +- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
> > +  within the slot.  For more details, see KVM_GET_DIRTY_LOG ioctl.
> > +
> > +- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
> > +  read-only.  In this case, writes to this memory will be posted to userspace as
> > +  KVM_EXIT_MMIO exits.
> > +
> > +- KVM_MEM_PRIVATE, if KVM_CAP_PRIVATE_MEM allows, to indicate a new slot has
> > +  private memory backed by a file descriptor(fd) and userspace access to the
> > +  fd may be restricted. Userspace should use restricted_fd/restricted_offset in
> > +  kvm_userspace_memory_region_ext to instruct KVM to provide private memory
> > +  to guest. Userspace should guarantee not to map the same pfn indicated by
> > +  restricted_fd/restricted_offset to different gfns with multiple memslots.
> > +  Failed to do this may result undefined behavior.
> >   When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> >   the memory region are automatically reflected into the guest.  For example, an
> > @@ -8215,6 +8239,16 @@ structure.
> >   When getting the Modified Change Topology Report value, the attr->addr
> >   must point to a byte where the value will be stored or retrieved from.
> > +8.36 KVM_CAP_PRIVATE_MEM
> > +------------------------
> > +
> > +:Architectures: x86
> > +
> > +This capability indicates that private memory is supported and userspace can
> > +set KVM_MEM_PRIVATE flag for KVM_SET_USER_MEMORY_REGION ioctl.  See
> > +KVM_SET_USER_MEMORY_REGION for details on the usage of KVM_MEM_PRIVATE and
> > +kvm_userspace_memory_region_ext fields.
> > +
> >   9. Known KVM API problems
> >   =========================
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index 67be7f217e37..8d2bd455c0cd 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -49,6 +49,8 @@ config KVM
> >   	select SRCU
> >   	select INTERVAL_TREE
> >   	select HAVE_KVM_PM_NOTIFIER if PM
> > +	select HAVE_KVM_RESTRICTED_MEM if X86_64
> > +	select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> >   	help
> >   	  Support hosting fully virtualized guest machines using hardware
> >   	  virtualization extensions.  You will need a fairly recent
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 4bd5f8a751de..02ad31f46dd7 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -12425,7 +12425,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
> >   	}
> >   	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > -		struct kvm_userspace_memory_region m;
> > +		struct kvm_user_mem_region m;
> >   		m.slot = id | (i << 16);
> >   		m.flags = 0;
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 32f259fa5801..739a7562a1f3 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -44,6 +44,7 @@
> >   #include <asm/kvm_host.h>
> >   #include <linux/kvm_dirty_ring.h>
> > +#include <linux/restrictedmem.h>
> >   #ifndef KVM_MAX_VCPU_IDS
> >   #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> > @@ -575,8 +576,16 @@ struct kvm_memory_slot {
> >   	u32 flags;
> >   	short id;
> >   	u16 as_id;
> > +	struct file *restricted_file;
> > +	loff_t restricted_offset;
> > +	struct restrictedmem_notifier notifier;
> >   };
> > +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
> > +{
> > +	return slot && (slot->flags & KVM_MEM_PRIVATE);
> > +}
> > +
> 
> We can introduce this function in patch 6 when it's first used.

Good to me.

Chao
> 
> 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-10-25 15:13 ` [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
  2022-10-26 17:31   ` Isaku Yamahata
  2022-10-27 10:20   ` Fuad Tabba
@ 2022-10-31 17:47   ` Michael Roth
  2022-11-01 11:37     ` Chao Peng
  2022-11-02 21:14     ` Kirill A. Shutemov
  2022-11-29  0:06   ` Michael Roth
                     ` (2 subsequent siblings)
  5 siblings, 2 replies; 101+ messages in thread
From: Michael Roth @ 2022-10-31 17:47 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	mhocko, Muchun Song, wei.w.wang

On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Introduce 'memfd_restricted' system call with the ability to create
> memory areas that are restricted from userspace access through ordinary
> MMU operations (e.g. read/write/mmap). The memory content is expected to
> be used through a new in-kernel interface by a third kernel module.
> 
> memfd_restricted() is useful for scenarios where a file descriptor(fd)
> can be used as an interface into mm but want to restrict userspace's
> ability on the fd. Initially it is designed to provide protections for
> KVM encrypted guest memory.
> 
> Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> (e.g. QEMU) and then using the mmaped virtual address to setup the
> mapping in the KVM secondary page table (e.g. EPT). With confidential
> computing technologies like Intel TDX, the memfd memory may be encrypted
> with special key for special software domain (e.g. KVM guest) and is not
> expected to be directly accessed by userspace. Precisely, userspace
> access to such encrypted memory may lead to host crash so should be
> prevented.
> 
> memfd_restricted() provides semantics required for KVM guest encrypted
> memory support that a fd created with memfd_restricted() is going to be
> used as the source of guest memory in confidential computing environment
> and KVM can directly interact with core-mm without the need to expose
> the memoy content into KVM userspace.
> 
> KVM userspace is still in charge of the lifecycle of the fd. It should
> pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> obtain the physical memory page and then uses it to populate the KVM
> secondary page table entries.
> 
> The userspace restricted memfd can be fallocate-ed or hole-punched
> from userspace. When these operations happen, KVM can get notified
> through restrictedmem_notifier, it then gets chance to remove any
> mapped entries of the range in the secondary page tables.
> 
> memfd_restricted() itself is implemented as a shim layer on top of real
> memory file systems (currently tmpfs). Pages in restrictedmem are marked
> as unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
> 
> By default memfd_restricted() prevents userspace read, write and mmap.
> By defining new bit in the 'flags', it can be extended to support other
> restricted semantics in the future.
> 
> The system call is currently wired up for x86 arch.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  include/linux/restrictedmem.h          |  62 ++++++
>  include/linux/syscalls.h               |   1 +
>  include/uapi/asm-generic/unistd.h      |   5 +-
>  include/uapi/linux/magic.h             |   1 +
>  kernel/sys_ni.c                        |   3 +
>  mm/Kconfig                             |   4 +
>  mm/Makefile                            |   1 +
>  mm/restrictedmem.c                     | 250 +++++++++++++++++++++++++
>  10 files changed, 328 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/restrictedmem.h
>  create mode 100644 mm/restrictedmem.c
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 320480a8db4f..dc70ba90247e 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -455,3 +455,4 @@
>  448	i386	process_mrelease	sys_process_mrelease
>  449	i386	futex_waitv		sys_futex_waitv
>  450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
> +451	i386	memfd_restricted	sys_memfd_restricted
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index c84d12608cd2..06516abc8318 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -372,6 +372,7 @@
>  448	common	process_mrelease	sys_process_mrelease
>  449	common	futex_waitv		sys_futex_waitv
>  450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
> +451	common	memfd_restricted	sys_memfd_restricted
>  
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> new file mode 100644
> index 000000000000..9c37c3ea3180
> --- /dev/null
> +++ b/include/linux/restrictedmem.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _LINUX_RESTRICTEDMEM_H
> +
> +#include <linux/file.h>
> +#include <linux/magic.h>
> +#include <linux/pfn_t.h>
> +
> +struct restrictedmem_notifier;
> +
> +struct restrictedmem_notifier_ops {
> +	void (*invalidate_start)(struct restrictedmem_notifier *notifier,
> +				 pgoff_t start, pgoff_t end);
> +	void (*invalidate_end)(struct restrictedmem_notifier *notifier,
> +			       pgoff_t start, pgoff_t end);
> +};
> +
> +struct restrictedmem_notifier {
> +	struct list_head list;
> +	const struct restrictedmem_notifier_ops *ops;
> +};
> +
> +#ifdef CONFIG_RESTRICTEDMEM
> +
> +void restrictedmem_register_notifier(struct file *file,
> +				     struct restrictedmem_notifier *notifier);
> +void restrictedmem_unregister_notifier(struct file *file,
> +				       struct restrictedmem_notifier *notifier);
> +
> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +			   struct page **pagep, int *order);
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +	return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
> +}
> +
> +#else
> +
> +static inline void restrictedmem_register_notifier(struct file *file,
> +				     struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline void restrictedmem_unregister_notifier(struct file *file,
> +				       struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +					 struct page **pagep, int *order)
> +{
> +	return -1;
> +}
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +	return false;
> +}
> +
> +#endif /* CONFIG_RESTRICTEDMEM */
> +
> +#endif /* _LINUX_RESTRICTEDMEM_H */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a34b0f9a9972..f9e9e0c820c5 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
>  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
>  					    unsigned long home_node,
>  					    unsigned long flags);
> +asmlinkage long sys_memfd_restricted(unsigned int flags);
>  
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 45fa180cc56a..e93cd35e46d0 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
>  #define __NR_set_mempolicy_home_node 450
>  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
>  
> +#define __NR_memfd_restricted 451
> +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
> +
>  #undef __NR_syscalls
> -#define __NR_syscalls 451
> +#define __NR_syscalls 452
>  
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 6325d1d0e90f..8aa38324b90a 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -101,5 +101,6 @@
>  #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
>  #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
>  #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
> +#define RESTRICTEDMEM_MAGIC	0x5245534d	/* "RESM" */
>  
>  #endif /* __LINUX_MAGIC_H__ */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 860b2dcf3ac4..7c4a32cbd2e7 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
>  /* memfd_secret */
>  COND_SYSCALL(memfd_secret);
>  
> +/* memfd_restricted */
> +COND_SYSCALL(memfd_restricted);
> +
>  /*
>   * Architecture specific weak syscall entries.
>   */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0331f1461f81..0177d53676c7 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1076,6 +1076,10 @@ config IO_MAPPING
>  config SECRETMEM
>  	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
>  
> +config RESTRICTEDMEM
> +	bool
> +	depends on TMPFS
> +
>  config ANON_VMA_NAME
>  	bool "Anonymous VMA name support"
>  	depends on PROC_FS && ADVISE_SYSCALLS && MMU
> diff --git a/mm/Makefile b/mm/Makefile
> index 9a564f836403..6cb6403ffd40 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -117,6 +117,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>  obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>  obj-$(CONFIG_SECRETMEM) += secretmem.o
> +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
>  obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>  obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> new file mode 100644
> index 000000000000..e5bf8907e0f8
> --- /dev/null
> +++ b/mm/restrictedmem.c
> @@ -0,0 +1,250 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/shmem_fs.h>
> +#include <linux/syscalls.h>
> +#include <uapi/linux/falloc.h>
> +#include <uapi/linux/magic.h>
> +#include <linux/restrictedmem.h>
> +
> +struct restrictedmem_data {
> +	struct mutex lock;
> +	struct file *memfd;
> +	struct list_head notifiers;
> +};
> +
> +static void restrictedmem_notifier_invalidate(struct restrictedmem_data *data,
> +				 pgoff_t start, pgoff_t end, bool notify_start)
> +{
> +	struct restrictedmem_notifier *notifier;
> +
> +	mutex_lock(&data->lock);
> +	list_for_each_entry(notifier, &data->notifiers, list) {
> +		if (notify_start)
> +			notifier->ops->invalidate_start(notifier, start, end);
> +		else
> +			notifier->ops->invalidate_end(notifier, start, end);
> +	}
> +	mutex_unlock(&data->lock);
> +}
> +
> +static int restrictedmem_release(struct inode *inode, struct file *file)
> +{
> +	struct restrictedmem_data *data = inode->i_mapping->private_data;
> +
> +	fput(data->memfd);
> +	kfree(data);
> +	return 0;
> +}
> +
> +static long restrictedmem_fallocate(struct file *file, int mode,
> +				    loff_t offset, loff_t len)
> +{
> +	struct restrictedmem_data *data = file->f_mapping->private_data;
> +	struct file *memfd = data->memfd;
> +	int ret;
> +
> +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> +			return -EINVAL;
> +	}
> +
> +	restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> +	restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> +	return ret;
> +}

In v8 there was some discussion about potentially passing the page/folio
and order as part of the invalidation callback, I ended up needing
something similar for SEV-SNP, and think it might make sense for other
platforms. This main reasoning is:

  1) restoring kernel directmap:

     Currently SNP (and I believe TDX) need to either split or remove kernel
     direct mappings for restricted PFNs, since there is no guarantee that
     other PFNs within a 2MB range won't be used for non-restricted
     (which will cause an RMP #PF in the case of SNP since the 2MB
     mapping overlaps with guest-owned pages)

     Previously we were able to restore 2MB mappings to some degree
     since both shared/restricted pages were all pinned, so anything
     backed by a THP (or hugetlb page once that is implemented) at guest
     teardown could be restored as 2MB direct mapping.

     Invalidation seems like the most logical time to have this happen,
     but whether or not to restore as 2MB requires the order to be 2MB
     or larger, and for GPA range being invalidated to cover the entire
     2MB (otherwise it means the page was potentially split and some
     subpages free back to host already, in which case it can't be
     restored as 2MB).

  2) Potentially less invalidations:
      
     If we pass the entire folio or compound_page as part of
     invalidation, we only needed to issue 1 invalidation per folio.

  3) Potentially useful for hugetlbfs support:

     One issue with hugetlbfs is that we don't support splitting the
     hugepage in such cases, which was a big obstacle prior to UPM. Now
     however, we may have the option of doing "lazy" invalidations where
     fallocate(PUNCH_HOLE, ...) won't free a shmem-allocate page unless
     all the subpages within the 2M range are either hole-punched, or the
     guest is shut down, so in that way we never have to split it. Sean
     was pondering something similar in another thread:

       https://lore.kernel.org/linux-mm/YyGLXXkFCmxBfu5U@google.com/

     Issuing invalidations with folio-granularity ties in fairly well
     with this sort of approach if we end up going that route.

I need to rework things for v9, and we'll probably want to use struct
folio instead of struct page now, but as a proof-of-concept of sorts this
is what I'd added on top of v8 of your patchset to implement 1) and 2):

  https://github.com/mdroth/linux/commit/127e5ea477c7bd5e4107fd44a04b9dc9e9b1af8b

Does an approach like this seem reasonable? Should be work this into the
base restricted memslot support?

Thanks,

Mike


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 7/8] KVM: Handle page fault for private memory
  2022-10-28  6:55     ` Chao Peng
@ 2022-11-01  0:02       ` Isaku Yamahata
  2022-11-01 11:38         ` Chao Peng
  0 siblings, 1 reply; 101+ messages in thread
From: Isaku Yamahata @ 2022-11-01  0:02 UTC (permalink / raw)
  To: Chao Peng
  Cc: Isaku Yamahata, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	Muchun Song, wei.w.wang

On Fri, Oct 28, 2022 at 02:55:45PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:

> On Wed, Oct 26, 2022 at 02:54:25PM -0700, Isaku Yamahata wrote:
> > On Tue, Oct 25, 2022 at 11:13:43PM +0800,
> > Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > 
> > > A memslot with KVM_MEM_PRIVATE being set can include both fd-based
> > > private memory and hva-based shared memory. Architecture code (like TDX
> > > code) can tell whether the on-going fault is private or not. This patch
> > > adds a 'is_private' field to kvm_page_fault to indicate this and
> > > architecture code is expected to set it.
> > > 
> > > To handle page fault for such memslot, the handling logic is different
> > > depending on whether the fault is private or shared. KVM checks if
> > > 'is_private' matches the host's view of the page (maintained in
> > > mem_attr_array).
> > >   - For a successful match, private pfn is obtained with
> > >     restrictedmem_get_page () from private fd and shared pfn is obtained
> > >     with existing get_user_pages().
> > >   - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> > >     userspace. Userspace then can convert memory between private/shared
> > >     in host's view and retry the fault.
> > > 
> > > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c          | 56 +++++++++++++++++++++++++++++++--
> > >  arch/x86/kvm/mmu/mmu_internal.h | 14 ++++++++-
> > >  arch/x86/kvm/mmu/mmutrace.h     |  1 +
> > >  arch/x86/kvm/mmu/spte.h         |  6 ++++
> > >  arch/x86/kvm/mmu/tdp_mmu.c      |  3 +-
> > >  include/linux/kvm_host.h        | 28 +++++++++++++++++
> > >  6 files changed, 103 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 67a9823a8c35..10017a9f26ee 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -3030,7 +3030,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
> > >  
> > >  int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > >  			      const struct kvm_memory_slot *slot, gfn_t gfn,
> > > -			      int max_level)
> > > +			      int max_level, bool is_private)
> > >  {
> > >  	struct kvm_lpage_info *linfo;
> > >  	int host_level;
> > > @@ -3042,6 +3042,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > >  			break;
> > >  	}
> > >  
> > > +	if (is_private)
> > > +		return max_level;
> > 
> > Below PG_LEVEL_NUM is passed by zap_collapsible_spte_range().  It doesn't make
> > sense.
> > 
> > > +
> > >  	if (max_level == PG_LEVEL_4K)
> > >  		return PG_LEVEL_4K;
> > >  
> > > @@ -3070,7 +3073,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > >  	 * level, which will be used to do precise, accurate accounting.
> > >  	 */
> > >  	fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> > > -						     fault->gfn, fault->max_level);
> > > +						     fault->gfn, fault->max_level,
> > > +						     fault->is_private);
> > >  	if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> > >  		return;
> > >  
> > > @@ -4141,6 +4145,32 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> > >  	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> > >  }
> > >  
> > > +static inline u8 order_to_level(int order)
> > > +{
> > > +	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> > > +
> > > +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> > > +		return PG_LEVEL_1G;
> > > +
> > > +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> > > +		return PG_LEVEL_2M;
> > > +
> > > +	return PG_LEVEL_4K;
> > > +}
> > > +
> > > +static int kvm_faultin_pfn_private(struct kvm_page_fault *fault)
> > > +{
> > > +	int order;
> > > +	struct kvm_memory_slot *slot = fault->slot;
> > > +
> > > +	if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> > > +		return RET_PF_RETRY;
> > > +
> > > +	fault->max_level = min(order_to_level(order), fault->max_level);
> > > +	fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> > > +	return RET_PF_CONTINUE;
> > > +}
> > > +
> > >  static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > >  {
> > >  	struct kvm_memory_slot *slot = fault->slot;
> > > @@ -4173,6 +4203,22 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > >  			return RET_PF_EMULATE;
> > >  	}
> > >  
> > > +	if (kvm_slot_can_be_private(slot) &&
> > > +	    fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> > > +		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > > +		if (fault->is_private)
> > > +			vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> > > +		else
> > > +			vcpu->run->memory.flags = 0;
> > > +		vcpu->run->memory.padding = 0;
> > > +		vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> > > +		vcpu->run->memory.size = PAGE_SIZE;
> > > +		return RET_PF_USER;
> > > +	}
> > > +
> > > +	if (fault->is_private)
> > > +		return kvm_faultin_pfn_private(fault);
> > > +
> > >  	async = false;
> > >  	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
> > >  					  fault->write, &fault->map_writable,
> > > @@ -5557,6 +5603,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> > >  			return -EIO;
> > >  	}
> > >  
> > > +	if (r == RET_PF_USER)
> > > +		return 0;
> > > +
> > >  	if (r < 0)
> > >  		return r;
> > >  	if (r != RET_PF_EMULATE)
> > > @@ -6408,7 +6457,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> > >  		 */
> > >  		if (sp->role.direct &&
> > >  		    sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> > > -							       PG_LEVEL_NUM)) {
> > > +							       PG_LEVEL_NUM,
> > > +							       false)) {
> > >  			kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
> > >  
> > >  			if (kvm_available_flush_tlb_with_range())
> > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > > index 582def531d4d..5cdff5ca546c 100644
> > > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > > @@ -188,6 +188,7 @@ struct kvm_page_fault {
> > >  
> > >  	/* Derived from mmu and global state.  */
> > >  	const bool is_tdp;
> > > +	const bool is_private;
> > >  	const bool nx_huge_page_workaround_enabled;
> > >  
> > >  	/*
> > > @@ -236,6 +237,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> > >   * RET_PF_RETRY: let CPU fault again on the address.
> > >   * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
> > >   * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> > > + * RET_PF_USER: need to exit to userspace to handle this fault.
> > >   * RET_PF_FIXED: The faulting entry has been fixed.
> > >   * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
> > >   *
> > > @@ -252,6 +254,7 @@ enum {
> > >  	RET_PF_RETRY,
> > >  	RET_PF_EMULATE,
> > >  	RET_PF_INVALID,
> > > +	RET_PF_USER,
> > >  	RET_PF_FIXED,
> > >  	RET_PF_SPURIOUS,
> > >  };
> > > @@ -309,7 +312,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > >  
> > >  int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > >  			      const struct kvm_memory_slot *slot, gfn_t gfn,
> > > -			      int max_level);
> > > +			      int max_level, bool is_private);
> > >  void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> > >  void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
> > >  
> > > @@ -318,4 +321,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > >  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > >  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > >  
> > > +#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > > +					gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > > +{
> > > +	WARN_ON_ONCE(1);
> > > +	return -EOPNOTSUPP;
> > > +}
> > > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > > +
> > >  #endif /* __KVM_X86_MMU_INTERNAL_H */
> > > diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> > > index ae86820cef69..2d7555381955 100644
> > > --- a/arch/x86/kvm/mmu/mmutrace.h
> > > +++ b/arch/x86/kvm/mmu/mmutrace.h
> > > @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
> > >  TRACE_DEFINE_ENUM(RET_PF_RETRY);
> > >  TRACE_DEFINE_ENUM(RET_PF_EMULATE);
> > >  TRACE_DEFINE_ENUM(RET_PF_INVALID);
> > > +TRACE_DEFINE_ENUM(RET_PF_USER);
> > >  TRACE_DEFINE_ENUM(RET_PF_FIXED);
> > >  TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
> > >  
> > > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > > index 7670c13ce251..9acdf72537ce 100644
> > > --- a/arch/x86/kvm/mmu/spte.h
> > > +++ b/arch/x86/kvm/mmu/spte.h
> > > @@ -315,6 +315,12 @@ static inline bool is_dirty_spte(u64 spte)
> > >  	return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
> > >  }
> > >  
> > > +static inline bool is_private_spte(u64 spte)
> > > +{
> > > +	/* FIXME: Query C-bit/S-bit for SEV/TDX. */
> > > +	return false;
> > > +}
> > > +
> > 
> > PFN encoded in spte doesn't make sense.  In VMM for TDX, private-vs-shared is
> > determined by S-bit of GFN.
> 
> My understanding is we will have software bit in the spte, will we? In
> current TDX code I see we have SPTE_SHARED_MASK bit defined.

I'm afraid that you're referring old version.  It's not.  For TDX, gfn needs
to be checked.  Which isn't encoded in spte.


> > >  static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64 pte,
> > >  				int level)
> > >  {
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 672f0432d777..9f97aac90606 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -1768,7 +1768,8 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> > >  			continue;
> > >  
> > >  		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> > > -							      iter.gfn, PG_LEVEL_NUM);
> > > +						iter.gfn, PG_LEVEL_NUM,
> > > +						is_private_spte(iter.old_spte));
> > >  		if (max_mapping_level < iter.level)
> > >  			continue;
> > 
> > This is to merge pages into a large page on the next kvm page fault.  large page
> > support is not yet supported.  Let's skip the private slot until large page
> > support is done.
> 
> So what your suggestion is passing in a 'false' at this time for
> 'is_private'? Unless we will decide not use the above is_private_spte,
> this code does not hurt, right? is_private_spte() return false before
> we finally get chance to add the large page support.

Let's pass false always for now.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-10-31 17:47   ` Michael Roth
@ 2022-11-01 11:37     ` Chao Peng
  2022-11-01 15:19       ` Michael Roth
  2022-11-02 21:14     ` Kirill A. Shutemov
  1 sibling, 1 reply; 101+ messages in thread
From: Chao Peng @ 2022-11-01 11:37 UTC (permalink / raw)
  To: Michael Roth
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	mhocko, Muchun Song, wei.w.wang

On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Introduce 'memfd_restricted' system call with the ability to create
> > memory areas that are restricted from userspace access through ordinary
> > MMU operations (e.g. read/write/mmap). The memory content is expected to
> > be used through a new in-kernel interface by a third kernel module.
> > 
> > memfd_restricted() is useful for scenarios where a file descriptor(fd)
> > can be used as an interface into mm but want to restrict userspace's
> > ability on the fd. Initially it is designed to provide protections for
> > KVM encrypted guest memory.
> > 
> > Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> > (e.g. QEMU) and then using the mmaped virtual address to setup the
> > mapping in the KVM secondary page table (e.g. EPT). With confidential
> > computing technologies like Intel TDX, the memfd memory may be encrypted
> > with special key for special software domain (e.g. KVM guest) and is not
> > expected to be directly accessed by userspace. Precisely, userspace
> > access to such encrypted memory may lead to host crash so should be
> > prevented.
> > 
> > memfd_restricted() provides semantics required for KVM guest encrypted
> > memory support that a fd created with memfd_restricted() is going to be
> > used as the source of guest memory in confidential computing environment
> > and KVM can directly interact with core-mm without the need to expose
> > the memoy content into KVM userspace.
> > 
> > KVM userspace is still in charge of the lifecycle of the fd. It should
> > pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> > obtain the physical memory page and then uses it to populate the KVM
> > secondary page table entries.
> > 
> > The userspace restricted memfd can be fallocate-ed or hole-punched
> > from userspace. When these operations happen, KVM can get notified
> > through restrictedmem_notifier, it then gets chance to remove any
> > mapped entries of the range in the secondary page tables.
> > 
> > memfd_restricted() itself is implemented as a shim layer on top of real
> > memory file systems (currently tmpfs). Pages in restrictedmem are marked
> > as unmovable and unevictable, this is required for current confidential
> > usage. But in future this might be changed.
> > 
> > By default memfd_restricted() prevents userspace read, write and mmap.
> > By defining new bit in the 'flags', it can be extended to support other
> > restricted semantics in the future.
> > 
> > The system call is currently wired up for x86 arch.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
> >  include/linux/restrictedmem.h          |  62 ++++++
> >  include/linux/syscalls.h               |   1 +
> >  include/uapi/asm-generic/unistd.h      |   5 +-
> >  include/uapi/linux/magic.h             |   1 +
> >  kernel/sys_ni.c                        |   3 +
> >  mm/Kconfig                             |   4 +
> >  mm/Makefile                            |   1 +
> >  mm/restrictedmem.c                     | 250 +++++++++++++++++++++++++
> >  10 files changed, 328 insertions(+), 1 deletion(-)
> >  create mode 100644 include/linux/restrictedmem.h
> >  create mode 100644 mm/restrictedmem.c
> > 
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 320480a8db4f..dc70ba90247e 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -455,3 +455,4 @@
> >  448	i386	process_mrelease	sys_process_mrelease
> >  449	i386	futex_waitv		sys_futex_waitv
> >  450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
> > +451	i386	memfd_restricted	sys_memfd_restricted
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index c84d12608cd2..06516abc8318 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -372,6 +372,7 @@
> >  448	common	process_mrelease	sys_process_mrelease
> >  449	common	futex_waitv		sys_futex_waitv
> >  450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
> > +451	common	memfd_restricted	sys_memfd_restricted
> >  
> >  #
> >  # Due to a historical design error, certain syscalls are numbered differently
> > diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> > new file mode 100644
> > index 000000000000..9c37c3ea3180
> > --- /dev/null
> > +++ b/include/linux/restrictedmem.h
> > @@ -0,0 +1,62 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _LINUX_RESTRICTEDMEM_H
> > +
> > +#include <linux/file.h>
> > +#include <linux/magic.h>
> > +#include <linux/pfn_t.h>
> > +
> > +struct restrictedmem_notifier;
> > +
> > +struct restrictedmem_notifier_ops {
> > +	void (*invalidate_start)(struct restrictedmem_notifier *notifier,
> > +				 pgoff_t start, pgoff_t end);
> > +	void (*invalidate_end)(struct restrictedmem_notifier *notifier,
> > +			       pgoff_t start, pgoff_t end);
> > +};
> > +
> > +struct restrictedmem_notifier {
> > +	struct list_head list;
> > +	const struct restrictedmem_notifier_ops *ops;
> > +};
> > +
> > +#ifdef CONFIG_RESTRICTEDMEM
> > +
> > +void restrictedmem_register_notifier(struct file *file,
> > +				     struct restrictedmem_notifier *notifier);
> > +void restrictedmem_unregister_notifier(struct file *file,
> > +				       struct restrictedmem_notifier *notifier);
> > +
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +			   struct page **pagep, int *order);
> > +
> > +static inline bool file_is_restrictedmem(struct file *file)
> > +{
> > +	return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
> > +}
> > +
> > +#else
> > +
> > +static inline void restrictedmem_register_notifier(struct file *file,
> > +				     struct restrictedmem_notifier *notifier)
> > +{
> > +}
> > +
> > +static inline void restrictedmem_unregister_notifier(struct file *file,
> > +				       struct restrictedmem_notifier *notifier)
> > +{
> > +}
> > +
> > +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +					 struct page **pagep, int *order)
> > +{
> > +	return -1;
> > +}
> > +
> > +static inline bool file_is_restrictedmem(struct file *file)
> > +{
> > +	return false;
> > +}
> > +
> > +#endif /* CONFIG_RESTRICTEDMEM */
> > +
> > +#endif /* _LINUX_RESTRICTEDMEM_H */
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index a34b0f9a9972..f9e9e0c820c5 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
> >  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
> >  					    unsigned long home_node,
> >  					    unsigned long flags);
> > +asmlinkage long sys_memfd_restricted(unsigned int flags);
> >  
> >  /*
> >   * Architecture-specific system calls
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index 45fa180cc56a..e93cd35e46d0 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
> >  #define __NR_set_mempolicy_home_node 450
> >  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
> >  
> > +#define __NR_memfd_restricted 451
> > +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
> > +
> >  #undef __NR_syscalls
> > -#define __NR_syscalls 451
> > +#define __NR_syscalls 452
> >  
> >  /*
> >   * 32 bit systems traditionally used different
> > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > index 6325d1d0e90f..8aa38324b90a 100644
> > --- a/include/uapi/linux/magic.h
> > +++ b/include/uapi/linux/magic.h
> > @@ -101,5 +101,6 @@
> >  #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
> >  #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
> >  #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
> > +#define RESTRICTEDMEM_MAGIC	0x5245534d	/* "RESM" */
> >  
> >  #endif /* __LINUX_MAGIC_H__ */
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index 860b2dcf3ac4..7c4a32cbd2e7 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
> >  /* memfd_secret */
> >  COND_SYSCALL(memfd_secret);
> >  
> > +/* memfd_restricted */
> > +COND_SYSCALL(memfd_restricted);
> > +
> >  /*
> >   * Architecture specific weak syscall entries.
> >   */
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 0331f1461f81..0177d53676c7 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -1076,6 +1076,10 @@ config IO_MAPPING
> >  config SECRETMEM
> >  	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
> >  
> > +config RESTRICTEDMEM
> > +	bool
> > +	depends on TMPFS
> > +
> >  config ANON_VMA_NAME
> >  	bool "Anonymous VMA name support"
> >  	depends on PROC_FS && ADVISE_SYSCALLS && MMU
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 9a564f836403..6cb6403ffd40 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -117,6 +117,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
> >  obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
> >  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
> >  obj-$(CONFIG_SECRETMEM) += secretmem.o
> > +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
> >  obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
> >  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
> >  obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> > new file mode 100644
> > index 000000000000..e5bf8907e0f8
> > --- /dev/null
> > +++ b/mm/restrictedmem.c
> > @@ -0,0 +1,250 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "linux/sbitmap.h"
> > +#include <linux/pagemap.h>
> > +#include <linux/pseudo_fs.h>
> > +#include <linux/shmem_fs.h>
> > +#include <linux/syscalls.h>
> > +#include <uapi/linux/falloc.h>
> > +#include <uapi/linux/magic.h>
> > +#include <linux/restrictedmem.h>
> > +
> > +struct restrictedmem_data {
> > +	struct mutex lock;
> > +	struct file *memfd;
> > +	struct list_head notifiers;
> > +};
> > +
> > +static void restrictedmem_notifier_invalidate(struct restrictedmem_data *data,
> > +				 pgoff_t start, pgoff_t end, bool notify_start)
> > +{
> > +	struct restrictedmem_notifier *notifier;
> > +
> > +	mutex_lock(&data->lock);
> > +	list_for_each_entry(notifier, &data->notifiers, list) {
> > +		if (notify_start)
> > +			notifier->ops->invalidate_start(notifier, start, end);
> > +		else
> > +			notifier->ops->invalidate_end(notifier, start, end);
> > +	}
> > +	mutex_unlock(&data->lock);
> > +}
> > +
> > +static int restrictedmem_release(struct inode *inode, struct file *file)
> > +{
> > +	struct restrictedmem_data *data = inode->i_mapping->private_data;
> > +
> > +	fput(data->memfd);
> > +	kfree(data);
> > +	return 0;
> > +}
> > +
> > +static long restrictedmem_fallocate(struct file *file, int mode,
> > +				    loff_t offset, loff_t len)
> > +{
> > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > +	struct file *memfd = data->memfd;
> > +	int ret;
> > +
> > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > +			return -EINVAL;
> > +	}
> > +
> > +	restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> > +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > +	restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> > +	return ret;
> > +}
> 
> In v8 there was some discussion about potentially passing the page/folio
> and order as part of the invalidation callback, I ended up needing
> something similar for SEV-SNP, and think it might make sense for other
> platforms. This main reasoning is:

In that context what we talked on is the inaccessible_get_pfn(), I was
not aware there is need for invalidation callback as well.

> 
>   1) restoring kernel directmap:
> 
>      Currently SNP (and I believe TDX) need to either split or remove kernel
>      direct mappings for restricted PFNs, since there is no guarantee that
>      other PFNs within a 2MB range won't be used for non-restricted
>      (which will cause an RMP #PF in the case of SNP since the 2MB
>      mapping overlaps with guest-owned pages)

Has the splitting and restoring been a well-discussed direction? I'm
just curious whether there is other options to solve this issue.

> 
>      Previously we were able to restore 2MB mappings to some degree
>      since both shared/restricted pages were all pinned, so anything
>      backed by a THP (or hugetlb page once that is implemented) at guest
>      teardown could be restored as 2MB direct mapping.
> 
>      Invalidation seems like the most logical time to have this happen,

Currently invalidation only happens at user-initiated fallocate(). It
does not cover the VM teardown case where the restoring might also be
expected to be handled.

>      but whether or not to restore as 2MB requires the order to be 2MB
>      or larger, and for GPA range being invalidated to cover the entire
>      2MB (otherwise it means the page was potentially split and some
>      subpages free back to host already, in which case it can't be
>      restored as 2MB).
> 
>   2) Potentially less invalidations:
>       
>      If we pass the entire folio or compound_page as part of
>      invalidation, we only needed to issue 1 invalidation per folio.

I'm not sure I agree, the current invalidation covers the whole range
that passed from userspace and the invalidation is invoked only once for
each usrspace fallocate().

> 
>   3) Potentially useful for hugetlbfs support:
> 
>      One issue with hugetlbfs is that we don't support splitting the
>      hugepage in such cases, which was a big obstacle prior to UPM. Now
>      however, we may have the option of doing "lazy" invalidations where
>      fallocate(PUNCH_HOLE, ...) won't free a shmem-allocate page unless
>      all the subpages within the 2M range are either hole-punched, or the
>      guest is shut down, so in that way we never have to split it. Sean
>      was pondering something similar in another thread:
> 
>        https://lore.kernel.org/linux-mm/YyGLXXkFCmxBfu5U@google.com/
> 
>      Issuing invalidations with folio-granularity ties in fairly well
>      with this sort of approach if we end up going that route.

There is semantics difference between the current one and the proposed
one: The invalidation range is exactly what userspace passed down to the
kernel (being fallocated) while the proposed one will be subset of that
(if userspace-provided addr/size is not aligned to power of two), I'm
not quite confident this difference has no side effect.

> 
> I need to rework things for v9, and we'll probably want to use struct
> folio instead of struct page now, but as a proof-of-concept of sorts this
> is what I'd added on top of v8 of your patchset to implement 1) and 2):
> 
>   https://github.com/mdroth/linux/commit/127e5ea477c7bd5e4107fd44a04b9dc9e9b1af8b
> 
> Does an approach like this seem reasonable? Should be work this into the
> base restricted memslot support?

If the above mentioned semantics difference is not a problem, I don't
have strong objection on this.

Sean, since you have much better understanding on this, what is your
take on this?

Chao
> 
> Thanks,
> 
> Mike


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 7/8] KVM: Handle page fault for private memory
  2022-11-01  0:02       ` Isaku Yamahata
@ 2022-11-01 11:38         ` Chao Peng
  0 siblings, 0 replies; 101+ messages in thread
From: Chao Peng @ 2022-11-01 11:38 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Mon, Oct 31, 2022 at 05:02:50PM -0700, Isaku Yamahata wrote:
> On Fri, Oct 28, 2022 at 02:55:45PM +0800,
> Chao Peng <chao.p.peng@linux.intel.com> wrote:
> 
> > On Wed, Oct 26, 2022 at 02:54:25PM -0700, Isaku Yamahata wrote:
> > > On Tue, Oct 25, 2022 at 11:13:43PM +0800,
> > > Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > > 
> > > > A memslot with KVM_MEM_PRIVATE being set can include both fd-based
> > > > private memory and hva-based shared memory. Architecture code (like TDX
> > > > code) can tell whether the on-going fault is private or not. This patch
> > > > adds a 'is_private' field to kvm_page_fault to indicate this and
> > > > architecture code is expected to set it.
> > > > 
> > > > To handle page fault for such memslot, the handling logic is different
> > > > depending on whether the fault is private or shared. KVM checks if
> > > > 'is_private' matches the host's view of the page (maintained in
> > > > mem_attr_array).
> > > >   - For a successful match, private pfn is obtained with
> > > >     restrictedmem_get_page () from private fd and shared pfn is obtained
> > > >     with existing get_user_pages().
> > > >   - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> > > >     userspace. Userspace then can convert memory between private/shared
> > > >     in host's view and retry the fault.
> > > > 
> > > > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > > ---
> > > >  arch/x86/kvm/mmu/mmu.c          | 56 +++++++++++++++++++++++++++++++--
> > > >  arch/x86/kvm/mmu/mmu_internal.h | 14 ++++++++-
> > > >  arch/x86/kvm/mmu/mmutrace.h     |  1 +
> > > >  arch/x86/kvm/mmu/spte.h         |  6 ++++
> > > >  arch/x86/kvm/mmu/tdp_mmu.c      |  3 +-
> > > >  include/linux/kvm_host.h        | 28 +++++++++++++++++
> > > >  6 files changed, 103 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 67a9823a8c35..10017a9f26ee 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -3030,7 +3030,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
> > > >  
> > > >  int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > > >  			      const struct kvm_memory_slot *slot, gfn_t gfn,
> > > > -			      int max_level)
> > > > +			      int max_level, bool is_private)
> > > >  {
> > > >  	struct kvm_lpage_info *linfo;
> > > >  	int host_level;
> > > > @@ -3042,6 +3042,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > > >  			break;
> > > >  	}
> > > >  
> > > > +	if (is_private)
> > > > +		return max_level;
> > > 
> > > Below PG_LEVEL_NUM is passed by zap_collapsible_spte_range().  It doesn't make
> > > sense.
> > > 
> > > > +
> > > >  	if (max_level == PG_LEVEL_4K)
> > > >  		return PG_LEVEL_4K;
> > > >  
> > > > @@ -3070,7 +3073,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > > >  	 * level, which will be used to do precise, accurate accounting.
> > > >  	 */
> > > >  	fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> > > > -						     fault->gfn, fault->max_level);
> > > > +						     fault->gfn, fault->max_level,
> > > > +						     fault->is_private);
> > > >  	if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> > > >  		return;
> > > >  
> > > > @@ -4141,6 +4145,32 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> > > >  	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> > > >  }
> > > >  
> > > > +static inline u8 order_to_level(int order)
> > > > +{
> > > > +	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> > > > +
> > > > +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> > > > +		return PG_LEVEL_1G;
> > > > +
> > > > +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> > > > +		return PG_LEVEL_2M;
> > > > +
> > > > +	return PG_LEVEL_4K;
> > > > +}
> > > > +
> > > > +static int kvm_faultin_pfn_private(struct kvm_page_fault *fault)
> > > > +{
> > > > +	int order;
> > > > +	struct kvm_memory_slot *slot = fault->slot;
> > > > +
> > > > +	if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> > > > +		return RET_PF_RETRY;
> > > > +
> > > > +	fault->max_level = min(order_to_level(order), fault->max_level);
> > > > +	fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> > > > +	return RET_PF_CONTINUE;
> > > > +}
> > > > +
> > > >  static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > > >  {
> > > >  	struct kvm_memory_slot *slot = fault->slot;
> > > > @@ -4173,6 +4203,22 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > > >  			return RET_PF_EMULATE;
> > > >  	}
> > > >  
> > > > +	if (kvm_slot_can_be_private(slot) &&
> > > > +	    fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> > > > +		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > > > +		if (fault->is_private)
> > > > +			vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> > > > +		else
> > > > +			vcpu->run->memory.flags = 0;
> > > > +		vcpu->run->memory.padding = 0;
> > > > +		vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> > > > +		vcpu->run->memory.size = PAGE_SIZE;
> > > > +		return RET_PF_USER;
> > > > +	}
> > > > +
> > > > +	if (fault->is_private)
> > > > +		return kvm_faultin_pfn_private(fault);
> > > > +
> > > >  	async = false;
> > > >  	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
> > > >  					  fault->write, &fault->map_writable,
> > > > @@ -5557,6 +5603,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> > > >  			return -EIO;
> > > >  	}
> > > >  
> > > > +	if (r == RET_PF_USER)
> > > > +		return 0;
> > > > +
> > > >  	if (r < 0)
> > > >  		return r;
> > > >  	if (r != RET_PF_EMULATE)
> > > > @@ -6408,7 +6457,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> > > >  		 */
> > > >  		if (sp->role.direct &&
> > > >  		    sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> > > > -							       PG_LEVEL_NUM)) {
> > > > +							       PG_LEVEL_NUM,
> > > > +							       false)) {
> > > >  			kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
> > > >  
> > > >  			if (kvm_available_flush_tlb_with_range())
> > > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > > > index 582def531d4d..5cdff5ca546c 100644
> > > > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > > > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > > > @@ -188,6 +188,7 @@ struct kvm_page_fault {
> > > >  
> > > >  	/* Derived from mmu and global state.  */
> > > >  	const bool is_tdp;
> > > > +	const bool is_private;
> > > >  	const bool nx_huge_page_workaround_enabled;
> > > >  
> > > >  	/*
> > > > @@ -236,6 +237,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> > > >   * RET_PF_RETRY: let CPU fault again on the address.
> > > >   * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
> > > >   * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> > > > + * RET_PF_USER: need to exit to userspace to handle this fault.
> > > >   * RET_PF_FIXED: The faulting entry has been fixed.
> > > >   * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
> > > >   *
> > > > @@ -252,6 +254,7 @@ enum {
> > > >  	RET_PF_RETRY,
> > > >  	RET_PF_EMULATE,
> > > >  	RET_PF_INVALID,
> > > > +	RET_PF_USER,
> > > >  	RET_PF_FIXED,
> > > >  	RET_PF_SPURIOUS,
> > > >  };
> > > > @@ -309,7 +312,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > > >  
> > > >  int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > > >  			      const struct kvm_memory_slot *slot, gfn_t gfn,
> > > > -			      int max_level);
> > > > +			      int max_level, bool is_private);
> > > >  void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> > > >  void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
> > > >  
> > > > @@ -318,4 +321,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > > >  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > > >  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > > >  
> > > > +#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > > > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > > > +					gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > > > +{
> > > > +	WARN_ON_ONCE(1);
> > > > +	return -EOPNOTSUPP;
> > > > +}
> > > > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > > > +
> > > >  #endif /* __KVM_X86_MMU_INTERNAL_H */
> > > > diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> > > > index ae86820cef69..2d7555381955 100644
> > > > --- a/arch/x86/kvm/mmu/mmutrace.h
> > > > +++ b/arch/x86/kvm/mmu/mmutrace.h
> > > > @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
> > > >  TRACE_DEFINE_ENUM(RET_PF_RETRY);
> > > >  TRACE_DEFINE_ENUM(RET_PF_EMULATE);
> > > >  TRACE_DEFINE_ENUM(RET_PF_INVALID);
> > > > +TRACE_DEFINE_ENUM(RET_PF_USER);
> > > >  TRACE_DEFINE_ENUM(RET_PF_FIXED);
> > > >  TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
> > > >  
> > > > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > > > index 7670c13ce251..9acdf72537ce 100644
> > > > --- a/arch/x86/kvm/mmu/spte.h
> > > > +++ b/arch/x86/kvm/mmu/spte.h
> > > > @@ -315,6 +315,12 @@ static inline bool is_dirty_spte(u64 spte)
> > > >  	return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
> > > >  }
> > > >  
> > > > +static inline bool is_private_spte(u64 spte)
> > > > +{
> > > > +	/* FIXME: Query C-bit/S-bit for SEV/TDX. */
> > > > +	return false;
> > > > +}
> > > > +
> > > 
> > > PFN encoded in spte doesn't make sense.  In VMM for TDX, private-vs-shared is
> > > determined by S-bit of GFN.
> > 
> > My understanding is we will have software bit in the spte, will we? In
> > current TDX code I see we have SPTE_SHARED_MASK bit defined.
> 
> I'm afraid that you're referring old version.  It's not.  For TDX, gfn needs
> to be checked.  Which isn't encoded in spte.

Okay.

> 
> 
> > > >  static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64 pte,
> > > >  				int level)
> > > >  {
> > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > index 672f0432d777..9f97aac90606 100644
> > > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > @@ -1768,7 +1768,8 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> > > >  			continue;
> > > >  
> > > >  		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> > > > -							      iter.gfn, PG_LEVEL_NUM);
> > > > +						iter.gfn, PG_LEVEL_NUM,
> > > > +						is_private_spte(iter.old_spte));
> > > >  		if (max_mapping_level < iter.level)
> > > >  			continue;
> > > 
> > > This is to merge pages into a large page on the next kvm page fault.  large page
> > > support is not yet supported.  Let's skip the private slot until large page
> > > support is done.
> > 
> > So what your suggestion is passing in a 'false' at this time for
> > 'is_private'? Unless we will decide not use the above is_private_spte,
> > this code does not hurt, right? is_private_spte() return false before
> > we finally get chance to add the large page support.
> 
> Let's pass false always for now.

Good to me. Thanks.

Chao
> -- 
> Isaku Yamahata <isaku.yamahata@gmail.com>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-01 11:37     ` Chao Peng
@ 2022-11-01 15:19       ` Michael Roth
  2022-11-01 19:30         ` Michael Roth
  2022-11-14 14:02         ` Vlastimil Babka
  0 siblings, 2 replies; 101+ messages in thread
From: Michael Roth @ 2022-11-01 15:19 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	mhocko, Muchun Song, wei.w.wang

On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
> On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > 
> > > +struct restrictedmem_data {
> > > +	struct mutex lock;
> > > +	struct file *memfd;
> > > +	struct list_head notifiers;
> > > +};
> > > +
> > > +static void restrictedmem_notifier_invalidate(struct restrictedmem_data *data,
> > > +				 pgoff_t start, pgoff_t end, bool notify_start)
> > > +{
> > > +	struct restrictedmem_notifier *notifier;
> > > +
> > > +	mutex_lock(&data->lock);
> > > +	list_for_each_entry(notifier, &data->notifiers, list) {
> > > +		if (notify_start)
> > > +			notifier->ops->invalidate_start(notifier, start, end);
> > > +		else
> > > +			notifier->ops->invalidate_end(notifier, start, end);
> > > +	}
> > > +	mutex_unlock(&data->lock);
> > > +}
> > > +
> > > +static int restrictedmem_release(struct inode *inode, struct file *file)
> > > +{
> > > +	struct restrictedmem_data *data = inode->i_mapping->private_data;
> > > +
> > > +	fput(data->memfd);
> > > +	kfree(data);
> > > +	return 0;
> > > +}
> > > +
> > > +static long restrictedmem_fallocate(struct file *file, int mode,
> > > +				    loff_t offset, loff_t len)
> > > +{
> > > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > > +	struct file *memfd = data->memfd;
> > > +	int ret;
> > > +
> > > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > +			return -EINVAL;
> > > +	}
> > > +
> > > +	restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> > > +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > +	restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> > > +	return ret;
> > > +}
> > 
> > In v8 there was some discussion about potentially passing the page/folio
> > and order as part of the invalidation callback, I ended up needing
> > something similar for SEV-SNP, and think it might make sense for other
> > platforms. This main reasoning is:
> 
> In that context what we talked on is the inaccessible_get_pfn(), I was
> not aware there is need for invalidation callback as well.

Right, your understanding is correct. I think Sean had only mentioned in
passing that it was something we could potentially do, and in the cases I
was looking at it ended up being useful. I only mentioned it so I don't
seem like I'm too far out in the weeds here :)

> 
> > 
> >   1) restoring kernel directmap:
> > 
> >      Currently SNP (and I believe TDX) need to either split or remove kernel
> >      direct mappings for restricted PFNs, since there is no guarantee that
> >      other PFNs within a 2MB range won't be used for non-restricted
> >      (which will cause an RMP #PF in the case of SNP since the 2MB
> >      mapping overlaps with guest-owned pages)
> 
> Has the splitting and restoring been a well-discussed direction? I'm
> just curious whether there is other options to solve this issue.

For SNP it's been discussed for quite some time, and either splitting or
removing private entries from directmap are the well-discussed way I'm
aware of to avoid RMP violations due to some other kernel process using
a 2MB mapping to access shared memory if there are private pages that
happen to be within that range.

In both cases the issue of how to restore directmap as 2M becomes a
problem.

I was also under the impression TDX had similar requirements. If so,
do you know what the plan is for handling this for TDX?

There are also 2 potential alternatives I'm aware of, but these haven't
been discussed in much detail AFAIK:

a) Ensure confidential guests are backed by 2MB pages. shmem has a way to
   request 2MB THP pages, but I'm not sure how reliably we can guarantee
   that enough THPs are available, so if we went that route we'd probably
   be better off requiring the use of hugetlbfs as the backing store. But
   obviously that's a bit limiting and it would be nice to have the option
   of using normal pages as well. One nice thing with invalidation
   scheme proposed here is that this would "Just Work" if implement
   hugetlbfs support, so an admin that doesn't want any directmap
   splitting has this option available, otherwise it's done as a
   best-effort.

b) Implement general support for restoring directmap as 2M even when
   subpages might be in use by other kernel threads. This would be the
   most flexible approach since it requires no special handling during
   invalidations, but I think it's only possible if all the CPA
   attributes for the 2M range are the same at the time the mapping is
   restored/unsplit, so some potential locking issues there and still
   chance for splitting directmap over time.

> 
> > 
> >      Previously we were able to restore 2MB mappings to some degree
> >      since both shared/restricted pages were all pinned, so anything
> >      backed by a THP (or hugetlb page once that is implemented) at guest
> >      teardown could be restored as 2MB direct mapping.
> > 
> >      Invalidation seems like the most logical time to have this happen,
> 
> Currently invalidation only happens at user-initiated fallocate(). It
> does not cover the VM teardown case where the restoring might also be
> expected to be handled.

Right, I forgot to add that in my proposed changes I added invalidations
for any still-allocated private pages present when the restricted memfd
notifier is unregistered. This was needed to avoid leaking pages back to
the kernel that still need directmap or RMP table fixups. I also added
similar invalidations for memfd->release(), since it seems possible that
userspace might close() it before shutting down guest, but maybe the
latter is not needed if KVM takes a reference on the FD during life of
the guest.

> 
> >      but whether or not to restore as 2MB requires the order to be 2MB
> >      or larger, and for GPA range being invalidated to cover the entire
> >      2MB (otherwise it means the page was potentially split and some
> >      subpages free back to host already, in which case it can't be
> >      restored as 2MB).
> > 
> >   2) Potentially less invalidations:
> >       
> >      If we pass the entire folio or compound_page as part of
> >      invalidation, we only needed to issue 1 invalidation per folio.
> 
> I'm not sure I agree, the current invalidation covers the whole range
> that passed from userspace and the invalidation is invoked only once for
> each usrspace fallocate().

That's true, it only reduces invalidations if we decide to provide a
struct page/folio as part of the invalidation callbacks, which isn't
the case yet. Sorry for the confusion.

> 
> > 
> >   3) Potentially useful for hugetlbfs support:
> > 
> >      One issue with hugetlbfs is that we don't support splitting the
> >      hugepage in such cases, which was a big obstacle prior to UPM. Now
> >      however, we may have the option of doing "lazy" invalidations where
> >      fallocate(PUNCH_HOLE, ...) won't free a shmem-allocate page unless
> >      all the subpages within the 2M range are either hole-punched, or the
> >      guest is shut down, so in that way we never have to split it. Sean
> >      was pondering something similar in another thread:
> > 
> >        https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2FYyGLXXkFCmxBfu5U%40google.com%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C3aba56bf7d574c749ea708dabbfe2224%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638028997419628807%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=c7gSLjJEAxuX8xmMiTDMUHNwUdQNKN00xqtAZAEeow8%3D&amp;reserved=0
> > 
> >      Issuing invalidations with folio-granularity ties in fairly well
> >      with this sort of approach if we end up going that route.
> 
> There is semantics difference between the current one and the proposed
> one: The invalidation range is exactly what userspace passed down to the
> kernel (being fallocated) while the proposed one will be subset of that
> (if userspace-provided addr/size is not aligned to power of two), I'm
> not quite confident this difference has no side effect.

In theory userspace should not be allocating/hole-punching restricted
pages for GPA ranges that are already mapped as private in the xarray,
and KVM could potentially fail such requests (though it does currently).

But if we somehow enforced that, then we could rely on
KVM_MEMORY_ENCRYPT_REG_REGION to handle all the MMU invalidation stuff,
which would free up the restricted fd invalidation callbacks to be used
purely to handle doing things like RMP/directmap fixups prior to returning
restricted pages back to the host. So that was sort of my thinking why the
new semantics would still cover all the necessary cases.

-Mike

> 
> > 
> > I need to rework things for v9, and we'll probably want to use struct
> > folio instead of struct page now, but as a proof-of-concept of sorts this
> > is what I'd added on top of v8 of your patchset to implement 1) and 2):
> > 
> >   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2F127e5ea477c7bd5e4107fd44a04b9dc9e9b1af8b&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C3aba56bf7d574c749ea708dabbfe2224%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638028997419628807%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jOFT0iLmeU7rKniEkWOsTf2%2FPI13EAw4Qm7arI1q970%3D&amp;reserved=0
> > 
> > Does an approach like this seem reasonable? Should be work this into the
> > base restricted memslot support?
> 
> If the above mentioned semantics difference is not a problem, I don't
> have strong objection on this.
> 
> Sean, since you have much better understanding on this, what is your
> take on this?
> 
> Chao
> > 
> > Thanks,
> > 
> > Mike


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-01 15:19       ` Michael Roth
@ 2022-11-01 19:30         ` Michael Roth
  2022-11-02 14:53           ` Chao Peng
  2022-11-14 14:02         ` Vlastimil Babka
  1 sibling, 1 reply; 101+ messages in thread
From: Michael Roth @ 2022-11-01 19:30 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	mhocko, Muchun Song, wei.w.wang

On Tue, Nov 01, 2022 at 10:19:44AM -0500, Michael Roth wrote:
> On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
> > On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> > > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > 
> > > 
> > >   3) Potentially useful for hugetlbfs support:
> > > 
> > >      One issue with hugetlbfs is that we don't support splitting the
> > >      hugepage in such cases, which was a big obstacle prior to UPM. Now
> > >      however, we may have the option of doing "lazy" invalidations where
> > >      fallocate(PUNCH_HOLE, ...) won't free a shmem-allocate page unless
> > >      all the subpages within the 2M range are either hole-punched, or the
> > >      guest is shut down, so in that way we never have to split it. Sean
> > >      was pondering something similar in another thread:
> > > 
> > >        https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2FYyGLXXkFCmxBfu5U%40google.com%2F&amp;data=05%7C01%7CMichael.Roth%40amd.com%7C28ba5dbb51844f910dec08dabc1c99e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638029128345507924%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=bxcRfuJIgo1Z1G8HQ800HscE6y7RXRQwvWSkfc5M8Bs%3D&amp;reserved=0
> > > 
> > >      Issuing invalidations with folio-granularity ties in fairly well
> > >      with this sort of approach if we end up going that route.
> > 
> > There is semantics difference between the current one and the proposed
> > one: The invalidation range is exactly what userspace passed down to the
> > kernel (being fallocated) while the proposed one will be subset of that
> > (if userspace-provided addr/size is not aligned to power of two), I'm
> > not quite confident this difference has no side effect.
> 
> In theory userspace should not be allocating/hole-punching restricted
> pages for GPA ranges that are already mapped as private in the xarray,
> and KVM could potentially fail such requests (though it does currently).
> 
> But if we somehow enforced that, then we could rely on
> KVM_MEMORY_ENCRYPT_REG_REGION to handle all the MMU invalidation stuff,
> which would free up the restricted fd invalidation callbacks to be used
> purely to handle doing things like RMP/directmap fixups prior to returning
> restricted pages back to the host. So that was sort of my thinking why the
> new semantics would still cover all the necessary cases.

Sorry, this explanation is if we rely on userspace to fallocate() on 2MB
boundaries, and ignore any non-aligned requests in the kernel. But
that's not how I actually ended up implementing things, so I'm not sure
why answered that way...

In my implementation we actually do issue invalidations for fallocate()
even for non-2M-aligned GPA/offset ranges. For instance (assuming
restricted FD offset 0 corresponds to GPA 0), an fallocate() on GPA
range 0x1000-0x402000 would result in the following invalidations being
issued if everything was backed by a 2MB page:

  invalidate GPA: 0x001000-0x200000, Page: pfn_to_page(I), order:9
  invalidate GPA: 0x200000-0x400000, Page: pfn_to_page(J), order:9
  invalidate GPA: 0x400000-0x402000, Page: pfn_to_page(K), order:9

So you still cover the same range, but the arch/platform callbacks can
then, as a best effort, do things like restore 2M directmap if they see
that the backing page is 2MB+ and the GPA range covers the entire range.
If the GPA doesn't covers the whole range, or the backing page is
order:0, then in that case we are still forced to leave the directmap
split.

But with that in place we can then improve on that by allowing for the
use of hugetlbfs.

We'd still be somewhat reliant on userspace to issue fallocate()'s on
2M-aligned boundaries to some degree (guest teardown invalidations
could be issued as 2M-aligned, which would be the bulk of the pages
in most cases, but for discarding pages after private->shared
conversion we could still get fragmentation). This could maybe be
addressed by keeping track of those partial/non-2M-aligned fallocate()
requests and then issuing them as a batched 2M invalidation once all
the subpages have been fallocate(HOLE_PUNCH)'d. We'd need to enforce
that fallocate(PUNCH_HOLE) is preceeded by
KVM_MEMORY_ENCRYPT_UNREG_REGION to make sure MMU invalidations happen
though.

Not sure on these potential follow-ups, but they all at least seem
compatible with the proposed invalidation scheme.

-Mike

> 
> -Mike
> 
> > 
> > > 
> > > I need to rework things for v9, and we'll probably want to use struct
> > > folio instead of struct page now, but as a proof-of-concept of sorts this
> > > is what I'd added on top of v8 of your patchset to implement 1) and 2):
> > > 
> > >   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2F127e5ea477c7bd5e4107fd44a04b9dc9e9b1af8b&amp;data=05%7C01%7CMichael.Roth%40amd.com%7C28ba5dbb51844f910dec08dabc1c99e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638029128345507924%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=iv%2BOMPe5AZuUtIW6bCH%2BRhJPljS14JrTXbQXptLG9fM%3D&amp;reserved=0
> > > 
> > > Does an approach like this seem reasonable? Should be work this into the
> > > base restricted memslot support?
> > 
> > If the above mentioned semantics difference is not a problem, I don't
> > have strong objection on this.
> > 
> > Sean, since you have much better understanding on this, what is your
> > take on this?
> > 
> > Chao
> > > 
> > > Thanks,
> > > 
> > > Mike


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-01 19:30         ` Michael Roth
@ 2022-11-02 14:53           ` Chao Peng
  2022-11-02 21:19             ` Michael Roth
  0 siblings, 1 reply; 101+ messages in thread
From: Chao Peng @ 2022-11-02 14:53 UTC (permalink / raw)
  To: Michael Roth
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	mhocko, Muchun Song, wei.w.wang

On Tue, Nov 01, 2022 at 02:30:58PM -0500, Michael Roth wrote:
> On Tue, Nov 01, 2022 at 10:19:44AM -0500, Michael Roth wrote:
> > On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
> > > On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> > > > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > > 
> > > > 
> > > >   3) Potentially useful for hugetlbfs support:
> > > > 
> > > >      One issue with hugetlbfs is that we don't support splitting the
> > > >      hugepage in such cases, which was a big obstacle prior to UPM. Now
> > > >      however, we may have the option of doing "lazy" invalidations where
> > > >      fallocate(PUNCH_HOLE, ...) won't free a shmem-allocate page unless
> > > >      all the subpages within the 2M range are either hole-punched, or the
> > > >      guest is shut down, so in that way we never have to split it. Sean
> > > >      was pondering something similar in another thread:
> > > > 
> > > >        https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2FYyGLXXkFCmxBfu5U%40google.com%2F&amp;data=05%7C01%7CMichael.Roth%40amd.com%7C28ba5dbb51844f910dec08dabc1c99e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638029128345507924%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=bxcRfuJIgo1Z1G8HQ800HscE6y7RXRQwvWSkfc5M8Bs%3D&amp;reserved=0
> > > > 
> > > >      Issuing invalidations with folio-granularity ties in fairly well
> > > >      with this sort of approach if we end up going that route.
> > > 
> > > There is semantics difference between the current one and the proposed
> > > one: The invalidation range is exactly what userspace passed down to the
> > > kernel (being fallocated) while the proposed one will be subset of that
> > > (if userspace-provided addr/size is not aligned to power of two), I'm
> > > not quite confident this difference has no side effect.
> > 
> > In theory userspace should not be allocating/hole-punching restricted
> > pages for GPA ranges that are already mapped as private in the xarray,
> > and KVM could potentially fail such requests (though it does currently).
> > 
> > But if we somehow enforced that, then we could rely on
> > KVM_MEMORY_ENCRYPT_REG_REGION to handle all the MMU invalidation stuff,
> > which would free up the restricted fd invalidation callbacks to be used
> > purely to handle doing things like RMP/directmap fixups prior to returning
> > restricted pages back to the host. So that was sort of my thinking why the
> > new semantics would still cover all the necessary cases.
> 
> Sorry, this explanation is if we rely on userspace to fallocate() on 2MB
> boundaries, and ignore any non-aligned requests in the kernel. But
> that's not how I actually ended up implementing things, so I'm not sure
> why answered that way...
> 
> In my implementation we actually do issue invalidations for fallocate()
> even for non-2M-aligned GPA/offset ranges. For instance (assuming
> restricted FD offset 0 corresponds to GPA 0), an fallocate() on GPA
> range 0x1000-0x402000 would result in the following invalidations being
> issued if everything was backed by a 2MB page:
> 
>   invalidate GPA: 0x001000-0x200000, Page: pfn_to_page(I), order:9
>   invalidate GPA: 0x200000-0x400000, Page: pfn_to_page(J), order:9
>   invalidate GPA: 0x400000-0x402000, Page: pfn_to_page(K), order:9

Only see this I understand what you are actually going to propose;)

So the memory range(start/end) will be still there and covers exactly
what it should be from usrspace point of view, the page+order(or just
folio) is really just a _hint_ for the invalidation callbacks. Looks
ugly though.

In v9 we use a invalidate_start/ invalidate_end pair to solve a race
contention issue(https://lore.kernel.org/kvm/Y1LOe4JvnTbFNs4u@google.com/).
To work with this, I believe we only need pass this hint info for
invalidate_start() since at the invalidate_end() time, the page has
already been discarded.

Another worth-mentioning-thing is invalidate_start/end is not just
invoked for hole punching, but also for allocation(e.g. default
fallocate). While for allocation we can get the page only at the
invalidate_end() time. But AFAICS, the invalidate() is called for
fallocate(allocation) is because previously we rely on the existence in
memory backing store to tell a page is private and we need notify KVM
that the page is being converted from shared to private, but that is not
true for current code and fallocate() is also not mandatory since KVM
can call restrictedmem_get_page() to allocate dynamically, so I think we
can remove the invalidation path for fallocate(allocation).

> 
> So you still cover the same range, but the arch/platform callbacks can
> then, as a best effort, do things like restore 2M directmap if they see
> that the backing page is 2MB+ and the GPA range covers the entire range.
> If the GPA doesn't covers the whole range, or the backing page is
> order:0, then in that case we are still forced to leave the directmap
> split.
> 
> But with that in place we can then improve on that by allowing for the
> use of hugetlbfs.
> 
> We'd still be somewhat reliant on userspace to issue fallocate()'s on
> 2M-aligned boundaries to some degree (guest teardown invalidations
> could be issued as 2M-aligned, which would be the bulk of the pages
> in most cases, but for discarding pages after private->shared
> conversion we could still get fragmentation). This could maybe be
> addressed by keeping track of those partial/non-2M-aligned fallocate()
> requests and then issuing them as a batched 2M invalidation once all
> the subpages have been fallocate(HOLE_PUNCH)'d. We'd need to enforce
> that fallocate(PUNCH_HOLE) is preceeded by
> KVM_MEMORY_ENCRYPT_UNREG_REGION to make sure MMU invalidations happen
> though.

Don't understand why the sequence matters here, we should do MMU
invalidation for both fallocate(PUNCH_HOLE) and
KVM_MEMORY_ENCRYPT_UNREG_REGION, right?

Thanks,
Chao
> 
> Not sure on these potential follow-ups, but they all at least seem
> compatible with the proposed invalidation scheme.
> 
> -Mike
> 
> > 
> > -Mike
> > 
> > > 
> > > > 
> > > > I need to rework things for v9, and we'll probably want to use struct
> > > > folio instead of struct page now, but as a proof-of-concept of sorts this
> > > > is what I'd added on top of v8 of your patchset to implement 1) and 2):
> > > > 
> > > >   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2F127e5ea477c7bd5e4107fd44a04b9dc9e9b1af8b&amp;data=05%7C01%7CMichael.Roth%40amd.com%7C28ba5dbb51844f910dec08dabc1c99e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638029128345507924%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=iv%2BOMPe5AZuUtIW6bCH%2BRhJPljS14JrTXbQXptLG9fM%3D&amp;reserved=0
> > > > 
> > > > Does an approach like this seem reasonable? Should be work this into the
> > > > base restricted memslot support?
> > > 
> > > If the above mentioned semantics difference is not a problem, I don't
> > > have strong objection on this.
> > > 
> > > Sean, since you have much better understanding on this, what is your
> > > take on this?
> > > 
> > > Chao
> > > > 
> > > > Thanks,
> > > > 
> > > > Mike


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-10-31 17:47   ` Michael Roth
  2022-11-01 11:37     ` Chao Peng
@ 2022-11-02 21:14     ` Kirill A. Shutemov
  2022-11-02 21:26       ` Michael Roth
  2022-11-02 22:07       ` Michael Roth
  1 sibling, 2 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2022-11-02 21:14 UTC (permalink / raw)
  To: Michael Roth
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, mhocko, Muchun Song, wei.w.wang

On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> 
> In v8 there was some discussion about potentially passing the page/folio
> and order as part of the invalidation callback, I ended up needing
> something similar for SEV-SNP, and think it might make sense for other
> platforms. This main reasoning is:
> 
>   1) restoring kernel directmap:
> 
>      Currently SNP (and I believe TDX) need to either split or remove kernel
>      direct mappings for restricted PFNs, since there is no guarantee that
>      other PFNs within a 2MB range won't be used for non-restricted
>      (which will cause an RMP #PF in the case of SNP since the 2MB
>      mapping overlaps with guest-owned pages)

That's news to me. Where the restriction for SNP comes from? There's no
such limitation on TDX side AFAIK?

Could you point me to relevant documentation if there's any?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-02 14:53           ` Chao Peng
@ 2022-11-02 21:19             ` Michael Roth
  0 siblings, 0 replies; 101+ messages in thread
From: Michael Roth @ 2022-11-02 21:19 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	mhocko, Muchun Song, wei.w.wang

On Wed, Nov 02, 2022 at 10:53:25PM +0800, Chao Peng wrote:
> On Tue, Nov 01, 2022 at 02:30:58PM -0500, Michael Roth wrote:
> > On Tue, Nov 01, 2022 at 10:19:44AM -0500, Michael Roth wrote:
> > > On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
> > > > On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> > > > > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > > > 
> > > > > 
> > > > >   3) Potentially useful for hugetlbfs support:
> > > > > 
> > > > >      One issue with hugetlbfs is that we don't support splitting the
> > > > >      hugepage in such cases, which was a big obstacle prior to UPM. Now
> > > > >      however, we may have the option of doing "lazy" invalidations where
> > > > >      fallocate(PUNCH_HOLE, ...) won't free a shmem-allocate page unless
> > > > >      all the subpages within the 2M range are either hole-punched, or the
> > > > >      guest is shut down, so in that way we never have to split it. Sean
> > > > >      was pondering something similar in another thread:
> > > > > 
> > > > >        https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2FYyGLXXkFCmxBfu5U%40google.com%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C13192ae987b442f10b7408dabce2a4c5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638029978853935768%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=Is%2Bfm3c9BGFmU%2Btn3ZgPPQnUeCK%2BhKPArsPrWY5JeSg%3D&amp;reserved=0
> > > > > 
> > > > >      Issuing invalidations with folio-granularity ties in fairly well
> > > > >      with this sort of approach if we end up going that route.
> > > > 
> > > > There is semantics difference between the current one and the proposed
> > > > one: The invalidation range is exactly what userspace passed down to the
> > > > kernel (being fallocated) while the proposed one will be subset of that
> > > > (if userspace-provided addr/size is not aligned to power of two), I'm
> > > > not quite confident this difference has no side effect.
> > > 
> > > In theory userspace should not be allocating/hole-punching restricted
> > > pages for GPA ranges that are already mapped as private in the xarray,
> > > and KVM could potentially fail such requests (though it does currently).
> > > 
> > > But if we somehow enforced that, then we could rely on
> > > KVM_MEMORY_ENCRYPT_REG_REGION to handle all the MMU invalidation stuff,
> > > which would free up the restricted fd invalidation callbacks to be used
> > > purely to handle doing things like RMP/directmap fixups prior to returning
> > > restricted pages back to the host. So that was sort of my thinking why the
> > > new semantics would still cover all the necessary cases.
> > 
> > Sorry, this explanation is if we rely on userspace to fallocate() on 2MB
> > boundaries, and ignore any non-aligned requests in the kernel. But
> > that's not how I actually ended up implementing things, so I'm not sure
> > why answered that way...
> > 
> > In my implementation we actually do issue invalidations for fallocate()
> > even for non-2M-aligned GPA/offset ranges. For instance (assuming
> > restricted FD offset 0 corresponds to GPA 0), an fallocate() on GPA
> > range 0x1000-0x402000 would result in the following invalidations being
> > issued if everything was backed by a 2MB page:
> > 
> >   invalidate GPA: 0x001000-0x200000, Page: pfn_to_page(I), order:9
> >   invalidate GPA: 0x200000-0x400000, Page: pfn_to_page(J), order:9
> >   invalidate GPA: 0x400000-0x402000, Page: pfn_to_page(K), order:9
> 
> Only see this I understand what you are actually going to propose;)
> 
> So the memory range(start/end) will be still there and covers exactly
> what it should be from usrspace point of view, the page+order(or just
> folio) is really just a _hint_ for the invalidation callbacks. Looks
> ugly though.

Yes that's accurate: callbacks still need to handle partial ranges, so
it's more of a hint/optimization for cases where callbacks can benefit
from knowing the entire backing hugepage is being invalidated/freed.

> 
> In v9 we use a invalidate_start/ invalidate_end pair to solve a race
> contention issue(https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fkvm%2FY1LOe4JvnTbFNs4u%40google.com%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C13192ae987b442f10b7408dabce2a4c5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638029978853935768%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=zccj0lNcqBCxGVGLBYAD2BCkJuy75nTxFTSUMfDJjzM%3D&amp;reserved=0).
> To work with this, I believe we only need pass this hint info for
> invalidate_start() since at the invalidate_end() time, the page has
> already been discarded.

Ok, yah, that's the approach I'm looking at for v9: pass the page/order
for invalidate_start, but keep invalidate_end as-is.

> 
> Another worth-mentioning-thing is invalidate_start/end is not just
> invoked for hole punching, but also for allocation(e.g. default
> fallocate). While for allocation we can get the page only at the
> invalidate_end() time. But AFAICS, the invalidate() is called for
> fallocate(allocation) is because previously we rely on the existence in
> memory backing store to tell a page is private and we need notify KVM
> that the page is being converted from shared to private, but that is not
> true for current code and fallocate() is also not mandatory since KVM
> can call restrictedmem_get_page() to allocate dynamically, so I think we
> can remove the invalidation path for fallocate(allocation).

I actually ended up doing that for the v8 implementation, I figured it
was a holdover from before {REG,UNREG}_REGION were used, but too sure on
that so good to have some confirmation there.

> 
> > 
> > So you still cover the same range, but the arch/platform callbacks can
> > then, as a best effort, do things like restore 2M directmap if they see
> > that the backing page is 2MB+ and the GPA range covers the entire range.
> > If the GPA doesn't covers the whole range, or the backing page is
> > order:0, then in that case we are still forced to leave the directmap
> > split.
> > 
> > But with that in place we can then improve on that by allowing for the
> > use of hugetlbfs.
> > 
> > We'd still be somewhat reliant on userspace to issue fallocate()'s on
> > 2M-aligned boundaries to some degree (guest teardown invalidations
> > could be issued as 2M-aligned, which would be the bulk of the pages
> > in most cases, but for discarding pages after private->shared
> > conversion we could still get fragmentation). This could maybe be
> > addressed by keeping track of those partial/non-2M-aligned fallocate()
> > requests and then issuing them as a batched 2M invalidation once all
> > the subpages have been fallocate(HOLE_PUNCH)'d. We'd need to enforce
> > that fallocate(PUNCH_HOLE) is preceeded by
> > KVM_MEMORY_ENCRYPT_UNREG_REGION to make sure MMU invalidations happen
> > though.
> 
> Don't understand why the sequence matters here, we should do MMU
> invalidation for both fallocate(PUNCH_HOLE) and
> KVM_MEMORY_ENCRYPT_UNREG_REGION, right?

It should happen in both places as long as it's possible for userspace
to fallocate(PUNCH_HOLE) a private page while it is mapped to a guest.
I'm not necessarily suggesting we should make any changes there right
now, but...

We might need to consider changing that if we decide that we don't want
to allow userspace to basically force splitting the directmap by always
issuing fallocate(PUNCH_HOLE) for each 4K page rather than trying to do
it in 2M intervals when possible, since it would still result in 4K
invalidations being issued, such that optimizations like restoring the
2M directmap can't be done, even when backed by THPs or hugetlbfs pages.
One approach to deal with this is to introduce a bitmap (for instance) to
track what subpages have been fallocate(PUNCH_HOLE)'d, and defer the
actual free'ing/invalidation until a whole page has been marked for
deallocation. This would keep huge-pages/huge-invalidations intact even
if userspace is malicious/non-optimal and actively trying to slow the
host down by forcing the directmap to get split.

*If* we took that approach though, then the MMU invalidations would also
get deferred, which is bad. But if we added a check/callback that
restrictedfd.c could use to confirm that the page is already in a
shared/non-private state, then we'd know the MMU invalidation for the
private page must have already happened, so if anything got faulted in
afterward it should be a shared page. (Though I guess update_mem_attr
would also need to check this bitmap and fail for cases where a
shared->private conversion is issued for a page/range that's been
recorded as having been previously issued a deferred PUNCH_HOLE'd.

But that's only an optimization and probably needs a lot more thought,
not necessarily something I think we need to implement now.

-Mike

> 
> Thanks,
> Chao
> > 
> > Not sure on these potential follow-ups, but they all at least seem
> > compatible with the proposed invalidation scheme.
> > 
> > -Mike
> > 
> > > 
> > > -Mike
> > > 
> > > > 
> > > > > 
> > > > > I need to rework things for v9, and we'll probably want to use struct
> > > > > folio instead of struct page now, but as a proof-of-concept of sorts this
> > > > > is what I'd added on top of v8 of your patchset to implement 1) and 2):
> > > > > 
> > > > >   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2F127e5ea477c7bd5e4107fd44a04b9dc9e9b1af8b&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C13192ae987b442f10b7408dabce2a4c5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638029978854091987%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=ghvLOpeqPz%2B6G593enT0p%2F3Ovh9rjHKtsuSQ2xObFHU%3D&amp;reserved=0
> > > > > 
> > > > > Does an approach like this seem reasonable? Should be work this into the
> > > > > base restricted memslot support?
> > > > 
> > > > If the above mentioned semantics difference is not a problem, I don't
> > > > have strong objection on this.
> > > > 
> > > > Sean, since you have much better understanding on this, what is your
> > > > take on this?
> > > > 
> > > > Chao
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > Mike


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-02 21:14     ` Kirill A. Shutemov
@ 2022-11-02 21:26       ` Michael Roth
  2022-11-02 22:07       ` Michael Roth
  1 sibling, 0 replies; 101+ messages in thread
From: Michael Roth @ 2022-11-02 21:26 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, mhocko, Muchun Song, wei.w.wang

On Thu, Nov 03, 2022 at 12:14:04AM +0300, Kirill A. Shutemov wrote:
> On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> > 
> > In v8 there was some discussion about potentially passing the page/folio
> > and order as part of the invalidation callback, I ended up needing
> > something similar for SEV-SNP, and think it might make sense for other
> > platforms. This main reasoning is:
> > 
> >   1) restoring kernel directmap:
> > 
> >      Currently SNP (and I believe TDX) need to either split or remove kernel
> >      direct mappings for restricted PFNs, since there is no guarantee that
> >      other PFNs within a 2MB range won't be used for non-restricted
> >      (which will cause an RMP #PF in the case of SNP since the 2MB
> >      mapping overlaps with guest-owned pages)
> 
> That's news to me. Where the restriction for SNP comes from? There's no
> such limitation on TDX side AFAIK?
> 
> Could you point me to relevant documentation if there's any?

I could be mistaken, I haven't looked into the specific documentation and was
going off of this discussion from a ways back:

  https://lore.kernel.org/all/YWb8WG6Ravbs1nbx@google.com/

Sean, is my read of that correct? Do you happen to know where there's
some documentation on that for the TDX side?

Thanks,

Mike

> 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-02 21:14     ` Kirill A. Shutemov
  2022-11-02 21:26       ` Michael Roth
@ 2022-11-02 22:07       ` Michael Roth
  2022-11-03 16:30         ` Kirill A. Shutemov
  1 sibling, 1 reply; 101+ messages in thread
From: Michael Roth @ 2022-11-02 22:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, mhocko, Muchun Song, wei.w.wang

On Thu, Nov 03, 2022 at 12:14:04AM +0300, Kirill A. Shutemov wrote:
> On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> > 
> > In v8 there was some discussion about potentially passing the page/folio
> > and order as part of the invalidation callback, I ended up needing
> > something similar for SEV-SNP, and think it might make sense for other
> > platforms. This main reasoning is:
> > 
> >   1) restoring kernel directmap:
> > 
> >      Currently SNP (and I believe TDX) need to either split or remove kernel
> >      direct mappings for restricted PFNs, since there is no guarantee that
> >      other PFNs within a 2MB range won't be used for non-restricted
> >      (which will cause an RMP #PF in the case of SNP since the 2MB
> >      mapping overlaps with guest-owned pages)
> 
> That's news to me. Where the restriction for SNP comes from?

Sorry, missed your first question.

For SNP at least, the restriction is documented in APM Volume 2, Section
15.36.10, First row of Table 15-36 (preceeding paragraph has more
context). I forgot to mention this is only pertaining to writes by the
host to 2MB pages that contain guest-owned subpages, for reads it's
not an issue, but I think the implementation requirements end up being
the same either way:

  https://www.amd.com/system/files/TechDocs/24593.pdf

-Mike

> That's news to me. Where the restriction for SNP comes from? There's no
> such limitation on TDX side AFAIK?
> 
> Could you point me to relevant documentation if there's any?
> 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM
  2022-10-25 15:13 [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (7 preceding siblings ...)
  2022-10-25 15:13 ` [PATCH v9 8/8] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
@ 2022-11-03 12:13 ` Vishal Annapurve
  2022-11-08  0:41   ` Isaku Yamahata
  2022-11-14 11:43 ` Alex Bennée
  9 siblings, 1 reply; 101+ messages in thread
From: Vishal Annapurve @ 2022-11-03 12:13 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
	mhocko, Muchun Song, wei.w.wang

On Tue, Oct 25, 2022 at 8:48 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> This patch series implements KVM guest private memory for confidential
> computing scenarios like Intel TDX[1]. If a TDX host accesses
> TDX-protected guest memory, machine check can happen which can further
> crash the running host system, this is terrible for multi-tenant
> configurations. The host accesses include those from KVM userspace like
> QEMU. This series addresses KVM userspace induced crash by introducing
> new mm and KVM interfaces so KVM userspace can still manage guest memory
> via a fd-based approach, but it can never access the guest memory
> content.
>
> The patch series touches both core mm and KVM code. I appreciate
> Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> reviews are always welcome.
>   - 01: mm change, target for mm tree
>   - 02-08: KVM change, target for KVM tree
>
> Given KVM is the only current user for the mm part, I have chatted with
> Paolo and he is OK to merge the mm change through KVM tree, but
> reviewed-by/acked-by is still expected from the mm people.
>
> The patches have been verified in Intel TDX environment, but Vishal has
> done an excellent work on the selftests[4] which are dedicated for this
> series, making it possible to test this series without innovative
> hardware and fancy steps of building a VM environment. See Test section
> below for more info.
>
>
> Introduction
> ============
> KVM userspace being able to crash the host is horrible. Under current
> KVM architecture, all guest memory is inherently accessible from KVM
> userspace and is exposed to the mentioned crash issue. The goal of this
> series is to provide a solution to align mm and KVM, on a userspace
> inaccessible approach of exposing guest memory.
>
> Normally, KVM populates secondary page table (e.g. EPT) by using a host
> virtual address (hva) from core mm page table (e.g. x86 userspace page
> table). This requires guest memory being mmaped into KVM userspace, but
> this is also the source where the mentioned crash issue can happen. In
> theory, apart from those 'shared' memory for device emulation etc, guest
> memory doesn't have to be mmaped into KVM userspace.
>
> This series introduces fd-based guest memory which will not be mmaped
> into KVM userspace. KVM populates secondary page table by using a

With no mappings in place for userspace VMM, IIUC, looks like the host
kernel will not be able to find the culprit userspace process in case
of Machine check error on guest private memory. As implemented in
hwpoison_user_mappings, host kernel tries to look at the processes
which have mapped the pfns with hardware error.

Is there a modification needed in mce handling logic of the host
kernel to immediately send a signal to the vcpu thread accessing
faulting pfn backing guest private memory?


> fd/offset pair backed by a memory file system. The fd can be created
> from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
> directly interact with them with newly introduced in-kernel interface,
> therefore remove the KVM userspace from the path of accessing/mmaping
> the guest memory.
>
> Kirill had a patch [2] to address the same issue in a different way. It
> tracks guest encrypted memory at the 'struct page' level and relies on
> HWPOISON to reject the userspace access. The patch has been discussed in
> several online and offline threads and resulted in a design document [3]
> which is also the original proposal for this series. Later this patch
> series evolved as more comments received in community but the major
> concepts in [3] still hold true so recommend reading.
>
> The patch series may also be useful for other usages, for example, pure
> software approach may use it to harden itself against unintentional
> access to guest memory. This series is designed with these usages in
> mind but doesn't have code directly support them and extension might be
> needed.
>
>
> mm change
> =========
> Introduces a new memfd_restricted system call which can create memory
> file that is restricted from userspace access via normal MMU operations
> like read(), write() or mmap() etc and the only way to use it is
> passing it to a third kernel module like KVM and relying on it to
> access the fd through the newly added restrictedmem kernel interface.
> The restrictedmem interface bridges the memory file subsystems
> (tmpfs/hugetlbfs etc) and their users (KVM in this case) and provides
> bi-directional communication between them.
>
>
> KVM change
> ==========
> Extends the KVM memslot to provide guest private (encrypted) memory from
> a fd. With this extension, a single memslot can maintain both private
> memory through private fd (restricted_fd/restricted_offset) and shared
> (unencrypted) memory through userspace mmaped host virtual address
> (userspace_addr). For a particular guest page, the corresponding page in
> KVM memslot can be only either private or shared and only one of the
> shared/private parts of the memslot is visible to guest. For how this
> new extension is used in QEMU, please refer to kvm_set_phys_mem() in
> below TDX-enabled QEMU repo.
>
> Introduces new KVM_EXIT_MEMORY_FAULT exit to allow userspace to get the
> chance on decision-making for shared <-> private memory conversion. The
> exit can be an implicit conversion in KVM page fault handler or an
> explicit conversion from guest OS.
>
> Extends existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to
> convert a guest page between private <-> shared. The data maintained in
> these ioctls tells the truth whether a guest page is private or shared
> and this information will be used in KVM page fault handler to decide
> whether the private or the shared part of the memslot is visible to
> guest.
>
>
> Test
> ====
> Ran two kinds of tests:
>   - Selftests [4] from Vishal and VM boot tests in non-TDX environment
>     Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v9
>
>   - Functional tests in TDX capable environment
>     Tested the new functionalities in TDX environment. Code repos:
>     Linux: https://github.com/chao-p/linux/tree/privmem-v9-tdx
>     QEMU: https://github.com/chao-p/qemu/tree/privmem-v9
>
>     An example QEMU command line for TDX test:
>     -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
>     -machine confidential-guest-support=tdx \
>     -object memory-backend-memfd-private,id=ram1,size=${mem} \
>     -machine memory-backend=ram1
>
>
> TODO
> ====
>   - Page accounting and limiting for encrypted memory
>   - hugetlbfs support
>
>
> Changelog
> =========
> v9:
>   - mm: move inaccessible memfd into separated syscall.
>   - mm: return page instead of pfn_t for inaccessible_get_pfn and remove
>     inaccessible_put_pfn.
>   - KVM: rename inaccessible/private to restricted and CONFIG change to
>     make the code friendly to pKVM.
>   - KVM: add invalidate_begin/end pair to fix race contention and revise
>     the lock protection for invalidation path.
>   - KVM: optimize setting lpage_info for > 2M level by direct accessing
>     lower level's result.
>   - KVM: avoid load xarray in kvm_mmu_max_mapping_level() and instead let
>     the caller to pass in is_private.
>   - KVM: API doc improvement.
> v8:
>   - mm: redesign mm part by introducing a shim layer(inaccessible_memfd)
>     in memfd to avoid touch the memory file systems directly.
>   - mm: exclude F_SEAL_AUTO_ALLOCATE as it is for shared memory and
>     cause confusion in this series, will send out separately.
>   - doc: exclude the man page change, it's not kernel patch and will
>     send out separately.
>   - KVM: adapt to use the new mm inaccessible_memfd interface.
>   - KVM: update lpage_info when setting mem_attr_array to support
>     large page.
>   - KVM: change from xa_store_range to xa_store for mem_attr_array due
>     to xa_store_range overrides all entries which is not intended
>     behavior for us.
>   - KVM: refine the mmu_invalidate_retry_gfn mechanism for private page.
>   - KVM: reorganize KVM_MEMORY_ENCRYPT_{UN,}REG_REGION and private page
>     handling code suggested by Sean.
> v7:
>   - mm: introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
>   - KVM: use KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to record
>     private/shared info.
>   - KVM: use similar sync mechanism between zap/page fault paths as
>     mmu_notifier for memfile_notifier based invalidation.
> v6:
>   - mm: introduce MEMFILE_F_* flags into memfile_node to allow checking
>     feature consistence among all memfile_notifier users and get rid of
>     internal flags like SHM_F_INACCESSIBLE.
>   - mm: make pfn_ops callbacks being members of memfile_backing_store
>     and then refer to it directly in memfile_notifier.
>   - mm: remove backing store unregister.
>   - mm: remove RLIMIT_MEMLOCK based memory accounting and limiting.
>   - KVM: reorganize patch sequence for page fault handling and private
>     memory enabling.
> v5:
>   - Add man page for MFD_INACCESSIBLE flag and improve KVM API do for
>     the new memslot extensions.
>   - mm: introduce memfile_{un}register_backing_store to allow memory
>     backing store to register/unregister it from memfile_notifier.
>   - mm: remove F_SEAL_INACCESSIBLE, use in-kernel flag
>     (SHM_F_INACCESSIBLE for shmem) instead.
>   - mm: add memory accounting and limiting (RLIMIT_MEMLOCK based) for
>     MFD_INACCESSIBLE memory.
>   - KVM: remove the overlap check for mapping the same file+offset into
>     multiple gfns due to perf consideration, warned in document.
> v4:
>   - mm: rename memfd_ops to memfile_notifier and separate it from
>     memfd.c to standalone memfile-notifier.c.
>   - KVM: move pfn_ops to per-memslot scope from per-vm scope and allow
>     registering multiple memslots to the same memory backing store.
>   - KVM: add a 'kvm' reference in memslot so that we can recover kvm in
>     memfile_notifier handlers.
>   - KVM: add 'private_' prefix for the new fields in memslot.
>   - KVM: reshape the 'type' to 'flag' for kvm_memory_exit
> v3:
>   - Remove 'RFC' prefix.
>   - Fix race condition between memfile_notifier handlers and kvm destroy.
>   - mm: introduce MFD_INACCESSIBLE flag for memfd_create() to force
>     setting F_SEAL_INACCESSIBLE when the fd is created.
>   - KVM: add the shared part of the memslot back to make private/shared
>     pages live in one memslot.
>
> Reference
> =========
> [1] Intel TDX:
> https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> [2] Kirill's implementation:
> https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com/T/
> [3] Original design proposal:
> https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com/
> [4] Selftest:
> https://lore.kernel.org/all/20220819174659.2427983-1-vannapurve@google.com/
>
>
> Chao Peng (7):
>   KVM: Extend the memslot to support fd-based private memory
>   KVM: Add KVM_EXIT_MEMORY_FAULT exit
>   KVM: Use gfn instead of hva for mmu_notifier_retry
>   KVM: Register/unregister the guest private memory regions
>   KVM: Update lpage info when private/shared memory are mixed
>   KVM: Handle page fault for private memory
>   KVM: Enable and expose KVM_MEM_PRIVATE
>
> Kirill A. Shutemov (1):
>   mm: Introduce memfd_restricted system call to create restricted user
>     memory
>
>  Documentation/virt/kvm/api.rst         |  88 ++++-
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  arch/x86/include/asm/kvm_host.h        |   8 +
>  arch/x86/kvm/Kconfig                   |   3 +
>  arch/x86/kvm/mmu/mmu.c                 | 170 +++++++++-
>  arch/x86/kvm/mmu/mmu_internal.h        |  14 +-
>  arch/x86/kvm/mmu/mmutrace.h            |   1 +
>  arch/x86/kvm/mmu/spte.h                |   6 +
>  arch/x86/kvm/mmu/tdp_mmu.c             |   3 +-
>  arch/x86/kvm/x86.c                     |   4 +-
>  include/linux/kvm_host.h               |  89 ++++-
>  include/linux/restrictedmem.h          |  62 ++++
>  include/linux/syscalls.h               |   1 +
>  include/uapi/asm-generic/unistd.h      |   5 +-
>  include/uapi/linux/kvm.h               |  38 +++
>  include/uapi/linux/magic.h             |   1 +
>  kernel/sys_ni.c                        |   3 +
>  mm/Kconfig                             |   4 +
>  mm/Makefile                            |   1 +
>  mm/restrictedmem.c                     | 250 ++++++++++++++
>  virt/kvm/Kconfig                       |   7 +
>  virt/kvm/kvm_main.c                    | 453 +++++++++++++++++++++----
>  23 files changed, 1121 insertions(+), 92 deletions(-)
>  create mode 100644 include/linux/restrictedmem.h
>  create mode 100644 mm/restrictedmem.c
>
>
> base-commit: e18d6152ff0f41b7f01f9817372022df04e0d354
> --
> 2.25.1
>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-02 22:07       ` Michael Roth
@ 2022-11-03 16:30         ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2022-11-03 16:30 UTC (permalink / raw)
  To: Michael Roth
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, mhocko, Muchun Song, wei.w.wang

On Wed, Nov 02, 2022 at 05:07:00PM -0500, Michael Roth wrote:
> On Thu, Nov 03, 2022 at 12:14:04AM +0300, Kirill A. Shutemov wrote:
> > On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> > > 
> > > In v8 there was some discussion about potentially passing the page/folio
> > > and order as part of the invalidation callback, I ended up needing
> > > something similar for SEV-SNP, and think it might make sense for other
> > > platforms. This main reasoning is:
> > > 
> > >   1) restoring kernel directmap:
> > > 
> > >      Currently SNP (and I believe TDX) need to either split or remove kernel
> > >      direct mappings for restricted PFNs, since there is no guarantee that
> > >      other PFNs within a 2MB range won't be used for non-restricted
> > >      (which will cause an RMP #PF in the case of SNP since the 2MB
> > >      mapping overlaps with guest-owned pages)
> > 
> > That's news to me. Where the restriction for SNP comes from?
> 
> Sorry, missed your first question.
> 
> For SNP at least, the restriction is documented in APM Volume 2, Section
> 15.36.10, First row of Table 15-36 (preceeding paragraph has more
> context). I forgot to mention this is only pertaining to writes by the
> host to 2MB pages that contain guest-owned subpages, for reads it's
> not an issue, but I think the implementation requirements end up being
> the same either way:
> 
>   https://www.amd.com/system/files/TechDocs/24593.pdf

Looks like you wanted restricted memfd to be backed by secretmem rather
then normal memfd. It would help preserve directmap.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions
  2022-10-25 15:13 ` [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions Chao Peng
  2022-10-27 10:31   ` Fuad Tabba
@ 2022-11-03 23:04   ` Sean Christopherson
  2022-11-04  8:28     ` Chao Peng
  2022-11-08  1:35   ` Yuan Yao
  2022-11-16 22:24   ` Sean Christopherson
  3 siblings, 1 reply; 101+ messages in thread
From: Sean Christopherson @ 2022-11-03 23:04 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Tue, Oct 25, 2022, Chao Peng wrote:
> @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
>  		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>  		break;
>  	}
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> +	case KVM_MEMORY_ENCRYPT_REG_REGION:
> +	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {

I'm having second thoughts about usurping KVM_MEMORY_ENCRYPT_(UN)REG_REGION.  Aside
from the fact that restricted/protected memory may not be encrypted, there are
other potential use cases for per-page memory attributes[*], e.g. to make memory
read-only (or no-exec, or exec-only, etc...) without having to modify memslots.

Any paravirt use case where the attributes of a page are effectively dictated by
the guest is going to run into the exact same performance problems with memslots,
which isn't suprising in hindsight since shared vs. private is really just an
attribute, albeit with extra special semantics.

And if we go with a brand new ioctl(), maybe someday in the very distant future
we can deprecate and delete KVM_MEMORY_ENCRYPT_(UN)REG_REGION.

Switching to a new ioctl() should be a minor change, i.e. shouldn't throw too big
of a wrench into things.

Something like:

  KVM_SET_MEMORY_ATTRIBUTES

  struct kvm_memory_attributes {
	__u64 address;
	__u64 size;
	__u64 flags;
  }

[*] https://lore.kernel.org/all/Y1a1i9vbJ%2FpVmV9r@google.com

> +		struct kvm_enc_region region;
> +		bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> +
> +		if (!kvm_arch_has_private_mem(kvm))
> +			goto arch_vm_ioctl;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&region, argp, sizeof(region)))
> +			goto out;
> +
> +		r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> +					      region.size, set);
> +		break;
> +	}
> +#endif


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-10-27 10:29   ` Fuad Tabba
@ 2022-11-04  2:28     ` Chao Peng
  2022-11-04 22:29       ` Sean Christopherson
  0 siblings, 1 reply; 101+ messages in thread
From: Chao Peng @ 2022-11-04  2:28 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Thu, Oct 27, 2022 at 11:29:14AM +0100, Fuad Tabba wrote:
> Hi,
> 
> On Tue, Oct 25, 2022 at 4:19 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > Currently in mmu_notifier validate path, hva range is recorded and then
> > checked against in the mmu_notifier_retry_hva() of the page fault path.
> > However, for the to be introduced private memory, a page fault may not
> > have a hva associated, checking gfn(gpa) makes more sense.
> >
> > For existing non private memory case, gfn is expected to continue to
> > work. The only downside is when aliasing multiple gfns to a single hva,
> > the current algorithm of checking multiple ranges could result in a much
> > larger range being rejected. Such aliasing should be uncommon, so the
> > impact is expected small.
> >
> > It also fixes a bug in kvm_zap_gfn_range() which has already been using
> 
> nit: Now it's kvm_unmap_gfn_range().

Forgot to mention: the bug is still with kvm_zap_gfn_range(). It calls
kvm_mmu_invalidate_begin/end with a gfn range but before this series
kvm_mmu_invalidate_begin/end actually accept a hva range. Note it's
unrelated to whether we use kvm_zap_gfn_range() or kvm_unmap_gfn_range()
in the following patch (patch 05).

Thanks,
Chao
> 
> > gfn when calling kvm_mmu_invalidate_begin/end() while these functions
> > accept hva in current code.
> >
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> 
> Based on reading this code and my limited knowledge of the x86 MMU code:
> Reviewed-by: Fuad Tabba <tabba@google.com>
> 
> Cheers,
> /fuad
> 
> 
> >  arch/x86/kvm/mmu/mmu.c   |  2 +-
> >  include/linux/kvm_host.h | 18 +++++++---------
> >  virt/kvm/kvm_main.c      | 45 ++++++++++++++++++++++++++--------------
> >  3 files changed, 39 insertions(+), 26 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6f81539061d6..33b1aec44fb8 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4217,7 +4217,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> >                 return true;
> >
> >         return fault->slot &&
> > -              mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> > +              mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
> >  }
> >
> >  static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 739a7562a1f3..79e5cbc35fcf 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -775,8 +775,8 @@ struct kvm {
> >         struct mmu_notifier mmu_notifier;
> >         unsigned long mmu_invalidate_seq;
> >         long mmu_invalidate_in_progress;
> > -       unsigned long mmu_invalidate_range_start;
> > -       unsigned long mmu_invalidate_range_end;
> > +       gfn_t mmu_invalidate_range_start;
> > +       gfn_t mmu_invalidate_range_end;
> >  #endif
> >         struct list_head devices;
> >         u64 manual_dirty_log_protect;
> > @@ -1365,10 +1365,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> >  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >  #endif
> >
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > -                             unsigned long end);
> > -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> > -                           unsigned long end);
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end);
> > +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end);
> >
> >  long kvm_arch_dev_ioctl(struct file *filp,
> >                         unsigned int ioctl, unsigned long arg);
> > @@ -1937,9 +1935,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
> >         return 0;
> >  }
> >
> > -static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> > +static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
> >                                            unsigned long mmu_seq,
> > -                                          unsigned long hva)
> > +                                          gfn_t gfn)
> >  {
> >         lockdep_assert_held(&kvm->mmu_lock);
> >         /*
> > @@ -1949,8 +1947,8 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> >          * positives, due to shortcuts when handing concurrent invalidations.
> >          */
> >         if (unlikely(kvm->mmu_invalidate_in_progress) &&
> > -           hva >= kvm->mmu_invalidate_range_start &&
> > -           hva < kvm->mmu_invalidate_range_end)
> > +           gfn >= kvm->mmu_invalidate_range_start &&
> > +           gfn < kvm->mmu_invalidate_range_end)
> >                 return 1;
> >         if (kvm->mmu_invalidate_seq != mmu_seq)
> >                 return 1;
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 8dace78a0278..09c9cdeb773c 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -540,8 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
> >
> >  typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
> >
> > -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> > -                            unsigned long end);
> > +typedef void (*on_lock_fn_t)(struct kvm *kvm, gfn_t start, gfn_t end);
> >
> >  typedef void (*on_unlock_fn_t)(struct kvm *kvm);
> >
> > @@ -628,7 +627,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> >                                 locked = true;
> >                                 KVM_MMU_LOCK(kvm);
> >                                 if (!IS_KVM_NULL_FN(range->on_lock))
> > -                                       range->on_lock(kvm, range->start, range->end);
> > +                                       range->on_lock(kvm, gfn_range.start,
> > +                                                           gfn_range.end);
> >                                 if (IS_KVM_NULL_FN(range->handler))
> >                                         break;
> >                         }
> > @@ -715,15 +715,9 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> >         kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> >  }
> >
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > -                             unsigned long end)
> > +static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
> > +                                                           gfn_t end)
> >  {
> > -       /*
> > -        * The count increase must become visible at unlock time as no
> > -        * spte can be established without taking the mmu_lock and
> > -        * count is also read inside the mmu_lock critical section.
> > -        */
> > -       kvm->mmu_invalidate_in_progress++;
> >         if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> >                 kvm->mmu_invalidate_range_start = start;
> >                 kvm->mmu_invalidate_range_end = end;
> > @@ -744,6 +738,28 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> >         }
> >  }
> >
> > +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +       /*
> > +        * The count increase must become visible at unlock time as no
> > +        * spte can be established without taking the mmu_lock and
> > +        * count is also read inside the mmu_lock critical section.
> > +        */
> > +       kvm->mmu_invalidate_in_progress++;
> > +}
> > +
> > +static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > +{
> > +       update_invalidate_range(kvm, range->start, range->end);
> > +       return kvm_unmap_gfn_range(kvm, range);
> > +}
> > +
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +       mark_invalidate_in_progress(kvm, start, end);
> > +       update_invalidate_range(kvm, start, end);
> > +}
> > +
> >  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >                                         const struct mmu_notifier_range *range)
> >  {
> > @@ -752,8 +768,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >                 .start          = range->start,
> >                 .end            = range->end,
> >                 .pte            = __pte(0),
> > -               .handler        = kvm_unmap_gfn_range,
> > -               .on_lock        = kvm_mmu_invalidate_begin,
> > +               .handler        = kvm_mmu_handle_gfn_range,
> > +               .on_lock        = mark_invalidate_in_progress,
> >                 .on_unlock      = kvm_arch_guest_memory_reclaimed,
> >                 .flush_on_ret   = true,
> >                 .may_block      = mmu_notifier_range_blockable(range),
> > @@ -791,8 +807,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >         return 0;
> >  }
> >
> > -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> > -                           unsigned long end)
> > +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
> >  {
> >         /*
> >          * This sequence increase will notify the kvm page fault that
> > --
> > 2.25.1
> >


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions
  2022-11-03 23:04   ` Sean Christopherson
@ 2022-11-04  8:28     ` Chao Peng
  2022-11-04 21:19       ` Sean Christopherson
  0 siblings, 1 reply; 101+ messages in thread
From: Chao Peng @ 2022-11-04  8:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Thu, Nov 03, 2022 at 11:04:53PM +0000, Sean Christopherson wrote:
> On Tue, Oct 25, 2022, Chao Peng wrote:
> > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
> >  		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> >  		break;
> >  	}
> > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > +	case KVM_MEMORY_ENCRYPT_REG_REGION:
> > +	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> 
> I'm having second thoughts about usurping KVM_MEMORY_ENCRYPT_(UN)REG_REGION.  Aside
> from the fact that restricted/protected memory may not be encrypted, there are
> other potential use cases for per-page memory attributes[*], e.g. to make memory
> read-only (or no-exec, or exec-only, etc...) without having to modify memslots.
> 
> Any paravirt use case where the attributes of a page are effectively dictated by
> the guest is going to run into the exact same performance problems with memslots,
> which isn't suprising in hindsight since shared vs. private is really just an
> attribute, albeit with extra special semantics.
> 
> And if we go with a brand new ioctl(), maybe someday in the very distant future
> we can deprecate and delete KVM_MEMORY_ENCRYPT_(UN)REG_REGION.
> 
> Switching to a new ioctl() should be a minor change, i.e. shouldn't throw too big
> of a wrench into things.
> 
> Something like:
> 
>   KVM_SET_MEMORY_ATTRIBUTES
> 
>   struct kvm_memory_attributes {
> 	__u64 address;
> 	__u64 size;
> 	__u64 flags;
>   }

I like the idea of adding a new ioctl(). But putting all attributes into
a flags in uAPI sounds not good to me, e.g. forcing userspace to set all
attributes in one call can cause pain for userspace, probably for KVM
implementation as well. For private<->shared memory conversion, we
actually only care the KVM_MEM_ATTR_SHARED or KVM_MEM_ATTR_PRIVATE bit,
but we force userspace to set other irrelevant bits as well if use this
API.

I looked at kvm_device_attr, sounds we can do similar:

  KVM_SET_MEMORY_ATTR

  struct kvm_memory_attr {
	__u64 address;
	__u64 size;
#define KVM_MEM_ATTR_SHARED	BIT(0)
#define KVM_MEM_ATTR_READONLY	BIT(1)
#define KVM_MEM_ATTR_NOEXEC	BIT(2)
	__u32 attr;
	__u32 pad;
  }

I'm not sure if we need KVM_GET_MEMORY_ATTR/KVM_HAS_MEMORY_ATTR as well,
but sounds like we need a KVM_UNSET_MEMORY_ATTR.

Since we are exposing the attribute directly to userspace I also think
we'd better treat shared memory as the default, so even when the private
memory is not used, the bit can still be meaningful. So define BIT(0) as
KVM_MEM_ATTR_PRIVATE instead of KVM_MEM_ATTR_SHARED.

Thanks,
Chao

> 
> [*] https://lore.kernel.org/all/Y1a1i9vbJ%2FpVmV9r@google.com
> 
> > +		struct kvm_enc_region region;
> > +		bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> > +
> > +		if (!kvm_arch_has_private_mem(kvm))
> > +			goto arch_vm_ioctl;
> > +
> > +		r = -EFAULT;
> > +		if (copy_from_user(&region, argp, sizeof(region)))
> > +			goto out;
> > +
> > +		r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> > +					      region.size, set);
> > +		break;
> > +	}
> > +#endif


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions
  2022-11-04  8:28     ` Chao Peng
@ 2022-11-04 21:19       ` Sean Christopherson
  2022-11-08  8:24         ` Chao Peng
  0 siblings, 1 reply; 101+ messages in thread
From: Sean Christopherson @ 2022-11-04 21:19 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

Paolo, any thoughts before I lead things further astray?

On Fri, Nov 04, 2022, Chao Peng wrote:
> On Thu, Nov 03, 2022 at 11:04:53PM +0000, Sean Christopherson wrote:
> > On Tue, Oct 25, 2022, Chao Peng wrote:
> > > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
> > >  		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > >  		break;
> > >  	}
> > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > > +	case KVM_MEMORY_ENCRYPT_REG_REGION:
> > > +	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> > 
> > I'm having second thoughts about usurping KVM_MEMORY_ENCRYPT_(UN)REG_REGION.  Aside
> > from the fact that restricted/protected memory may not be encrypted, there are
> > other potential use cases for per-page memory attributes[*], e.g. to make memory
> > read-only (or no-exec, or exec-only, etc...) without having to modify memslots.
> > 
> > Any paravirt use case where the attributes of a page are effectively dictated by
> > the guest is going to run into the exact same performance problems with memslots,
> > which isn't suprising in hindsight since shared vs. private is really just an
> > attribute, albeit with extra special semantics.
> > 
> > And if we go with a brand new ioctl(), maybe someday in the very distant future
> > we can deprecate and delete KVM_MEMORY_ENCRYPT_(UN)REG_REGION.
> > 
> > Switching to a new ioctl() should be a minor change, i.e. shouldn't throw too big
> > of a wrench into things.
> > 
> > Something like:
> > 
> >   KVM_SET_MEMORY_ATTRIBUTES
> > 
> >   struct kvm_memory_attributes {
> > 	__u64 address;
> > 	__u64 size;
> > 	__u64 flags;

Oh, this is half-baked.  I lost track of which flags were which.  What I intended
was a separate, initially-unused flags, e.g.

 struct kvm_memory_attributes {
	__u64 address;
	__u64 size;
	__u64 attributes;
	__u64 flags;
  }

so that KVM can tweak behavior and/or extend the effective size of the struct.

> I like the idea of adding a new ioctl(). But putting all attributes into
> a flags in uAPI sounds not good to me, e.g. forcing userspace to set all
> attributes in one call can cause pain for userspace, probably for KVM
> implementation as well. For private<->shared memory conversion, we
> actually only care the KVM_MEM_ATTR_SHARED or KVM_MEM_ATTR_PRIVATE bit,

Not necessarily, e.g. I can see pKVM wanting to convert from RW+PRIVATE => RO+SHARED
or even RW+PRIVATE => NONE+SHARED so that the guest can't write/access the memory
while it's accessible from the host.

And if this does extend beyond shared/private, dropping from RWX=>R, i.e. dropping
WX permissions, would also be a common operation.

Hmm, typing that out makes me think that if we do end up supporting other "attributes",
i.e. protections, we should go straight to full RWX protections instead of doing
things piecemeal, i.e. add individual protections instead of combinations like
NO_EXEC and READ_ONLY.  The protections would have to be inverted for backwards
compatibility, but that's easy enough to handle.  The semantics could be like
protection keys, which also have inverted persmissions, where the final protections
are the combination of memslot+attributes, i.e. a read-only memslot couldn't be made
writable via attributes.

E.g. userspace could do "NO_READ | NO_WRITE | NO_EXEC" to temporarily block access
to memory without needing to delete the memslot.  KVM would need to disallow
unsupported combinations, e.g. disallowed effective protections would be:

  - W or WX [unless there's an arch that supports write-only memory]
  - R or RW [until KVM plumbs through support for no-exec, or it's unsupported in hardware]
  - X       [until KVM plumbs through support for exec-only, or it's unsupported in hardware]

Anyways, that's all future work...

> but we force userspace to set other irrelevant bits as well if use this
> API.

They aren't irrelevant though, as the memory attributes are all describing the
allowed protections for a given page.  If there's a use case where userspace "can't"
keep track of the attributes for whatever reason, then userspace could do a RMW
to set/clear attributes.  Alternatively, the ioctl() could take an "operation" and
support WRITE/OR/AND to allow setting/clearing individual flags, e.g. tweak the
above to be: 
 
 struct kvm_memory_attributes {
	__u64 address;
	__u64 size;
	__u64 attributes;
	__u32 operation;
	__u32 flags;
  }

> I looked at kvm_device_attr, sounds we can do similar:

The device attributes deal with isolated, arbitrary values, whereas memory attributes
are flags, i.e. devices are 1:1 whereas memory is 1:MANY.  There is no "unset" for
device attributes, because they aren't flags.  Device attributes vs. memory attributes
really are two very different things that just happen to use a common name.

If it helped clarify things without creating naming problems, we could even use
PROTECTIONS instead of ATTRIBUTES.

>   KVM_SET_MEMORY_ATTR
> 
>   struct kvm_memory_attr {
> 	__u64 address;
> 	__u64 size;
> #define KVM_MEM_ATTR_SHARED	BIT(0)
> #define KVM_MEM_ATTR_READONLY	BIT(1)
> #define KVM_MEM_ATTR_NOEXEC	BIT(2)
> 	__u32 attr;

As above, letting userspace set only a single attribute would prevent setting
(or clearing) multiple attributes in a single ioctl().

> 	__u32 pad;
>   }
> 
> I'm not sure if we need KVM_GET_MEMORY_ATTR/KVM_HAS_MEMORY_ATTR as well,

Definitely would need to communicate to userspace that various attributes are
supported.  That doesn't necessarily require a common ioctl(), but I don't see
any reason not to add a common helper, and adding a common helper would mean
KVM_CAP_PRIVATE_MEM can go away.  But it should return a bitmask so that userspace
can do a single query to get all supported attributes, e.g. KVM_SUPPORTED_MEMORY_ATTRIBUTES.  

As for KVM_GET_MEMORY_ATTRIBUTES, we wouldn't necessarily have to provide such an
API, e.g. we could hold off until someone came along with a RMW use case (as above).
That said, debug would likely be a nightmare without KVM_GET_MEMORY_ATTRIBUTES,
so it's probably best to add it straightway.

> but sounds like we need a KVM_UNSET_MEMORY_ATTR.

No need if the setter operates on all attributes.

> Since we are exposing the attribute directly to userspace I also think
> we'd better treat shared memory as the default, so even when the private
> memory is not used, the bit can still be meaningful. So define BIT(0) as
> KVM_MEM_ATTR_PRIVATE instead of KVM_MEM_ATTR_SHARED.

Ah, right.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-11-04  2:28     ` Chao Peng
@ 2022-11-04 22:29       ` Sean Christopherson
  2022-11-08  7:16         ` Chao Peng
  0 siblings, 1 reply; 101+ messages in thread
From: Sean Christopherson @ 2022-11-04 22:29 UTC (permalink / raw)
  To: Chao Peng
  Cc: Fuad Tabba, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Fri, Nov 04, 2022, Chao Peng wrote:
> On Thu, Oct 27, 2022 at 11:29:14AM +0100, Fuad Tabba wrote:
> > Hi,
> > 
> > On Tue, Oct 25, 2022 at 4:19 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > >
> > > Currently in mmu_notifier validate path, hva range is recorded and then
> > > checked against in the mmu_notifier_retry_hva() of the page fault path.
> > > However, for the to be introduced private memory, a page fault may not
> > > have a hva associated, checking gfn(gpa) makes more sense.
> > >
> > > For existing non private memory case, gfn is expected to continue to
> > > work. The only downside is when aliasing multiple gfns to a single hva,
> > > the current algorithm of checking multiple ranges could result in a much
> > > larger range being rejected. Such aliasing should be uncommon, so the
> > > impact is expected small.
> > >
> > > It also fixes a bug in kvm_zap_gfn_range() which has already been using
> > 
> > nit: Now it's kvm_unmap_gfn_range().
> 
> Forgot to mention: the bug is still with kvm_zap_gfn_range(). It calls
> kvm_mmu_invalidate_begin/end with a gfn range but before this series
> kvm_mmu_invalidate_begin/end actually accept a hva range. Note it's
> unrelated to whether we use kvm_zap_gfn_range() or kvm_unmap_gfn_range()
> in the following patch (patch 05).

Grr, in the future, if you find an existing bug, please send a patch.  At the
very least, report the bug.  The APICv case that this was added for could very
well be broken because of this, and the resulting failures would be an absolute
nightmare to debug.

Compile tested only...

--
From: Sean Christopherson <seanjc@google.com>
Date: Fri, 4 Nov 2022 22:20:33 +0000
Subject: [PATCH] KVM: x86/mmu: Block all page faults during
 kvm_zap_gfn_range()

When zapping a GFN range, pass 0 => ALL_ONES for the to-be-invalidated
range to effectively block all page faults while the zap is in-progress.
The invalidation helpers take a host virtual address, whereas zapping a
GFN obviously provides a guest physical address and with the wrong unit
of measurement (frame vs. byte).

Alternatively, KVM could walk all memslots to get the associated HVAs,
but thanks to SMM, that would require multiple lookups.  And practically
speaking, kvm_zap_gfn_range() usage is quite rare and not a hot path,
e.g. MTRR and CR0.CD are almost guaranteed to be done only on vCPU0
during boot, and APICv inhibits are similarly infrequent operations.

Fixes: edb298c663fc ("KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_range")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6f81539061d6..1ccb769f62af 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6056,7 +6056,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 
 	write_lock(&kvm->mmu_lock);
 
-	kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
+	kvm_mmu_invalidate_begin(kvm, 0, -1ul);
 
 	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
 
@@ -6070,7 +6070,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 		kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
 						   gfn_end - gfn_start);
 
-	kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
+	kvm_mmu_invalidate_end(kvm, 0, -1ul);
 
 	write_unlock(&kvm->mmu_lock);
 }

base-commit: c12879206e47730ff5ab255bbf625b28ade4028f
-- 



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM
  2022-11-03 12:13 ` [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM Vishal Annapurve
@ 2022-11-08  0:41   ` Isaku Yamahata
  2022-11-09 15:54     ` Kirill A. Shutemov
  0 siblings, 1 reply; 101+ messages in thread
From: Isaku Yamahata @ 2022-11-08  0:41 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	Muchun Song, wei.w.wang, isaku.yamahata

On Thu, Nov 03, 2022 at 05:43:52PM +0530,
Vishal Annapurve <vannapurve@google.com> wrote:

> On Tue, Oct 25, 2022 at 8:48 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> >
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> >   - 01: mm change, target for mm tree
> >   - 02-08: KVM change, target for KVM tree
> >
> > Given KVM is the only current user for the mm part, I have chatted with
> > Paolo and he is OK to merge the mm change through KVM tree, but
> > reviewed-by/acked-by is still expected from the mm people.
> >
> > The patches have been verified in Intel TDX environment, but Vishal has
> > done an excellent work on the selftests[4] which are dedicated for this
> > series, making it possible to test this series without innovative
> > hardware and fancy steps of building a VM environment. See Test section
> > below for more info.
> >
> >
> > Introduction
> > ============
> > KVM userspace being able to crash the host is horrible. Under current
> > KVM architecture, all guest memory is inherently accessible from KVM
> > userspace and is exposed to the mentioned crash issue. The goal of this
> > series is to provide a solution to align mm and KVM, on a userspace
> > inaccessible approach of exposing guest memory.
> >
> > Normally, KVM populates secondary page table (e.g. EPT) by using a host
> > virtual address (hva) from core mm page table (e.g. x86 userspace page
> > table). This requires guest memory being mmaped into KVM userspace, but
> > this is also the source where the mentioned crash issue can happen. In
> > theory, apart from those 'shared' memory for device emulation etc, guest
> > memory doesn't have to be mmaped into KVM userspace.
> >
> > This series introduces fd-based guest memory which will not be mmaped
> > into KVM userspace. KVM populates secondary page table by using a
> 
> With no mappings in place for userspace VMM, IIUC, looks like the host
> kernel will not be able to find the culprit userspace process in case
> of Machine check error on guest private memory. As implemented in
> hwpoison_user_mappings, host kernel tries to look at the processes
> which have mapped the pfns with hardware error.
> 
> Is there a modification needed in mce handling logic of the host
> kernel to immediately send a signal to the vcpu thread accessing
> faulting pfn backing guest private memory?

mce_register_decode_chain() can be used.  MCE physical address(p->mce_addr)
includes host key id in addition to real physical address.  By searching used
hkid by KVM, we can determine if the page is assigned to guest TD or not. If
yes, send SIGBUS.

kvm_machine_check() can be enhanced for KVM specific use.  This is before
memory_failure() is called, though.

any other ideas?
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions
  2022-10-25 15:13 ` [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions Chao Peng
  2022-10-27 10:31   ` Fuad Tabba
  2022-11-03 23:04   ` Sean Christopherson
@ 2022-11-08  1:35   ` Yuan Yao
  2022-11-08  9:41     ` Chao Peng
  2022-11-16 22:24   ` Sean Christopherson
  3 siblings, 1 reply; 101+ messages in thread
From: Yuan Yao @ 2022-11-08  1:35 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Tue, Oct 25, 2022 at 11:13:41PM +0800, Chao Peng wrote:
> Introduce generic private memory register/unregister by reusing existing
> SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION. It differs from SEV case
> by treating address in the region as gpa instead of hva. Which cases
> should these ioctls go is determined by the kvm_arch_has_private_mem().
> Architecture which supports KVM_PRIVATE_MEM should override this function.
>
> KVM internally defaults all guest memory as private memory and maintain
> the shared memory in 'mem_attr_array'. The above ioctls operate on this
> field and unmap existing mappings if any.
>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  Documentation/virt/kvm/api.rst |  17 ++-
>  arch/x86/kvm/Kconfig           |   1 +
>  include/linux/kvm_host.h       |  10 +-
>  virt/kvm/Kconfig               |   4 +
>  virt/kvm/kvm_main.c            | 227 +++++++++++++++++++++++++--------
>  5 files changed, 198 insertions(+), 61 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 975688912b8c..08253cf498d1 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -4717,10 +4717,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst.
>  This ioctl can be used to register a guest memory region which may
>  contain encrypted data (e.g. guest RAM, SMRAM etc).
>
> -It is used in the SEV-enabled guest. When encryption is enabled, a guest
> -memory region may contain encrypted data. The SEV memory encryption
> -engine uses a tweak such that two identical plaintext pages, each at
> -different locations will have differing ciphertexts. So swapping or
> +Currently this ioctl supports registering memory regions for two usages:
> +private memory and SEV-encrypted memory.
> +
> +When private memory is enabled, this ioctl is used to register guest private
> +memory region and the addr/size of kvm_enc_region represents guest physical
> +address (GPA). In this usage, this ioctl zaps the existing guest memory
> +mappings in KVM that fallen into the region.
> +
> +When SEV-encrypted memory is enabled, this ioctl is used to register guest
> +memory region which may contain encrypted data for a SEV-enabled guest. The
> +addr/size of kvm_enc_region represents userspace address (HVA). The SEV
> +memory encryption engine uses a tweak such that two identical plaintext pages,
> +each at different locations will have differing ciphertexts. So swapping or
>  moving ciphertext of those pages will not result in plaintext being
>  swapped. So relocating (or migrating) physical backing pages for the SEV
>  guest will require some additional steps.
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 8d2bd455c0cd..73fdfa429b20 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -51,6 +51,7 @@ config KVM
>  	select HAVE_KVM_PM_NOTIFIER if PM
>  	select HAVE_KVM_RESTRICTED_MEM if X86_64
>  	select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> +	select KVM_GENERIC_PRIVATE_MEM if HAVE_KVM_RESTRICTED_MEM
>  	help
>  	  Support hosting fully virtualized guest machines using hardware
>  	  virtualization extensions.  You will need a fairly recent
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 79e5cbc35fcf..4ce98fa0153c 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -245,7 +245,8 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>  #endif
>
> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> +
> +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_KVM_GENERIC_PRIVATE_MEM)
>  struct kvm_gfn_range {
>  	struct kvm_memory_slot *slot;
>  	gfn_t start;
> @@ -254,6 +255,9 @@ struct kvm_gfn_range {
>  	bool may_block;
>  };
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> +#endif
> +
> +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> @@ -794,6 +798,9 @@ struct kvm {
>  	struct notifier_block pm_notifier;
>  #endif
>  	char stats_id[KVM_STATS_NAME_SIZE];
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> +	struct xarray mem_attr_array;
> +#endif
>  };
>
>  #define kvm_err(fmt, ...) \
> @@ -1453,6 +1460,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
>  int kvm_arch_post_init_vm(struct kvm *kvm);
>  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
>  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_has_private_mem(struct kvm *kvm);
>
>  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
>  /*
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 9ff164c7e0cc..69ca59e82149 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -89,3 +89,7 @@ config HAVE_KVM_PM_NOTIFIER
>
>  config HAVE_KVM_RESTRICTED_MEM
>         bool
> +
> +config KVM_GENERIC_PRIVATE_MEM
> +       bool
> +       depends on HAVE_KVM_RESTRICTED_MEM
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 09c9cdeb773c..fc3835826ace 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
>  }
>  EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
>
> +static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
> +							    gfn_t end)
> +{
> +	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> +		kvm->mmu_invalidate_range_start = start;
> +		kvm->mmu_invalidate_range_end = end;
> +	} else {
> +		/*
> +		 * Fully tracking multiple concurrent ranges has diminishing
> +		 * returns. Keep things simple and just find the minimal range
> +		 * which includes the current and new ranges. As there won't be
> +		 * enough information to subtract a range after its invalidate
> +		 * completes, any ranges invalidated concurrently will
> +		 * accumulate and persist until all outstanding invalidates
> +		 * complete.
> +		 */
> +		kvm->mmu_invalidate_range_start =
> +			min(kvm->mmu_invalidate_range_start, start);
> +		kvm->mmu_invalidate_range_end =
> +			max(kvm->mmu_invalidate_range_end, end);
> +	}
> +}
> +
> +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	/*
> +	 * The count increase must become visible at unlock time as no
> +	 * spte can be established without taking the mmu_lock and
> +	 * count is also read inside the mmu_lock critical section.
> +	 */
> +	kvm->mmu_invalidate_in_progress++;
> +}
> +
> +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	mark_invalidate_in_progress(kvm, start, end);
> +	update_invalidate_range(kvm, start, end);
> +}
> +
> +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	/*
> +	 * This sequence increase will notify the kvm page fault that
> +	 * the page that is going to be mapped in the spte could have
> +	 * been freed.
> +	 */
> +	kvm->mmu_invalidate_seq++;
> +	smp_wmb();
> +	/*
> +	 * The above sequence increase must be visible before the
> +	 * below count decrease, which is ensured by the smp_wmb above
> +	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
> +	 */
> +	kvm->mmu_invalidate_in_progress--;
> +}
> +
>  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>  static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>  {
> @@ -715,51 +771,12 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
>  }
>
> -static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
> -							    gfn_t end)
> -{
> -	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> -		kvm->mmu_invalidate_range_start = start;
> -		kvm->mmu_invalidate_range_end = end;
> -	} else {
> -		/*
> -		 * Fully tracking multiple concurrent ranges has diminishing
> -		 * returns. Keep things simple and just find the minimal range
> -		 * which includes the current and new ranges. As there won't be
> -		 * enough information to subtract a range after its invalidate
> -		 * completes, any ranges invalidated concurrently will
> -		 * accumulate and persist until all outstanding invalidates
> -		 * complete.
> -		 */
> -		kvm->mmu_invalidate_range_start =
> -			min(kvm->mmu_invalidate_range_start, start);
> -		kvm->mmu_invalidate_range_end =
> -			max(kvm->mmu_invalidate_range_end, end);
> -	}
> -}
> -
> -static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> -{
> -	/*
> -	 * The count increase must become visible at unlock time as no
> -	 * spte can be established without taking the mmu_lock and
> -	 * count is also read inside the mmu_lock critical section.
> -	 */
> -	kvm->mmu_invalidate_in_progress++;
> -}
> -
>  static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>  	update_invalidate_range(kvm, range->start, range->end);
>  	return kvm_unmap_gfn_range(kvm, range);
>  }
>
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
> -{
> -	mark_invalidate_in_progress(kvm, start, end);
> -	update_invalidate_range(kvm, start, end);
> -}
> -
>  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  					const struct mmu_notifier_range *range)
>  {
> @@ -807,23 +824,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  	return 0;
>  }
>
> -void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
> -{
> -	/*
> -	 * This sequence increase will notify the kvm page fault that
> -	 * the page that is going to be mapped in the spte could have
> -	 * been freed.
> -	 */
> -	kvm->mmu_invalidate_seq++;
> -	smp_wmb();
> -	/*
> -	 * The above sequence increase must be visible before the
> -	 * below count decrease, which is ensured by the smp_wmb above
> -	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
> -	 */
> -	kvm->mmu_invalidate_in_progress--;
> -}
> -
>  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>  					const struct mmu_notifier_range *range)
>  {
> @@ -937,6 +937,89 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>
>  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> +
> +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	struct kvm_gfn_range gfn_range;
> +	struct kvm_memory_slot *slot;
> +	struct kvm_memslots *slots;
> +	struct kvm_memslot_iter iter;
> +	int i;
> +	int r = 0;
> +
> +	gfn_range.pte = __pte(0);
> +	gfn_range.may_block = true;
> +
> +	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +		slots = __kvm_memslots(kvm, i);
> +
> +		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> +			slot = iter.slot;
> +			gfn_range.start = max(start, slot->base_gfn);
> +			gfn_range.end = min(end, slot->base_gfn + slot->npages);
> +			if (gfn_range.start >= gfn_range.end)
> +				continue;
> +			gfn_range.slot = slot;
> +
> +			r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> +		}
> +	}
> +
> +	if (r)
> +		kvm_flush_remote_tlbs(kvm);
> +}
> +
> +#define KVM_MEM_ATTR_SHARED	0x0001
> +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> +				     bool is_private)
> +{
> +	gfn_t start, end;
> +	unsigned long i;
> +	void *entry;
> +	int idx;
> +	int r = 0;
> +
> +	if (size == 0 || gpa + size < gpa)
> +		return -EINVAL;
> +	if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> +		return -EINVAL;
> +
> +	start = gpa >> PAGE_SHIFT;
> +	end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> +	/*
> +	 * Guest memory defaults to private, kvm->mem_attr_array only stores
> +	 * shared memory.
> +	 */
> +	entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> +
> +	idx = srcu_read_lock(&kvm->srcu);
> +	KVM_MMU_LOCK(kvm);
> +	kvm_mmu_invalidate_begin(kvm, start, end);
> +
> +	for (i = start; i < end; i++) {
> +		r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> +				    GFP_KERNEL_ACCOUNT));
> +		if (r)
> +			goto err;
> +	}
> +
> +	kvm_unmap_mem_range(kvm, start, end);

lock is hold by KVM_MMU_LOCK() so how about do
kvm_mmu_invalidate_begin() after changing xarray:

kvm_mmu_invalidate_begin(kvm, start, end);
kvm_unmap_mem_range(kvm, start, end);
kvm_mmu_invalidate_end(kvm, start, end);

Also the error handling path doesn't need to care it yet.

> +
> +	goto ret;
> +err:
> +	for (; i > start; i--)
> +		xa_erase(&kvm->mem_attr_array, i);

the start should be covered yet, consider the i is
unsigned long and case of start is 0, may need another
variable j for this.

> +ret:
> +	kvm_mmu_invalidate_end(kvm, start, end);
> +	KVM_MMU_UNLOCK(kvm);
> +	srcu_read_unlock(&kvm->srcu, idx);
> +
> +	return r;
> +}
> +#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
> +
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>  static int kvm_pm_notifier_call(struct notifier_block *bl,
>  				unsigned long state,
> @@ -1165,6 +1248,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>  	spin_lock_init(&kvm->mn_invalidate_lock);
>  	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
>  	xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> +	xa_init(&kvm->mem_attr_array);
> +#endif
>
>  	INIT_LIST_HEAD(&kvm->gpc_list);
>  	spin_lock_init(&kvm->gpc_lock);
> @@ -1338,6 +1424,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
>  		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
>  	}
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> +	xa_destroy(&kvm->mem_attr_array);
> +#endif
>  	cleanup_srcu_struct(&kvm->irq_srcu);
>  	cleanup_srcu_struct(&kvm->srcu);
>  	kvm_arch_free_vm(kvm);
> @@ -1541,6 +1630,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
>  	}
>  }
>
> +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> +{
> +	return false;
> +}
> +
>  static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
>  {
>  	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
>  		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>  		break;
>  	}
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> +	case KVM_MEMORY_ENCRYPT_REG_REGION:
> +	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> +		struct kvm_enc_region region;
> +		bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> +
> +		if (!kvm_arch_has_private_mem(kvm))
> +			goto arch_vm_ioctl;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&region, argp, sizeof(region)))
> +			goto out;
> +
> +		r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> +					      region.size, set);
> +		break;
> +	}
> +#endif
>  	case KVM_GET_DIRTY_LOG: {
>  		struct kvm_dirty_log log;
>
> @@ -4861,6 +4973,9 @@ static long kvm_vm_ioctl(struct file *filp,
>  		r = kvm_vm_ioctl_get_stats_fd(kvm);
>  		break;
>  	default:
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> +arch_vm_ioctl:
> +#endif
>  		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>  	}
>  out:
> --
> 2.25.1
>
>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-11-04 22:29       ` Sean Christopherson
@ 2022-11-08  7:16         ` Chao Peng
  2022-11-10 17:53           ` Sean Christopherson
  0 siblings, 1 reply; 101+ messages in thread
From: Chao Peng @ 2022-11-08  7:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Fuad Tabba, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Fri, Nov 04, 2022 at 10:29:48PM +0000, Sean Christopherson wrote:
> On Fri, Nov 04, 2022, Chao Peng wrote:
> > On Thu, Oct 27, 2022 at 11:29:14AM +0100, Fuad Tabba wrote:
> > > Hi,
> > > 
> > > On Tue, Oct 25, 2022 at 4:19 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > > >
> > > > Currently in mmu_notifier validate path, hva range is recorded and then
> > > > checked against in the mmu_notifier_retry_hva() of the page fault path.
> > > > However, for the to be introduced private memory, a page fault may not
> > > > have a hva associated, checking gfn(gpa) makes more sense.
> > > >
> > > > For existing non private memory case, gfn is expected to continue to
> > > > work. The only downside is when aliasing multiple gfns to a single hva,
> > > > the current algorithm of checking multiple ranges could result in a much
> > > > larger range being rejected. Such aliasing should be uncommon, so the
> > > > impact is expected small.
> > > >
> > > > It also fixes a bug in kvm_zap_gfn_range() which has already been using
> > > 
> > > nit: Now it's kvm_unmap_gfn_range().
> > 
> > Forgot to mention: the bug is still with kvm_zap_gfn_range(). It calls
> > kvm_mmu_invalidate_begin/end with a gfn range but before this series
> > kvm_mmu_invalidate_begin/end actually accept a hva range. Note it's
> > unrelated to whether we use kvm_zap_gfn_range() or kvm_unmap_gfn_range()
> > in the following patch (patch 05).
> 
> Grr, in the future, if you find an existing bug, please send a patch.  At the
> very least, report the bug.

Agreed, this can be sent out separately from this series.

> The APICv case that this was added for could very
> well be broken because of this, and the resulting failures would be an absolute
> nightmare to debug.

Given the apicv_inhibit should be rare, the change looks good to me.
Just to be clear, your will send out this fix, right?

Chao

> 
> Compile tested only...
> 
> --
> From: Sean Christopherson <seanjc@google.com>
> Date: Fri, 4 Nov 2022 22:20:33 +0000
> Subject: [PATCH] KVM: x86/mmu: Block all page faults during
>  kvm_zap_gfn_range()
> 
> When zapping a GFN range, pass 0 => ALL_ONES for the to-be-invalidated
> range to effectively block all page faults while the zap is in-progress.
> The invalidation helpers take a host virtual address, whereas zapping a
> GFN obviously provides a guest physical address and with the wrong unit
> of measurement (frame vs. byte).
> 
> Alternatively, KVM could walk all memslots to get the associated HVAs,
> but thanks to SMM, that would require multiple lookups.  And practically
> speaking, kvm_zap_gfn_range() usage is quite rare and not a hot path,
> e.g. MTRR and CR0.CD are almost guaranteed to be done only on vCPU0
> during boot, and APICv inhibits are similarly infrequent operations.
> 
> Fixes: edb298c663fc ("KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_range")
> Cc: stable@vger.kernel.org
> Cc: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f81539061d6..1ccb769f62af 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6056,7 +6056,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  
>  	write_lock(&kvm->mmu_lock);
>  
> -	kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
> +	kvm_mmu_invalidate_begin(kvm, 0, -1ul);
>  
>  	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>  
> @@ -6070,7 +6070,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  		kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
>  						   gfn_end - gfn_start);
>  
> -	kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
> +	kvm_mmu_invalidate_end(kvm, 0, -1ul);
>  
>  	write_unlock(&kvm->mmu_lock);
>  }
> 
> base-commit: c12879206e47730ff5ab255bbf625b28ade4028f
> -- 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions
  2022-11-04 21:19       ` Sean Christopherson
@ 2022-11-08  8:24         ` Chao Peng
  0 siblings, 0 replies; 101+ messages in thread
From: Chao Peng @ 2022-11-08  8:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Fri, Nov 04, 2022 at 09:19:31PM +0000, Sean Christopherson wrote:
> Paolo, any thoughts before I lead things further astray?
> 
> On Fri, Nov 04, 2022, Chao Peng wrote:
> > On Thu, Nov 03, 2022 at 11:04:53PM +0000, Sean Christopherson wrote:
> > > On Tue, Oct 25, 2022, Chao Peng wrote:
> > > > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
> > > >  		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > > >  		break;
> > > >  	}
> > > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > > > +	case KVM_MEMORY_ENCRYPT_REG_REGION:
> > > > +	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> > > 
> > > I'm having second thoughts about usurping KVM_MEMORY_ENCRYPT_(UN)REG_REGION.  Aside
> > > from the fact that restricted/protected memory may not be encrypted, there are
> > > other potential use cases for per-page memory attributes[*], e.g. to make memory
> > > read-only (or no-exec, or exec-only, etc...) without having to modify memslots.
> > > 
> > > Any paravirt use case where the attributes of a page are effectively dictated by
> > > the guest is going to run into the exact same performance problems with memslots,
> > > which isn't suprising in hindsight since shared vs. private is really just an
> > > attribute, albeit with extra special semantics.
> > > 
> > > And if we go with a brand new ioctl(), maybe someday in the very distant future
> > > we can deprecate and delete KVM_MEMORY_ENCRYPT_(UN)REG_REGION.
> > > 
> > > Switching to a new ioctl() should be a minor change, i.e. shouldn't throw too big
> > > of a wrench into things.
> > > 
> > > Something like:
> > > 
> > >   KVM_SET_MEMORY_ATTRIBUTES
> > > 
> > >   struct kvm_memory_attributes {
> > > 	__u64 address;
> > > 	__u64 size;
> > > 	__u64 flags;
> 
> Oh, this is half-baked.  I lost track of which flags were which.  What I intended
> was a separate, initially-unused flags, e.g.

That makes sense.

> 
>  struct kvm_memory_attributes {
> 	__u64 address;
> 	__u64 size;
> 	__u64 attributes;
> 	__u64 flags;
>   }
> 
> so that KVM can tweak behavior and/or extend the effective size of the struct.
> 
> > I like the idea of adding a new ioctl(). But putting all attributes into
> > a flags in uAPI sounds not good to me, e.g. forcing userspace to set all
> > attributes in one call can cause pain for userspace, probably for KVM
> > implementation as well. For private<->shared memory conversion, we
> > actually only care the KVM_MEM_ATTR_SHARED or KVM_MEM_ATTR_PRIVATE bit,
> 
> Not necessarily, e.g. I can see pKVM wanting to convert from RW+PRIVATE => RO+SHARED
> or even RW+PRIVATE => NONE+SHARED so that the guest can't write/access the memory
> while it's accessible from the host.
> 
> And if this does extend beyond shared/private, dropping from RWX=>R, i.e. dropping
> WX permissions, would also be a common operation.
> 
> Hmm, typing that out makes me think that if we do end up supporting other "attributes",
> i.e. protections, we should go straight to full RWX protections instead of doing
> things piecemeal, i.e. add individual protections instead of combinations like
> NO_EXEC and READ_ONLY.  The protections would have to be inverted for backwards
> compatibility, but that's easy enough to handle.  The semantics could be like
> protection keys, which also have inverted persmissions, where the final protections
> are the combination of memslot+attributes, i.e. a read-only memslot couldn't be made
> writable via attributes.
> 
> E.g. userspace could do "NO_READ | NO_WRITE | NO_EXEC" to temporarily block access
> to memory without needing to delete the memslot.  KVM would need to disallow
> unsupported combinations, e.g. disallowed effective protections would be:
> 
>   - W or WX [unless there's an arch that supports write-only memory]
>   - R or RW [until KVM plumbs through support for no-exec, or it's unsupported in hardware]
>   - X       [until KVM plumbs through support for exec-only, or it's unsupported in hardware]
> 
> Anyways, that's all future work...
> 
> > but we force userspace to set other irrelevant bits as well if use this
> > API.
> 
> They aren't irrelevant though, as the memory attributes are all describing the
> allowed protections for a given page.

The 'allowed' protections seems answer my concern. But after we enabled
"NO_READ | NO_WRITE | NO_EXEC", are we going to check "NO_READ |
NO_WRITE | NO_EXEC" are also set together with the PRIVATE bit? I just
can't imagine what the semantic would be if we have the PRIVATE bit set
but other bits indicate it's actually can READ/WRITE/EXEC from usrspace.

> If there's a use case where userspace "can't"
> keep track of the attributes for whatever reason, then userspace could do a RMW
> to set/clear attributes.  Alternatively, the ioctl() could take an "operation" and
> support WRITE/OR/AND to allow setting/clearing individual flags, e.g. tweak the
> above to be: 

A getter would be good, it might also be needed for live migration.

>  
>  struct kvm_memory_attributes {
> 	__u64 address;
> 	__u64 size;
> 	__u64 attributes;
> 	__u32 operation;
> 	__u32 flags;
>   }
> 
> > I looked at kvm_device_attr, sounds we can do similar:
> 
> The device attributes deal with isolated, arbitrary values, whereas memory attributes
> are flags, i.e. devices are 1:1 whereas memory is 1:MANY.  There is no "unset" for
> device attributes, because they aren't flags.  Device attributes vs. memory attributes
> really are two very different things that just happen to use a common name.
> 
> If it helped clarify things without creating naming problems, we could even use
> PROTECTIONS instead of ATTRIBUTES.
> 
> >   KVM_SET_MEMORY_ATTR
> > 
> >   struct kvm_memory_attr {
> > 	__u64 address;
> > 	__u64 size;
> > #define KVM_MEM_ATTR_SHARED	BIT(0)
> > #define KVM_MEM_ATTR_READONLY	BIT(1)
> > #define KVM_MEM_ATTR_NOEXEC	BIT(2)
> > 	__u32 attr;
> 
> As above, letting userspace set only a single attribute would prevent setting
> (or clearing) multiple attributes in a single ioctl().
> 
> > 	__u32 pad;
> >   }
> > 
> > I'm not sure if we need KVM_GET_MEMORY_ATTR/KVM_HAS_MEMORY_ATTR as well,
> 
> Definitely would need to communicate to userspace that various attributes are
> supported.  That doesn't necessarily require a common ioctl(), but I don't see
> any reason not to add a common helper, and adding a common helper would mean
> KVM_CAP_PRIVATE_MEM can go away.  But it should return a bitmask so that userspace
> can do a single query to get all supported attributes, e.g. KVM_SUPPORTED_MEMORY_ATTRIBUTES.  

Do you have preference on using a new ioctl or just keep it as a cap?
E.g. KVM_CAP_MEMORY_ATTIBUTES can also returns a mask.

> 
> As for KVM_GET_MEMORY_ATTRIBUTES, we wouldn't necessarily have to provide such an
> API, e.g. we could hold off until someone came along with a RMW use case (as above).
> That said, debug would likely be a nightmare without KVM_GET_MEMORY_ATTRIBUTES,
> so it's probably best to add it straightway.

Dive into the implementation a bit, for KVM_GET_MEMORY_ATTRIBUTES we can
have different attributes for different pages in the same user-provided
range, in that case we will have to either return a list or just a error
number. Or we only support per-page attributes for the getter?

Chao
> 
> > but sounds like we need a KVM_UNSET_MEMORY_ATTR.
> 
> No need if the setter operates on all attributes.
> 
> > Since we are exposing the attribute directly to userspace I also think
> > we'd better treat shared memory as the default, so even when the private
> > memory is not used, the bit can still be meaningful. So define BIT(0) as
> > KVM_MEM_ATTR_PRIVATE instead of KVM_MEM_ATTR_SHARED.
> 
> Ah, right.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions
  2022-11-08  1:35   ` Yuan Yao
@ 2022-11-08  9:41     ` Chao Peng
  2022-11-09  5:52       ` Yuan Yao
  0 siblings, 1 reply; 101+ messages in thread
From: Chao Peng @ 2022-11-08  9:41 UTC (permalink / raw)
  To: Yuan Yao
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Tue, Nov 08, 2022 at 09:35:06AM +0800, Yuan Yao wrote:
> On Tue, Oct 25, 2022 at 11:13:41PM +0800, Chao Peng wrote:
> > Introduce generic private memory register/unregister by reusing existing
> > SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION. It differs from SEV case
> > by treating address in the region as gpa instead of hva. Which cases
> > should these ioctls go is determined by the kvm_arch_has_private_mem().
> > Architecture which supports KVM_PRIVATE_MEM should override this function.
> >
> > KVM internally defaults all guest memory as private memory and maintain
> > the shared memory in 'mem_attr_array'. The above ioctls operate on this
> > field and unmap existing mappings if any.
> >
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  Documentation/virt/kvm/api.rst |  17 ++-
> >  arch/x86/kvm/Kconfig           |   1 +
> >  include/linux/kvm_host.h       |  10 +-
> >  virt/kvm/Kconfig               |   4 +
> >  virt/kvm/kvm_main.c            | 227 +++++++++++++++++++++++++--------
> >  5 files changed, 198 insertions(+), 61 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 975688912b8c..08253cf498d1 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -4717,10 +4717,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst.
> >  This ioctl can be used to register a guest memory region which may
> >  contain encrypted data (e.g. guest RAM, SMRAM etc).
> >
> > -It is used in the SEV-enabled guest. When encryption is enabled, a guest
> > -memory region may contain encrypted data. The SEV memory encryption
> > -engine uses a tweak such that two identical plaintext pages, each at
> > -different locations will have differing ciphertexts. So swapping or
> > +Currently this ioctl supports registering memory regions for two usages:
> > +private memory and SEV-encrypted memory.
> > +
> > +When private memory is enabled, this ioctl is used to register guest private
> > +memory region and the addr/size of kvm_enc_region represents guest physical
> > +address (GPA). In this usage, this ioctl zaps the existing guest memory
> > +mappings in KVM that fallen into the region.
> > +
> > +When SEV-encrypted memory is enabled, this ioctl is used to register guest
> > +memory region which may contain encrypted data for a SEV-enabled guest. The
> > +addr/size of kvm_enc_region represents userspace address (HVA). The SEV
> > +memory encryption engine uses a tweak such that two identical plaintext pages,
> > +each at different locations will have differing ciphertexts. So swapping or
> >  moving ciphertext of those pages will not result in plaintext being
> >  swapped. So relocating (or migrating) physical backing pages for the SEV
> >  guest will require some additional steps.
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index 8d2bd455c0cd..73fdfa429b20 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -51,6 +51,7 @@ config KVM
> >  	select HAVE_KVM_PM_NOTIFIER if PM
> >  	select HAVE_KVM_RESTRICTED_MEM if X86_64
> >  	select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> > +	select KVM_GENERIC_PRIVATE_MEM if HAVE_KVM_RESTRICTED_MEM
> >  	help
> >  	  Support hosting fully virtualized guest machines using hardware
> >  	  virtualization extensions.  You will need a fairly recent
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 79e5cbc35fcf..4ce98fa0153c 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -245,7 +245,8 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> >  #endif
> >
> > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > +
> > +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_KVM_GENERIC_PRIVATE_MEM)
> >  struct kvm_gfn_range {
> >  	struct kvm_memory_slot *slot;
> >  	gfn_t start;
> > @@ -254,6 +255,9 @@ struct kvm_gfn_range {
> >  	bool may_block;
> >  };
> >  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > +#endif
> > +
> > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> >  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> >  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> >  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > @@ -794,6 +798,9 @@ struct kvm {
> >  	struct notifier_block pm_notifier;
> >  #endif
> >  	char stats_id[KVM_STATS_NAME_SIZE];
> > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > +	struct xarray mem_attr_array;
> > +#endif
> >  };
> >
> >  #define kvm_err(fmt, ...) \
> > @@ -1453,6 +1460,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> >  int kvm_arch_post_init_vm(struct kvm *kvm);
> >  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> >  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> >
> >  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> >  /*
> > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > index 9ff164c7e0cc..69ca59e82149 100644
> > --- a/virt/kvm/Kconfig
> > +++ b/virt/kvm/Kconfig
> > @@ -89,3 +89,7 @@ config HAVE_KVM_PM_NOTIFIER
> >
> >  config HAVE_KVM_RESTRICTED_MEM
> >         bool
> > +
> > +config KVM_GENERIC_PRIVATE_MEM
> > +       bool
> > +       depends on HAVE_KVM_RESTRICTED_MEM
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 09c9cdeb773c..fc3835826ace 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> >
> > +static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
> > +							    gfn_t end)
> > +{
> > +	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > +		kvm->mmu_invalidate_range_start = start;
> > +		kvm->mmu_invalidate_range_end = end;
> > +	} else {
> > +		/*
> > +		 * Fully tracking multiple concurrent ranges has diminishing
> > +		 * returns. Keep things simple and just find the minimal range
> > +		 * which includes the current and new ranges. As there won't be
> > +		 * enough information to subtract a range after its invalidate
> > +		 * completes, any ranges invalidated concurrently will
> > +		 * accumulate and persist until all outstanding invalidates
> > +		 * complete.
> > +		 */
> > +		kvm->mmu_invalidate_range_start =
> > +			min(kvm->mmu_invalidate_range_start, start);
> > +		kvm->mmu_invalidate_range_end =
> > +			max(kvm->mmu_invalidate_range_end, end);
> > +	}
> > +}
> > +
> > +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +	/*
> > +	 * The count increase must become visible at unlock time as no
> > +	 * spte can be established without taking the mmu_lock and
> > +	 * count is also read inside the mmu_lock critical section.
> > +	 */
> > +	kvm->mmu_invalidate_in_progress++;
> > +}
> > +
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +	mark_invalidate_in_progress(kvm, start, end);
> > +	update_invalidate_range(kvm, start, end);
> > +}
> > +
> > +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +	/*
> > +	 * This sequence increase will notify the kvm page fault that
> > +	 * the page that is going to be mapped in the spte could have
> > +	 * been freed.
> > +	 */
> > +	kvm->mmu_invalidate_seq++;
> > +	smp_wmb();
> > +	/*
> > +	 * The above sequence increase must be visible before the
> > +	 * below count decrease, which is ensured by the smp_wmb above
> > +	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > +	 */
> > +	kvm->mmu_invalidate_in_progress--;
> > +}
> > +
> >  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> >  static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> >  {
> > @@ -715,51 +771,12 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> >  	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> >  }
> >
> > -static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
> > -							    gfn_t end)
> > -{
> > -	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > -		kvm->mmu_invalidate_range_start = start;
> > -		kvm->mmu_invalidate_range_end = end;
> > -	} else {
> > -		/*
> > -		 * Fully tracking multiple concurrent ranges has diminishing
> > -		 * returns. Keep things simple and just find the minimal range
> > -		 * which includes the current and new ranges. As there won't be
> > -		 * enough information to subtract a range after its invalidate
> > -		 * completes, any ranges invalidated concurrently will
> > -		 * accumulate and persist until all outstanding invalidates
> > -		 * complete.
> > -		 */
> > -		kvm->mmu_invalidate_range_start =
> > -			min(kvm->mmu_invalidate_range_start, start);
> > -		kvm->mmu_invalidate_range_end =
> > -			max(kvm->mmu_invalidate_range_end, end);
> > -	}
> > -}
> > -
> > -static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> > -{
> > -	/*
> > -	 * The count increase must become visible at unlock time as no
> > -	 * spte can be established without taking the mmu_lock and
> > -	 * count is also read inside the mmu_lock critical section.
> > -	 */
> > -	kvm->mmu_invalidate_in_progress++;
> > -}
> > -
> >  static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> >  {
> >  	update_invalidate_range(kvm, range->start, range->end);
> >  	return kvm_unmap_gfn_range(kvm, range);
> >  }
> >
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
> > -{
> > -	mark_invalidate_in_progress(kvm, start, end);
> > -	update_invalidate_range(kvm, start, end);
> > -}
> > -
> >  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >  					const struct mmu_notifier_range *range)
> >  {
> > @@ -807,23 +824,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >  	return 0;
> >  }
> >
> > -void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
> > -{
> > -	/*
> > -	 * This sequence increase will notify the kvm page fault that
> > -	 * the page that is going to be mapped in the spte could have
> > -	 * been freed.
> > -	 */
> > -	kvm->mmu_invalidate_seq++;
> > -	smp_wmb();
> > -	/*
> > -	 * The above sequence increase must be visible before the
> > -	 * below count decrease, which is ensured by the smp_wmb above
> > -	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > -	 */
> > -	kvm->mmu_invalidate_in_progress--;
> > -}
> > -
> >  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> >  					const struct mmu_notifier_range *range)
> >  {
> > @@ -937,6 +937,89 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
> >
> >  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
> >
> > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > +
> > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +	struct kvm_gfn_range gfn_range;
> > +	struct kvm_memory_slot *slot;
> > +	struct kvm_memslots *slots;
> > +	struct kvm_memslot_iter iter;
> > +	int i;
> > +	int r = 0;
> > +
> > +	gfn_range.pte = __pte(0);
> > +	gfn_range.may_block = true;
> > +
> > +	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > +		slots = __kvm_memslots(kvm, i);
> > +
> > +		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > +			slot = iter.slot;
> > +			gfn_range.start = max(start, slot->base_gfn);
> > +			gfn_range.end = min(end, slot->base_gfn + slot->npages);
> > +			if (gfn_range.start >= gfn_range.end)
> > +				continue;
> > +			gfn_range.slot = slot;
> > +
> > +			r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > +		}
> > +	}
> > +
> > +	if (r)
> > +		kvm_flush_remote_tlbs(kvm);
> > +}
> > +
> > +#define KVM_MEM_ATTR_SHARED	0x0001
> > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> > +				     bool is_private)
> > +{
> > +	gfn_t start, end;
> > +	unsigned long i;
> > +	void *entry;
> > +	int idx;
> > +	int r = 0;
> > +
> > +	if (size == 0 || gpa + size < gpa)
> > +		return -EINVAL;
> > +	if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> > +		return -EINVAL;
> > +
> > +	start = gpa >> PAGE_SHIFT;
> > +	end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > +	/*
> > +	 * Guest memory defaults to private, kvm->mem_attr_array only stores
> > +	 * shared memory.
> > +	 */
> > +	entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> > +
> > +	idx = srcu_read_lock(&kvm->srcu);
> > +	KVM_MMU_LOCK(kvm);
> > +	kvm_mmu_invalidate_begin(kvm, start, end);
> > +
> > +	for (i = start; i < end; i++) {
> > +		r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > +				    GFP_KERNEL_ACCOUNT));
> > +		if (r)
> > +			goto err;
> > +	}
> > +
> > +	kvm_unmap_mem_range(kvm, start, end);
> 
> lock is hold by KVM_MMU_LOCK() so how about do
> kvm_mmu_invalidate_begin() after changing xarray:
> 
> kvm_mmu_invalidate_begin(kvm, start, end);
> kvm_unmap_mem_range(kvm, start, end);
> kvm_mmu_invalidate_end(kvm, start, end);
> 
> Also the error handling path doesn't need to care it yet.

The mem_attr_array is consumed in the page fault handler(i.e.
kvm_mem_is_private() in patch 08) so it should also be protected by
kvm_mmu_invalidate_begin/end(). E.g. if we change the mem_attr_arry here
after the page fault handler has read the mem_attr_array, the
mmu_invalidate_retry_gfn() should return 1 to let the page fault handler
to retry the fault. 

> 
> > +
> > +	goto ret;
> > +err:
> > +	for (; i > start; i--)
> > +		xa_erase(&kvm->mem_attr_array, i);
> 
> the start should be covered yet, consider the i is
> unsigned long and case of start is 0, may need another
> variable j for this.

Ah, right!

Thanks,
Chao
> 
> > +ret:
> > +	kvm_mmu_invalidate_end(kvm, start, end);
> > +	KVM_MMU_UNLOCK(kvm);
> > +	srcu_read_unlock(&kvm->srcu, idx);
> > +
> > +	return r;
> > +}
> > +#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
> > +
> >  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> >  static int kvm_pm_notifier_call(struct notifier_block *bl,
> >  				unsigned long state,
> > @@ -1165,6 +1248,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> >  	spin_lock_init(&kvm->mn_invalidate_lock);
> >  	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> >  	xa_init(&kvm->vcpu_array);
> > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > +	xa_init(&kvm->mem_attr_array);
> > +#endif
> >
> >  	INIT_LIST_HEAD(&kvm->gpc_list);
> >  	spin_lock_init(&kvm->gpc_lock);
> > @@ -1338,6 +1424,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
> >  		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
> >  		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
> >  	}
> > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > +	xa_destroy(&kvm->mem_attr_array);
> > +#endif
> >  	cleanup_srcu_struct(&kvm->irq_srcu);
> >  	cleanup_srcu_struct(&kvm->srcu);
> >  	kvm_arch_free_vm(kvm);
> > @@ -1541,6 +1630,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
> >  	}
> >  }
> >
> > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > +{
> > +	return false;
> > +}
> > +
> >  static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> >  {
> >  	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
> >  		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> >  		break;
> >  	}
> > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > +	case KVM_MEMORY_ENCRYPT_REG_REGION:
> > +	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> > +		struct kvm_enc_region region;
> > +		bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> > +
> > +		if (!kvm_arch_has_private_mem(kvm))
> > +			goto arch_vm_ioctl;
> > +
> > +		r = -EFAULT;
> > +		if (copy_from_user(&region, argp, sizeof(region)))
> > +			goto out;
> > +
> > +		r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> > +					      region.size, set);
> > +		break;
> > +	}
> > +#endif
> >  	case KVM_GET_DIRTY_LOG: {
> >  		struct kvm_dirty_log log;
> >
> > @@ -4861,6 +4973,9 @@ static long kvm_vm_ioctl(struct file *filp,
> >  		r = kvm_vm_ioctl_get_stats_fd(kvm);
> >  		break;
> >  	default:
> > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > +arch_vm_ioctl:
> > +#endif
> >  		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> >  	}
> >  out:
> > --
> > 2.25.1
> >
> >


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 6/8] KVM: Update lpage info when private/shared memory are mixed
  2022-10-25 15:13 ` [PATCH v9 6/8] KVM: Update lpage info when private/shared memory are mixed Chao Peng
  2022-10-26 20:46   ` Isaku Yamahata
@ 2022-11-08 12:08   ` Yuan Yao
  2022-11-09  4:13     ` Chao Peng
  1 sibling, 1 reply; 101+ messages in thread
From: Yuan Yao @ 2022-11-08 12:08 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Tue, Oct 25, 2022 at 11:13:42PM +0800, Chao Peng wrote:
> When private/shared memory are mixed in a large page, the lpage_info may
> not be accurate and should be updated with this mixed info. A large page
> has mixed pages can't be really mapped as large page since its
> private/shared pages are from different physical memory.
>
> Update lpage_info when private/shared memory attribute is changed. If
> both private and shared pages are within a large page region, it can't
> be mapped as large page. It's a bit challenge to track the mixed
> info in a 'count' like variable, this patch instead reserves a bit in
> 'disallow_lpage' to indicate a large page has mixed private/share pages.
>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   8 +++
>  arch/x86/kvm/mmu/mmu.c          | 112 +++++++++++++++++++++++++++++++-
>  arch/x86/kvm/x86.c              |   2 +
>  include/linux/kvm_host.h        |  19 ++++++
>  virt/kvm/kvm_main.c             |  16 +++--
>  5 files changed, 152 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7551b6f9c31c..db811a54e3fd 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -37,6 +37,7 @@
>  #include <asm/hyperv-tlfs.h>
>
>  #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> +#define __KVM_HAVE_ARCH_UPDATE_MEM_ATTR
>
>  #define KVM_MAX_VCPUS 1024
>
> @@ -952,6 +953,13 @@ struct kvm_vcpu_arch {
>  #endif
>  };
>
> +/*
> + * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
> + * level. The remaining bits are used as a reference count.
> + */
> +#define KVM_LPAGE_PRIVATE_SHARED_MIXED		(1U << 31)
> +#define KVM_LPAGE_COUNT_MAX			((1U << 31) - 1)
> +
>  struct kvm_lpage_info {
>  	int disallow_lpage;
>  };
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 33b1aec44fb8..67a9823a8c35 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -762,11 +762,16 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
>  {
>  	struct kvm_lpage_info *linfo;
>  	int i;
> +	int disallow_count;
>
>  	for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
>  		linfo = lpage_info_slot(gfn, slot, i);
> +
> +		disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
> +		WARN_ON(disallow_count + count < 0 ||
> +			disallow_count > KVM_LPAGE_COUNT_MAX - count);
> +
>  		linfo->disallow_lpage += count;
> -		WARN_ON(linfo->disallow_lpage < 0);
>  	}
>  }
>
> @@ -6910,3 +6915,108 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
>  	if (kvm->arch.nx_lpage_recovery_thread)
>  		kthread_stop(kvm->arch.nx_lpage_recovery_thread);
>  }
> +
> +static inline bool linfo_is_mixed(struct kvm_lpage_info *linfo)
> +{
> +	return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +}
> +
> +static inline void linfo_update_mixed(struct kvm_lpage_info *linfo, bool mixed)
> +{
> +	if (mixed)
> +		linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +	else
> +		linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +}
> +
> +static bool mem_attr_is_mixed_2m(struct kvm *kvm, unsigned int attr,
> +				 gfn_t start, gfn_t end)
> +{
> +	XA_STATE(xas, &kvm->mem_attr_array, start);
> +	gfn_t gfn = start;
> +	void *entry;
> +	bool shared = attr == KVM_MEM_ATTR_SHARED;
> +	bool mixed = false;
> +
> +	rcu_read_lock();
> +	entry = xas_load(&xas);
> +	while (gfn < end) {
> +		if (xas_retry(&xas, entry))
> +			continue;
> +
> +		KVM_BUG_ON(gfn != xas.xa_index, kvm);
> +
> +		if ((entry && !shared) || (!entry && shared)) {
> +			mixed = true;
> +			goto out;
> +		}
> +
> +		entry = xas_next(&xas);
> +		gfn++;
> +	}
> +out:
> +	rcu_read_unlock();
> +	return mixed;
> +}
> +
> +static bool mem_attr_is_mixed(struct kvm *kvm, struct kvm_memory_slot *slot,
> +			      int level, unsigned int attr,
> +			      gfn_t start, gfn_t end)
> +{
> +	unsigned long gfn;
> +	void *entry;
> +
> +	if (level == PG_LEVEL_2M)
> +		return mem_attr_is_mixed_2m(kvm, attr, start, end);
> +
> +	entry = xa_load(&kvm->mem_attr_array, start);
> +	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
> +		if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)))
> +			return true;
> +		if (xa_load(&kvm->mem_attr_array, gfn) != entry)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +void kvm_arch_update_mem_attr(struct kvm *kvm, struct kvm_memory_slot *slot,
> +			      unsigned int attr, gfn_t start, gfn_t end)
> +{
> +
> +	unsigned long lpage_start, lpage_end;
> +	unsigned long gfn, pages, mask;
> +	int level;
> +
> +	WARN_ONCE(!(attr & (KVM_MEM_ATTR_PRIVATE | KVM_MEM_ATTR_SHARED)),
> +			"Unsupported mem attribute.\n");
> +
> +	/*
> +	 * The sequence matters here: we update the higher level basing on the
> +	 * lower level's scanning result.
> +	 */
> +	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> +		pages = KVM_PAGES_PER_HPAGE(level);
> +		mask = ~(pages - 1);
> +		lpage_start = max(start & mask, slot->base_gfn);
> +		lpage_end = (end - 1) & mask;
> +
> +		/*
> +		 * We only need to scan the head and tail page, for middle pages
> +		 * we know they are not mixed.
> +		 */
> +		linfo_update_mixed(lpage_info_slot(lpage_start, slot, level),
> +				   mem_attr_is_mixed(kvm, slot, level, attr,
> +						     lpage_start, start));

Looks only query the lpage_start, start is not enough:

A and B are private gfns from same lpage_start as below, A > B :
lpage_start
       |---------A
       |----B

Convert A to shared, this makes the upper 2M page to MIX.
Convert B to shared, this also makes the upper 2M page to MIX.
Convert B to private, this makes the upper 2M page to Non-MIX, but
it's incorrect, due to A is shared.

Same to tail case.

> +
> +		if (lpage_start == lpage_end)
> +			return;
> +
> +		for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages)
> +			linfo_update_mixed(lpage_info_slot(gfn, slot, level),
> +					   false);
> +
> +		linfo_update_mixed(lpage_info_slot(lpage_end, slot, level),
> +				   mem_attr_is_mixed(kvm, slot, level, attr,
> +						     end, lpage_end + pages));
> +	}
> +}
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 02ad31f46dd7..4276ca73bd7b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12563,6 +12563,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
>  		if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
>  			linfo[lpages - 1].disallow_lpage = 1;
>  		ugfn = slot->userspace_addr >> PAGE_SHIFT;
> +		if (kvm_slot_can_be_private(slot))
> +			ugfn |= slot->restricted_offset >> PAGE_SHIFT;
>  		/*
>  		 * If the gfn and userspace address are not aligned wrt each
>  		 * other, disable large page support for this slot.
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4ce98fa0153c..6ce36065532c 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2284,4 +2284,23 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>  /* Max number of entries allowed for each kvm dirty ring */
>  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> +
> +#define KVM_MEM_ATTR_SHARED	0x0001
> +#define KVM_MEM_ATTR_PRIVATE	0x0002
> +
> +#ifdef __KVM_HAVE_ARCH_UPDATE_MEM_ATTR
> +void kvm_arch_update_mem_attr(struct kvm *kvm, struct kvm_memory_slot *slot,
> +			      unsigned int attr, gfn_t start, gfn_t end);
> +#else
> +static inline void kvm_arch_update_mem_attr(struct kvm *kvm,
> +					    struct kvm_memory_slot *slot,
> +					    unsigned int attr,
> +					    gfn_t start, gfn_t end)
> +{
> +}
> +#endif
> +
> +#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
> +
>  #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fc3835826ace..13a37b4d9e97 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -939,7 +939,8 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>
>  #ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
>
> -static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end,
> +				unsigned int attr)
>  {
>  	struct kvm_gfn_range gfn_range;
>  	struct kvm_memory_slot *slot;
> @@ -963,6 +964,7 @@ static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
>  			gfn_range.slot = slot;
>
>  			r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> +			kvm_arch_update_mem_attr(kvm, slot, attr, start, end);
>  		}
>  	}
>
> @@ -970,7 +972,6 @@ static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
>  		kvm_flush_remote_tlbs(kvm);
>  }
>
> -#define KVM_MEM_ATTR_SHARED	0x0001
>  static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
>  				     bool is_private)
>  {
> @@ -979,6 +980,7 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
>  	void *entry;
>  	int idx;
>  	int r = 0;
> +	unsigned int attr;
>
>  	if (size == 0 || gpa + size < gpa)
>  		return -EINVAL;
> @@ -992,7 +994,13 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
>  	 * Guest memory defaults to private, kvm->mem_attr_array only stores
>  	 * shared memory.
>  	 */
> -	entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> +	if (is_private) {
> +		attr = KVM_MEM_ATTR_PRIVATE;
> +		entry = NULL;
> +	} else {
> +		attr = KVM_MEM_ATTR_SHARED;
> +		entry = xa_mk_value(KVM_MEM_ATTR_SHARED);
> +	}
>
>  	idx = srcu_read_lock(&kvm->srcu);
>  	KVM_MMU_LOCK(kvm);
> @@ -1005,7 +1013,7 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
>  			goto err;
>  	}
>
> -	kvm_unmap_mem_range(kvm, start, end);
> +	kvm_unmap_mem_range(kvm, start, end, attr);
>
>  	goto ret;
>  err:
> --
> 2.25.1
>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 6/8] KVM: Update lpage info when private/shared memory are mixed
  2022-11-08 12:08   ` Yuan Yao
@ 2022-11-09  4:13     ` Chao Peng
  0 siblings, 0 replies; 101+ messages in thread
From: Chao Peng @ 2022-11-09  4:13 UTC (permalink / raw)
  To: Yuan Yao
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Tue, Nov 08, 2022 at 08:08:05PM +0800, Yuan Yao wrote:
> On Tue, Oct 25, 2022 at 11:13:42PM +0800, Chao Peng wrote:
> > When private/shared memory are mixed in a large page, the lpage_info may
> > not be accurate and should be updated with this mixed info. A large page
> > has mixed pages can't be really mapped as large page since its
> > private/shared pages are from different physical memory.
> >
> > Update lpage_info when private/shared memory attribute is changed. If
> > both private and shared pages are within a large page region, it can't
> > be mapped as large page. It's a bit challenge to track the mixed
> > info in a 'count' like variable, this patch instead reserves a bit in
> > 'disallow_lpage' to indicate a large page has mixed private/share pages.
> >
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |   8 +++
> >  arch/x86/kvm/mmu/mmu.c          | 112 +++++++++++++++++++++++++++++++-
> >  arch/x86/kvm/x86.c              |   2 +
> >  include/linux/kvm_host.h        |  19 ++++++
> >  virt/kvm/kvm_main.c             |  16 +++--
> >  5 files changed, 152 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 7551b6f9c31c..db811a54e3fd 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -37,6 +37,7 @@
> >  #include <asm/hyperv-tlfs.h>
> >
> >  #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> > +#define __KVM_HAVE_ARCH_UPDATE_MEM_ATTR
> >
> >  #define KVM_MAX_VCPUS 1024
> >
> > @@ -952,6 +953,13 @@ struct kvm_vcpu_arch {
> >  #endif
> >  };
> >
> > +/*
> > + * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
> > + * level. The remaining bits are used as a reference count.
> > + */
> > +#define KVM_LPAGE_PRIVATE_SHARED_MIXED		(1U << 31)
> > +#define KVM_LPAGE_COUNT_MAX			((1U << 31) - 1)
> > +
> >  struct kvm_lpage_info {
> >  	int disallow_lpage;
> >  };
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 33b1aec44fb8..67a9823a8c35 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -762,11 +762,16 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
> >  {
> >  	struct kvm_lpage_info *linfo;
> >  	int i;
> > +	int disallow_count;
> >
> >  	for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
> >  		linfo = lpage_info_slot(gfn, slot, i);
> > +
> > +		disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
> > +		WARN_ON(disallow_count + count < 0 ||
> > +			disallow_count > KVM_LPAGE_COUNT_MAX - count);
> > +
> >  		linfo->disallow_lpage += count;
> > -		WARN_ON(linfo->disallow_lpage < 0);
> >  	}
> >  }
> >
> > @@ -6910,3 +6915,108 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> >  	if (kvm->arch.nx_lpage_recovery_thread)
> >  		kthread_stop(kvm->arch.nx_lpage_recovery_thread);
> >  }
> > +
> > +static inline bool linfo_is_mixed(struct kvm_lpage_info *linfo)
> > +{
> > +	return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +}
> > +
> > +static inline void linfo_update_mixed(struct kvm_lpage_info *linfo, bool mixed)
> > +{
> > +	if (mixed)
> > +		linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +	else
> > +		linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +}
> > +
> > +static bool mem_attr_is_mixed_2m(struct kvm *kvm, unsigned int attr,
> > +				 gfn_t start, gfn_t end)
> > +{
> > +	XA_STATE(xas, &kvm->mem_attr_array, start);
> > +	gfn_t gfn = start;
> > +	void *entry;
> > +	bool shared = attr == KVM_MEM_ATTR_SHARED;
> > +	bool mixed = false;
> > +
> > +	rcu_read_lock();
> > +	entry = xas_load(&xas);
> > +	while (gfn < end) {
> > +		if (xas_retry(&xas, entry))
> > +			continue;
> > +
> > +		KVM_BUG_ON(gfn != xas.xa_index, kvm);
> > +
> > +		if ((entry && !shared) || (!entry && shared)) {
> > +			mixed = true;
> > +			goto out;
> > +		}
> > +
> > +		entry = xas_next(&xas);
> > +		gfn++;
> > +	}
> > +out:
> > +	rcu_read_unlock();
> > +	return mixed;
> > +}
> > +
> > +static bool mem_attr_is_mixed(struct kvm *kvm, struct kvm_memory_slot *slot,
> > +			      int level, unsigned int attr,
> > +			      gfn_t start, gfn_t end)
> > +{
> > +	unsigned long gfn;
> > +	void *entry;
> > +
> > +	if (level == PG_LEVEL_2M)
> > +		return mem_attr_is_mixed_2m(kvm, attr, start, end);
> > +
> > +	entry = xa_load(&kvm->mem_attr_array, start);
> > +	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
> > +		if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)))
> > +			return true;
> > +		if (xa_load(&kvm->mem_attr_array, gfn) != entry)
> > +			return true;
> > +	}
> > +	return false;
> > +}
> > +
> > +void kvm_arch_update_mem_attr(struct kvm *kvm, struct kvm_memory_slot *slot,
> > +			      unsigned int attr, gfn_t start, gfn_t end)
> > +{
> > +
> > +	unsigned long lpage_start, lpage_end;
> > +	unsigned long gfn, pages, mask;
> > +	int level;
> > +
> > +	WARN_ONCE(!(attr & (KVM_MEM_ATTR_PRIVATE | KVM_MEM_ATTR_SHARED)),
> > +			"Unsupported mem attribute.\n");
> > +
> > +	/*
> > +	 * The sequence matters here: we update the higher level basing on the
> > +	 * lower level's scanning result.
> > +	 */
> > +	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> > +		pages = KVM_PAGES_PER_HPAGE(level);
> > +		mask = ~(pages - 1);
> > +		lpage_start = max(start & mask, slot->base_gfn);
> > +		lpage_end = (end - 1) & mask;
> > +
> > +		/*
> > +		 * We only need to scan the head and tail page, for middle pages
> > +		 * we know they are not mixed.
> > +		 */
> > +		linfo_update_mixed(lpage_info_slot(lpage_start, slot, level),
> > +				   mem_attr_is_mixed(kvm, slot, level, attr,
> > +						     lpage_start, start));
> 
> Looks only query the lpage_start, start is not enough:
> 
> A and B are private gfns from same lpage_start as below, A > B :
> lpage_start
>        |---------A
>        |----B
> 
> Convert A to shared, this makes the upper 2M page to MIX.
> Convert B to shared, this also makes the upper 2M page to MIX.
> Convert B to private, this makes the upper 2M page to Non-MIX, but
> it's incorrect, due to A is shared.

In previous versions this is actually "lpage_start, lpage_start +
pages", e.g. covers the whole large page. While fixing another issue[*]
in v8 this was wrongly changed to "lpage_start, start", at that time I
made an assumption that "end > lpage_start + pages" so the remaining
scanning in the same large page is useless if we know what attribute we
will set, this is definitely not true though.

[*]
https://lore.kernel.org/linux-mm/20220930085914.GA2799703@chaop.bj.intel.com/

Thanks,
Chao
> 
> Same to tail case.
> 
> > +
> > +		if (lpage_start == lpage_end)
> > +			return;
> > +
> > +		for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages)
> > +			linfo_update_mixed(lpage_info_slot(gfn, slot, level),
> > +					   false);
> > +
> > +		linfo_update_mixed(lpage_info_slot(lpage_end, slot, level),
> > +				   mem_attr_is_mixed(kvm, slot, level, attr,
> > +						     end, lpage_end + pages));
> > +	}
> > +}
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 02ad31f46dd7..4276ca73bd7b 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -12563,6 +12563,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
> >  		if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
> >  			linfo[lpages - 1].disallow_lpage = 1;
> >  		ugfn = slot->userspace_addr >> PAGE_SHIFT;
> > +		if (kvm_slot_can_be_private(slot))
> > +			ugfn |= slot->restricted_offset >> PAGE_SHIFT;
> >  		/*
> >  		 * If the gfn and userspace address are not aligned wrt each
> >  		 * other, disable large page support for this slot.
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 4ce98fa0153c..6ce36065532c 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2284,4 +2284,23 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
> >  /* Max number of entries allowed for each kvm dirty ring */
> >  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> >
> > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > +
> > +#define KVM_MEM_ATTR_SHARED	0x0001
> > +#define KVM_MEM_ATTR_PRIVATE	0x0002
> > +
> > +#ifdef __KVM_HAVE_ARCH_UPDATE_MEM_ATTR
> > +void kvm_arch_update_mem_attr(struct kvm *kvm, struct kvm_memory_slot *slot,
> > +			      unsigned int attr, gfn_t start, gfn_t end);
> > +#else
> > +static inline void kvm_arch_update_mem_attr(struct kvm *kvm,
> > +					    struct kvm_memory_slot *slot,
> > +					    unsigned int attr,
> > +					    gfn_t start, gfn_t end)
> > +{
> > +}
> > +#endif
> > +
> > +#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
> > +
> >  #endif
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index fc3835826ace..13a37b4d9e97 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -939,7 +939,8 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
> >
> >  #ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> >
> > -static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end,
> > +				unsigned int attr)
> >  {
> >  	struct kvm_gfn_range gfn_range;
> >  	struct kvm_memory_slot *slot;
> > @@ -963,6 +964,7 @@ static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> >  			gfn_range.slot = slot;
> >
> >  			r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > +			kvm_arch_update_mem_attr(kvm, slot, attr, start, end);
> >  		}
> >  	}
> >
> > @@ -970,7 +972,6 @@ static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> >  		kvm_flush_remote_tlbs(kvm);
> >  }
> >
> > -#define KVM_MEM_ATTR_SHARED	0x0001
> >  static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> >  				     bool is_private)
> >  {
> > @@ -979,6 +980,7 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> >  	void *entry;
> >  	int idx;
> >  	int r = 0;
> > +	unsigned int attr;
> >
> >  	if (size == 0 || gpa + size < gpa)
> >  		return -EINVAL;
> > @@ -992,7 +994,13 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> >  	 * Guest memory defaults to private, kvm->mem_attr_array only stores
> >  	 * shared memory.
> >  	 */
> > -	entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> > +	if (is_private) {
> > +		attr = KVM_MEM_ATTR_PRIVATE;
> > +		entry = NULL;
> > +	} else {
> > +		attr = KVM_MEM_ATTR_SHARED;
> > +		entry = xa_mk_value(KVM_MEM_ATTR_SHARED);
> > +	}
> >
> >  	idx = srcu_read_lock(&kvm->srcu);
> >  	KVM_MMU_LOCK(kvm);
> > @@ -1005,7 +1013,7 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> >  			goto err;
> >  	}
> >
> > -	kvm_unmap_mem_range(kvm, start, end);
> > +	kvm_unmap_mem_range(kvm, start, end, attr);
> >
> >  	goto ret;
> >  err:
> > --
> > 2.25.1
> >


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions
  2022-11-08  9:41     ` Chao Peng
@ 2022-11-09  5:52       ` Yuan Yao
  0 siblings, 0 replies; 101+ messages in thread
From: Yuan Yao @ 2022-11-09  5:52 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Tue, Nov 08, 2022 at 05:41:41PM +0800, Chao Peng wrote:
> On Tue, Nov 08, 2022 at 09:35:06AM +0800, Yuan Yao wrote:
> > On Tue, Oct 25, 2022 at 11:13:41PM +0800, Chao Peng wrote:
> > > Introduce generic private memory register/unregister by reusing existing
> > > SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION. It differs from SEV case
> > > by treating address in the region as gpa instead of hva. Which cases
> > > should these ioctls go is determined by the kvm_arch_has_private_mem().
> > > Architecture which supports KVM_PRIVATE_MEM should override this function.
> > >
> > > KVM internally defaults all guest memory as private memory and maintain
> > > the shared memory in 'mem_attr_array'. The above ioctls operate on this
> > > field and unmap existing mappings if any.
> > >
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > ---
> > >  Documentation/virt/kvm/api.rst |  17 ++-
> > >  arch/x86/kvm/Kconfig           |   1 +
> > >  include/linux/kvm_host.h       |  10 +-
> > >  virt/kvm/Kconfig               |   4 +
> > >  virt/kvm/kvm_main.c            | 227 +++++++++++++++++++++++++--------
> > >  5 files changed, 198 insertions(+), 61 deletions(-)
> > >
> > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > index 975688912b8c..08253cf498d1 100644
> > > --- a/Documentation/virt/kvm/api.rst
> > > +++ b/Documentation/virt/kvm/api.rst
> > > @@ -4717,10 +4717,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst.
> > >  This ioctl can be used to register a guest memory region which may
> > >  contain encrypted data (e.g. guest RAM, SMRAM etc).
> > >
> > > -It is used in the SEV-enabled guest. When encryption is enabled, a guest
> > > -memory region may contain encrypted data. The SEV memory encryption
> > > -engine uses a tweak such that two identical plaintext pages, each at
> > > -different locations will have differing ciphertexts. So swapping or
> > > +Currently this ioctl supports registering memory regions for two usages:
> > > +private memory and SEV-encrypted memory.
> > > +
> > > +When private memory is enabled, this ioctl is used to register guest private
> > > +memory region and the addr/size of kvm_enc_region represents guest physical
> > > +address (GPA). In this usage, this ioctl zaps the existing guest memory
> > > +mappings in KVM that fallen into the region.
> > > +
> > > +When SEV-encrypted memory is enabled, this ioctl is used to register guest
> > > +memory region which may contain encrypted data for a SEV-enabled guest. The
> > > +addr/size of kvm_enc_region represents userspace address (HVA). The SEV
> > > +memory encryption engine uses a tweak such that two identical plaintext pages,
> > > +each at different locations will have differing ciphertexts. So swapping or
> > >  moving ciphertext of those pages will not result in plaintext being
> > >  swapped. So relocating (or migrating) physical backing pages for the SEV
> > >  guest will require some additional steps.
> > > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > > index 8d2bd455c0cd..73fdfa429b20 100644
> > > --- a/arch/x86/kvm/Kconfig
> > > +++ b/arch/x86/kvm/Kconfig
> > > @@ -51,6 +51,7 @@ config KVM
> > >  	select HAVE_KVM_PM_NOTIFIER if PM
> > >  	select HAVE_KVM_RESTRICTED_MEM if X86_64
> > >  	select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> > > +	select KVM_GENERIC_PRIVATE_MEM if HAVE_KVM_RESTRICTED_MEM
> > >  	help
> > >  	  Support hosting fully virtualized guest machines using hardware
> > >  	  virtualization extensions.  You will need a fairly recent
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 79e5cbc35fcf..4ce98fa0153c 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -245,7 +245,8 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > >  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > >  #endif
> > >
> > > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > > +
> > > +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_KVM_GENERIC_PRIVATE_MEM)
> > >  struct kvm_gfn_range {
> > >  	struct kvm_memory_slot *slot;
> > >  	gfn_t start;
> > > @@ -254,6 +255,9 @@ struct kvm_gfn_range {
> > >  	bool may_block;
> > >  };
> > >  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > > +#endif
> > > +
> > > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > >  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > >  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > >  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > @@ -794,6 +798,9 @@ struct kvm {
> > >  	struct notifier_block pm_notifier;
> > >  #endif
> > >  	char stats_id[KVM_STATS_NAME_SIZE];
> > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > > +	struct xarray mem_attr_array;
> > > +#endif
> > >  };
> > >
> > >  #define kvm_err(fmt, ...) \
> > > @@ -1453,6 +1460,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> > >  int kvm_arch_post_init_vm(struct kvm *kvm);
> > >  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> > >  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> > >
> > >  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> > >  /*
> > > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > > index 9ff164c7e0cc..69ca59e82149 100644
> > > --- a/virt/kvm/Kconfig
> > > +++ b/virt/kvm/Kconfig
> > > @@ -89,3 +89,7 @@ config HAVE_KVM_PM_NOTIFIER
> > >
> > >  config HAVE_KVM_RESTRICTED_MEM
> > >         bool
> > > +
> > > +config KVM_GENERIC_PRIVATE_MEM
> > > +       bool
> > > +       depends on HAVE_KVM_RESTRICTED_MEM
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 09c9cdeb773c..fc3835826ace 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> > >
> > > +static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
> > > +							    gfn_t end)
> > > +{
> > > +	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > +		kvm->mmu_invalidate_range_start = start;
> > > +		kvm->mmu_invalidate_range_end = end;
> > > +	} else {
> > > +		/*
> > > +		 * Fully tracking multiple concurrent ranges has diminishing
> > > +		 * returns. Keep things simple and just find the minimal range
> > > +		 * which includes the current and new ranges. As there won't be
> > > +		 * enough information to subtract a range after its invalidate
> > > +		 * completes, any ranges invalidated concurrently will
> > > +		 * accumulate and persist until all outstanding invalidates
> > > +		 * complete.
> > > +		 */
> > > +		kvm->mmu_invalidate_range_start =
> > > +			min(kvm->mmu_invalidate_range_start, start);
> > > +		kvm->mmu_invalidate_range_end =
> > > +			max(kvm->mmu_invalidate_range_end, end);
> > > +	}
> > > +}
> > > +
> > > +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > +	/*
> > > +	 * The count increase must become visible at unlock time as no
> > > +	 * spte can be established without taking the mmu_lock and
> > > +	 * count is also read inside the mmu_lock critical section.
> > > +	 */
> > > +	kvm->mmu_invalidate_in_progress++;
> > > +}
> > > +
> > > +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > +	mark_invalidate_in_progress(kvm, start, end);
> > > +	update_invalidate_range(kvm, start, end);
> > > +}
> > > +
> > > +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > +	/*
> > > +	 * This sequence increase will notify the kvm page fault that
> > > +	 * the page that is going to be mapped in the spte could have
> > > +	 * been freed.
> > > +	 */
> > > +	kvm->mmu_invalidate_seq++;
> > > +	smp_wmb();
> > > +	/*
> > > +	 * The above sequence increase must be visible before the
> > > +	 * below count decrease, which is ensured by the smp_wmb above
> > > +	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > +	 */
> > > +	kvm->mmu_invalidate_in_progress--;
> > > +}
> > > +
> > >  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > >  static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> > >  {
> > > @@ -715,51 +771,12 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > >  	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > >  }
> > >
> > > -static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
> > > -							    gfn_t end)
> > > -{
> > > -	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > -		kvm->mmu_invalidate_range_start = start;
> > > -		kvm->mmu_invalidate_range_end = end;
> > > -	} else {
> > > -		/*
> > > -		 * Fully tracking multiple concurrent ranges has diminishing
> > > -		 * returns. Keep things simple and just find the minimal range
> > > -		 * which includes the current and new ranges. As there won't be
> > > -		 * enough information to subtract a range after its invalidate
> > > -		 * completes, any ranges invalidated concurrently will
> > > -		 * accumulate and persist until all outstanding invalidates
> > > -		 * complete.
> > > -		 */
> > > -		kvm->mmu_invalidate_range_start =
> > > -			min(kvm->mmu_invalidate_range_start, start);
> > > -		kvm->mmu_invalidate_range_end =
> > > -			max(kvm->mmu_invalidate_range_end, end);
> > > -	}
> > > -}
> > > -
> > > -static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> > > -{
> > > -	/*
> > > -	 * The count increase must become visible at unlock time as no
> > > -	 * spte can be established without taking the mmu_lock and
> > > -	 * count is also read inside the mmu_lock critical section.
> > > -	 */
> > > -	kvm->mmu_invalidate_in_progress++;
> > > -}
> > > -
> > >  static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > >  {
> > >  	update_invalidate_range(kvm, range->start, range->end);
> > >  	return kvm_unmap_gfn_range(kvm, range);
> > >  }
> > >
> > > -void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
> > > -{
> > > -	mark_invalidate_in_progress(kvm, start, end);
> > > -	update_invalidate_range(kvm, start, end);
> > > -}
> > > -
> > >  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > >  					const struct mmu_notifier_range *range)
> > >  {
> > > @@ -807,23 +824,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > >  	return 0;
> > >  }
> > >
> > > -void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
> > > -{
> > > -	/*
> > > -	 * This sequence increase will notify the kvm page fault that
> > > -	 * the page that is going to be mapped in the spte could have
> > > -	 * been freed.
> > > -	 */
> > > -	kvm->mmu_invalidate_seq++;
> > > -	smp_wmb();
> > > -	/*
> > > -	 * The above sequence increase must be visible before the
> > > -	 * below count decrease, which is ensured by the smp_wmb above
> > > -	 * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > -	 */
> > > -	kvm->mmu_invalidate_in_progress--;
> > > -}
> > > -
> > >  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > >  					const struct mmu_notifier_range *range)
> > >  {
> > > @@ -937,6 +937,89 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
> > >
> > >  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
> > >
> > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > > +
> > > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > +	struct kvm_gfn_range gfn_range;
> > > +	struct kvm_memory_slot *slot;
> > > +	struct kvm_memslots *slots;
> > > +	struct kvm_memslot_iter iter;
> > > +	int i;
> > > +	int r = 0;
> > > +
> > > +	gfn_range.pte = __pte(0);
> > > +	gfn_range.may_block = true;
> > > +
> > > +	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > +		slots = __kvm_memslots(kvm, i);
> > > +
> > > +		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > > +			slot = iter.slot;
> > > +			gfn_range.start = max(start, slot->base_gfn);
> > > +			gfn_range.end = min(end, slot->base_gfn + slot->npages);
> > > +			if (gfn_range.start >= gfn_range.end)
> > > +				continue;
> > > +			gfn_range.slot = slot;
> > > +
> > > +			r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > > +		}
> > > +	}
> > > +
> > > +	if (r)
> > > +		kvm_flush_remote_tlbs(kvm);
> > > +}
> > > +
> > > +#define KVM_MEM_ATTR_SHARED	0x0001
> > > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> > > +				     bool is_private)
> > > +{
> > > +	gfn_t start, end;
> > > +	unsigned long i;
> > > +	void *entry;
> > > +	int idx;
> > > +	int r = 0;
> > > +
> > > +	if (size == 0 || gpa + size < gpa)
> > > +		return -EINVAL;
> > > +	if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> > > +		return -EINVAL;
> > > +
> > > +	start = gpa >> PAGE_SHIFT;
> > > +	end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > > +
> > > +	/*
> > > +	 * Guest memory defaults to private, kvm->mem_attr_array only stores
> > > +	 * shared memory.
> > > +	 */
> > > +	entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> > > +
> > > +	idx = srcu_read_lock(&kvm->srcu);
> > > +	KVM_MMU_LOCK(kvm);
> > > +	kvm_mmu_invalidate_begin(kvm, start, end);
> > > +
> > > +	for (i = start; i < end; i++) {
> > > +		r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > > +				    GFP_KERNEL_ACCOUNT));
> > > +		if (r)
> > > +			goto err;
> > > +	}
> > > +
> > > +	kvm_unmap_mem_range(kvm, start, end);
> >
> > lock is hold by KVM_MMU_LOCK() so how about do
> > kvm_mmu_invalidate_begin() after changing xarray:
> >
> > kvm_mmu_invalidate_begin(kvm, start, end);
> > kvm_unmap_mem_range(kvm, start, end);
> > kvm_mmu_invalidate_end(kvm, start, end);
> >
> > Also the error handling path doesn't need to care it yet.
>
> The mem_attr_array is consumed in the page fault handler(i.e.
> kvm_mem_is_private() in patch 08) so it should also be protected by
> kvm_mmu_invalidate_begin/end(). E.g. if we change the mem_attr_arry here
> after the page fault handler has read the mem_attr_array, the
> mmu_invalidate_retry_gfn() should return 1 to let the page fault handler
> to retry the fault.

You're right!
Even the changes are undo by error handling path, we still need to
sure that user of mem_attr_arry retry the fault, due to the user may
get some "stale" data (they're "stale" becaus the xarray is recovered
in error case).

>
> >
> > > +
> > > +	goto ret;
> > > +err:
> > > +	for (; i > start; i--)
> > > +		xa_erase(&kvm->mem_attr_array, i);
> >
> > the start should be covered yet, consider the i is
> > unsigned long and case of start is 0, may need another
> > variable j for this.
>
> Ah, right!
>
> Thanks,
> Chao
> >
> > > +ret:
> > > +	kvm_mmu_invalidate_end(kvm, start, end);
> > > +	KVM_MMU_UNLOCK(kvm);
> > > +	srcu_read_unlock(&kvm->srcu, idx);
> > > +
> > > +	return r;
> > > +}
> > > +#endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
> > > +
> > >  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> > >  static int kvm_pm_notifier_call(struct notifier_block *bl,
> > >  				unsigned long state,
> > > @@ -1165,6 +1248,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > >  	spin_lock_init(&kvm->mn_invalidate_lock);
> > >  	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> > >  	xa_init(&kvm->vcpu_array);
> > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > > +	xa_init(&kvm->mem_attr_array);
> > > +#endif
> > >
> > >  	INIT_LIST_HEAD(&kvm->gpc_list);
> > >  	spin_lock_init(&kvm->gpc_lock);
> > > @@ -1338,6 +1424,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
> > >  		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
> > >  		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
> > >  	}
> > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > > +	xa_destroy(&kvm->mem_attr_array);
> > > +#endif
> > >  	cleanup_srcu_struct(&kvm->irq_srcu);
> > >  	cleanup_srcu_struct(&kvm->srcu);
> > >  	kvm_arch_free_vm(kvm);
> > > @@ -1541,6 +1630,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
> > >  	}
> > >  }
> > >
> > > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > > +{
> > > +	return false;
> > > +}
> > > +
> > >  static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> > >  {
> > >  	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> > > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
> > >  		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > >  		break;
> > >  	}
> > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > > +	case KVM_MEMORY_ENCRYPT_REG_REGION:
> > > +	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> > > +		struct kvm_enc_region region;
> > > +		bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> > > +
> > > +		if (!kvm_arch_has_private_mem(kvm))
> > > +			goto arch_vm_ioctl;
> > > +
> > > +		r = -EFAULT;
> > > +		if (copy_from_user(&region, argp, sizeof(region)))
> > > +			goto out;
> > > +
> > > +		r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> > > +					      region.size, set);
> > > +		break;
> > > +	}
> > > +#endif
> > >  	case KVM_GET_DIRTY_LOG: {
> > >  		struct kvm_dirty_log log;
> > >
> > > @@ -4861,6 +4973,9 @@ static long kvm_vm_ioctl(struct file *filp,
> > >  		r = kvm_vm_ioctl_get_stats_fd(kvm);
> > >  		break;
> > >  	default:
> > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > > +arch_vm_ioctl:
> > > +#endif
> > >  		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> > >  	}
> > >  out:
> > > --
> > > 2.25.1
> > >
> > >


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM
  2022-11-08  0:41   ` Isaku Yamahata
@ 2022-11-09 15:54     ` Kirill A. Shutemov
  2022-11-15 14:36       ` Kirill A. Shutemov
  0 siblings, 1 reply; 101+ messages in thread
From: Kirill A. Shutemov @ 2022-11-09 15:54 UTC (permalink / raw)
  To: Isaku Yamahata, Hugh Dickins
  Cc: Vishal Annapurve, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
	Muchun Song, wei.w.wang

On Mon, Nov 07, 2022 at 04:41:41PM -0800, Isaku Yamahata wrote:
> On Thu, Nov 03, 2022 at 05:43:52PM +0530,
> Vishal Annapurve <vannapurve@google.com> wrote:
> 
> > On Tue, Oct 25, 2022 at 8:48 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > >
> > > This patch series implements KVM guest private memory for confidential
> > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > TDX-protected guest memory, machine check can happen which can further
> > > crash the running host system, this is terrible for multi-tenant
> > > configurations. The host accesses include those from KVM userspace like
> > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > via a fd-based approach, but it can never access the guest memory
> > > content.
> > >
> > > The patch series touches both core mm and KVM code. I appreciate
> > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > reviews are always welcome.
> > >   - 01: mm change, target for mm tree
> > >   - 02-08: KVM change, target for KVM tree
> > >
> > > Given KVM is the only current user for the mm part, I have chatted with
> > > Paolo and he is OK to merge the mm change through KVM tree, but
> > > reviewed-by/acked-by is still expected from the mm people.
> > >
> > > The patches have been verified in Intel TDX environment, but Vishal has
> > > done an excellent work on the selftests[4] which are dedicated for this
> > > series, making it possible to test this series without innovative
> > > hardware and fancy steps of building a VM environment. See Test section
> > > below for more info.
> > >
> > >
> > > Introduction
> > > ============
> > > KVM userspace being able to crash the host is horrible. Under current
> > > KVM architecture, all guest memory is inherently accessible from KVM
> > > userspace and is exposed to the mentioned crash issue. The goal of this
> > > series is to provide a solution to align mm and KVM, on a userspace
> > > inaccessible approach of exposing guest memory.
> > >
> > > Normally, KVM populates secondary page table (e.g. EPT) by using a host
> > > virtual address (hva) from core mm page table (e.g. x86 userspace page
> > > table). This requires guest memory being mmaped into KVM userspace, but
> > > this is also the source where the mentioned crash issue can happen. In
> > > theory, apart from those 'shared' memory for device emulation etc, guest
> > > memory doesn't have to be mmaped into KVM userspace.
> > >
> > > This series introduces fd-based guest memory which will not be mmaped
> > > into KVM userspace. KVM populates secondary page table by using a
> > 
> > With no mappings in place for userspace VMM, IIUC, looks like the host
> > kernel will not be able to find the culprit userspace process in case
> > of Machine check error on guest private memory. As implemented in
> > hwpoison_user_mappings, host kernel tries to look at the processes
> > which have mapped the pfns with hardware error.
> > 
> > Is there a modification needed in mce handling logic of the host
> > kernel to immediately send a signal to the vcpu thread accessing
> > faulting pfn backing guest private memory?
> 
> mce_register_decode_chain() can be used.  MCE physical address(p->mce_addr)
> includes host key id in addition to real physical address.  By searching used
> hkid by KVM, we can determine if the page is assigned to guest TD or not. If
> yes, send SIGBUS.
> 
> kvm_machine_check() can be enhanced for KVM specific use.  This is before
> memory_failure() is called, though.
> 
> any other ideas?

That's too KVM-centric. It will not work for other possible user of
restricted memfd.

I tried to find a way to get it right: we need to get restricted memfd
code info about corrupted page so it can invalidate its users. On the next
request of the page the user will see an error. In case of KVM, the error
will likely escalate to SIGBUS.

The problem is that core-mm code that handles memory failure knows nothing
about restricted memfd. It only sees that the page belongs to a normal
memfd.

AFAICS, there's no way to get it intercepted from the shim level. shmem
code has to be patches. shmem_error_remove_page() has to call into
restricted memfd code.

Hugh, are you okay with this? Or maybe you have a better idea?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-11-08  7:16         ` Chao Peng
@ 2022-11-10 17:53           ` Sean Christopherson
  0 siblings, 0 replies; 101+ messages in thread
From: Sean Christopherson @ 2022-11-10 17:53 UTC (permalink / raw)
  To: Chao Peng
  Cc: Fuad Tabba, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Tue, Nov 08, 2022, Chao Peng wrote:
> On Fri, Nov 04, 2022 at 10:29:48PM +0000, Sean Christopherson wrote:
> > The APICv case that this was added for could very well be broken because of
> > this, and the resulting failures would be an absolute nightmare to debug.
> 
> Given the apicv_inhibit should be rare, the change looks good to me.
> Just to be clear, your will send out this fix, right?

Ya, I'll post an official patch.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-10-25 15:13 ` [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
  2022-10-27 10:29   ` Fuad Tabba
@ 2022-11-10 20:06   ` Sean Christopherson
  2022-11-11  8:27     ` Chao Peng
  1 sibling, 1 reply; 101+ messages in thread
From: Sean Christopherson @ 2022-11-10 20:06 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Tue, Oct 25, 2022, Chao Peng wrote:
> @@ -715,15 +715,9 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
>  }
>  
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> -			      unsigned long end)
> +static inline

Don't tag static functions with "inline" unless they're in headers, in which case
the inline is effectively required.  In pretty much every scenario, the compiler
can do a better job of optimizing inline vs. non-inline, i.e. odds are very good
the compiler would inline this helper anyways, and if not, there would likely be
a good reason not to inline it.

It'll be a moot point in this case (more below), but this would also reduce the
line length and avoid the wrap.

> void update_invalidate_range(struct kvm *kvm, gfn_t start,
> +							    gfn_t end)

I appreciate the effort to make this easier to read, but making such a big divergence
from the kernel's preferred formatting is often counter-productive, e.g. I blinked a
few times when first reading this code.

Again, moot point this time (still below ;-) ), but for future reference, better
options are to either let the line poke out or simply wrap early to get the
bundling of parameters that you want, e.g.

  static inline void update_invalidate_range(struct kvm *kvm, gfn_t start, gfn_t end)

or 

  static inline void update_invalidate_range(struct kvm *kvm,
					     gfn_t start, gfn_t end)

>  {
> -	/*
> -	 * The count increase must become visible at unlock time as no
> -	 * spte can be established without taking the mmu_lock and
> -	 * count is also read inside the mmu_lock critical section.
> -	 */
> -	kvm->mmu_invalidate_in_progress++;
>  	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
>  		kvm->mmu_invalidate_range_start = start;
>  		kvm->mmu_invalidate_range_end = end;
> @@ -744,6 +738,28 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
>  	}
>  }
>  
> +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)

Splitting the helpers this way yields a weird API overall, e.g. it's possible
(common, actually) to have an "end" without a "begin".

Taking the range in the "end" is also dangerous/misleading/imbalanced, because _if_
there are multiple ranges in a batch, each range would need to be unwound
independently, e.g. the invocation of the "end" helper in
kvm_mmu_notifier_invalidate_range_end() is flat out wrong, it just doesn't cause
problems because KVM doesn't (currently) try to unwind regions (and probably never
will, but that's beside the point).

Rather than shunt what is effectively the "begin" into a separate helper, provide
three separate APIs, e.g. begin, range_add, end.  That way, begin+end don't take a
range and thus are symmetrical, always paired, and can't screw up unwinding since
they don't have a range to unwind.

It'll require three calls in every case, but that's not the end of the world since
none of these flows are super hot paths.

> +{
> +	/*
> +	 * The count increase must become visible at unlock time as no
> +	 * spte can be established without taking the mmu_lock and
> +	 * count is also read inside the mmu_lock critical section.
> +	 */
> +	kvm->mmu_invalidate_in_progress++;

This should invalidate (ha!) mmu_invalidate_range_{start,end}, and then WARN in
mmu_invalidate_retry() if the range isn't valid.  And the "add" helper should
WARN if mmu_invalidate_in_progress == 0.

> +}
> +
> +static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)

"handle" is waaaay too generic.  Just match kvm_unmap_gfn_range() and call it
kvm_mmu_unmap_gfn_range().  This is a local function so it's unlikely to collide
with arch code, now or in the future.

> +{
> +	update_invalidate_range(kvm, range->start, range->end);
> +	return kvm_unmap_gfn_range(kvm, range);
> +}

Overall, this?  Compile tested only...

---
 arch/x86/kvm/mmu/mmu.c   |  8 +++++---
 include/linux/kvm_host.h | 33 +++++++++++++++++++++------------
 virt/kvm/kvm_main.c      | 30 +++++++++++++++++++++---------
 3 files changed, 47 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 93c389eaf471..d4b373e3e524 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4259,7 +4259,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
 		return true;
 
 	return fault->slot &&
-	       mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
+	       mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
 }
 
 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
@@ -6098,7 +6098,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 
 	write_lock(&kvm->mmu_lock);
 
-	kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
+	kvm_mmu_invalidate_begin(kvm);
+
+	kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
 
 	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
 
@@ -6112,7 +6114,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 		kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
 						   gfn_end - gfn_start);
 
-	kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
+	kvm_mmu_invalidate_end(kvm);
 
 	write_unlock(&kvm->mmu_lock);
 }
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e6e66c5e56f2..29aa6d6827cc 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -778,8 +778,8 @@ struct kvm {
 	struct mmu_notifier mmu_notifier;
 	unsigned long mmu_invalidate_seq;
 	long mmu_invalidate_in_progress;
-	unsigned long mmu_invalidate_range_start;
-	unsigned long mmu_invalidate_range_end;
+	gfn_t mmu_invalidate_range_start;
+	gfn_t mmu_invalidate_range_end;
 #endif
 	struct list_head devices;
 	u64 manual_dirty_log_protect;
@@ -1378,10 +1378,9 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 #endif
 
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
-			      unsigned long end);
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
-			    unsigned long end);
+void kvm_mmu_invalidate_begin(struct kvm *kvm);
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
+void kvm_mmu_invalidate_end(struct kvm *kvm);
 
 long kvm_arch_dev_ioctl(struct file *filp,
 			unsigned int ioctl, unsigned long arg);
@@ -1952,9 +1951,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
 	return 0;
 }
 
-static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
+static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
 					   unsigned long mmu_seq,
-					   unsigned long hva)
+					   gfn_t gfn)
 {
 	lockdep_assert_held(&kvm->mmu_lock);
 	/*
@@ -1963,10 +1962,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
 	 * that might be being invalidated. Note that it may include some false
 	 * positives, due to shortcuts when handing concurrent invalidations.
 	 */
-	if (unlikely(kvm->mmu_invalidate_in_progress) &&
-	    hva >= kvm->mmu_invalidate_range_start &&
-	    hva < kvm->mmu_invalidate_range_end)
-		return 1;
+	if (unlikely(kvm->mmu_invalidate_in_progress)) {
+		/*
+		 * Dropping mmu_lock after bumping mmu_invalidate_in_progress
+		 * but before updating the range is a KVM bug.
+		 */
+		if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
+				 kvm->mmu_invalidate_range_end == INVALID_GPA))
+			return 1;
+
+		if (gfn >= kvm->mmu_invalidate_range_start &&
+		    gfn < kvm->mmu_invalidate_range_end)
+			return 1;
+	}
+
 	if (kvm->mmu_invalidate_seq != mmu_seq)
 		return 1;
 	return 0;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 43bbe4fde078..e9e03b979f77 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -540,9 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
 
 typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
-typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
-			     unsigned long end);
-
+typedef void (*on_lock_fn_t)(struct kvm *kvm);
 typedef void (*on_unlock_fn_t)(struct kvm *kvm);
 
 struct kvm_hva_range {
@@ -628,7 +626,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 				locked = true;
 				KVM_MMU_LOCK(kvm);
 				if (!IS_KVM_NULL_FN(range->on_lock))
-					range->on_lock(kvm, range->start, range->end);
+					range->on_lock(kvm);
+
 				if (IS_KVM_NULL_FN(range->handler))
 					break;
 			}
@@ -715,8 +714,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
 }
 
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
-			      unsigned long end)
+void kvm_mmu_invalidate_begin(struct kvm *kvm)
 {
 	/*
 	 * The count increase must become visible at unlock time as no
@@ -724,6 +722,15 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
 	 * count is also read inside the mmu_lock critical section.
 	 */
 	kvm->mmu_invalidate_in_progress++;
+
+	kvm->mmu_invalidate_range_start = INVALID_GPA;
+	kvm->mmu_invalidate_range_end = INVALID_GPA;
+}
+
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
+
 	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
 		kvm->mmu_invalidate_range_start = start;
 		kvm->mmu_invalidate_range_end = end;
@@ -744,6 +751,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
 	}
 }
 
+static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
+	return kvm_unmap_gfn_range(kvm, range);
+}
+
 static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
@@ -752,7 +765,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.start		= range->start,
 		.end		= range->end,
 		.pte		= __pte(0),
-		.handler	= kvm_unmap_gfn_range,
+		.handler	= kvm_mmu_unmap_gfn_range,
 		.on_lock	= kvm_mmu_invalidate_begin,
 		.on_unlock	= kvm_arch_guest_memory_reclaimed,
 		.flush_on_ret	= true,
@@ -791,8 +804,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	return 0;
 }
 
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
-			    unsigned long end)
+void kvm_mmu_invalidate_end(struct kvm *kvm)
 {
 	/*
 	 * This sequence increase will notify the kvm page fault that

base-commit: d663b8a285986072428a6a145e5994bc275df994
-- 



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-11-10 20:06   ` Sean Christopherson
@ 2022-11-11  8:27     ` Chao Peng
  0 siblings, 0 replies; 101+ messages in thread
From: Chao Peng @ 2022-11-11  8:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Thu, Nov 10, 2022 at 08:06:33PM +0000, Sean Christopherson wrote:
> On Tue, Oct 25, 2022, Chao Peng wrote:
> > @@ -715,15 +715,9 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> >  	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> >  }
> >  
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > -			      unsigned long end)
> > +static inline
> 
> Don't tag static functions with "inline" unless they're in headers, in which case
> the inline is effectively required.  In pretty much every scenario, the compiler
> can do a better job of optimizing inline vs. non-inline, i.e. odds are very good
> the compiler would inline this helper anyways, and if not, there would likely be
> a good reason not to inline it.

Yep, I know the rationale behind, I made a mistake.

> 
> It'll be a moot point in this case (more below), but this would also reduce the
> line length and avoid the wrap.
> 
> > void update_invalidate_range(struct kvm *kvm, gfn_t start,
> > +							    gfn_t end)
> 
> I appreciate the effort to make this easier to read, but making such a big divergence
> from the kernel's preferred formatting is often counter-productive, e.g. I blinked a
> few times when first reading this code.
> 
> Again, moot point this time (still below ;-) ), but for future reference, better
> options are to either let the line poke out or simply wrap early to get the
> bundling of parameters that you want, e.g.
> 
>   static inline void update_invalidate_range(struct kvm *kvm, gfn_t start, gfn_t end)
> 
> or 
> 
>   static inline void update_invalidate_range(struct kvm *kvm,
> 					     gfn_t start, gfn_t end)

Fully agreed.

> 
> >  {
> > -	/*
> > -	 * The count increase must become visible at unlock time as no
> > -	 * spte can be established without taking the mmu_lock and
> > -	 * count is also read inside the mmu_lock critical section.
> > -	 */
> > -	kvm->mmu_invalidate_in_progress++;
> >  	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> >  		kvm->mmu_invalidate_range_start = start;
> >  		kvm->mmu_invalidate_range_end = end;
> > @@ -744,6 +738,28 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> >  	}
> >  }
> >  
> > +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> 
> Splitting the helpers this way yields a weird API overall, e.g. it's possible
> (common, actually) to have an "end" without a "begin".
> 
> Taking the range in the "end" is also dangerous/misleading/imbalanced, because _if_
> there are multiple ranges in a batch, each range would need to be unwound
> independently, e.g. the invocation of the "end" helper in
> kvm_mmu_notifier_invalidate_range_end() is flat out wrong, it just doesn't cause
> problems because KVM doesn't (currently) try to unwind regions (and probably never
> will, but that's beside the point).

I actually also don't feel good with existing code (taking range in the
"start" and "end") but didn't go further to find a better solution.

> 
> Rather than shunt what is effectively the "begin" into a separate helper, provide
> three separate APIs, e.g. begin, range_add, end.  That way, begin+end don't take a
> range and thus are symmetrical, always paired, and can't screw up unwinding since
> they don't have a range to unwind.

This looks much better to me.

> 
> It'll require three calls in every case, but that's not the end of the world since
> none of these flows are super hot paths.
> 
> > +{
> > +	/*
> > +	 * The count increase must become visible at unlock time as no
> > +	 * spte can be established without taking the mmu_lock and
> > +	 * count is also read inside the mmu_lock critical section.
> > +	 */
> > +	kvm->mmu_invalidate_in_progress++;
> 
> This should invalidate (ha!) mmu_invalidate_range_{start,end}, and then WARN in
> mmu_invalidate_retry() if the range isn't valid.  And the "add" helper should
> WARN if mmu_invalidate_in_progress == 0.
> 
> > +}
> > +
> > +static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> 
> "handle" is waaaay too generic.  Just match kvm_unmap_gfn_range() and call it
> kvm_mmu_unmap_gfn_range().  This is a local function so it's unlikely to collide
> with arch code, now or in the future.

Agreed.

> 
> > +{
> > +	update_invalidate_range(kvm, range->start, range->end);
> > +	return kvm_unmap_gfn_range(kvm, range);
> > +}
> 
> Overall, this?  Compile tested only...

Thanks!
Chao
> 
> ---
>  arch/x86/kvm/mmu/mmu.c   |  8 +++++---
>  include/linux/kvm_host.h | 33 +++++++++++++++++++++------------
>  virt/kvm/kvm_main.c      | 30 +++++++++++++++++++++---------
>  3 files changed, 47 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 93c389eaf471..d4b373e3e524 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4259,7 +4259,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
>  		return true;
>  
>  	return fault->slot &&
> -	       mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> +	       mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
>  }
>  
>  static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> @@ -6098,7 +6098,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  
>  	write_lock(&kvm->mmu_lock);
>  
> -	kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
> +	kvm_mmu_invalidate_begin(kvm);
> +
> +	kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
>  
>  	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>  
> @@ -6112,7 +6114,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  		kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
>  						   gfn_end - gfn_start);
>  
> -	kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
> +	kvm_mmu_invalidate_end(kvm);
>  
>  	write_unlock(&kvm->mmu_lock);
>  }
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index e6e66c5e56f2..29aa6d6827cc 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -778,8 +778,8 @@ struct kvm {
>  	struct mmu_notifier mmu_notifier;
>  	unsigned long mmu_invalidate_seq;
>  	long mmu_invalidate_in_progress;
> -	unsigned long mmu_invalidate_range_start;
> -	unsigned long mmu_invalidate_range_end;
> +	gfn_t mmu_invalidate_range_start;
> +	gfn_t mmu_invalidate_range_end;
>  #endif
>  	struct list_head devices;
>  	u64 manual_dirty_log_protect;
> @@ -1378,10 +1378,9 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
>  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  #endif
>  
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> -			      unsigned long end);
> -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> -			    unsigned long end);
> +void kvm_mmu_invalidate_begin(struct kvm *kvm);
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
> +void kvm_mmu_invalidate_end(struct kvm *kvm);
>  
>  long kvm_arch_dev_ioctl(struct file *filp,
>  			unsigned int ioctl, unsigned long arg);
> @@ -1952,9 +1951,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
>  	return 0;
>  }
>  
> -static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> +static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
>  					   unsigned long mmu_seq,
> -					   unsigned long hva)
> +					   gfn_t gfn)
>  {
>  	lockdep_assert_held(&kvm->mmu_lock);
>  	/*
> @@ -1963,10 +1962,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
>  	 * that might be being invalidated. Note that it may include some false
>  	 * positives, due to shortcuts when handing concurrent invalidations.
>  	 */
> -	if (unlikely(kvm->mmu_invalidate_in_progress) &&
> -	    hva >= kvm->mmu_invalidate_range_start &&
> -	    hva < kvm->mmu_invalidate_range_end)
> -		return 1;
> +	if (unlikely(kvm->mmu_invalidate_in_progress)) {
> +		/*
> +		 * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> +		 * but before updating the range is a KVM bug.
> +		 */
> +		if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> +				 kvm->mmu_invalidate_range_end == INVALID_GPA))
> +			return 1;
> +
> +		if (gfn >= kvm->mmu_invalidate_range_start &&
> +		    gfn < kvm->mmu_invalidate_range_end)
> +			return 1;
> +	}
> +
>  	if (kvm->mmu_invalidate_seq != mmu_seq)
>  		return 1;
>  	return 0;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 43bbe4fde078..e9e03b979f77 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -540,9 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
>  
>  typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
>  
> -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> -			     unsigned long end);
> -
> +typedef void (*on_lock_fn_t)(struct kvm *kvm);
>  typedef void (*on_unlock_fn_t)(struct kvm *kvm);
>  
>  struct kvm_hva_range {
> @@ -628,7 +626,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>  				locked = true;
>  				KVM_MMU_LOCK(kvm);
>  				if (!IS_KVM_NULL_FN(range->on_lock))
> -					range->on_lock(kvm, range->start, range->end);
> +					range->on_lock(kvm);
> +
>  				if (IS_KVM_NULL_FN(range->handler))
>  					break;
>  			}
> @@ -715,8 +714,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
>  }
>  
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> -			      unsigned long end)
> +void kvm_mmu_invalidate_begin(struct kvm *kvm)
>  {
>  	/*
>  	 * The count increase must become visible at unlock time as no
> @@ -724,6 +722,15 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
>  	 * count is also read inside the mmu_lock critical section.
>  	 */
>  	kvm->mmu_invalidate_in_progress++;
> +
> +	kvm->mmu_invalidate_range_start = INVALID_GPA;
> +	kvm->mmu_invalidate_range_end = INVALID_GPA;
> +}
> +
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> +
>  	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
>  		kvm->mmu_invalidate_range_start = start;
>  		kvm->mmu_invalidate_range_end = end;
> @@ -744,6 +751,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
>  	}
>  }
>  
> +static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> +	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> +	return kvm_unmap_gfn_range(kvm, range);
> +}
> +
>  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  					const struct mmu_notifier_range *range)
>  {
> @@ -752,7 +765,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  		.start		= range->start,
>  		.end		= range->end,
>  		.pte		= __pte(0),
> -		.handler	= kvm_unmap_gfn_range,
> +		.handler	= kvm_mmu_unmap_gfn_range,
>  		.on_lock	= kvm_mmu_invalidate_begin,
>  		.on_unlock	= kvm_arch_guest_memory_reclaimed,
>  		.flush_on_ret	= true,
> @@ -791,8 +804,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  	return 0;
>  }
>  
> -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> -			    unsigned long end)
> +void kvm_mmu_invalidate_end(struct kvm *kvm)
>  {
>  	/*
>  	 * This sequence increase will notify the kvm page fault that
> 
> base-commit: d663b8a285986072428a6a145e5994bc275df994
> -- 
> 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM
  2022-10-25 15:13 [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (8 preceding siblings ...)
  2022-11-03 12:13 ` [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM Vishal Annapurve
@ 2022-11-14 11:43 ` Alex Bennée
  2022-11-16  5:00   ` Chao Peng
  9 siblings, 1 reply; 101+ messages in thread
From: Alex Bennée @ 2022-11-14 11:43 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang, Viresh Kumar,
	Mathieu Poirier, AKASHI Takahiro


Chao Peng <chao.p.peng@linux.intel.com> writes:

<snip>
> Introduction
> ============
> KVM userspace being able to crash the host is horrible. Under current
> KVM architecture, all guest memory is inherently accessible from KVM
> userspace and is exposed to the mentioned crash issue. The goal of this
> series is to provide a solution to align mm and KVM, on a userspace
> inaccessible approach of exposing guest memory. 
>
> Normally, KVM populates secondary page table (e.g. EPT) by using a host
> virtual address (hva) from core mm page table (e.g. x86 userspace page
> table). This requires guest memory being mmaped into KVM userspace, but
> this is also the source where the mentioned crash issue can happen. In
> theory, apart from those 'shared' memory for device emulation etc, guest
> memory doesn't have to be mmaped into KVM userspace.
>
> This series introduces fd-based guest memory which will not be mmaped
> into KVM userspace. KVM populates secondary page table by using a
> fd/offset pair backed by a memory file system. The fd can be created
> from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
> directly interact with them with newly introduced in-kernel interface,
> therefore remove the KVM userspace from the path of accessing/mmaping
> the guest memory. 
>
> Kirill had a patch [2] to address the same issue in a different way. It
> tracks guest encrypted memory at the 'struct page' level and relies on
> HWPOISON to reject the userspace access. The patch has been discussed in
> several online and offline threads and resulted in a design document [3]
> which is also the original proposal for this series. Later this patch
> series evolved as more comments received in community but the major
> concepts in [3] still hold true so recommend reading.
>
> The patch series may also be useful for other usages, for example, pure
> software approach may use it to harden itself against unintentional
> access to guest memory. This series is designed with these usages in
> mind but doesn't have code directly support them and extension might be
> needed.

There are a couple of additional use cases where having a consistent
memory interface with the kernel would be useful.

  - Xen DomU guests providing other domains with VirtIO backends

  Xen by default doesn't give other domains special access to a domains
  memory. The guest can grant access to regions of its memory to other
  domains for this purpose. 

  - pKVM on ARM

  Similar to Xen, pKVM moves the management of the page tables into the
  hypervisor and again doesn't allow those domains to share memory by
  default.

  - VirtIO loopback

  This allows for VirtIO devices for the host kernel to be serviced by
  backends running in userspace. Obviously the memory userspace is
  allowed to access is strictly limited to the buffers and queues
  because giving userspace unrestricted access to the host kernel would
  have consequences.

All of these VirtIO backends work with vhost-user which uses memfds to
pass references to guest memory from the VMM to the backend
implementation.

> mm change
> =========
> Introduces a new memfd_restricted system call which can create memory
> file that is restricted from userspace access via normal MMU operations
> like read(), write() or mmap() etc and the only way to use it is
> passing it to a third kernel module like KVM and relying on it to
> access the fd through the newly added restrictedmem kernel interface.
> The restrictedmem interface bridges the memory file subsystems
> (tmpfs/hugetlbfs etc) and their users (KVM in this case) and provides
> bi-directional communication between them. 
>
>
> KVM change
> ==========
> Extends the KVM memslot to provide guest private (encrypted) memory from
> a fd. With this extension, a single memslot can maintain both private
> memory through private fd (restricted_fd/restricted_offset) and shared
> (unencrypted) memory through userspace mmaped host virtual address
> (userspace_addr). For a particular guest page, the corresponding page in
> KVM memslot can be only either private or shared and only one of the
> shared/private parts of the memslot is visible to guest. For how this
> new extension is used in QEMU, please refer to kvm_set_phys_mem() in
> below TDX-enabled QEMU repo.
>
> Introduces new KVM_EXIT_MEMORY_FAULT exit to allow userspace to get the
> chance on decision-making for shared <-> private memory conversion. The
> exit can be an implicit conversion in KVM page fault handler or an
> explicit conversion from guest OS.
>
> Extends existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to
> convert a guest page between private <-> shared. The data maintained in
> these ioctls tells the truth whether a guest page is private or shared
> and this information will be used in KVM page fault handler to decide
> whether the private or the shared part of the memslot is visible to
> guest.
>
<snip>

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-01 15:19       ` Michael Roth
  2022-11-01 19:30         ` Michael Roth
@ 2022-11-14 14:02         ` Vlastimil Babka
  2022-11-14 15:28           ` Kirill A. Shutemov
  2022-11-14 22:16           ` Michael Roth
  1 sibling, 2 replies; 101+ messages in thread
From: Vlastimil Babka @ 2022-11-14 14:02 UTC (permalink / raw)
  To: Michael Roth, Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vishal Annapurve, Yu Zhang,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, mhocko,
	Muchun Song, wei.w.wang

On 11/1/22 16:19, Michael Roth wrote:
> On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
>> > 
>> >   1) restoring kernel directmap:
>> > 
>> >      Currently SNP (and I believe TDX) need to either split or remove kernel
>> >      direct mappings for restricted PFNs, since there is no guarantee that
>> >      other PFNs within a 2MB range won't be used for non-restricted
>> >      (which will cause an RMP #PF in the case of SNP since the 2MB
>> >      mapping overlaps with guest-owned pages)
>> 
>> Has the splitting and restoring been a well-discussed direction? I'm
>> just curious whether there is other options to solve this issue.
> 
> For SNP it's been discussed for quite some time, and either splitting or
> removing private entries from directmap are the well-discussed way I'm
> aware of to avoid RMP violations due to some other kernel process using
> a 2MB mapping to access shared memory if there are private pages that
> happen to be within that range.
> 
> In both cases the issue of how to restore directmap as 2M becomes a
> problem.
> 
> I was also under the impression TDX had similar requirements. If so,
> do you know what the plan is for handling this for TDX?
> 
> There are also 2 potential alternatives I'm aware of, but these haven't
> been discussed in much detail AFAIK:
> 
> a) Ensure confidential guests are backed by 2MB pages. shmem has a way to
>    request 2MB THP pages, but I'm not sure how reliably we can guarantee
>    that enough THPs are available, so if we went that route we'd probably
>    be better off requiring the use of hugetlbfs as the backing store. But
>    obviously that's a bit limiting and it would be nice to have the option
>    of using normal pages as well. One nice thing with invalidation
>    scheme proposed here is that this would "Just Work" if implement
>    hugetlbfs support, so an admin that doesn't want any directmap
>    splitting has this option available, otherwise it's done as a
>    best-effort.
> 
> b) Implement general support for restoring directmap as 2M even when
>    subpages might be in use by other kernel threads. This would be the
>    most flexible approach since it requires no special handling during
>    invalidations, but I think it's only possible if all the CPA
>    attributes for the 2M range are the same at the time the mapping is
>    restored/unsplit, so some potential locking issues there and still
>    chance for splitting directmap over time.

I've been hoping that

c) using a mechanism such as [1] [2] where the goal is to group together
these small allocations that need to increase directmap granularity so
maximum number of large mappings are preserved. But I guess that means
knowing at allocation time that this will happen. So I've been wondering how
this would be possible to employ in the SNP/UPM case? I guess it depends on
how we expect the private/shared conversions to happen in practice, and I
don't know the details. I can imagine the following complications:

- a memfd_restricted region is created such that it's 2MB large/aligned,
i.e. like case a) above, we can allocate it normally. Now, what if a 4k page
in the middle is to be temporarily converted to shared for some
communication between host and guest (can such thing happen?). With the
punch hole approach, I wonder if we end up fragmenting directmap
unnecessarily? IIUC the now shared page will become backed by some other
page (as the memslot supports both private and shared pages simultaneously).
But does it make sense to really split the direct mapping (and e.g. the
shmem page?) We could leave the whole 2MB unmapped without splitting if we
didn't free the private 4k subpage.

- a restricted region is created that's below 2MB. If something like [1] is
merged, it could be used for the backing pages to limit directmap
fragmentation. But then in case it's eventually fallocated to become larger
and gain one more more 2MB aligned ranges, the result is suboptimal. Unless
in that case we migrate the existing pages to a THP-backed shmem, kinda like
khugepaged collapses hugepages. But that would have to be coordinated with
the guest, maybe not even possible?

[1] https://lore.kernel.org/all/20220127085608.306306-1-rppt@kernel.org/
[2] https://lwn.net/Articles/894557/

>> 
>> > 
>> >      Previously we were able to restore 2MB mappings to some degree
>> >      since both shared/restricted pages were all pinned, so anything
>> >      backed by a THP (or hugetlb page once that is implemented) at guest
>> >      teardown could be restored as 2MB direct mapping.
>> > 
>> >      Invalidation seems like the most logical time to have this happen,
>> 
>> Currently invalidation only happens at user-initiated fallocate(). It
>> does not cover the VM teardown case where the restoring might also be
>> expected to be handled.
> 
> Right, I forgot to add that in my proposed changes I added invalidations
> for any still-allocated private pages present when the restricted memfd
> notifier is unregistered. This was needed to avoid leaking pages back to
> the kernel that still need directmap or RMP table fixups. I also added
> similar invalidations for memfd->release(), since it seems possible that
> userspace might close() it before shutting down guest, but maybe the
> latter is not needed if KVM takes a reference on the FD during life of
> the guest.
> 
>> 
>> >      but whether or not to restore as 2MB requires the order to be 2MB
>> >      or larger, and for GPA range being invalidated to cover the entire
>> >      2MB (otherwise it means the page was potentially split and some
>> >      subpages free back to host already, in which case it can't be
>> >      restored as 2MB).
>> > 
>> >   2) Potentially less invalidations:
>> >       
>> >      If we pass the entire folio or compound_page as part of
>> >      invalidation, we only needed to issue 1 invalidation per folio.
>> 
>> I'm not sure I agree, the current invalidation covers the whole range
>> that passed from userspace and the invalidation is invoked only once for
>> each usrspace fallocate().
> 
> That's true, it only reduces invalidations if we decide to provide a
> struct page/folio as part of the invalidation callbacks, which isn't
> the case yet. Sorry for the confusion.
> 
>> 
>> > 
>> >   3) Potentially useful for hugetlbfs support:
>> > 
>> >      One issue with hugetlbfs is that we don't support splitting the
>> >      hugepage in such cases, which was a big obstacle prior to UPM. Now
>> >      however, we may have the option of doing "lazy" invalidations where
>> >      fallocate(PUNCH_HOLE, ...) won't free a shmem-allocate page unless
>> >      all the subpages within the 2M range are either hole-punched, or the
>> >      guest is shut down, so in that way we never have to split it. Sean
>> >      was pondering something similar in another thread:
>> > 
>> >        https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2FYyGLXXkFCmxBfu5U%40google.com%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C3aba56bf7d574c749ea708dabbfe2224%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638028997419628807%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=c7gSLjJEAxuX8xmMiTDMUHNwUdQNKN00xqtAZAEeow8%3D&amp;reserved=0
>> > 
>> >      Issuing invalidations with folio-granularity ties in fairly well
>> >      with this sort of approach if we end up going that route.
>> 
>> There is semantics difference between the current one and the proposed
>> one: The invalidation range is exactly what userspace passed down to the
>> kernel (being fallocated) while the proposed one will be subset of that
>> (if userspace-provided addr/size is not aligned to power of two), I'm
>> not quite confident this difference has no side effect.
> 
> In theory userspace should not be allocating/hole-punching restricted
> pages for GPA ranges that are already mapped as private in the xarray,
> and KVM could potentially fail such requests (though it does currently).
> 
> But if we somehow enforced that, then we could rely on
> KVM_MEMORY_ENCRYPT_REG_REGION to handle all the MMU invalidation stuff,
> which would free up the restricted fd invalidation callbacks to be used
> purely to handle doing things like RMP/directmap fixups prior to returning
> restricted pages back to the host. So that was sort of my thinking why the
> new semantics would still cover all the necessary cases.
> 
> -Mike
> 
>> 
>> > 
>> > I need to rework things for v9, and we'll probably want to use struct
>> > folio instead of struct page now, but as a proof-of-concept of sorts this
>> > is what I'd added on top of v8 of your patchset to implement 1) and 2):
>> > 
>> >   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2F127e5ea477c7bd5e4107fd44a04b9dc9e9b1af8b&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C3aba56bf7d574c749ea708dabbfe2224%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638028997419628807%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jOFT0iLmeU7rKniEkWOsTf2%2FPI13EAw4Qm7arI1q970%3D&amp;reserved=0
>> > 
>> > Does an approach like this seem reasonable? Should be work this into the
>> > base restricted memslot support?
>> 
>> If the above mentioned semantics difference is not a problem, I don't
>> have strong objection on this.
>> 
>> Sean, since you have much better understanding on this, what is your
>> take on this?
>> 
>> Chao
>> > 
>> > Thanks,
>> > 
>> > Mike



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-14 14:02         ` Vlastimil Babka
@ 2022-11-14 15:28           ` Kirill A. Shutemov
  2022-11-14 22:16             ` Michael Roth
  2022-11-14 22:16           ` Michael Roth
  1 sibling, 1 reply; 101+ messages in thread
From: Kirill A. Shutemov @ 2022-11-14 15:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michael Roth, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vishal Annapurve, Yu Zhang,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, mhocko,
	Muchun Song, wei.w.wang

On Mon, Nov 14, 2022 at 03:02:37PM +0100, Vlastimil Babka wrote:
> On 11/1/22 16:19, Michael Roth wrote:
> > On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
> >> > 
> >> >   1) restoring kernel directmap:
> >> > 
> >> >      Currently SNP (and I believe TDX) need to either split or remove kernel
> >> >      direct mappings for restricted PFNs, since there is no guarantee that
> >> >      other PFNs within a 2MB range won't be used for non-restricted
> >> >      (which will cause an RMP #PF in the case of SNP since the 2MB
> >> >      mapping overlaps with guest-owned pages)
> >> 
> >> Has the splitting and restoring been a well-discussed direction? I'm
> >> just curious whether there is other options to solve this issue.
> > 
> > For SNP it's been discussed for quite some time, and either splitting or
> > removing private entries from directmap are the well-discussed way I'm
> > aware of to avoid RMP violations due to some other kernel process using
> > a 2MB mapping to access shared memory if there are private pages that
> > happen to be within that range.
> > 
> > In both cases the issue of how to restore directmap as 2M becomes a
> > problem.
> > 
> > I was also under the impression TDX had similar requirements. If so,
> > do you know what the plan is for handling this for TDX?
> > 
> > There are also 2 potential alternatives I'm aware of, but these haven't
> > been discussed in much detail AFAIK:
> > 
> > a) Ensure confidential guests are backed by 2MB pages. shmem has a way to
> >    request 2MB THP pages, but I'm not sure how reliably we can guarantee
> >    that enough THPs are available, so if we went that route we'd probably
> >    be better off requiring the use of hugetlbfs as the backing store. But
> >    obviously that's a bit limiting and it would be nice to have the option
> >    of using normal pages as well. One nice thing with invalidation
> >    scheme proposed here is that this would "Just Work" if implement
> >    hugetlbfs support, so an admin that doesn't want any directmap
> >    splitting has this option available, otherwise it's done as a
> >    best-effort.
> > 
> > b) Implement general support for restoring directmap as 2M even when
> >    subpages might be in use by other kernel threads. This would be the
> >    most flexible approach since it requires no special handling during
> >    invalidations, but I think it's only possible if all the CPA
> >    attributes for the 2M range are the same at the time the mapping is
> >    restored/unsplit, so some potential locking issues there and still
> >    chance for splitting directmap over time.
> 
> I've been hoping that
> 
> c) using a mechanism such as [1] [2] where the goal is to group together
> these small allocations that need to increase directmap granularity so
> maximum number of large mappings are preserved.

As I mentioned in the other thread the restricted memfd can be backed by
secretmem instead of plain memfd. It already handles directmap with care.

But I don't think it has to be part of initial restricted memfd
implementation. It is SEV-specific requirement and AMD folks can extend
implementation as needed later.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-10-25 15:13 ` [PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
  2022-10-27 10:25   ` Fuad Tabba
  2022-10-28  7:04   ` Xiaoyao Li
@ 2022-11-14 16:04   ` Alex Bennée
  2022-11-15  9:29     ` Chao Peng
  2 siblings, 1 reply; 101+ messages in thread
From: Alex Bennée @ 2022-11-14 16:04 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang


Chao Peng <chao.p.peng@linux.intel.com> writes:

> In memory encryption usage, guest memory may be encrypted with special
> key and can be accessed only by the guest itself. We call such memory
> private memory. It's valueless and sometimes can cause problem to allow
> userspace to access guest private memory. This new KVM memslot extension
> allows guest private memory being provided though a restrictedmem
> backed file descriptor(fd) and userspace is restricted to access the
> bookmarked memory in the fd.
>
<snip>
> To make code maintenance easy, internally we use a binary compatible
> alias struct kvm_user_mem_region to handle both the normal and the
> '_ext' variants.

> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 0d5d4419139a..f1ae45c10c94 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
>  	__u64 userspace_addr; /* start of the userspace allocated memory */
>  };
>  
> +struct kvm_userspace_memory_region_ext {
> +	struct kvm_userspace_memory_region region;
> +	__u64 restricted_offset;
> +	__u32 restricted_fd;
> +	__u32 pad1;
> +	__u64 pad2[14];
> +};
> +
> +#ifdef __KERNEL__
> +/*
> + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> + * all fields from the top-level "extended" region.
> + */
> +struct kvm_user_mem_region {
> +	__u32 slot;
> +	__u32 flags;
> +	__u64 guest_phys_addr;
> +	__u64 memory_size;
> +	__u64 userspace_addr;
> +	__u64 restricted_offset;
> +	__u32 restricted_fd;
> +	__u32 pad1;
> +	__u64 pad2[14];
> +};
> +#endif

I'm not sure I buy the argument this makes the code maintenance easier
because you now have multiple places to update if you extend the field.
Was this simply to avoid changing:

  foo->slot to foo->region.slot

in the underlying code?

> +
>  /*
>   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
>   * other bits are reserved for kvm internal use which are defined in
> @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
>   */
>  #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>  #define KVM_MEM_READONLY	(1UL << 1)
> +#define KVM_MEM_PRIVATE		(1UL << 2)
>  
>  /* for KVM_IRQ_LINE */
>  struct kvm_irq_level {
> @@ -1178,6 +1206,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_S390_ZPCI_OP 221
>  #define KVM_CAP_S390_CPU_TOPOLOGY 222
>  #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
> +#define KVM_CAP_PRIVATE_MEM 224
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 800f9470e36b..9ff164c7e0cc 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -86,3 +86,6 @@ config KVM_XFER_TO_GUEST_WORK
>  
>  config HAVE_KVM_PM_NOTIFIER
>         bool
> +
> +config HAVE_KVM_RESTRICTED_MEM
> +       bool
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index e30f1b4ecfa5..8dace78a0278 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1526,7 +1526,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
>  	}
>  }
>  
> -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
>  {
>  	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>  
> @@ -1920,7 +1920,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
>   * Must be called holding kvm->slots_lock for write.
>   */
>  int __kvm_set_memory_region(struct kvm *kvm,
> -			    const struct kvm_userspace_memory_region *mem)
> +			    const struct kvm_user_mem_region *mem)
>  {
>  	struct kvm_memory_slot *old, *new;
>  	struct kvm_memslots *slots;
> @@ -2024,7 +2024,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
>  
>  int kvm_set_memory_region(struct kvm *kvm,
> -			  const struct kvm_userspace_memory_region *mem)
> +			  const struct kvm_user_mem_region *mem)
>  {
>  	int r;
>  
> @@ -2036,7 +2036,7 @@ int kvm_set_memory_region(struct kvm *kvm,
>  EXPORT_SYMBOL_GPL(kvm_set_memory_region);
>  
>  static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> -					  struct kvm_userspace_memory_region *mem)
> +					  struct kvm_user_mem_region *mem)
>  {
>  	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
>  		return -EINVAL;
> @@ -4627,6 +4627,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
>  	return fd;
>  }
>  
> +#define SANITY_CHECK_MEM_REGION_FIELD(field)					\
> +do {										\
> +	BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=		\
> +		     offsetof(struct kvm_userspace_memory_region, field));	\
> +	BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=		\
> +		     sizeof_field(struct kvm_userspace_memory_region, field));	\
> +} while (0)
> +
> +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)					\
> +do {											\
> +	BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=			\
> +		     offsetof(struct kvm_userspace_memory_region_ext, field));		\
> +	BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=			\
> +		     sizeof_field(struct kvm_userspace_memory_region_ext, field));	\
> +} while (0)
> +
> +static void kvm_sanity_check_user_mem_region_alias(void)
> +{
> +	SANITY_CHECK_MEM_REGION_FIELD(slot);
> +	SANITY_CHECK_MEM_REGION_FIELD(flags);
> +	SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> +	SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> +	SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> +	SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
> +	SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
> +}

Do we have other examples in the kernel that jump these hoops?

>  static long kvm_vm_ioctl(struct file *filp,
>  			   unsigned int ioctl, unsigned long arg)
>  {
> @@ -4650,14 +4677,20 @@ static long kvm_vm_ioctl(struct file *filp,
>  		break;
>  	}
>  	case KVM_SET_USER_MEMORY_REGION: {
> -		struct kvm_userspace_memory_region kvm_userspace_mem;
> +		struct kvm_user_mem_region mem;
> +		unsigned long size = sizeof(struct kvm_userspace_memory_region);
> +
> +		kvm_sanity_check_user_mem_region_alias();
>  
>  		r = -EFAULT;
> -		if (copy_from_user(&kvm_userspace_mem, argp,
> -						sizeof(kvm_userspace_mem)))
> +		if (copy_from_user(&mem, argp, size))
> +			goto out;
> +
> +		r = -EINVAL;
> +		if (mem.flags & KVM_MEM_PRIVATE)
>  			goto out;

Hmm I can see in the later code you explicitly check for the
KVM_MEM_PRIVATE flag with:

		if (get_user(flags, (u32 __user *)(argp + flags_offset)))
			goto out;

		if (flags & KVM_MEM_PRIVATE)
			size = sizeof(struct kvm_userspace_memory_region_ext);
		else
			size = sizeof(struct kvm_userspace_memory_region);

I think it would make sense to bring that sanity checking forward into
this patch to avoid the validation logic working in two different ways
over the series.

>  
> -		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> +		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>  		break;
>  	}
>  	case KVM_GET_DIRTY_LOG: {


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-14 14:02         ` Vlastimil Babka
  2022-11-14 15:28           ` Kirill A. Shutemov
@ 2022-11-14 22:16           ` Michael Roth
  1 sibling, 0 replies; 101+ messages in thread
From: Michael Roth @ 2022-11-14 22:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, mhocko, Muchun Song, wei.w.wang

On Mon, Nov 14, 2022 at 03:02:37PM +0100, Vlastimil Babka wrote:
> On 11/1/22 16:19, Michael Roth wrote:
> > On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
> >> > 
> >> >   1) restoring kernel directmap:
> >> > 
> >> >      Currently SNP (and I believe TDX) need to either split or remove kernel
> >> >      direct mappings for restricted PFNs, since there is no guarantee that
> >> >      other PFNs within a 2MB range won't be used for non-restricted
> >> >      (which will cause an RMP #PF in the case of SNP since the 2MB
> >> >      mapping overlaps with guest-owned pages)
> >> 
> >> Has the splitting and restoring been a well-discussed direction? I'm
> >> just curious whether there is other options to solve this issue.
> > 
> > For SNP it's been discussed for quite some time, and either splitting or
> > removing private entries from directmap are the well-discussed way I'm
> > aware of to avoid RMP violations due to some other kernel process using
> > a 2MB mapping to access shared memory if there are private pages that
> > happen to be within that range.
> > 
> > In both cases the issue of how to restore directmap as 2M becomes a
> > problem.
> > 
> > I was also under the impression TDX had similar requirements. If so,
> > do you know what the plan is for handling this for TDX?
> > 
> > There are also 2 potential alternatives I'm aware of, but these haven't
> > been discussed in much detail AFAIK:
> > 
> > a) Ensure confidential guests are backed by 2MB pages. shmem has a way to
> >    request 2MB THP pages, but I'm not sure how reliably we can guarantee
> >    that enough THPs are available, so if we went that route we'd probably
> >    be better off requiring the use of hugetlbfs as the backing store. But
> >    obviously that's a bit limiting and it would be nice to have the option
> >    of using normal pages as well. One nice thing with invalidation
> >    scheme proposed here is that this would "Just Work" if implement
> >    hugetlbfs support, so an admin that doesn't want any directmap
> >    splitting has this option available, otherwise it's done as a
> >    best-effort.
> > 
> > b) Implement general support for restoring directmap as 2M even when
> >    subpages might be in use by other kernel threads. This would be the
> >    most flexible approach since it requires no special handling during
> >    invalidations, but I think it's only possible if all the CPA
> >    attributes for the 2M range are the same at the time the mapping is
> >    restored/unsplit, so some potential locking issues there and still
> >    chance for splitting directmap over time.
> 
> I've been hoping that
> 
> c) using a mechanism such as [1] [2] where the goal is to group together
> these small allocations that need to increase directmap granularity so
> maximum number of large mappings are preserved. But I guess that means

Thanks for the references. I wasn't aware there was work in this area,
this opens up some possibilities on how to approach this.

> maximum number of large mappings are preserved. But I guess that means
> knowing at allocation time that this will happen. So I've been wondering how
> this would be possible to employ in the SNP/UPM case? I guess it depends on
> how we expect the private/shared conversions to happen in practice, and I
> don't know the details. I can imagine the following complications:
> 
> - a memfd_restricted region is created such that it's 2MB large/aligned,
> i.e. like case a) above, we can allocate it normally. Now, what if a 4k page
> in the middle is to be temporarily converted to shared for some
> communication between host and guest (can such thing happen?). With the
> punch hole approach, I wonder if we end up fragmenting directmap
> unnecessarily? IIUC the now shared page will become backed by some other

Yes, we end up fragmenting in cases where a guest converts a sub-page to a
shared page because the fallocate(PUNCH_HOLE) gets forwarded through to shmem
which will then split it. At that point the subpage might get used elsewhere
so we no longer have the ability to restore as 2M after
invalidation/shutdown. We could potentially just intercept those
fallocate()'s and only issue the invalidation once all the subpages have
been PUNCH_HOLE'd. We'd still need to ensure KVM MMU invalidations
happen immediately though, but since we rely on a KVM ioctl to do the
conversion in advance, we can rely on the KVM MMU invalidation that
happens at that point and simply make fallocate(PUNCH_HOLE) fail if
someone attempts it on a page that hasn't been converted to shared yet.

Otherwise we could end up being an good chunk of pages depending on how
guest allocates shared pages, but I'm slightly less concerned about that
seeing as there are some general solutions to directmap fragmentation
being considered. I need to think more how this hooks would tie in to
that though.

And since we'd only really being able to avoid unrecoverable splits if
the restrictedmem is hugepage-backed (if we get a bunch of 4K pages
to begin with there's no handling that would avoid fragmentation), it seems
like we'd end up relying on hugetlbfs support for instances where a host
really wants to avoid splitting, and maybe in the case of hugetlbfs
fallocate(PUNCH_HOLE) is already a no-op of sorts? Either way maybe it's
better to explore this aspect in the context of hugetlbfs support.

> page (as the memslot supports both private and shared pages simultaneously).
> But does it make sense to really split the direct mapping (and e.g. the
> shmem page?) We could leave the whole 2MB unmapped without splitting if we
> didn't free the private 4k subpage.
> 
> - a restricted region is created that's below 2MB. If something like [1] is
> merged, it could be used for the backing pages to limit directmap
> fragmentation. But then in case it's eventually fallocated to become larger
> and gain one more more 2MB aligned ranges, the result is suboptimal. Unless
> in that case we migrate the existing pages to a THP-backed shmem, kinda like
> khugepaged collapses hugepages. But that would have to be coordinated with
> the guest, maybe not even possible?

Any migrations would need to be coordinated with SNP firmware at least. I
think it's possible, but that support is probably a ways out. Near-term
I think it might be more straightforward to say: if you don't want to
directmap fragmentation (for SNP anyway), you need to ensure restricted
ranges are backed by THPs or hugetlbfs, and make that the basis for
avoiding directmap splitting for now. Otherwise, it's simply done as a
best-effort, and then maybe over time, with things like [1] and migration
support in place, this restriction can go away, or become less impactful
at least.

Thanks,

Mike

> 
> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F20220127085608.306306-1-rppt%40kernel.org%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C50b74bc241704885319d08dac648e4bb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638040313701097847%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=iGsgnGccmJrik%2FJqve4NmP0U%2B9cEQBJGDPITynIYZUQ%3D&amp;reserved=0
> [2] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F894557%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C50b74bc241704885319d08dac648e4bb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638040313701097847%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=W%2BtSXxi%2Bs5RJYxP9BH0%2FiNOfl1FKM9mfw5nEXJO5doU%3D&amp;reserved=0
> 
> >> 
> >> > 
> >> >      Previously we were able to restore 2MB mappings to some degree
> >> >      since both shared/restricted pages were all pinned, so anything
> >> >      backed by a THP (or hugetlb page once that is implemented) at guest
> >> >      teardown could be restored as 2MB direct mapping.
> >> > 
> >> >      Invalidation seems like the most logical time to have this happen,
> >> 
> >> Currently invalidation only happens at user-initiated fallocate(). It
> >> does not cover the VM teardown case where the restoring might also be
> >> expected to be handled.
> > 
> > Right, I forgot to add that in my proposed changes I added invalidations
> > for any still-allocated private pages present when the restricted memfd
> > notifier is unregistered. This was needed to avoid leaking pages back to
> > the kernel that still need directmap or RMP table fixups. I also added
> > similar invalidations for memfd->release(), since it seems possible that
> > userspace might close() it before shutting down guest, but maybe the
> > latter is not needed if KVM takes a reference on the FD during life of
> > the guest.
> > 
> >> 
> >> >      but whether or not to restore as 2MB requires the order to be 2MB
> >> >      or larger, and for GPA range being invalidated to cover the entire
> >> >      2MB (otherwise it means the page was potentially split and some
> >> >      subpages free back to host already, in which case it can't be
> >> >      restored as 2MB).
> >> > 
> >> >   2) Potentially less invalidations:
> >> >       
> >> >      If we pass the entire folio or compound_page as part of
> >> >      invalidation, we only needed to issue 1 invalidation per folio.
> >> 
> >> I'm not sure I agree, the current invalidation covers the whole range
> >> that passed from userspace and the invalidation is invoked only once for
> >> each usrspace fallocate().
> > 
> > That's true, it only reduces invalidations if we decide to provide a
> > struct page/folio as part of the invalidation callbacks, which isn't
> > the case yet. Sorry for the confusion.
> > 
> >> 
> >> > 
> >> >   3) Potentially useful for hugetlbfs support:
> >> > 
> >> >      One issue with hugetlbfs is that we don't support splitting the
> >> >      hugepage in such cases, which was a big obstacle prior to UPM. Now
> >> >      however, we may have the option of doing "lazy" invalidations where
> >> >      fallocate(PUNCH_HOLE, ...) won't free a shmem-allocate page unless
> >> >      all the subpages within the 2M range are either hole-punched, or the
> >> >      guest is shut down, so in that way we never have to split it. Sean
> >> >      was pondering something similar in another thread:
> >> > 
> >> >        https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2FYyGLXXkFCmxBfu5U%40google.com%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C50b74bc241704885319d08dac648e4bb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638040313701097847%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=GTO73Onun86jZh3PZABQL%2F4Fs5R%2BFZe9gDkOSMoHddA%3D&amp;reserved=0
> >> > 
> >> >      Issuing invalidations with folio-granularity ties in fairly well
> >> >      with this sort of approach if we end up going that route.
> >> 
> >> There is semantics difference between the current one and the proposed
> >> one: The invalidation range is exactly what userspace passed down to the
> >> kernel (being fallocated) while the proposed one will be subset of that
> >> (if userspace-provided addr/size is not aligned to power of two), I'm
> >> not quite confident this difference has no side effect.
> > 
> > In theory userspace should not be allocating/hole-punching restricted
> > pages for GPA ranges that are already mapped as private in the xarray,
> > and KVM could potentially fail such requests (though it does currently).
> > 
> > But if we somehow enforced that, then we could rely on
> > KVM_MEMORY_ENCRYPT_REG_REGION to handle all the MMU invalidation stuff,
> > which would free up the restricted fd invalidation callbacks to be used
> > purely to handle doing things like RMP/directmap fixups prior to returning
> > restricted pages back to the host. So that was sort of my thinking why the
> > new semantics would still cover all the necessary cases.
> > 
> > -Mike
> > 
> >> 
> >> > 
> >> > I need to rework things for v9, and we'll probably want to use struct
> >> > folio instead of struct page now, but as a proof-of-concept of sorts this
> >> > is what I'd added on top of v8 of your patchset to implement 1) and 2):
> >> > 
> >> >   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2F127e5ea477c7bd5e4107fd44a04b9dc9e9b1af8b&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C50b74bc241704885319d08dac648e4bb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638040313701097847%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=yYd6lVWEFkffTZnAoFTgoozYxZbxvMXjOd%2BWuP70G7I%3D&amp;reserved=0
> >> > 
> >> > Does an approach like this seem reasonable? Should be work this into the
> >> > base restricted memslot support?
> >> 
> >> If the above mentioned semantics difference is not a problem, I don't
> >> have strong objection on this.
> >> 
> >> Sean, since you have much better understanding on this, what is your
> >> take on this?
> >> 
> >> Chao
> >> > 
> >> > Thanks,
> >> > 
> >> > Mike
> 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-14 15:28           ` Kirill A. Shutemov
@ 2022-11-14 22:16             ` Michael Roth
  2022-11-15  9:48               ` Chao Peng
  0 siblings, 1 reply; 101+ messages in thread
From: Michael Roth @ 2022-11-14 22:16 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Vlastimil Babka, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vishal Annapurve, Yu Zhang,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, mhocko,
	Muchun Song, wei.w.wang

On Mon, Nov 14, 2022 at 06:28:43PM +0300, Kirill A. Shutemov wrote:
> On Mon, Nov 14, 2022 at 03:02:37PM +0100, Vlastimil Babka wrote:
> > On 11/1/22 16:19, Michael Roth wrote:
> > > On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
> > >> > 
> > >> >   1) restoring kernel directmap:
> > >> > 
> > >> >      Currently SNP (and I believe TDX) need to either split or remove kernel
> > >> >      direct mappings for restricted PFNs, since there is no guarantee that
> > >> >      other PFNs within a 2MB range won't be used for non-restricted
> > >> >      (which will cause an RMP #PF in the case of SNP since the 2MB
> > >> >      mapping overlaps with guest-owned pages)
> > >> 
> > >> Has the splitting and restoring been a well-discussed direction? I'm
> > >> just curious whether there is other options to solve this issue.
> > > 
> > > For SNP it's been discussed for quite some time, and either splitting or
> > > removing private entries from directmap are the well-discussed way I'm
> > > aware of to avoid RMP violations due to some other kernel process using
> > > a 2MB mapping to access shared memory if there are private pages that
> > > happen to be within that range.
> > > 
> > > In both cases the issue of how to restore directmap as 2M becomes a
> > > problem.
> > > 
> > > I was also under the impression TDX had similar requirements. If so,
> > > do you know what the plan is for handling this for TDX?
> > > 
> > > There are also 2 potential alternatives I'm aware of, but these haven't
> > > been discussed in much detail AFAIK:
> > > 
> > > a) Ensure confidential guests are backed by 2MB pages. shmem has a way to
> > >    request 2MB THP pages, but I'm not sure how reliably we can guarantee
> > >    that enough THPs are available, so if we went that route we'd probably
> > >    be better off requiring the use of hugetlbfs as the backing store. But
> > >    obviously that's a bit limiting and it would be nice to have the option
> > >    of using normal pages as well. One nice thing with invalidation
> > >    scheme proposed here is that this would "Just Work" if implement
> > >    hugetlbfs support, so an admin that doesn't want any directmap
> > >    splitting has this option available, otherwise it's done as a
> > >    best-effort.
> > > 
> > > b) Implement general support for restoring directmap as 2M even when
> > >    subpages might be in use by other kernel threads. This would be the
> > >    most flexible approach since it requires no special handling during
> > >    invalidations, but I think it's only possible if all the CPA
> > >    attributes for the 2M range are the same at the time the mapping is
> > >    restored/unsplit, so some potential locking issues there and still
> > >    chance for splitting directmap over time.
> > 
> > I've been hoping that
> > 
> > c) using a mechanism such as [1] [2] where the goal is to group together
> > these small allocations that need to increase directmap granularity so
> > maximum number of large mappings are preserved.
> 
> As I mentioned in the other thread the restricted memfd can be backed by
> secretmem instead of plain memfd. It already handles directmap with care.

It looks like it would handle direct unmapping/cleanup nicely, but it
seems to lack fallocate(PUNCH_HOLE) support which we'd probably want to
avoid additional memory requirements. I think once we added that we'd
still end up needing some sort of handling for the invalidations.

Also, I know Chao has been considering hugetlbfs support, I assume by
leveraging the support that already exists in shmem. Ideally SNP would
be able to make use of that support as well, but relying on a separate
backend seems likely to result in more complications getting there
later.

> 
> But I don't think it has to be part of initial restricted memfd
> implementation. It is SEV-specific requirement and AMD folks can extend
> implementation as needed later.

Admittedly the suggested changes to the invalidation mechanism made a
lot more sense to me when I was under the impression that TDX would have
similar requirements and we might end up with a common hook. Since that
doesn't actually seem to be the case, it makes sense to try to do it as
a platform-specific hook for SNP.

I think, given a memslot, a GFN range, and kvm_restricted_mem_get_pfn(),
we should be able to get the same information needed to figure out whether
the range is backed by huge pages or not. I'll see how that works out
instead.

Thanks,

Mike

> 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-11-14 16:04   ` Alex Bennée
@ 2022-11-15  9:29     ` Chao Peng
  0 siblings, 0 replies; 101+ messages in thread
From: Chao Peng @ 2022-11-15  9:29 UTC (permalink / raw)
  To: Alex Bennée
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Mon, Nov 14, 2022 at 04:04:59PM +0000, Alex Bennée wrote:
> 
> Chao Peng <chao.p.peng@linux.intel.com> writes:
> 
> > In memory encryption usage, guest memory may be encrypted with special
> > key and can be accessed only by the guest itself. We call such memory
> > private memory. It's valueless and sometimes can cause problem to allow
> > userspace to access guest private memory. This new KVM memslot extension
> > allows guest private memory being provided though a restrictedmem
> > backed file descriptor(fd) and userspace is restricted to access the
> > bookmarked memory in the fd.
> >
> <snip>
> > To make code maintenance easy, internally we use a binary compatible
> > alias struct kvm_user_mem_region to handle both the normal and the
> > '_ext' variants.
> 
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 0d5d4419139a..f1ae45c10c94 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
> >  	__u64 userspace_addr; /* start of the userspace allocated memory */
> >  };
> >  
> > +struct kvm_userspace_memory_region_ext {
> > +	struct kvm_userspace_memory_region region;
> > +	__u64 restricted_offset;
> > +	__u32 restricted_fd;
> > +	__u32 pad1;
> > +	__u64 pad2[14];
> > +};
> > +
> > +#ifdef __KERNEL__
> > +/*
> > + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> > + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> > + * all fields from the top-level "extended" region.
> > + */
> > +struct kvm_user_mem_region {
> > +	__u32 slot;
> > +	__u32 flags;
> > +	__u64 guest_phys_addr;
> > +	__u64 memory_size;
> > +	__u64 userspace_addr;
> > +	__u64 restricted_offset;
> > +	__u32 restricted_fd;
> > +	__u32 pad1;
> > +	__u64 pad2[14];
> > +};
> > +#endif
> 
> I'm not sure I buy the argument this makes the code maintenance easier
> because you now have multiple places to update if you extend the field.
> Was this simply to avoid changing:
> 
>   foo->slot to foo->region.slot
> 
> in the underlying code?

That is one of the reasons, by doing this we can also avoid confusion to
deal with '_ext' and the 'base' struct for different functions spread
across KVM code. No doubt now I need update every places where the
'base' struct is being used, but that makes future maintenance easier,
e.g. adding another new field or even extend the memslot structure again
would just require changes to the flat struct here and the places where
the new field is actually used.

> 
> > +
> >  /*
> >   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> >   * other bits are reserved for kvm internal use which are defined in
> > @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
> >   */
> >  #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> >  #define KVM_MEM_READONLY	(1UL << 1)
> > +#define KVM_MEM_PRIVATE		(1UL << 2)
> >  
> >  /* for KVM_IRQ_LINE */
> >  struct kvm_irq_level {
> > @@ -1178,6 +1206,7 @@ struct kvm_ppc_resize_hpt {
> >  #define KVM_CAP_S390_ZPCI_OP 221
> >  #define KVM_CAP_S390_CPU_TOPOLOGY 222
> >  #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
> > +#define KVM_CAP_PRIVATE_MEM 224
> >  
> >  #ifdef KVM_CAP_IRQ_ROUTING
> >  
> > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > index 800f9470e36b..9ff164c7e0cc 100644
> > --- a/virt/kvm/Kconfig
> > +++ b/virt/kvm/Kconfig
> > @@ -86,3 +86,6 @@ config KVM_XFER_TO_GUEST_WORK
> >  
> >  config HAVE_KVM_PM_NOTIFIER
> >         bool
> > +
> > +config HAVE_KVM_RESTRICTED_MEM
> > +       bool
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index e30f1b4ecfa5..8dace78a0278 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1526,7 +1526,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
> >  	}
> >  }
> >  
> > -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> > +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> >  {
> >  	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> >  
> > @@ -1920,7 +1920,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> >   * Must be called holding kvm->slots_lock for write.
> >   */
> >  int __kvm_set_memory_region(struct kvm *kvm,
> > -			    const struct kvm_userspace_memory_region *mem)
> > +			    const struct kvm_user_mem_region *mem)
> >  {
> >  	struct kvm_memory_slot *old, *new;
> >  	struct kvm_memslots *slots;
> > @@ -2024,7 +2024,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >  EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> >  
> >  int kvm_set_memory_region(struct kvm *kvm,
> > -			  const struct kvm_userspace_memory_region *mem)
> > +			  const struct kvm_user_mem_region *mem)
> >  {
> >  	int r;
> >  
> > @@ -2036,7 +2036,7 @@ int kvm_set_memory_region(struct kvm *kvm,
> >  EXPORT_SYMBOL_GPL(kvm_set_memory_region);
> >  
> >  static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> > -					  struct kvm_userspace_memory_region *mem)
> > +					  struct kvm_user_mem_region *mem)
> >  {
> >  	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
> >  		return -EINVAL;
> > @@ -4627,6 +4627,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
> >  	return fd;
> >  }
> >  
> > +#define SANITY_CHECK_MEM_REGION_FIELD(field)					\
> > +do {										\
> > +	BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=		\
> > +		     offsetof(struct kvm_userspace_memory_region, field));	\
> > +	BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=		\
> > +		     sizeof_field(struct kvm_userspace_memory_region, field));	\
> > +} while (0)
> > +
> > +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)					\
> > +do {											\
> > +	BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=			\
> > +		     offsetof(struct kvm_userspace_memory_region_ext, field));		\
> > +	BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=			\
> > +		     sizeof_field(struct kvm_userspace_memory_region_ext, field));	\
> > +} while (0)
> > +
> > +static void kvm_sanity_check_user_mem_region_alias(void)
> > +{
> > +	SANITY_CHECK_MEM_REGION_FIELD(slot);
> > +	SANITY_CHECK_MEM_REGION_FIELD(flags);
> > +	SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> > +	SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> > +	SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> > +	SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
> > +	SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
> > +}
> 
> Do we have other examples in the kernel that jump these hoops?

grep -rn 'BUILD_BUG_ON(offsetof' can give you some hint on other usages
in the kernel. But for a quick check you can look:
  siginfo_buildtime_checks()

> 
> >  static long kvm_vm_ioctl(struct file *filp,
> >  			   unsigned int ioctl, unsigned long arg)
> >  {
> > @@ -4650,14 +4677,20 @@ static long kvm_vm_ioctl(struct file *filp,
> >  		break;
> >  	}
> >  	case KVM_SET_USER_MEMORY_REGION: {
> > -		struct kvm_userspace_memory_region kvm_userspace_mem;
> > +		struct kvm_user_mem_region mem;
> > +		unsigned long size = sizeof(struct kvm_userspace_memory_region);
> > +
> > +		kvm_sanity_check_user_mem_region_alias();
> >  
> >  		r = -EFAULT;
> > -		if (copy_from_user(&kvm_userspace_mem, argp,
> > -						sizeof(kvm_userspace_mem)))
> > +		if (copy_from_user(&mem, argp, size))
> > +			goto out;
> > +
> > +		r = -EINVAL;
> > +		if (mem.flags & KVM_MEM_PRIVATE)
> >  			goto out;
> 
> Hmm I can see in the later code you explicitly check for the
> KVM_MEM_PRIVATE flag with:
> 
> 		if (get_user(flags, (u32 __user *)(argp + flags_offset)))
> 			goto out;
> 
> 		if (flags & KVM_MEM_PRIVATE)
> 			size = sizeof(struct kvm_userspace_memory_region_ext);
> 		else
> 			size = sizeof(struct kvm_userspace_memory_region);
> 
> I think it would make sense to bring that sanity checking forward into
> this patch to avoid the validation logic working in two different ways
> over the series.

That is my original code actually, then Sean suggested to change to
current code[*], the reason is these two pathes are for different
purpose, this patch introduces the data structures but the later patch
actually makes use of the '_ext' variant.

[*] https://lkml.kernel.org/kvm/YuQ6QWcdZLdStkWl@google.com/

Chao
> 
> >  
> > -		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> > +		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> >  		break;
> >  	}
> >  	case KVM_GET_DIRTY_LOG: {
> 
> 
> -- 
> Alex Bennée


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-14 22:16             ` Michael Roth
@ 2022-11-15  9:48               ` Chao Peng
  0 siblings, 0 replies; 101+ messages in thread
From: Chao Peng @ 2022-11-15  9:48 UTC (permalink / raw)
  To: Michael Roth
  Cc: Kirill A. Shutemov, Vlastimil Babka, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vishal Annapurve, Yu Zhang,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, mhocko,
	Muchun Song, wei.w.wang

On Mon, Nov 14, 2022 at 04:16:32PM -0600, Michael Roth wrote:
> On Mon, Nov 14, 2022 at 06:28:43PM +0300, Kirill A. Shutemov wrote:
> > On Mon, Nov 14, 2022 at 03:02:37PM +0100, Vlastimil Babka wrote:
> > > On 11/1/22 16:19, Michael Roth wrote:
> > > > On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
> > > >> > 
> > > >> >   1) restoring kernel directmap:
> > > >> > 
> > > >> >      Currently SNP (and I believe TDX) need to either split or remove kernel
> > > >> >      direct mappings for restricted PFNs, since there is no guarantee that
> > > >> >      other PFNs within a 2MB range won't be used for non-restricted
> > > >> >      (which will cause an RMP #PF in the case of SNP since the 2MB
> > > >> >      mapping overlaps with guest-owned pages)
> > > >> 
> > > >> Has the splitting and restoring been a well-discussed direction? I'm
> > > >> just curious whether there is other options to solve this issue.
> > > > 
> > > > For SNP it's been discussed for quite some time, and either splitting or
> > > > removing private entries from directmap are the well-discussed way I'm
> > > > aware of to avoid RMP violations due to some other kernel process using
> > > > a 2MB mapping to access shared memory if there are private pages that
> > > > happen to be within that range.
> > > > 
> > > > In both cases the issue of how to restore directmap as 2M becomes a
> > > > problem.
> > > > 
> > > > I was also under the impression TDX had similar requirements. If so,
> > > > do you know what the plan is for handling this for TDX?
> > > > 
> > > > There are also 2 potential alternatives I'm aware of, but these haven't
> > > > been discussed in much detail AFAIK:
> > > > 
> > > > a) Ensure confidential guests are backed by 2MB pages. shmem has a way to
> > > >    request 2MB THP pages, but I'm not sure how reliably we can guarantee
> > > >    that enough THPs are available, so if we went that route we'd probably
> > > >    be better off requiring the use of hugetlbfs as the backing store. But
> > > >    obviously that's a bit limiting and it would be nice to have the option
> > > >    of using normal pages as well. One nice thing with invalidation
> > > >    scheme proposed here is that this would "Just Work" if implement
> > > >    hugetlbfs support, so an admin that doesn't want any directmap
> > > >    splitting has this option available, otherwise it's done as a
> > > >    best-effort.
> > > > 
> > > > b) Implement general support for restoring directmap as 2M even when
> > > >    subpages might be in use by other kernel threads. This would be the
> > > >    most flexible approach since it requires no special handling during
> > > >    invalidations, but I think it's only possible if all the CPA
> > > >    attributes for the 2M range are the same at the time the mapping is
> > > >    restored/unsplit, so some potential locking issues there and still
> > > >    chance for splitting directmap over time.
> > > 
> > > I've been hoping that
> > > 
> > > c) using a mechanism such as [1] [2] where the goal is to group together
> > > these small allocations that need to increase directmap granularity so
> > > maximum number of large mappings are preserved.
> > 
> > As I mentioned in the other thread the restricted memfd can be backed by
> > secretmem instead of plain memfd. It already handles directmap with care.
> 
> It looks like it would handle direct unmapping/cleanup nicely, but it
> seems to lack fallocate(PUNCH_HOLE) support which we'd probably want to
> avoid additional memory requirements. I think once we added that we'd
> still end up needing some sort of handling for the invalidations.
> 
> Also, I know Chao has been considering hugetlbfs support, I assume by
> leveraging the support that already exists in shmem. Ideally SNP would
> be able to make use of that support as well, but relying on a separate
> backend seems likely to result in more complications getting there
> later.
> 
> > 
> > But I don't think it has to be part of initial restricted memfd
> > implementation. It is SEV-specific requirement and AMD folks can extend
> > implementation as needed later.
> 
> Admittedly the suggested changes to the invalidation mechanism made a
> lot more sense to me when I was under the impression that TDX would have
> similar requirements and we might end up with a common hook. Since that
> doesn't actually seem to be the case, it makes sense to try to do it as
> a platform-specific hook for SNP.
> 
> I think, given a memslot, a GFN range, and kvm_restricted_mem_get_pfn(),
> we should be able to get the same information needed to figure out whether
> the range is backed by huge pages or not. I'll see how that works out
> instead.

Sounds a viable solution, just that kvm_restricted_mem_get_pfn() will
only give you the ability to check a page, not a range. But you can
still call it many times I think.

The invalidation callback will be still needed, it gives you the chance
to do the restoring.

Chao
> 
> Thanks,
> 
> Mike
> 
> > 
> > -- 
> >   Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM
  2022-11-09 15:54     ` Kirill A. Shutemov
@ 2022-11-15 14:36       ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2022-11-15 14:36 UTC (permalink / raw)
  To: Isaku Yamahata, Chao Peng, Hugh Dickins
  Cc: Vishal Annapurve, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
	mhocko, Muchun Song, wei.w.wang

On Wed, Nov 09, 2022 at 06:54:04PM +0300, Kirill A. Shutemov wrote:
> On Mon, Nov 07, 2022 at 04:41:41PM -0800, Isaku Yamahata wrote:
> > On Thu, Nov 03, 2022 at 05:43:52PM +0530,
> > Vishal Annapurve <vannapurve@google.com> wrote:
> > 
> > > On Tue, Oct 25, 2022 at 8:48 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > > >
> > > > This patch series implements KVM guest private memory for confidential
> > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > TDX-protected guest memory, machine check can happen which can further
> > > > crash the running host system, this is terrible for multi-tenant
> > > > configurations. The host accesses include those from KVM userspace like
> > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > via a fd-based approach, but it can never access the guest memory
> > > > content.
> > > >
> > > > The patch series touches both core mm and KVM code. I appreciate
> > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > reviews are always welcome.
> > > >   - 01: mm change, target for mm tree
> > > >   - 02-08: KVM change, target for KVM tree
> > > >
> > > > Given KVM is the only current user for the mm part, I have chatted with
> > > > Paolo and he is OK to merge the mm change through KVM tree, but
> > > > reviewed-by/acked-by is still expected from the mm people.
> > > >
> > > > The patches have been verified in Intel TDX environment, but Vishal has
> > > > done an excellent work on the selftests[4] which are dedicated for this
> > > > series, making it possible to test this series without innovative
> > > > hardware and fancy steps of building a VM environment. See Test section
> > > > below for more info.
> > > >
> > > >
> > > > Introduction
> > > > ============
> > > > KVM userspace being able to crash the host is horrible. Under current
> > > > KVM architecture, all guest memory is inherently accessible from KVM
> > > > userspace and is exposed to the mentioned crash issue. The goal of this
> > > > series is to provide a solution to align mm and KVM, on a userspace
> > > > inaccessible approach of exposing guest memory.
> > > >
> > > > Normally, KVM populates secondary page table (e.g. EPT) by using a host
> > > > virtual address (hva) from core mm page table (e.g. x86 userspace page
> > > > table). This requires guest memory being mmaped into KVM userspace, but
> > > > this is also the source where the mentioned crash issue can happen. In
> > > > theory, apart from those 'shared' memory for device emulation etc, guest
> > > > memory doesn't have to be mmaped into KVM userspace.
> > > >
> > > > This series introduces fd-based guest memory which will not be mmaped
> > > > into KVM userspace. KVM populates secondary page table by using a
> > > 
> > > With no mappings in place for userspace VMM, IIUC, looks like the host
> > > kernel will not be able to find the culprit userspace process in case
> > > of Machine check error on guest private memory. As implemented in
> > > hwpoison_user_mappings, host kernel tries to look at the processes
> > > which have mapped the pfns with hardware error.
> > > 
> > > Is there a modification needed in mce handling logic of the host
> > > kernel to immediately send a signal to the vcpu thread accessing
> > > faulting pfn backing guest private memory?
> > 
> > mce_register_decode_chain() can be used.  MCE physical address(p->mce_addr)
> > includes host key id in addition to real physical address.  By searching used
> > hkid by KVM, we can determine if the page is assigned to guest TD or not. If
> > yes, send SIGBUS.
> > 
> > kvm_machine_check() can be enhanced for KVM specific use.  This is before
> > memory_failure() is called, though.
> > 
> > any other ideas?
> 
> That's too KVM-centric. It will not work for other possible user of
> restricted memfd.
> 
> I tried to find a way to get it right: we need to get restricted memfd
> code info about corrupted page so it can invalidate its users. On the next
> request of the page the user will see an error. In case of KVM, the error
> will likely escalate to SIGBUS.
> 
> The problem is that core-mm code that handles memory failure knows nothing
> about restricted memfd. It only sees that the page belongs to a normal
> memfd.
> 
> AFAICS, there's no way to get it intercepted from the shim level. shmem
> code has to be patches. shmem_error_remove_page() has to call into
> restricted memfd code.
> 
> Hugh, are you okay with this? Or maybe you have a better idea?

Okay, here is what I've come up with. It doesn't touch shmem code, but
hooks up directly into memory-failure.c. It is still ugly, but should be
tolerable.

restrictedmem_error_page() loops over all restrictedmem inodes. It is
slow, but memory failure is not hot path (I hope).

Only build-tested. Chao, could you hook up ->error for KVM and get it
tested?

diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
index 9c37c3ea3180..c2700c5daa43 100644
--- a/include/linux/restrictedmem.h
+++ b/include/linux/restrictedmem.h
@@ -12,6 +12,8 @@ struct restrictedmem_notifier_ops {
 				 pgoff_t start, pgoff_t end);
 	void (*invalidate_end)(struct restrictedmem_notifier *notifier,
 			       pgoff_t start, pgoff_t end);
+	void (*error)(struct restrictedmem_notifier *notifier,
+			       pgoff_t start, pgoff_t end);
 };
 
 struct restrictedmem_notifier {
@@ -34,6 +36,8 @@ static inline bool file_is_restrictedmem(struct file *file)
 	return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
 }
 
+void restrictedmem_error_page(struct page *page, struct address_space *mapping);
+
 #else
 
 static inline void restrictedmem_register_notifier(struct file *file,
@@ -57,6 +61,11 @@ static inline bool file_is_restrictedmem(struct file *file)
 	return false;
 }
 
+static inline void restrictedmem_error_page(struct page *page,
+					    struct address_space *mapping)
+{
+}
+
 #endif /* CONFIG_RESTRICTEDMEM */
 
 #endif /* _LINUX_RESTRICTEDMEM_H */
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e7ac570dda75..ee85e46c6992 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -62,6 +62,7 @@
 #include <linux/page-isolation.h>
 #include <linux/pagewalk.h>
 #include <linux/shmem_fs.h>
+#include <linux/restrictedmem.h>
 #include "swap.h"
 #include "internal.h"
 #include "ras/ras_event.h"
@@ -939,6 +940,8 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
 		goto out;
 	}
 
+	restrictedmem_error_page(p, mapping);
+
 	/*
 	 * The shmem page is kept in page cache instead of truncating
 	 * so is expected to have an extra refcount after error-handling.
diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
index e5bf8907e0f8..0dcdff0d8055 100644
--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -29,6 +29,18 @@ static void restrictedmem_notifier_invalidate(struct restrictedmem_data *data,
 	mutex_unlock(&data->lock);
 }
 
+static void restrictedmem_notifier_error(struct restrictedmem_data *data,
+				 pgoff_t start, pgoff_t end)
+{
+	struct restrictedmem_notifier *notifier;
+
+	mutex_lock(&data->lock);
+	list_for_each_entry(notifier, &data->notifiers, list) {
+			notifier->ops->error(notifier, start, end);
+	}
+	mutex_unlock(&data->lock);
+}
+
 static int restrictedmem_release(struct inode *inode, struct file *file)
 {
 	struct restrictedmem_data *data = inode->i_mapping->private_data;
@@ -248,3 +260,30 @@ int restrictedmem_get_page(struct file *file, pgoff_t offset,
 	return 0;
 }
 EXPORT_SYMBOL_GPL(restrictedmem_get_page);
+
+void restrictedmem_error_page(struct page *page, struct address_space *mapping)
+{
+	struct super_block *sb = restrictedmem_mnt->mnt_sb;
+	struct inode *inode, *next;
+
+	if (!shmem_mapping(mapping))
+		return;
+
+	spin_lock(&sb->s_inode_list_lock);
+	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+		struct restrictedmem_data *data = inode->i_mapping->private_data;
+		struct file *memfd = data->memfd;
+
+		if (memfd->f_mapping == mapping) {
+			pgoff_t start, end;
+
+			spin_unlock(&sb->s_inode_list_lock);
+
+			start = page->index;
+			end = start + thp_nr_pages(page);
+			restrictedmem_notifier_error(data, start, end);
+			return;
+		}
+	}
+	spin_unlock(&sb->s_inode_list_lock);
+}
-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-10-25 15:13 ` [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
  2022-10-25 15:26   ` Peter Maydell
  2022-10-27 10:27   ` Fuad Tabba
@ 2022-11-15 16:56   ` Alex Bennée
  2022-11-16  3:14     ` Chao Peng
  2022-11-16 18:15   ` Andy Lutomirski
  3 siblings, 1 reply; 101+ messages in thread
From: Alex Bennée @ 2022-11-15 16:56 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang


Chao Peng <chao.p.peng@linux.intel.com> writes:

> This new KVM exit allows userspace to handle memory-related errors. It
> indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> The flags includes additional information for userspace to handle the
> error. Currently bit 0 is defined as 'private memory' where '1'
> indicates error happens due to private memory access and '0' indicates
> error happens due to shared memory access.
>
> When private memory is enabled, this new exit will be used for KVM to
> exit to userspace for shared <-> private memory conversion in memory
> encryption usage. In such usage, typically there are two kind of memory
> conversions:
>   - explicit conversion: happens when guest explicitly calls into KVM
>     to map a range (as private or shared), KVM then exits to userspace
>     to perform the map/unmap operations.
>   - implicit conversion: happens in KVM page fault handler where KVM
>     exits to userspace for an implicit conversion when the page is in a
>     different state than requested (private or shared).
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
>  include/uapi/linux/kvm.h       |  9 +++++++++
>  2 files changed, 32 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index f3fa75649a78..975688912b8c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>  
> +::
> +
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> +			__u32 flags;
> +			__u32 padding;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory;
> +
> +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> +encountered a memory error which is not handled by KVM kernel module and
> +userspace may choose to handle it. The 'flags' field indicates the memory
> +properties of the exit.
> +
> + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> +   private memory access when the bit is set. Otherwise the memory error is
> +   caused by shared memory access when the bit is clear.

What does a shared memory access failure entail?

If you envision any other failure modes it might be worth making it
explicit with additional flags. I also wonder if a bitmask makes sense if
there can only be one reason for a failure? Maybe all that is needed is
a reason enum?

> +
> +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
> +may handle the error and return to KVM to retry the previous memory access.
> +
>  ::
>  
>      /* KVM_EXIT_NOTIFY */
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index f1ae45c10c94..fa60b032a405 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -300,6 +300,7 @@ struct kvm_xen_exit {
>  #define KVM_EXIT_RISCV_SBI        35
>  #define KVM_EXIT_RISCV_CSR        36
>  #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MEMORY_FAULT     38
>  
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -538,6 +539,14 @@ struct kvm_run {
>  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
>  			__u32 flags;
>  		} notify;
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> +			__u32 flags;
> +			__u32 padding;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory;
>  		/* Fix the size of the union. */
>  		char padding[256];
>  	};


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-11-15 16:56   ` Alex Bennée
@ 2022-11-16  3:14     ` Chao Peng
  2022-11-16 19:03       ` Alex Bennée
  0 siblings, 1 reply; 101+ messages in thread
From: Chao Peng @ 2022-11-16  3:14 UTC (permalink / raw)
  To: Alex Bennée
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Tue, Nov 15, 2022 at 04:56:12PM +0000, Alex Bennée wrote:
> 
> Chao Peng <chao.p.peng@linux.intel.com> writes:
> 
> > This new KVM exit allows userspace to handle memory-related errors. It
> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> > The flags includes additional information for userspace to handle the
> > error. Currently bit 0 is defined as 'private memory' where '1'
> > indicates error happens due to private memory access and '0' indicates
> > error happens due to shared memory access.
> >
> > When private memory is enabled, this new exit will be used for KVM to
> > exit to userspace for shared <-> private memory conversion in memory
> > encryption usage. In such usage, typically there are two kind of memory
> > conversions:
> >   - explicit conversion: happens when guest explicitly calls into KVM
> >     to map a range (as private or shared), KVM then exits to userspace
> >     to perform the map/unmap operations.
> >   - implicit conversion: happens in KVM page fault handler where KVM
> >     exits to userspace for an implicit conversion when the page is in a
> >     different state than requested (private or shared).
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
> >  include/uapi/linux/kvm.h       |  9 +++++++++
> >  2 files changed, 32 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index f3fa75649a78..975688912b8c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >  
> > +::
> > +
> > +		/* KVM_EXIT_MEMORY_FAULT */
> > +		struct {
> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> > +			__u32 flags;
> > +			__u32 padding;
> > +			__u64 gpa;
> > +			__u64 size;
> > +		} memory;
> > +
> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> > +encountered a memory error which is not handled by KVM kernel module and
> > +userspace may choose to handle it. The 'flags' field indicates the memory
> > +properties of the exit.
> > +
> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> > +   private memory access when the bit is set. Otherwise the memory error is
> > +   caused by shared memory access when the bit is clear.
> 
> What does a shared memory access failure entail?

In the context of confidential computing usages, guest can issue a
shared memory access while the memory is actually private from the host
point of view. This exit with bit 0 cleared gives userspace a chance to
convert the private memory to shared memory on host.

> 
> If you envision any other failure modes it might be worth making it
> explicit with additional flags.

Sean mentioned some more usages[1][]2] other than the memory conversion
for confidential usage. But I would leave those flags being added in the
future after those usages being well discussed.

[1] https://lkml.kernel.org/r/20200617230052.GB27751@linux.intel.com
[2] https://lore.kernel.org/all/YKxJLcg%2FWomPE422@google.com

> I also wonder if a bitmask makes sense if
> there can only be one reason for a failure? Maybe all that is needed is
> a reason enum?

Tough we only have one reason right now but we still want to leave room
for future extension. Enum can express a single value at once well but
bitmask makes it possible to express multiple orthogonal flags.

Chao
> 
> > +
> > +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
> > +may handle the error and return to KVM to retry the previous memory access.
> > +
> >  ::
> >  
> >      /* KVM_EXIT_NOTIFY */
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index f1ae45c10c94..fa60b032a405 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -300,6 +300,7 @@ struct kvm_xen_exit {
> >  #define KVM_EXIT_RISCV_SBI        35
> >  #define KVM_EXIT_RISCV_CSR        36
> >  #define KVM_EXIT_NOTIFY           37
> > +#define KVM_EXIT_MEMORY_FAULT     38
> >  
> >  /* For KVM_EXIT_INTERNAL_ERROR */
> >  /* Emulate instruction failed. */
> > @@ -538,6 +539,14 @@ struct kvm_run {
> >  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
> >  			__u32 flags;
> >  		} notify;
> > +		/* KVM_EXIT_MEMORY_FAULT */
> > +		struct {
> > +#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> > +			__u32 flags;
> > +			__u32 padding;
> > +			__u64 gpa;
> > +			__u64 size;
> > +		} memory;
> >  		/* Fix the size of the union. */
> >  		char padding[256];
> >  	};
> 
> 
> -- 
> Alex Bennée


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM
  2022-11-14 11:43 ` Alex Bennée
@ 2022-11-16  5:00   ` Chao Peng
  2022-11-16  9:40     ` Alex Bennée
  0 siblings, 1 reply; 101+ messages in thread
From: Chao Peng @ 2022-11-16  5:00 UTC (permalink / raw)
  To: Alex Bennée
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang, Viresh Kumar,
	Mathieu Poirier, AKASHI Takahiro

On Mon, Nov 14, 2022 at 11:43:37AM +0000, Alex Bennée wrote:
> 
> Chao Peng <chao.p.peng@linux.intel.com> writes:
> 
> <snip>
> > Introduction
> > ============
> > KVM userspace being able to crash the host is horrible. Under current
> > KVM architecture, all guest memory is inherently accessible from KVM
> > userspace and is exposed to the mentioned crash issue. The goal of this
> > series is to provide a solution to align mm and KVM, on a userspace
> > inaccessible approach of exposing guest memory. 
> >
> > Normally, KVM populates secondary page table (e.g. EPT) by using a host
> > virtual address (hva) from core mm page table (e.g. x86 userspace page
> > table). This requires guest memory being mmaped into KVM userspace, but
> > this is also the source where the mentioned crash issue can happen. In
> > theory, apart from those 'shared' memory for device emulation etc, guest
> > memory doesn't have to be mmaped into KVM userspace.
> >
> > This series introduces fd-based guest memory which will not be mmaped
> > into KVM userspace. KVM populates secondary page table by using a
> > fd/offset pair backed by a memory file system. The fd can be created
> > from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
> > directly interact with them with newly introduced in-kernel interface,
> > therefore remove the KVM userspace from the path of accessing/mmaping
> > the guest memory. 
> >
> > Kirill had a patch [2] to address the same issue in a different way. It
> > tracks guest encrypted memory at the 'struct page' level and relies on
> > HWPOISON to reject the userspace access. The patch has been discussed in
> > several online and offline threads and resulted in a design document [3]
> > which is also the original proposal for this series. Later this patch
> > series evolved as more comments received in community but the major
> > concepts in [3] still hold true so recommend reading.
> >
> > The patch series may also be useful for other usages, for example, pure
> > software approach may use it to harden itself against unintentional
> > access to guest memory. This series is designed with these usages in
> > mind but doesn't have code directly support them and extension might be
> > needed.
> 
> There are a couple of additional use cases where having a consistent
> memory interface with the kernel would be useful.

Thanks very much for the info. But I'm not so confident that the current
memfd_restricted() implementation can be useful for all these usages. 

> 
>   - Xen DomU guests providing other domains with VirtIO backends
> 
>   Xen by default doesn't give other domains special access to a domains
>   memory. The guest can grant access to regions of its memory to other
>   domains for this purpose. 

I'm trying to form my understanding on how this could work and what's
the benefit for a DomU guest to provide memory through memfd_restricted().
AFAICS, memfd_restricted() can help to hide the memory from DomU userspace,
but I assume VirtIO backends are still in DomU uerspace and need access
that memory, right?

> 
>   - pKVM on ARM
> 
>   Similar to Xen, pKVM moves the management of the page tables into the
>   hypervisor and again doesn't allow those domains to share memory by
>   default.

Right, we already had some discussions on this in the past versions.

> 
>   - VirtIO loopback
> 
>   This allows for VirtIO devices for the host kernel to be serviced by
>   backends running in userspace. Obviously the memory userspace is
>   allowed to access is strictly limited to the buffers and queues
>   because giving userspace unrestricted access to the host kernel would
>   have consequences.

Okay, but normal memfd_create() should work for it, right? And
memfd_restricted() instead may not work as it unmaps the memory from
userspace.

> 
> All of these VirtIO backends work with vhost-user which uses memfds to
> pass references to guest memory from the VMM to the backend
> implementation.

Sounds to me these are the places where normal memfd_create() can act on.
VirtIO backends work on the mmap-ed memory which currently is not the
case for memfd_restricted(). memfd_restricted() has different design
purpose that unmaps the memory from userspace and employs some kernel
callbacks so other kernel modules can make use of the memory with these
callbacks instead of userspace virtual address.

Chao
> 
> > mm change
> > =========
> > Introduces a new memfd_restricted system call which can create memory
> > file that is restricted from userspace access via normal MMU operations
> > like read(), write() or mmap() etc and the only way to use it is
> > passing it to a third kernel module like KVM and relying on it to
> > access the fd through the newly added restrictedmem kernel interface.
> > The restrictedmem interface bridges the memory file subsystems
> > (tmpfs/hugetlbfs etc) and their users (KVM in this case) and provides
> > bi-directional communication between them. 
> >
> >
> > KVM change
> > ==========
> > Extends the KVM memslot to provide guest private (encrypted) memory from
> > a fd. With this extension, a single memslot can maintain both private
> > memory through private fd (restricted_fd/restricted_offset) and shared
> > (unencrypted) memory through userspace mmaped host virtual address
> > (userspace_addr). For a particular guest page, the corresponding page in
> > KVM memslot can be only either private or shared and only one of the
> > shared/private parts of the memslot is visible to guest. For how this
> > new extension is used in QEMU, please refer to kvm_set_phys_mem() in
> > below TDX-enabled QEMU repo.
> >
> > Introduces new KVM_EXIT_MEMORY_FAULT exit to allow userspace to get the
> > chance on decision-making for shared <-> private memory conversion. The
> > exit can be an implicit conversion in KVM page fault handler or an
> > explicit conversion from guest OS.
> >
> > Extends existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to
> > convert a guest page between private <-> shared. The data maintained in
> > these ioctls tells the truth whether a guest page is private or shared
> > and this information will be used in KVM page fault handler to decide
> > whether the private or the shared part of the memslot is visible to
> > guest.
> >
> <snip>
> 
> -- 
> Alex Bennée


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM
  2022-11-16  5:00   ` Chao Peng
@ 2022-11-16  9:40     ` Alex Bennée
  2022-11-17 14:16       ` Chao Peng
  0 siblings, 1 reply; 101+ messages in thread
From: Alex Bennée @ 2022-11-16  9:40 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang, Viresh Kumar,
	Mathieu Poirier, AKASHI Takahiro


Chao Peng <chao.p.peng@linux.intel.com> writes:

> On Mon, Nov 14, 2022 at 11:43:37AM +0000, Alex Bennée wrote:
>> 
>> Chao Peng <chao.p.peng@linux.intel.com> writes:
>> 
>> <snip>
>> > Introduction
>> > ============
>> > KVM userspace being able to crash the host is horrible. Under current
>> > KVM architecture, all guest memory is inherently accessible from KVM
>> > userspace and is exposed to the mentioned crash issue. The goal of this
>> > series is to provide a solution to align mm and KVM, on a userspace
>> > inaccessible approach of exposing guest memory. 
>> >
>> > Normally, KVM populates secondary page table (e.g. EPT) by using a host
>> > virtual address (hva) from core mm page table (e.g. x86 userspace page
>> > table). This requires guest memory being mmaped into KVM userspace, but
>> > this is also the source where the mentioned crash issue can happen. In
>> > theory, apart from those 'shared' memory for device emulation etc, guest
>> > memory doesn't have to be mmaped into KVM userspace.
>> >
>> > This series introduces fd-based guest memory which will not be mmaped
>> > into KVM userspace. KVM populates secondary page table by using a
>> > fd/offset pair backed by a memory file system. The fd can be created
>> > from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
>> > directly interact with them with newly introduced in-kernel interface,
>> > therefore remove the KVM userspace from the path of accessing/mmaping
>> > the guest memory. 
>> >
>> > Kirill had a patch [2] to address the same issue in a different way. It
>> > tracks guest encrypted memory at the 'struct page' level and relies on
>> > HWPOISON to reject the userspace access. The patch has been discussed in
>> > several online and offline threads and resulted in a design document [3]
>> > which is also the original proposal for this series. Later this patch
>> > series evolved as more comments received in community but the major
>> > concepts in [3] still hold true so recommend reading.
>> >
>> > The patch series may also be useful for other usages, for example, pure
>> > software approach may use it to harden itself against unintentional
>> > access to guest memory. This series is designed with these usages in
>> > mind but doesn't have code directly support them and extension might be
>> > needed.
>> 
>> There are a couple of additional use cases where having a consistent
>> memory interface with the kernel would be useful.
>
> Thanks very much for the info. But I'm not so confident that the current
> memfd_restricted() implementation can be useful for all these usages. 
>
>> 
>>   - Xen DomU guests providing other domains with VirtIO backends
>> 
>>   Xen by default doesn't give other domains special access to a domains
>>   memory. The guest can grant access to regions of its memory to other
>>   domains for this purpose. 
>
> I'm trying to form my understanding on how this could work and what's
> the benefit for a DomU guest to provide memory through memfd_restricted().
> AFAICS, memfd_restricted() can help to hide the memory from DomU userspace,
> but I assume VirtIO backends are still in DomU uerspace and need access
> that memory, right?

They need access to parts of the memory. At the moment you run your
VirtIO domains in the Dom0 and give them access to the whole of a DomU's
address space - however the Xen model is by default the guests memory is
inaccessible to other domains on the system. The DomU guest uses the Xen
grant model to expose portions of its address space to other domains -
namely for the VirtIO queues themselves and any pages containing buffers
involved in the VirtIO transaction. My thought was that looks like a
guest memory interface which is mostly inaccessible (private) with some
holes in it where memory is being explicitly shared with other domains.

What I want to achieve is a common userspace API with defined semantics
for what happens when private and shared regions are accessed. Because
having each hypervisor/confidential computing architecture define its
own special API for accessing this memory is just a recipe for
fragmentation and makes sharing common VirtIO backends impossible.

>
>> 
>>   - pKVM on ARM
>> 
>>   Similar to Xen, pKVM moves the management of the page tables into the
>>   hypervisor and again doesn't allow those domains to share memory by
>>   default.
>
> Right, we already had some discussions on this in the past versions.
>
>> 
>>   - VirtIO loopback
>> 
>>   This allows for VirtIO devices for the host kernel to be serviced by
>>   backends running in userspace. Obviously the memory userspace is
>>   allowed to access is strictly limited to the buffers and queues
>>   because giving userspace unrestricted access to the host kernel would
>>   have consequences.
>
> Okay, but normal memfd_create() should work for it, right? And
> memfd_restricted() instead may not work as it unmaps the memory from
> userspace.
>
>> 
>> All of these VirtIO backends work with vhost-user which uses memfds to
>> pass references to guest memory from the VMM to the backend
>> implementation.
>
> Sounds to me these are the places where normal memfd_create() can act on.
> VirtIO backends work on the mmap-ed memory which currently is not the
> case for memfd_restricted(). memfd_restricted() has different design
> purpose that unmaps the memory from userspace and employs some kernel
> callbacks so other kernel modules can make use of the memory with these
> callbacks instead of userspace virtual address.

Maybe my understanding is backwards then. Are you saying a guest starts
with all its memory exposed and then selectively unmaps the private
regions? Is this driven by the VMM or the guest itself?

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-10-25 15:13 ` [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
                     ` (2 preceding siblings ...)
  2022-11-15 16:56   ` Alex Bennée
@ 2022-11-16 18:15   ` Andy Lutomirski
  2022-11-16 18:48     ` Sean Christopherson
  3 siblings, 1 reply; 101+ messages in thread
From: Andy Lutomirski @ 2022-11-16 18:15 UTC (permalink / raw)
  To: Chao Peng, kvm list, Linux Kernel Mailing List, linux-mm,
	linux-fsdevel, linux-arch, Linux API, linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	aarcange, ddutile, dhildenb, Quentin Perret, Fuad Tabba,
	Michael Roth, Michal Hocko, Muchun Song, Wei W Wang



On Tue, Oct 25, 2022, at 8:13 AM, Chao Peng wrote:
> This new KVM exit allows userspace to handle memory-related errors. It
> indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> The flags includes additional information for userspace to handle the
> error. Currently bit 0 is defined as 'private memory' where '1'
> indicates error happens due to private memory access and '0' indicates
> error happens due to shared memory access.
>
> When private memory is enabled, this new exit will be used for KVM to
> exit to userspace for shared <-> private memory conversion in memory
> encryption usage. In such usage, typically there are two kind of memory
> conversions:
>   - explicit conversion: happens when guest explicitly calls into KVM
>     to map a range (as private or shared), KVM then exits to userspace
>     to perform the map/unmap operations.
>   - implicit conversion: happens in KVM page fault handler where KVM
>     exits to userspace for an implicit conversion when the page is in a
>     different state than requested (private or shared).
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
>  include/uapi/linux/kvm.h       |  9 +++++++++
>  2 files changed, 32 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst 
> b/Documentation/virt/kvm/api.rst
> index f3fa75649a78..975688912b8c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6537,6 +6537,29 @@ array field represents return values. The 
> userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on 
> RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
> 
> +::
> +
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> +			__u32 flags;
> +			__u32 padding;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory;
> +

Would it make sense to also have a field for the access type (read, write, execute, etc)?  I realize that shared <-> private conversion doesn't strictly need this, but it seems like it could be useful for logging failures and also for avoiding a second immediate fault if the type gets converted but doesn't have the right protection yet.

(Obviously, if this were changed, KVM would need the ability to report that it doesn't actually know the mode.)

--Andy


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-11-16 18:15   ` Andy Lutomirski
@ 2022-11-16 18:48     ` Sean Christopherson
  2022-11-17 13:42       ` Chao Peng
  0 siblings, 1 reply; 101+ messages in thread
From: Sean Christopherson @ 2022-11-16 18:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chao Peng, kvm list, Linux Kernel Mailing List, linux-mm,
	linux-fsdevel, linux-arch, Linux API, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, the arch/x86 maintainers, H. Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	aarcange, ddutile, dhildenb, Quentin Perret, Fuad Tabba,
	Michael Roth, Michal Hocko, Muchun Song, Wei W Wang

On Wed, Nov 16, 2022, Andy Lutomirski wrote:
> 
> 
> On Tue, Oct 25, 2022, at 8:13 AM, Chao Peng wrote:
> > diff --git a/Documentation/virt/kvm/api.rst 
> > b/Documentation/virt/kvm/api.rst
> > index f3fa75649a78..975688912b8c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6537,6 +6537,29 @@ array field represents return values. The 
> > userspace should update the return
> >  values of SBI call before resuming the VCPU. For more details on 
> > RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> > 
> > +::
> > +
> > +		/* KVM_EXIT_MEMORY_FAULT */
> > +		struct {
> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> > +			__u32 flags;
> > +			__u32 padding;
> > +			__u64 gpa;
> > +			__u64 size;
> > +		} memory;
> > +
> 
> Would it make sense to also have a field for the access type (read, write,
> execute, etc)?  I realize that shared <-> private conversion doesn't strictly
> need this, but it seems like it could be useful for logging failures and also
> for avoiding a second immediate fault if the type gets converted but doesn't
> have the right protection yet.

I don't think a separate field is necessary, that info can be conveyed via flags.
Though maybe we should go straight to a u64 for flags.  Hmm, and maybe avoid bits
0-3 so that if/when RWX info is conveyed the flags can align with
PROT_{READ,WRITE,EXEC} and the EPT flags, e.g.

	KVM_MEMORY_EXIT_FLAG_READ	(1 << 0)
	KVM_MEMORY_EXIT_FLAG_WRITE	(1 << 1)
	KVM_MEMORY_EXIT_FLAG_EXECUTE	(1 << 2)

> (Obviously, if this were changed, KVM would need the ability to report that
> it doesn't actually know the mode.)
> 
> --Andy


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-11-16  3:14     ` Chao Peng
@ 2022-11-16 19:03       ` Alex Bennée
  2022-11-17 13:45         ` Chao Peng
  0 siblings, 1 reply; 101+ messages in thread
From: Alex Bennée @ 2022-11-16 19:03 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang


Chao Peng <chao.p.peng@linux.intel.com> writes:

> On Tue, Nov 15, 2022 at 04:56:12PM +0000, Alex Bennée wrote:
>> 
>> Chao Peng <chao.p.peng@linux.intel.com> writes:
>> 
>> > This new KVM exit allows userspace to handle memory-related errors. It
>> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
>> > The flags includes additional information for userspace to handle the
>> > error. Currently bit 0 is defined as 'private memory' where '1'
>> > indicates error happens due to private memory access and '0' indicates
>> > error happens due to shared memory access.
>> >
>> > When private memory is enabled, this new exit will be used for KVM to
>> > exit to userspace for shared <-> private memory conversion in memory
>> > encryption usage. In such usage, typically there are two kind of memory
>> > conversions:
>> >   - explicit conversion: happens when guest explicitly calls into KVM
>> >     to map a range (as private or shared), KVM then exits to userspace
>> >     to perform the map/unmap operations.
>> >   - implicit conversion: happens in KVM page fault handler where KVM
>> >     exits to userspace for an implicit conversion when the page is in a
>> >     different state than requested (private or shared).
>> >
>> > Suggested-by: Sean Christopherson <seanjc@google.com>
>> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
>> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
>> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>> > ---
>> >  Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
>> >  include/uapi/linux/kvm.h       |  9 +++++++++
>> >  2 files changed, 32 insertions(+)
>> >
>> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> > index f3fa75649a78..975688912b8c 100644
>> > --- a/Documentation/virt/kvm/api.rst
>> > +++ b/Documentation/virt/kvm/api.rst
>> > @@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
>> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
>> >  
>> > +::
>> > +
>> > +		/* KVM_EXIT_MEMORY_FAULT */
>> > +		struct {
>> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
>> > +			__u32 flags;
>> > +			__u32 padding;
>> > +			__u64 gpa;
>> > +			__u64 size;
>> > +		} memory;
>> > +
>> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
>> > +encountered a memory error which is not handled by KVM kernel module and
>> > +userspace may choose to handle it. The 'flags' field indicates the memory
>> > +properties of the exit.
>> > +
>> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
>> > +   private memory access when the bit is set. Otherwise the memory error is
>> > +   caused by shared memory access when the bit is clear.
>> 
>> What does a shared memory access failure entail?
>
> In the context of confidential computing usages, guest can issue a
> shared memory access while the memory is actually private from the host
> point of view. This exit with bit 0 cleared gives userspace a chance to
> convert the private memory to shared memory on host.

I think this should be explicit rather than implied by the absence of
another flag. Sean suggested you might want flags for RWX failures so
maybe something like:

	KVM_MEMORY_EXIT_SHARED_FLAG_READ	(1 << 0)
	KVM_MEMORY_EXIT_SHARED_FLAG_WRITE	(1 << 1)
	KVM_MEMORY_EXIT_SHARED_FLAG_EXECUTE	(1 << 2)
        KVM_MEMORY_EXIT_FLAG_PRIVATE            (1 << 3)

which would allow you to signal the various failure modes of the shared
region, or that you had accessed private memory.

>
>> 
>> If you envision any other failure modes it might be worth making it
>> explicit with additional flags.
>
> Sean mentioned some more usages[1][]2] other than the memory conversion
> for confidential usage. But I would leave those flags being added in the
> future after those usages being well discussed.
>
> [1] https://lkml.kernel.org/r/20200617230052.GB27751@linux.intel.com
> [2] https://lore.kernel.org/all/YKxJLcg%2FWomPE422@google.com
>
>> I also wonder if a bitmask makes sense if
>> there can only be one reason for a failure? Maybe all that is needed is
>> a reason enum?
>
> Tough we only have one reason right now but we still want to leave room
> for future extension. Enum can express a single value at once well but
> bitmask makes it possible to express multiple orthogonal flags.

I agree if multiple orthogonal failures can occur at once a bitmask is
the right choice.

>
> Chao
>> 
>> > +
>> > +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
>> > +may handle the error and return to KVM to retry the previous memory access.
>> > +
>> >  ::
>> >  
>> >      /* KVM_EXIT_NOTIFY */
>> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> > index f1ae45c10c94..fa60b032a405 100644
>> > --- a/include/uapi/linux/kvm.h
>> > +++ b/include/uapi/linux/kvm.h
>> > @@ -300,6 +300,7 @@ struct kvm_xen_exit {
>> >  #define KVM_EXIT_RISCV_SBI        35
>> >  #define KVM_EXIT_RISCV_CSR        36
>> >  #define KVM_EXIT_NOTIFY           37
>> > +#define KVM_EXIT_MEMORY_FAULT     38
>> >  
>> >  /* For KVM_EXIT_INTERNAL_ERROR */
>> >  /* Emulate instruction failed. */
>> > @@ -538,6 +539,14 @@ struct kvm_run {
>> >  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
>> >  			__u32 flags;
>> >  		} notify;
>> > +		/* KVM_EXIT_MEMORY_FAULT */
>> > +		struct {
>> > +#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
>> > +			__u32 flags;
>> > +			__u32 padding;
>> > +			__u64 gpa;
>> > +			__u64 size;
>> > +		} memory;
>> >  		/* Fix the size of the union. */
>> >  		char padding[256];
>> >  	};
>> 
>> 
>> -- 
>> Alex Bennée


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 7/8] KVM: Handle page fault for private memory
  2022-10-25 15:13 ` [PATCH v9 7/8] KVM: Handle page fault for private memory Chao Peng
  2022-10-26 21:54   ` Isaku Yamahata
@ 2022-11-16 20:50   ` Ackerley Tng
  2022-11-16 22:13     ` Sean Christopherson
  1 sibling, 1 reply; 101+ messages in thread
From: Ackerley Tng @ 2022-11-16 20:50 UTC (permalink / raw)
  To: chao.p.peng
  Cc: aarcange, ak, akpm, bfields, bp, corbet, dave.hansen, david,
	ddutile, dhildenb, hpa, hughd, jlayton, jmattson, joro,
	jun.nakajima, kirill.shutemov, kvm, linux-api, linux-arch,
	linux-doc, linux-fsdevel, linux-kernel, linux-mm, luto, mail,
	mhocko, michael.roth, mingo, pbonzini, qemu-devel, qperret, rppt,
	seanjc, shuah, songmuchun, steven.price, tabba, tglx, vannapurve,
	vbabka, vkuznets, wanpengli, wei.w.wang, x86, yu.c.zhang

> A memslot with KVM_MEM_PRIVATE being set can include both fd-based
> private memory and hva-based shared memory. Architecture code (like TDX
> code) can tell whether the on-going fault is private or not. This patch
> adds a 'is_private' field to kvm_page_fault to indicate this and
> architecture code is expected to set it.
>
> To handle page fault for such memslot, the handling logic is different
> depending on whether the fault is private or shared. KVM checks if
> 'is_private' matches the host's view of the page (maintained in
> mem_attr_array).
>   - For a successful match, private pfn is obtained with
>     restrictedmem_get_page () from private fd and shared pfn is obtained
>     with existing get_user_pages().
>   - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
>     userspace. Userspace then can convert memory between private/shared
>     in host's view and retry the fault.
>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 56 +++++++++++++++++++++++++++++++--
>  arch/x86/kvm/mmu/mmu_internal.h | 14 ++++++++-
>  arch/x86/kvm/mmu/mmutrace.h     |  1 +
>  arch/x86/kvm/mmu/spte.h         |  6 ++++
>  arch/x86/kvm/mmu/tdp_mmu.c      |  3 +-
>  include/linux/kvm_host.h        | 28 +++++++++++++++++
>  6 files changed, 103 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 67a9823a8c35..10017a9f26ee 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3030,7 +3030,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
>
>  int kvm_mmu_max_mapping_level(struct kvm *kvm,
>  			      const struct kvm_memory_slot *slot, gfn_t gfn,
> -			      int max_level)
> +			      int max_level, bool is_private)
>  {
>  	struct kvm_lpage_info *linfo;
>  	int host_level;
> @@ -3042,6 +3042,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
>  			break;
>  	}
>
> +	if (is_private)
> +		return max_level;
> +
>  	if (max_level == PG_LEVEL_4K)
>  		return PG_LEVEL_4K;
>
> @@ -3070,7 +3073,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	 * level, which will be used to do precise, accurate accounting.
>  	 */
>  	fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> -						     fault->gfn, fault->max_level);
> +						     fault->gfn, fault->max_level,
> +						     fault->is_private);
>  	if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
>  		return;
>
> @@ -4141,6 +4145,32 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
>  }
>
> +static inline u8 order_to_level(int order)
> +{
> +	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> +
> +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> +		return PG_LEVEL_1G;
> +
> +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> +		return PG_LEVEL_2M;
> +
> +	return PG_LEVEL_4K;
> +}
> +
> +static int kvm_faultin_pfn_private(struct kvm_page_fault *fault)
>  +{
>  +	int order;
>  +	struct kvm_memory_slot *slot = fault->slot;
>  +
>  +	if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
>+		return RET_PF_RETRY;
>+
>+	fault->max_level = min(order_to_level(order), fault->max_level);
>+	fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
>+	return RET_PF_CONTINUE;
>+}
>+
> static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> 	struct kvm_memory_slot *slot = fault->slot;
>@@ -4173,6 +4203,22 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> 			return RET_PF_EMULATE;
> 	}
>
>+	if (kvm_slot_can_be_private(slot) &&
>+	    fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
>+		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
>+		if (fault->is_private)
>+			vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
>+		else
>+			vcpu->run->memory.flags = 0;
>+		vcpu->run->memory.padding = 0;
>+		vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
>+		vcpu->run->memory.size = PAGE_SIZE;
>+		return RET_PF_USER;
>+	}
>+
>+	if (fault->is_private)
>+		return kvm_faultin_pfn_private(fault);
>+

Since each memslot may also not be backed by restricted memory, we
should also check if the memslot has been set up for private memory
with

	if (fault->is_private && kvm_slot_can_be_private(slot))
		return kvm_faultin_pfn_private(fault);

Without this check, restrictedmem_get_page will get called with NULL
in slot->restricted_file, which causes a NULL pointer dereference.

> 	async = false;
> 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
> 					  fault->write, &fault->map_writable,
>@@ -5557,6 +5603,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> 			return -EIO;
> 	}
>
>+	if (r == RET_PF_USER)
>+		return 0;
>+
> 	if (r < 0)
> 		return r;
> 	if (r != RET_PF_EMULATE)
>@@ -6408,7 +6457,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> 		 */
> 		if (sp->role.direct &&
> 		    sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
>-							       PG_LEVEL_NUM)) {
>+							       PG_LEVEL_NUM,
>+							       false)) {
> 			kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
>
> 			if (kvm_available_flush_tlb_with_range())


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 7/8] KVM: Handle page fault for private memory
  2022-11-16 20:50   ` Ackerley Tng
@ 2022-11-16 22:13     ` Sean Christopherson
  2022-11-17 13:25       ` Chao Peng
  0 siblings, 1 reply; 101+ messages in thread
From: Sean Christopherson @ 2022-11-16 22:13 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: chao.p.peng, aarcange, ak, akpm, bfields, bp, corbet,
	dave.hansen, david, ddutile, dhildenb, hpa, hughd, jlayton,
	jmattson, joro, jun.nakajima, kirill.shutemov, kvm, linux-api,
	linux-arch, linux-doc, linux-fsdevel, linux-kernel, linux-mm,
	luto, mail, mhocko, michael.roth, mingo, pbonzini, qemu-devel,
	qperret, rppt, shuah, songmuchun, steven.price, tabba, tglx,
	vannapurve, vbabka, vkuznets, wanpengli, wei.w.wang, x86,
	yu.c.zhang

On Wed, Nov 16, 2022, Ackerley Tng wrote:
> >@@ -4173,6 +4203,22 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > 			return RET_PF_EMULATE;
> > 	}
> >
> >+	if (kvm_slot_can_be_private(slot) &&
> >+	    fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> >+		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> >+		if (fault->is_private)
> >+			vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> >+		else
> >+			vcpu->run->memory.flags = 0;
> >+		vcpu->run->memory.padding = 0;
> >+		vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> >+		vcpu->run->memory.size = PAGE_SIZE;
> >+		return RET_PF_USER;
> >+	}
> >+
> >+	if (fault->is_private)
> >+		return kvm_faultin_pfn_private(fault);
> >+
> 
> Since each memslot may also not be backed by restricted memory, we
> should also check if the memslot has been set up for private memory
> with
> 
> 	if (fault->is_private && kvm_slot_can_be_private(slot))
> 		return kvm_faultin_pfn_private(fault);
> 
> Without this check, restrictedmem_get_page will get called with NULL
> in slot->restricted_file, which causes a NULL pointer dereference.

Hmm, silently skipping the faultin would result in KVM faulting in the shared
portion of the memslot, and I believe would end up mapping that pfn as private,
i.e. would map a non-UPM PFN as a private mapping.  For TDX and SNP, that would
be double ungood as it would let the host access memory that is mapped private,
i.e. lead to #MC or #PF(RMP) in the host.

I believe the correct solution is to drop the "can be private" check from the
above check, and instead handle that in kvm_faultin_pfn_private().  That would fix
another bug, e.g. if the fault is shared, the slot can't be private, but for
whatever reason userspace marked the gfn as private.  Even though KVM might be
able service the fault, the correct thing to do in that case is to exit to userspace.

E.g.

---
 arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 10017a9f26ee..e2ac8873938e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4158,11 +4158,29 @@ static inline u8 order_to_level(int order)
 	return PG_LEVEL_4K;
 }
 
-static int kvm_faultin_pfn_private(struct kvm_page_fault *fault)
+static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
+					struct kvm_page_fault *fault)
+{
+	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+	if (fault->is_private)
+		vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
+	else
+		vcpu->run->memory.flags = 0;
+	vcpu->run->memory.padding = 0;
+	vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
+	vcpu->run->memory.size = PAGE_SIZE;
+	return RET_PF_USER;
+}
+
+static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
+				   struct kvm_page_fault *fault)
 {
 	int order;
 	struct kvm_memory_slot *slot = fault->slot;
 
+	if (kvm_slot_can_be_private(slot))
+		return kvm_do_memory_fault_exit(vcpu, fault);
+
 	if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
 		return RET_PF_RETRY;
 
@@ -4203,21 +4221,11 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			return RET_PF_EMULATE;
 	}
 
-	if (kvm_slot_can_be_private(slot) &&
-	    fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
-		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
-		if (fault->is_private)
-			vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
-		else
-			vcpu->run->memory.flags = 0;
-		vcpu->run->memory.padding = 0;
-		vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
-		vcpu->run->memory.size = PAGE_SIZE;
-		return RET_PF_USER;
-	}
+	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
+		return kvm_do_memory_fault_exit(vcpu, fault);
 
 	if (fault->is_private)
-		return kvm_faultin_pfn_private(fault);
+		return kvm_faultin_pfn_private(vcpu, fault);
 
 	async = false;
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,

base-commit: 969d761bb7b8654605937f31ae76123dcb7f15a3
-- 



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions
  2022-10-25 15:13 ` [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions Chao Peng
                     ` (2 preceding siblings ...)
  2022-11-08  1:35   ` Yuan Yao
@ 2022-11-16 22:24   ` Sean Christopherson
  2022-11-17 13:20     ` Chao Peng
  3 siblings, 1 reply; 101+ messages in thread
From: Sean Christopherson @ 2022-11-16 22:24 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Tue, Oct 25, 2022, Chao Peng wrote:
> +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> +				     bool is_private)
> +{
> +	gfn_t start, end;
> +	unsigned long i;
> +	void *entry;
> +	int idx;
> +	int r = 0;
> +
> +	if (size == 0 || gpa + size < gpa)
> +		return -EINVAL;
> +	if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> +		return -EINVAL;
> +
> +	start = gpa >> PAGE_SHIFT;
> +	end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> +	/*
> +	 * Guest memory defaults to private, kvm->mem_attr_array only stores
> +	 * shared memory.
> +	 */
> +	entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> +
> +	idx = srcu_read_lock(&kvm->srcu);
> +	KVM_MMU_LOCK(kvm);
> +	kvm_mmu_invalidate_begin(kvm, start, end);
> +
> +	for (i = start; i < end; i++) {
> +		r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> +				    GFP_KERNEL_ACCOUNT));
> +		if (r)
> +			goto err;
> +	}
> +
> +	kvm_unmap_mem_range(kvm, start, end);
> +
> +	goto ret;
> +err:
> +	for (; i > start; i--)
> +		xa_erase(&kvm->mem_attr_array, i);

I don't think deleting previous entries is correct.  To unwind, the correct thing
to do is restore the original values.  E.g. if userspace space is mapping a large
range as shared, and some of the previous entries were shared, deleting them would
incorrectly "convert" those entries to private.

Tracking the previous state likely isn't the best approach, e.g. it would require
speculatively allocating extra memory for a rare condition that is likely going to
lead to OOM anyways.

Instead of trying to unwind, what about updating the ioctl() params such that
retrying with the updated addr+size would Just Work?  E.g.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 55b07aae67cc..f1de592a1a06 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1015,15 +1015,12 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
 
        kvm_unmap_mem_range(kvm, start, end, attr);
 
-       goto ret;
-err:
-       for (; i > start; i--)
-               xa_erase(&kvm->mem_attr_array, i);
-ret:
        kvm_mmu_invalidate_end(kvm, start, end);
        KVM_MMU_UNLOCK(kvm);
        srcu_read_unlock(&kvm->srcu, idx);
 
+       <update gpa and size>
+
        return r;
 }
 #endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
@@ -4989,6 +4986,8 @@ static long kvm_vm_ioctl(struct file *filp,
 
                r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
                                              region.size, set);
+               if (copy_to_user(argp, &region, sizeof(region)) && !r)
+                       r = -EFAULT
                break;
        }
 #endif


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions
  2022-11-16 22:24   ` Sean Christopherson
@ 2022-11-17 13:20     ` Chao Peng
  0 siblings, 0 replies; 101+ messages in thread
From: Chao Peng @ 2022-11-17 13:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Wed, Nov 16, 2022 at 10:24:11PM +0000, Sean Christopherson wrote:
> On Tue, Oct 25, 2022, Chao Peng wrote:
> > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> > +				     bool is_private)
> > +{
> > +	gfn_t start, end;
> > +	unsigned long i;
> > +	void *entry;
> > +	int idx;
> > +	int r = 0;
> > +
> > +	if (size == 0 || gpa + size < gpa)
> > +		return -EINVAL;
> > +	if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> > +		return -EINVAL;
> > +
> > +	start = gpa >> PAGE_SHIFT;
> > +	end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > +	/*
> > +	 * Guest memory defaults to private, kvm->mem_attr_array only stores
> > +	 * shared memory.
> > +	 */
> > +	entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> > +
> > +	idx = srcu_read_lock(&kvm->srcu);
> > +	KVM_MMU_LOCK(kvm);
> > +	kvm_mmu_invalidate_begin(kvm, start, end);
> > +
> > +	for (i = start; i < end; i++) {
> > +		r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > +				    GFP_KERNEL_ACCOUNT));
> > +		if (r)
> > +			goto err;
> > +	}
> > +
> > +	kvm_unmap_mem_range(kvm, start, end);
> > +
> > +	goto ret;
> > +err:
> > +	for (; i > start; i--)
> > +		xa_erase(&kvm->mem_attr_array, i);
> 
> I don't think deleting previous entries is correct.  To unwind, the correct thing
> to do is restore the original values.  E.g. if userspace space is mapping a large
> range as shared, and some of the previous entries were shared, deleting them would
> incorrectly "convert" those entries to private.

Ah, right!

> 
> Tracking the previous state likely isn't the best approach, e.g. it would require
> speculatively allocating extra memory for a rare condition that is likely going to
> lead to OOM anyways.

Agree.

> 
> Instead of trying to unwind, what about updating the ioctl() params such that
> retrying with the updated addr+size would Just Work?  E.g.

Looks good to me. Thanks!

Chao
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 55b07aae67cc..f1de592a1a06 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1015,15 +1015,12 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
>  
>         kvm_unmap_mem_range(kvm, start, end, attr);
>  
> -       goto ret;
> -err:
> -       for (; i > start; i--)
> -               xa_erase(&kvm->mem_attr_array, i);
> -ret:
>         kvm_mmu_invalidate_end(kvm, start, end);
>         KVM_MMU_UNLOCK(kvm);
>         srcu_read_unlock(&kvm->srcu, idx);
>  
> +       <update gpa and size>
> +
>         return r;
>  }
>  #endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
> @@ -4989,6 +4986,8 @@ static long kvm_vm_ioctl(struct file *filp,
>  
>                 r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
>                                               region.size, set);
> +               if (copy_to_user(argp, &region, sizeof(region)) && !r)
> +                       r = -EFAULT
>                 break;
>         }
>  #endif


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 7/8] KVM: Handle page fault for private memory
  2022-11-16 22:13     ` Sean Christopherson
@ 2022-11-17 13:25       ` Chao Peng
  0 siblings, 0 replies; 101+ messages in thread
From: Chao Peng @ 2022-11-17 13:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ackerley Tng, aarcange, ak, akpm, bfields, bp, corbet,
	dave.hansen, david, ddutile, dhildenb, hpa, hughd, jlayton,
	jmattson, joro, jun.nakajima, kirill.shutemov, kvm, linux-api,
	linux-arch, linux-doc, linux-fsdevel, linux-kernel, linux-mm,
	luto, mail, mhocko, michael.roth, mingo, pbonzini, qemu-devel,
	qperret, rppt, shuah, songmuchun, steven.price, tabba, tglx,
	vannapurve, vbabka, vkuznets, wanpengli, wei.w.wang, x86,
	yu.c.zhang

On Wed, Nov 16, 2022 at 10:13:07PM +0000, Sean Christopherson wrote:
> On Wed, Nov 16, 2022, Ackerley Tng wrote:
> > >@@ -4173,6 +4203,22 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > > 			return RET_PF_EMULATE;
> > > 	}
> > >
> > >+	if (kvm_slot_can_be_private(slot) &&
> > >+	    fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> > >+		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > >+		if (fault->is_private)
> > >+			vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> > >+		else
> > >+			vcpu->run->memory.flags = 0;
> > >+		vcpu->run->memory.padding = 0;
> > >+		vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> > >+		vcpu->run->memory.size = PAGE_SIZE;
> > >+		return RET_PF_USER;
> > >+	}
> > >+
> > >+	if (fault->is_private)
> > >+		return kvm_faultin_pfn_private(fault);
> > >+
> > 
> > Since each memslot may also not be backed by restricted memory, we
> > should also check if the memslot has been set up for private memory
> > with
> > 
> > 	if (fault->is_private && kvm_slot_can_be_private(slot))
> > 		return kvm_faultin_pfn_private(fault);
> > 
> > Without this check, restrictedmem_get_page will get called with NULL
> > in slot->restricted_file, which causes a NULL pointer dereference.
> 
> Hmm, silently skipping the faultin would result in KVM faulting in the shared
> portion of the memslot, and I believe would end up mapping that pfn as private,
> i.e. would map a non-UPM PFN as a private mapping.  For TDX and SNP, that would
> be double ungood as it would let the host access memory that is mapped private,
> i.e. lead to #MC or #PF(RMP) in the host.

That's correct.

> 
> I believe the correct solution is to drop the "can be private" check from the
> above check, and instead handle that in kvm_faultin_pfn_private().  That would fix
> another bug, e.g. if the fault is shared, the slot can't be private, but for
> whatever reason userspace marked the gfn as private.  Even though KVM might be
> able service the fault, the correct thing to do in that case is to exit to userspace.

It makes sense to me.

Chao
> 
> E.g.
> 
> ---
>  arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++--------------
>  1 file changed, 22 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 10017a9f26ee..e2ac8873938e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4158,11 +4158,29 @@ static inline u8 order_to_level(int order)
>  	return PG_LEVEL_4K;
>  }
>  
> -static int kvm_faultin_pfn_private(struct kvm_page_fault *fault)
> +static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
> +					struct kvm_page_fault *fault)
> +{
> +	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> +	if (fault->is_private)
> +		vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> +	else
> +		vcpu->run->memory.flags = 0;
> +	vcpu->run->memory.padding = 0;
> +	vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> +	vcpu->run->memory.size = PAGE_SIZE;
> +	return RET_PF_USER;
> +}
> +
> +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> +				   struct kvm_page_fault *fault)
>  {
>  	int order;
>  	struct kvm_memory_slot *slot = fault->slot;
>  
> +	if (kvm_slot_can_be_private(slot))
> +		return kvm_do_memory_fault_exit(vcpu, fault);
> +
>  	if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
>  		return RET_PF_RETRY;
>  
> @@ -4203,21 +4221,11 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  			return RET_PF_EMULATE;
>  	}
>  
> -	if (kvm_slot_can_be_private(slot) &&
> -	    fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> -		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> -		if (fault->is_private)
> -			vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> -		else
> -			vcpu->run->memory.flags = 0;
> -		vcpu->run->memory.padding = 0;
> -		vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> -		vcpu->run->memory.size = PAGE_SIZE;
> -		return RET_PF_USER;
> -	}
> +	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
> +		return kvm_do_memory_fault_exit(vcpu, fault);
>  
>  	if (fault->is_private)
> -		return kvm_faultin_pfn_private(fault);
> +		return kvm_faultin_pfn_private(vcpu, fault);
>  
>  	async = false;
>  	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
> 
> base-commit: 969d761bb7b8654605937f31ae76123dcb7f15a3
> -- 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-11-16 18:48     ` Sean Christopherson
@ 2022-11-17 13:42       ` Chao Peng
  0 siblings, 0 replies; 101+ messages in thread
From: Chao Peng @ 2022-11-17 13:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, kvm list, Linux Kernel Mailing List, linux-mm,
	linux-fsdevel, linux-arch, Linux API, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, the arch/x86 maintainers, H. Peter Anvin,
	Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
	Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, David Hildenbrand,
	aarcange, ddutile, dhildenb, Quentin Perret, Fuad Tabba,
	Michael Roth, Michal Hocko, Muchun Song, Wei W Wang

On Wed, Nov 16, 2022 at 06:48:43PM +0000, Sean Christopherson wrote:
> On Wed, Nov 16, 2022, Andy Lutomirski wrote:
> > 
> > 
> > On Tue, Oct 25, 2022, at 8:13 AM, Chao Peng wrote:
> > > diff --git a/Documentation/virt/kvm/api.rst 
> > > b/Documentation/virt/kvm/api.rst
> > > index f3fa75649a78..975688912b8c 100644
> > > --- a/Documentation/virt/kvm/api.rst
> > > +++ b/Documentation/virt/kvm/api.rst
> > > @@ -6537,6 +6537,29 @@ array field represents return values. The 
> > > userspace should update the return
> > >  values of SBI call before resuming the VCPU. For more details on 
> > > RISC-V SBI
> > >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> > > 
> > > +::
> > > +
> > > +		/* KVM_EXIT_MEMORY_FAULT */
> > > +		struct {
> > > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> > > +			__u32 flags;
> > > +			__u32 padding;
> > > +			__u64 gpa;
> > > +			__u64 size;
> > > +		} memory;
> > > +
> > 
> > Would it make sense to also have a field for the access type (read, write,
> > execute, etc)?  I realize that shared <-> private conversion doesn't strictly
> > need this, but it seems like it could be useful for logging failures and also
> > for avoiding a second immediate fault if the type gets converted but doesn't
> > have the right protection yet.
> 
> I don't think a separate field is necessary, that info can be conveyed via flags.
> Though maybe we should go straight to a u64 for flags.

Yeah, I can do that.

> Hmm, and maybe avoid bits
> 0-3 so that if/when RWX info is conveyed the flags can align with
> PROT_{READ,WRITE,EXEC} and the EPT flags, e.g.

You mean avoiding bits 0-2, right, bit3 is not so special and can be used
for KVM_MEMORY_EXIT_FLAG_PRIVATE.

Chao
> 
> 	KVM_MEMORY_EXIT_FLAG_READ	(1 << 0)
> 	KVM_MEMORY_EXIT_FLAG_WRITE	(1 << 1)
> 	KVM_MEMORY_EXIT_FLAG_EXECUTE	(1 << 2)
> 
> > (Obviously, if this were changed, KVM would need the ability to report that
> > it doesn't actually know the mode.)
> > 
> > --Andy


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-11-16 19:03       ` Alex Bennée
@ 2022-11-17 13:45         ` Chao Peng
  2022-11-17 15:08           ` Alex Bennée
  0 siblings, 1 reply; 101+ messages in thread
From: Chao Peng @ 2022-11-17 13:45 UTC (permalink / raw)
  To: Alex Bennée
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Wed, Nov 16, 2022 at 07:03:49PM +0000, Alex Bennée wrote:
> 
> Chao Peng <chao.p.peng@linux.intel.com> writes:
> 
> > On Tue, Nov 15, 2022 at 04:56:12PM +0000, Alex Bennée wrote:
> >> 
> >> Chao Peng <chao.p.peng@linux.intel.com> writes:
> >> 
> >> > This new KVM exit allows userspace to handle memory-related errors. It
> >> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> >> > The flags includes additional information for userspace to handle the
> >> > error. Currently bit 0 is defined as 'private memory' where '1'
> >> > indicates error happens due to private memory access and '0' indicates
> >> > error happens due to shared memory access.
> >> >
> >> > When private memory is enabled, this new exit will be used for KVM to
> >> > exit to userspace for shared <-> private memory conversion in memory
> >> > encryption usage. In such usage, typically there are two kind of memory
> >> > conversions:
> >> >   - explicit conversion: happens when guest explicitly calls into KVM
> >> >     to map a range (as private or shared), KVM then exits to userspace
> >> >     to perform the map/unmap operations.
> >> >   - implicit conversion: happens in KVM page fault handler where KVM
> >> >     exits to userspace for an implicit conversion when the page is in a
> >> >     different state than requested (private or shared).
> >> >
> >> > Suggested-by: Sean Christopherson <seanjc@google.com>
> >> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> >> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> >> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> >> > ---
> >> >  Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
> >> >  include/uapi/linux/kvm.h       |  9 +++++++++
> >> >  2 files changed, 32 insertions(+)
> >> >
> >> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> >> > index f3fa75649a78..975688912b8c 100644
> >> > --- a/Documentation/virt/kvm/api.rst
> >> > +++ b/Documentation/virt/kvm/api.rst
> >> > @@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
> >> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >> >  
> >> > +::
> >> > +
> >> > +		/* KVM_EXIT_MEMORY_FAULT */
> >> > +		struct {
> >> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> >> > +			__u32 flags;
> >> > +			__u32 padding;
> >> > +			__u64 gpa;
> >> > +			__u64 size;
> >> > +		} memory;
> >> > +
> >> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> >> > +encountered a memory error which is not handled by KVM kernel module and
> >> > +userspace may choose to handle it. The 'flags' field indicates the memory
> >> > +properties of the exit.
> >> > +
> >> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> >> > +   private memory access when the bit is set. Otherwise the memory error is
> >> > +   caused by shared memory access when the bit is clear.
> >> 
> >> What does a shared memory access failure entail?
> >
> > In the context of confidential computing usages, guest can issue a
> > shared memory access while the memory is actually private from the host
> > point of view. This exit with bit 0 cleared gives userspace a chance to
> > convert the private memory to shared memory on host.
> 
> I think this should be explicit rather than implied by the absence of
> another flag. Sean suggested you might want flags for RWX failures so
> maybe something like:
> 
> 	KVM_MEMORY_EXIT_SHARED_FLAG_READ	(1 << 0)
> 	KVM_MEMORY_EXIT_SHARED_FLAG_WRITE	(1 << 1)
> 	KVM_MEMORY_EXIT_SHARED_FLAG_EXECUTE	(1 << 2)
>         KVM_MEMORY_EXIT_FLAG_PRIVATE            (1 << 3)

Yes, but I would not add 'SHARED' to RWX, they are not share memory
specific, private memory can also set them once introduced.

Thanks,
Chao
> 
> which would allow you to signal the various failure modes of the shared
> region, or that you had accessed private memory.
> 
> >
> >> 
> >> If you envision any other failure modes it might be worth making it
> >> explicit with additional flags.
> >
> > Sean mentioned some more usages[1][]2] other than the memory conversion
> > for confidential usage. But I would leave those flags being added in the
> > future after those usages being well discussed.
> >
> > [1] https://lkml.kernel.org/r/20200617230052.GB27751@linux.intel.com
> > [2] https://lore.kernel.org/all/YKxJLcg%2FWomPE422@google.com
> >
> >> I also wonder if a bitmask makes sense if
> >> there can only be one reason for a failure? Maybe all that is needed is
> >> a reason enum?
> >
> > Tough we only have one reason right now but we still want to leave room
> > for future extension. Enum can express a single value at once well but
> > bitmask makes it possible to express multiple orthogonal flags.
> 
> I agree if multiple orthogonal failures can occur at once a bitmask is
> the right choice.
> 
> >
> > Chao
> >> 
> >> > +
> >> > +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
> >> > +may handle the error and return to KVM to retry the previous memory access.
> >> > +
> >> >  ::
> >> >  
> >> >      /* KVM_EXIT_NOTIFY */
> >> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> > index f1ae45c10c94..fa60b032a405 100644
> >> > --- a/include/uapi/linux/kvm.h
> >> > +++ b/include/uapi/linux/kvm.h
> >> > @@ -300,6 +300,7 @@ struct kvm_xen_exit {
> >> >  #define KVM_EXIT_RISCV_SBI        35
> >> >  #define KVM_EXIT_RISCV_CSR        36
> >> >  #define KVM_EXIT_NOTIFY           37
> >> > +#define KVM_EXIT_MEMORY_FAULT     38
> >> >  
> >> >  /* For KVM_EXIT_INTERNAL_ERROR */
> >> >  /* Emulate instruction failed. */
> >> > @@ -538,6 +539,14 @@ struct kvm_run {
> >> >  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
> >> >  			__u32 flags;
> >> >  		} notify;
> >> > +		/* KVM_EXIT_MEMORY_FAULT */
> >> > +		struct {
> >> > +#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> >> > +			__u32 flags;
> >> > +			__u32 padding;
> >> > +			__u64 gpa;
> >> > +			__u64 size;
> >> > +		} memory;
> >> >  		/* Fix the size of the union. */
> >> >  		char padding[256];
> >> >  	};
> >> 
> >> 
> >> -- 
> >> Alex Bennée
> 
> 
> -- 
> Alex Bennée


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM
  2022-11-16  9:40     ` Alex Bennée
@ 2022-11-17 14:16       ` Chao Peng
  0 siblings, 0 replies; 101+ messages in thread
From: Chao Peng @ 2022-11-17 14:16 UTC (permalink / raw)
  To: Alex Bennée
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang, Viresh Kumar,
	Mathieu Poirier, AKASHI Takahiro

On Wed, Nov 16, 2022 at 09:40:23AM +0000, Alex Bennée wrote:
> 
> Chao Peng <chao.p.peng@linux.intel.com> writes:
> 
> > On Mon, Nov 14, 2022 at 11:43:37AM +0000, Alex Bennée wrote:
> >> 
> >> Chao Peng <chao.p.peng@linux.intel.com> writes:
> >> 
> >> <snip>
> >> > Introduction
> >> > ============
> >> > KVM userspace being able to crash the host is horrible. Under current
> >> > KVM architecture, all guest memory is inherently accessible from KVM
> >> > userspace and is exposed to the mentioned crash issue. The goal of this
> >> > series is to provide a solution to align mm and KVM, on a userspace
> >> > inaccessible approach of exposing guest memory. 
> >> >
> >> > Normally, KVM populates secondary page table (e.g. EPT) by using a host
> >> > virtual address (hva) from core mm page table (e.g. x86 userspace page
> >> > table). This requires guest memory being mmaped into KVM userspace, but
> >> > this is also the source where the mentioned crash issue can happen. In
> >> > theory, apart from those 'shared' memory for device emulation etc, guest
> >> > memory doesn't have to be mmaped into KVM userspace.
> >> >
> >> > This series introduces fd-based guest memory which will not be mmaped
> >> > into KVM userspace. KVM populates secondary page table by using a
> >> > fd/offset pair backed by a memory file system. The fd can be created
> >> > from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
> >> > directly interact with them with newly introduced in-kernel interface,
> >> > therefore remove the KVM userspace from the path of accessing/mmaping
> >> > the guest memory. 
> >> >
> >> > Kirill had a patch [2] to address the same issue in a different way. It
> >> > tracks guest encrypted memory at the 'struct page' level and relies on
> >> > HWPOISON to reject the userspace access. The patch has been discussed in
> >> > several online and offline threads and resulted in a design document [3]
> >> > which is also the original proposal for this series. Later this patch
> >> > series evolved as more comments received in community but the major
> >> > concepts in [3] still hold true so recommend reading.
> >> >
> >> > The patch series may also be useful for other usages, for example, pure
> >> > software approach may use it to harden itself against unintentional
> >> > access to guest memory. This series is designed with these usages in
> >> > mind but doesn't have code directly support them and extension might be
> >> > needed.
> >> 
> >> There are a couple of additional use cases where having a consistent
> >> memory interface with the kernel would be useful.
> >
> > Thanks very much for the info. But I'm not so confident that the current
> > memfd_restricted() implementation can be useful for all these usages. 
> >
> >> 
> >>   - Xen DomU guests providing other domains with VirtIO backends
> >> 
> >>   Xen by default doesn't give other domains special access to a domains
> >>   memory. The guest can grant access to regions of its memory to other
> >>   domains for this purpose. 
> >
> > I'm trying to form my understanding on how this could work and what's
> > the benefit for a DomU guest to provide memory through memfd_restricted().
> > AFAICS, memfd_restricted() can help to hide the memory from DomU userspace,
> > but I assume VirtIO backends are still in DomU uerspace and need access
> > that memory, right?
> 
> They need access to parts of the memory. At the moment you run your
> VirtIO domains in the Dom0 and give them access to the whole of a DomU's
> address space - however the Xen model is by default the guests memory is
> inaccessible to other domains on the system. The DomU guest uses the Xen
> grant model to expose portions of its address space to other domains -
> namely for the VirtIO queues themselves and any pages containing buffers
> involved in the VirtIO transaction. My thought was that looks like a
> guest memory interface which is mostly inaccessible (private) with some
> holes in it where memory is being explicitly shared with other domains.

Yes, similar in conception. For KVM, memfd_restricted() is used by host
OS, guest will issue conversion between private and shared for its
memory range. This is similar to Xen DomU guest grants its memory to
other domains. Similarly, I guess to make memfd_restricted() being really
useful for Xen, it should be run on the VirtIO backend domain (e.g.
equivalent to the host position for KVM).

> 
> What I want to achieve is a common userspace API with defined semantics
> for what happens when private and shared regions are accessed. Because
> having each hypervisor/confidential computing architecture define its
> own special API for accessing this memory is just a recipe for
> fragmentation and makes sharing common VirtIO backends impossible.

Yes, I agree. That's interesting to explore.

> 
> >
> >> 
> >>   - pKVM on ARM
> >> 
> >>   Similar to Xen, pKVM moves the management of the page tables into the
> >>   hypervisor and again doesn't allow those domains to share memory by
> >>   default.
> >
> > Right, we already had some discussions on this in the past versions.
> >
> >> 
> >>   - VirtIO loopback
> >> 
> >>   This allows for VirtIO devices for the host kernel to be serviced by
> >>   backends running in userspace. Obviously the memory userspace is
> >>   allowed to access is strictly limited to the buffers and queues
> >>   because giving userspace unrestricted access to the host kernel would
> >>   have consequences.
> >
> > Okay, but normal memfd_create() should work for it, right? And
> > memfd_restricted() instead may not work as it unmaps the memory from
> > userspace.
> >
> >> 
> >> All of these VirtIO backends work with vhost-user which uses memfds to
> >> pass references to guest memory from the VMM to the backend
> >> implementation.
> >
> > Sounds to me these are the places where normal memfd_create() can act on.
> > VirtIO backends work on the mmap-ed memory which currently is not the
> > case for memfd_restricted(). memfd_restricted() has different design
> > purpose that unmaps the memory from userspace and employs some kernel
> > callbacks so other kernel modules can make use of the memory with these
> > callbacks instead of userspace virtual address.
> 
> Maybe my understanding is backwards then. Are you saying a guest starts
> with all its memory exposed and then selectively unmaps the private
> regions? Is this driven by the VMM or the guest itself?

For confidential computing usages, normally guest starts with all guest
memory being private, e.g,  cannot be accessed by host. The memory will
be lived in memfd_restricted() memory and not exposed to host userspace
VMM like QEMU. Guest then can selectively map its private sub regions
(e.g. VirtIO queue in the guest VirtIO frontend driver) as shared so
host backend driver in QEMU can see it. When this happens, new shared
mapping will be established in KVM and the new memory will be provided
from normal mmap-able memory, then QEMU can do whatever it can do for
the device emulation.

Thanks,
Chao
> 
> -- 
> Alex Bennée


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-11-17 13:45         ` Chao Peng
@ 2022-11-17 15:08           ` Alex Bennée
  2022-11-18  1:32             ` Chao Peng
  0 siblings, 1 reply; 101+ messages in thread
From: Alex Bennée @ 2022-11-17 15:08 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang


Chao Peng <chao.p.peng@linux.intel.com> writes:

> On Wed, Nov 16, 2022 at 07:03:49PM +0000, Alex Bennée wrote:
>> 
>> Chao Peng <chao.p.peng@linux.intel.com> writes:
>> 
>> > On Tue, Nov 15, 2022 at 04:56:12PM +0000, Alex Bennée wrote:
>> >> 
>> >> Chao Peng <chao.p.peng@linux.intel.com> writes:
>> >> 
>> >> > This new KVM exit allows userspace to handle memory-related errors. It
>> >> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
>> >> > The flags includes additional information for userspace to handle the
>> >> > error. Currently bit 0 is defined as 'private memory' where '1'
>> >> > indicates error happens due to private memory access and '0' indicates
>> >> > error happens due to shared memory access.
>> >> >
>> >> > When private memory is enabled, this new exit will be used for KVM to
>> >> > exit to userspace for shared <-> private memory conversion in memory
>> >> > encryption usage. In such usage, typically there are two kind of memory
>> >> > conversions:
>> >> >   - explicit conversion: happens when guest explicitly calls into KVM
>> >> >     to map a range (as private or shared), KVM then exits to userspace
>> >> >     to perform the map/unmap operations.
>> >> >   - implicit conversion: happens in KVM page fault handler where KVM
>> >> >     exits to userspace for an implicit conversion when the page is in a
>> >> >     different state than requested (private or shared).
>> >> >
>> >> > Suggested-by: Sean Christopherson <seanjc@google.com>
>> >> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
>> >> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
>> >> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>> >> > ---
>> >> >  Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
>> >> >  include/uapi/linux/kvm.h       |  9 +++++++++
>> >> >  2 files changed, 32 insertions(+)
>> >> >
>> >> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> >> > index f3fa75649a78..975688912b8c 100644
>> >> > --- a/Documentation/virt/kvm/api.rst
>> >> > +++ b/Documentation/virt/kvm/api.rst
>> >> > @@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
>> >> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>> >> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
>> >> >  
>> >> > +::
>> >> > +
>> >> > +		/* KVM_EXIT_MEMORY_FAULT */
>> >> > +		struct {
>> >> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
>> >> > +			__u32 flags;
>> >> > +			__u32 padding;
>> >> > +			__u64 gpa;
>> >> > +			__u64 size;
>> >> > +		} memory;
>> >> > +
>> >> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
>> >> > +encountered a memory error which is not handled by KVM kernel module and
>> >> > +userspace may choose to handle it. The 'flags' field indicates the memory
>> >> > +properties of the exit.
>> >> > +
>> >> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
>> >> > +   private memory access when the bit is set. Otherwise the memory error is
>> >> > +   caused by shared memory access when the bit is clear.
>> >> 
>> >> What does a shared memory access failure entail?
>> >
>> > In the context of confidential computing usages, guest can issue a
>> > shared memory access while the memory is actually private from the host
>> > point of view. This exit with bit 0 cleared gives userspace a chance to
>> > convert the private memory to shared memory on host.
>> 
>> I think this should be explicit rather than implied by the absence of
>> another flag. Sean suggested you might want flags for RWX failures so
>> maybe something like:
>> 
>> 	KVM_MEMORY_EXIT_SHARED_FLAG_READ	(1 << 0)
>> 	KVM_MEMORY_EXIT_SHARED_FLAG_WRITE	(1 << 1)
>> 	KVM_MEMORY_EXIT_SHARED_FLAG_EXECUTE	(1 << 2)
>>         KVM_MEMORY_EXIT_FLAG_PRIVATE            (1 << 3)
>
> Yes, but I would not add 'SHARED' to RWX, they are not share memory
> specific, private memory can also set them once introduced.

OK so how about:

 	KVM_MEMORY_EXIT_FLAG_READ	(1 << 0)
 	KVM_MEMORY_EXIT_FLAG_WRITE	(1 << 1)
 	KVM_MEMORY_EXIT_FLAG_EXECUTE	(1 << 2)
        KVM_MEMORY_EXIT_FLAG_SHARED     (1 << 3)
        KVM_MEMORY_EXIT_FLAG_PRIVATE    (1 << 4)

>
> Thanks,
> Chao
>> 
>> which would allow you to signal the various failure modes of the shared
>> region, or that you had accessed private memory.
>> 
>> >
>> >> 
>> >> If you envision any other failure modes it might be worth making it
>> >> explicit with additional flags.
>> >
>> > Sean mentioned some more usages[1][]2] other than the memory conversion
>> > for confidential usage. But I would leave those flags being added in the
>> > future after those usages being well discussed.
>> >
>> > [1] https://lkml.kernel.org/r/20200617230052.GB27751@linux.intel.com
>> > [2] https://lore.kernel.org/all/YKxJLcg%2FWomPE422@google.com
>> >
>> >> I also wonder if a bitmask makes sense if
>> >> there can only be one reason for a failure? Maybe all that is needed is
>> >> a reason enum?
>> >
>> > Tough we only have one reason right now but we still want to leave room
>> > for future extension. Enum can express a single value at once well but
>> > bitmask makes it possible to express multiple orthogonal flags.
>> 
>> I agree if multiple orthogonal failures can occur at once a bitmask is
>> the right choice.
>> 
>> >
>> > Chao
>> >> 
>> >> > +
>> >> > +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
>> >> > +may handle the error and return to KVM to retry the previous memory access.
>> >> > +
>> >> >  ::
>> >> >  
>> >> >      /* KVM_EXIT_NOTIFY */
>> >> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> >> > index f1ae45c10c94..fa60b032a405 100644
>> >> > --- a/include/uapi/linux/kvm.h
>> >> > +++ b/include/uapi/linux/kvm.h
>> >> > @@ -300,6 +300,7 @@ struct kvm_xen_exit {
>> >> >  #define KVM_EXIT_RISCV_SBI        35
>> >> >  #define KVM_EXIT_RISCV_CSR        36
>> >> >  #define KVM_EXIT_NOTIFY           37
>> >> > +#define KVM_EXIT_MEMORY_FAULT     38
>> >> >  
>> >> >  /* For KVM_EXIT_INTERNAL_ERROR */
>> >> >  /* Emulate instruction failed. */
>> >> > @@ -538,6 +539,14 @@ struct kvm_run {
>> >> >  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
>> >> >  			__u32 flags;
>> >> >  		} notify;
>> >> > +		/* KVM_EXIT_MEMORY_FAULT */
>> >> > +		struct {
>> >> > +#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
>> >> > +			__u32 flags;
>> >> > +			__u32 padding;
>> >> > +			__u64 gpa;
>> >> > +			__u64 size;
>> >> > +		} memory;
>> >> >  		/* Fix the size of the union. */
>> >> >  		char padding[256];
>> >> >  	};
>> >> 
>> >> 
>> >> -- 
>> >> Alex Bennée
>> 
>> 
>> -- 
>> Alex Bennée


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-11-17 15:08           ` Alex Bennée
@ 2022-11-18  1:32             ` Chao Peng
  2022-11-18 13:23               ` Alex Bennée
  0 siblings, 1 reply; 101+ messages in thread
From: Chao Peng @ 2022-11-18  1:32 UTC (permalink / raw)
  To: Alex Bennée
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Thu, Nov 17, 2022 at 03:08:17PM +0000, Alex Bennée wrote:
> 
> Chao Peng <chao.p.peng@linux.intel.com> writes:
> 
> > On Wed, Nov 16, 2022 at 07:03:49PM +0000, Alex Bennée wrote:
> >> 
> >> Chao Peng <chao.p.peng@linux.intel.com> writes:
> >> 
> >> > On Tue, Nov 15, 2022 at 04:56:12PM +0000, Alex Bennée wrote:
> >> >> 
> >> >> Chao Peng <chao.p.peng@linux.intel.com> writes:
> >> >> 
> >> >> > This new KVM exit allows userspace to handle memory-related errors. It
> >> >> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> >> >> > The flags includes additional information for userspace to handle the
> >> >> > error. Currently bit 0 is defined as 'private memory' where '1'
> >> >> > indicates error happens due to private memory access and '0' indicates
> >> >> > error happens due to shared memory access.
> >> >> >
> >> >> > When private memory is enabled, this new exit will be used for KVM to
> >> >> > exit to userspace for shared <-> private memory conversion in memory
> >> >> > encryption usage. In such usage, typically there are two kind of memory
> >> >> > conversions:
> >> >> >   - explicit conversion: happens when guest explicitly calls into KVM
> >> >> >     to map a range (as private or shared), KVM then exits to userspace
> >> >> >     to perform the map/unmap operations.
> >> >> >   - implicit conversion: happens in KVM page fault handler where KVM
> >> >> >     exits to userspace for an implicit conversion when the page is in a
> >> >> >     different state than requested (private or shared).
> >> >> >
> >> >> > Suggested-by: Sean Christopherson <seanjc@google.com>
> >> >> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> >> >> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> >> >> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> >> >> > ---
> >> >> >  Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
> >> >> >  include/uapi/linux/kvm.h       |  9 +++++++++
> >> >> >  2 files changed, 32 insertions(+)
> >> >> >
> >> >> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> >> >> > index f3fa75649a78..975688912b8c 100644
> >> >> > --- a/Documentation/virt/kvm/api.rst
> >> >> > +++ b/Documentation/virt/kvm/api.rst
> >> >> > @@ -6537,6 +6537,29 @@ array field represents return values. The userspace should update the return
> >> >> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >> >> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >> >> >  
> >> >> > +::
> >> >> > +
> >> >> > +		/* KVM_EXIT_MEMORY_FAULT */
> >> >> > +		struct {
> >> >> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> >> >> > +			__u32 flags;
> >> >> > +			__u32 padding;
> >> >> > +			__u64 gpa;
> >> >> > +			__u64 size;
> >> >> > +		} memory;
> >> >> > +
> >> >> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> >> >> > +encountered a memory error which is not handled by KVM kernel module and
> >> >> > +userspace may choose to handle it. The 'flags' field indicates the memory
> >> >> > +properties of the exit.
> >> >> > +
> >> >> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> >> >> > +   private memory access when the bit is set. Otherwise the memory error is
> >> >> > +   caused by shared memory access when the bit is clear.
> >> >> 
> >> >> What does a shared memory access failure entail?
> >> >
> >> > In the context of confidential computing usages, guest can issue a
> >> > shared memory access while the memory is actually private from the host
> >> > point of view. This exit with bit 0 cleared gives userspace a chance to
> >> > convert the private memory to shared memory on host.
> >> 
> >> I think this should be explicit rather than implied by the absence of
> >> another flag. Sean suggested you might want flags for RWX failures so
> >> maybe something like:
> >> 
> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_READ	(1 << 0)
> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_WRITE	(1 << 1)
> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_EXECUTE	(1 << 2)
> >>         KVM_MEMORY_EXIT_FLAG_PRIVATE            (1 << 3)
> >
> > Yes, but I would not add 'SHARED' to RWX, they are not share memory
> > specific, private memory can also set them once introduced.
> 
> OK so how about:
> 
>  	KVM_MEMORY_EXIT_FLAG_READ	(1 << 0)
>  	KVM_MEMORY_EXIT_FLAG_WRITE	(1 << 1)
>  	KVM_MEMORY_EXIT_FLAG_EXECUTE	(1 << 2)
>         KVM_MEMORY_EXIT_FLAG_SHARED     (1 << 3)
>         KVM_MEMORY_EXIT_FLAG_PRIVATE    (1 << 4)

We don't actually need a new bit, the opposite side of private is
shared, i.e. flags with KVM_MEMORY_EXIT_FLAG_PRIVATE cleared expresses
'shared'.

Chao
> 
> >
> > Thanks,
> > Chao
> >> 
> >> which would allow you to signal the various failure modes of the shared
> >> region, or that you had accessed private memory.
> >> 
> >> >
> >> >> 
> >> >> If you envision any other failure modes it might be worth making it
> >> >> explicit with additional flags.
> >> >
> >> > Sean mentioned some more usages[1][]2] other than the memory conversion
> >> > for confidential usage. But I would leave those flags being added in the
> >> > future after those usages being well discussed.
> >> >
> >> > [1] https://lkml.kernel.org/r/20200617230052.GB27751@linux.intel.com
> >> > [2] https://lore.kernel.org/all/YKxJLcg%2FWomPE422@google.com
> >> >
> >> >> I also wonder if a bitmask makes sense if
> >> >> there can only be one reason for a failure? Maybe all that is needed is
> >> >> a reason enum?
> >> >
> >> > Tough we only have one reason right now but we still want to leave room
> >> > for future extension. Enum can express a single value at once well but
> >> > bitmask makes it possible to express multiple orthogonal flags.
> >> 
> >> I agree if multiple orthogonal failures can occur at once a bitmask is
> >> the right choice.
> >> 
> >> >
> >> > Chao
> >> >> 
> >> >> > +
> >> >> > +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
> >> >> > +may handle the error and return to KVM to retry the previous memory access.
> >> >> > +
> >> >> >  ::
> >> >> >  
> >> >> >      /* KVM_EXIT_NOTIFY */
> >> >> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> >> > index f1ae45c10c94..fa60b032a405 100644
> >> >> > --- a/include/uapi/linux/kvm.h
> >> >> > +++ b/include/uapi/linux/kvm.h
> >> >> > @@ -300,6 +300,7 @@ struct kvm_xen_exit {
> >> >> >  #define KVM_EXIT_RISCV_SBI        35
> >> >> >  #define KVM_EXIT_RISCV_CSR        36
> >> >> >  #define KVM_EXIT_NOTIFY           37
> >> >> > +#define KVM_EXIT_MEMORY_FAULT     38
> >> >> >  
> >> >> >  /* For KVM_EXIT_INTERNAL_ERROR */
> >> >> >  /* Emulate instruction failed. */
> >> >> > @@ -538,6 +539,14 @@ struct kvm_run {
> >> >> >  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
> >> >> >  			__u32 flags;
> >> >> >  		} notify;
> >> >> > +		/* KVM_EXIT_MEMORY_FAULT */
> >> >> > +		struct {
> >> >> > +#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
> >> >> > +			__u32 flags;
> >> >> > +			__u32 padding;
> >> >> > +			__u64 gpa;
> >> >> > +			__u64 size;
> >> >> > +		} memory;
> >> >> >  		/* Fix the size of the union. */
> >> >> >  		char padding[256];
> >> >> >  	};
> >> >> 
> >> >> 
> >> >> -- 
> >> >> Alex Bennée
> >> 
> >> 
> >> -- 
> >> Alex Bennée
> 
> 
> -- 
> Alex Bennée


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-11-18  1:32             ` Chao Peng
@ 2022-11-18 13:23               ` Alex Bennée
  2022-11-18 15:59                 ` Sean Christopherson
  0 siblings, 1 reply; 101+ messages in thread
From: Alex Bennée @ 2022-11-18 13:23 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang


Chao Peng <chao.p.peng@linux.intel.com> writes:

> On Thu, Nov 17, 2022 at 03:08:17PM +0000, Alex Bennée wrote:
>> 
<snip>
>> >> >> > +
>> >> >> > +		/* KVM_EXIT_MEMORY_FAULT */
>> >> >> > +		struct {
>> >> >> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
>> >> >> > +			__u32 flags;
>> >> >> > +			__u32 padding;
>> >> >> > +			__u64 gpa;
>> >> >> > +			__u64 size;
>> >> >> > +		} memory;
>> >> >> > +
>> >> >> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
>> >> >> > +encountered a memory error which is not handled by KVM kernel module and
>> >> >> > +userspace may choose to handle it. The 'flags' field indicates the memory
>> >> >> > +properties of the exit.
>> >> >> > +
>> >> >> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
>> >> >> > +   private memory access when the bit is set. Otherwise the memory error is
>> >> >> > +   caused by shared memory access when the bit is clear.
>> >> >> 
>> >> >> What does a shared memory access failure entail?
>> >> >
>> >> > In the context of confidential computing usages, guest can issue a
>> >> > shared memory access while the memory is actually private from the host
>> >> > point of view. This exit with bit 0 cleared gives userspace a chance to
>> >> > convert the private memory to shared memory on host.
>> >> 
>> >> I think this should be explicit rather than implied by the absence of
>> >> another flag. Sean suggested you might want flags for RWX failures so
>> >> maybe something like:
>> >> 
>> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_READ	(1 << 0)
>> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_WRITE	(1 << 1)
>> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_EXECUTE	(1 << 2)
>> >>         KVM_MEMORY_EXIT_FLAG_PRIVATE            (1 << 3)
>> >
>> > Yes, but I would not add 'SHARED' to RWX, they are not share memory
>> > specific, private memory can also set them once introduced.
>> 
>> OK so how about:
>> 
>>  	KVM_MEMORY_EXIT_FLAG_READ	(1 << 0)
>>  	KVM_MEMORY_EXIT_FLAG_WRITE	(1 << 1)
>>  	KVM_MEMORY_EXIT_FLAG_EXECUTE	(1 << 2)
>>         KVM_MEMORY_EXIT_FLAG_SHARED     (1 << 3)
>>         KVM_MEMORY_EXIT_FLAG_PRIVATE    (1 << 4)
>
> We don't actually need a new bit, the opposite side of private is
> shared, i.e. flags with KVM_MEMORY_EXIT_FLAG_PRIVATE cleared expresses
> 'shared'.

If that is always true and we never expect a 3rd type of memory that is
fine. But given we are leaving room for expansion having an explicit bit
allows for that as well as making cases of forgetting to set the flags
more obvious.

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-11-18 13:23               ` Alex Bennée
@ 2022-11-18 15:59                 ` Sean Christopherson
  2022-11-22  9:50                   ` Chao Peng
  0 siblings, 1 reply; 101+ messages in thread
From: Sean Christopherson @ 2022-11-18 15:59 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Fri, Nov 18, 2022, Alex Bennée wrote:
> 
> Chao Peng <chao.p.peng@linux.intel.com> writes:
> 
> > On Thu, Nov 17, 2022 at 03:08:17PM +0000, Alex Bennée wrote:
> >> >> I think this should be explicit rather than implied by the absence of
> >> >> another flag. Sean suggested you might want flags for RWX failures so
> >> >> maybe something like:
> >> >> 
> >> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_READ	(1 << 0)
> >> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_WRITE	(1 << 1)
> >> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_EXECUTE	(1 << 2)
> >> >>         KVM_MEMORY_EXIT_FLAG_PRIVATE            (1 << 3)
> >> >
> >> > Yes, but I would not add 'SHARED' to RWX, they are not share memory
> >> > specific, private memory can also set them once introduced.
> >> 
> >> OK so how about:
> >> 
> >>  	KVM_MEMORY_EXIT_FLAG_READ	(1 << 0)
> >>  	KVM_MEMORY_EXIT_FLAG_WRITE	(1 << 1)
> >>  	KVM_MEMORY_EXIT_FLAG_EXECUTE	(1 << 2)
> >>         KVM_MEMORY_EXIT_FLAG_SHARED     (1 << 3)
> >>         KVM_MEMORY_EXIT_FLAG_PRIVATE    (1 << 4)
> >
> > We don't actually need a new bit, the opposite side of private is
> > shared, i.e. flags with KVM_MEMORY_EXIT_FLAG_PRIVATE cleared expresses
> > 'shared'.
> 
> If that is always true and we never expect a 3rd type of memory that is
> fine. But given we are leaving room for expansion having an explicit bit
> allows for that as well as making cases of forgetting to set the flags
> more obvious.

Hrm, I'm on the fence.

A dedicated flag isn't strictly needed, e.g. even if we end up with 3+ types in
this category, the baseline could always be "private".

I do like being explicit, and adding a PRIVATE flag costs KVM practically nothing
to implement and maintain, but evetually we'll up with flags that are paired with
an implicit state, e.g. see the many #PF error codes in x86.  In other words,
inevitably KVM will need to define the default/base state of the access, at which
point the base state for SHARED vs. PRIVATE is "undefined".  

The RWX bits are in the same boat, e.g. the READ flag isn't strictly necessary.
I was thinking more of the KVM_SET_MEMORY_ATTRIBUTES ioctl(), which does need
the full RWX gamut, when I typed out that response.

So I would say if we add an explicit READ flag, then we might as well add an explicit
PRIVATE flag too.  But if we omit PRIVATE, then we should omit READ too.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-11-18 15:59                 ` Sean Christopherson
@ 2022-11-22  9:50                   ` Chao Peng
  2022-11-23 18:02                     ` Sean Christopherson
  0 siblings, 1 reply; 101+ messages in thread
From: Chao Peng @ 2022-11-22  9:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Alex Benn�e, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Fri, Nov 18, 2022 at 03:59:12PM +0000, Sean Christopherson wrote:
> On Fri, Nov 18, 2022, Alex Benn?e wrote:
> > 
> > Chao Peng <chao.p.peng@linux.intel.com> writes:
> > 
> > > On Thu, Nov 17, 2022 at 03:08:17PM +0000, Alex Benn?e wrote:
> > >> >> I think this should be explicit rather than implied by the absence of
> > >> >> another flag. Sean suggested you might want flags for RWX failures so
> > >> >> maybe something like:
> > >> >> 
> > >> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_READ	(1 << 0)
> > >> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_WRITE	(1 << 1)
> > >> >> 	KVM_MEMORY_EXIT_SHARED_FLAG_EXECUTE	(1 << 2)
> > >> >>         KVM_MEMORY_EXIT_FLAG_PRIVATE            (1 << 3)
> > >> >
> > >> > Yes, but I would not add 'SHARED' to RWX, they are not share memory
> > >> > specific, private memory can also set them once introduced.
> > >> 
> > >> OK so how about:
> > >> 
> > >>  	KVM_MEMORY_EXIT_FLAG_READ	(1 << 0)
> > >>  	KVM_MEMORY_EXIT_FLAG_WRITE	(1 << 1)
> > >>  	KVM_MEMORY_EXIT_FLAG_EXECUTE	(1 << 2)
> > >>         KVM_MEMORY_EXIT_FLAG_SHARED     (1 << 3)
> > >>         KVM_MEMORY_EXIT_FLAG_PRIVATE    (1 << 4)
> > >
> > > We don't actually need a new bit, the opposite side of private is
> > > shared, i.e. flags with KVM_MEMORY_EXIT_FLAG_PRIVATE cleared expresses
> > > 'shared'.
> > 
> > If that is always true and we never expect a 3rd type of memory that is
> > fine. But given we are leaving room for expansion having an explicit bit
> > allows for that as well as making cases of forgetting to set the flags
> > more obvious.
> 
> Hrm, I'm on the fence.
> 
> A dedicated flag isn't strictly needed, e.g. even if we end up with 3+ types in
> this category, the baseline could always be "private".

The baseline for the current code is actually "shared".

> 
> I do like being explicit, and adding a PRIVATE flag costs KVM practically nothing
> to implement and maintain, but evetually we'll up with flags that are paired with
> an implicit state, e.g. see the many #PF error codes in x86.  In other words,
> inevitably KVM will need to define the default/base state of the access, at which
> point the base state for SHARED vs. PRIVATE is "undefined".  

Current memory conversion for confidential usage is bi-directional so we
already need both private and shared states and if we use one bit for
both "shared" and "private" then we will have to define the default
state, e.g, currently the default state is "shared" when we define

	KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)

> 
> The RWX bits are in the same boat, e.g. the READ flag isn't strictly necessary.
> I was thinking more of the KVM_SET_MEMORY_ATTRIBUTES ioctl(), which does need
> the full RWX gamut, when I typed out that response.

For KVM_SET_MEMORY_ATTRIBUTES it's reasonable to add RWX bits and match
that to the permission bits definition in EPT entry.

> 
> So I would say if we add an explicit READ flag, then we might as well add an explicit
> PRIVATE flag too.  But if we omit PRIVATE, then we should omit READ too.

Since we assume the default state is shared, so we actually only need a
PRIVATE flag, e.g. there is no SHARED flag and will ignore the RWX for now.

Chao


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-11-22  9:50                   ` Chao Peng
@ 2022-11-23 18:02                     ` Sean Christopherson
  0 siblings, 0 replies; 101+ messages in thread
From: Sean Christopherson @ 2022-11-23 18:02 UTC (permalink / raw)
  To: Chao Peng
  Cc: Alex Benn�e, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Tue, Nov 22, 2022, Chao Peng wrote:
> On Fri, Nov 18, 2022 at 03:59:12PM +0000, Sean Christopherson wrote:
> > On Fri, Nov 18, 2022, Alex Benn?e wrote:
> > > > We don't actually need a new bit, the opposite side of private is
> > > > shared, i.e. flags with KVM_MEMORY_EXIT_FLAG_PRIVATE cleared expresses
> > > > 'shared'.
> > > 
> > > If that is always true and we never expect a 3rd type of memory that is
> > > fine. But given we are leaving room for expansion having an explicit bit
> > > allows for that as well as making cases of forgetting to set the flags
> > > more obvious.
> > 
> > Hrm, I'm on the fence.
> > 
> > A dedicated flag isn't strictly needed, e.g. even if we end up with 3+ types in
> > this category, the baseline could always be "private".
> 
> The baseline for the current code is actually "shared".

Ah, right, the baseline needs to be "shared" so that legacy code doesn't end up
with impossible states.

> > I do like being explicit, and adding a PRIVATE flag costs KVM practically nothing
> > to implement and maintain, but evetually we'll up with flags that are paired with
> > an implicit state, e.g. see the many #PF error codes in x86.  In other words,
> > inevitably KVM will need to define the default/base state of the access, at which
> > point the base state for SHARED vs. PRIVATE is "undefined".  
> 
> Current memory conversion for confidential usage is bi-directional so we
> already need both private and shared states and if we use one bit for
> both "shared" and "private" then we will have to define the default
> state, e.g, currently the default state is "shared" when we define
> 
> 	KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)

...

> > So I would say if we add an explicit READ flag, then we might as well add an explicit
> > PRIVATE flag too.  But if we omit PRIVATE, then we should omit READ too.
> 
> Since we assume the default state is shared, so we actually only need a
> PRIVATE flag, e.g. there is no SHARED flag and will ignore the RWX for now.

Yeah, I'm leading towards "shared" being the implied default state.  Ditto for
"read" if/when we need to communicate write/execute information  E.g. for VMs
that don't support guest private memory, the "shared" flag is in some ways
nonsensical.  Worst case scenario, e.g. if we end up with variations of "shared",
we'll need something like KVM_MEMORY_EXIT_FLAG_SHARED_RESTRICTIVE or whatever,
but the basic "shared" default will still work.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-10-25 15:13 ` [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
                     ` (2 preceding siblings ...)
  2022-10-31 17:47   ` Michael Roth
@ 2022-11-29  0:06   ` Michael Roth
  2022-11-29 11:21     ` Kirill A. Shutemov
  2022-11-29  0:37   ` Michael Roth
  2022-12-02  2:16   ` Vishal Annapurve
  5 siblings, 1 reply; 101+ messages in thread
From: Michael Roth @ 2022-11-29  0:06 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	mhocko, Muchun Song, wei.w.wang

On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 

<snip>

> +static struct file *restrictedmem_file_create(struct file *memfd)
> +{
> +	struct restrictedmem_data *data;
> +	struct address_space *mapping;
> +	struct inode *inode;
> +	struct file *file;
> +
> +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> +	if (!data)
> +		return ERR_PTR(-ENOMEM);
> +
> +	data->memfd = memfd;
> +	mutex_init(&data->lock);
> +	INIT_LIST_HEAD(&data->notifiers);
> +
> +	inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> +	if (IS_ERR(inode)) {
> +		kfree(data);
> +		return ERR_CAST(inode);
> +	}
> +
> +	inode->i_mode |= S_IFREG;
> +	inode->i_op = &restrictedmem_iops;
> +	inode->i_mapping->private_data = data;
> +
> +	file = alloc_file_pseudo(inode, restrictedmem_mnt,
> +				 "restrictedmem", O_RDWR,
> +				 &restrictedmem_fops);
> +	if (IS_ERR(file)) {
> +		iput(inode);
> +		kfree(data);
> +		return ERR_CAST(file);
> +	}
> +
> +	file->f_flags |= O_LARGEFILE;
> +
> +	mapping = memfd->f_mapping;
> +	mapping_set_unevictable(mapping);
> +	mapping_set_gfp_mask(mapping,
> +			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);

Is this supposed to prevent migration of pages being used for
restrictedmem/shmem backend?

In my case I've been testing SNP support based on UPM v9, and for
large guests (128GB+), if I force 2M THPs via:

  echo always >/sys/kernel/mm/transparent_hugepages/shmem_enabled

it will in some cases trigger the below trace, which suggests that
kcompactd is trying to call migrate_folio() on a PFN that was/is
still allocated for guest private memory (and so has been removed from
directmap as part of shared->private conversation via REG_REGION kvm
ioctl, leading to the crash). This trace seems to occur during early
OVMF boot while the guest is in the middle of pre-accepting on private
memory (no lazy accept in this case).

Is this expected behavior? What else needs to be done to ensure
migrations aren't attempted in this case?

Thanks!

-Mike


# Host logs with debug info for crash during SNP boot

...
[  904.373632] kvm_restricted_mem_get_pfn: GFN: 0x1caced1, PFN: 0x156b7f, page: ffffea0006b197b0, ref_count: 2
[  904.373634] kvm_restricted_mem_get_pfn: GFN: 0x1caced2, PFN: 0x156840, page: ffffea0006b09400, ref_count: 2
[  904.373637] kvm_restricted_mem_get_pfn: GFN: 0x1caced3, PFN: 0x156841, page: ffffea0006b09450, ref_count: 2
[  904.373639] kvm_restricted_mem_get_pfn: GFN: 0x1caced4, PFN: 0x156842, page: ffffea0006b094a0, ref_count: 2
[  904.373641] kvm_restricted_mem_get_pfn: GFN: 0x1caced5, PFN: 0x156843, page: ffffea0006b094f0, ref_count: 2
[  904.373645] kvm_restricted_mem_get_pfn: GFN: 0x1caced6, PFN: 0x156844, page: ffffea0006b09540, ref_count: 2
[  904.373647] kvm_restricted_mem_get_pfn: GFN: 0x1caced7, PFN: 0x156845, page: ffffea0006b09590, ref_count: 2
[  904.373649] kvm_restricted_mem_get_pfn: GFN: 0x1caced8, PFN: 0x156846, page: ffffea0006b095e0, ref_count: 2
[  904.373652] kvm_restricted_mem_get_pfn: GFN: 0x1caced9, PFN: 0x156847, page: ffffea0006b09630, ref_count: 2
[  904.373654] kvm_restricted_mem_get_pfn: GFN: 0x1caceda, PFN: 0x156848, page: ffffea0006b09680, ref_count: 2
[  904.373656] kvm_restricted_mem_get_pfn: GFN: 0x1cacedb, PFN: 0x156849, page: ffffea0006b096d0, ref_count: 2
[  904.373661] kvm_restricted_mem_get_pfn: GFN: 0x1cacedc, PFN: 0x15684a, page: ffffea0006b09720, ref_count: 2
[  904.373663] kvm_restricted_mem_get_pfn: GFN: 0x1cacedd, PFN: 0x15684b, page: ffffea0006b09770, ref_count: 2

# PFN 0x15684c is allocated for guest private memory, will have been removed from directmap as part of RMP requirements

[  904.373665] kvm_restricted_mem_get_pfn: GFN: 0x1cacede, PFN: 0x15684c, page: ffffea0006b097c0, ref_count: 2
...

# kcompactd crashes trying to copy PFN 0x15684c to a new folio, crashes trying to access PFN via directmap

[  904.470135] Migrating restricted page, SRC pfn: 0x15684c, folio_ref_count: 2, folio_order: 0
[  904.470154] BUG: unable to handle page fault for address: ffff88815684c000
[  904.470314] kvm_restricted_mem_get_pfn: GFN: 0x1cafe00, PFN: 0x19f6d0, page: ffffea00081d2100, ref_count: 2
[  904.477828] #PF: supervisor read access in kernel mode
[  904.477831] #PF: error_code(0x0000) - not-present page
[  904.477833] PGD 6601067 P4D 6601067 PUD 1569ad063 PMD 1569af063 PTE 800ffffea97b3060
[  904.508806] Oops: 0000 [#1] SMP NOPTI
[  904.512892] CPU: 52 PID: 1563 Comm: kcompactd0 Tainted: G            E      6.0.0-rc7-hsnp-v7pfdv9d+ #10
[  904.523473] Hardware name: AMD Corporation ETHANOL_X/ETHANOL_X, BIOS RXM1006B 08/20/2021
[  904.532499] RIP: 0010:copy_page+0x7/0x10
[  904.536877] Code: 00 66 90 48 89 f8 48 89 d1 f3 a4 31 c0 c3 cc cc cc cc 48 89 c8 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
[  904.557831] RSP: 0018:ffffc900106dfb78 EFLAGS: 00010286
[  904.563661] RAX: ffff888000000000 RBX: ffffea0006b09810 RCX: 0000000000000200
[  904.571622] RDX: ffffea0000000000 RSI: ffff88815684c000 RDI: ffff88816bc5d000
[  904.579581] RBP: ffffc900106dfba0 R08: 0000000000000001 R09: ffffea0006b097c0
[  904.587541] R10: 0000000000000002 R11: ffffc900106dfb38 R12: ffffea00071add60
[  904.595502] R13: cccccccccccccccd R14: ffffea0006b09810 R15: ffff888159c1e0f8
[  904.603462] FS:  0000000000000000(0000) GS:ffff88a04df00000(0000) knlGS:0000000000000000
[  904.612489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  904.618897] CR2: ffff88815684c000 CR3: 00000020eae16002 CR4: 0000000000770ee0
[  904.626855] PKRU: 55555554
[  904.629870] Call Trace:
[  904.632594]  <TASK>
[  904.634928]  ? folio_copy+0x8c/0xe0
[  904.638818]  migrate_folio+0x5b/0x110
[  904.642901]  move_to_new_folio+0x5b/0x150
[  904.647371]  migrate_pages+0x11bb/0x1830
[  904.651743]  ? move_freelist_tail+0xc0/0xc0
[  904.656406]  ? isolate_freepages_block+0x470/0x470
[  904.661749]  compact_zone+0x681/0xda0
[  904.665832]  kcompactd_do_work+0x1b3/0x2c0
[  904.670400]  kcompactd+0x257/0x330
[  904.674190]  ? prepare_to_wait_event+0x120/0x120
[  904.679338]  ? kcompactd_do_work+0x2c0/0x2c0
[  904.684098]  kthread+0xcf/0xf0
[  904.687501]  ? kthread_complete_and_exit+0x20/0x20
[  904.692844]  ret_from_fork+0x22/0x30
[  904.696830]  </TASK>
[  904.699262] Modules linked in: nf_conntrack_netlink(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) br_netfilter(E) xt_CHECKSUM(E) xt_MASQUERADE(E) xt_conntrack(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) ip6table_mangle(E) ip6table_nat(E) iptable_mangle(E) iptable_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) nfnetlink(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) bpfilter(E) intel_rapl_msr(E) intel_rapl_common(E) amd64_edac(E) bridge(E) stp(E) llc(E) kvm_amd(E) overlay(E) nls_iso8859_1(E) kvm(E) crct10dif_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) crypto_simd(E) cryptd(E) rapl(E) ipmi_si(E) ipmi_devintf(E) wmi_bmof(E) ipmi_msghandler(E) efi_pstore(E) binfmt_misc(E) ast(E) drm_vram_helper(E) joydev(E) drm_ttm_helper(E) ttm(E) drm_kms_helper(E) input_leds(E) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ccp(E) k10temp(E) mac_hid(E) sch_fq_codel(E) parport_pc(E) ppdev(E) lp(E) parport(E) drm(E) ip_tables(E)
[  904.699316]  x_tables(E) autofs4(E) btrfs(E) blake2b_generic(E) zstd_compress(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) raid1(E) raid0(E) multipath(E) linear(E) crc32_pclmul(E) hid_generic(E) usbhid(E) hid(E) e1000e(E) i2c_piix4(E) wmi(E)
[  904.828498] CR2: ffff88815684c000
[  904.832193] ---[ end trace 0000000000000000 ]---
[  904.937159] RIP: 0010:copy_page+0x7/0x10
[  904.941524] Code: 00 66 90 48 89 f8 48 89 d1 f3 a4 31 c0 c3 cc cc cc cc 48 89 c8 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
[  904.962478] RSP: 0018:ffffc900106dfb78 EFLAGS: 00010286
[  904.968305] RAX: ffff888000000000 RBX: ffffea0006b09810 RCX: 0000000000000200
[  904.976265] RDX: ffffea0000000000 RSI: ffff88815684c000 RDI: ffff88816bc5d000
[  904.984227] RBP: ffffc900106dfba0 R08: 0000000000000001 R09: ffffea0006b097c0
[  904.992187] R10: 0000000000000002 R11: ffffc900106dfb38 R12: ffffea00071add60
[  905.000145] R13: cccccccccccccccd R14: ffffea0006b09810 R15: ffff888159c1e0f8
[  905.008105] FS:  0000000000000000(0000) GS:ffff88a04df00000(0000) knlGS:0000000000000000
[  905.017132] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  905.023540] CR2: ffff88815684c000 CR3: 00000020eae16002 CR4: 0000000000770ee0
[  905.031501] PKRU: 55555554
[  905.034558] kvm_restricted_mem_get_pfn: GFN: 0x1cafe01, PFN: 0x19f6d1, page: ffffea00081d2150, ref_count: 2
[  905.045455] kvm_restricted_mem_get_pfn: GFN: 0x1cafe02, PFN: 0x19f6d2, page: ffffea00081d21a0, ref_count: 2
...


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-10-25 15:13 ` [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
                     ` (3 preceding siblings ...)
  2022-11-29  0:06   ` Michael Roth
@ 2022-11-29  0:37   ` Michael Roth
  2022-11-29 14:06     ` Chao Peng
  2022-11-29 18:01     ` Vishal Annapurve
  2022-12-02  2:16   ` Vishal Annapurve
  5 siblings, 2 replies; 101+ messages in thread
From: Michael Roth @ 2022-11-29  0:37 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	mhocko, Muchun Song, wei.w.wang

On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Introduce 'memfd_restricted' system call with the ability to create
> memory areas that are restricted from userspace access through ordinary
> MMU operations (e.g. read/write/mmap). The memory content is expected to
> be used through a new in-kernel interface by a third kernel module.
> 
> memfd_restricted() is useful for scenarios where a file descriptor(fd)
> can be used as an interface into mm but want to restrict userspace's
> ability on the fd. Initially it is designed to provide protections for
> KVM encrypted guest memory.
> 
> Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> (e.g. QEMU) and then using the mmaped virtual address to setup the
> mapping in the KVM secondary page table (e.g. EPT). With confidential
> computing technologies like Intel TDX, the memfd memory may be encrypted
> with special key for special software domain (e.g. KVM guest) and is not
> expected to be directly accessed by userspace. Precisely, userspace
> access to such encrypted memory may lead to host crash so should be
> prevented.
> 
> memfd_restricted() provides semantics required for KVM guest encrypted
> memory support that a fd created with memfd_restricted() is going to be
> used as the source of guest memory in confidential computing environment
> and KVM can directly interact with core-mm without the need to expose
> the memoy content into KVM userspace.
> 
> KVM userspace is still in charge of the lifecycle of the fd. It should
> pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> obtain the physical memory page and then uses it to populate the KVM
> secondary page table entries.
> 
> The userspace restricted memfd can be fallocate-ed or hole-punched
> from userspace. When these operations happen, KVM can get notified
> through restrictedmem_notifier, it then gets chance to remove any
> mapped entries of the range in the secondary page tables.
> 
> memfd_restricted() itself is implemented as a shim layer on top of real
> memory file systems (currently tmpfs). Pages in restrictedmem are marked
> as unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
> 
> By default memfd_restricted() prevents userspace read, write and mmap.
> By defining new bit in the 'flags', it can be extended to support other
> restricted semantics in the future.
> 
> The system call is currently wired up for x86 arch.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  include/linux/restrictedmem.h          |  62 ++++++
>  include/linux/syscalls.h               |   1 +
>  include/uapi/asm-generic/unistd.h      |   5 +-
>  include/uapi/linux/magic.h             |   1 +
>  kernel/sys_ni.c                        |   3 +
>  mm/Kconfig                             |   4 +
>  mm/Makefile                            |   1 +
>  mm/restrictedmem.c                     | 250 +++++++++++++++++++++++++
>  10 files changed, 328 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/restrictedmem.h
>  create mode 100644 mm/restrictedmem.c
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 320480a8db4f..dc70ba90247e 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -455,3 +455,4 @@
>  448	i386	process_mrelease	sys_process_mrelease
>  449	i386	futex_waitv		sys_futex_waitv
>  450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
> +451	i386	memfd_restricted	sys_memfd_restricted
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index c84d12608cd2..06516abc8318 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -372,6 +372,7 @@
>  448	common	process_mrelease	sys_process_mrelease
>  449	common	futex_waitv		sys_futex_waitv
>  450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
> +451	common	memfd_restricted	sys_memfd_restricted
>  
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> new file mode 100644
> index 000000000000..9c37c3ea3180
> --- /dev/null
> +++ b/include/linux/restrictedmem.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _LINUX_RESTRICTEDMEM_H
> +
> +#include <linux/file.h>
> +#include <linux/magic.h>
> +#include <linux/pfn_t.h>
> +
> +struct restrictedmem_notifier;
> +
> +struct restrictedmem_notifier_ops {
> +	void (*invalidate_start)(struct restrictedmem_notifier *notifier,
> +				 pgoff_t start, pgoff_t end);
> +	void (*invalidate_end)(struct restrictedmem_notifier *notifier,
> +			       pgoff_t start, pgoff_t end);
> +};
> +
> +struct restrictedmem_notifier {
> +	struct list_head list;
> +	const struct restrictedmem_notifier_ops *ops;
> +};
> +
> +#ifdef CONFIG_RESTRICTEDMEM
> +
> +void restrictedmem_register_notifier(struct file *file,
> +				     struct restrictedmem_notifier *notifier);
> +void restrictedmem_unregister_notifier(struct file *file,
> +				       struct restrictedmem_notifier *notifier);
> +
> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +			   struct page **pagep, int *order);
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +	return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
> +}
> +
> +#else
> +
> +static inline void restrictedmem_register_notifier(struct file *file,
> +				     struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline void restrictedmem_unregister_notifier(struct file *file,
> +				       struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +					 struct page **pagep, int *order)
> +{
> +	return -1;
> +}
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +	return false;
> +}
> +
> +#endif /* CONFIG_RESTRICTEDMEM */
> +
> +#endif /* _LINUX_RESTRICTEDMEM_H */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a34b0f9a9972..f9e9e0c820c5 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
>  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
>  					    unsigned long home_node,
>  					    unsigned long flags);
> +asmlinkage long sys_memfd_restricted(unsigned int flags);
>  
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 45fa180cc56a..e93cd35e46d0 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
>  #define __NR_set_mempolicy_home_node 450
>  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
>  
> +#define __NR_memfd_restricted 451
> +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
> +
>  #undef __NR_syscalls
> -#define __NR_syscalls 451
> +#define __NR_syscalls 452
>  
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 6325d1d0e90f..8aa38324b90a 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -101,5 +101,6 @@
>  #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
>  #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
>  #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
> +#define RESTRICTEDMEM_MAGIC	0x5245534d	/* "RESM" */
>  
>  #endif /* __LINUX_MAGIC_H__ */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 860b2dcf3ac4..7c4a32cbd2e7 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
>  /* memfd_secret */
>  COND_SYSCALL(memfd_secret);
>  
> +/* memfd_restricted */
> +COND_SYSCALL(memfd_restricted);
> +
>  /*
>   * Architecture specific weak syscall entries.
>   */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0331f1461f81..0177d53676c7 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1076,6 +1076,10 @@ config IO_MAPPING
>  config SECRETMEM
>  	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
>  
> +config RESTRICTEDMEM
> +	bool
> +	depends on TMPFS
> +
>  config ANON_VMA_NAME
>  	bool "Anonymous VMA name support"
>  	depends on PROC_FS && ADVISE_SYSCALLS && MMU
> diff --git a/mm/Makefile b/mm/Makefile
> index 9a564f836403..6cb6403ffd40 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -117,6 +117,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>  obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>  obj-$(CONFIG_SECRETMEM) += secretmem.o
> +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
>  obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>  obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> new file mode 100644
> index 000000000000..e5bf8907e0f8
> --- /dev/null
> +++ b/mm/restrictedmem.c
> @@ -0,0 +1,250 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/shmem_fs.h>
> +#include <linux/syscalls.h>
> +#include <uapi/linux/falloc.h>
> +#include <uapi/linux/magic.h>
> +#include <linux/restrictedmem.h>
> +
> +struct restrictedmem_data {
> +	struct mutex lock;
> +	struct file *memfd;
> +	struct list_head notifiers;
> +};
> +
> +static void restrictedmem_notifier_invalidate(struct restrictedmem_data *data,
> +				 pgoff_t start, pgoff_t end, bool notify_start)
> +{
> +	struct restrictedmem_notifier *notifier;
> +
> +	mutex_lock(&data->lock);
> +	list_for_each_entry(notifier, &data->notifiers, list) {
> +		if (notify_start)
> +			notifier->ops->invalidate_start(notifier, start, end);
> +		else
> +			notifier->ops->invalidate_end(notifier, start, end);
> +	}
> +	mutex_unlock(&data->lock);
> +}
> +
> +static int restrictedmem_release(struct inode *inode, struct file *file)
> +{
> +	struct restrictedmem_data *data = inode->i_mapping->private_data;
> +
> +	fput(data->memfd);
> +	kfree(data);
> +	return 0;
> +}
> +
> +static long restrictedmem_fallocate(struct file *file, int mode,
> +				    loff_t offset, loff_t len)
> +{
> +	struct restrictedmem_data *data = file->f_mapping->private_data;
> +	struct file *memfd = data->memfd;
> +	int ret;
> +
> +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> +			return -EINVAL;
> +	}
> +
> +	restrictedmem_notifier_invalidate(data, offset, offset + len, true);

The KVM restrictedmem ops seem to expect pgoff_t, but here we pass
loff_t. For SNP we've made this strange as part of the following patch
and it seems to produce the expected behavior:

  https://github.com/mdroth/linux/commit/d669c7d3003ff7a7a47e73e8c3b4eeadbd2c4eb6

> +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> +	restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> +	return ret;
> +}
> +

<snip>

> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +			   struct page **pagep, int *order)
> +{
> +	struct restrictedmem_data *data = file->f_mapping->private_data;
> +	struct file *memfd = data->memfd;
> +	struct page *page;
> +	int ret;
> +
> +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);

This will result in KVM allocating pages that userspace hasn't necessary
fallocate()'d. In the case of SNP we need to get the PFN so we can clean
up the RMP entries when restrictedmem invalidations are issued for a GFN
range.

If the guest supports lazy-acceptance however, these pages may not have
been faulted in yet, and if the VMM defers actually fallocate()'ing space
until the guest actually tries to issue a shared->private for that GFN
(to support lazy-pinning), then there may never be a need to allocate
pages for these backends.

However, the restrictedmem invalidations are for GFN ranges so there's
no way to know inadvance whether it's been allocated yet or not. The
xarray is one option but currently it defaults to 'private' so that
doesn't help us here. It might if we introduced a 'uninitialized' state
or something along that line instead of just the binary
'shared'/'private' though...

But for now we added a restrictedmem_get_page_noalloc() that uses
SGP_NONE instead of SGP_WRITE to avoid accidentally allocating a bunch
of memory as part of guest shutdown, and a
kvm_restrictedmem_get_pfn_noalloc() variant to go along with that. But
maybe a boolean param is better? Or maybe SGP_NOALLOC is the better
default, and we just propagate an error to userspace if they didn't
fallocate() in advance?

-Mike

> +	if (ret)
> +		return ret;
> +
> +	*pagep = page;
> +	if (order)
> +		*order = thp_order(compound_head(page));
> +
> +	SetPageUptodate(page);
> +	unlock_page(page);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-29  0:06   ` Michael Roth
@ 2022-11-29 11:21     ` Kirill A. Shutemov
  2022-11-29 11:39       ` David Hildenbrand
  2022-11-29 13:58       ` Chao Peng
  0 siblings, 2 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2022-11-29 11:21 UTC (permalink / raw)
  To: Michael Roth
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, mhocko, Muchun Song, wei.w.wang

On Mon, Nov 28, 2022 at 06:06:32PM -0600, Michael Roth wrote:
> On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> 
> <snip>
> 
> > +static struct file *restrictedmem_file_create(struct file *memfd)
> > +{
> > +	struct restrictedmem_data *data;
> > +	struct address_space *mapping;
> > +	struct inode *inode;
> > +	struct file *file;
> > +
> > +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> > +	if (!data)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	data->memfd = memfd;
> > +	mutex_init(&data->lock);
> > +	INIT_LIST_HEAD(&data->notifiers);
> > +
> > +	inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> > +	if (IS_ERR(inode)) {
> > +		kfree(data);
> > +		return ERR_CAST(inode);
> > +	}
> > +
> > +	inode->i_mode |= S_IFREG;
> > +	inode->i_op = &restrictedmem_iops;
> > +	inode->i_mapping->private_data = data;
> > +
> > +	file = alloc_file_pseudo(inode, restrictedmem_mnt,
> > +				 "restrictedmem", O_RDWR,
> > +				 &restrictedmem_fops);
> > +	if (IS_ERR(file)) {
> > +		iput(inode);
> > +		kfree(data);
> > +		return ERR_CAST(file);
> > +	}
> > +
> > +	file->f_flags |= O_LARGEFILE;
> > +
> > +	mapping = memfd->f_mapping;
> > +	mapping_set_unevictable(mapping);
> > +	mapping_set_gfp_mask(mapping,
> > +			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> 
> Is this supposed to prevent migration of pages being used for
> restrictedmem/shmem backend?

Yes, my bad. I expected it to prevent migration, but it is not true.

Looks like we need to bump refcount in restrictedmem_get_page() and reduce
it back when KVM is no longer use it.

Chao, could you adjust it?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-29 11:21     ` Kirill A. Shutemov
@ 2022-11-29 11:39       ` David Hildenbrand
  2022-11-29 13:59         ` Chao Peng
  2022-11-29 13:58       ` Chao Peng
  1 sibling, 1 reply; 101+ messages in thread
From: David Hildenbrand @ 2022-11-29 11:39 UTC (permalink / raw)
  To: Kirill A. Shutemov, Michael Roth
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, tabba, mhocko, Muchun Song, wei.w.wang

On 29.11.22 12:21, Kirill A. Shutemov wrote:
> On Mon, Nov 28, 2022 at 06:06:32PM -0600, Michael Roth wrote:
>> On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>
>>
>> <snip>
>>
>>> +static struct file *restrictedmem_file_create(struct file *memfd)
>>> +{
>>> +	struct restrictedmem_data *data;
>>> +	struct address_space *mapping;
>>> +	struct inode *inode;
>>> +	struct file *file;
>>> +
>>> +	data = kzalloc(sizeof(*data), GFP_KERNEL);
>>> +	if (!data)
>>> +		return ERR_PTR(-ENOMEM);
>>> +
>>> +	data->memfd = memfd;
>>> +	mutex_init(&data->lock);
>>> +	INIT_LIST_HEAD(&data->notifiers);
>>> +
>>> +	inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
>>> +	if (IS_ERR(inode)) {
>>> +		kfree(data);
>>> +		return ERR_CAST(inode);
>>> +	}
>>> +
>>> +	inode->i_mode |= S_IFREG;
>>> +	inode->i_op = &restrictedmem_iops;
>>> +	inode->i_mapping->private_data = data;
>>> +
>>> +	file = alloc_file_pseudo(inode, restrictedmem_mnt,
>>> +				 "restrictedmem", O_RDWR,
>>> +				 &restrictedmem_fops);
>>> +	if (IS_ERR(file)) {
>>> +		iput(inode);
>>> +		kfree(data);
>>> +		return ERR_CAST(file);
>>> +	}
>>> +
>>> +	file->f_flags |= O_LARGEFILE;
>>> +
>>> +	mapping = memfd->f_mapping;
>>> +	mapping_set_unevictable(mapping);
>>> +	mapping_set_gfp_mask(mapping,
>>> +			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
>>
>> Is this supposed to prevent migration of pages being used for
>> restrictedmem/shmem backend?
> 
> Yes, my bad. I expected it to prevent migration, but it is not true.

Maybe add a comment that these pages are not movable and we don't want 
to place them into movable pageblocks (including CMA and ZONE_MOVABLE). 
That's the primary purpose of the GFP mask here.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-29 11:21     ` Kirill A. Shutemov
  2022-11-29 11:39       ` David Hildenbrand
@ 2022-11-29 13:58       ` Chao Peng
  1 sibling, 0 replies; 101+ messages in thread
From: Chao Peng @ 2022-11-29 13:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Michael Roth, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, mhocko, Muchun Song, wei.w.wang

On Tue, Nov 29, 2022 at 02:21:39PM +0300, Kirill A. Shutemov wrote:
> On Mon, Nov 28, 2022 at 06:06:32PM -0600, Michael Roth wrote:
> > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > 
> > 
> > <snip>
> > 
> > > +static struct file *restrictedmem_file_create(struct file *memfd)
> > > +{
> > > +	struct restrictedmem_data *data;
> > > +	struct address_space *mapping;
> > > +	struct inode *inode;
> > > +	struct file *file;
> > > +
> > > +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> > > +	if (!data)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	data->memfd = memfd;
> > > +	mutex_init(&data->lock);
> > > +	INIT_LIST_HEAD(&data->notifiers);
> > > +
> > > +	inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> > > +	if (IS_ERR(inode)) {
> > > +		kfree(data);
> > > +		return ERR_CAST(inode);
> > > +	}
> > > +
> > > +	inode->i_mode |= S_IFREG;
> > > +	inode->i_op = &restrictedmem_iops;
> > > +	inode->i_mapping->private_data = data;
> > > +
> > > +	file = alloc_file_pseudo(inode, restrictedmem_mnt,
> > > +				 "restrictedmem", O_RDWR,
> > > +				 &restrictedmem_fops);
> > > +	if (IS_ERR(file)) {
> > > +		iput(inode);
> > > +		kfree(data);
> > > +		return ERR_CAST(file);
> > > +	}
> > > +
> > > +	file->f_flags |= O_LARGEFILE;
> > > +
> > > +	mapping = memfd->f_mapping;
> > > +	mapping_set_unevictable(mapping);
> > > +	mapping_set_gfp_mask(mapping,
> > > +			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> > 
> > Is this supposed to prevent migration of pages being used for
> > restrictedmem/shmem backend?
> 
> Yes, my bad. I expected it to prevent migration, but it is not true.
> 
> Looks like we need to bump refcount in restrictedmem_get_page() and reduce
> it back when KVM is no longer use it.

The restrictedmem_get_page() has taken a reference, but later KVM
put_page() after populating the secondary page table entry through
kvm_release_pfn_clean(). One option would let the user feature(e.g.
TDX/SEV) to get_page/put_page() during populating the secondary page
table entry, AFAICS, this requirement also comes from these features.

Chao
> 
> Chao, could you adjust it?
> 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-29 11:39       ` David Hildenbrand
@ 2022-11-29 13:59         ` Chao Peng
  0 siblings, 0 replies; 101+ messages in thread
From: Chao Peng @ 2022-11-29 13:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kirill A. Shutemov, Michael Roth, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, tabba, mhocko,
	Muchun Song, wei.w.wang

On Tue, Nov 29, 2022 at 12:39:06PM +0100, David Hildenbrand wrote:
> On 29.11.22 12:21, Kirill A. Shutemov wrote:
> > On Mon, Nov 28, 2022 at 06:06:32PM -0600, Michael Roth wrote:
> > > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > 
> > > 
> > > <snip>
> > > 
> > > > +static struct file *restrictedmem_file_create(struct file *memfd)
> > > > +{
> > > > +	struct restrictedmem_data *data;
> > > > +	struct address_space *mapping;
> > > > +	struct inode *inode;
> > > > +	struct file *file;
> > > > +
> > > > +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> > > > +	if (!data)
> > > > +		return ERR_PTR(-ENOMEM);
> > > > +
> > > > +	data->memfd = memfd;
> > > > +	mutex_init(&data->lock);
> > > > +	INIT_LIST_HEAD(&data->notifiers);
> > > > +
> > > > +	inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> > > > +	if (IS_ERR(inode)) {
> > > > +		kfree(data);
> > > > +		return ERR_CAST(inode);
> > > > +	}
> > > > +
> > > > +	inode->i_mode |= S_IFREG;
> > > > +	inode->i_op = &restrictedmem_iops;
> > > > +	inode->i_mapping->private_data = data;
> > > > +
> > > > +	file = alloc_file_pseudo(inode, restrictedmem_mnt,
> > > > +				 "restrictedmem", O_RDWR,
> > > > +				 &restrictedmem_fops);
> > > > +	if (IS_ERR(file)) {
> > > > +		iput(inode);
> > > > +		kfree(data);
> > > > +		return ERR_CAST(file);
> > > > +	}
> > > > +
> > > > +	file->f_flags |= O_LARGEFILE;
> > > > +
> > > > +	mapping = memfd->f_mapping;
> > > > +	mapping_set_unevictable(mapping);
> > > > +	mapping_set_gfp_mask(mapping,
> > > > +			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> > > 
> > > Is this supposed to prevent migration of pages being used for
> > > restrictedmem/shmem backend?
> > 
> > Yes, my bad. I expected it to prevent migration, but it is not true.
> 
> Maybe add a comment that these pages are not movable and we don't want to
> place them into movable pageblocks (including CMA and ZONE_MOVABLE). That's
> the primary purpose of the GFP mask here.

Yes I can do that.

Chao
> 
> -- 
> Thanks,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-29  0:37   ` Michael Roth
@ 2022-11-29 14:06     ` Chao Peng
  2022-11-29 19:06       ` Michael Roth
  2022-11-29 18:01     ` Vishal Annapurve
  1 sibling, 1 reply; 101+ messages in thread
From: Chao Peng @ 2022-11-29 14:06 UTC (permalink / raw)
  To: Michael Roth
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	mhocko, Muchun Song, wei.w.wang

On Mon, Nov 28, 2022 at 06:37:25PM -0600, Michael Roth wrote:
> On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
...
> > +static long restrictedmem_fallocate(struct file *file, int mode,
> > +				    loff_t offset, loff_t len)
> > +{
> > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > +	struct file *memfd = data->memfd;
> > +	int ret;
> > +
> > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > +			return -EINVAL;
> > +	}
> > +
> > +	restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> 
> The KVM restrictedmem ops seem to expect pgoff_t, but here we pass
> loff_t. For SNP we've made this strange as part of the following patch
> and it seems to produce the expected behavior:

That's correct. Thanks.

> 
>   https://github.com/mdroth/linux/commit/d669c7d3003ff7a7a47e73e8c3b4eeadbd2c4eb6
> 
> > +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > +	restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> > +	return ret;
> > +}
> > +
> 
> <snip>
> 
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +			   struct page **pagep, int *order)
> > +{
> > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > +	struct file *memfd = data->memfd;
> > +	struct page *page;
> > +	int ret;
> > +
> > +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> 
> This will result in KVM allocating pages that userspace hasn't necessary
> fallocate()'d. In the case of SNP we need to get the PFN so we can clean
> up the RMP entries when restrictedmem invalidations are issued for a GFN
> range.

Yes fallocate() is unnecessary unless someone wants to reserve some
space (e.g. for determination or performance purpose), this matches its
semantics perfectly at:
https://www.man7.org/linux/man-pages/man2/fallocate.2.html

> 
> If the guest supports lazy-acceptance however, these pages may not have
> been faulted in yet, and if the VMM defers actually fallocate()'ing space
> until the guest actually tries to issue a shared->private for that GFN
> (to support lazy-pinning), then there may never be a need to allocate
> pages for these backends.
> 
> However, the restrictedmem invalidations are for GFN ranges so there's
> no way to know inadvance whether it's been allocated yet or not. The
> xarray is one option but currently it defaults to 'private' so that
> doesn't help us here. It might if we introduced a 'uninitialized' state
> or something along that line instead of just the binary
> 'shared'/'private' though...

How about if we change the default to 'shared' as we discussed at
https://lore.kernel.org/all/Y35gI0L8GMt9+OkK@google.com/?
> 
> But for now we added a restrictedmem_get_page_noalloc() that uses
> SGP_NONE instead of SGP_WRITE to avoid accidentally allocating a bunch
> of memory as part of guest shutdown, and a
> kvm_restrictedmem_get_pfn_noalloc() variant to go along with that. But
> maybe a boolean param is better? Or maybe SGP_NOALLOC is the better
> default, and we just propagate an error to userspace if they didn't
> fallocate() in advance?

This (making fallocate() a hard requirement) not only complicates the
userspace but also forces the lazy-faulting going through a long path of
exiting to userspace. Unless we don't have other options I would not go
this way.

Chao
> 
> -Mike
> 
> > +	if (ret)
> > +		return ret;
> > +
> > +	*pagep = page;
> > +	if (order)
> > +		*order = thp_order(compound_head(page));
> > +
> > +	SetPageUptodate(page);
> > +	unlock_page(page);
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> > -- 
> > 2.25.1
> > 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-29  0:37   ` Michael Roth
  2022-11-29 14:06     ` Chao Peng
@ 2022-11-29 18:01     ` Vishal Annapurve
  1 sibling, 0 replies; 101+ messages in thread
From: Vishal Annapurve @ 2022-11-29 18:01 UTC (permalink / raw)
  To: Michael Roth
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, tabba, mhocko, Muchun Song, wei.w.wang

On Mon, Nov 28, 2022 at 4:37 PM Michael Roth <michael.roth@amd.com> wrote:
>
> On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > Introduce 'memfd_restricted' system call with the ability to create
> > memory areas that are restricted from userspace access through ordinary
> > MMU operations (e.g. read/write/mmap). The memory content is expected to
> > be used through a new in-kernel interface by a third kernel module.
> >
> > memfd_restricted() is useful for scenarios where a file descriptor(fd)
> > can be used as an interface into mm but want to restrict userspace's
> > ability on the fd. Initially it is designed to provide protections for
> > KVM encrypted guest memory.
> >
> > Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> > (e.g. QEMU) and then using the mmaped virtual address to setup the
> > mapping in the KVM secondary page table (e.g. EPT). With confidential
> > computing technologies like Intel TDX, the memfd memory may be encrypted
> > with special key for special software domain (e.g. KVM guest) and is not
> > expected to be directly accessed by userspace. Precisely, userspace
> > access to such encrypted memory may lead to host crash so should be
> > prevented.
> >
> > memfd_restricted() provides semantics required for KVM guest encrypted
> > memory support that a fd created with memfd_restricted() is going to be
> > used as the source of guest memory in confidential computing environment
> > and KVM can directly interact with core-mm without the need to expose
> > the memoy content into KVM userspace.
> >
> > KVM userspace is still in charge of the lifecycle of the fd. It should
> > pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> > obtain the physical memory page and then uses it to populate the KVM
> > secondary page table entries.
> >
> > The userspace restricted memfd can be fallocate-ed or hole-punched
> > from userspace. When these operations happen, KVM can get notified
> > through restrictedmem_notifier, it then gets chance to remove any
> > mapped entries of the range in the secondary page tables.
> >
> > memfd_restricted() itself is implemented as a shim layer on top of real
> > memory file systems (currently tmpfs). Pages in restrictedmem are marked
> > as unmovable and unevictable, this is required for current confidential
> > usage. But in future this might be changed.
> >
> > By default memfd_restricted() prevents userspace read, write and mmap.
> > By defining new bit in the 'flags', it can be extended to support other
> > restricted semantics in the future.
> >
> > The system call is currently wired up for x86 arch.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
> >  include/linux/restrictedmem.h          |  62 ++++++
> >  include/linux/syscalls.h               |   1 +
> >  include/uapi/asm-generic/unistd.h      |   5 +-
> >  include/uapi/linux/magic.h             |   1 +
> >  kernel/sys_ni.c                        |   3 +
> >  mm/Kconfig                             |   4 +
> >  mm/Makefile                            |   1 +
> >  mm/restrictedmem.c                     | 250 +++++++++++++++++++++++++
> >  10 files changed, 328 insertions(+), 1 deletion(-)
> >  create mode 100644 include/linux/restrictedmem.h
> >  create mode 100644 mm/restrictedmem.c
> >
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 320480a8db4f..dc70ba90247e 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -455,3 +455,4 @@
> >  448  i386    process_mrelease        sys_process_mrelease
> >  449  i386    futex_waitv             sys_futex_waitv
> >  450  i386    set_mempolicy_home_node         sys_set_mempolicy_home_node
> > +451  i386    memfd_restricted        sys_memfd_restricted
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index c84d12608cd2..06516abc8318 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -372,6 +372,7 @@
> >  448  common  process_mrelease        sys_process_mrelease
> >  449  common  futex_waitv             sys_futex_waitv
> >  450  common  set_mempolicy_home_node sys_set_mempolicy_home_node
> > +451  common  memfd_restricted        sys_memfd_restricted
> >
> >  #
> >  # Due to a historical design error, certain syscalls are numbered differently
> > diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> > new file mode 100644
> > index 000000000000..9c37c3ea3180
> > --- /dev/null
> > +++ b/include/linux/restrictedmem.h
> > @@ -0,0 +1,62 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _LINUX_RESTRICTEDMEM_H
> > +
> > +#include <linux/file.h>
> > +#include <linux/magic.h>
> > +#include <linux/pfn_t.h>
> > +
> > +struct restrictedmem_notifier;
> > +
> > +struct restrictedmem_notifier_ops {
> > +     void (*invalidate_start)(struct restrictedmem_notifier *notifier,
> > +                              pgoff_t start, pgoff_t end);
> > +     void (*invalidate_end)(struct restrictedmem_notifier *notifier,
> > +                            pgoff_t start, pgoff_t end);
> > +};
> > +
> > +struct restrictedmem_notifier {
> > +     struct list_head list;
> > +     const struct restrictedmem_notifier_ops *ops;
> > +};
> > +
> > +#ifdef CONFIG_RESTRICTEDMEM
> > +
> > +void restrictedmem_register_notifier(struct file *file,
> > +                                  struct restrictedmem_notifier *notifier);
> > +void restrictedmem_unregister_notifier(struct file *file,
> > +                                    struct restrictedmem_notifier *notifier);
> > +
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +                        struct page **pagep, int *order);
> > +
> > +static inline bool file_is_restrictedmem(struct file *file)
> > +{
> > +     return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
> > +}
> > +
> > +#else
> > +
> > +static inline void restrictedmem_register_notifier(struct file *file,
> > +                                  struct restrictedmem_notifier *notifier)
> > +{
> > +}
> > +
> > +static inline void restrictedmem_unregister_notifier(struct file *file,
> > +                                    struct restrictedmem_notifier *notifier)
> > +{
> > +}
> > +
> > +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +                                      struct page **pagep, int *order)
> > +{
> > +     return -1;
> > +}
> > +
> > +static inline bool file_is_restrictedmem(struct file *file)
> > +{
> > +     return false;
> > +}
> > +
> > +#endif /* CONFIG_RESTRICTEDMEM */
> > +
> > +#endif /* _LINUX_RESTRICTEDMEM_H */
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index a34b0f9a9972..f9e9e0c820c5 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
> >  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
> >                                           unsigned long home_node,
> >                                           unsigned long flags);
> > +asmlinkage long sys_memfd_restricted(unsigned int flags);
> >
> >  /*
> >   * Architecture-specific system calls
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index 45fa180cc56a..e93cd35e46d0 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
> >  #define __NR_set_mempolicy_home_node 450
> >  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
> >
> > +#define __NR_memfd_restricted 451
> > +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
> > +
> >  #undef __NR_syscalls
> > -#define __NR_syscalls 451
> > +#define __NR_syscalls 452
> >
> >  /*
> >   * 32 bit systems traditionally used different
> > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > index 6325d1d0e90f..8aa38324b90a 100644
> > --- a/include/uapi/linux/magic.h
> > +++ b/include/uapi/linux/magic.h
> > @@ -101,5 +101,6 @@
> >  #define DMA_BUF_MAGIC                0x444d4142      /* "DMAB" */
> >  #define DEVMEM_MAGIC         0x454d444d      /* "DMEM" */
> >  #define SECRETMEM_MAGIC              0x5345434d      /* "SECM" */
> > +#define RESTRICTEDMEM_MAGIC  0x5245534d      /* "RESM" */
> >
> >  #endif /* __LINUX_MAGIC_H__ */
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index 860b2dcf3ac4..7c4a32cbd2e7 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
> >  /* memfd_secret */
> >  COND_SYSCALL(memfd_secret);
> >
> > +/* memfd_restricted */
> > +COND_SYSCALL(memfd_restricted);
> > +
> >  /*
> >   * Architecture specific weak syscall entries.
> >   */
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 0331f1461f81..0177d53676c7 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -1076,6 +1076,10 @@ config IO_MAPPING
> >  config SECRETMEM
> >       def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
> >
> > +config RESTRICTEDMEM
> > +     bool
> > +     depends on TMPFS
> > +
> >  config ANON_VMA_NAME
> >       bool "Anonymous VMA name support"
> >       depends on PROC_FS && ADVISE_SYSCALLS && MMU
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 9a564f836403..6cb6403ffd40 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -117,6 +117,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
> >  obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
> >  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
> >  obj-$(CONFIG_SECRETMEM) += secretmem.o
> > +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
> >  obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
> >  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
> >  obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> > new file mode 100644
> > index 000000000000..e5bf8907e0f8
> > --- /dev/null
> > +++ b/mm/restrictedmem.c
> > @@ -0,0 +1,250 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "linux/sbitmap.h"
> > +#include <linux/pagemap.h>
> > +#include <linux/pseudo_fs.h>
> > +#include <linux/shmem_fs.h>
> > +#include <linux/syscalls.h>
> > +#include <uapi/linux/falloc.h>
> > +#include <uapi/linux/magic.h>
> > +#include <linux/restrictedmem.h>
> > +
> > +struct restrictedmem_data {
> > +     struct mutex lock;
> > +     struct file *memfd;
> > +     struct list_head notifiers;
> > +};
> > +
> > +static void restrictedmem_notifier_invalidate(struct restrictedmem_data *data,
> > +                              pgoff_t start, pgoff_t end, bool notify_start)
> > +{
> > +     struct restrictedmem_notifier *notifier;
> > +
> > +     mutex_lock(&data->lock);
> > +     list_for_each_entry(notifier, &data->notifiers, list) {
> > +             if (notify_start)
> > +                     notifier->ops->invalidate_start(notifier, start, end);
> > +             else
> > +                     notifier->ops->invalidate_end(notifier, start, end);
> > +     }
> > +     mutex_unlock(&data->lock);
> > +}
> > +
> > +static int restrictedmem_release(struct inode *inode, struct file *file)
> > +{
> > +     struct restrictedmem_data *data = inode->i_mapping->private_data;
> > +
> > +     fput(data->memfd);
> > +     kfree(data);
> > +     return 0;
> > +}
> > +
> > +static long restrictedmem_fallocate(struct file *file, int mode,
> > +                                 loff_t offset, loff_t len)
> > +{
> > +     struct restrictedmem_data *data = file->f_mapping->private_data;
> > +     struct file *memfd = data->memfd;
> > +     int ret;
> > +
> > +     if (mode & FALLOC_FL_PUNCH_HOLE) {
> > +             if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > +                     return -EINVAL;
> > +     }
> > +
> > +     restrictedmem_notifier_invalidate(data, offset, offset + len, true);
>
> The KVM restrictedmem ops seem to expect pgoff_t, but here we pass
> loff_t. For SNP we've made this strange as part of the following patch
> and it seems to produce the expected behavior:
>
>   https://github.com/mdroth/linux/commit/d669c7d3003ff7a7a47e73e8c3b4eeadbd2c4eb6
>
> > +     ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > +     restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> > +     return ret;
> > +}
> > +
>
> <snip>
>
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +                        struct page **pagep, int *order)
> > +{
> > +     struct restrictedmem_data *data = file->f_mapping->private_data;
> > +     struct file *memfd = data->memfd;
> > +     struct page *page;
> > +     int ret;
> > +
> > +     ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
>
> This will result in KVM allocating pages that userspace hasn't necessary
> fallocate()'d. In the case of SNP we need to get the PFN so we can clean
> up the RMP entries when restrictedmem invalidations are issued for a GFN
> range.
>
> If the guest supports lazy-acceptance however, these pages may not have
> been faulted in yet, and if the VMM defers actually fallocate()'ing space
> until the guest actually tries to issue a shared->private for that GFN
> (to support lazy-pinning), then there may never be a need to allocate
> pages for these backends.
>
> However, the restrictedmem invalidations are for GFN ranges so there's
> no way to know inadvance whether it's been allocated yet or not. The
> xarray is one option but currently it defaults to 'private' so that
> doesn't help us here. It might if we introduced a 'uninitialized' state
> or something along that line instead of just the binary
> 'shared'/'private' though...
>
> But for now we added a restrictedmem_get_page_noalloc() that uses
> SGP_NONE instead of SGP_WRITE to avoid accidentally allocating a bunch
> of memory as part of guest shutdown, and a
> kvm_restrictedmem_get_pfn_noalloc() variant to go along with that. But
> maybe a boolean param is better? Or maybe SGP_NOALLOC is the better
> default, and we just propagate an error to userspace if they didn't
> fallocate() in advance?
>

One caveat with SGP_NOALLOC being default: For performance reasons (to
avoid frequent userspace exits), VMM will have to always preallocate
all the guest restricted memory. In general this will prevent VMM from
overcommitting.


> -Mike
>
> > +     if (ret)
> > +             return ret;
> > +
> > +     *pagep = page;
> > +     if (order)
> > +             *order = thp_order(compound_head(page));
> > +
> > +     SetPageUptodate(page);
> > +     unlock_page(page);
> > +
> > +     return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> > --
> > 2.25.1
> >


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-29 14:06     ` Chao Peng
@ 2022-11-29 19:06       ` Michael Roth
  2022-11-29 19:18         ` Michael Roth
  0 siblings, 1 reply; 101+ messages in thread
From: Michael Roth @ 2022-11-29 19:06 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	mhocko, Muchun Song, wei.w.wang

On Tue, Nov 29, 2022 at 10:06:15PM +0800, Chao Peng wrote:
> On Mon, Nov 28, 2022 at 06:37:25PM -0600, Michael Roth wrote:
> > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> ...
> > > +static long restrictedmem_fallocate(struct file *file, int mode,
> > > +				    loff_t offset, loff_t len)
> > > +{
> > > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > > +	struct file *memfd = data->memfd;
> > > +	int ret;
> > > +
> > > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > +			return -EINVAL;
> > > +	}
> > > +
> > > +	restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> > 
> > The KVM restrictedmem ops seem to expect pgoff_t, but here we pass
> > loff_t. For SNP we've made this strange as part of the following patch
> > and it seems to produce the expected behavior:
> 
> That's correct. Thanks.
> 
> > 
> >   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2Fd669c7d3003ff7a7a47e73e8c3b4eeadbd2c4eb6&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C99e80696067a40d42f6e08dad2138556%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053278531323330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=WDj4KxJjhcntBWJUGCjNmMPfZMGQkCSaAo6ElYrGgF0%3D&amp;reserved=0
> > 
> > > +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > +	restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> > > +	return ret;
> > > +}
> > > +
> > 
> > <snip>
> > 
> > > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > > +			   struct page **pagep, int *order)
> > > +{
> > > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > > +	struct file *memfd = data->memfd;
> > > +	struct page *page;
> > > +	int ret;
> > > +
> > > +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> > 
> > This will result in KVM allocating pages that userspace hasn't necessary
> > fallocate()'d. In the case of SNP we need to get the PFN so we can clean
> > up the RMP entries when restrictedmem invalidations are issued for a GFN
> > range.
> 
> Yes fallocate() is unnecessary unless someone wants to reserve some
> space (e.g. for determination or performance purpose), this matches its
> semantics perfectly at:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.man7.org%2Flinux%2Fman-pages%2Fman2%2Ffallocate.2.html&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C99e80696067a40d42f6e08dad2138556%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053278531323330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=67sdTY47cM1IBUG2eJCltYF5SyGOpd9%2FVxVlHUw02tU%3D&amp;reserved=0
> 
> > 
> > If the guest supports lazy-acceptance however, these pages may not have
> > been faulted in yet, and if the VMM defers actually fallocate()'ing space
> > until the guest actually tries to issue a shared->private for that GFN
> > (to support lazy-pinning), then there may never be a need to allocate
> > pages for these backends.
> > 
> > However, the restrictedmem invalidations are for GFN ranges so there's
> > no way to know inadvance whether it's been allocated yet or not. The
> > xarray is one option but currently it defaults to 'private' so that
> > doesn't help us here. It might if we introduced a 'uninitialized' state
> > or something along that line instead of just the binary
> > 'shared'/'private' though...
> 
> How about if we change the default to 'shared' as we discussed at
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FY35gI0L8GMt9%2BOkK%40google.com%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C99e80696067a40d42f6e08dad2138556%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053278531323330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qzWObDo7ZHW4YjuAjZ5%2B1wEwbqymgBiNM%2BYXiyUSBdI%3D&amp;reserved=0?

Need to look at this a bit more, but I think that could work as well.

> > 
> > But for now we added a restrictedmem_get_page_noalloc() that uses
> > SGP_NONE instead of SGP_WRITE to avoid accidentally allocating a bunch
> > of memory as part of guest shutdown, and a
> > kvm_restrictedmem_get_pfn_noalloc() variant to go along with that. But
> > maybe a boolean param is better? Or maybe SGP_NOALLOC is the better
> > default, and we just propagate an error to userspace if they didn't
> > fallocate() in advance?
> 
> This (making fallocate() a hard requirement) not only complicates the
> userspace but also forces the lazy-faulting going through a long path of
> exiting to userspace. Unless we don't have other options I would not go
> this way.

Unless I'm missing something, it's already the case that userspace is
responsible for handling all the shared->private transitions in response
to KVM_EXIT_MEMORY_FAULT or (in our case) KVM_EXIT_VMGEXIT. So it only
places the additional requirements on the VMM that if they *don't*
preallocate, then they'll need to issue the fallocate() prior to issuing
the KVM_MEM_ENCRYPT_REG_REGION ioctl in response to these events.

QEMU for example already has a separate 'prealloc' option for cases
where they want to prefault all the guest memory, so it makes sense to
continue making that an optional thing with regard to UPM.

-Mike

> 
> Chao
> > 
> > -Mike
> > 
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	*pagep = page;
> > > +	if (order)
> > > +		*order = thp_order(compound_head(page));
> > > +
> > > +	SetPageUptodate(page);
> > > +	unlock_page(page);
> > > +
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> > > -- 
> > > 2.25.1
> > > 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-29 19:06       ` Michael Roth
@ 2022-11-29 19:18         ` Michael Roth
  2022-11-30  9:39           ` Chao Peng
  0 siblings, 1 reply; 101+ messages in thread
From: Michael Roth @ 2022-11-29 19:18 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	mhocko, Muchun Song, wei.w.wang

On Tue, Nov 29, 2022 at 01:06:58PM -0600, Michael Roth wrote:
> On Tue, Nov 29, 2022 at 10:06:15PM +0800, Chao Peng wrote:
> > On Mon, Nov 28, 2022 at 06:37:25PM -0600, Michael Roth wrote:
> > > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > ...
> > > > +static long restrictedmem_fallocate(struct file *file, int mode,
> > > > +				    loff_t offset, loff_t len)
> > > > +{
> > > > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > > > +	struct file *memfd = data->memfd;
> > > > +	int ret;
> > > > +
> > > > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > +			return -EINVAL;
> > > > +	}
> > > > +
> > > > +	restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> > > 
> > > The KVM restrictedmem ops seem to expect pgoff_t, but here we pass
> > > loff_t. For SNP we've made this strange as part of the following patch
> > > and it seems to produce the expected behavior:
> > 
> > That's correct. Thanks.
> > 
> > > 
> > >   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2Fd669c7d3003ff7a7a47e73e8c3b4eeadbd2c4eb6&amp;data=05%7C01%7CMichael.Roth%40amd.com%7C0c26815eb6af4f1a243508dad23cf713%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053456609134623%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=kAL42bmyBB0alVwh%2FN%2BT3D%2BiVTdxxMsJ7V4TNuCTjM4%3D&amp;reserved=0
> > > 
> > > > +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > +	restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> > > > +	return ret;
> > > > +}
> > > > +
> > > 
> > > <snip>
> > > 
> > > > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > > > +			   struct page **pagep, int *order)
> > > > +{
> > > > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > > > +	struct file *memfd = data->memfd;
> > > > +	struct page *page;
> > > > +	int ret;
> > > > +
> > > > +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> > > 
> > > This will result in KVM allocating pages that userspace hasn't necessary
> > > fallocate()'d. In the case of SNP we need to get the PFN so we can clean
> > > up the RMP entries when restrictedmem invalidations are issued for a GFN
> > > range.
> > 
> > Yes fallocate() is unnecessary unless someone wants to reserve some
> > space (e.g. for determination or performance purpose), this matches its
> > semantics perfectly at:
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.man7.org%2Flinux%2Fman-pages%2Fman2%2Ffallocate.2.html&amp;data=05%7C01%7CMichael.Roth%40amd.com%7C0c26815eb6af4f1a243508dad23cf713%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053456609134623%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=acBSquFG%2FHtpbcZfHDZrP2O63bu06rI0pjiPJFSJSj8%3D&amp;reserved=0
> > 
> > > 
> > > If the guest supports lazy-acceptance however, these pages may not have
> > > been faulted in yet, and if the VMM defers actually fallocate()'ing space
> > > until the guest actually tries to issue a shared->private for that GFN
> > > (to support lazy-pinning), then there may never be a need to allocate
> > > pages for these backends.
> > > 
> > > However, the restrictedmem invalidations are for GFN ranges so there's
> > > no way to know inadvance whether it's been allocated yet or not. The
> > > xarray is one option but currently it defaults to 'private' so that
> > > doesn't help us here. It might if we introduced a 'uninitialized' state
> > > or something along that line instead of just the binary
> > > 'shared'/'private' though...
> > 
> > How about if we change the default to 'shared' as we discussed at
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FY35gI0L8GMt9%2BOkK%40google.com%2F&amp;data=05%7C01%7CMichael.Roth%40amd.com%7C0c26815eb6af4f1a243508dad23cf713%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053456609134623%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=Q1vZWQiZ7mx12Qn5aKl4s8Ea9hNbwCJBb%2BjiA1du3Os%3D&amp;reserved=0?
> 
> Need to look at this a bit more, but I think that could work as well.
> 
> > > 
> > > But for now we added a restrictedmem_get_page_noalloc() that uses
> > > SGP_NONE instead of SGP_WRITE to avoid accidentally allocating a bunch
> > > of memory as part of guest shutdown, and a
> > > kvm_restrictedmem_get_pfn_noalloc() variant to go along with that. But
> > > maybe a boolean param is better? Or maybe SGP_NOALLOC is the better
> > > default, and we just propagate an error to userspace if they didn't
> > > fallocate() in advance?
> > 
> > This (making fallocate() a hard requirement) not only complicates the
> > userspace but also forces the lazy-faulting going through a long path of
> > exiting to userspace. Unless we don't have other options I would not go
> > this way.
> 
> Unless I'm missing something, it's already the case that userspace is
> responsible for handling all the shared->private transitions in response
> to KVM_EXIT_MEMORY_FAULT or (in our case) KVM_EXIT_VMGEXIT. So it only
> places the additional requirements on the VMM that if they *don't*
> preallocate, then they'll need to issue the fallocate() prior to issuing
> the KVM_MEM_ENCRYPT_REG_REGION ioctl in response to these events.
> 
> QEMU for example already has a separate 'prealloc' option for cases
> where they want to prefault all the guest memory, so it makes sense to
> continue making that an optional thing with regard to UPM.

Although I guess what you're suggesting doesn't stop userspace from
deciding whether they want to prefault or not. I know the Google folks
had some concerns over unexpected allocations causing 2x memory usage
though so giving userspace full control of what is/isn't allocated in
the restrictedmem backend seems to make it easier to guard against this,
but I think checking the xarray and defaulting to 'shared' would work
for us if that's the direction we end up going.

-Mike

> 
> -Mike
> 
> > 
> > Chao
> > > 
> > > -Mike
> > > 
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > > +	*pagep = page;
> > > > +	if (order)
> > > > +		*order = thp_order(compound_head(page));
> > > > +
> > > > +	SetPageUptodate(page);
> > > > +	unlock_page(page);
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> > > > -- 
> > > > 2.25.1
> > > > 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-29 19:18         ` Michael Roth
@ 2022-11-30  9:39           ` Chao Peng
  2022-11-30 14:31             ` Michael Roth
  0 siblings, 1 reply; 101+ messages in thread
From: Chao Peng @ 2022-11-30  9:39 UTC (permalink / raw)
  To: Michael Roth
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	mhocko, Muchun Song, wei.w.wang

On Tue, Nov 29, 2022 at 01:18:15PM -0600, Michael Roth wrote:
> On Tue, Nov 29, 2022 at 01:06:58PM -0600, Michael Roth wrote:
> > On Tue, Nov 29, 2022 at 10:06:15PM +0800, Chao Peng wrote:
> > > On Mon, Nov 28, 2022 at 06:37:25PM -0600, Michael Roth wrote:
> > > > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > > ...
> > > > > +static long restrictedmem_fallocate(struct file *file, int mode,
> > > > > +				    loff_t offset, loff_t len)
> > > > > +{
> > > > > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > > > > +	struct file *memfd = data->memfd;
> > > > > +	int ret;
> > > > > +
> > > > > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > +			return -EINVAL;
> > > > > +	}
> > > > > +
> > > > > +	restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> > > > 
> > > > The KVM restrictedmem ops seem to expect pgoff_t, but here we pass
> > > > loff_t. For SNP we've made this strange as part of the following patch
> > > > and it seems to produce the expected behavior:
> > > 
> > > That's correct. Thanks.
> > > 
> > > > 
> > > >   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2Fd669c7d3003ff7a7a47e73e8c3b4eeadbd2c4eb6&amp;data=05%7C01%7CMichael.Roth%40amd.com%7C0c26815eb6af4f1a243508dad23cf713%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053456609134623%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=kAL42bmyBB0alVwh%2FN%2BT3D%2BiVTdxxMsJ7V4TNuCTjM4%3D&amp;reserved=0
> > > > 
> > > > > +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > +	restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> > > > > +	return ret;
> > > > > +}
> > > > > +
> > > > 
> > > > <snip>
> > > > 
> > > > > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > > > > +			   struct page **pagep, int *order)
> > > > > +{
> > > > > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > > > > +	struct file *memfd = data->memfd;
> > > > > +	struct page *page;
> > > > > +	int ret;
> > > > > +
> > > > > +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> > > > 
> > > > This will result in KVM allocating pages that userspace hasn't necessary
> > > > fallocate()'d. In the case of SNP we need to get the PFN so we can clean
> > > > up the RMP entries when restrictedmem invalidations are issued for a GFN
> > > > range.
> > > 
> > > Yes fallocate() is unnecessary unless someone wants to reserve some
> > > space (e.g. for determination or performance purpose), this matches its
> > > semantics perfectly at:
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.man7.org%2Flinux%2Fman-pages%2Fman2%2Ffallocate.2.html&amp;data=05%7C01%7CMichael.Roth%40amd.com%7C0c26815eb6af4f1a243508dad23cf713%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053456609134623%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=acBSquFG%2FHtpbcZfHDZrP2O63bu06rI0pjiPJFSJSj8%3D&amp;reserved=0
> > > 
> > > > 
> > > > If the guest supports lazy-acceptance however, these pages may not have
> > > > been faulted in yet, and if the VMM defers actually fallocate()'ing space
> > > > until the guest actually tries to issue a shared->private for that GFN
> > > > (to support lazy-pinning), then there may never be a need to allocate
> > > > pages for these backends.
> > > > 
> > > > However, the restrictedmem invalidations are for GFN ranges so there's
> > > > no way to know inadvance whether it's been allocated yet or not. The
> > > > xarray is one option but currently it defaults to 'private' so that
> > > > doesn't help us here. It might if we introduced a 'uninitialized' state
> > > > or something along that line instead of just the binary
> > > > 'shared'/'private' though...
> > > 
> > > How about if we change the default to 'shared' as we discussed at
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FY35gI0L8GMt9%2BOkK%40google.com%2F&amp;data=05%7C01%7CMichael.Roth%40amd.com%7C0c26815eb6af4f1a243508dad23cf713%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053456609134623%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=Q1vZWQiZ7mx12Qn5aKl4s8Ea9hNbwCJBb%2BjiA1du3Os%3D&amp;reserved=0?
> > 
> > Need to look at this a bit more, but I think that could work as well.
> > 
> > > > 
> > > > But for now we added a restrictedmem_get_page_noalloc() that uses
> > > > SGP_NONE instead of SGP_WRITE to avoid accidentally allocating a bunch
> > > > of memory as part of guest shutdown, and a
> > > > kvm_restrictedmem_get_pfn_noalloc() variant to go along with that. But
> > > > maybe a boolean param is better? Or maybe SGP_NOALLOC is the better
> > > > default, and we just propagate an error to userspace if they didn't
> > > > fallocate() in advance?
> > > 
> > > This (making fallocate() a hard requirement) not only complicates the
> > > userspace but also forces the lazy-faulting going through a long path of
> > > exiting to userspace. Unless we don't have other options I would not go
> > > this way.
> > 
> > Unless I'm missing something, it's already the case that userspace is
> > responsible for handling all the shared->private transitions in response
> > to KVM_EXIT_MEMORY_FAULT or (in our case) KVM_EXIT_VMGEXIT. So it only
> > places the additional requirements on the VMM that if they *don't*
> > preallocate, then they'll need to issue the fallocate() prior to issuing
> > the KVM_MEM_ENCRYPT_REG_REGION ioctl in response to these events.

Preallocating and memory conversion between shared<->private are two
different things. No double fallocate() and conversion can be called
together in response to KVM_EXIT_MEMORY_FAULT, but they don't have to be
paired. And the fallocate() does not have to operate on the same memory
range as memory conversion does.

> > 
> > QEMU for example already has a separate 'prealloc' option for cases
> > where they want to prefault all the guest memory, so it makes sense to
> > continue making that an optional thing with regard to UPM.

Making 'prealloc' work for UPM in QEMU does sound reasonable. Anyway,
it's just an option so not change the assumption here.

> 
> Although I guess what you're suggesting doesn't stop userspace from
> deciding whether they want to prefault or not. I know the Google folks
> had some concerns over unexpected allocations causing 2x memory usage
> though so giving userspace full control of what is/isn't allocated in
> the restrictedmem backend seems to make it easier to guard against this,
> but I think checking the xarray and defaulting to 'shared' would work
> for us if that's the direction we end up going.

Yeah, that looks very likely the direction satisfying all people here.

Chao
> 
> -Mike
> 
> > 
> > -Mike
> > 
> > > 
> > > Chao
> > > > 
> > > > -Mike
> > > > 
> > > > > +	if (ret)
> > > > > +		return ret;
> > > > > +
> > > > > +	*pagep = page;
> > > > > +	if (order)
> > > > > +		*order = thp_order(compound_head(page));
> > > > > +
> > > > > +	SetPageUptodate(page);
> > > > > +	unlock_page(page);
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> > > > > -- 
> > > > > 2.25.1
> > > > > 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-11-30  9:39           ` Chao Peng
@ 2022-11-30 14:31             ` Michael Roth
  0 siblings, 0 replies; 101+ messages in thread
From: Michael Roth @ 2022-11-30 14:31 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	mhocko, Muchun Song, wei.w.wang

On Wed, Nov 30, 2022 at 05:39:31PM +0800, Chao Peng wrote:
> On Tue, Nov 29, 2022 at 01:18:15PM -0600, Michael Roth wrote:
> > On Tue, Nov 29, 2022 at 01:06:58PM -0600, Michael Roth wrote:
> > > On Tue, Nov 29, 2022 at 10:06:15PM +0800, Chao Peng wrote:
> > > > On Mon, Nov 28, 2022 at 06:37:25PM -0600, Michael Roth wrote:
> > > > > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > > > ...
> > > > > > +static long restrictedmem_fallocate(struct file *file, int mode,
> > > > > > +				    loff_t offset, loff_t len)
> > > > > > +{
> > > > > > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > > > > > +	struct file *memfd = data->memfd;
> > > > > > +	int ret;
> > > > > > +
> > > > > > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > > +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > > +			return -EINVAL;
> > > > > > +	}
> > > > > > +
> > > > > > +	restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> > > > > 
> > > > > The KVM restrictedmem ops seem to expect pgoff_t, but here we pass
> > > > > loff_t. For SNP we've made this strange as part of the following patch
> > > > > and it seems to produce the expected behavior:
> > > > 
> > > > That's correct. Thanks.
> > > > 
> > > > > 
> > > > >   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2Fd669c7d3003ff7a7a47e73e8c3b4eeadbd2c4eb6&amp;data=05%7C01%7Cmichael.roth%40amd.com%7Cf3ad9d505bec4006028308dad2b76bc5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053982483658905%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=ipHjTVNhiRmaa%2BKTJiodbxHS7TOaYbBhAPD0VZ%2FFU2k%3D&amp;reserved=0
> > > > > 
> > > > > > +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > > +	restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> > > > > > +	return ret;
> > > > > > +}
> > > > > > +
> > > > > 
> > > > > <snip>
> > > > > 
> > > > > > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > > > > > +			   struct page **pagep, int *order)
> > > > > > +{
> > > > > > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > > > > > +	struct file *memfd = data->memfd;
> > > > > > +	struct page *page;
> > > > > > +	int ret;
> > > > > > +
> > > > > > +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> > > > > 
> > > > > This will result in KVM allocating pages that userspace hasn't necessary
> > > > > fallocate()'d. In the case of SNP we need to get the PFN so we can clean
> > > > > up the RMP entries when restrictedmem invalidations are issued for a GFN
> > > > > range.
> > > > 
> > > > Yes fallocate() is unnecessary unless someone wants to reserve some
> > > > space (e.g. for determination or performance purpose), this matches its
> > > > semantics perfectly at:
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.man7.org%2Flinux%2Fman-pages%2Fman2%2Ffallocate.2.html&amp;data=05%7C01%7Cmichael.roth%40amd.com%7Cf3ad9d505bec4006028308dad2b76bc5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053982483658905%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=NJXs0bvvqb3oU%2FGhcvgHSvh8r1DouskOY5CreP1Q5OU%3D&amp;reserved=0
> > > > 
> > > > > 
> > > > > If the guest supports lazy-acceptance however, these pages may not have
> > > > > been faulted in yet, and if the VMM defers actually fallocate()'ing space
> > > > > until the guest actually tries to issue a shared->private for that GFN
> > > > > (to support lazy-pinning), then there may never be a need to allocate
> > > > > pages for these backends.
> > > > > 
> > > > > However, the restrictedmem invalidations are for GFN ranges so there's
> > > > > no way to know inadvance whether it's been allocated yet or not. The
> > > > > xarray is one option but currently it defaults to 'private' so that
> > > > > doesn't help us here. It might if we introduced a 'uninitialized' state
> > > > > or something along that line instead of just the binary
> > > > > 'shared'/'private' though...
> > > > 
> > > > How about if we change the default to 'shared' as we discussed at
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FY35gI0L8GMt9%2BOkK%40google.com%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7Cf3ad9d505bec4006028308dad2b76bc5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053982483658905%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=%2F1g3NdU0iLO6rWVgSm42UYlfHGG2EJ1Wp0r%2FGEznUoo%3D&amp;reserved=0?
> > > 
> > > Need to look at this a bit more, but I think that could work as well.
> > > 
> > > > > 
> > > > > But for now we added a restrictedmem_get_page_noalloc() that uses
> > > > > SGP_NONE instead of SGP_WRITE to avoid accidentally allocating a bunch
> > > > > of memory as part of guest shutdown, and a
> > > > > kvm_restrictedmem_get_pfn_noalloc() variant to go along with that. But
> > > > > maybe a boolean param is better? Or maybe SGP_NOALLOC is the better
> > > > > default, and we just propagate an error to userspace if they didn't
> > > > > fallocate() in advance?
> > > > 
> > > > This (making fallocate() a hard requirement) not only complicates the
> > > > userspace but also forces the lazy-faulting going through a long path of
> > > > exiting to userspace. Unless we don't have other options I would not go
> > > > this way.
> > > 
> > > Unless I'm missing something, it's already the case that userspace is
> > > responsible for handling all the shared->private transitions in response
> > > to KVM_EXIT_MEMORY_FAULT or (in our case) KVM_EXIT_VMGEXIT. So it only
> > > places the additional requirements on the VMM that if they *don't*
> > > preallocate, then they'll need to issue the fallocate() prior to issuing
> > > the KVM_MEM_ENCRYPT_REG_REGION ioctl in response to these events.
> 
> Preallocating and memory conversion between shared<->private are two
> different things. No double fallocate() and conversion can be called

I just mean that we don't actually have additional userspace exits for
doing lazy-faulting in this manner, because prior to mapping restricted
page into the TDP, we will have gotten a KVM_EXIT_MEMORY_FAULT anyway so
that userspace can handle the conversion, so if you do the fallocate()
prior to KVM_MEM_ENCRYPT_REG_REGION, there's no additional KVM exits
(unless you count the fallocate() syscall itself but that seems
negligable compared to memory allocation).

For instance on QEMU side we do the fallocate() as part of
kvm_convert_memory() helper.

But thinking about it more, the main upside to this approach (giving VMM
control/accounting over restrictedmem allocations), doesn't actually
work out. For instance if VMM fallocate()'s memory for a single 4K page
prior to shared->private conversion, shmem might still allocate a THP for
that whole 2M range, and userspace doesn't have a good way to account
for this. So what I'm proposing probably isn't feasible anyway.

> different things. No double fallocate() and conversion can be called
> together in response to KVM_EXIT_MEMORY_FAULT, but they don't have to be
> paired. And the fallocate() does not have to operate on the same memory
> range as memory conversion does.
> 
> > > 
> > > QEMU for example already has a separate 'prealloc' option for cases
> > > where they want to prefault all the guest memory, so it makes sense to
> > > continue making that an optional thing with regard to UPM.
> 
> Making 'prealloc' work for UPM in QEMU does sound reasonable. Anyway,
> it's just an option so not change the assumption here.
> 
> > 
> > Although I guess what you're suggesting doesn't stop userspace from
> > deciding whether they want to prefault or not. I know the Google folks
> > had some concerns over unexpected allocations causing 2x memory usage
> > though so giving userspace full control of what is/isn't allocated in
> > the restrictedmem backend seems to make it easier to guard against this,
> > but I think checking the xarray and defaulting to 'shared' would work
> > for us if that's the direction we end up going.
> 
> Yeah, that looks very likely the direction satisfying all people here.

Ok, yah after some more thought this probably is the more feasible
approach. Thanks for your input on this.

-Mike

> 
> Chao
> > 
> > -Mike
> > 
> > > 
> > > -Mike
> > > 
> > > > 
> > > > Chao
> > > > > 
> > > > > -Mike
> > > > > 
> > > > > > +	if (ret)
> > > > > > +		return ret;
> > > > > > +
> > > > > > +	*pagep = page;
> > > > > > +	if (order)
> > > > > > +		*order = thp_order(compound_head(page));
> > > > > > +
> > > > > > +	SetPageUptodate(page);
> > > > > > +	unlock_page(page);
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> > > > > > -- 
> > > > > > 2.25.1
> > > > > > 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-10-25 15:13 ` [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
                     ` (4 preceding siblings ...)
  2022-11-29  0:37   ` Michael Roth
@ 2022-12-02  2:16   ` Vishal Annapurve
  2022-12-02  6:49     ` Chao Peng
  5 siblings, 1 reply; 101+ messages in thread
From: Vishal Annapurve @ 2022-12-02  2:16 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
	mhocko, Muchun Song, wei.w.wang

On Tue, Oct 25, 2022 at 8:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Introduce 'memfd_restricted' system call with the ability to create
> memory areas that are restricted from userspace access through ordinary
> MMU operations (e.g. read/write/mmap). The memory content is expected to
> be used through a new in-kernel interface by a third kernel module.
>
> memfd_restricted() is useful for scenarios where a file descriptor(fd)
> can be used as an interface into mm but want to restrict userspace's
> ability on the fd. Initially it is designed to provide protections for
> KVM encrypted guest memory.
>
> Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> (e.g. QEMU) and then using the mmaped virtual address to setup the
> mapping in the KVM secondary page table (e.g. EPT). With confidential
> computing technologies like Intel TDX, the memfd memory may be encrypted
> with special key for special software domain (e.g. KVM guest) and is not
> expected to be directly accessed by userspace. Precisely, userspace
> access to such encrypted memory may lead to host crash so should be
> prevented.
>
> memfd_restricted() provides semantics required for KVM guest encrypted
> memory support that a fd created with memfd_restricted() is going to be
> used as the source of guest memory in confidential computing environment
> and KVM can directly interact with core-mm without the need to expose
> the memoy content into KVM userspace.
>
> KVM userspace is still in charge of the lifecycle of the fd. It should
> pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> obtain the physical memory page and then uses it to populate the KVM
> secondary page table entries.
>
> The userspace restricted memfd can be fallocate-ed or hole-punched
> from userspace. When these operations happen, KVM can get notified
> through restrictedmem_notifier, it then gets chance to remove any
> mapped entries of the range in the secondary page tables.
>
> memfd_restricted() itself is implemented as a shim layer on top of real
> memory file systems (currently tmpfs). Pages in restrictedmem are marked
> as unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
>
> By default memfd_restricted() prevents userspace read, write and mmap.
> By defining new bit in the 'flags', it can be extended to support other
> restricted semantics in the future.
>
> The system call is currently wired up for x86 arch.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  include/linux/restrictedmem.h          |  62 ++++++
>  include/linux/syscalls.h               |   1 +
>  include/uapi/asm-generic/unistd.h      |   5 +-
>  include/uapi/linux/magic.h             |   1 +
>  kernel/sys_ni.c                        |   3 +
>  mm/Kconfig                             |   4 +
>  mm/Makefile                            |   1 +
>  mm/restrictedmem.c                     | 250 +++++++++++++++++++++++++
>  10 files changed, 328 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/restrictedmem.h
>  create mode 100644 mm/restrictedmem.c
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 320480a8db4f..dc70ba90247e 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -455,3 +455,4 @@
>  448    i386    process_mrelease        sys_process_mrelease
>  449    i386    futex_waitv             sys_futex_waitv
>  450    i386    set_mempolicy_home_node         sys_set_mempolicy_home_node
> +451    i386    memfd_restricted        sys_memfd_restricted
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index c84d12608cd2..06516abc8318 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -372,6 +372,7 @@
>  448    common  process_mrelease        sys_process_mrelease
>  449    common  futex_waitv             sys_futex_waitv
>  450    common  set_mempolicy_home_node sys_set_mempolicy_home_node
> +451    common  memfd_restricted        sys_memfd_restricted
>
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> new file mode 100644
> index 000000000000..9c37c3ea3180
> --- /dev/null
> +++ b/include/linux/restrictedmem.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _LINUX_RESTRICTEDMEM_H
> +
> +#include <linux/file.h>
> +#include <linux/magic.h>
> +#include <linux/pfn_t.h>
> +
> +struct restrictedmem_notifier;
> +
> +struct restrictedmem_notifier_ops {
> +       void (*invalidate_start)(struct restrictedmem_notifier *notifier,
> +                                pgoff_t start, pgoff_t end);
> +       void (*invalidate_end)(struct restrictedmem_notifier *notifier,
> +                              pgoff_t start, pgoff_t end);
> +};
> +
> +struct restrictedmem_notifier {
> +       struct list_head list;
> +       const struct restrictedmem_notifier_ops *ops;
> +};
> +
> +#ifdef CONFIG_RESTRICTEDMEM
> +
> +void restrictedmem_register_notifier(struct file *file,
> +                                    struct restrictedmem_notifier *notifier);
> +void restrictedmem_unregister_notifier(struct file *file,
> +                                      struct restrictedmem_notifier *notifier);
> +
> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +                          struct page **pagep, int *order);
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +       return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
> +}
> +
> +#else
> +
> +static inline void restrictedmem_register_notifier(struct file *file,
> +                                    struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline void restrictedmem_unregister_notifier(struct file *file,
> +                                      struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +                                        struct page **pagep, int *order)
> +{
> +       return -1;
> +}
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +       return false;
> +}
> +
> +#endif /* CONFIG_RESTRICTEDMEM */
> +
> +#endif /* _LINUX_RESTRICTEDMEM_H */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a34b0f9a9972..f9e9e0c820c5 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
>  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
>                                             unsigned long home_node,
>                                             unsigned long flags);
> +asmlinkage long sys_memfd_restricted(unsigned int flags);
>
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 45fa180cc56a..e93cd35e46d0 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
>  #define __NR_set_mempolicy_home_node 450
>  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
>
> +#define __NR_memfd_restricted 451
> +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
> +
>  #undef __NR_syscalls
> -#define __NR_syscalls 451
> +#define __NR_syscalls 452
>
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 6325d1d0e90f..8aa38324b90a 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -101,5 +101,6 @@
>  #define DMA_BUF_MAGIC          0x444d4142      /* "DMAB" */
>  #define DEVMEM_MAGIC           0x454d444d      /* "DMEM" */
>  #define SECRETMEM_MAGIC                0x5345434d      /* "SECM" */
> +#define RESTRICTEDMEM_MAGIC    0x5245534d      /* "RESM" */
>
>  #endif /* __LINUX_MAGIC_H__ */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 860b2dcf3ac4..7c4a32cbd2e7 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
>  /* memfd_secret */
>  COND_SYSCALL(memfd_secret);
>
> +/* memfd_restricted */
> +COND_SYSCALL(memfd_restricted);
> +
>  /*
>   * Architecture specific weak syscall entries.
>   */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0331f1461f81..0177d53676c7 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1076,6 +1076,10 @@ config IO_MAPPING
>  config SECRETMEM
>         def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
>
> +config RESTRICTEDMEM
> +       bool
> +       depends on TMPFS
> +
>  config ANON_VMA_NAME
>         bool "Anonymous VMA name support"
>         depends on PROC_FS && ADVISE_SYSCALLS && MMU
> diff --git a/mm/Makefile b/mm/Makefile
> index 9a564f836403..6cb6403ffd40 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -117,6 +117,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>  obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>  obj-$(CONFIG_SECRETMEM) += secretmem.o
> +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
>  obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>  obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> new file mode 100644
> index 000000000000..e5bf8907e0f8
> --- /dev/null
> +++ b/mm/restrictedmem.c
> @@ -0,0 +1,250 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/shmem_fs.h>
> +#include <linux/syscalls.h>
> +#include <uapi/linux/falloc.h>
> +#include <uapi/linux/magic.h>
> +#include <linux/restrictedmem.h>
> +
> +struct restrictedmem_data {
> +       struct mutex lock;
> +       struct file *memfd;
> +       struct list_head notifiers;
> +};
> +
> +static void restrictedmem_notifier_invalidate(struct restrictedmem_data *data,
> +                                pgoff_t start, pgoff_t end, bool notify_start)
> +{
> +       struct restrictedmem_notifier *notifier;
> +
> +       mutex_lock(&data->lock);
> +       list_for_each_entry(notifier, &data->notifiers, list) {
> +               if (notify_start)
> +                       notifier->ops->invalidate_start(notifier, start, end);
> +               else
> +                       notifier->ops->invalidate_end(notifier, start, end);
> +       }
> +       mutex_unlock(&data->lock);
> +}
> +
> +static int restrictedmem_release(struct inode *inode, struct file *file)
> +{
> +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> +
> +       fput(data->memfd);
> +       kfree(data);
> +       return 0;
> +}
> +
> +static long restrictedmem_fallocate(struct file *file, int mode,
> +                                   loff_t offset, loff_t len)
> +{
> +       struct restrictedmem_data *data = file->f_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       int ret;
> +
> +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> +               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> +                       return -EINVAL;
> +       }
> +
> +       restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> +       restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> +       return ret;
> +}
> +
> +static const struct file_operations restrictedmem_fops = {
> +       .release = restrictedmem_release,
> +       .fallocate = restrictedmem_fallocate,
> +};
> +
> +static int restrictedmem_getattr(struct user_namespace *mnt_userns,
> +                                const struct path *path, struct kstat *stat,
> +                                u32 request_mask, unsigned int query_flags)
> +{
> +       struct inode *inode = d_inode(path->dentry);
> +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +
> +       return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
> +                                            request_mask, query_flags);
> +}
> +
> +static int restrictedmem_setattr(struct user_namespace *mnt_userns,
> +                                struct dentry *dentry, struct iattr *attr)
> +{
> +       struct inode *inode = d_inode(dentry);
> +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       int ret;
> +
> +       if (attr->ia_valid & ATTR_SIZE) {
> +               if (memfd->f_inode->i_size)
> +                       return -EPERM;
> +
> +               if (!PAGE_ALIGNED(attr->ia_size))
> +                       return -EINVAL;
> +       }
> +
> +       ret = memfd->f_inode->i_op->setattr(mnt_userns,
> +                                           file_dentry(memfd), attr);
> +       return ret;
> +}
> +
> +static const struct inode_operations restrictedmem_iops = {
> +       .getattr = restrictedmem_getattr,
> +       .setattr = restrictedmem_setattr,
> +};
> +
> +static int restrictedmem_init_fs_context(struct fs_context *fc)
> +{
> +       if (!init_pseudo(fc, RESTRICTEDMEM_MAGIC))
> +               return -ENOMEM;
> +
> +       fc->s_iflags |= SB_I_NOEXEC;
> +       return 0;
> +}
> +
> +static struct file_system_type restrictedmem_fs = {
> +       .owner          = THIS_MODULE,
> +       .name           = "memfd:restrictedmem",
> +       .init_fs_context = restrictedmem_init_fs_context,
> +       .kill_sb        = kill_anon_super,
> +};
> +
> +static struct vfsmount *restrictedmem_mnt;
> +
> +static __init int restrictedmem_init(void)
> +{
> +       restrictedmem_mnt = kern_mount(&restrictedmem_fs);
> +       if (IS_ERR(restrictedmem_mnt))
> +               return PTR_ERR(restrictedmem_mnt);
> +       return 0;
> +}
> +fs_initcall(restrictedmem_init);
> +
> +static struct file *restrictedmem_file_create(struct file *memfd)
> +{
> +       struct restrictedmem_data *data;
> +       struct address_space *mapping;
> +       struct inode *inode;
> +       struct file *file;
> +
> +       data = kzalloc(sizeof(*data), GFP_KERNEL);
> +       if (!data)
> +               return ERR_PTR(-ENOMEM);
> +
> +       data->memfd = memfd;
> +       mutex_init(&data->lock);
> +       INIT_LIST_HEAD(&data->notifiers);
> +
> +       inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> +       if (IS_ERR(inode)) {
> +               kfree(data);
> +               return ERR_CAST(inode);
> +       }
> +
> +       inode->i_mode |= S_IFREG;
> +       inode->i_op = &restrictedmem_iops;
> +       inode->i_mapping->private_data = data;
> +
> +       file = alloc_file_pseudo(inode, restrictedmem_mnt,
> +                                "restrictedmem", O_RDWR,
> +                                &restrictedmem_fops);
> +       if (IS_ERR(file)) {
> +               iput(inode);
> +               kfree(data);
> +               return ERR_CAST(file);
> +       }
> +
> +       file->f_flags |= O_LARGEFILE;
> +
> +       mapping = memfd->f_mapping;
> +       mapping_set_unevictable(mapping);
> +       mapping_set_gfp_mask(mapping,
> +                            mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> +
> +       return file;
> +}
> +
> +SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> +{

Looking at the underlying shmem implementation, there seems to be no
way to enable transparent huge pages specifically for restricted memfd
files.

Michael discussed earlier about tweaking
/sys/kernel/mm/transparent_hugepage/shmem_enabled setting to allow
hugepages to be used while backing restricted memfd. Such a change
will affect the rest of the shmem usecases as well. Even setting the
shmem_enabled policy to "advise" wouldn't help unless file based
advise for hugepage allocation is implemented.

Does it make sense to provide a flag here to allow creating restricted
memfds backed possibly by huge pages to give a more granular control?

> +       struct file *file, *restricted_file;
> +       int fd, err;
> +
> +       if (flags)
> +               return -EINVAL;
> +
> +       fd = get_unused_fd_flags(0);
> +       if (fd < 0)
> +               return fd;
> +
> +       file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> +       if (IS_ERR(file)) {
> +               err = PTR_ERR(file);
> +               goto err_fd;
> +       }
> +       file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> +       file->f_flags |= O_LARGEFILE;
> +
> +       restricted_file = restrictedmem_file_create(file);
> +       if (IS_ERR(restricted_file)) {
> +               err = PTR_ERR(restricted_file);
> +               fput(file);
> +               goto err_fd;
> +       }
> +
> +       fd_install(fd, restricted_file);
> +       return fd;
> +err_fd:
> +       put_unused_fd(fd);
> +       return err;
> +}
> +
> +void restrictedmem_register_notifier(struct file *file,
> +                                    struct restrictedmem_notifier *notifier)
> +{
> +       struct restrictedmem_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_add(&notifier->list, &data->notifiers);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_register_notifier);
> +
> +void restrictedmem_unregister_notifier(struct file *file,
> +                                      struct restrictedmem_notifier *notifier)
> +{
> +       struct restrictedmem_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_del(&notifier->list);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier);
> +
> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +                          struct page **pagep, int *order)
> +{
> +       struct restrictedmem_data *data = file->f_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       struct page *page;
> +       int ret;
> +
> +       ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> +       if (ret)
> +               return ret;
> +
> +       *pagep = page;
> +       if (order)
> +               *order = thp_order(compound_head(page));
> +
> +       SetPageUptodate(page);
> +       unlock_page(page);
> +
> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> --
> 2.25.1
>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-02  2:16   ` Vishal Annapurve
@ 2022-12-02  6:49     ` Chao Peng
  2022-12-02 13:44       ` Kirill A . Shutemov
  0 siblings, 1 reply; 101+ messages in thread
From: Chao Peng @ 2022-12-02  6:49 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
	mhocko, Muchun Song, wei.w.wang

On Thu, Dec 01, 2022 at 06:16:46PM -0800, Vishal Annapurve wrote:
> On Tue, Oct 25, 2022 at 8:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
...
> > +}
> > +
> > +SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> > +{
> 
> Looking at the underlying shmem implementation, there seems to be no
> way to enable transparent huge pages specifically for restricted memfd
> files.
> 
> Michael discussed earlier about tweaking
> /sys/kernel/mm/transparent_hugepage/shmem_enabled setting to allow
> hugepages to be used while backing restricted memfd. Such a change
> will affect the rest of the shmem usecases as well. Even setting the
> shmem_enabled policy to "advise" wouldn't help unless file based
> advise for hugepage allocation is implemented.

Had a look at fadvise() and looks it does not support HUGEPAGE for any
filesystem yet.

> 
> Does it make sense to provide a flag here to allow creating restricted
> memfds backed possibly by huge pages to give a more granular control?

We do have a unused 'flags' can be extended for such usage, but I would
let Kirill have further look, perhaps need more discussions.

Chao
> 
> > +       struct file *file, *restricted_file;
> > +       int fd, err;
> > +
> > +       if (flags)
> > +               return -EINVAL;
> > +
> > +       fd = get_unused_fd_flags(0);
> > +       if (fd < 0)
> > +               return fd;
> > +
> > +       file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> > +       if (IS_ERR(file)) {
> > +               err = PTR_ERR(file);
> > +               goto err_fd;
> > +       }
> > +       file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> > +       file->f_flags |= O_LARGEFILE;
> > +
> > +       restricted_file = restrictedmem_file_create(file);
> > +       if (IS_ERR(restricted_file)) {
> > +               err = PTR_ERR(restricted_file);
> > +               fput(file);
> > +               goto err_fd;
> > +       }
> > +
> > +       fd_install(fd, restricted_file);
> > +       return fd;
> > +err_fd:
> > +       put_unused_fd(fd);
> > +       return err;
> > +}
> > +
> > +void restrictedmem_register_notifier(struct file *file,
> > +                                    struct restrictedmem_notifier *notifier)
> > +{
> > +       struct restrictedmem_data *data = file->f_mapping->private_data;
> > +
> > +       mutex_lock(&data->lock);
> > +       list_add(&notifier->list, &data->notifiers);
> > +       mutex_unlock(&data->lock);
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_register_notifier);
> > +
> > +void restrictedmem_unregister_notifier(struct file *file,
> > +                                      struct restrictedmem_notifier *notifier)
> > +{
> > +       struct restrictedmem_data *data = file->f_mapping->private_data;
> > +
> > +       mutex_lock(&data->lock);
> > +       list_del(&notifier->list);
> > +       mutex_unlock(&data->lock);
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier);
> > +
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +                          struct page **pagep, int *order)
> > +{
> > +       struct restrictedmem_data *data = file->f_mapping->private_data;
> > +       struct file *memfd = data->memfd;
> > +       struct page *page;
> > +       int ret;
> > +
> > +       ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> > +       if (ret)
> > +               return ret;
> > +
> > +       *pagep = page;
> > +       if (order)
> > +               *order = thp_order(compound_head(page));
> > +
> > +       SetPageUptodate(page);
> > +       unlock_page(page);
> > +
> > +       return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> > --
> > 2.25.1
> >


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory
  2022-12-02  6:49     ` Chao Peng
@ 2022-12-02 13:44       ` Kirill A . Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A . Shutemov @ 2022-12-02 13:44 UTC (permalink / raw)
  To: Chao Peng
  Cc: Vishal Annapurve, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Yu Zhang, luto, jun.nakajima, dave.hansen, ak,
	david, aarcange, ddutile, dhildenb, Quentin Perret, tabba,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Fri, Dec 02, 2022 at 02:49:09PM +0800, Chao Peng wrote:
> On Thu, Dec 01, 2022 at 06:16:46PM -0800, Vishal Annapurve wrote:
> > On Tue, Oct 25, 2022 at 8:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > >
> ...
> > > +}
> > > +
> > > +SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> > > +{
> > 
> > Looking at the underlying shmem implementation, there seems to be no
> > way to enable transparent huge pages specifically for restricted memfd
> > files.
> > 
> > Michael discussed earlier about tweaking
> > /sys/kernel/mm/transparent_hugepage/shmem_enabled setting to allow
> > hugepages to be used while backing restricted memfd. Such a change
> > will affect the rest of the shmem usecases as well. Even setting the
> > shmem_enabled policy to "advise" wouldn't help unless file based
> > advise for hugepage allocation is implemented.
> 
> Had a look at fadvise() and looks it does not support HUGEPAGE for any
> filesystem yet.

Yes, I think fadvise() is the right direction here. The problem is similar
to NUMA policy where existing APIs are focused around virtual memory
addresses. We need to extend ABI to take fd+offset as input instead.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 101+ messages in thread

end of thread, other threads:[~2022-12-02 13:44 UTC | newest]

Thread overview: 101+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-25 15:13 [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
2022-10-25 15:13 ` [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
2022-10-26 17:31   ` Isaku Yamahata
2022-10-28  6:12     ` Chao Peng
2022-10-27 10:20   ` Fuad Tabba
2022-10-31 17:47   ` Michael Roth
2022-11-01 11:37     ` Chao Peng
2022-11-01 15:19       ` Michael Roth
2022-11-01 19:30         ` Michael Roth
2022-11-02 14:53           ` Chao Peng
2022-11-02 21:19             ` Michael Roth
2022-11-14 14:02         ` Vlastimil Babka
2022-11-14 15:28           ` Kirill A. Shutemov
2022-11-14 22:16             ` Michael Roth
2022-11-15  9:48               ` Chao Peng
2022-11-14 22:16           ` Michael Roth
2022-11-02 21:14     ` Kirill A. Shutemov
2022-11-02 21:26       ` Michael Roth
2022-11-02 22:07       ` Michael Roth
2022-11-03 16:30         ` Kirill A. Shutemov
2022-11-29  0:06   ` Michael Roth
2022-11-29 11:21     ` Kirill A. Shutemov
2022-11-29 11:39       ` David Hildenbrand
2022-11-29 13:59         ` Chao Peng
2022-11-29 13:58       ` Chao Peng
2022-11-29  0:37   ` Michael Roth
2022-11-29 14:06     ` Chao Peng
2022-11-29 19:06       ` Michael Roth
2022-11-29 19:18         ` Michael Roth
2022-11-30  9:39           ` Chao Peng
2022-11-30 14:31             ` Michael Roth
2022-11-29 18:01     ` Vishal Annapurve
2022-12-02  2:16   ` Vishal Annapurve
2022-12-02  6:49     ` Chao Peng
2022-12-02 13:44       ` Kirill A . Shutemov
2022-10-25 15:13 ` [PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
2022-10-27 10:25   ` Fuad Tabba
2022-10-28  7:04   ` Xiaoyao Li
2022-10-31 14:14     ` Chao Peng
2022-11-14 16:04   ` Alex Bennée
2022-11-15  9:29     ` Chao Peng
2022-10-25 15:13 ` [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
2022-10-25 15:26   ` Peter Maydell
2022-10-25 16:17     ` Sean Christopherson
2022-10-27 10:27   ` Fuad Tabba
2022-10-28  6:14     ` Chao Peng
2022-11-15 16:56   ` Alex Bennée
2022-11-16  3:14     ` Chao Peng
2022-11-16 19:03       ` Alex Bennée
2022-11-17 13:45         ` Chao Peng
2022-11-17 15:08           ` Alex Bennée
2022-11-18  1:32             ` Chao Peng
2022-11-18 13:23               ` Alex Bennée
2022-11-18 15:59                 ` Sean Christopherson
2022-11-22  9:50                   ` Chao Peng
2022-11-23 18:02                     ` Sean Christopherson
2022-11-16 18:15   ` Andy Lutomirski
2022-11-16 18:48     ` Sean Christopherson
2022-11-17 13:42       ` Chao Peng
2022-10-25 15:13 ` [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
2022-10-27 10:29   ` Fuad Tabba
2022-11-04  2:28     ` Chao Peng
2022-11-04 22:29       ` Sean Christopherson
2022-11-08  7:16         ` Chao Peng
2022-11-10 17:53           ` Sean Christopherson
2022-11-10 20:06   ` Sean Christopherson
2022-11-11  8:27     ` Chao Peng
2022-10-25 15:13 ` [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions Chao Peng
2022-10-27 10:31   ` Fuad Tabba
2022-11-03 23:04   ` Sean Christopherson
2022-11-04  8:28     ` Chao Peng
2022-11-04 21:19       ` Sean Christopherson
2022-11-08  8:24         ` Chao Peng
2022-11-08  1:35   ` Yuan Yao
2022-11-08  9:41     ` Chao Peng
2022-11-09  5:52       ` Yuan Yao
2022-11-16 22:24   ` Sean Christopherson
2022-11-17 13:20     ` Chao Peng
2022-10-25 15:13 ` [PATCH v9 6/8] KVM: Update lpage info when private/shared memory are mixed Chao Peng
2022-10-26 20:46   ` Isaku Yamahata
2022-10-28  6:38     ` Chao Peng
2022-11-08 12:08   ` Yuan Yao
2022-11-09  4:13     ` Chao Peng
2022-10-25 15:13 ` [PATCH v9 7/8] KVM: Handle page fault for private memory Chao Peng
2022-10-26 21:54   ` Isaku Yamahata
2022-10-28  6:55     ` Chao Peng
2022-11-01  0:02       ` Isaku Yamahata
2022-11-01 11:38         ` Chao Peng
2022-11-16 20:50   ` Ackerley Tng
2022-11-16 22:13     ` Sean Christopherson
2022-11-17 13:25       ` Chao Peng
2022-10-25 15:13 ` [PATCH v9 8/8] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
2022-10-27 10:31   ` Fuad Tabba
2022-11-03 12:13 ` [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM Vishal Annapurve
2022-11-08  0:41   ` Isaku Yamahata
2022-11-09 15:54     ` Kirill A. Shutemov
2022-11-15 14:36       ` Kirill A. Shutemov
2022-11-14 11:43 ` Alex Bennée
2022-11-16  5:00   ` Chao Peng
2022-11-16  9:40     ` Alex Bennée
2022-11-17 14:16       ` Chao Peng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).