linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/8] KVM: mm: fd-based approach for supporting KVM
@ 2022-09-15 14:29 Chao Peng
  2022-09-15 14:29 ` [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd Chao Peng
                   ` (7 more replies)
  0 siblings, 8 replies; 97+ messages in thread
From: Chao Peng @ 2022-09-15 14:29 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, wei.w.wang

This patch series implements KVM guest private memory for confidential
computing scenarios like Intel TDX[1]. If a TDX host accesses
TDX-protected guest memory, machine check can happen which can further
crash the running host system, this is terrible for multi-tenant
configurations. The host accesses include those from KVM userspace like
QEMU. This series addresses KVM userspace induced crash by introducing
new mm and KVM interfaces so KVM userspace can still manage guest memory
via a fd-based approach, but it can never access the guest memory
content.

The patch series touches both core mm and KVM code. I appreciate
Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
reviews are always welcome.
  - 01: mm change, target for mm tree
  - 02-08: KVM change, target for KVM tree

Given KVM is the only current user for the mm part, I have chatted with
Paolo and he is OK to merge the mm change through KVM tree, but
reviewed-by/acked-by is still expected from the mm people.

The patches have been verified in Intel TDX environment, but Vishal has
done an excellent work on the selftests[4] which are dedicated for this
series, making it possible to test this series without innovative
hardware and fancy steps of building a VM environment. See Test section
below for more info.

Comparing to previous version, this version redesigned mm part code and
excluded F_SEAL_AUTO_ALLOCATE and man page changes from this series. See
Changelog section below for more info.


Introduction
============
KVM userspace being able to crash the host is horrible. Under current
KVM architecture, all guest memory is inherently accessible from KVM
userspace and is exposed to the mentioned crash issue. The goal of this
series is to provide a solution to align mm and KVM, on a userspace
inaccessible approach of exposing guest memory. 

Normally, KVM populates secondary page table (e.g. EPT) by using a host
virtual address (hva) from core mm page table (e.g. x86 userspace page
table). This requires guest memory being mmaped into KVM userspace, but
this is also the source where the mentioned crash issue can happen. In
theory, apart from those 'shared' memory for device emulation etc, guest
memory doesn't have to be mmaped into KVM userspace.

This series introduces fd-based guest memory which will not be mmaped
into KVM userspace. KVM populates secondary page table by using a
fd/offset pair backed by a memory file system. The fd can be created
from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
directly interact with them with newly introduced in-kernel interface,
therefore remove the KVM userspace from the path of accessing/mmaping
the guest memory. 

Kirill had a patch [2] to address the same issue in a different way. It
tracks guest encrypted memory at the 'struct page' level and relies on
HWPOISON to reject the userspace access. The patch has been discussed in
several online and offline threads and resulted in a design document [3]
which is also the original proposal for this series. Later this patch
series evolved as more comments received in community but the major
concepts in [3] still hold true so recommend reading.

The patch series may also be useful for other usages, for example, pure
software approach may use it to harden itself against unintentional
access to guest memory. This series is designed with these usages in
mind but doesn't have code directly support them and extension might be
needed.


mm change
=========
Introduces a new userspace MFD_INACCESSIBLE flag for memfd_create() so
that the memory fd created with this flag cannot read(), write() or
mmap() etc via normal MMU operations and the only way to use it is
passing it to a third kernel module like KVM and relying on it to
access the fd through the newly added inaccessible_memfd kernel
interface. The inaccessible_memfd interface bridges the memory file
subsystems (e.g.tmpfs/hugetlbfs) and their users (KVM in this case) and
provides bi-directional communication between them. 


KVM change
==========
Extends the KVM memslot to provide guest private (encrypted) memory from
a fd. With this extension, a single memslot can maintain both private
memory through private fd (private_fd/private_offset) and shared
(unencrypted) memory through userspace mmaped host virtual address
(userspace_addr). For a particular guest page, the corresponding page in
KVM memslot can be only either private or shared and only one of the
shared/private parts of the memslot is visible to guest.

Introduces new KVM_EXIT_MEMORY_FAULT exit to allow userspace to get the
chance on decision-making for shared <-> private memory conversion. The
exit can be an implicit conversion in KVM page fault handler or an
explicit conversion from guest OS.

Extends existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to
convert a guest page between private <-> shared. The data saved in these
ioctls tells the truth whether a guest page is private or shared and
this information will be used in KVM page fault handler to decide
whether the private or the shared part of the memslot is visible to
guest.


Test
====
Ran two kinds of tests:
  - Selftests [4] from Vishal and VM boot tests in non-TDX environment
    Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v8

  - Functional tests in TDX capable environment
    Tested the new functionalities in TDX environment. Code repos:
    Linux: https://github.com/chao-p/linux/tree/privmem-v8-tdx
    QEMU: https://github.com/chao-p/qemu/tree/privmem-v8

    An example QEMU command line for TDX test:
    -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
    -machine confidential-guest-support=tdx \
    -object memory-backend-memfd-private,id=ram1,size=${mem} \
    -machine memory-backend=ram1


TODO
====
  - Page accounting and limiting for encrypted memory
  - hugetlbfs support


Changelog
=========
v8:
  - mm: redesign mm part by introducing a shim layer(inaccessible_memfd)
    in memfd to avoid touch the memory file systems directly.
  - mm: exclude F_SEAL_AUTO_ALLOCATE as it is for shared memory and
    cause confusion in this series, will send out separately.
  - doc: exclude the man page change, it's not kernel patch and will
    send out separately.
  - KVM: adapt to use the new mm inaccessible_memfd interface.
  - KVM: update lpage_info when setting mem_attr_array to support
    large page.
  - KVM: change from xa_store_range to xa_store for mem_attr_array due
    to xa_store_range overrides all entries which is not intended
    behavior for us.
  - KVM: refine the mmu_invalidate_retry_gfn mechanism for private page.
  - KVM: reorganize KVM_MEMORY_ENCRYPT_{UN,}REG_REGION and private page
    handling code suggested by Sean.
v7:
  - mm: introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
  - KVM: use KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to record
    private/shared info.
  - KVM: use similar sync mechanism between zap/page fault paths as
    mmu_notifier for memfile_notifier based invalidation.
v6:
  - mm: introduce MEMFILE_F_* flags into memfile_node to allow checking
    feature consistence among all memfile_notifier users and get rid of
    internal flags like SHM_F_INACCESSIBLE.
  - mm: make pfn_ops callbacks being members of memfile_backing_store
    and then refer to it directly in memfile_notifier.
  - mm: remove backing store unregister.
  - mm: remove RLIMIT_MEMLOCK based memory accounting and limiting.
  - KVM: reorganize patch sequence for page fault handling and private
    memory enabling.
v5:
  - Add man page for MFD_INACCESSIBLE flag and improve KVM API do for
    the new memslot extensions.
  - mm: introduce memfile_{un}register_backing_store to allow memory
    backing store to register/unregister it from memfile_notifier.
  - mm: remove F_SEAL_INACCESSIBLE, use in-kernel flag
    (SHM_F_INACCESSIBLE for shmem) instead. 
  - mm: add memory accounting and limiting (RLIMIT_MEMLOCK based) for
    MFD_INACCESSIBLE memory.
  - KVM: remove the overlap check for mapping the same file+offset into
    multiple gfns due to perf consideration, warned in document.
v4:
  - mm: rename memfd_ops to memfile_notifier and separate it from
    memfd.c to standalone memfile-notifier.c.
  - KVM: move pfn_ops to per-memslot scope from per-vm scope and allow
    registering multiple memslots to the same memory backing store.
  - KVM: add a 'kvm' reference in memslot so that we can recover kvm in
    memfile_notifier handlers.
  - KVM: add 'private_' prefix for the new fields in memslot.
  - KVM: reshape the 'type' to 'flag' for kvm_memory_exit
v3:
  - Remove 'RFC' prefix.
  - Fix race condition between memfile_notifier handlers and kvm destroy.
  - mm: introduce MFD_INACCESSIBLE flag for memfd_create() to force
    setting F_SEAL_INACCESSIBLE when the fd is created.
  - KVM: add the shared part of the memslot back to make private/shared
    pages live in one memslot.

Reference
=========
[1] Intel TDX:
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
[2] Kirill's implementation:
https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com/T/ 
[3] Original design proposal:
https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com/  
[4] Selftest:
https://lore.kernel.org/all/20220819174659.2427983-1-vannapurve@google.com/ 


Chao Peng (7):
  KVM: Extend the memslot to support fd-based private memory
  KVM: Add KVM_EXIT_MEMORY_FAULT exit
  KVM: Use gfn instead of hva for mmu_notifier_retry
  KVM: Register/unregister the guest private memory regions
  KVM: Update lpage info when private/shared memory are mixed
  KVM: Handle page fault for private memory
  KVM: Enable and expose KVM_MEM_PRIVATE

Kirill A. Shutemov (1):
  mm/memfd: Introduce userspace inaccessible memfd

 Documentation/virt/kvm/api.rst  |  78 +++++++--
 arch/x86/include/asm/kvm_host.h |   9 +
 arch/x86/kvm/Kconfig            |   1 +
 arch/x86/kvm/mmu.h              |   2 -
 arch/x86/kvm/mmu/mmu.c          | 175 +++++++++++++++++++-
 arch/x86/kvm/mmu/mmu_internal.h |  18 ++
 arch/x86/kvm/mmu/mmutrace.h     |   1 +
 arch/x86/kvm/x86.c              |   4 +-
 include/linux/kvm_host.h        |  86 ++++++++--
 include/linux/memfd.h           |  24 +++
 include/uapi/linux/kvm.h        |  37 +++++
 include/uapi/linux/magic.h      |   1 +
 include/uapi/linux/memfd.h      |   1 +
 mm/Makefile                     |   2 +-
 mm/memfd.c                      |  25 ++-
 mm/memfd_inaccessible.c         | 219 +++++++++++++++++++++++++
 virt/kvm/Kconfig                |   3 +
 virt/kvm/kvm_main.c             | 282 +++++++++++++++++++++++++++++---
 18 files changed, 912 insertions(+), 56 deletions(-)
 create mode 100644 mm/memfd_inaccessible.c

base-commit: 372d07084593dc7a399bf9bee815711b1fb1bcf2
-- 
2.25.1


^ permalink raw reply	[flat|nested] 97+ messages in thread

* [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-15 14:29 [PATCH v8 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
@ 2022-09-15 14:29 ` Chao Peng
  2022-09-19  9:12   ` David Hildenbrand
                     ` (5 more replies)
  2022-09-15 14:29 ` [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
                   ` (6 subsequent siblings)
  7 siblings, 6 replies; 97+ messages in thread
From: Chao Peng @ 2022-09-15 14:29 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, wei.w.wang

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

KVM can use memfd-provided memory for guest memory. For normal userspace
accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
virtual address space and then tells KVM to use the virtual address to
setup the mapping in the secondary page table (e.g. EPT).

With confidential computing technologies like Intel TDX, the
memfd-provided memory may be encrypted with special key for special
software domain (e.g. KVM guest) and is not expected to be directly
accessed by userspace. Precisely, userspace access to such encrypted
memory may lead to host crash so it should be prevented.

This patch introduces userspace inaccessible memfd (created with
MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
ordinary MMU access (e.g. read/write/mmap) but can be accessed via
in-kernel interface so KVM can directly interact with core-mm without
the need to map the memory into KVM userspace.

It provides semantics required for KVM guest private(encrypted) memory
support that a file descriptor with this flag set is going to be used as
the source of guest memory in confidential computing environments such
as Intel TDX/AMD SEV.

KVM userspace is still in charge of the lifecycle of the memfd. It
should pass the opened fd to KVM. KVM uses the kernel APIs newly added
in this patch to obtain the physical memory address and then populate
the secondary page table entries.

The userspace inaccessible memfd can be fallocate-ed and hole-punched
from userspace. When hole-punching happens, KVM can get notified through
inaccessible_notifier it then gets chance to remove any mapped entries
of the range in the secondary page tables.

The userspace inaccessible memfd itself is implemented as a shim layer
on top of real memory file systems like tmpfs/hugetlbfs but this patch
only implemented tmpfs. The allocated memory is currently marked as
unmovable and unevictable, this is required for current confidential
usage. But in future this might be changed.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/memfd.h      |  24 ++++
 include/uapi/linux/magic.h |   1 +
 include/uapi/linux/memfd.h |   1 +
 mm/Makefile                |   2 +-
 mm/memfd.c                 |  25 ++++-
 mm/memfd_inaccessible.c    | 219 +++++++++++++++++++++++++++++++++++++
 6 files changed, 270 insertions(+), 2 deletions(-)
 create mode 100644 mm/memfd_inaccessible.c

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..334ddff08377 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -3,6 +3,7 @@
 #define __LINUX_MEMFD_H
 
 #include <linux/file.h>
+#include <linux/pfn_t.h>
 
 #ifdef CONFIG_MEMFD_CREATE
 extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
@@ -13,4 +14,27 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
 }
 #endif
 
+struct inaccessible_notifier;
+
+struct inaccessible_notifier_ops {
+	void (*invalidate)(struct inaccessible_notifier *notifier,
+			   pgoff_t start, pgoff_t end);
+};
+
+struct inaccessible_notifier {
+	struct list_head list;
+	const struct inaccessible_notifier_ops *ops;
+};
+
+void inaccessible_register_notifier(struct file *file,
+				    struct inaccessible_notifier *notifier);
+void inaccessible_unregister_notifier(struct file *file,
+				      struct inaccessible_notifier *notifier);
+
+int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+			 int *order);
+void inaccessible_put_pfn(struct file *file, pfn_t pfn);
+
+struct file *memfd_mkinaccessible(struct file *memfd);
+
 #endif /* __LINUX_MEMFD_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..9d066be3d7e8 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@
 #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
+#define INACCESSIBLE_MAGIC	0x494e4143	/* "INAC" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..48750474b904 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,7 @@
 #define MFD_CLOEXEC		0x0001U
 #define MFD_ALLOW_SEALING	0x0002U
 #define MFD_HUGETLB		0x0004U
+#define MFD_INACCESSIBLE	0x0008U
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/Makefile b/mm/Makefile
index 9a564f836403..f82e5d4b4388 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -126,7 +126,7 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 obj-$(CONFIG_HMM_MIRROR) += hmm.o
-obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_MEMFD_CREATE) += memfd.o memfd_inaccessible.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
diff --git a/mm/memfd.c b/mm/memfd.c
index 08f5f8304746..1853a90f49ff 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -261,7 +261,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
 
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
+		       MFD_INACCESSIBLE)
 
 SYSCALL_DEFINE2(memfd_create,
 		const char __user *, uname,
@@ -283,6 +284,14 @@ SYSCALL_DEFINE2(memfd_create,
 			return -EINVAL;
 	}
 
+	/* Disallow sealing when MFD_INACCESSIBLE is set. */
+	if ((flags & MFD_INACCESSIBLE) && (flags & MFD_ALLOW_SEALING))
+		return -EINVAL;
+
+	/* TODO: add hugetlb support */
+	if ((flags & MFD_INACCESSIBLE) && (flags & MFD_HUGETLB))
+		return -EINVAL;
+
 	/* length includes terminating zero */
 	len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
 	if (len <= 0)
@@ -331,10 +340,24 @@ SYSCALL_DEFINE2(memfd_create,
 		*file_seals &= ~F_SEAL_SEAL;
 	}
 
+	if (flags & MFD_INACCESSIBLE) {
+		struct file *inaccessible_file;
+
+		inaccessible_file = memfd_mkinaccessible(file);
+		if (IS_ERR(inaccessible_file)) {
+			error = PTR_ERR(inaccessible_file);
+			goto err_file;
+		}
+
+		file = inaccessible_file;
+	}
+
 	fd_install(fd, file);
 	kfree(name);
 	return fd;
 
+err_file:
+	fput(file);
 err_fd:
 	put_unused_fd(fd);
 err_name:
diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
new file mode 100644
index 000000000000..2d33cbdd9282
--- /dev/null
+++ b/mm/memfd_inaccessible.c
@@ -0,0 +1,219 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "linux/sbitmap.h"
+#include <linux/memfd.h>
+#include <linux/pagemap.h>
+#include <linux/pseudo_fs.h>
+#include <linux/shmem_fs.h>
+#include <uapi/linux/falloc.h>
+#include <uapi/linux/magic.h>
+
+struct inaccessible_data {
+	struct mutex lock;
+	struct file *memfd;
+	struct list_head notifiers;
+};
+
+static void inaccessible_notifier_invalidate(struct inaccessible_data *data,
+				 pgoff_t start, pgoff_t end)
+{
+	struct inaccessible_notifier *notifier;
+
+	mutex_lock(&data->lock);
+	list_for_each_entry(notifier, &data->notifiers, list) {
+		notifier->ops->invalidate(notifier, start, end);
+	}
+	mutex_unlock(&data->lock);
+}
+
+static int inaccessible_release(struct inode *inode, struct file *file)
+{
+	struct inaccessible_data *data = inode->i_mapping->private_data;
+
+	fput(data->memfd);
+	kfree(data);
+	return 0;
+}
+
+static long inaccessible_fallocate(struct file *file, int mode,
+				   loff_t offset, loff_t len)
+{
+	struct inaccessible_data *data = file->f_mapping->private_data;
+	struct file *memfd = data->memfd;
+	int ret;
+
+	if (mode & FALLOC_FL_PUNCH_HOLE) {
+		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
+			return -EINVAL;
+	}
+
+	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
+	inaccessible_notifier_invalidate(data, offset, offset + len);
+	return ret;
+}
+
+static const struct file_operations inaccessible_fops = {
+	.release = inaccessible_release,
+	.fallocate = inaccessible_fallocate,
+};
+
+static int inaccessible_getattr(struct user_namespace *mnt_userns,
+				const struct path *path, struct kstat *stat,
+				u32 request_mask, unsigned int query_flags)
+{
+	struct inode *inode = d_inode(path->dentry);
+	struct inaccessible_data *data = inode->i_mapping->private_data;
+	struct file *memfd = data->memfd;
+
+	return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
+					     request_mask, query_flags);
+}
+
+static int inaccessible_setattr(struct user_namespace *mnt_userns,
+				struct dentry *dentry, struct iattr *attr)
+{
+	struct inode *inode = d_inode(dentry);
+	struct inaccessible_data *data = inode->i_mapping->private_data;
+	struct file *memfd = data->memfd;
+	int ret;
+
+	if (attr->ia_valid & ATTR_SIZE) {
+		if (memfd->f_inode->i_size)
+			return -EPERM;
+
+		if (!PAGE_ALIGNED(attr->ia_size))
+			return -EINVAL;
+	}
+
+	ret = memfd->f_inode->i_op->setattr(mnt_userns,
+					    file_dentry(memfd), attr);
+	return ret;
+}
+
+static const struct inode_operations inaccessible_iops = {
+	.getattr = inaccessible_getattr,
+	.setattr = inaccessible_setattr,
+};
+
+static int inaccessible_init_fs_context(struct fs_context *fc)
+{
+	if (!init_pseudo(fc, INACCESSIBLE_MAGIC))
+		return -ENOMEM;
+
+	fc->s_iflags |= SB_I_NOEXEC;
+	return 0;
+}
+
+static struct file_system_type inaccessible_fs = {
+	.owner		= THIS_MODULE,
+	.name		= "[inaccessible]",
+	.init_fs_context = inaccessible_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static struct vfsmount *inaccessible_mnt;
+
+static __init int inaccessible_init(void)
+{
+	inaccessible_mnt = kern_mount(&inaccessible_fs);
+	if (IS_ERR(inaccessible_mnt))
+		return PTR_ERR(inaccessible_mnt);
+	return 0;
+}
+fs_initcall(inaccessible_init);
+
+struct file *memfd_mkinaccessible(struct file *memfd)
+{
+	struct inaccessible_data *data;
+	struct address_space *mapping;
+	struct inode *inode;
+	struct file *file;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return ERR_PTR(-ENOMEM);
+
+	data->memfd = memfd;
+	mutex_init(&data->lock);
+	INIT_LIST_HEAD(&data->notifiers);
+
+	inode = alloc_anon_inode(inaccessible_mnt->mnt_sb);
+	if (IS_ERR(inode)) {
+		kfree(data);
+		return ERR_CAST(inode);
+	}
+
+	inode->i_mode |= S_IFREG;
+	inode->i_op = &inaccessible_iops;
+	inode->i_mapping->private_data = data;
+
+	file = alloc_file_pseudo(inode, inaccessible_mnt,
+				 "[memfd:inaccessible]", O_RDWR,
+				 &inaccessible_fops);
+	if (IS_ERR(file)) {
+		iput(inode);
+		kfree(data);
+	}
+
+	file->f_flags |= O_LARGEFILE;
+
+	mapping = memfd->f_mapping;
+	mapping_set_unevictable(mapping);
+	mapping_set_gfp_mask(mapping,
+			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
+
+	return file;
+}
+
+void inaccessible_register_notifier(struct file *file,
+				    struct inaccessible_notifier *notifier)
+{
+	struct inaccessible_data *data = file->f_mapping->private_data;
+
+	mutex_lock(&data->lock);
+	list_add(&notifier->list, &data->notifiers);
+	mutex_unlock(&data->lock);
+}
+EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
+
+void inaccessible_unregister_notifier(struct file *file,
+				      struct inaccessible_notifier *notifier)
+{
+	struct inaccessible_data *data = file->f_mapping->private_data;
+
+	mutex_lock(&data->lock);
+	list_del(&notifier->list);
+	mutex_unlock(&data->lock);
+}
+EXPORT_SYMBOL_GPL(inaccessible_unregister_notifier);
+
+int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+			 int *order)
+{
+	struct inaccessible_data *data = file->f_mapping->private_data;
+	struct file *memfd = data->memfd;
+	struct page *page;
+	int ret;
+
+	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
+	if (ret)
+		return ret;
+
+	*pfn = page_to_pfn_t(page);
+	*order = thp_order(compound_head(page));
+	SetPageUptodate(page);
+	unlock_page(page);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
+
+void inaccessible_put_pfn(struct file *file, pfn_t pfn)
+{
+	struct page *page = pfn_t_to_page(pfn);
+
+	if (WARN_ON_ONCE(!page))
+		return;
+
+	put_page(page);
+}
+EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-09-15 14:29 [PATCH v8 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
  2022-09-15 14:29 ` [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd Chao Peng
@ 2022-09-15 14:29 ` Chao Peng
  2022-09-16  9:14   ` Bagas Sanjaya
                     ` (5 more replies)
  2022-09-15 14:29 ` [PATCH v8 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
                   ` (5 subsequent siblings)
  7 siblings, 6 replies; 97+ messages in thread
From: Chao Peng @ 2022-09-15 14:29 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, wei.w.wang

In memory encryption usage, guest memory may be encrypted with special
key and can be accessed only by the VM itself. We call such memory
private memory. It's valueless and sometimes can cause problem to allow
userspace to access guest private memory. This patch extends the KVM
memslot definition so that guest private memory can be provided though
an inaccessible_notifier enlightened file descriptor (fd), without being
mmaped into userspace.

This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
additional KVM memslot fields private_fd/private_offset to allow
userspace to specify that guest private memory provided from the
private_fd and guest_phys_addr mapped at the private_offset of the
private_fd, spanning a range of memory_size.

The extended memslot can still have the userspace_addr(hva). When use, a
single memslot can maintain both private memory through private
fd(private_fd/private_offset) and shared memory through
hva(userspace_addr). Whether the private or shared part is visible to
guest is maintained by other KVM code.

Since there is no userspace mapping for private fd so we cannot
get_user_pages() to get the pfn in KVM, instead we add a new
inaccessible_notifier in the internal memslot structure and rely on it
to get pfn by interacting with the memory file systems.

Together with the change, a new config HAVE_KVM_PRIVATE_MEM is added and
right now it is selected on X86_64 for Intel TDX usage.

To make code maintenance easy, internally we use a binary compatible
alias struct kvm_user_mem_region to handle both the normal and the
'_ext' variants.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 Documentation/virt/kvm/api.rst | 38 +++++++++++++++++++++-----
 arch/x86/kvm/Kconfig           |  1 +
 arch/x86/kvm/x86.c             |  2 +-
 include/linux/kvm_host.h       | 13 +++++++--
 include/uapi/linux/kvm.h       | 28 +++++++++++++++++++
 virt/kvm/Kconfig               |  3 +++
 virt/kvm/kvm_main.c            | 49 ++++++++++++++++++++++++++++------
 7 files changed, 116 insertions(+), 18 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index abd7c32126ce..c1fac1e9f820 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
 :Capability: KVM_CAP_USER_MEMORY
 :Architectures: all
 :Type: vm ioctl
-:Parameters: struct kvm_userspace_memory_region (in)
+:Parameters: struct kvm_userspace_memory_region(_ext) (in)
 :Returns: 0 on success, -1 on error
 
 ::
@@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
 	__u64 userspace_addr; /* start of the userspace allocated memory */
   };
 
+  struct kvm_userspace_memory_region_ext {
+	struct kvm_userspace_memory_region region;
+	__u64 private_offset;
+	__u32 private_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+  };
+
   /* for kvm_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
   #define KVM_MEM_READONLY	(1UL << 1)
+  #define KVM_MEM_PRIVATE		(1UL << 2)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1365,12 +1374,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
-writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only.  In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region
+fields. It also includes additional fields for some specific features. See
+below description of flags field for more information. It's recommended to use
+kvm_userspace_memory_region_ext in new userspace code.
+
+The flags field supports below flags:
+
+- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to
+  memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to use it.
+
+- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to
+  make a new slot read-only.  In this case, writes to this memory will be posted
+  to userspace as KVM_EXIT_MMIO exits.
+
+- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by
+  a file descirptor(fd) and the content of the private memory is invisible to
+  userspace. In this case, userspace should use private_fd/private_offset in
+  kvm_userspace_memory_region_ext to instruct KVM to provide private memory to
+  guest. Userspace should guarantee not to map the same pfn indicated by
+  private_fd/private_offset to different gfns with multiple memslots. Failed to
+  do this may result undefined behavior.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index e3cbd7706136..31db64ec0b33 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -48,6 +48,7 @@ config KVM
 	select SRCU
 	select INTERVAL_TREE
 	select HAVE_KVM_PM_NOTIFIER if PM
+	select HAVE_KVM_PRIVATE_MEM if X86_64
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d7374d768296..081f62ccc9a1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12183,7 +12183,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 	}
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-		struct kvm_userspace_memory_region m;
+		struct kvm_user_mem_region m;
 
 		m.slot = id | (i << 16);
 		m.flags = 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f4519d3689e1..eac1787b899b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -44,6 +44,7 @@
 
 #include <asm/kvm_host.h>
 #include <linux/kvm_dirty_ring.h>
+#include <linux/memfd.h>
 
 #ifndef KVM_MAX_VCPU_IDS
 #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
@@ -576,8 +577,16 @@ struct kvm_memory_slot {
 	u32 flags;
 	short id;
 	u16 as_id;
+	struct file *private_file;
+	loff_t private_offset;
+	struct inaccessible_notifier notifier;
 };
 
+static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
+{
+	return slot && (slot->flags & KVM_MEM_PRIVATE);
+}
+
 static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
 {
 	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
@@ -1104,9 +1113,9 @@ enum kvm_mr_change {
 };
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem);
+			  const struct kvm_user_mem_region *mem);
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem);
+			    const struct kvm_user_mem_region *mem);
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index eed0315a77a6..3ef462fb3b2a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
 	__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
+struct kvm_userspace_memory_region_ext {
+	struct kvm_userspace_memory_region region;
+	__u64 private_offset;
+	__u32 private_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+};
+
+#ifdef __KERNEL__
+/*
+ * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
+ * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
+ * all fields from the top-level "extended" region.
+ */
+struct kvm_user_mem_region {
+	__u32 slot;
+	__u32 flags;
+	__u64 guest_phys_addr;
+	__u64 memory_size;
+	__u64 userspace_addr;
+	__u64 private_offset;
+	__u32 private_fd;
+	__u32 pad1;
+	__u64 pad2[14];
+};
+#endif
+
 /*
  * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
  * other bits are reserved for kvm internal use which are defined in
@@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_PRIVATE		(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index a8c5c9f06b3c..ccaff13cc5b8 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -72,3 +72,6 @@ config KVM_XFER_TO_GUEST_WORK
 
 config HAVE_KVM_PM_NOTIFIER
        bool
+
+config HAVE_KVM_PRIVATE_MEM
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 584a5bab3af3..12dc0dc57b06 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1526,7 +1526,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
 	}
 }
 
-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
+static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
 
@@ -1920,7 +1920,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
  * Must be called holding kvm->slots_lock for write.
  */
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem)
+			    const struct kvm_user_mem_region *mem)
 {
 	struct kvm_memory_slot *old, *new;
 	struct kvm_memslots *slots;
@@ -2024,7 +2024,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem)
+			  const struct kvm_user_mem_region *mem)
 {
 	int r;
 
@@ -2036,7 +2036,7 @@ int kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(kvm_set_memory_region);
 
 static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
-					  struct kvm_userspace_memory_region *mem)
+					  struct kvm_user_mem_region *mem)
 {
 	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
 		return -EINVAL;
@@ -4622,6 +4622,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
 	return fd;
 }
 
+#define SANITY_CHECK_MEM_REGION_FIELD(field)					\
+do {										\
+	BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=		\
+		     offsetof(struct kvm_userspace_memory_region, field));	\
+	BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=		\
+		     sizeof_field(struct kvm_userspace_memory_region, field));	\
+} while (0)
+
+#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)					\
+do {											\
+	BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=			\
+		     offsetof(struct kvm_userspace_memory_region_ext, field));		\
+	BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=			\
+		     sizeof_field(struct kvm_userspace_memory_region_ext, field));	\
+} while (0)
+
+static void kvm_sanity_check_user_mem_region_alias(void)
+{
+	SANITY_CHECK_MEM_REGION_FIELD(slot);
+	SANITY_CHECK_MEM_REGION_FIELD(flags);
+	SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
+	SANITY_CHECK_MEM_REGION_FIELD(memory_size);
+	SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
+	SANITY_CHECK_MEM_REGION_EXT_FIELD(private_offset);
+	SANITY_CHECK_MEM_REGION_EXT_FIELD(private_fd);
+}
+
 static long kvm_vm_ioctl(struct file *filp,
 			   unsigned int ioctl, unsigned long arg)
 {
@@ -4645,14 +4672,20 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 	case KVM_SET_USER_MEMORY_REGION: {
-		struct kvm_userspace_memory_region kvm_userspace_mem;
+		struct kvm_user_mem_region mem;
+		unsigned long size = sizeof(struct kvm_userspace_memory_region);
+
+		kvm_sanity_check_user_mem_region_alias();
 
 		r = -EFAULT;
-		if (copy_from_user(&kvm_userspace_mem, argp,
-						sizeof(kvm_userspace_mem)))
+		if (copy_from_user(&mem, argp, size);
+			goto out;
+
+		r = -EINVAL;
+		if (mem.flags & KVM_MEM_PRIVATE)
 			goto out;
 
-		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
+		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
 		break;
 	}
 	case KVM_GET_DIRTY_LOG: {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH v8 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-09-15 14:29 [PATCH v8 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
  2022-09-15 14:29 ` [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd Chao Peng
  2022-09-15 14:29 ` [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
@ 2022-09-15 14:29 ` Chao Peng
  2022-09-16  9:17   ` Bagas Sanjaya
  2022-09-15 14:29 ` [PATCH v8 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 97+ messages in thread
From: Chao Peng @ 2022-09-15 14:29 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, wei.w.wang

This new KVM exit allows userspace to handle memory-related errors. It
indicates an error happens in KVM at guest memory range [gpa, gpa+size).
The flags includes additional information for userspace to handle the
error. Currently bit 0 is defined as 'private memory' where '1'
indicates error happens due to private memory access and '0' indicates
error happens due to shared memory access.

When private memory is enabled, this new exit will be used for KVM to
exit to userspace for shared <-> private memory conversion in memory
encryption usage. In such usage, typically there are two kind of memory
conversions:
  - explicit conversion: happens when guest explicitly calls into KVM
    to map a range (as private or shared), KVM then exits to userspace
    to do the map/unmap operations.
  - implicit conversion: happens in KVM page fault handler where KVM
    exits to userspace for an implicit conversion when the page is in a
    different state than requested (private or shared).

Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
 include/uapi/linux/kvm.h       |  9 +++++++++
 2 files changed, 32 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index c1fac1e9f820..1a6c003b2a0b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6638,6 +6638,29 @@ array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
+			__u32 flags;
+			__u32 padding;
+			__u64 gpa;
+			__u64 size;
+		} memory;
+
+If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
+encountered a memory error which is not handled by KVM kernel module and
+userspace may choose to handle it. The 'flags' field indicates the memory
+properties of the exit.
+
+ - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
+   private memory access when the bit is set otherwise the memory error is
+   caused by shared memory access when the bit is clear.
+
+'gpa' and 'size' indicate the memory range the error occurs at. The userspace
+may handle the error and return to KVM to retry the previous memory access.
+
 ::
 
     /* KVM_EXIT_NOTIFY */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 3ef462fb3b2a..0c8db7b7c138 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -300,6 +300,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MEMORY_FAULT     38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -538,6 +539,14 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1 << 0)
+			__u32 flags;
+			__u32 padding;
+			__u64 gpa;
+			__u64 size;
+		} memory;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH v8 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry
  2022-09-15 14:29 [PATCH v8 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (2 preceding siblings ...)
  2022-09-15 14:29 ` [PATCH v8 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
@ 2022-09-15 14:29 ` Chao Peng
  2022-09-15 14:29 ` [PATCH v8 5/8] KVM: Register/unregister the guest private memory regions Chao Peng
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 97+ messages in thread
From: Chao Peng @ 2022-09-15 14:29 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, wei.w.wang

Currently in mmu_notifier validate path, hva range is recorded and then
checked against in the mmu_notifier_retry_hva() of the page fault path.
However, for the to be introduced private memory, a page fault may not
have a hva associated, checking gfn(gpa) makes more sense.

For existing non private memory case, gfn is expected to continue to
work. The only downside is when aliasing multiple gfns to a single hva,
the current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.

The patch also fixes a bug in kvm_zap_gfn_range() which has already
been using gfn when calling kvm_mmu_invalidate_begin/end() while these
functions accept hva in current code.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/kvm/mmu/mmu.c   |  2 +-
 include/linux/kvm_host.h | 18 +++++++---------
 virt/kvm/kvm_main.c      | 45 ++++++++++++++++++++++++++--------------
 3 files changed, 39 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e418ef3ecfcb..08abad4f3e6f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4203,7 +4203,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
 		return true;
 
 	return fault->slot &&
-	       mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
+	       mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
 }
 
 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index eac1787b899b..2125b50f6345 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -776,8 +776,8 @@ struct kvm {
 	struct mmu_notifier mmu_notifier;
 	unsigned long mmu_invalidate_seq;
 	long mmu_invalidate_in_progress;
-	unsigned long mmu_invalidate_range_start;
-	unsigned long mmu_invalidate_range_end;
+	gfn_t mmu_invalidate_range_start;
+	gfn_t mmu_invalidate_range_end;
 #endif
 	struct list_head devices;
 	u64 manual_dirty_log_protect;
@@ -1366,10 +1366,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 #endif
 
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
-			      unsigned long end);
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
-			    unsigned long end);
+void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end);
+void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end);
 
 long kvm_arch_dev_ioctl(struct file *filp,
 			unsigned int ioctl, unsigned long arg);
@@ -1938,9 +1936,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
 	return 0;
 }
 
-static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
+static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
 					   unsigned long mmu_seq,
-					   unsigned long hva)
+					   gfn_t gfn)
 {
 	lockdep_assert_held(&kvm->mmu_lock);
 	/*
@@ -1950,8 +1948,8 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
 	 * positives, due to shortcuts when handing concurrent invalidations.
 	 */
 	if (unlikely(kvm->mmu_invalidate_in_progress) &&
-	    hva >= kvm->mmu_invalidate_range_start &&
-	    hva < kvm->mmu_invalidate_range_end)
+	    gfn >= kvm->mmu_invalidate_range_start &&
+	    gfn < kvm->mmu_invalidate_range_end)
 		return 1;
 	if (kvm->mmu_invalidate_seq != mmu_seq)
 		return 1;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 12dc0dc57b06..fa9dd2d2c001 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -540,8 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
 
 typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
-typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
-			     unsigned long end);
+typedef void (*on_lock_fn_t)(struct kvm *kvm, gfn_t start, gfn_t end);
 
 typedef void (*on_unlock_fn_t)(struct kvm *kvm);
 
@@ -628,7 +627,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 				locked = true;
 				KVM_MMU_LOCK(kvm);
 				if (!IS_KVM_NULL_FN(range->on_lock))
-					range->on_lock(kvm, range->start, range->end);
+					range->on_lock(kvm, gfn_range.start,
+							    gfn_range.end);
 				if (IS_KVM_NULL_FN(range->handler))
 					break;
 			}
@@ -715,15 +715,9 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
 }
 
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
-			      unsigned long end)
+static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
+							    gfn_t end)
 {
-	/*
-	 * The count increase must become visible at unlock time as no
-	 * spte can be established without taking the mmu_lock and
-	 * count is also read inside the mmu_lock critical section.
-	 */
-	kvm->mmu_invalidate_in_progress++;
 	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
 		kvm->mmu_invalidate_range_start = start;
 		kvm->mmu_invalidate_range_end = end;
@@ -744,6 +738,28 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
 	}
 }
 
+static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	/*
+	 * The count increase must become visible at unlock time as no
+	 * spte can be established without taking the mmu_lock and
+	 * count is also read inside the mmu_lock critical section.
+	 */
+	kvm->mmu_invalidate_in_progress++;
+}
+
+static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	update_invalidate_range(kvm, range->start, range->end);
+	return kvm_unmap_gfn_range(kvm, range);
+}
+
+void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	mark_invalidate_in_progress(kvm, start, end);
+	update_invalidate_range(kvm, start, end);
+}
+
 static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
@@ -752,8 +768,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.start		= range->start,
 		.end		= range->end,
 		.pte		= __pte(0),
-		.handler	= kvm_unmap_gfn_range,
-		.on_lock	= kvm_mmu_invalidate_begin,
+		.handler	= kvm_mmu_handle_gfn_range,
+		.on_lock	= mark_invalidate_in_progress,
 		.on_unlock	= kvm_arch_guest_memory_reclaimed,
 		.flush_on_ret	= true,
 		.may_block	= mmu_notifier_range_blockable(range),
@@ -791,8 +807,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	return 0;
 }
 
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
-			    unsigned long end)
+void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
 {
 	/*
 	 * This sequence increase will notify the kvm page fault that
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH v8 5/8] KVM: Register/unregister the guest private memory regions
  2022-09-15 14:29 [PATCH v8 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (3 preceding siblings ...)
  2022-09-15 14:29 ` [PATCH v8 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
@ 2022-09-15 14:29 ` Chao Peng
  2022-09-26 10:36   ` Fuad Tabba
  2022-10-11  9:48   ` Fuad Tabba
  2022-09-15 14:29 ` [PATCH v8 6/8] KVM: Update lpage info when private/shared memory are mixed Chao Peng
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 97+ messages in thread
From: Chao Peng @ 2022-09-15 14:29 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, wei.w.wang

If CONFIG_HAVE_KVM_PRIVATE_MEM=y, userspace can register/unregister the
guest private memory regions through KVM_MEMORY_ENCRYPT_{UN,}REG_REGION
ioctls. The patch reuses existing SEV ioctl number but differs that the
address in the region for KVM_PRIVATE_MEM case is gpa while for SEV case
it's hva. Which usages should the ioctls go is determined by the newly
added kvm_arch_has_private_mem(). Architecture which supports
KVM_PRIVATE_MEM should override this function.

The current implementation defaults all memory to private. The shared
memory regions are stored in a xarray variable for memory efficiency and
zapping existing memory mappings is also a side effect of these two
ioctls when defined.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 Documentation/virt/kvm/api.rst  | 17 ++++++--
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/mmu.h              |  2 -
 include/linux/kvm_host.h        | 13 ++++++
 virt/kvm/kvm_main.c             | 73 +++++++++++++++++++++++++++++++++
 5 files changed, 100 insertions(+), 6 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 1a6c003b2a0b..c0f800d04ffc 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4715,10 +4715,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst.
 This ioctl can be used to register a guest memory region which may
 contain encrypted data (e.g. guest RAM, SMRAM etc).
 
-It is used in the SEV-enabled guest. When encryption is enabled, a guest
-memory region may contain encrypted data. The SEV memory encryption
-engine uses a tweak such that two identical plaintext pages, each at
-different locations will have differing ciphertexts. So swapping or
+Currently this ioctl supports registering memory regions for two usages:
+private memory and SEV-encrypted memory.
+
+When private memory is enabled, this ioctl is used to register guest private
+memory region and the addr/size of kvm_enc_region represents guest physical
+address (GPA). In this usage, this ioctl zaps the existing guest memory
+mappings in KVM that fallen into the region.
+
+When SEV-encrypted memory is enabled, this ioctl is used to register guest
+memory region which may contain encrypted data for a SEV-enabled guest. The
+addr/size of kvm_enc_region represents userspace address (HVA). The SEV
+memory encryption engine uses a tweak such that two identical plaintext pages,
+each at different locations will have differing ciphertexts. So swapping or
 moving ciphertext of those pages will not result in plaintext being
 swapped. So relocating (or migrating) physical backing pages for the SEV
 guest will require some additional steps.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2c96c43c313a..cfad6ba1a70a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -37,6 +37,7 @@
 #include <asm/hyperv-tlfs.h>
 
 #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
+#define __KVM_HAVE_ZAP_GFN_RANGE
 
 #define KVM_MAX_VCPUS 1024
 
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 6bdaacb6faa0..c94b620bf94b 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -211,8 +211,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	return -(u32)fault & errcode;
 }
 
-void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
-
 int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 
 int kvm_mmu_post_init_vm(struct kvm *kvm);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2125b50f6345..d65690cae80b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -260,6 +260,15 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 #endif
 
+#ifdef __KVM_HAVE_ZAP_GFN_RANGE
+void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
+#else
+static inline void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start
+						      gfn_t gfn_end)
+{
+}
+#endif
+
 enum {
 	OUTSIDE_GUEST_MODE,
 	IN_GUEST_MODE,
@@ -795,6 +804,9 @@ struct kvm {
 	struct notifier_block pm_notifier;
 #endif
 	char stats_id[KVM_STATS_NAME_SIZE];
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+	struct xarray mem_attr_array;
+#endif
 };
 
 #define kvm_err(fmt, ...) \
@@ -1454,6 +1466,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
 int kvm_arch_post_init_vm(struct kvm *kvm);
 void kvm_arch_pre_destroy_vm(struct kvm *kvm);
 int kvm_arch_create_vm_debugfs(struct kvm *kvm);
+bool kvm_arch_has_private_mem(struct kvm *kvm);
 
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 /*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fa9dd2d2c001..de5cce8c82c7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -937,6 +937,47 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
 
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+#define KVM_MEM_ATTR_SHARED	0x0001
+static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
+				     bool is_private)
+{
+	gfn_t start, end;
+	unsigned long index;
+	void *entry;
+	int r;
+
+	if (size == 0 || gpa + size < gpa)
+		return -EINVAL;
+	if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
+		return -EINVAL;
+
+	start = gpa >> PAGE_SHIFT;
+	end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
+
+	/*
+	 * Guest memory defaults to private, kvm->mem_attr_array only stores
+	 * shared memory.
+	 */
+	entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
+
+	for (index = start; index < end; index++) {
+		r = xa_err(xa_store(&kvm->mem_attr_array, index, entry,
+				    GFP_KERNEL_ACCOUNT));
+		if (r)
+			goto err;
+	}
+
+	kvm_zap_gfn_range(kvm, start, end);
+
+	return r;
+err:
+	for (; index > start; index--)
+		xa_erase(&kvm->mem_attr_array, index);
+	return r;
+}
+#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
+
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 static int kvm_pm_notifier_call(struct notifier_block *bl,
 				unsigned long state,
@@ -1165,6 +1206,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	spin_lock_init(&kvm->mn_invalidate_lock);
 	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
 	xa_init(&kvm->vcpu_array);
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+	xa_init(&kvm->mem_attr_array);
+#endif
 
 	INIT_LIST_HEAD(&kvm->gpc_list);
 	spin_lock_init(&kvm->gpc_lock);
@@ -1338,6 +1382,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
 		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
 		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
 	}
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+	xa_destroy(&kvm->mem_attr_array);
+#endif
 	cleanup_srcu_struct(&kvm->irq_srcu);
 	cleanup_srcu_struct(&kvm->srcu);
 	kvm_arch_free_vm(kvm);
@@ -1541,6 +1588,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
 	}
 }
 
+bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
+{
+	return false;
+}
+
 static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
@@ -4703,6 +4755,24 @@ static long kvm_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
 		break;
 	}
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+	case KVM_MEMORY_ENCRYPT_REG_REGION:
+	case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
+		struct kvm_enc_region region;
+		bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
+
+		if (!kvm_arch_has_private_mem(kvm))
+			goto arch_vm_ioctl;
+
+		r = -EFAULT;
+		if (copy_from_user(&region, argp, sizeof(region)))
+			goto out;
+
+		r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
+					      region.size, set);
+		break;
+	}
+#endif
 	case KVM_GET_DIRTY_LOG: {
 		struct kvm_dirty_log log;
 
@@ -4856,6 +4926,9 @@ static long kvm_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_get_stats_fd(kvm);
 		break;
 	default:
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+arch_vm_ioctl:
+#endif
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 	}
 out:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH v8 6/8] KVM: Update lpage info when private/shared memory are mixed
  2022-09-15 14:29 [PATCH v8 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (4 preceding siblings ...)
  2022-09-15 14:29 ` [PATCH v8 5/8] KVM: Register/unregister the guest private memory regions Chao Peng
@ 2022-09-15 14:29 ` Chao Peng
  2022-09-29 16:52   ` Isaku Yamahata
  2022-09-15 14:29 ` [PATCH v8 7/8] KVM: Handle page fault for private memory Chao Peng
  2022-09-15 14:29 ` [PATCH v8 8/8] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
  7 siblings, 1 reply; 97+ messages in thread
From: Chao Peng @ 2022-09-15 14:29 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, wei.w.wang

When private/shared memory are mixed in a large page, the lpage_info may
not be accurate and should be updated with this mixed info. A large page
has mixed pages can't be really mapped as large page since its
private/shared pages are from different physical memory.

This patch updates lpage_info when private/shared memory attribute is
changed.  If both private and shared pages are within a large page
region, it can't be mapped as large page. It's a bit challenge to track
the mixed info in a 'count' like variable, this patch instead reserves a
bit in disallow_lpage to indicate a large page include mixed
private/share pages.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h |   8 +++
 arch/x86/kvm/mmu/mmu.c          | 119 +++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              |   2 +
 include/linux/kvm_host.h        |  17 +++++
 virt/kvm/kvm_main.c             |  11 ++-
 5 files changed, 154 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cfad6ba1a70a..85119ed9527a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -38,6 +38,7 @@
 
 #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
 #define __KVM_HAVE_ZAP_GFN_RANGE
+#define __KVM_HAVE_ARCH_UPDATE_MEM_ATTR
 
 #define KVM_MAX_VCPUS 1024
 
@@ -945,6 +946,13 @@ struct kvm_vcpu_arch {
 #endif
 };
 
+/*
+ * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
+ * level. The remaining bits will be used as a reference count for other users.
+ */
+#define KVM_LPAGE_PRIVATE_SHARED_MIXED		(1U << 31)
+#define KVM_LPAGE_COUNT_MAX			((1U << 31) - 1)
+
 struct kvm_lpage_info {
 	int disallow_lpage;
 };
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 08abad4f3e6f..a0f198cede3d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -762,11 +762,16 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
 {
 	struct kvm_lpage_info *linfo;
 	int i;
+	int disallow_count;
 
 	for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
 		linfo = lpage_info_slot(gfn, slot, i);
+
+		disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
+		WARN_ON(disallow_count + count < 0 ||
+			disallow_count > KVM_LPAGE_COUNT_MAX - count);
+
 		linfo->disallow_lpage += count;
-		WARN_ON(linfo->disallow_lpage < 0);
 	}
 }
 
@@ -6894,3 +6899,115 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 	if (kvm->arch.nx_lpage_recovery_thread)
 		kthread_stop(kvm->arch.nx_lpage_recovery_thread);
 }
+
+static bool mem_attr_is_mixed(struct kvm *kvm, unsigned int attr,
+			      gfn_t start, gfn_t end)
+{
+	XA_STATE(xas, &kvm->mem_attr_array, start);
+	gfn_t gfn = start;
+	void *entry;
+	bool shared, private;
+	bool mixed = false;
+
+	if (attr == KVM_MEM_ATTR_SHARED) {
+		shared = true;
+		private = false;
+	} else {
+		shared = false;
+		private = true;
+	}
+
+	rcu_read_lock();
+	entry = xas_load(&xas);
+	while (gfn < end) {
+		if (xas_retry(&xas, entry))
+			continue;
+
+		KVM_BUG_ON(gfn != xas.xa_index, kvm);
+
+		if (entry)
+			private = true;
+		else
+			shared = true;
+
+		if (private && shared) {
+			mixed = true;
+			goto out;
+		}
+
+		entry = xas_next(&xas);
+		gfn++;
+	}
+out:
+	rcu_read_unlock();
+	return mixed;
+}
+
+static inline void update_mixed(struct kvm_lpage_info *linfo, bool mixed)
+{
+	if (mixed)
+		linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
+	else
+		linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
+}
+
+static void update_mem_lpage_info(struct kvm *kvm,
+				  struct kvm_memory_slot *slot,
+				  unsigned int attr,
+				  gfn_t start, gfn_t end)
+{
+	unsigned long lpage_start, lpage_end;
+	unsigned long gfn, pages, mask;
+	int level;
+
+	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
+		pages = KVM_PAGES_PER_HPAGE(level);
+		mask = ~(pages - 1);
+		lpage_start = start & mask;
+		lpage_end = (end - 1) & mask;
+
+		/*
+		 * We only need to scan the head and tail page, for middle pages
+		 * we know they are not mixed.
+		 */
+		update_mixed(lpage_info_slot(lpage_start, slot, level),
+			     mem_attr_is_mixed(kvm, attr, lpage_start,
+							  lpage_start + pages));
+
+		if (lpage_start == lpage_end)
+			return;
+
+		for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages)
+			update_mixed(lpage_info_slot(gfn, slot, level), false);
+
+		update_mixed(lpage_info_slot(lpage_end, slot, level),
+			     mem_attr_is_mixed(kvm, attr, lpage_end,
+							  lpage_end + pages));
+	}
+}
+
+void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
+			      gfn_t start, gfn_t end)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm_memslots *slots;
+	struct kvm_memslot_iter iter;
+	int i;
+
+	WARN_ONCE(!(attr & (KVM_MEM_ATTR_PRIVATE | KVM_MEM_ATTR_SHARED)),
+			"Unsupported mem attribute.\n");
+
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		slots = __kvm_memslots(kvm, i);
+
+		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
+			slot = iter.slot;
+			start = max(start, slot->base_gfn);
+			end = min(end, slot->base_gfn + slot->npages);
+			if (WARN_ON_ONCE(start >= end))
+				continue;
+
+			update_mem_lpage_info(kvm, slot, attr, start, end);
+		}
+	}
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 081f62ccc9a1..ef11cda6f13f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12321,6 +12321,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
 		if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
 			linfo[lpages - 1].disallow_lpage = 1;
 		ugfn = slot->userspace_addr >> PAGE_SHIFT;
+		if (kvm_slot_can_be_private(slot))
+			ugfn |= slot->private_offset >> PAGE_SHIFT;
 		/*
 		 * If the gfn and userspace address are not aligned wrt each
 		 * other, disable large page support for this slot.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d65690cae80b..fd36ce6597ad 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2277,4 +2277,21 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+
+#define KVM_MEM_ATTR_SHARED	0x0001
+#define KVM_MEM_ATTR_PRIVATE	0x0002
+
+#ifdef __KVM_HAVE_ARCH_UPDATE_MEM_ATTR
+void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
+			      gfn_t start, gfn_t end);
+#else
+static inline void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
+					    gfn_t start, gfn_t end)
+{
+}
+#endif
+
+#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
+
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index de5cce8c82c7..97d893f7482c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -938,13 +938,13 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
 
 #ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
-#define KVM_MEM_ATTR_SHARED	0x0001
 static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
 				     bool is_private)
 {
 	gfn_t start, end;
 	unsigned long index;
 	void *entry;
+	int attr;
 	int r;
 
 	if (size == 0 || gpa + size < gpa)
@@ -959,7 +959,13 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
 	 * Guest memory defaults to private, kvm->mem_attr_array only stores
 	 * shared memory.
 	 */
-	entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
+	if (is_private) {
+		attr = KVM_MEM_ATTR_PRIVATE;
+		entry = NULL;
+	} else {
+		attr = KVM_MEM_ATTR_SHARED;
+		entry = xa_mk_value(KVM_MEM_ATTR_SHARED);
+	}
 
 	for (index = start; index < end; index++) {
 		r = xa_err(xa_store(&kvm->mem_attr_array, index, entry,
@@ -969,6 +975,7 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
 	}
 
 	kvm_zap_gfn_range(kvm, start, end);
+	kvm_arch_update_mem_attr(kvm, attr, start, end);
 
 	return r;
 err:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH v8 7/8] KVM: Handle page fault for private memory
  2022-09-15 14:29 [PATCH v8 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (5 preceding siblings ...)
  2022-09-15 14:29 ` [PATCH v8 6/8] KVM: Update lpage info when private/shared memory are mixed Chao Peng
@ 2022-09-15 14:29 ` Chao Peng
  2022-10-14 18:57   ` Sean Christopherson
  2022-09-15 14:29 ` [PATCH v8 8/8] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
  7 siblings, 1 reply; 97+ messages in thread
From: Chao Peng @ 2022-09-15 14:29 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, wei.w.wang

A memslot with KVM_MEM_PRIVATE being set can include both fd-based
private memory and hva-based shared memory. Architecture code (like TDX
code) can tell whether the on-going fault is private or not. This patch
adds a 'is_private' field to kvm_page_fault to indicate this and
architecture code is expected to set it.

To handle page fault for such memslot, the handling logic is different
depending on whether the fault is private or shared. KVM checks if
'is_private' matches the host's view of the page (this is maintained in
mem_attr_array).
  - For a successful match, private pfn is obtained with
    inaccessible_get_pfn() from private fd and shared pfn is obtained
    with existing get_user_pages().
  - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
    userspace. Userspace then can convert memory between private/shared
    in host's view and then retry the access.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/kvm/mmu/mmu.c          | 54 ++++++++++++++++++++++++++++++++-
 arch/x86/kvm/mmu/mmu_internal.h | 18 +++++++++++
 arch/x86/kvm/mmu/mmutrace.h     |  1 +
 include/linux/kvm_host.h        | 24 +++++++++++++++
 4 files changed, 96 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a0f198cede3d..81ab20003824 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3028,6 +3028,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
 			break;
 	}
 
+	if (kvm_mem_is_private(kvm, gfn))
+		return max_level;
+
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
 
@@ -4127,6 +4130,32 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
 }
 
+static inline u8 order_to_level(int order)
+{
+	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
+
+	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
+		return PG_LEVEL_1G;
+
+	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
+		return PG_LEVEL_2M;
+
+	return PG_LEVEL_4K;
+}
+
+static int kvm_faultin_pfn_private(struct kvm_page_fault *fault)
+{
+	int order;
+	struct kvm_memory_slot *slot = fault->slot;
+
+	if (kvm_private_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
+		return RET_PF_RETRY;
+
+	fault->max_level = min(order_to_level(order), fault->max_level);
+	fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
+	return RET_PF_CONTINUE;
+}
+
 static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
@@ -4159,6 +4188,22 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			return RET_PF_EMULATE;
 	}
 
+	if (kvm_slot_can_be_private(slot) &&
+	    fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
+		vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+		if (fault->is_private)
+			vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
+		else
+			vcpu->run->memory.flags = 0;
+		vcpu->run->memory.padding = 0;
+		vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
+		vcpu->run->memory.size = PAGE_SIZE;
+		return RET_PF_USER;
+	}
+
+	if (fault->is_private)
+		return kvm_faultin_pfn_private(fault);
+
 	async = false;
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
 					  fault->write, &fault->map_writable,
@@ -4267,7 +4312,11 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 		read_unlock(&vcpu->kvm->mmu_lock);
 	else
 		write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+
+	if (fault->is_private)
+		kvm_private_mem_put_pfn(fault->slot, fault->pfn);
+	else
+		kvm_release_pfn_clean(fault->pfn);
 	return r;
 }
 
@@ -5543,6 +5592,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
 			return -EIO;
 	}
 
+	if (r == RET_PF_USER)
+		return 0;
+
 	if (r < 0)
 		return r;
 	if (r != RET_PF_EMULATE)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 582def531d4d..a55e352246a7 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -188,6 +188,7 @@ struct kvm_page_fault {
 
 	/* Derived from mmu and global state.  */
 	const bool is_tdp;
+	const bool is_private;
 	const bool nx_huge_page_workaround_enabled;
 
 	/*
@@ -236,6 +237,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
  * RET_PF_RETRY: let CPU fault again on the address.
  * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
  * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
+ * RET_PF_USER: need to exit to userspace to handle this fault.
  * RET_PF_FIXED: The faulting entry has been fixed.
  * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
  *
@@ -252,6 +254,7 @@ enum {
 	RET_PF_RETRY,
 	RET_PF_EMULATE,
 	RET_PF_INVALID,
+	RET_PF_USER,
 	RET_PF_FIXED,
 	RET_PF_SPURIOUS,
 };
@@ -318,4 +321,19 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+#ifndef CONFIG_HAVE_KVM_PRIVATE_MEM
+static inline int kvm_private_mem_get_pfn(struct kvm_memory_slot *slot,
+					  gfn_t gfn, kvm_pfn_t *pfn, int *order)
+{
+	WARN_ON_ONCE(1);
+	return -EOPNOTSUPP;
+}
+
+static inline void kvm_private_mem_put_pfn(struct kvm_memory_slot *slot,
+					   kvm_pfn_t pfn)
+{
+	WARN_ON_ONCE(1);
+}
+#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index ae86820cef69..2d7555381955 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
 TRACE_DEFINE_ENUM(RET_PF_RETRY);
 TRACE_DEFINE_ENUM(RET_PF_EMULATE);
 TRACE_DEFINE_ENUM(RET_PF_INVALID);
+TRACE_DEFINE_ENUM(RET_PF_USER);
 TRACE_DEFINE_ENUM(RET_PF_FIXED);
 TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fd36ce6597ad..b9906cdf468b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2292,6 +2292,30 @@ static inline void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
 }
 #endif
 
+static inline int kvm_private_mem_get_pfn(struct kvm_memory_slot *slot,
+					  gfn_t gfn, kvm_pfn_t *pfn, int *order)
+{
+	int ret;
+	pfn_t pfnt;
+	pgoff_t index = gfn - slot->base_gfn +
+			(slot->private_offset >> PAGE_SHIFT);
+
+	ret = inaccessible_get_pfn(slot->private_file, index, &pfnt, order);
+	*pfn = pfn_t_to_pfn(pfnt);
+	return ret;
+}
+
+static inline void kvm_private_mem_put_pfn(struct kvm_memory_slot *slot,
+					   kvm_pfn_t pfn)
+{
+	inaccessible_put_pfn(slot->private_file, pfn_to_pfn_t(pfn));
+}
+
+static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+{
+	return !xa_load(&kvm->mem_attr_array, gfn);
+}
+
 #endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
 
 #endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH v8 8/8] KVM: Enable and expose KVM_MEM_PRIVATE
  2022-09-15 14:29 [PATCH v8 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
                   ` (6 preceding siblings ...)
  2022-09-15 14:29 ` [PATCH v8 7/8] KVM: Handle page fault for private memory Chao Peng
@ 2022-09-15 14:29 ` Chao Peng
  2022-10-04 14:55   ` Jarkko Sakkinen
  2022-10-06  8:55   ` Fuad Tabba
  7 siblings, 2 replies; 97+ messages in thread
From: Chao Peng @ 2022-09-15 14:29 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Chao Peng, Kirill A . Shutemov, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, wei.w.wang

Expose KVM_MEM_PRIVATE and memslot fields private_fd/offset to
userspace. KVM will register/unregister private memslot to fd-based
memory backing store and response to invalidation event from
inaccessible_notifier to zap the existing memory mappings in the
secondary page table.

Whether KVM_MEM_PRIVATE is actually exposed to userspace is determined
by architecture code which can turn on it by overriding the default
kvm_arch_has_private_mem().

A 'kvm' reference is added in memslot structure since in
inaccessible_notifier callback we can only obtain a memslot reference
but 'kvm' is needed to do the zapping.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/kvm_host.h |   1 +
 virt/kvm/kvm_main.c      | 116 +++++++++++++++++++++++++++++++++++++--
 2 files changed, 111 insertions(+), 6 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b9906cdf468b..cb4eefac709c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -589,6 +589,7 @@ struct kvm_memory_slot {
 	struct file *private_file;
 	loff_t private_offset;
 	struct inaccessible_notifier notifier;
+	struct kvm *kvm;
 };
 
 static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 97d893f7482c..87e239d35b96 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -983,6 +983,57 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
 		xa_erase(&kvm->mem_attr_array, index);
 	return r;
 }
+
+static void kvm_private_notifier_invalidate(struct inaccessible_notifier *notifier,
+					    pgoff_t start, pgoff_t end)
+{
+	struct kvm_memory_slot *slot = container_of(notifier,
+						    struct kvm_memory_slot,
+						    notifier);
+	unsigned long base_pgoff = slot->private_offset >> PAGE_SHIFT;
+	gfn_t start_gfn = slot->base_gfn;
+	gfn_t end_gfn = slot->base_gfn + slot->npages;
+
+
+	if (start > base_pgoff)
+		start_gfn = slot->base_gfn + start - base_pgoff;
+
+	if (end < base_pgoff + slot->npages)
+		end_gfn = slot->base_gfn + end - base_pgoff;
+
+	if (start_gfn >= end_gfn)
+		return;
+
+	kvm_zap_gfn_range(slot->kvm, start_gfn, end_gfn);
+}
+
+static struct inaccessible_notifier_ops kvm_private_notifier_ops = {
+	.invalidate = kvm_private_notifier_invalidate,
+};
+
+static inline void kvm_private_mem_register(struct kvm_memory_slot *slot)
+{
+	slot->notifier.ops = &kvm_private_notifier_ops;
+	inaccessible_register_notifier(slot->private_file, &slot->notifier);
+}
+
+static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot)
+{
+	inaccessible_unregister_notifier(slot->private_file, &slot->notifier);
+}
+
+#else /* !CONFIG_HAVE_KVM_PRIVATE_MEM */
+
+static inline void kvm_private_mem_register(struct kvm_memory_slot *slot)
+{
+	WARN_ON_ONCE(1);
+}
+
+static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot)
+{
+	WARN_ON_ONCE(1);
+}
+
 #endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
 
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
@@ -1029,6 +1080,11 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
 /* This does not remove the slot from struct kvm_memslots data structures */
 static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
+	if (slot->flags & KVM_MEM_PRIVATE) {
+		kvm_private_mem_unregister(slot);
+		fput(slot->private_file);
+	}
+
 	kvm_destroy_dirty_bitmap(slot);
 
 	kvm_arch_free_memslot(kvm, slot);
@@ -1600,10 +1656,16 @@ bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
 	return false;
 }
 
-static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
+static int check_memory_region_flags(struct kvm *kvm,
+				     const struct kvm_user_mem_region *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
 
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+	if (kvm_arch_has_private_mem(kvm))
+		valid_flags |= KVM_MEM_PRIVATE;
+#endif
+
 #ifdef __KVM_HAVE_READONLY_MEM
 	valid_flags |= KVM_MEM_READONLY;
 #endif
@@ -1679,6 +1741,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
 {
 	int r;
 
+	if (change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
+		kvm_private_mem_register(new);
+
 	/*
 	 * If dirty logging is disabled, nullify the bitmap; the old bitmap
 	 * will be freed on "commit".  If logging is enabled in both old and
@@ -1707,6 +1772,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
 	if (r && new && new->dirty_bitmap && (!old || !old->dirty_bitmap))
 		kvm_destroy_dirty_bitmap(new);
 
+	if (r && change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
+		kvm_private_mem_unregister(new);
+
 	return r;
 }
 
@@ -2004,7 +2072,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	int as_id, id;
 	int r;
 
-	r = check_memory_region_flags(mem);
+	r = check_memory_region_flags(kvm, mem);
 	if (r)
 		return r;
 
@@ -2023,6 +2091,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
 			mem->memory_size))
 		return -EINVAL;
+	if (mem->flags & KVM_MEM_PRIVATE &&
+		(mem->private_offset & (PAGE_SIZE - 1) ||
+		 mem->private_offset > U64_MAX - mem->memory_size))
+		return -EINVAL;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
 		return -EINVAL;
 	if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
@@ -2061,6 +2133,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
 			return -EINVAL;
 	} else { /* Modify an existing slot. */
+		/* Private memslots are immutable, they can only be deleted. */
+		if (mem->flags & KVM_MEM_PRIVATE)
+			return -EINVAL;
 		if ((mem->userspace_addr != old->userspace_addr) ||
 		    (npages != old->npages) ||
 		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
@@ -2089,10 +2164,27 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	new->npages = npages;
 	new->flags = mem->flags;
 	new->userspace_addr = mem->userspace_addr;
+	if (mem->flags & KVM_MEM_PRIVATE) {
+		new->private_file = fget(mem->private_fd);
+		if (!new->private_file) {
+			r = -EINVAL;
+			goto out;
+		}
+		new->private_offset = mem->private_offset;
+	}
+
+	new->kvm = kvm;
 
 	r = kvm_set_memslot(kvm, old, new, change);
 	if (r)
-		kfree(new);
+		goto out;
+
+	return 0;
+
+out:
+	if (new->private_file)
+		fput(new->private_file);
+	kfree(new);
 	return r;
 }
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
@@ -4747,16 +4839,28 @@ static long kvm_vm_ioctl(struct file *filp,
 	}
 	case KVM_SET_USER_MEMORY_REGION: {
 		struct kvm_user_mem_region mem;
-		unsigned long size = sizeof(struct kvm_userspace_memory_region);
+		unsigned int flags_offset = offsetof(typeof(mem), flags);
+		unsigned long size;
+		u32 flags;
 
 		kvm_sanity_check_user_mem_region_alias();
 
+		memset(&mem, 0, sizeof(mem));
+
 		r = -EFAULT;
-		if (copy_from_user(&mem, argp, size);
+		if (get_user(flags, (u32 __user *)(argp + flags_offset)))
+			goto out;
+
+		if (flags & KVM_MEM_PRIVATE)
+			size = sizeof(struct kvm_userspace_memory_region_ext);
+		else
+			size = sizeof(struct kvm_userspace_memory_region);
+
+		if (copy_from_user(&mem, argp, size))
 			goto out;
 
 		r = -EINVAL;
-		if (mem.flags & KVM_MEM_PRIVATE)
+		if ((flags ^ mem.flags) & KVM_MEM_PRIVATE)
 			goto out;
 
 		r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-09-15 14:29 ` [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
@ 2022-09-16  9:14   ` Bagas Sanjaya
  2022-09-16  9:53     ` Chao Peng
  2022-09-26 10:26   ` Fuad Tabba
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 97+ messages in thread
From: Bagas Sanjaya @ 2022-09-16  9:14 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

[-- Attachment #1: Type: text/plain, Size: 3627 bytes --]

On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote:
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index abd7c32126ce..c1fac1e9f820 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
>  :Capability: KVM_CAP_USER_MEMORY
>  :Architectures: all
>  :Type: vm ioctl
> -:Parameters: struct kvm_userspace_memory_region (in)
> +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
>  :Returns: 0 on success, -1 on error
>  
>  ::
> @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
>  	__u64 userspace_addr; /* start of the userspace allocated memory */
>    };
>  
> +  struct kvm_userspace_memory_region_ext {
> +	struct kvm_userspace_memory_region region;
> +	__u64 private_offset;
> +	__u32 private_fd;
> +	__u32 pad1;
> +	__u64 pad2[14];
> +  };
> +
>    /* for kvm_memory_region::flags */
>    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>    #define KVM_MEM_READONLY	(1UL << 1)
> +  #define KVM_MEM_PRIVATE		(1UL << 2)
>  
>  This ioctl allows the user to create, modify or delete a guest physical
>  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> @@ -1365,12 +1374,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
>  be identical.  This allows large pages in the guest to be backed by large
>  pages in the host.
>  
> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> -to make a new slot read-only.  In this case, writes to this memory will be
> -posted to userspace as KVM_EXIT_MMIO exits.
> +kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region
> +fields. It also includes additional fields for some specific features. See
> +below description of flags field for more information. It's recommended to use
> +kvm_userspace_memory_region_ext in new userspace code.

Better say "kvm_userspace_memory_region_ext includes all fields of
kvm_userspace_memory_region struct, while also adds additional fields ..."

> +
> +The flags field supports below flags:

s/below/following/

> +
> +- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to
> +  memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to use it.
> +

Better say "... For more details, see KVM_GET_DIRTY_LOG."

> +- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to
> +  make a new slot read-only.  In this case, writes to this memory will be posted
> +  to userspace as KVM_EXIT_MMIO exits.
> +

Better say "if KVM_CAP_READONLY_MEM allows, KVM_MEM_READONLY makes a new
slot read-only ..."

> +- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by
> +  a file descirptor(fd) and the content of the private memory is invisible to
> +  userspace. In this case, userspace should use private_fd/private_offset in
> +  kvm_userspace_memory_region_ext to instruct KVM to provide private memory to
> +  guest. Userspace should guarantee not to map the same pfn indicated by
> +  private_fd/private_offset to different gfns with multiple memslots. Failed to
> +  do this may result undefined behavior.
>  

For the lists above,
s/can be set/

Thanks. 

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-09-15 14:29 ` [PATCH v8 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
@ 2022-09-16  9:17   ` Bagas Sanjaya
  2022-09-16  9:54     ` Chao Peng
  0 siblings, 1 reply; 97+ messages in thread
From: Bagas Sanjaya @ 2022-09-16  9:17 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

[-- Attachment #1: Type: text/plain, Size: 386 bytes --]

On Thu, Sep 15, 2022 at 10:29:08PM +0800, Chao Peng wrote:
> + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> +   private memory access when the bit is set otherwise the memory error is
> +   caused by shared memory access when the bit is clear.

s/set otherwise/set. Otherwise,

Thanks.

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-09-16  9:14   ` Bagas Sanjaya
@ 2022-09-16  9:53     ` Chao Peng
  0 siblings, 0 replies; 97+ messages in thread
From: Chao Peng @ 2022-09-16  9:53 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Fri, Sep 16, 2022 at 04:14:29PM +0700, Bagas Sanjaya wrote:
> On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote:
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index abd7c32126ce..c1fac1e9f820 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
> >  :Capability: KVM_CAP_USER_MEMORY
> >  :Architectures: all
> >  :Type: vm ioctl
> > -:Parameters: struct kvm_userspace_memory_region (in)
> > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> >  :Returns: 0 on success, -1 on error
> >  
> >  ::
> > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> >  	__u64 userspace_addr; /* start of the userspace allocated memory */
> >    };
> >  
> > +  struct kvm_userspace_memory_region_ext {
> > +	struct kvm_userspace_memory_region region;
> > +	__u64 private_offset;
> > +	__u32 private_fd;
> > +	__u32 pad1;
> > +	__u64 pad2[14];
> > +  };
> > +
> >    /* for kvm_memory_region::flags */
> >    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> >    #define KVM_MEM_READONLY	(1UL << 1)
> > +  #define KVM_MEM_PRIVATE		(1UL << 2)
> >  
> >  This ioctl allows the user to create, modify or delete a guest physical
> >  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> > @@ -1365,12 +1374,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> >  be identical.  This allows large pages in the guest to be backed by large
> >  pages in the host.
> >  
> > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> > -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> > -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> > -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> > -to make a new slot read-only.  In this case, writes to this memory will be
> > -posted to userspace as KVM_EXIT_MMIO exits.
> > +kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region
> > +fields. It also includes additional fields for some specific features. See
> > +below description of flags field for more information. It's recommended to use
> > +kvm_userspace_memory_region_ext in new userspace code.
> 
> Better say "kvm_userspace_memory_region_ext includes all fields of
> kvm_userspace_memory_region struct, while also adds additional fields ..."
> 
> > +
> > +The flags field supports below flags:
> 
> s/below/following/
> 
> > +
> > +- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to
> > +  memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to use it.
> > +
> 
> Better say "... For more details, see KVM_GET_DIRTY_LOG."
> 
> > +- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to
> > +  make a new slot read-only.  In this case, writes to this memory will be posted
> > +  to userspace as KVM_EXIT_MMIO exits.
> > +
> 
> Better say "if KVM_CAP_READONLY_MEM allows, KVM_MEM_READONLY makes a new
> slot read-only ..."
> 
> > +- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by
> > +  a file descirptor(fd) and the content of the private memory is invisible to
> > +  userspace. In this case, userspace should use private_fd/private_offset in
> > +  kvm_userspace_memory_region_ext to instruct KVM to provide private memory to
> > +  guest. Userspace should guarantee not to map the same pfn indicated by
> > +  private_fd/private_offset to different gfns with multiple memslots. Failed to
> > +  do this may result undefined behavior.
> >  
> 
> For the lists above,
> s/can be set/

It all looks good, thanks!

> 
> Thanks. 
> 
> -- 
> An old man doll... just what I always wanted! - Clara



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
  2022-09-16  9:17   ` Bagas Sanjaya
@ 2022-09-16  9:54     ` Chao Peng
  0 siblings, 0 replies; 97+ messages in thread
From: Chao Peng @ 2022-09-16  9:54 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Fri, Sep 16, 2022 at 04:17:48PM +0700, Bagas Sanjaya wrote:
> On Thu, Sep 15, 2022 at 10:29:08PM +0800, Chao Peng wrote:
> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> > +   private memory access when the bit is set otherwise the memory error is
> > +   caused by shared memory access when the bit is clear.
> 
> s/set otherwise/set. Otherwise,

Thanks.

> 
> Thanks.
> 
> -- 
> An old man doll... just what I always wanted! - Clara



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-15 14:29 ` [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd Chao Peng
@ 2022-09-19  9:12   ` David Hildenbrand
  2022-09-19 19:10     ` Sean Christopherson
  2022-09-23  0:58     ` Kirill A . Shutemov
  2022-09-22 13:26   ` Wang, Wei W
                     ` (4 subsequent siblings)
  5 siblings, 2 replies; 97+ messages in thread
From: David Hildenbrand @ 2022-09-19  9:12 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, wei.w.wang

On 15.09.22 16:29, Chao Peng wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> KVM can use memfd-provided memory for guest memory. For normal userspace
> accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> virtual address space and then tells KVM to use the virtual address to
> setup the mapping in the secondary page table (e.g. EPT).
> 
> With confidential computing technologies like Intel TDX, the
> memfd-provided memory may be encrypted with special key for special
> software domain (e.g. KVM guest) and is not expected to be directly
> accessed by userspace. Precisely, userspace access to such encrypted
> memory may lead to host crash so it should be prevented.

Initially my thaught was that this whole inaccessible thing is TDX 
specific and there is no need to force that on other mechanisms. That's 
why I suggested to not expose this to user space but handle the notifier 
requirements internally.

IIUC now, protected KVM has similar demands. Either access (read/write) 
of guest RAM would result in a fault and possibly crash the hypervisor 
(at least not the whole machine IIUC).

> 
> This patch introduces userspace inaccessible memfd (created with
> MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> in-kernel interface so KVM can directly interact with core-mm without
> the need to map the memory into KVM userspace.

With secretmem we decided to not add such "concept switch" flags and 
instead use a dedicated syscall.

What about memfd_inaccessible()? Especially, sealing and hugetlb are not 
even supported and it might take a while to support either.


> 
> It provides semantics required for KVM guest private(encrypted) memory
> support that a file descriptor with this flag set is going to be used as
> the source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV.
> 
> KVM userspace is still in charge of the lifecycle of the memfd. It
> should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> in this patch to obtain the physical memory address and then populate
> the secondary page table entries.
> 
> The userspace inaccessible memfd can be fallocate-ed and hole-punched
> from userspace. When hole-punching happens, KVM can get notified through
> inaccessible_notifier it then gets chance to remove any mapped entries
> of the range in the secondary page tables.
> 
> The userspace inaccessible memfd itself is implemented as a shim layer
> on top of real memory file systems like tmpfs/hugetlbfs but this patch
> only implemented tmpfs. The allocated memory is currently marked as
> unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>   include/linux/memfd.h      |  24 ++++
>   include/uapi/linux/magic.h |   1 +
>   include/uapi/linux/memfd.h |   1 +
>   mm/Makefile                |   2 +-
>   mm/memfd.c                 |  25 ++++-
>   mm/memfd_inaccessible.c    | 219 +++++++++++++++++++++++++++++++++++++
>   6 files changed, 270 insertions(+), 2 deletions(-)
>   create mode 100644 mm/memfd_inaccessible.c
> 
> diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> index 4f1600413f91..334ddff08377 100644
> --- a/include/linux/memfd.h
> +++ b/include/linux/memfd.h
> @@ -3,6 +3,7 @@
>   #define __LINUX_MEMFD_H
>   
>   #include <linux/file.h>
> +#include <linux/pfn_t.h>
>   
>   #ifdef CONFIG_MEMFD_CREATE
>   extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
> @@ -13,4 +14,27 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
>   }
>   #endif
>   
> +struct inaccessible_notifier;
> +
> +struct inaccessible_notifier_ops {
> +	void (*invalidate)(struct inaccessible_notifier *notifier,
> +			   pgoff_t start, pgoff_t end);
> +};
> +
> +struct inaccessible_notifier {
> +	struct list_head list;
> +	const struct inaccessible_notifier_ops *ops;
> +};
> +
> +void inaccessible_register_notifier(struct file *file,
> +				    struct inaccessible_notifier *notifier);
> +void inaccessible_unregister_notifier(struct file *file,
> +				      struct inaccessible_notifier *notifier);
> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +			 int *order);
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn);
> +
> +struct file *memfd_mkinaccessible(struct file *memfd);
> +
>   #endif /* __LINUX_MEMFD_H */
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 6325d1d0e90f..9d066be3d7e8 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -101,5 +101,6 @@
>   #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
>   #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
>   #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
> +#define INACCESSIBLE_MAGIC	0x494e4143	/* "INAC" */


[...]

> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +			 int *order)
> +{
> +	struct inaccessible_data *data = file->f_mapping->private_data;
> +	struct file *memfd = data->memfd;
> +	struct page *page;
> +	int ret;
> +
> +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> +	if (ret)
> +		return ret;
> +
> +	*pfn = page_to_pfn_t(page);
> +	*order = thp_order(compound_head(page));
> +	SetPageUptodate(page);
> +	unlock_page(page);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> +
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> +{
> +	struct page *page = pfn_t_to_page(pfn);
> +
> +	if (WARN_ON_ONCE(!page))
> +		return;
> +
> +	put_page(page);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);

Sorry, I missed your reply regarding get/put interface.

https://lore.kernel.org/linux-mm/20220810092532.GD862421@chaop.bj.intel.com/

"We have a design assumption that somedays this can even support 
non-page based backing stores."

As long as there is no such user in sight (especially how to get the 
memfd from even allocating such memory which will require bigger 
changes), I prefer to keep it simple here and work on pages/folios. No 
need to over-complicate it for now.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-19  9:12   ` David Hildenbrand
@ 2022-09-19 19:10     ` Sean Christopherson
  2022-09-21 21:10       ` Andy Lutomirski
  2022-09-23 15:19       ` Fuad Tabba
  2022-09-23  0:58     ` Kirill A . Shutemov
  1 sibling, 2 replies; 97+ messages in thread
From: Sean Christopherson @ 2022-09-19 19:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, wei.w.wang, Will Deacon, Marc Zyngier,
	Fuad Tabba

+Will, Marc and Fuad (apologies if I missed other pKVM folks)

On Mon, Sep 19, 2022, David Hildenbrand wrote:
> On 15.09.22 16:29, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > KVM can use memfd-provided memory for guest memory. For normal userspace
> > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > virtual address space and then tells KVM to use the virtual address to
> > setup the mapping in the secondary page table (e.g. EPT).
> > 
> > With confidential computing technologies like Intel TDX, the
> > memfd-provided memory may be encrypted with special key for special
> > software domain (e.g. KVM guest) and is not expected to be directly
> > accessed by userspace. Precisely, userspace access to such encrypted
> > memory may lead to host crash so it should be prevented.
> 
> Initially my thaught was that this whole inaccessible thing is TDX specific
> and there is no need to force that on other mechanisms. That's why I
> suggested to not expose this to user space but handle the notifier
> requirements internally.
> 
> IIUC now, protected KVM has similar demands. Either access (read/write) of
> guest RAM would result in a fault and possibly crash the hypervisor (at
> least not the whole machine IIUC).

Yep.  The missing piece for pKVM is the ability to convert from shared to private
while preserving the contents, e.g. to hand off a large buffer (hundreds of MiB)
for processing in the protected VM.  Thoughts on this at the bottom.

> > This patch introduces userspace inaccessible memfd (created with
> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > in-kernel interface so KVM can directly interact with core-mm without
> > the need to map the memory into KVM userspace.
> 
> With secretmem we decided to not add such "concept switch" flags and instead
> use a dedicated syscall.
>

I have no personal preference whatsoever between a flag and a dedicated syscall,
but a dedicated syscall does seem like it would give the kernel a bit more
flexibility.

> What about memfd_inaccessible()? Especially, sealing and hugetlb are not
> even supported and it might take a while to support either.

Don't know about sealing, but hugetlb support for "inaccessible" memory needs to
come sooner than later.  "inaccessible" in quotes because we might want to choose
a less binary name, e.g. "restricted"?.

Regarding pKVM's use case, with the shim approach I believe this can be done by
allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
piled on top.

My first thought was to make the uAPI a set of KVM ioctls so that KVM could tightly
tightly control usage without taking on too much complexity in the kernel, but
working through things, routing the behavior through the shim itself might not be
all that horrific.

IIRC, we discarded the idea of allowing userspace to map the "private" fd because
things got too complex, but with the shim it doesn't seem _that_ bad.

E.g. on the memfd side:

  1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
     mapping is all or nothing.

  2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
     the restricted memfd.

  3. Add notifier hooks to allow downstream users to further restrict things.

  4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
     one shot.

  5. Require that there are no outstanding references at munmap().  Or if this
     can't be guaranteed by userspace, maybe add some way for userspace to wait
     until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
     to do an expensive check every time.
     
  static int memfd_restricted_mmap(struct file *file, struct vm_area_struct *vma)
  {
	if (vma->vm_pgoff)
		return -EINVAL;

	if ((vma->vm_end - vma->vm_start) != <file size>)
		return -EINVAL;

	mutex_lock(&data->lock);

	if (data->has_mapping) {
		r = -EINVAL;
		goto err;
	}
	list_for_each_entry(notifier, &data->notifiers, list) {
		r = notifier->ops->mmap_start(notifier, ...);
		if (r)
			goto abort;
	}

	notifier->ops->mmap_end(notifier, ...);
	mutex_unlock(&data->lock);
	return 0;

  abort:
	list_for_each_entry_continue_reverse(notifier &data->notifiers, list)
		notifier->ops->mmap_abort(notifier, ...);
  err:
	mutex_unlock(&data->lock);
	return r;
  }

  static void memfd_restricted_close(struct vm_area_struct *vma)
  {
	mutex_lock(...);

	/*
	 * Destroy the memfd and disable all future accesses if there are
	 * outstanding refcounts (or other unsatisfied restrictions?).
	 */
	if (<outstanding references> || ???)
		memfd_restricted_destroy(...);
	else
		data->has_mapping = false;

	mutex_unlock(...);
  }

  static int memfd_restricted_may_split(struct vm_area_struct *area, unsigned long addr)
  {
	return -EINVAL;
  }

  static int memfd_restricted_mapping_mremap(struct vm_area_struct *new_vma)
  {
	return -EINVAL;
  }

Then on the KVM side, its mmap_start() + mmap_end() sequence would:

  1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
     memory into the guest (after pre-boot phase).

  2. Be mutually exclusive with shared<=>private conversions, and is allowed if
     and only if the entire gfn range of the associated memslot is shared.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-19 19:10     ` Sean Christopherson
@ 2022-09-21 21:10       ` Andy Lutomirski
  2022-09-22 13:23         ` Wang, Wei W
  2022-09-23 15:20         ` Fuad Tabba
  2022-09-23 15:19       ` Fuad Tabba
  1 sibling, 2 replies; 97+ messages in thread
From: Andy Lutomirski @ 2022-09-21 21:10 UTC (permalink / raw)
  To: Sean Christopherson, David Hildenbrand
  Cc: Chao Peng, kvm list, Linux Kernel Mailing List, linux-mm,
	linux-fsdevel, Linux API, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, Michal Hocko,
	Muchun Song, wei.w.wang, Will Deacon, Marc Zyngier, Fuad Tabba

(please excuse any formatting disasters.  my internet went out as I was composing this, and i did my best to rescue it.)

On Mon, Sep 19, 2022, at 12:10 PM, Sean Christopherson wrote:
> +Will, Marc and Fuad (apologies if I missed other pKVM folks)
>
> On Mon, Sep 19, 2022, David Hildenbrand wrote:
>> On 15.09.22 16:29, Chao Peng wrote:
>> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> > 
>> > KVM can use memfd-provided memory for guest memory. For normal userspace
>> > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
>> > virtual address space and then tells KVM to use the virtual address to
>> > setup the mapping in the secondary page table (e.g. EPT).
>> > 
>> > With confidential computing technologies like Intel TDX, the
>> > memfd-provided memory may be encrypted with special key for special
>> > software domain (e.g. KVM guest) and is not expected to be directly
>> > accessed by userspace. Precisely, userspace access to such encrypted
>> > memory may lead to host crash so it should be prevented.
>> 
>> Initially my thaught was that this whole inaccessible thing is TDX specific
>> and there is no need to force that on other mechanisms. That's why I
>> suggested to not expose this to user space but handle the notifier
>> requirements internally.
>> 
>> IIUC now, protected KVM has similar demands. Either access (read/write) of
>> guest RAM would result in a fault and possibly crash the hypervisor (at
>> least not the whole machine IIUC).
>
> Yep.  The missing piece for pKVM is the ability to convert from shared 
> to private
> while preserving the contents, e.g. to hand off a large buffer 
> (hundreds of MiB)
> for processing in the protected VM.  Thoughts on this at the bottom.
>
>> > This patch introduces userspace inaccessible memfd (created with
>> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
>> > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
>> > in-kernel interface so KVM can directly interact with core-mm without
>> > the need to map the memory into KVM userspace.
>> 
>> With secretmem we decided to not add such "concept switch" flags and instead
>> use a dedicated syscall.
>>
>
> I have no personal preference whatsoever between a flag and a dedicated syscall,
> but a dedicated syscall does seem like it would give the kernel a bit more
> flexibility.

The third option is a device node, e.g. /dev/kvm_secretmem or /dev/kvm_tdxmem or similar.  But if we need flags or other details in the future, maybe this isn't ideal.

>
>> What about memfd_inaccessible()? Especially, sealing and hugetlb are not
>> even supported and it might take a while to support either.
>
> Don't know about sealing, but hugetlb support for "inaccessible" memory 
> needs to
> come sooner than later.  "inaccessible" in quotes because we might want 
> to choose
> a less binary name, e.g. "restricted"?.
>
> Regarding pKVM's use case, with the shim approach I believe this can be done by
> allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> piled on top.
>
> My first thought was to make the uAPI a set of KVM ioctls so that KVM 
> could tightly
> tightly control usage without taking on too much complexity in the 
> kernel, but
> working through things, routing the behavior through the shim itself 
> might not be
> all that horrific.
>
> IIRC, we discarded the idea of allowing userspace to map the "private" 
> fd because
> things got too complex, but with the shim it doesn't seem _that_ bad.

What's the exact use case?  Is it just to pre-populate the memory?

>
> E.g. on the memfd side:
>
>   1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
>      mapping is all or nothing.
>
>   2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
>      the restricted memfd.
>
>   3. Add notifier hooks to allow downstream users to further restrict things.
>
>   4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
>      one shot.
>
>   5. Require that there are no outstanding references at munmap().  Or if this
>      can't be guaranteed by userspace, maybe add some way for userspace to wait
>      until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
>      to do an expensive check every time.

Hmm.  I haven't looked at the code to see if this would really work, but I think this could be done more in line with how the rest of the kernel works by using the rmap infrastructure.  When the pKVM memfd is in not-yet-private mode, just let it be mmapped as usual (but don't allow any form of GUP or pinning).  Then have an ioctl to switch to to shared mode that takes locks or sets flags so that no new faults can be serviced and does unmap_mapping_range.

As long as the shim arranges to have its own vm_ops, I don't immediately see any reason this can't work.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* RE: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-21 21:10       ` Andy Lutomirski
@ 2022-09-22 13:23         ` Wang, Wei W
  2022-09-23 15:20         ` Fuad Tabba
  1 sibling, 0 replies; 97+ messages in thread
From: Wang, Wei W @ 2022-09-22 13:23 UTC (permalink / raw)
  To: Lutomirski, Andy, Christopherson,, Sean, David Hildenbrand
  Cc: Chao Peng, kvm list, Linux Kernel Mailing List, linux-mm,
	linux-fsdevel, Linux API, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Hansen, Dave, Andi Kleen, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, Hocko, Michal,
	Muchun Song, Will Deacon, Marc Zyngier, Fuad Tabba

On Thursday, September 22, 2022 5:11 AM, Andy Lutomirski wrote:
> To: Christopherson,, Sean <seanjc@google.com>; David Hildenbrand
> <david@redhat.com>
> Cc: Chao Peng <chao.p.peng@linux.intel.com>; kvm list
> <kvm@vger.kernel.org>; Linux Kernel Mailing List
> <linux-kernel@vger.kernel.org>; linux-mm@kvack.org;
> linux-fsdevel@vger.kernel.org; Linux API <linux-api@vger.kernel.org>;
> linux-doc@vger.kernel.org; qemu-devel@nongnu.org; Paolo Bonzini
> <pbonzini@redhat.com>; Jonathan Corbet <corbet@lwn.net>; Vitaly
> Kuznetsov <vkuznets@redhat.com>; Wanpeng Li <wanpengli@tencent.com>;
> Jim Mattson <jmattson@google.com>; Joerg Roedel <joro@8bytes.org>;
> Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>;
> Borislav Petkov <bp@alien8.de>; the arch/x86 maintainers <x86@kernel.org>;
> H. Peter Anvin <hpa@zytor.com>; Hugh Dickins <hughd@google.com>; Jeff
> Layton <jlayton@kernel.org>; J . Bruce Fields <bfields@fieldses.org>; Andrew
> Morton <akpm@linux-foundation.org>; Shuah Khan <shuah@kernel.org>;
> Mike Rapoport <rppt@kernel.org>; Steven Price <steven.price@arm.com>;
> Maciej S . Szmigiero <mail@maciej.szmigiero.name>; Vlastimil Babka
> <vbabka@suse.cz>; Vishal Annapurve <vannapurve@google.com>; Yu Zhang
> <yu.c.zhang@linux.intel.com>; Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com>; Nakajima, Jun <jun.nakajima@intel.com>;
> Hansen, Dave <dave.hansen@intel.com>; Andi Kleen <ak@linux.intel.com>;
> aarcange@redhat.com; ddutile@redhat.com; dhildenb@redhat.com; Quentin
> Perret <qperret@google.com>; Michael Roth <michael.roth@amd.com>;
> Hocko, Michal <mhocko@suse.com>; Muchun Song
> <songmuchun@bytedance.com>; Wang, Wei W <wei.w.wang@intel.com>;
> Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Fuad Tabba
> <tabba@google.com>
> Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible
> memfd
> 
> (please excuse any formatting disasters.  my internet went out as I was
> composing this, and i did my best to rescue it.)
> 
> On Mon, Sep 19, 2022, at 12:10 PM, Sean Christopherson wrote:
> > +Will, Marc and Fuad (apologies if I missed other pKVM folks)
> >
> > On Mon, Sep 19, 2022, David Hildenbrand wrote:
> >> On 15.09.22 16:29, Chao Peng wrote:
> >> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >> >
> >> > KVM can use memfd-provided memory for guest memory. For normal
> >> > userspace accessible memory, KVM userspace (e.g. QEMU) mmaps the
> >> > memfd into its virtual address space and then tells KVM to use the
> >> > virtual address to setup the mapping in the secondary page table (e.g.
> EPT).
> >> >
> >> > With confidential computing technologies like Intel TDX, the
> >> > memfd-provided memory may be encrypted with special key for special
> >> > software domain (e.g. KVM guest) and is not expected to be directly
> >> > accessed by userspace. Precisely, userspace access to such
> >> > encrypted memory may lead to host crash so it should be prevented.
> >>
> >> Initially my thaught was that this whole inaccessible thing is TDX
> >> specific and there is no need to force that on other mechanisms.
> >> That's why I suggested to not expose this to user space but handle
> >> the notifier requirements internally.
> >>
> >> IIUC now, protected KVM has similar demands. Either access
> >> (read/write) of guest RAM would result in a fault and possibly crash
> >> the hypervisor (at least not the whole machine IIUC).
> >
> > Yep.  The missing piece for pKVM is the ability to convert from shared
> > to private while preserving the contents, e.g. to hand off a large
> > buffer (hundreds of MiB) for processing in the protected VM.  Thoughts
> > on this at the bottom.
> >
> >> > This patch introduces userspace inaccessible memfd (created with
> >> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace
> >> > through ordinary MMU access (e.g. read/write/mmap) but can be
> >> > accessed via in-kernel interface so KVM can directly interact with
> >> > core-mm without the need to map the memory into KVM userspace.
> >>
> >> With secretmem we decided to not add such "concept switch" flags and
> >> instead use a dedicated syscall.
> >>
> >
> > I have no personal preference whatsoever between a flag and a
> > dedicated syscall, but a dedicated syscall does seem like it would
> > give the kernel a bit more flexibility.
> 
> The third option is a device node, e.g. /dev/kvm_secretmem or
> /dev/kvm_tdxmem or similar.  But if we need flags or other details in the
> future, maybe this isn't ideal.
> 
> >
> >> What about memfd_inaccessible()? Especially, sealing and hugetlb are
> >> not even supported and it might take a while to support either.
> >
> > Don't know about sealing, but hugetlb support for "inaccessible"
> > memory needs to come sooner than later.  "inaccessible" in quotes
> > because we might want to choose a less binary name, e.g.
> > "restricted"?.
> >
> > Regarding pKVM's use case, with the shim approach I believe this can
> > be done by allowing userspace mmap() the "hidden" memfd, but with a
> > ton of restrictions piled on top.
> >
> > My first thought was to make the uAPI a set of KVM ioctls so that KVM
> > could tightly tightly control usage without taking on too much
> > complexity in the kernel, but working through things, routing the
> > behavior through the shim itself might not be all that horrific.
> >
> > IIRC, we discarded the idea of allowing userspace to map the "private"
> > fd because
> > things got too complex, but with the shim it doesn't seem _that_ bad.
> 
> What's the exact use case?  Is it just to pre-populate the memory?

Add one more use case here. For TDX live migration support, on the destination side,
we map the private fd during migration to store the encrypted private memory data sent
from source, and at the end of migration, we unmap it and make it inaccessible before
resuming the TD to run.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* RE: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-15 14:29 ` [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd Chao Peng
  2022-09-19  9:12   ` David Hildenbrand
@ 2022-09-22 13:26   ` Wang, Wei W
  2022-09-22 19:49     ` Sean Christopherson
  2022-09-30 16:14   ` Fuad Tabba
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 97+ messages in thread
From: Wang, Wei W @ 2022-09-22 13:26 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Christopherson,,
	Sean, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, Lutomirski, Andy, Nakajima, Jun,
	Hansen, Dave, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, Hocko, Michal, Muchun Song

On Thursday, September 15, 2022 10:29 PM, Chao Peng wrote:
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +			 int *order)

Better to remove "order" from this interface?
Some callers only need to get pfn, and no need to bother with
defining and inputting something unused. For callers who need the "order",
can easily get it via thp_order(pfn_to_page(pfn)) on their own.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-22 13:26   ` Wang, Wei W
@ 2022-09-22 19:49     ` Sean Christopherson
  2022-09-23  0:53       ` Kirill A . Shutemov
  0 siblings, 1 reply; 97+ messages in thread
From: Sean Christopherson @ 2022-09-22 19:49 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, Lutomirski, Andy, Nakajima, Jun,
	Hansen, Dave, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, Hocko, Michal, Muchun Song

On Thu, Sep 22, 2022, Wang, Wei W wrote:
> On Thursday, September 15, 2022 10:29 PM, Chao Peng wrote:
> > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > +			 int *order)
> 
> Better to remove "order" from this interface?

Hard 'no'.

> Some callers only need to get pfn, and no need to bother with
> defining and inputting something unused. For callers who need the "order",
> can easily get it via thp_order(pfn_to_page(pfn)) on their own.

That requires (a) assuming the pfn is backed by struct page, and (b) assuming the
struct page is a transparent huge page.  That might be true for the current
implementation, but it most certainly will not always be true.

KVM originally did things like this, where there was dedicated code for THP vs.
HugeTLB, and it was a mess.  The goal here is very much to avoid repeating those
mistakes.  Have the backing store _tell_ KVM how big the mapping is, don't force
KVM to rediscover the info on its own.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-22 19:49     ` Sean Christopherson
@ 2022-09-23  0:53       ` Kirill A . Shutemov
  2022-09-23 15:20         ` Fuad Tabba
  0 siblings, 1 reply; 97+ messages in thread
From: Kirill A . Shutemov @ 2022-09-23  0:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Wang, Wei W, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Lutomirski, Andy, Nakajima, Jun, Hansen, Dave, ak,
	david, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	Hocko, Michal, Muchun Song

On Thu, Sep 22, 2022 at 07:49:18PM +0000, Sean Christopherson wrote:
> On Thu, Sep 22, 2022, Wang, Wei W wrote:
> > On Thursday, September 15, 2022 10:29 PM, Chao Peng wrote:
> > > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > > +			 int *order)
> > 
> > Better to remove "order" from this interface?
> 
> Hard 'no'.
> 
> > Some callers only need to get pfn, and no need to bother with
> > defining and inputting something unused. For callers who need the "order",
> > can easily get it via thp_order(pfn_to_page(pfn)) on their own.
> 
> That requires (a) assuming the pfn is backed by struct page, and (b) assuming the
> struct page is a transparent huge page.  That might be true for the current
> implementation, but it most certainly will not always be true.
> 
> KVM originally did things like this, where there was dedicated code for THP vs.
> HugeTLB, and it was a mess.  The goal here is very much to avoid repeating those
> mistakes.  Have the backing store _tell_ KVM how big the mapping is, don't force
> KVM to rediscover the info on its own.

I guess we can allow order pointer to be NULL to cover caller that don't
need to know the order. Is it useful?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-19  9:12   ` David Hildenbrand
  2022-09-19 19:10     ` Sean Christopherson
@ 2022-09-23  0:58     ` Kirill A . Shutemov
  2022-09-26 10:35       ` David Hildenbrand
  1 sibling, 1 reply; 97+ messages in thread
From: Kirill A . Shutemov @ 2022-09-23  0:58 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, David Hildenbrand
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Mon, Sep 19, 2022 at 11:12:46AM +0200, David Hildenbrand wrote:
> > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > index 6325d1d0e90f..9d066be3d7e8 100644
> > --- a/include/uapi/linux/magic.h
> > +++ b/include/uapi/linux/magic.h
> > @@ -101,5 +101,6 @@
> >   #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
> >   #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
> >   #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
> > +#define INACCESSIBLE_MAGIC	0x494e4143	/* "INAC" */
> 
> 
> [...]
> 
> > +
> > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > +			 int *order)
> > +{
> > +	struct inaccessible_data *data = file->f_mapping->private_data;
> > +	struct file *memfd = data->memfd;
> > +	struct page *page;
> > +	int ret;
> > +
> > +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> > +	if (ret)
> > +		return ret;
> > +
> > +	*pfn = page_to_pfn_t(page);
> > +	*order = thp_order(compound_head(page));
> > +	SetPageUptodate(page);
> > +	unlock_page(page);
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> > +
> > +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> > +{
> > +	struct page *page = pfn_t_to_page(pfn);
> > +
> > +	if (WARN_ON_ONCE(!page))
> > +		return;
> > +
> > +	put_page(page);
> > +}
> > +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
> 
> Sorry, I missed your reply regarding get/put interface.
> 
> https://lore.kernel.org/linux-mm/20220810092532.GD862421@chaop.bj.intel.com/
> 
> "We have a design assumption that somedays this can even support non-page
> based backing stores."
> 
> As long as there is no such user in sight (especially how to get the memfd
> from even allocating such memory which will require bigger changes), I
> prefer to keep it simple here and work on pages/folios. No need to
> over-complicate it for now.

Sean, Paolo , what is your take on this? Do you have conrete use case of
pageless backend for the mechanism in sight? Maybe DAX?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-19 19:10     ` Sean Christopherson
  2022-09-21 21:10       ` Andy Lutomirski
@ 2022-09-23 15:19       ` Fuad Tabba
  2022-09-26 14:23         ` Chao Peng
  1 sibling, 1 reply; 97+ messages in thread
From: Fuad Tabba @ 2022-09-23 15:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Hildenbrand, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, wei.w.wang, Will Deacon, Marc Zyngier

Hi,

On Mon, Sep 19, 2022 at 8:10 PM Sean Christopherson <seanjc@google.com> wrote:
>
> +Will, Marc and Fuad (apologies if I missed other pKVM folks)
>
> On Mon, Sep 19, 2022, David Hildenbrand wrote:
> > On 15.09.22 16:29, Chao Peng wrote:
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > >
> > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > virtual address space and then tells KVM to use the virtual address to
> > > setup the mapping in the secondary page table (e.g. EPT).
> > >
> > > With confidential computing technologies like Intel TDX, the
> > > memfd-provided memory may be encrypted with special key for special
> > > software domain (e.g. KVM guest) and is not expected to be directly
> > > accessed by userspace. Precisely, userspace access to such encrypted
> > > memory may lead to host crash so it should be prevented.
> >
> > Initially my thaught was that this whole inaccessible thing is TDX specific
> > and there is no need to force that on other mechanisms. That's why I
> > suggested to not expose this to user space but handle the notifier
> > requirements internally.
> >
> > IIUC now, protected KVM has similar demands. Either access (read/write) of
> > guest RAM would result in a fault and possibly crash the hypervisor (at
> > least not the whole machine IIUC).
>
> Yep.  The missing piece for pKVM is the ability to convert from shared to private
> while preserving the contents, e.g. to hand off a large buffer (hundreds of MiB)
> for processing in the protected VM.  Thoughts on this at the bottom.

Just wanted to mention that for pKVM (arm64), this wouldn't crash the
hypervisor. A userspace access would crash the userspace process since
the hypervisor would inject a fault back. Because of that making it
inaccessible from userspace is good to have, but not really vital for
pKVM. What is important for pKVM is that the guest private memory is
not GUP'able by the host. This is because if it were, it might be
possible for a malicious userspace process (e.g., a malicious vmm) to
trick the host kernel into accessing guest private memory in a context
where it isn’t prepared to handle the fault injected by the
hypervisor. This of course might crash the host.

> > > This patch introduces userspace inaccessible memfd (created with
> > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > in-kernel interface so KVM can directly interact with core-mm without
> > > the need to map the memory into KVM userspace.
> >
> > With secretmem we decided to not add such "concept switch" flags and instead
> > use a dedicated syscall.
> >
>
> I have no personal preference whatsoever between a flag and a dedicated syscall,
> but a dedicated syscall does seem like it would give the kernel a bit more
> flexibility.
>
> > What about memfd_inaccessible()? Especially, sealing and hugetlb are not
> > even supported and it might take a while to support either.
>
> Don't know about sealing, but hugetlb support for "inaccessible" memory needs to
> come sooner than later.  "inaccessible" in quotes because we might want to choose
> a less binary name, e.g. "restricted"?.
>
> Regarding pKVM's use case, with the shim approach I believe this can be done by
> allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> piled on top.
>
> My first thought was to make the uAPI a set of KVM ioctls so that KVM could tightly
> tightly control usage without taking on too much complexity in the kernel, but
> working through things, routing the behavior through the shim itself might not be
> all that horrific.
>
> IIRC, we discarded the idea of allowing userspace to map the "private" fd because
> things got too complex, but with the shim it doesn't seem _that_ bad.
>
> E.g. on the memfd side:
>
>   1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
>      mapping is all or nothing.
>
>   2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
>      the restricted memfd.
>
>   3. Add notifier hooks to allow downstream users to further restrict things.
>
>   4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
>      one shot.
>
>   5. Require that there are no outstanding references at munmap().  Or if this
>      can't be guaranteed by userspace, maybe add some way for userspace to wait
>      until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
>      to do an expensive check every time.
>
>   static int memfd_restricted_mmap(struct file *file, struct vm_area_struct *vma)
>   {
>         if (vma->vm_pgoff)
>                 return -EINVAL;
>
>         if ((vma->vm_end - vma->vm_start) != <file size>)
>                 return -EINVAL;
>
>         mutex_lock(&data->lock);
>
>         if (data->has_mapping) {
>                 r = -EINVAL;
>                 goto err;
>         }
>         list_for_each_entry(notifier, &data->notifiers, list) {
>                 r = notifier->ops->mmap_start(notifier, ...);
>                 if (r)
>                         goto abort;
>         }
>
>         notifier->ops->mmap_end(notifier, ...);
>         mutex_unlock(&data->lock);
>         return 0;
>
>   abort:
>         list_for_each_entry_continue_reverse(notifier &data->notifiers, list)
>                 notifier->ops->mmap_abort(notifier, ...);
>   err:
>         mutex_unlock(&data->lock);
>         return r;
>   }
>
>   static void memfd_restricted_close(struct vm_area_struct *vma)
>   {
>         mutex_lock(...);
>
>         /*
>          * Destroy the memfd and disable all future accesses if there are
>          * outstanding refcounts (or other unsatisfied restrictions?).
>          */
>         if (<outstanding references> || ???)
>                 memfd_restricted_destroy(...);
>         else
>                 data->has_mapping = false;
>
>         mutex_unlock(...);
>   }
>
>   static int memfd_restricted_may_split(struct vm_area_struct *area, unsigned long addr)
>   {
>         return -EINVAL;
>   }
>
>   static int memfd_restricted_mapping_mremap(struct vm_area_struct *new_vma)
>   {
>         return -EINVAL;
>   }
>
> Then on the KVM side, its mmap_start() + mmap_end() sequence would:
>
>   1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
>      memory into the guest (after pre-boot phase).
>
>   2. Be mutually exclusive with shared<=>private conversions, and is allowed if
>      and only if the entire gfn range of the associated memslot is shared.

In general I think that this would work with pKVM. However, limiting
private<->shared conversions to the granularity of a whole memslot
might be difficult to handle in pKVM, since the guest doesn't have the
concept of memslots. For example, in pKVM right now, when a guest
shares back its restricted DMA pool with the host it does so at the
page-level. pKVM would also need a way to make an fd accessible again
when shared back, which I think isn't possible with this patch.

You were initially considering a KVM ioctl for mapping, which might be
better suited for this since KVM knows which pages are shared and
which ones are private. So routing things through KVM might simplify
things and allow it to enforce all the necessary restrictions (e.g.,
private memory cannot be mapped). What do you think?

Thanks,
/fuad

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-21 21:10       ` Andy Lutomirski
  2022-09-22 13:23         ` Wang, Wei W
@ 2022-09-23 15:20         ` Fuad Tabba
  1 sibling, 0 replies; 97+ messages in thread
From: Fuad Tabba @ 2022-09-23 15:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Sean Christopherson, David Hildenbrand, Chao Peng, kvm list,
	Linux Kernel Mailing List, linux-mm, linux-fsdevel, Linux API,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	the arch/x86 maintainers, H. Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A. Shutemov,
	Nakajima, Jun, Dave Hansen, Andi Kleen, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, Michal Hocko,
	Muchun Song, wei.w.wang, Will Deacon, Marc Zyngier

Hi,

<...>

> > Regarding pKVM's use case, with the shim approach I believe this can be done by
> > allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> > piled on top.
> >
> > My first thought was to make the uAPI a set of KVM ioctls so that KVM
> > could tightly
> > tightly control usage without taking on too much complexity in the
> > kernel, but
> > working through things, routing the behavior through the shim itself
> > might not be
> > all that horrific.
> >
> > IIRC, we discarded the idea of allowing userspace to map the "private"
> > fd because
> > things got too complex, but with the shim it doesn't seem _that_ bad.
>
> What's the exact use case?  Is it just to pre-populate the memory?

Prepopulate memory and access memory that could go back and forth from
being shared to being private.

Cheers,
/fuad



> >
> > E.g. on the memfd side:
> >
> >   1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
> >      mapping is all or nothing.
> >
> >   2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
> >      the restricted memfd.
> >
> >   3. Add notifier hooks to allow downstream users to further restrict things.
> >
> >   4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
> >      one shot.
> >
> >   5. Require that there are no outstanding references at munmap().  Or if this
> >      can't be guaranteed by userspace, maybe add some way for userspace to wait
> >      until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
> >      to do an expensive check every time.
>
> Hmm.  I haven't looked at the code to see if this would really work, but I think this could be done more in line with how the rest of the kernel works by using the rmap infrastructure.  When the pKVM memfd is in not-yet-private mode, just let it be mmapped as usual (but don't allow any form of GUP or pinning).  Then have an ioctl to switch to to shared mode that takes locks or sets flags so that no new faults can be serviced and does unmap_mapping_range.
>
> As long as the shim arranges to have its own vm_ops, I don't immediately see any reason this can't work.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-23  0:53       ` Kirill A . Shutemov
@ 2022-09-23 15:20         ` Fuad Tabba
  0 siblings, 0 replies; 97+ messages in thread
From: Fuad Tabba @ 2022-09-23 15:20 UTC (permalink / raw)
  To: Kirill A . Shutemov
  Cc: Sean Christopherson, Wang, Wei W, Chao Peng, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Lutomirski, Andy, Nakajima, Jun,
	Hansen, Dave, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, Hocko, Michal, Muchun Song

Hi,

On Fri, Sep 23, 2022 at 1:53 AM Kirill A . Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Thu, Sep 22, 2022 at 07:49:18PM +0000, Sean Christopherson wrote:
> > On Thu, Sep 22, 2022, Wang, Wei W wrote:
> > > On Thursday, September 15, 2022 10:29 PM, Chao Peng wrote:
> > > > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > > > +                  int *order)
> > >
> > > Better to remove "order" from this interface?
> >
> > Hard 'no'.
> >
> > > Some callers only need to get pfn, and no need to bother with
> > > defining and inputting something unused. For callers who need the "order",
> > > can easily get it via thp_order(pfn_to_page(pfn)) on their own.
> >
> > That requires (a) assuming the pfn is backed by struct page, and (b) assuming the
> > struct page is a transparent huge page.  That might be true for the current
> > implementation, but it most certainly will not always be true.
> >
> > KVM originally did things like this, where there was dedicated code for THP vs.
> > HugeTLB, and it was a mess.  The goal here is very much to avoid repeating those
> > mistakes.  Have the backing store _tell_ KVM how big the mapping is, don't force
> > KVM to rediscover the info on its own.
>
> I guess we can allow order pointer to be NULL to cover caller that don't
> need to know the order. Is it useful?

I think that would be useful. In pKVM we don't need to know the order,
and I had to use a dummy variable when porting V7.

Cheers,
/fuad


> --
>   Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-09-15 14:29 ` [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
  2022-09-16  9:14   ` Bagas Sanjaya
@ 2022-09-26 10:26   ` Fuad Tabba
  2022-09-26 14:04     ` Chao Peng
  2022-09-29 22:45   ` Isaku Yamahata
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 97+ messages in thread
From: Fuad Tabba @ 2022-09-26 10:26 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

Hi Chao,

On Thu, Sep 15, 2022 at 3:35 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> In memory encryption usage, guest memory may be encrypted with special
> key and can be accessed only by the VM itself. We call such memory
> private memory. It's valueless and sometimes can cause problem to allow
> userspace to access guest private memory. This patch extends the KVM
> memslot definition so that guest private memory can be provided though
> an inaccessible_notifier enlightened file descriptor (fd), without being
> mmaped into userspace.
>
> This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> additional KVM memslot fields private_fd/private_offset to allow
> userspace to specify that guest private memory provided from the
> private_fd and guest_phys_addr mapped at the private_offset of the
> private_fd, spanning a range of memory_size.
>
> The extended memslot can still have the userspace_addr(hva). When use, a
> single memslot can maintain both private memory through private
> fd(private_fd/private_offset) and shared memory through
> hva(userspace_addr). Whether the private or shared part is visible to
> guest is maintained by other KVM code.
>
> Since there is no userspace mapping for private fd so we cannot
> get_user_pages() to get the pfn in KVM, instead we add a new
> inaccessible_notifier in the internal memslot structure and rely on it
> to get pfn by interacting with the memory file systems.
>
> Together with the change, a new config HAVE_KVM_PRIVATE_MEM is added and
> right now it is selected on X86_64 for Intel TDX usage.
>
> To make code maintenance easy, internally we use a binary compatible
> alias struct kvm_user_mem_region to handle both the normal and the
> '_ext' variants.
>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  Documentation/virt/kvm/api.rst | 38 +++++++++++++++++++++-----
>  arch/x86/kvm/Kconfig           |  1 +
>  arch/x86/kvm/x86.c             |  2 +-
>  include/linux/kvm_host.h       | 13 +++++++--
>  include/uapi/linux/kvm.h       | 28 +++++++++++++++++++
>  virt/kvm/Kconfig               |  3 +++
>  virt/kvm/kvm_main.c            | 49 ++++++++++++++++++++++++++++------
>  7 files changed, 116 insertions(+), 18 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index abd7c32126ce..c1fac1e9f820 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
>  :Capability: KVM_CAP_USER_MEMORY
>  :Architectures: all
>  :Type: vm ioctl
> -:Parameters: struct kvm_userspace_memory_region (in)
> +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
>  :Returns: 0 on success, -1 on error
>
>  ::
> @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
>         __u64 userspace_addr; /* start of the userspace allocated memory */
>    };
>
> +  struct kvm_userspace_memory_region_ext {
> +       struct kvm_userspace_memory_region region;
> +       __u64 private_offset;
> +       __u32 private_fd;
> +       __u32 pad1;
> +       __u64 pad2[14];
> +  };
> +
>    /* for kvm_memory_region::flags */
>    #define KVM_MEM_LOG_DIRTY_PAGES      (1UL << 0)
>    #define KVM_MEM_READONLY     (1UL << 1)
> +  #define KVM_MEM_PRIVATE              (1UL << 2)
>
>  This ioctl allows the user to create, modify or delete a guest physical
>  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> @@ -1365,12 +1374,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
>  be identical.  This allows large pages in the guest to be backed by large
>  pages in the host.
>
> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> -to make a new slot read-only.  In this case, writes to this memory will be
> -posted to userspace as KVM_EXIT_MMIO exits.
> +kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region
> +fields. It also includes additional fields for some specific features. See
> +below description of flags field for more information. It's recommended to use
> +kvm_userspace_memory_region_ext in new userspace code.
> +
> +The flags field supports below flags:
> +
> +- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to
> +  memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to use it.
> +
> +- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to
> +  make a new slot read-only.  In this case, writes to this memory will be posted
> +  to userspace as KVM_EXIT_MMIO exits.
> +
> +- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by
> +  a file descirptor(fd) and the content of the private memory is invisible to

s/descirptor/descriptor

> +  userspace. In this case, userspace should use private_fd/private_offset in
> +  kvm_userspace_memory_region_ext to instruct KVM to provide private memory to
> +  guest. Userspace should guarantee not to map the same pfn indicated by
> +  private_fd/private_offset to different gfns with multiple memslots. Failed to
> +  do this may result undefined behavior.
>
>  When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
>  the memory region are automatically reflected into the guest.  For example, an
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index e3cbd7706136..31db64ec0b33 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -48,6 +48,7 @@ config KVM
>         select SRCU
>         select INTERVAL_TREE
>         select HAVE_KVM_PM_NOTIFIER if PM
> +       select HAVE_KVM_PRIVATE_MEM if X86_64
>         help
>           Support hosting fully virtualized guest machines using hardware
>           virtualization extensions.  You will need a fairly recent
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d7374d768296..081f62ccc9a1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12183,7 +12183,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
>         }
>
>         for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> -               struct kvm_userspace_memory_region m;
> +               struct kvm_user_mem_region m;
>
>                 m.slot = id | (i << 16);
>                 m.flags = 0;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f4519d3689e1..eac1787b899b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -44,6 +44,7 @@
>
>  #include <asm/kvm_host.h>
>  #include <linux/kvm_dirty_ring.h>
> +#include <linux/memfd.h>
>
>  #ifndef KVM_MAX_VCPU_IDS
>  #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> @@ -576,8 +577,16 @@ struct kvm_memory_slot {
>         u32 flags;
>         short id;
>         u16 as_id;
> +       struct file *private_file;
> +       loff_t private_offset;
> +       struct inaccessible_notifier notifier;
>  };
>
> +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
> +{
> +       return slot && (slot->flags & KVM_MEM_PRIVATE);
> +}
> +
>  static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
>  {
>         return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
> @@ -1104,9 +1113,9 @@ enum kvm_mr_change {
>  };
>
>  int kvm_set_memory_region(struct kvm *kvm,
> -                         const struct kvm_userspace_memory_region *mem);
> +                         const struct kvm_user_mem_region *mem);
>  int __kvm_set_memory_region(struct kvm *kvm,
> -                           const struct kvm_userspace_memory_region *mem);
> +                           const struct kvm_user_mem_region *mem);
>  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
>  void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
>  int kvm_arch_prepare_memory_region(struct kvm *kvm,
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index eed0315a77a6..3ef462fb3b2a 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
>         __u64 userspace_addr; /* start of the userspace allocated memory */
>  };
>
> +struct kvm_userspace_memory_region_ext {
> +       struct kvm_userspace_memory_region region;
> +       __u64 private_offset;
> +       __u32 private_fd;
> +       __u32 pad1;
> +       __u64 pad2[14];
> +};
> +
> +#ifdef __KERNEL__
> +/*
> + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> + * all fields from the top-level "extended" region.
> + */
> +struct kvm_user_mem_region {
> +       __u32 slot;
> +       __u32 flags;
> +       __u64 guest_phys_addr;
> +       __u64 memory_size;
> +       __u64 userspace_addr;
> +       __u64 private_offset;
> +       __u32 private_fd;
> +       __u32 pad1;
> +       __u64 pad2[14];
> +};
> +#endif
> +
>  /*
>   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
>   * other bits are reserved for kvm internal use which are defined in
> @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
>   */
>  #define KVM_MEM_LOG_DIRTY_PAGES        (1UL << 0)
>  #define KVM_MEM_READONLY       (1UL << 1)
> +#define KVM_MEM_PRIVATE                (1UL << 2)
>
>  /* for KVM_IRQ_LINE */
>  struct kvm_irq_level {
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index a8c5c9f06b3c..ccaff13cc5b8 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -72,3 +72,6 @@ config KVM_XFER_TO_GUEST_WORK
>
>  config HAVE_KVM_PM_NOTIFIER
>         bool
> +
> +config HAVE_KVM_PRIVATE_MEM
> +       bool
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 584a5bab3af3..12dc0dc57b06 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1526,7 +1526,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
>         }
>  }
>
> -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
>  {
>         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>
> @@ -1920,7 +1920,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
>   * Must be called holding kvm->slots_lock for write.
>   */
>  int __kvm_set_memory_region(struct kvm *kvm,
> -                           const struct kvm_userspace_memory_region *mem)
> +                           const struct kvm_user_mem_region *mem)
>  {
>         struct kvm_memory_slot *old, *new;
>         struct kvm_memslots *slots;
> @@ -2024,7 +2024,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
>
>  int kvm_set_memory_region(struct kvm *kvm,
> -                         const struct kvm_userspace_memory_region *mem)
> +                         const struct kvm_user_mem_region *mem)
>  {
>         int r;
>
> @@ -2036,7 +2036,7 @@ int kvm_set_memory_region(struct kvm *kvm,
>  EXPORT_SYMBOL_GPL(kvm_set_memory_region);
>
>  static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> -                                         struct kvm_userspace_memory_region *mem)
> +                                         struct kvm_user_mem_region *mem)
>  {
>         if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
>                 return -EINVAL;
> @@ -4622,6 +4622,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
>         return fd;
>  }
>
> +#define SANITY_CHECK_MEM_REGION_FIELD(field)                                   \
> +do {                                                                           \
> +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=             \
> +                    offsetof(struct kvm_userspace_memory_region, field));      \
> +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=         \
> +                    sizeof_field(struct kvm_userspace_memory_region, field));  \
> +} while (0)
> +
> +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)                                       \
> +do {                                                                                   \
> +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=                     \
> +                    offsetof(struct kvm_userspace_memory_region_ext, field));          \
> +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=                 \
> +                    sizeof_field(struct kvm_userspace_memory_region_ext, field));      \
> +} while (0)
> +
> +static void kvm_sanity_check_user_mem_region_alias(void)
> +{
> +       SANITY_CHECK_MEM_REGION_FIELD(slot);
> +       SANITY_CHECK_MEM_REGION_FIELD(flags);
> +       SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> +       SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> +       SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> +       SANITY_CHECK_MEM_REGION_EXT_FIELD(private_offset);
> +       SANITY_CHECK_MEM_REGION_EXT_FIELD(private_fd);
> +}
> +
>  static long kvm_vm_ioctl(struct file *filp,
>                            unsigned int ioctl, unsigned long arg)
>  {
> @@ -4645,14 +4672,20 @@ static long kvm_vm_ioctl(struct file *filp,
>                 break;
>         }
>         case KVM_SET_USER_MEMORY_REGION: {
> -               struct kvm_userspace_memory_region kvm_userspace_mem;
> +               struct kvm_user_mem_region mem;
> +               unsigned long size = sizeof(struct kvm_userspace_memory_region);

nit: should this be sizeof(struct mem)? That's more similar to the
existing code and makes it dependent on the size of mem regardless of
possible changes to its type in the future.

> +
> +               kvm_sanity_check_user_mem_region_alias();
>
>                 r = -EFAULT;
> -               if (copy_from_user(&kvm_userspace_mem, argp,
> -                                               sizeof(kvm_userspace_mem)))
> +               if (copy_from_user(&mem, argp, size);

It gets fixed in a future patch, but the ; should be a ).

Cheers,
/fuad

> +                       goto out;
> +
> +               r = -EINVAL;
> +               if (mem.flags & KVM_MEM_PRIVATE)
>                         goto out;
>
> -               r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> +               r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>                 break;
>         }
>         case KVM_GET_DIRTY_LOG: {
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-23  0:58     ` Kirill A . Shutemov
@ 2022-09-26 10:35       ` David Hildenbrand
  2022-09-26 14:48         ` Kirill A. Shutemov
  0 siblings, 1 reply; 97+ messages in thread
From: David Hildenbrand @ 2022-09-26 10:35 UTC (permalink / raw)
  To: Kirill A . Shutemov, Paolo Bonzini, Sean Christopherson
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On 23.09.22 02:58, Kirill A . Shutemov wrote:
> On Mon, Sep 19, 2022 at 11:12:46AM +0200, David Hildenbrand wrote:
>>> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
>>> index 6325d1d0e90f..9d066be3d7e8 100644
>>> --- a/include/uapi/linux/magic.h
>>> +++ b/include/uapi/linux/magic.h
>>> @@ -101,5 +101,6 @@
>>>    #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
>>>    #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
>>>    #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
>>> +#define INACCESSIBLE_MAGIC	0x494e4143	/* "INAC" */
>>
>>
>> [...]
>>
>>> +
>>> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
>>> +			 int *order)
>>> +{
>>> +	struct inaccessible_data *data = file->f_mapping->private_data;
>>> +	struct file *memfd = data->memfd;
>>> +	struct page *page;
>>> +	int ret;
>>> +
>>> +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	*pfn = page_to_pfn_t(page);
>>> +	*order = thp_order(compound_head(page));
>>> +	SetPageUptodate(page);
>>> +	unlock_page(page);
>>> +
>>> +	return 0;
>>> +}
>>> +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
>>> +
>>> +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
>>> +{
>>> +	struct page *page = pfn_t_to_page(pfn);
>>> +
>>> +	if (WARN_ON_ONCE(!page))
>>> +		return;
>>> +
>>> +	put_page(page);
>>> +}
>>> +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
>>
>> Sorry, I missed your reply regarding get/put interface.
>>
>> https://lore.kernel.org/linux-mm/20220810092532.GD862421@chaop.bj.intel.com/
>>
>> "We have a design assumption that somedays this can even support non-page
>> based backing stores."
>>
>> As long as there is no such user in sight (especially how to get the memfd
>> from even allocating such memory which will require bigger changes), I
>> prefer to keep it simple here and work on pages/folios. No need to
>> over-complicate it for now.
> 
> Sean, Paolo , what is your take on this? Do you have conrete use case of
> pageless backend for the mechanism in sight? Maybe DAX?

The problem I'm having with this is how to actually get such memory into 
the memory backend (that triggers notifiers) and what the semantics are 
at all with memory that is not managed by the buddy.

memfd with fixed PFNs doesn't make too much sense.

When using DAX, what happens with the shared <->private conversion? 
Which "type" is supposed to use dax, which not?

In other word, I'm missing too many details on the bigger picture of how 
this would work at all to see why it makes sense right now to prepare 
for that.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 5/8] KVM: Register/unregister the guest private memory regions
  2022-09-15 14:29 ` [PATCH v8 5/8] KVM: Register/unregister the guest private memory regions Chao Peng
@ 2022-09-26 10:36   ` Fuad Tabba
  2022-09-26 14:07     ` Chao Peng
  2022-10-11  9:48   ` Fuad Tabba
  1 sibling, 1 reply; 97+ messages in thread
From: Fuad Tabba @ 2022-09-26 10:36 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

Hi Chao,

On Thu, Sep 15, 2022 at 3:38 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> If CONFIG_HAVE_KVM_PRIVATE_MEM=y, userspace can register/unregister the
> guest private memory regions through KVM_MEMORY_ENCRYPT_{UN,}REG_REGION
> ioctls. The patch reuses existing SEV ioctl number but differs that the
> address in the region for KVM_PRIVATE_MEM case is gpa while for SEV case
> it's hva. Which usages should the ioctls go is determined by the newly
> added kvm_arch_has_private_mem(). Architecture which supports
> KVM_PRIVATE_MEM should override this function.
>
> The current implementation defaults all memory to private. The shared
> memory regions are stored in a xarray variable for memory efficiency and
> zapping existing memory mappings is also a side effect of these two
> ioctls when defined.
>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  Documentation/virt/kvm/api.rst  | 17 ++++++--
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/mmu.h              |  2 -
>  include/linux/kvm_host.h        | 13 ++++++
>  virt/kvm/kvm_main.c             | 73 +++++++++++++++++++++++++++++++++
>  5 files changed, 100 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 1a6c003b2a0b..c0f800d04ffc 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -4715,10 +4715,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst.
>  This ioctl can be used to register a guest memory region which may
>  contain encrypted data (e.g. guest RAM, SMRAM etc).
>
> -It is used in the SEV-enabled guest. When encryption is enabled, a guest
> -memory region may contain encrypted data. The SEV memory encryption
> -engine uses a tweak such that two identical plaintext pages, each at
> -different locations will have differing ciphertexts. So swapping or
> +Currently this ioctl supports registering memory regions for two usages:
> +private memory and SEV-encrypted memory.
> +
> +When private memory is enabled, this ioctl is used to register guest private
> +memory region and the addr/size of kvm_enc_region represents guest physical
> +address (GPA). In this usage, this ioctl zaps the existing guest memory
> +mappings in KVM that fallen into the region.
> +
> +When SEV-encrypted memory is enabled, this ioctl is used to register guest
> +memory region which may contain encrypted data for a SEV-enabled guest. The
> +addr/size of kvm_enc_region represents userspace address (HVA). The SEV
> +memory encryption engine uses a tweak such that two identical plaintext pages,
> +each at different locations will have differing ciphertexts. So swapping or
>  moving ciphertext of those pages will not result in plaintext being
>  swapped. So relocating (or migrating) physical backing pages for the SEV
>  guest will require some additional steps.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2c96c43c313a..cfad6ba1a70a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -37,6 +37,7 @@
>  #include <asm/hyperv-tlfs.h>
>
>  #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> +#define __KVM_HAVE_ZAP_GFN_RANGE
>
>  #define KVM_MAX_VCPUS 1024
>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 6bdaacb6faa0..c94b620bf94b 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -211,8 +211,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>         return -(u32)fault & errcode;
>  }
>
> -void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> -
>  int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
>
>  int kvm_mmu_post_init_vm(struct kvm *kvm);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 2125b50f6345..d65690cae80b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -260,6 +260,15 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  #endif
>
> +#ifdef __KVM_HAVE_ZAP_GFN_RANGE
> +void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> +#else
> +static inline void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start
> +                                                     gfn_t gfn_end)

Missing a comma after gfn_start.

Cheers,
/fuad



> +{
> +}
> +#endif
> +
>  enum {
>         OUTSIDE_GUEST_MODE,
>         IN_GUEST_MODE,
> @@ -795,6 +804,9 @@ struct kvm {
>         struct notifier_block pm_notifier;
>  #endif
>         char stats_id[KVM_STATS_NAME_SIZE];
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +       struct xarray mem_attr_array;
> +#endif
>  };
>
>  #define kvm_err(fmt, ...) \
> @@ -1454,6 +1466,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
>  int kvm_arch_post_init_vm(struct kvm *kvm);
>  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
>  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_has_private_mem(struct kvm *kvm);
>
>  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
>  /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fa9dd2d2c001..de5cce8c82c7 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -937,6 +937,47 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>
>  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +#define KVM_MEM_ATTR_SHARED    0x0001
> +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> +                                    bool is_private)
> +{
> +       gfn_t start, end;
> +       unsigned long index;
> +       void *entry;
> +       int r;
> +
> +       if (size == 0 || gpa + size < gpa)
> +               return -EINVAL;
> +       if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> +               return -EINVAL;
> +
> +       start = gpa >> PAGE_SHIFT;
> +       end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> +       /*
> +        * Guest memory defaults to private, kvm->mem_attr_array only stores
> +        * shared memory.
> +        */
> +       entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> +
> +       for (index = start; index < end; index++) {
> +               r = xa_err(xa_store(&kvm->mem_attr_array, index, entry,
> +                                   GFP_KERNEL_ACCOUNT));
> +               if (r)
> +                       goto err;
> +       }
> +
> +       kvm_zap_gfn_range(kvm, start, end);
> +
> +       return r;
> +err:
> +       for (; index > start; index--)
> +               xa_erase(&kvm->mem_attr_array, index);
> +       return r;
> +}
> +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
> +
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>  static int kvm_pm_notifier_call(struct notifier_block *bl,
>                                 unsigned long state,
> @@ -1165,6 +1206,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>         spin_lock_init(&kvm->mn_invalidate_lock);
>         rcuwait_init(&kvm->mn_memslots_update_rcuwait);
>         xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +       xa_init(&kvm->mem_attr_array);
> +#endif
>
>         INIT_LIST_HEAD(&kvm->gpc_list);
>         spin_lock_init(&kvm->gpc_lock);
> @@ -1338,6 +1382,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
>                 kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
>                 kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
>         }
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +       xa_destroy(&kvm->mem_attr_array);
> +#endif
>         cleanup_srcu_struct(&kvm->irq_srcu);
>         cleanup_srcu_struct(&kvm->srcu);
>         kvm_arch_free_vm(kvm);
> @@ -1541,6 +1588,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
>         }
>  }
>
> +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> +{
> +       return false;
> +}
> +
>  static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
>  {
>         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> @@ -4703,6 +4755,24 @@ static long kvm_vm_ioctl(struct file *filp,
>                 r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>                 break;
>         }
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +       case KVM_MEMORY_ENCRYPT_REG_REGION:
> +       case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> +               struct kvm_enc_region region;
> +               bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> +
> +               if (!kvm_arch_has_private_mem(kvm))
> +                       goto arch_vm_ioctl;
> +
> +               r = -EFAULT;
> +               if (copy_from_user(&region, argp, sizeof(region)))
> +                       goto out;
> +
> +               r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> +                                             region.size, set);
> +               break;
> +       }
> +#endif
>         case KVM_GET_DIRTY_LOG: {
>                 struct kvm_dirty_log log;
>
> @@ -4856,6 +4926,9 @@ static long kvm_vm_ioctl(struct file *filp,
>                 r = kvm_vm_ioctl_get_stats_fd(kvm);
>                 break;
>         default:
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +arch_vm_ioctl:
> +#endif
>                 r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>         }
>  out:
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-09-26 10:26   ` Fuad Tabba
@ 2022-09-26 14:04     ` Chao Peng
  0 siblings, 0 replies; 97+ messages in thread
From: Chao Peng @ 2022-09-26 14:04 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Mon, Sep 26, 2022 at 11:26:45AM +0100, Fuad Tabba wrote:
...

> > +
> > +- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by
> > +  a file descirptor(fd) and the content of the private memory is invisible to
> 
> s/descirptor/descriptor

Thanks.

...

>  static long kvm_vm_ioctl(struct file *filp,
> >                            unsigned int ioctl, unsigned long arg)
> >  {
> > @@ -4645,14 +4672,20 @@ static long kvm_vm_ioctl(struct file *filp,
> >                 break;
> >         }
> >         case KVM_SET_USER_MEMORY_REGION: {
> > -               struct kvm_userspace_memory_region kvm_userspace_mem;
> > +               struct kvm_user_mem_region mem;
> > +               unsigned long size = sizeof(struct kvm_userspace_memory_region);
> 
> nit: should this be sizeof(struct mem)? That's more similar to the
> existing code and makes it dependent on the size of mem regardless of
> possible changes to its type in the future.

Unluckily no, the size we need copy_from_user() depends on the flags,
e.g. without KVM_MEM_PRIVATE, we can't safely copy that big size since
the 'extended' part may not even exist.

> 
> > +
> > +               kvm_sanity_check_user_mem_region_alias();
> >
> >                 r = -EFAULT;
> > -               if (copy_from_user(&kvm_userspace_mem, argp,
> > -                                               sizeof(kvm_userspace_mem)))
> > +               if (copy_from_user(&mem, argp, size);
> 
> It gets fixed in a future patch, but the ; should be a ).

Good catch, thanks!

Chao
> 
> Cheers,
> /fuad

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 5/8] KVM: Register/unregister the guest private memory regions
  2022-09-26 10:36   ` Fuad Tabba
@ 2022-09-26 14:07     ` Chao Peng
  0 siblings, 0 replies; 97+ messages in thread
From: Chao Peng @ 2022-09-26 14:07 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Mon, Sep 26, 2022 at 11:36:34AM +0100, Fuad Tabba wrote:
...

> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 2125b50f6345..d65690cae80b 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -260,6 +260,15 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> >  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> >  #endif
> >
> > +#ifdef __KVM_HAVE_ZAP_GFN_RANGE
> > +void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> > +#else
> > +static inline void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start
> > +                                                     gfn_t gfn_end)
> 
> Missing a comma after gfn_start.

Good catch, thanks!
Chao
> 
> Cheers,
> /fuad


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-23 15:19       ` Fuad Tabba
@ 2022-09-26 14:23         ` Chao Peng
  2022-09-26 15:51           ` Fuad Tabba
  0 siblings, 1 reply; 97+ messages in thread
From: Chao Peng @ 2022-09-26 14:23 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Sean Christopherson, David Hildenbrand, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, wei.w.wang,
	Will Deacon, Marc Zyngier

On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > Regarding pKVM's use case, with the shim approach I believe this can be done by
> > allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> > piled on top.
> >
> > My first thought was to make the uAPI a set of KVM ioctls so that KVM could tightly
> > tightly control usage without taking on too much complexity in the kernel, but
> > working through things, routing the behavior through the shim itself might not be
> > all that horrific.
> >
> > IIRC, we discarded the idea of allowing userspace to map the "private" fd because
> > things got too complex, but with the shim it doesn't seem _that_ bad.
> >
> > E.g. on the memfd side:
> >
> >   1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
> >      mapping is all or nothing.
> >
> >   2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
> >      the restricted memfd.
> >
> >   3. Add notifier hooks to allow downstream users to further restrict things.
> >
> >   4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
> >      one shot.
> >
> >   5. Require that there are no outstanding references at munmap().  Or if this
> >      can't be guaranteed by userspace, maybe add some way for userspace to wait
> >      until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
> >      to do an expensive check every time.
> >
> >   static int memfd_restricted_mmap(struct file *file, struct vm_area_struct *vma)
> >   {
> >         if (vma->vm_pgoff)
> >                 return -EINVAL;
> >
> >         if ((vma->vm_end - vma->vm_start) != <file size>)
> >                 return -EINVAL;
> >
> >         mutex_lock(&data->lock);
> >
> >         if (data->has_mapping) {
> >                 r = -EINVAL;
> >                 goto err;
> >         }
> >         list_for_each_entry(notifier, &data->notifiers, list) {
> >                 r = notifier->ops->mmap_start(notifier, ...);
> >                 if (r)
> >                         goto abort;
> >         }
> >
> >         notifier->ops->mmap_end(notifier, ...);
> >         mutex_unlock(&data->lock);
> >         return 0;
> >
> >   abort:
> >         list_for_each_entry_continue_reverse(notifier &data->notifiers, list)
> >                 notifier->ops->mmap_abort(notifier, ...);
> >   err:
> >         mutex_unlock(&data->lock);
> >         return r;
> >   }
> >
> >   static void memfd_restricted_close(struct vm_area_struct *vma)
> >   {
> >         mutex_lock(...);
> >
> >         /*
> >          * Destroy the memfd and disable all future accesses if there are
> >          * outstanding refcounts (or other unsatisfied restrictions?).
> >          */
> >         if (<outstanding references> || ???)
> >                 memfd_restricted_destroy(...);
> >         else
> >                 data->has_mapping = false;
> >
> >         mutex_unlock(...);
> >   }
> >
> >   static int memfd_restricted_may_split(struct vm_area_struct *area, unsigned long addr)
> >   {
> >         return -EINVAL;
> >   }
> >
> >   static int memfd_restricted_mapping_mremap(struct vm_area_struct *new_vma)
> >   {
> >         return -EINVAL;
> >   }
> >
> > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> >
> >   1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
> >      memory into the guest (after pre-boot phase).
> >
> >   2. Be mutually exclusive with shared<=>private conversions, and is allowed if
> >      and only if the entire gfn range of the associated memslot is shared.
> 
> In general I think that this would work with pKVM. However, limiting
> private<->shared conversions to the granularity of a whole memslot
> might be difficult to handle in pKVM, since the guest doesn't have the
> concept of memslots. For example, in pKVM right now, when a guest
> shares back its restricted DMA pool with the host it does so at the
> page-level. pKVM would also need a way to make an fd accessible again
> when shared back, which I think isn't possible with this patch.

But does pKVM really want to mmap/munmap a new region at the page-level,
that can cause VMA fragmentation if the conversion is frequent as I see.
Even with a KVM ioctl for mapping as mentioned below, I think there will
be the same issue.

> 
> You were initially considering a KVM ioctl for mapping, which might be
> better suited for this since KVM knows which pages are shared and
> which ones are private. So routing things through KVM might simplify
> things and allow it to enforce all the necessary restrictions (e.g.,
> private memory cannot be mapped). What do you think?
> 
> Thanks,
> /fuad

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-26 10:35       ` David Hildenbrand
@ 2022-09-26 14:48         ` Kirill A. Shutemov
  2022-09-26 14:53           ` David Hildenbrand
  0 siblings, 1 reply; 97+ messages in thread
From: Kirill A. Shutemov @ 2022-09-26 14:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kirill A . Shutemov, Paolo Bonzini, Sean Christopherson,
	Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Mon, Sep 26, 2022 at 12:35:34PM +0200, David Hildenbrand wrote:
> On 23.09.22 02:58, Kirill A . Shutemov wrote:
> > On Mon, Sep 19, 2022 at 11:12:46AM +0200, David Hildenbrand wrote:
> > > > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > > > index 6325d1d0e90f..9d066be3d7e8 100644
> > > > --- a/include/uapi/linux/magic.h
> > > > +++ b/include/uapi/linux/magic.h
> > > > @@ -101,5 +101,6 @@
> > > >    #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
> > > >    #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
> > > >    #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
> > > > +#define INACCESSIBLE_MAGIC	0x494e4143	/* "INAC" */
> > > 
> > > 
> > > [...]
> > > 
> > > > +
> > > > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > > > +			 int *order)
> > > > +{
> > > > +	struct inaccessible_data *data = file->f_mapping->private_data;
> > > > +	struct file *memfd = data->memfd;
> > > > +	struct page *page;
> > > > +	int ret;
> > > > +
> > > > +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > > +	*pfn = page_to_pfn_t(page);
> > > > +	*order = thp_order(compound_head(page));
> > > > +	SetPageUptodate(page);
> > > > +	unlock_page(page);
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> > > > +
> > > > +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> > > > +{
> > > > +	struct page *page = pfn_t_to_page(pfn);
> > > > +
> > > > +	if (WARN_ON_ONCE(!page))
> > > > +		return;
> > > > +
> > > > +	put_page(page);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
> > > 
> > > Sorry, I missed your reply regarding get/put interface.
> > > 
> > > https://lore.kernel.org/linux-mm/20220810092532.GD862421@chaop.bj.intel.com/
> > > 
> > > "We have a design assumption that somedays this can even support non-page
> > > based backing stores."
> > > 
> > > As long as there is no such user in sight (especially how to get the memfd
> > > from even allocating such memory which will require bigger changes), I
> > > prefer to keep it simple here and work on pages/folios. No need to
> > > over-complicate it for now.
> > 
> > Sean, Paolo , what is your take on this? Do you have conrete use case of
> > pageless backend for the mechanism in sight? Maybe DAX?
> 
> The problem I'm having with this is how to actually get such memory into the
> memory backend (that triggers notifiers) and what the semantics are at all
> with memory that is not managed by the buddy.
> 
> memfd with fixed PFNs doesn't make too much sense.

What do you mean by "fixed PFN". It is as fixed as struct page/folio, no?
PFN covers more possible backends.

> When using DAX, what happens with the shared <->private conversion? Which
> "type" is supposed to use dax, which not?
> 
> In other word, I'm missing too many details on the bigger picture of how
> this would work at all to see why it makes sense right now to prepare for
> that.

IIUC, KVM doesn't really care about pages or folios. They need PFN to
populate SEPT. Returning page/folio would make KVM do additional steps to
extract PFN and one more place to have a bug.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-26 14:48         ` Kirill A. Shutemov
@ 2022-09-26 14:53           ` David Hildenbrand
  2022-09-27 23:23             ` Sean Christopherson
  0 siblings, 1 reply; 97+ messages in thread
From: David Hildenbrand @ 2022-09-26 14:53 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A . Shutemov, Paolo Bonzini, Sean Christopherson,
	Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On 26.09.22 16:48, Kirill A. Shutemov wrote:
> On Mon, Sep 26, 2022 at 12:35:34PM +0200, David Hildenbrand wrote:
>> On 23.09.22 02:58, Kirill A . Shutemov wrote:
>>> On Mon, Sep 19, 2022 at 11:12:46AM +0200, David Hildenbrand wrote:
>>>>> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
>>>>> index 6325d1d0e90f..9d066be3d7e8 100644
>>>>> --- a/include/uapi/linux/magic.h
>>>>> +++ b/include/uapi/linux/magic.h
>>>>> @@ -101,5 +101,6 @@
>>>>>     #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
>>>>>     #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
>>>>>     #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
>>>>> +#define INACCESSIBLE_MAGIC	0x494e4143	/* "INAC" */
>>>>
>>>>
>>>> [...]
>>>>
>>>>> +
>>>>> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
>>>>> +			 int *order)
>>>>> +{
>>>>> +	struct inaccessible_data *data = file->f_mapping->private_data;
>>>>> +	struct file *memfd = data->memfd;
>>>>> +	struct page *page;
>>>>> +	int ret;
>>>>> +
>>>>> +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
>>>>> +	if (ret)
>>>>> +		return ret;
>>>>> +
>>>>> +	*pfn = page_to_pfn_t(page);
>>>>> +	*order = thp_order(compound_head(page));
>>>>> +	SetPageUptodate(page);
>>>>> +	unlock_page(page);
>>>>> +
>>>>> +	return 0;
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
>>>>> +
>>>>> +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
>>>>> +{
>>>>> +	struct page *page = pfn_t_to_page(pfn);
>>>>> +
>>>>> +	if (WARN_ON_ONCE(!page))
>>>>> +		return;
>>>>> +
>>>>> +	put_page(page);
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
>>>>
>>>> Sorry, I missed your reply regarding get/put interface.
>>>>
>>>> https://lore.kernel.org/linux-mm/20220810092532.GD862421@chaop.bj.intel.com/
>>>>
>>>> "We have a design assumption that somedays this can even support non-page
>>>> based backing stores."
>>>>
>>>> As long as there is no such user in sight (especially how to get the memfd
>>>> from even allocating such memory which will require bigger changes), I
>>>> prefer to keep it simple here and work on pages/folios. No need to
>>>> over-complicate it for now.
>>>
>>> Sean, Paolo , what is your take on this? Do you have conrete use case of
>>> pageless backend for the mechanism in sight? Maybe DAX?
>>
>> The problem I'm having with this is how to actually get such memory into the
>> memory backend (that triggers notifiers) and what the semantics are at all
>> with memory that is not managed by the buddy.
>>
>> memfd with fixed PFNs doesn't make too much sense.
> 
> What do you mean by "fixed PFN". It is as fixed as struct page/folio, no?
> PFN covers more possible backends.

For DAX, you usually bypass the buddy and map /dev/mem or a devdax. In 
contrast to ordinary memfd that allocates memory via the buddy. That's 
the difference I see -- and I wonder how it could work.

> 
>> When using DAX, what happens with the shared <->private conversion? Which
>> "type" is supposed to use dax, which not?
>>
>> In other word, I'm missing too many details on the bigger picture of how
>> this would work at all to see why it makes sense right now to prepare for
>> that.
> 
> IIUC, KVM doesn't really care about pages or folios. They need PFN to
> populate SEPT. Returning page/folio would make KVM do additional steps to
> extract PFN and one more place to have a bug.

Fair enough. Smells KVM specific, though.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-26 14:23         ` Chao Peng
@ 2022-09-26 15:51           ` Fuad Tabba
  2022-09-27 22:47             ` Sean Christopherson
  0 siblings, 1 reply; 97+ messages in thread
From: Fuad Tabba @ 2022-09-26 15:51 UTC (permalink / raw)
  To: Chao Peng
  Cc: Sean Christopherson, David Hildenbrand, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, wei.w.wang,
	Will Deacon, Marc Zyngier

Hi,

On Mon, Sep 26, 2022 at 3:28 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > > Regarding pKVM's use case, with the shim approach I believe this can be done by
> > > allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> > > piled on top.
> > >
> > > My first thought was to make the uAPI a set of KVM ioctls so that KVM could tightly
> > > tightly control usage without taking on too much complexity in the kernel, but
> > > working through things, routing the behavior through the shim itself might not be
> > > all that horrific.
> > >
> > > IIRC, we discarded the idea of allowing userspace to map the "private" fd because
> > > things got too complex, but with the shim it doesn't seem _that_ bad.
> > >
> > > E.g. on the memfd side:
> > >
> > >   1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
> > >      mapping is all or nothing.
> > >
> > >   2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
> > >      the restricted memfd.
> > >
> > >   3. Add notifier hooks to allow downstream users to further restrict things.
> > >
> > >   4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
> > >      one shot.
> > >
> > >   5. Require that there are no outstanding references at munmap().  Or if this
> > >      can't be guaranteed by userspace, maybe add some way for userspace to wait
> > >      until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
> > >      to do an expensive check every time.
> > >
> > >   static int memfd_restricted_mmap(struct file *file, struct vm_area_struct *vma)
> > >   {
> > >         if (vma->vm_pgoff)
> > >                 return -EINVAL;
> > >
> > >         if ((vma->vm_end - vma->vm_start) != <file size>)
> > >                 return -EINVAL;
> > >
> > >         mutex_lock(&data->lock);
> > >
> > >         if (data->has_mapping) {
> > >                 r = -EINVAL;
> > >                 goto err;
> > >         }
> > >         list_for_each_entry(notifier, &data->notifiers, list) {
> > >                 r = notifier->ops->mmap_start(notifier, ...);
> > >                 if (r)
> > >                         goto abort;
> > >         }
> > >
> > >         notifier->ops->mmap_end(notifier, ...);
> > >         mutex_unlock(&data->lock);
> > >         return 0;
> > >
> > >   abort:
> > >         list_for_each_entry_continue_reverse(notifier &data->notifiers, list)
> > >                 notifier->ops->mmap_abort(notifier, ...);
> > >   err:
> > >         mutex_unlock(&data->lock);
> > >         return r;
> > >   }
> > >
> > >   static void memfd_restricted_close(struct vm_area_struct *vma)
> > >   {
> > >         mutex_lock(...);
> > >
> > >         /*
> > >          * Destroy the memfd and disable all future accesses if there are
> > >          * outstanding refcounts (or other unsatisfied restrictions?).
> > >          */
> > >         if (<outstanding references> || ???)
> > >                 memfd_restricted_destroy(...);
> > >         else
> > >                 data->has_mapping = false;
> > >
> > >         mutex_unlock(...);
> > >   }
> > >
> > >   static int memfd_restricted_may_split(struct vm_area_struct *area, unsigned long addr)
> > >   {
> > >         return -EINVAL;
> > >   }
> > >
> > >   static int memfd_restricted_mapping_mremap(struct vm_area_struct *new_vma)
> > >   {
> > >         return -EINVAL;
> > >   }
> > >
> > > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> > >
> > >   1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
> > >      memory into the guest (after pre-boot phase).
> > >
> > >   2. Be mutually exclusive with shared<=>private conversions, and is allowed if
> > >      and only if the entire gfn range of the associated memslot is shared.
> >
> > In general I think that this would work with pKVM. However, limiting
> > private<->shared conversions to the granularity of a whole memslot
> > might be difficult to handle in pKVM, since the guest doesn't have the
> > concept of memslots. For example, in pKVM right now, when a guest
> > shares back its restricted DMA pool with the host it does so at the
> > page-level. pKVM would also need a way to make an fd accessible again
> > when shared back, which I think isn't possible with this patch.
>
> But does pKVM really want to mmap/munmap a new region at the page-level,
> that can cause VMA fragmentation if the conversion is frequent as I see.
> Even with a KVM ioctl for mapping as mentioned below, I think there will
> be the same issue.

pKVM doesn't really need to unmap the memory. What is really important
is that the memory is not GUP'able. Having private memory mapped and
then accessed by a misbehaving/malicious process will reinject a fault
into the misbehaving process.

Cheers,
/fuad

> >
> > You were initially considering a KVM ioctl for mapping, which might be
> > better suited for this since KVM knows which pages are shared and
> > which ones are private. So routing things through KVM might simplify
> > things and allow it to enforce all the necessary restrictions (e.g.,
> > private memory cannot be mapped). What do you think?
> >
> > Thanks,
> > /fuad

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-26 15:51           ` Fuad Tabba
@ 2022-09-27 22:47             ` Sean Christopherson
  2022-09-30 16:19               ` Fuad Tabba
  0 siblings, 1 reply; 97+ messages in thread
From: Sean Christopherson @ 2022-09-27 22:47 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Chao Peng, David Hildenbrand, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, wei.w.wang, Will Deacon, Marc Zyngier

On Mon, Sep 26, 2022, Fuad Tabba wrote:
> Hi,
> 
> On Mon, Sep 26, 2022 at 3:28 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> > > >
> > > >   1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
> > > >      memory into the guest (after pre-boot phase).
> > > >
> > > >   2. Be mutually exclusive with shared<=>private conversions, and is allowed if
> > > >      and only if the entire gfn range of the associated memslot is shared.
> > >
> > > In general I think that this would work with pKVM. However, limiting
> > > private<->shared conversions to the granularity of a whole memslot
> > > might be difficult to handle in pKVM, since the guest doesn't have the
> > > concept of memslots. For example, in pKVM right now, when a guest
> > > shares back its restricted DMA pool with the host it does so at the
> > > page-level.

Y'all are killing me :-)

Isn't the guest enlightened?  E.g. can't you tell the guest "thou shalt share at
granularity X"?  With KVM's newfangled scalable memslots and per-vCPU MRU slot,
X doesn't even have to be that high to get reasonable performance, e.g. assuming
the DMA pool is at most 2GiB, that's "only" 1024 memslots, which is supposed to
work just fine in KVM.

> > > pKVM would also need a way to make an fd accessible again
> > > when shared back, which I think isn't possible with this patch.
> >
> > But does pKVM really want to mmap/munmap a new region at the page-level,
> > that can cause VMA fragmentation if the conversion is frequent as I see.
> > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > be the same issue.
> 
> pKVM doesn't really need to unmap the memory. What is really important
> is that the memory is not GUP'able.

Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
otherwise KVM wouldn't be able to get the PFN to map into guest memory.

The problem is that gup() and "mapped" are tied together.  So yes, pKVM doesn't
strictly need to unmap memory _in the untrusted host_, but since mapped==guppable,
the end result is the same.

Emphasis above because pKVM still needs unmap the memory _somehwere_.  IIUC, the
current approach is to do that only in the stage-2 page tables, i.e. only in the
context of the hypervisor.  Which is also the source of the gup() problems; the
untrusted kernel is blissfully unaware that the memory is inaccessible.

Any approach that moves some of that information into the untrusted kernel so that
the kernel can protect itself will incur fragmentation in the VMAs.  Well, unless
all of guest memory becomes unguppable, but that's likely not a viable option.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-26 14:53           ` David Hildenbrand
@ 2022-09-27 23:23             ` Sean Christopherson
  2022-09-28 13:36               ` Kirill A. Shutemov
  0 siblings, 1 reply; 97+ messages in thread
From: Sean Christopherson @ 2022-09-27 23:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kirill A. Shutemov, Kirill A . Shutemov, Paolo Bonzini,
	Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Mon, Sep 26, 2022, David Hildenbrand wrote:
> On 26.09.22 16:48, Kirill A. Shutemov wrote:
> > On Mon, Sep 26, 2022 at 12:35:34PM +0200, David Hildenbrand wrote:
> > > When using DAX, what happens with the shared <->private conversion? Which
> > > "type" is supposed to use dax, which not?
> > > 
> > > In other word, I'm missing too many details on the bigger picture of how
> > > this would work at all to see why it makes sense right now to prepare for
> > > that.
> > 
> > IIUC, KVM doesn't really care about pages or folios. They need PFN to
> > populate SEPT. Returning page/folio would make KVM do additional steps to
> > extract PFN and one more place to have a bug.
> 
> Fair enough. Smells KVM specific, though.

TL;DR: I'm good with either approach, though providing a "struct page" might avoid
       refactoring the API in the nearish future.

Playing devil's advocate for a second, the counter argument is that KVM is the
only user for the foreseeable future.

That said, it might make sense to return a "struct page" from the core API and
force KVM to do page_to_pfn().  KVM already does that for HVA-based memory, so
it's not exactly new code.

More importantly, KVM may actually need/want the "struct page" in the not-too-distant
future to support mapping non-refcounted "struct page" memory into the guest.  The
ChromeOS folks have a use case involving virtio-gpu blobs where KVM can get handed a
"struct page" that _isn't_ refcounted[*].  Once the lack of mmu_notifier integration
is fixed, the remaining issue is that KVM doesn't currently have a way to determine
whether or not it holds a reference to the page.  Instead, KVM assumes that if the
page is "normal", it's refcounted, e.g. see kvm_release_pfn_clean().

KVM's current workaround for this is to refuse to map these pages into the guest,
i.e. KVM simply forces its assumption that normal pages are refcounted to be true.
To remove that workaround, the likely solution will be to pass around a tuple of
page+pfn, where "page" is non-NULL if the pfn is a refcounted "struct page".

At that point, getting handed a "struct page" from the core API would be a good
thing as KVM wouldn't need to probe the PFN to determine whether or not it's a
refcounted page.

Note, I still want the order to be provided by the API so that KVM doesn't need
to run through a bunch of helpers to try and figure out the allowed mapping size.

[*] https://lore.kernel.org/all/CAD=HUj736L5oxkzeL2JoPV8g1S6Rugy_TquW=PRt73YmFzP6Jw@mail.gmail.com


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-27 23:23             ` Sean Christopherson
@ 2022-09-28 13:36               ` Kirill A. Shutemov
  0 siblings, 0 replies; 97+ messages in thread
From: Kirill A. Shutemov @ 2022-09-28 13:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Hildenbrand, Kirill A . Shutemov, Paolo Bonzini, Chao Peng,
	kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, luto, jun.nakajima, dave.hansen, ak,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, wei.w.wang

On Tue, Sep 27, 2022 at 11:23:24PM +0000, Sean Christopherson wrote:
> On Mon, Sep 26, 2022, David Hildenbrand wrote:
> > On 26.09.22 16:48, Kirill A. Shutemov wrote:
> > > On Mon, Sep 26, 2022 at 12:35:34PM +0200, David Hildenbrand wrote:
> > > > When using DAX, what happens with the shared <->private conversion? Which
> > > > "type" is supposed to use dax, which not?
> > > > 
> > > > In other word, I'm missing too many details on the bigger picture of how
> > > > this would work at all to see why it makes sense right now to prepare for
> > > > that.
> > > 
> > > IIUC, KVM doesn't really care about pages or folios. They need PFN to
> > > populate SEPT. Returning page/folio would make KVM do additional steps to
> > > extract PFN and one more place to have a bug.
> > 
> > Fair enough. Smells KVM specific, though.
> 
> TL;DR: I'm good with either approach, though providing a "struct page" might avoid
>        refactoring the API in the nearish future.
> 
> Playing devil's advocate for a second, the counter argument is that KVM is the
> only user for the foreseeable future.
> 
> That said, it might make sense to return a "struct page" from the core API and
> force KVM to do page_to_pfn().  KVM already does that for HVA-based memory, so
> it's not exactly new code.

Core MM tries to move away from struct page in favour of struct folio. We
can make interface return folio.

But it would require more work on KVM side.

folio_pfn(folio) + offset % folio_nr_pages(folio) would give you PFN for
base-pagesize PFN for given offset. I guess it is not too hard.

It also gives KVM capability to populate multiple EPT entries for non-zero
order folio and save few cycles.

Does it work for you?

> More importantly, KVM may actually need/want the "struct page" in the not-too-distant
> future to support mapping non-refcounted "struct page" memory into the guest.  The
> ChromeOS folks have a use case involving virtio-gpu blobs where KVM can get handed a
> "struct page" that _isn't_ refcounted[*].  Once the lack of mmu_notifier integration
> is fixed, the remaining issue is that KVM doesn't currently have a way to determine
> whether or not it holds a reference to the page.  Instead, KVM assumes that if the
> page is "normal", it's refcounted, e.g. see kvm_release_pfn_clean().
> 
> KVM's current workaround for this is to refuse to map these pages into the guest,
> i.e. KVM simply forces its assumption that normal pages are refcounted to be true.
> To remove that workaround, the likely solution will be to pass around a tuple of
> page+pfn, where "page" is non-NULL if the pfn is a refcounted "struct page".
> 
> At that point, getting handed a "struct page" from the core API would be a good
> thing as KVM wouldn't need to probe the PFN to determine whether or not it's a
> refcounted page.
> 
> Note, I still want the order to be provided by the API so that KVM doesn't need
> to run through a bunch of helpers to try and figure out the allowed mapping size.
> 
> [*] https://lore.kernel.org/all/CAD=HUj736L5oxkzeL2JoPV8g1S6Rugy_TquW=PRt73YmFzP6Jw@mail.gmail.com

These non-refcounted "struct page" confuses me.

IIUC (probably not), the idea is to share a buffer between host and guest
and avoid double buffering in page cache on the guest ("guest shadow
buffer" means page cache, right?). Don't we already have DAX interfaces to
bypass guest page cache?

And do you think it would need to be handled on inaccessible API lavel or
is it KVM-only thing that uses inaccessible API for some use-cases?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 6/8] KVM: Update lpage info when private/shared memory are mixed
  2022-09-15 14:29 ` [PATCH v8 6/8] KVM: Update lpage info when private/shared memory are mixed Chao Peng
@ 2022-09-29 16:52   ` Isaku Yamahata
  2022-09-30  8:59     ` Chao Peng
  0 siblings, 1 reply; 97+ messages in thread
From: Isaku Yamahata @ 2022-09-29 16:52 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang, isaku.yamahata

On Thu, Sep 15, 2022 at 10:29:11PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:

> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 08abad4f3e6f..a0f198cede3d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
...
> @@ -6894,3 +6899,115 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
>  	if (kvm->arch.nx_lpage_recovery_thread)
>  		kthread_stop(kvm->arch.nx_lpage_recovery_thread);
>  }
> +
> +static bool mem_attr_is_mixed(struct kvm *kvm, unsigned int attr,
> +			      gfn_t start, gfn_t end)
> +{
> +	XA_STATE(xas, &kvm->mem_attr_array, start);
> +	gfn_t gfn = start;
> +	void *entry;
> +	bool shared, private;
> +	bool mixed = false;
> +
> +	if (attr == KVM_MEM_ATTR_SHARED) {
> +		shared = true;
> +		private = false;
> +	} else {
> +		shared = false;
> +		private = true;
> +	}

We don't have to care the target is shared or private.  We need to check
only same or not.

> +
> +	rcu_read_lock();
> +	entry = xas_load(&xas);
> +	while (gfn < end) {
> +		if (xas_retry(&xas, entry))
> +			continue;
> +
> +		KVM_BUG_ON(gfn != xas.xa_index, kvm);
> +
> +		if (entry)
> +			private = true;
> +		else
> +			shared = true;
> +
> +		if (private && shared) {
> +			mixed = true;
> +			goto out;
> +		}
> +
> +		entry = xas_next(&xas);
> +		gfn++;
> +	}
> +out:
> +	rcu_read_unlock();
> +	return mixed;
> +}
> +
> +static inline void update_mixed(struct kvm_lpage_info *linfo, bool mixed)
> +{
> +	if (mixed)
> +		linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +	else
> +		linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +}
> +
> +static void update_mem_lpage_info(struct kvm *kvm,
> +				  struct kvm_memory_slot *slot,
> +				  unsigned int attr,
> +				  gfn_t start, gfn_t end)
> +{
> +	unsigned long lpage_start, lpage_end;
> +	unsigned long gfn, pages, mask;
> +	int level;
> +
> +	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> +		pages = KVM_PAGES_PER_HPAGE(level);
> +		mask = ~(pages - 1);
> +		lpage_start = start & mask;
> +		lpage_end = (end - 1) & mask;
> +
> +		/*
> +		 * We only need to scan the head and tail page, for middle pages
> +		 * we know they are not mixed.
> +		 */
> +		update_mixed(lpage_info_slot(lpage_start, slot, level),
> +			     mem_attr_is_mixed(kvm, attr, lpage_start,
> +							  lpage_start + pages));
> +
> +		if (lpage_start == lpage_end)
> +			return;
> +
> +		for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages)
> +			update_mixed(lpage_info_slot(gfn, slot, level), false);


For >2M case, we don't have to check all entry. just check lower level case.

> +
> +		update_mixed(lpage_info_slot(lpage_end, slot, level),
> +			     mem_attr_is_mixed(kvm, attr, lpage_end,
> +							  lpage_end + pages));
> +	}
> +}
> +
> +void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
> +			      gfn_t start, gfn_t end)
> +{
> +	struct kvm_memory_slot *slot;
> +	struct kvm_memslots *slots;
> +	struct kvm_memslot_iter iter;
> +	int i;
> +
> +	WARN_ONCE(!(attr & (KVM_MEM_ATTR_PRIVATE | KVM_MEM_ATTR_SHARED)),
> +			"Unsupported mem attribute.\n");
> +
> +	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +		slots = __kvm_memslots(kvm, i);
> +
> +		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> +			slot = iter.slot;
> +			start = max(start, slot->base_gfn);
> +			end = min(end, slot->base_gfn + slot->npages);
> +			if (WARN_ON_ONCE(start >= end))
> +				continue;
> +
> +			update_mem_lpage_info(kvm, slot, attr, start, end);
> +		}
> +	}
> +}


Here is my updated version.

bool kvm_mem_attr_is_mixed(struct kvm_memory_slot *slot, gfn_t gfn, int level)
{
	gfn_t pages = KVM_PAGES_PER_HPAGE(level);
	gfn_t mask = ~(pages - 1);
	struct kvm_lpage_info *linfo = lpage_info_slot(gfn & mask, slot, level);

	WARN_ON_ONCE(level == PG_LEVEL_4K);
	return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
}

#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM_ATTR
static void update_mixed(struct kvm_lpage_info *linfo, bool mixed)
{
	if (mixed)
		linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
	else
		linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
}

static bool __mem_attr_is_mixed(struct kvm *kvm, gfn_t start, gfn_t end)
{
	XA_STATE(xas, &kvm->mem_attr_array, start);
	bool mixed = false;
	gfn_t gfn = start;
	void *s_entry;
	void *entry;

	rcu_read_lock();
	s_entry = xas_load(&xas);
	entry = s_entry;
	while (gfn < end) {
		if (xas_retry(&xas, entry))
			continue;

		KVM_BUG_ON(gfn != xas.xa_index, kvm);

		entry = xas_next(&xas);
		if (entry != s_entry) {
			mixed = true;
			break;
		}
		gfn++;
	}
	rcu_read_unlock();
	return mixed;
}

static bool mem_attr_is_mixed(struct kvm *kvm,
			      struct kvm_memory_slot *slot, int level,
			      gfn_t start, gfn_t end)
{
	struct kvm_lpage_info *child_linfo;
	unsigned long child_pages;
	bool mixed = false;
	unsigned long gfn;
	void *entry;

	if (WARN_ON_ONCE(level == PG_LEVEL_4K))
		return false;

	if (level == PG_LEVEL_2M)
		return __mem_attr_is_mixed(kvm, start, end);

	/* This assumes that level - 1 is already updated. */
	rcu_read_lock();
	child_pages = KVM_PAGES_PER_HPAGE(level - 1);
	entry = xa_load(&kvm->mem_attr_array, start);
	for (gfn = start; gfn < end; gfn += child_pages) {
		child_linfo = lpage_info_slot(gfn, slot, level - 1);
		if (child_linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED) {
			mixed = true;
			break;
		}
		if (xa_load(&kvm->mem_attr_array, gfn) != entry) {
			mixed = true;
			break;
		}
	}
	rcu_read_unlock();
	return mixed;
}

static void update_mem_lpage_info(struct kvm *kvm,
				  struct kvm_memory_slot *slot,
				  unsigned int attr,
				  gfn_t start, gfn_t end)
{
	unsigned long lpage_start, lpage_end;
	unsigned long gfn, pages, mask;
	int level;

	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
		pages = KVM_PAGES_PER_HPAGE(level);
		mask = ~(pages - 1);
		lpage_start = start & mask;
		lpage_end = (end - 1) & mask;

		/*
		 * We only need to scan the head and tail page, for middle pages
		 * we know they are not mixed.
		 */
		update_mixed(lpage_info_slot(lpage_start, slot, level),
			     mem_attr_is_mixed(kvm, slot, level,
					       lpage_start, lpage_start + pages));

		if (lpage_start == lpage_end)
			return;

		for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages)
			update_mixed(lpage_info_slot(gfn, slot, level), false);

		update_mixed(lpage_info_slot(lpage_end, slot, level),
			     mem_attr_is_mixed(kvm, slot, level,
					       lpage_end, lpage_end + pages));
	}
}

void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
			      gfn_t start, gfn_t end)
{
	struct kvm_memory_slot *slot;
	struct kvm_memslots *slots;
	struct kvm_memslot_iter iter;
	int idx;
	int i;

	WARN_ONCE(!(attr & (KVM_MEM_ATTR_PRIVATE | KVM_MEM_ATTR_SHARED)),
		  "Unsupported mem attribute.\n");

	idx = srcu_read_lock(&kvm->srcu);
	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
		slots = __kvm_memslots(kvm, i);

		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
			slot = iter.slot;
			start = max(start, slot->base_gfn);
			end = min(end, slot->base_gfn + slot->npages);
			if (WARN_ON_ONCE(start >= end))
				continue;

			update_mem_lpage_info(kvm, slot, attr, start, end);
		}
	}
	srcu_read_unlock(&kvm->srcu, idx);
}
#endif


-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-09-15 14:29 ` [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
  2022-09-16  9:14   ` Bagas Sanjaya
  2022-09-26 10:26   ` Fuad Tabba
@ 2022-09-29 22:45   ` Isaku Yamahata
  2022-09-29 23:22     ` Sean Christopherson
  2022-10-05 13:04   ` Jarkko Sakkinen
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 97+ messages in thread
From: Isaku Yamahata @ 2022-09-29 22:45 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang, isaku.yamahata

On Thu, Sep 15, 2022 at 10:29:07PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:
...
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 584a5bab3af3..12dc0dc57b06 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
...
> @@ -4622,6 +4622,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
>  	return fd;
>  }
>  
> +#define SANITY_CHECK_MEM_REGION_FIELD(field)					\
> +do {										\
> +	BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=		\
> +		     offsetof(struct kvm_userspace_memory_region, field));	\
> +	BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=		\
> +		     sizeof_field(struct kvm_userspace_memory_region, field));	\
> +} while (0)
> +
> +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)					\
> +do {											\
> +	BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=			\
> +		     offsetof(struct kvm_userspace_memory_region_ext, field));		\
> +	BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=			\
> +		     sizeof_field(struct kvm_userspace_memory_region_ext, field));	\
> +} while (0)
> +
> +static void kvm_sanity_check_user_mem_region_alias(void)
> +{
> +	SANITY_CHECK_MEM_REGION_FIELD(slot);
> +	SANITY_CHECK_MEM_REGION_FIELD(flags);
> +	SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> +	SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> +	SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> +	SANITY_CHECK_MEM_REGION_EXT_FIELD(private_offset);
> +	SANITY_CHECK_MEM_REGION_EXT_FIELD(private_fd);
> +}
> +
>  static long kvm_vm_ioctl(struct file *filp,
>  			   unsigned int ioctl, unsigned long arg)
>  {
> @@ -4645,14 +4672,20 @@ static long kvm_vm_ioctl(struct file *filp,
>  		break;
>  	}
>  	case KVM_SET_USER_MEMORY_REGION: {
> -		struct kvm_userspace_memory_region kvm_userspace_mem;
> +		struct kvm_user_mem_region mem;
> +		unsigned long size = sizeof(struct kvm_userspace_memory_region);
> +
> +		kvm_sanity_check_user_mem_region_alias();
>  
>  		r = -EFAULT;
> -		if (copy_from_user(&kvm_userspace_mem, argp,
> -						sizeof(kvm_userspace_mem)))
> +		if (copy_from_user(&mem, argp, size);
> +			goto out;
> +
> +		r = -EINVAL;
> +		if (mem.flags & KVM_MEM_PRIVATE)
>  			goto out;

Nit:  It's better to check if padding is zero.  Maybe rename it to reserved.

+               if (mem.pad1 || memchr_inv(mem.pad2, 0, sizeof(mem.pad2)))
+                       goto out;
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-09-29 22:45   ` Isaku Yamahata
@ 2022-09-29 23:22     ` Sean Christopherson
  0 siblings, 0 replies; 97+ messages in thread
From: Sean Christopherson @ 2022-09-29 23:22 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Thu, Sep 29, 2022, Isaku Yamahata wrote:
> On Thu, Sep 15, 2022 at 10:29:07PM +0800,
> Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > @@ -4645,14 +4672,20 @@ static long kvm_vm_ioctl(struct file *filp,
> >  		break;
> >  	}
> >  	case KVM_SET_USER_MEMORY_REGION: {
> > -		struct kvm_userspace_memory_region kvm_userspace_mem;
> > +		struct kvm_user_mem_region mem;
> > +		unsigned long size = sizeof(struct kvm_userspace_memory_region);
> > +
> > +		kvm_sanity_check_user_mem_region_alias();
> >  
> >  		r = -EFAULT;
> > -		if (copy_from_user(&kvm_userspace_mem, argp,
> > -						sizeof(kvm_userspace_mem)))
> > +		if (copy_from_user(&mem, argp, size);
> > +			goto out;
> > +
> > +		r = -EINVAL;
> > +		if (mem.flags & KVM_MEM_PRIVATE)
> >  			goto out;
> 
> Nit:  It's better to check if padding is zero.  Maybe rename it to reserved.
> 
> +               if (mem.pad1 || memchr_inv(mem.pad2, 0, sizeof(mem.pad2)))
> +                       goto out;

No need, KVM has more or less settled on using flags instead "reserving" bytes.
E.g. if/when another fancy feature comes along, we'll add another KVM_MEM_XYZ
and only consume the relevant fields when the flag is set.  Reserving bytes
doesn't work very well because it assumes that '0' is an invalid value, e.g. if
the future expansion is for a non-private file descriptor, then we'd need a new
flag even if KVM reserved bytes since fd=0 is valid.

The only reason to bother with pad2[14] at this time is to avoid having to define
yet another struct if/when the struct needs to expand again.  The struct definition
will still need to be changed, but at least we won't end up with struct
kvm_userspace_memory_region_really_extended.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 6/8] KVM: Update lpage info when private/shared memory are mixed
  2022-09-29 16:52   ` Isaku Yamahata
@ 2022-09-30  8:59     ` Chao Peng
  0 siblings, 0 replies; 97+ messages in thread
From: Chao Peng @ 2022-09-30  8:59 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Thu, Sep 29, 2022 at 09:52:06AM -0700, Isaku Yamahata wrote:
> On Thu, Sep 15, 2022 at 10:29:11PM +0800,
> Chao Peng <chao.p.peng@linux.intel.com> wrote:
> 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 08abad4f3e6f..a0f198cede3d 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> ...
> > @@ -6894,3 +6899,115 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> >  	if (kvm->arch.nx_lpage_recovery_thread)
> >  		kthread_stop(kvm->arch.nx_lpage_recovery_thread);
> >  }
> > +
> > +static bool mem_attr_is_mixed(struct kvm *kvm, unsigned int attr,
> > +			      gfn_t start, gfn_t end)
> > +{
> > +	XA_STATE(xas, &kvm->mem_attr_array, start);
> > +	gfn_t gfn = start;
> > +	void *entry;
> > +	bool shared, private;
> > +	bool mixed = false;
> > +
> > +	if (attr == KVM_MEM_ATTR_SHARED) {
> > +		shared = true;
> > +		private = false;
> > +	} else {
> > +		shared = false;
> > +		private = true;
> > +	}
> 
> We don't have to care the target is shared or private.  We need to check
> only same or not.

There is optimization chance if we know what we are going to set. we can
return 'mixed = true' earlier when we find the first reverse attr, e.g.
it's unnecessarily to check all the child page attr in one largepage to
give a conclusion.

After a further look, the code can be refined as below:

--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7255,17 +7255,9 @@ static bool mem_attr_is_mixed(struct kvm *kvm, unsigned int attr,
 	XA_STATE(xas, &kvm->mem_attr_array, start);
 	gfn_t gfn = start;
 	void *entry;
-	bool shared, private;
+	bool shared = attr == KVM_MEM_ATTR_SHARED;
 	bool mixed = false;
 
-	if (attr == KVM_MEM_ATTR_SHARED) {
-		shared = true;
-		private = false;
-	} else {
-		shared = false;
-		private = true;
-	}
-
 	rcu_read_lock();
 	entry = xas_load(&xas);
 	while (gfn < end) {
@@ -7274,12 +7266,7 @@ static bool mem_attr_is_mixed(struct kvm *kvm, unsigned int attr,
 
 		KVM_BUG_ON(gfn != xas.xa_index, kvm);
 
-		if (entry)
-			private = true;
-		else
-			shared = true;
-
-		if (private && shared) {
+		if ((entry && !shared) || (!entry && shared)) {
 			mixed = true;
 			goto out;
 		}
@@ -7320,8 +7307,7 @@ static void update_mem_lpage_info(struct kvm *kvm,
 		 * we know they are not mixed.
 		 */
 		update_mixed(lpage_info_slot(lpage_start, slot, level),
-			     mem_attr_is_mixed(kvm, attr, lpage_start,
-							  lpage_start + pages));
+			     mem_attr_is_mixed(kvm, attr, lpage_start, start));
 
 		if (lpage_start == lpage_end)
 			return;
@@ -7330,7 +7316,7 @@ static void update_mem_lpage_info(struct kvm *kvm,
 			update_mixed(lpage_info_slot(gfn, slot, level), false);
 
 		update_mixed(lpage_info_slot(lpage_end, slot, level),
-			     mem_attr_is_mixed(kvm, attr, lpage_end,
+			     mem_attr_is_mixed(kvm, attr, end,
 							  lpage_end + pages));
 	}
 }
> 
> > +
> > +	rcu_read_lock();
> > +	entry = xas_load(&xas);
> > +	while (gfn < end) {
> > +		if (xas_retry(&xas, entry))
> > +			continue;
> > +
> > +		KVM_BUG_ON(gfn != xas.xa_index, kvm);
> > +
> > +		if (entry)
> > +			private = true;
> > +		else
> > +			shared = true;
> > +
> > +		if (private && shared) {
> > +			mixed = true;
> > +			goto out;
> > +		}
> > +
> > +		entry = xas_next(&xas);
> > +		gfn++;
> > +	}
> > +out:
> > +	rcu_read_unlock();
> > +	return mixed;
> > +}
> > +
> > +static inline void update_mixed(struct kvm_lpage_info *linfo, bool mixed)
> > +{
> > +	if (mixed)
> > +		linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +	else
> > +		linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +}
> > +
> > +static void update_mem_lpage_info(struct kvm *kvm,
> > +				  struct kvm_memory_slot *slot,
> > +				  unsigned int attr,
> > +				  gfn_t start, gfn_t end)
> > +{
> > +	unsigned long lpage_start, lpage_end;
> > +	unsigned long gfn, pages, mask;
> > +	int level;
> > +
> > +	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> > +		pages = KVM_PAGES_PER_HPAGE(level);
> > +		mask = ~(pages - 1);
> > +		lpage_start = start & mask;
> > +		lpage_end = (end - 1) & mask;
> > +
> > +		/*
> > +		 * We only need to scan the head and tail page, for middle pages
> > +		 * we know they are not mixed.
> > +		 */
> > +		update_mixed(lpage_info_slot(lpage_start, slot, level),
> > +			     mem_attr_is_mixed(kvm, attr, lpage_start,
> > +							  lpage_start + pages));
> > +
> > +		if (lpage_start == lpage_end)
> > +			return;
> > +
> > +		for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages)
> > +			update_mixed(lpage_info_slot(gfn, slot, level), false);
> 
> 
> For >2M case, we don't have to check all entry. just check lower level case.

Sounds good, we can reduce some scanning.

Thanks,
Chao
> 
> > +
> > +		update_mixed(lpage_info_slot(lpage_end, slot, level),
> > +			     mem_attr_is_mixed(kvm, attr, lpage_end,
> > +							  lpage_end + pages));
> > +	}
> > +}
> > +
> > +void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
> > +			      gfn_t start, gfn_t end)
> > +{
> > +	struct kvm_memory_slot *slot;
> > +	struct kvm_memslots *slots;
> > +	struct kvm_memslot_iter iter;
> > +	int i;
> > +
> > +	WARN_ONCE(!(attr & (KVM_MEM_ATTR_PRIVATE | KVM_MEM_ATTR_SHARED)),
> > +			"Unsupported mem attribute.\n");
> > +
> > +	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > +		slots = __kvm_memslots(kvm, i);
> > +
> > +		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > +			slot = iter.slot;
> > +			start = max(start, slot->base_gfn);
> > +			end = min(end, slot->base_gfn + slot->npages);
> > +			if (WARN_ON_ONCE(start >= end))
> > +				continue;
> > +
> > +			update_mem_lpage_info(kvm, slot, attr, start, end);
> > +		}
> > +	}
> > +}
> 
> 
> Here is my updated version.
> 
> bool kvm_mem_attr_is_mixed(struct kvm_memory_slot *slot, gfn_t gfn, int level)
> {
> 	gfn_t pages = KVM_PAGES_PER_HPAGE(level);
> 	gfn_t mask = ~(pages - 1);
> 	struct kvm_lpage_info *linfo = lpage_info_slot(gfn & mask, slot, level);
> 
> 	WARN_ON_ONCE(level == PG_LEVEL_4K);
> 	return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> }
> 
> #ifdef CONFIG_HAVE_KVM_PRIVATE_MEM_ATTR
> static void update_mixed(struct kvm_lpage_info *linfo, bool mixed)
> {
> 	if (mixed)
> 		linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> 	else
> 		linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> }
> 
> static bool __mem_attr_is_mixed(struct kvm *kvm, gfn_t start, gfn_t end)
> {
> 	XA_STATE(xas, &kvm->mem_attr_array, start);
> 	bool mixed = false;
> 	gfn_t gfn = start;
> 	void *s_entry;
> 	void *entry;
> 
> 	rcu_read_lock();
> 	s_entry = xas_load(&xas);
> 	entry = s_entry;
> 	while (gfn < end) {
> 		if (xas_retry(&xas, entry))
> 			continue;
> 
> 		KVM_BUG_ON(gfn != xas.xa_index, kvm);
> 
> 		entry = xas_next(&xas);
> 		if (entry != s_entry) {
> 			mixed = true;
> 			break;
> 		}
> 		gfn++;
> 	}
> 	rcu_read_unlock();
> 	return mixed;
> }
> 
> static bool mem_attr_is_mixed(struct kvm *kvm,
> 			      struct kvm_memory_slot *slot, int level,
> 			      gfn_t start, gfn_t end)
> {
> 	struct kvm_lpage_info *child_linfo;
> 	unsigned long child_pages;
> 	bool mixed = false;
> 	unsigned long gfn;
> 	void *entry;
> 
> 	if (WARN_ON_ONCE(level == PG_LEVEL_4K))
> 		return false;
> 
> 	if (level == PG_LEVEL_2M)
> 		return __mem_attr_is_mixed(kvm, start, end);
> 
> 	/* This assumes that level - 1 is already updated. */
> 	rcu_read_lock();
> 	child_pages = KVM_PAGES_PER_HPAGE(level - 1);
> 	entry = xa_load(&kvm->mem_attr_array, start);
> 	for (gfn = start; gfn < end; gfn += child_pages) {
> 		child_linfo = lpage_info_slot(gfn, slot, level - 1);
> 		if (child_linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED) {
> 			mixed = true;
> 			break;
> 		}
> 		if (xa_load(&kvm->mem_attr_array, gfn) != entry) {
> 			mixed = true;
> 			break;
> 		}
> 	}
> 	rcu_read_unlock();
> 	return mixed;
> }
> 
> static void update_mem_lpage_info(struct kvm *kvm,
> 				  struct kvm_memory_slot *slot,
> 				  unsigned int attr,
> 				  gfn_t start, gfn_t end)
> {
> 	unsigned long lpage_start, lpage_end;
> 	unsigned long gfn, pages, mask;
> 	int level;
> 
> 	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> 		pages = KVM_PAGES_PER_HPAGE(level);
> 		mask = ~(pages - 1);
> 		lpage_start = start & mask;
> 		lpage_end = (end - 1) & mask;
> 
> 		/*
> 		 * We only need to scan the head and tail page, for middle pages
> 		 * we know they are not mixed.
> 		 */
> 		update_mixed(lpage_info_slot(lpage_start, slot, level),
> 			     mem_attr_is_mixed(kvm, slot, level,
> 					       lpage_start, lpage_start + pages));
> 
> 		if (lpage_start == lpage_end)
> 			return;
> 
> 		for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages)
> 			update_mixed(lpage_info_slot(gfn, slot, level), false);
> 
> 		update_mixed(lpage_info_slot(lpage_end, slot, level),
> 			     mem_attr_is_mixed(kvm, slot, level,
> 					       lpage_end, lpage_end + pages));
> 	}
> }
> 
> void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
> 			      gfn_t start, gfn_t end)
> {
> 	struct kvm_memory_slot *slot;
> 	struct kvm_memslots *slots;
> 	struct kvm_memslot_iter iter;
> 	int idx;
> 	int i;
> 
> 	WARN_ONCE(!(attr & (KVM_MEM_ATTR_PRIVATE | KVM_MEM_ATTR_SHARED)),
> 		  "Unsupported mem attribute.\n");
> 
> 	idx = srcu_read_lock(&kvm->srcu);
> 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> 		slots = __kvm_memslots(kvm, i);
> 
> 		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> 			slot = iter.slot;
> 			start = max(start, slot->base_gfn);
> 			end = min(end, slot->base_gfn + slot->npages);
> 			if (WARN_ON_ONCE(start >= end))
> 				continue;
> 
> 			update_mem_lpage_info(kvm, slot, attr, start, end);
> 		}
> 	}
> 	srcu_read_unlock(&kvm->srcu, idx);
> }
> #endif
> 
> 
> -- 
> Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-15 14:29 ` [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd Chao Peng
  2022-09-19  9:12   ` David Hildenbrand
  2022-09-22 13:26   ` Wang, Wei W
@ 2022-09-30 16:14   ` Fuad Tabba
  2022-09-30 16:23     ` Kirill A . Shutemov
  2022-10-06  8:50   ` Fuad Tabba
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 97+ messages in thread
From: Fuad Tabba @ 2022-09-30 16:14 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

Hi,

<...>

> diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> new file mode 100644
> index 000000000000..2d33cbdd9282
> --- /dev/null
> +++ b/mm/memfd_inaccessible.c
> @@ -0,0 +1,219 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include <linux/memfd.h>
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/shmem_fs.h>
> +#include <uapi/linux/falloc.h>
> +#include <uapi/linux/magic.h>
> +
> +struct inaccessible_data {
> +       struct mutex lock;
> +       struct file *memfd;
> +       struct list_head notifiers;
> +};
> +
> +static void inaccessible_notifier_invalidate(struct inaccessible_data *data,
> +                                pgoff_t start, pgoff_t end)
> +{
> +       struct inaccessible_notifier *notifier;
> +
> +       mutex_lock(&data->lock);
> +       list_for_each_entry(notifier, &data->notifiers, list) {
> +               notifier->ops->invalidate(notifier, start, end);
> +       }
> +       mutex_unlock(&data->lock);
> +}
> +
> +static int inaccessible_release(struct inode *inode, struct file *file)
> +{
> +       struct inaccessible_data *data = inode->i_mapping->private_data;
> +
> +       fput(data->memfd);
> +       kfree(data);
> +       return 0;
> +}
> +
> +static long inaccessible_fallocate(struct file *file, int mode,
> +                                  loff_t offset, loff_t len)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       int ret;
> +
> +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> +               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> +                       return -EINVAL;
> +       }
> +
> +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);

I think that shmem_file_operations.fallocate is only set if
CONFIG_TMPFS is enabled (shmem.c). Should there be a check at
initialization that fallocate is set, or maybe a config dependency, or
can we count on it always being enabled?

> +       inaccessible_notifier_invalidate(data, offset, offset + len);
> +       return ret;
> +}
> +

<...>

> +void inaccessible_register_notifier(struct file *file,
> +                                   struct inaccessible_notifier *notifier)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_add(&notifier->list, &data->notifiers);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_register_notifier);

If the memfd wasn't marked as inaccessible, or more generally
speaking, if the file isn't a memfd_inaccessible file, this ends up
accessing an uninitialized pointer for the notifier list. Should there
be a check for that here, and have this function return an error if
that's not the case?

Thanks,
/fuad



> +
> +void inaccessible_unregister_notifier(struct file *file,
> +                                     struct inaccessible_notifier *notifier)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_del(&notifier->list);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_unregister_notifier);
> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +                        int *order)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       struct page *page;
> +       int ret;
> +
> +       ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> +       if (ret)
> +               return ret;
> +
> +       *pfn = page_to_pfn_t(page);
> +       *order = thp_order(compound_head(page));
> +       SetPageUptodate(page);
> +       unlock_page(page);
> +
> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> +
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> +{
> +       struct page *page = pfn_t_to_page(pfn);
> +
> +       if (WARN_ON_ONCE(!page))
> +               return;
> +
> +       put_page(page);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-27 22:47             ` Sean Christopherson
@ 2022-09-30 16:19               ` Fuad Tabba
  2022-10-13 13:34                 ` Chao Peng
  2022-10-18  0:33                 ` Sean Christopherson
  0 siblings, 2 replies; 97+ messages in thread
From: Fuad Tabba @ 2022-09-30 16:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, David Hildenbrand, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, wei.w.wang, Will Deacon, Marc Zyngier

Hi,

On Tue, Sep 27, 2022 at 11:47 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Sep 26, 2022, Fuad Tabba wrote:
> > Hi,
> >
> > On Mon, Sep 26, 2022 at 3:28 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > >
> > > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> > > > >
> > > > >   1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
> > > > >      memory into the guest (after pre-boot phase).
> > > > >
> > > > >   2. Be mutually exclusive with shared<=>private conversions, and is allowed if
> > > > >      and only if the entire gfn range of the associated memslot is shared.
> > > >
> > > > In general I think that this would work with pKVM. However, limiting
> > > > private<->shared conversions to the granularity of a whole memslot
> > > > might be difficult to handle in pKVM, since the guest doesn't have the
> > > > concept of memslots. For example, in pKVM right now, when a guest
> > > > shares back its restricted DMA pool with the host it does so at the
> > > > page-level.
>
> Y'all are killing me :-)

 :D

> Isn't the guest enlightened?  E.g. can't you tell the guest "thou shalt share at
> granularity X"?  With KVM's newfangled scalable memslots and per-vCPU MRU slot,
> X doesn't even have to be that high to get reasonable performance, e.g. assuming
> the DMA pool is at most 2GiB, that's "only" 1024 memslots, which is supposed to
> work just fine in KVM.

The guest is potentially enlightened, but the host doesn't necessarily
know which memslot the guest might want to share back, since it
doesn't know where the guest might want to place the DMA pool. If I
understand this correctly, for this to work, all memslots would need
to be the same size and sharing would always need to happen at that
granularity.

Moreover, for something like a small DMA pool this might scale, but
I'm not sure about potential future workloads (e.g., multimedia
in-place sharing).

>
> > > > pKVM would also need a way to make an fd accessible again
> > > > when shared back, which I think isn't possible with this patch.
> > >
> > > But does pKVM really want to mmap/munmap a new region at the page-level,
> > > that can cause VMA fragmentation if the conversion is frequent as I see.
> > > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > > be the same issue.
> >
> > pKVM doesn't really need to unmap the memory. What is really important
> > is that the memory is not GUP'able.
>
> Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
> otherwise KVM wouldn't be able to get the PFN to map into guest memory.
>
> The problem is that gup() and "mapped" are tied together.  So yes, pKVM doesn't
> strictly need to unmap memory _in the untrusted host_, but since mapped==guppable,
> the end result is the same.
>
> Emphasis above because pKVM still needs unmap the memory _somehwere_.  IIUC, the
> current approach is to do that only in the stage-2 page tables, i.e. only in the
> context of the hypervisor.  Which is also the source of the gup() problems; the
> untrusted kernel is blissfully unaware that the memory is inaccessible.
>
> Any approach that moves some of that information into the untrusted kernel so that
> the kernel can protect itself will incur fragmentation in the VMAs.  Well, unless
> all of guest memory becomes unguppable, but that's likely not a viable option.

Actually, for pKVM, there is no need for the guest memory to be
GUP'able at all if we use the new inaccessible_get_pfn(). This of
course goes back to what I'd mentioned before in v7; it seems that
representing the memslot memory as a file descriptor should be
orthogonal to whether the memory is shared or private, rather than a
private_fd for private memory and the userspace_addr for shared
memory. The host can then map or unmap the shared/private memory using
the fd, which allows it more freedom in even choosing to unmap shared
memory when not needed, for example.

Cheers,
/fuad

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-30 16:14   ` Fuad Tabba
@ 2022-09-30 16:23     ` Kirill A . Shutemov
  2022-10-03  7:33       ` Fuad Tabba
  0 siblings, 1 reply; 97+ messages in thread
From: Kirill A . Shutemov @ 2022-09-30 16:23 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, luto, jun.nakajima, dave.hansen, ak, david, aarcange,
	ddutile, dhildenb, Quentin Perret, Michael Roth, mhocko,
	Muchun Song, wei.w.wang

On Fri, Sep 30, 2022 at 05:14:00PM +0100, Fuad Tabba wrote:
> Hi,
> 
> <...>
> 
> > diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> > new file mode 100644
> > index 000000000000..2d33cbdd9282
> > --- /dev/null
> > +++ b/mm/memfd_inaccessible.c
> > @@ -0,0 +1,219 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "linux/sbitmap.h"
> > +#include <linux/memfd.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/pseudo_fs.h>
> > +#include <linux/shmem_fs.h>
> > +#include <uapi/linux/falloc.h>
> > +#include <uapi/linux/magic.h>
> > +
> > +struct inaccessible_data {
> > +       struct mutex lock;
> > +       struct file *memfd;
> > +       struct list_head notifiers;
> > +};
> > +
> > +static void inaccessible_notifier_invalidate(struct inaccessible_data *data,
> > +                                pgoff_t start, pgoff_t end)
> > +{
> > +       struct inaccessible_notifier *notifier;
> > +
> > +       mutex_lock(&data->lock);
> > +       list_for_each_entry(notifier, &data->notifiers, list) {
> > +               notifier->ops->invalidate(notifier, start, end);
> > +       }
> > +       mutex_unlock(&data->lock);
> > +}
> > +
> > +static int inaccessible_release(struct inode *inode, struct file *file)
> > +{
> > +       struct inaccessible_data *data = inode->i_mapping->private_data;
> > +
> > +       fput(data->memfd);
> > +       kfree(data);
> > +       return 0;
> > +}
> > +
> > +static long inaccessible_fallocate(struct file *file, int mode,
> > +                                  loff_t offset, loff_t len)
> > +{
> > +       struct inaccessible_data *data = file->f_mapping->private_data;
> > +       struct file *memfd = data->memfd;
> > +       int ret;
> > +
> > +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> > +               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > +                       return -EINVAL;
> > +       }
> > +
> > +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> 
> I think that shmem_file_operations.fallocate is only set if
> CONFIG_TMPFS is enabled (shmem.c). Should there be a check at
> initialization that fallocate is set, or maybe a config dependency, or
> can we count on it always being enabled?

It is already there:

	config MEMFD_CREATE
		def_bool TMPFS || HUGETLBFS

And we reject inaccessible memfd_create() for HUGETLBFS.

But if we go with a separate syscall, yes, we need the dependency.

> > +       inaccessible_notifier_invalidate(data, offset, offset + len);
> > +       return ret;
> > +}
> > +
> 
> <...>
> 
> > +void inaccessible_register_notifier(struct file *file,
> > +                                   struct inaccessible_notifier *notifier)
> > +{
> > +       struct inaccessible_data *data = file->f_mapping->private_data;
> > +
> > +       mutex_lock(&data->lock);
> > +       list_add(&notifier->list, &data->notifiers);
> > +       mutex_unlock(&data->lock);
> > +}
> > +EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
> 
> If the memfd wasn't marked as inaccessible, or more generally
> speaking, if the file isn't a memfd_inaccessible file, this ends up
> accessing an uninitialized pointer for the notifier list. Should there
> be a check for that here, and have this function return an error if
> that's not the case?

I think it is "don't do that" category. inaccessible_register_notifier()
caller has to know what file it operates on, no?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-30 16:23     ` Kirill A . Shutemov
@ 2022-10-03  7:33       ` Fuad Tabba
  2022-10-03 11:01         ` Kirill A. Shutemov
  0 siblings, 1 reply; 97+ messages in thread
From: Fuad Tabba @ 2022-10-03  7:33 UTC (permalink / raw)
  To: Kirill A . Shutemov
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, luto, jun.nakajima, dave.hansen, ak, david, aarcange,
	ddutile, dhildenb, Quentin Perret, Michael Roth, mhocko,
	Muchun Song, wei.w.wang

Hi

On Fri, Sep 30, 2022 at 5:23 PM Kirill A . Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Fri, Sep 30, 2022 at 05:14:00PM +0100, Fuad Tabba wrote:
> > Hi,
> >
> > <...>
> >
> > > diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> > > new file mode 100644
> > > index 000000000000..2d33cbdd9282
> > > --- /dev/null
> > > +++ b/mm/memfd_inaccessible.c
> > > @@ -0,0 +1,219 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +#include "linux/sbitmap.h"
> > > +#include <linux/memfd.h>
> > > +#include <linux/pagemap.h>
> > > +#include <linux/pseudo_fs.h>
> > > +#include <linux/shmem_fs.h>
> > > +#include <uapi/linux/falloc.h>
> > > +#include <uapi/linux/magic.h>
> > > +
> > > +struct inaccessible_data {
> > > +       struct mutex lock;
> > > +       struct file *memfd;
> > > +       struct list_head notifiers;
> > > +};
> > > +
> > > +static void inaccessible_notifier_invalidate(struct inaccessible_data *data,
> > > +                                pgoff_t start, pgoff_t end)
> > > +{
> > > +       struct inaccessible_notifier *notifier;
> > > +
> > > +       mutex_lock(&data->lock);
> > > +       list_for_each_entry(notifier, &data->notifiers, list) {
> > > +               notifier->ops->invalidate(notifier, start, end);
> > > +       }
> > > +       mutex_unlock(&data->lock);
> > > +}
> > > +
> > > +static int inaccessible_release(struct inode *inode, struct file *file)
> > > +{
> > > +       struct inaccessible_data *data = inode->i_mapping->private_data;
> > > +
> > > +       fput(data->memfd);
> > > +       kfree(data);
> > > +       return 0;
> > > +}
> > > +
> > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > +                                  loff_t offset, loff_t len)
> > > +{
> > > +       struct inaccessible_data *data = file->f_mapping->private_data;
> > > +       struct file *memfd = data->memfd;
> > > +       int ret;
> > > +
> > > +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > +               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > +                       return -EINVAL;
> > > +       }
> > > +
> > > +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> >
> > I think that shmem_file_operations.fallocate is only set if
> > CONFIG_TMPFS is enabled (shmem.c). Should there be a check at
> > initialization that fallocate is set, or maybe a config dependency, or
> > can we count on it always being enabled?
>
> It is already there:
>
>         config MEMFD_CREATE
>                 def_bool TMPFS || HUGETLBFS
>
> And we reject inaccessible memfd_create() for HUGETLBFS.
>
> But if we go with a separate syscall, yes, we need the dependency.

I missed that, thanks.

>
> > > +       inaccessible_notifier_invalidate(data, offset, offset + len);
> > > +       return ret;
> > > +}
> > > +
> >
> > <...>
> >
> > > +void inaccessible_register_notifier(struct file *file,
> > > +                                   struct inaccessible_notifier *notifier)
> > > +{
> > > +       struct inaccessible_data *data = file->f_mapping->private_data;
> > > +
> > > +       mutex_lock(&data->lock);
> > > +       list_add(&notifier->list, &data->notifiers);
> > > +       mutex_unlock(&data->lock);
> > > +}
> > > +EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
> >
> > If the memfd wasn't marked as inaccessible, or more generally
> > speaking, if the file isn't a memfd_inaccessible file, this ends up
> > accessing an uninitialized pointer for the notifier list. Should there
> > be a check for that here, and have this function return an error if
> > that's not the case?
>
> I think it is "don't do that" category. inaccessible_register_notifier()
> caller has to know what file it operates on, no?

The thing is, you could oops the kernel from userspace. For that, all
you have to do is a memfd_create without the MFD_INACCESSIBLE,
followed by a KVM_SET_USER_MEMORY_REGION using that as the private_fd.
I ran into this using my port of this patch series to arm64.

Cheers,
/fuad


> --
>   Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-03  7:33       ` Fuad Tabba
@ 2022-10-03 11:01         ` Kirill A. Shutemov
  2022-10-04 15:39           ` Fuad Tabba
  0 siblings, 1 reply; 97+ messages in thread
From: Kirill A. Shutemov @ 2022-10-03 11:01 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Kirill A . Shutemov, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, wei.w.wang

On Mon, Oct 03, 2022 at 08:33:13AM +0100, Fuad Tabba wrote:
> > I think it is "don't do that" category. inaccessible_register_notifier()
> > caller has to know what file it operates on, no?
> 
> The thing is, you could oops the kernel from userspace. For that, all
> you have to do is a memfd_create without the MFD_INACCESSIBLE,
> followed by a KVM_SET_USER_MEMORY_REGION using that as the private_fd.
> I ran into this using my port of this patch series to arm64.

My point is that it has to be handled on a different level. KVM has to
reject private_fd if it is now inaccessible. It should be trivial by
checking file->f_inode->i_sb->s_magic.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 8/8] KVM: Enable and expose KVM_MEM_PRIVATE
  2022-09-15 14:29 ` [PATCH v8 8/8] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
@ 2022-10-04 14:55   ` Jarkko Sakkinen
  2022-10-10  8:31     ` Chao Peng
  2022-10-06  8:55   ` Fuad Tabba
  1 sibling, 1 reply; 97+ messages in thread
From: Jarkko Sakkinen @ 2022-10-04 14:55 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

[-- Attachment #1: Type: text/plain, Size: 1762 bytes --]

On Thu, Sep 15, 2022 at 10:29:13PM +0800, Chao Peng wrote:
> Expose KVM_MEM_PRIVATE and memslot fields private_fd/offset to
> userspace. KVM will register/unregister private memslot to fd-based
> memory backing store and response to invalidation event from
> inaccessible_notifier to zap the existing memory mappings in the
> secondary page table.
> 
> Whether KVM_MEM_PRIVATE is actually exposed to userspace is determined
> by architecture code which can turn on it by overriding the default
> kvm_arch_has_private_mem().
> 
> A 'kvm' reference is added in memslot structure since in
> inaccessible_notifier callback we can only obtain a memslot reference
> but 'kvm' is needed to do the zapping.
> 
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>

ld: arch/x86/../../virt/kvm/kvm_main.o: in function `kvm_free_memslot':
kvm_main.c:(.text+0x1385): undefined reference to `inaccessible_unregister_notifier'
ld: arch/x86/../../virt/kvm/kvm_main.o: in function `kvm_set_memslot':
kvm_main.c:(.text+0x1b86): undefined reference to `inaccessible_register_notifier'
ld: kvm_main.c:(.text+0x1c85): undefined reference to `inaccessible_unregister_notifier'
ld: arch/x86/kvm/mmu/mmu.o: in function `kvm_faultin_pfn':
mmu.c:(.text+0x1e38): undefined reference to `inaccessible_get_pfn'
ld: arch/x86/kvm/mmu/mmu.o: in function `direct_page_fault':
mmu.c:(.text+0x67ca): undefined reference to `inaccessible_put_pfn'
make: *** [Makefile:1169: vmlinux] Error 1

I attached kernel config for reproduction.

The problem is that CONFIG_MEMFD_CREATE does not get enabled:

mm/Makefile:obj-$(CONFIG_MEMFD_CREATE) += memfd.o memfd_inaccessible.o

BR, Jarkko

[-- Attachment #2: config --]
[-- Type: text/plain, Size: 45091 bytes --]

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 6.0.0 Kernel Configuration
#
CONFIG_CC_VERSION_TEXT="gcc (GCC) 12.2.1 20220819 (Red Hat 12.2.1-2)"
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=120201
CONFIG_CLANG_VERSION=0
CONFIG_AS_IS_GNU=y
CONFIG_AS_VERSION=23700
CONFIG_LD_IS_BFD=y
CONFIG_LD_VERSION=23700
CONFIG_LLD_VERSION=0
CONFIG_CC_CAN_LINK=y
CONFIG_CC_CAN_LINK_STATIC=y
CONFIG_CC_HAS_ASM_GOTO_OUTPUT=y
CONFIG_CC_HAS_ASM_GOTO_TIED_OUTPUT=y
CONFIG_CC_HAS_ASM_INLINE=y
CONFIG_CC_HAS_NO_PROFILE_FN_ATTR=y
CONFIG_PAHOLE_VERSION=123
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_TABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
# CONFIG_WERROR is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_HAVE_KERNEL_ZSTD=y
# CONFIG_KERNEL_GZIP is not set
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
CONFIG_KERNEL_XZ=y
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
# CONFIG_KERNEL_ZSTD is not set
CONFIG_DEFAULT_INIT=""
CONFIG_DEFAULT_HOSTNAME="(none)"
# CONFIG_SYSVIPC is not set
# CONFIG_WATCH_QUEUE is not set
# CONFIG_CROSS_MEMORY_ATTACH is not set
# CONFIG_USELIB is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# end of IRQ subsystem

CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_INIT=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_HZ_PERIODIC=y
# CONFIG_NO_HZ_IDLE is not set
# CONFIG_NO_HZ is not set
CONFIG_HIGH_RES_TIMERS=y
CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US=100
# end of Timers subsystem

CONFIG_HAVE_EBPF_JIT=y
CONFIG_ARCH_WANT_DEFAULT_BPF_JIT=y

#
# BPF subsystem
#
# CONFIG_BPF_SYSCALL is not set
# end of BPF subsystem

CONFIG_PREEMPT_NONE_BUILD=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_DYNAMIC is not set

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
# CONFIG_PSI is not set
# end of CPU/Task time and stats accounting

#
# RCU Subsystem
#
CONFIG_TINY_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TINY_SRCU=y
# end of RCU Subsystem

# CONFIG_IKCONFIG is not set
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y

#
# Scheduler features
#
# end of Scheduler features

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_CC_HAS_INT128=y
CONFIG_CC_IMPLICIT_FALLTHROUGH="-Wimplicit-fallthrough=5"
CONFIG_GCC12_NO_ARRAY_BOUNDS=y
CONFIG_CC_NO_ARRAY_BOUNDS=y
CONFIG_ARCH_SUPPORTS_INT128=y
# CONFIG_CGROUPS is not set
# CONFIG_CHECKPOINT_RESTORE is not set
# CONFIG_SCHED_AUTOGROUP is not set
# CONFIG_RELAY is not set
# CONFIG_BLK_DEV_INITRD is not set
# CONFIG_BOOT_CONFIG is not set
# CONFIG_INITRAMFS_PRESERVE_MTIME is not set
# CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE is not set
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_LD_ORPHAN_WARN=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_EXPERT=y
# CONFIG_MULTIUSER is not set
# CONFIG_SGETMASK_SYSCALL is not set
# CONFIG_SYSFS_SYSCALL is not set
# CONFIG_FHANDLE is not set
# CONFIG_POSIX_TIMERS is not set
# CONFIG_PRINTK is not set
# CONFIG_BUG is not set
# CONFIG_PCSPKR_PLATFORM is not set
# CONFIG_BASE_FULL is not set
# CONFIG_FUTEX is not set
# CONFIG_EPOLL is not set
# CONFIG_SIGNALFD is not set
# CONFIG_TIMERFD is not set
CONFIG_EVENTFD=y
# CONFIG_SHMEM is not set
# CONFIG_AIO is not set
# CONFIG_IO_URING is not set
# CONFIG_ADVISE_SYSCALLS is not set
# CONFIG_MEMBARRIER is not set
# CONFIG_KALLSYMS is not set
CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE=y
# CONFIG_KCMP is not set
# CONFIG_RSEQ is not set
CONFIG_EMBEDDED=y
CONFIG_HAVE_PERF_EVENTS=y
CONFIG_GUEST_PERF_EVENTS=y
# CONFIG_PC104 is not set

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
# end of Kernel Performance Events And Counters

# CONFIG_PROFILING is not set
# end of General setup

CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_GENERIC_ISA_DMA=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_NR_GPIO=1024
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_DYNAMIC_PHYSICAL_MASK=y
CONFIG_PGTABLE_LEVELS=5
CONFIG_CC_HAS_SANE_STACKPROTECTOR=y

#
# Processor type and features
#
# CONFIG_SMP is not set
# CONFIG_X86_FEATURE_NAMES is not set
CONFIG_X86_MPPARSE=y
# CONFIG_GOLDFISH is not set
# CONFIG_X86_CPU_RESCTRL is not set
# CONFIG_X86_EXTENDED_PLATFORM is not set
# CONFIG_X86_INTEL_LPSS is not set
# CONFIG_X86_AMD_PLATFORM_DEVICE is not set
# CONFIG_IOSF_MBI is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
# CONFIG_SCHED_OMIT_FRAME_POINTER is not set
# CONFIG_HYPERVISOR_GUEST is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_IA32_FEAT_CTL=y
# CONFIG_PROCESSOR_SELECT is not set
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_HYGON=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_CPU_SUP_ZHAOXIN=y
CONFIG_HPET_TIMER=y
# CONFIG_DMI is not set
# CONFIG_GART_IOMMU is not set
CONFIG_NR_CPUS_RANGE_BEGIN=1
CONFIG_NR_CPUS_RANGE_END=1
CONFIG_NR_CPUS_DEFAULT=1
CONFIG_NR_CPUS=1
CONFIG_UP_LATE_INIT=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
# CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS is not set
CONFIG_X86_MCE=y
# CONFIG_X86_MCELOG_LEGACY is not set
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y

#
# Performance monitoring
#
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_PERF_EVENTS_INTEL_RAPL=y
CONFIG_PERF_EVENTS_INTEL_CSTATE=y
# CONFIG_PERF_EVENTS_AMD_POWER is not set
# CONFIG_PERF_EVENTS_AMD_UNCORE is not set
# CONFIG_PERF_EVENTS_AMD_BRS is not set
# end of Performance monitoring

CONFIG_X86_VSYSCALL_EMULATION=y
# CONFIG_X86_IOPL_IOPERM is not set
# CONFIG_MICROCODE is not set
# CONFIG_X86_MSR is not set
# CONFIG_X86_CPUID is not set
CONFIG_X86_5LEVEL=y
CONFIG_X86_DIRECT_GBPAGES=y
CONFIG_X86_MEM_ENCRYPT=y
CONFIG_AMD_MEM_ENCRYPT=y
# CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT is not set
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
# CONFIG_X86_CHECK_BIOS_CORRUPTION is not set
# CONFIG_MTRR is not set
# CONFIG_X86_UMIP is not set
CONFIG_CC_HAS_IBT=y
# CONFIG_X86_KERNEL_IBT is not set
CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS=y
CONFIG_X86_INTEL_TSX_MODE_OFF=y
# CONFIG_X86_INTEL_TSX_MODE_ON is not set
# CONFIG_X86_INTEL_TSX_MODE_AUTO is not set
CONFIG_X86_SGX=y
# CONFIG_EFI is not set
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
CONFIG_SCHED_HRTICK=y
# CONFIG_KEXEC is not set
# CONFIG_KEXEC_FILE is not set
# CONFIG_CRASH_DUMP is not set
CONFIG_PHYSICAL_START=0x1000000
# CONFIG_RELOCATABLE is not set
CONFIG_PHYSICAL_ALIGN=0x200000
CONFIG_DYNAMIC_MEMORY_LAYOUT=y
CONFIG_LEGACY_VSYSCALL_XONLY=y
# CONFIG_LEGACY_VSYSCALL_NONE is not set
# CONFIG_CMDLINE_BOOL is not set
# CONFIG_MODIFY_LDT_SYSCALL is not set
# CONFIG_STRICT_SIGALTSTACK_SIZE is not set
CONFIG_HAVE_LIVEPATCH=y
# end of Processor type and features

CONFIG_CC_HAS_SLS=y
CONFIG_CC_HAS_RETURN_THUNK=y
# CONFIG_SPECULATION_MITIGATIONS is not set
CONFIG_ARCH_HAS_ADD_PAGES=y
CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE=y

#
# Power management and ACPI options
#
# CONFIG_SUSPEND is not set
# CONFIG_PM is not set
CONFIG_ARCH_SUPPORTS_ACPI=y
CONFIG_ACPI=y
CONFIG_ACPI_LEGACY_TABLES_LOOKUP=y
CONFIG_ARCH_MIGHT_HAVE_ACPI_PDC=y
CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT=y
# CONFIG_ACPI_DEBUGGER is not set
CONFIG_ACPI_SPCR_TABLE=y
# CONFIG_ACPI_FPDT is not set
CONFIG_ACPI_LPIT=y
CONFIG_ACPI_REV_OVERRIDE_POSSIBLE=y
# CONFIG_ACPI_EC_DEBUGFS is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
# CONFIG_ACPI_TINY_POWER_BUTTON is not set
CONFIG_ACPI_FAN=y
# CONFIG_ACPI_DOCK is not set
CONFIG_ACPI_CPU_FREQ_PSS=y
CONFIG_ACPI_PROCESSOR_CSTATE=y
CONFIG_ACPI_PROCESSOR_IDLE=y
CONFIG_ACPI_PROCESSOR=y
# CONFIG_ACPI_PROCESSOR_AGGREGATOR is not set
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_CUSTOM_DSDT_FILE=""
CONFIG_ARCH_HAS_ACPI_TABLE_UPGRADE=y
# CONFIG_ACPI_DEBUG is not set
# CONFIG_ACPI_CONTAINER is not set
CONFIG_ACPI_HOTPLUG_IOAPIC=y
# CONFIG_ACPI_SBS is not set
# CONFIG_ACPI_HED is not set
# CONFIG_ACPI_REDUCED_HARDWARE_ONLY is not set
CONFIG_HAVE_ACPI_APEI=y
CONFIG_HAVE_ACPI_APEI_NMI=y
# CONFIG_ACPI_APEI is not set
# CONFIG_ACPI_DPTF is not set
# CONFIG_ACPI_CONFIGFS is not set
# CONFIG_ACPI_PFRUT is not set
# CONFIG_PMIC_OPREGION is not set
CONFIG_X86_PM_TIMER=y

#
# CPU Frequency scaling
#
# CONFIG_CPU_FREQ is not set
# end of CPU Frequency scaling

#
# CPU Idle
#
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
# CONFIG_CPU_IDLE_GOV_MENU is not set
# CONFIG_CPU_IDLE_GOV_TEO is not set
# end of CPU Idle

# CONFIG_INTEL_IDLE is not set
# end of Power management and ACPI options

#
# Bus options (PCI etc.)
#
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_MMCONF_FAM10H=y
# CONFIG_PCI_CNB20LE_QUIRK is not set
# CONFIG_ISA_BUS is not set
CONFIG_ISA_DMA_API=y
CONFIG_AMD_NB=y
# end of Bus options (PCI etc.)

#
# Binary Emulations
#
# CONFIG_IA32_EMULATION is not set
# CONFIG_X86_X32_ABI is not set
# end of Binary Emulations

CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_PFNCACHE=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_IRQFD=y
CONFIG_HAVE_KVM_IRQ_ROUTING=y
CONFIG_HAVE_KVM_DIRTY_RING=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_MMIO=y
CONFIG_KVM_ASYNC_PF=y
CONFIG_HAVE_KVM_MSI=y
CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT=y
CONFIG_KVM_VFIO=y
CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT=y
CONFIG_HAVE_KVM_IRQ_BYPASS=y
CONFIG_HAVE_KVM_NO_POLL=y
CONFIG_KVM_XFER_TO_GUEST_WORK=y
CONFIG_HAVE_KVM_PRIVATE_MEM=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=y
CONFIG_KVM_WERROR=y
# CONFIG_KVM_INTEL is not set
CONFIG_KVM_AMD=y
CONFIG_KVM_AMD_SEV=y
# CONFIG_KVM_XEN is not set
CONFIG_AS_AVX512=y
CONFIG_AS_SHA1_NI=y
CONFIG_AS_SHA256_NI=y
CONFIG_AS_TPAUSE=y

#
# General architecture-dependent options
#
CONFIG_GENERIC_ENTRY=y
# CONFIG_JUMP_LABEL is not set
# CONFIG_STATIC_CALL_SELFTEST is not set
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_ARCH_CORRECT_STACKTRACE_ON_KRETPROBE=y
CONFIG_HAVE_FUNCTION_ERROR_INJECTION=y
CONFIG_HAVE_NMI=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_TRACE_IRQFLAGS_NMI_SUPPORT=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_ARCH_HAS_FORTIFY_SOURCE=y
CONFIG_ARCH_HAS_SET_MEMORY=y
CONFIG_ARCH_HAS_SET_DIRECT_MAP=y
CONFIG_HAVE_ARCH_THREAD_STRUCT_WHITELIST=y
CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT=y
CONFIG_ARCH_WANTS_NO_INSTR=y
CONFIG_HAVE_ASM_MODVERSIONS=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_RSEQ=y
CONFIG_HAVE_FUNCTION_ARG_ACCESS_API=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_HAVE_ARCH_JUMP_LABEL_RELATIVE=y
CONFIG_MMU_GATHER_MERGE_VMAS=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_HAVE_ARCH_SECCOMP=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
# CONFIG_SECCOMP is not set
CONFIG_HAVE_ARCH_STACKLEAK=y
CONFIG_HAVE_STACKPROTECTOR=y
# CONFIG_STACKPROTECTOR is not set
CONFIG_ARCH_SUPPORTS_LTO_CLANG=y
CONFIG_ARCH_SUPPORTS_LTO_CLANG_THIN=y
CONFIG_LTO_NONE=y
CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES=y
CONFIG_HAVE_CONTEXT_TRACKING_USER=y
CONFIG_HAVE_CONTEXT_TRACKING_USER_OFFSTACK=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_MOVE_PUD=y
CONFIG_HAVE_MOVE_PMD=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=y
CONFIG_HAVE_ARCH_HUGE_VMAP=y
CONFIG_HAVE_ARCH_HUGE_VMALLOC=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_HAVE_ARCH_SOFT_DIRTY=y
CONFIG_HAVE_MOD_ARCH_SPECIFIC=y
CONFIG_MODULES_USE_ELF_RELA=y
CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y
CONFIG_HAVE_SOFTIRQ_ON_OWN_STACK=y
CONFIG_SOFTIRQ_ON_OWN_STACK=y
CONFIG_ARCH_HAS_ELF_RANDOMIZE=y
CONFIG_HAVE_ARCH_MMAP_RND_BITS=y
CONFIG_HAVE_EXIT_THREAD=y
CONFIG_ARCH_MMAP_RND_BITS=28
CONFIG_PAGE_SIZE_LESS_THAN_64KB=y
CONFIG_PAGE_SIZE_LESS_THAN_256KB=y
CONFIG_HAVE_OBJTOOL=y
CONFIG_HAVE_JUMP_LABEL_HACK=y
CONFIG_HAVE_NOINSTR_HACK=y
CONFIG_HAVE_NOINSTR_VALIDATION=y
CONFIG_HAVE_UACCESS_VALIDATION=y
CONFIG_HAVE_STACK_VALIDATION=y
# CONFIG_COMPAT_32BIT_TIME is not set
CONFIG_HAVE_ARCH_VMAP_STACK=y
CONFIG_VMAP_STACK=y
CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET=y
# CONFIG_RANDOMIZE_KSTACK_OFFSET is not set
CONFIG_ARCH_HAS_STRICT_KERNEL_RWX=y
CONFIG_STRICT_KERNEL_RWX=y
CONFIG_ARCH_HAS_STRICT_MODULE_RWX=y
CONFIG_HAVE_ARCH_PREL32_RELOCATIONS=y
CONFIG_ARCH_USE_MEMREMAP_PROT=y
CONFIG_ARCH_HAS_MEM_ENCRYPT=y
CONFIG_ARCH_HAS_CC_PLATFORM=y
CONFIG_HAVE_STATIC_CALL=y
CONFIG_HAVE_STATIC_CALL_INLINE=y
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
CONFIG_ARCH_WANT_LD_ORPHAN_WARN=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_ARCH_SUPPORTS_PAGE_TABLE_CHECK=y
CONFIG_ARCH_HAS_ELFCORE_COMPAT=y
CONFIG_ARCH_HAS_PARANOID_L1D_FLUSH=y
CONFIG_DYNAMIC_SIGFRAME=y
CONFIG_HAVE_ARCH_NODE_DEV_GROUP=y

#
# GCOV-based kernel profiling
#
CONFIG_ARCH_HAS_GCOV_PROFILE_ALL=y
# end of GCOV-based kernel profiling

CONFIG_HAVE_GCC_PLUGINS=y
# CONFIG_GCC_PLUGINS is not set
# end of General architecture-dependent options

CONFIG_BASE_SMALL=1
# CONFIG_MODULES is not set
# CONFIG_BLOCK is not set
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_ASN1=y
CONFIG_INLINE_SPIN_UNLOCK_IRQ=y
CONFIG_INLINE_READ_UNLOCK=y
CONFIG_INLINE_READ_UNLOCK_IRQ=y
CONFIG_INLINE_WRITE_UNLOCK=y
CONFIG_INLINE_WRITE_UNLOCK_IRQ=y
CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE=y
CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE=y
CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y

#
# Executable file formats
#
# CONFIG_BINFMT_ELF is not set
# CONFIG_BINFMT_SCRIPT is not set
# CONFIG_BINFMT_MISC is not set
# CONFIG_COREDUMP is not set
# end of Executable file formats

#
# Memory Management options
#

#
# SLAB allocator options
#
# CONFIG_SLAB is not set
# CONFIG_SLUB is not set
CONFIG_SLOB=y
# end of SLAB allocator options

# CONFIG_SHUFFLE_PAGE_ALLOCATOR is not set
# CONFIG_COMPAT_BRK is not set
CONFIG_SPARSEMEM=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_HAVE_FAST_GUP=y
CONFIG_MEMORY_ISOLATION=y
CONFIG_EXCLUSIVE_SYSTEM_RAM=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
# CONFIG_MEMORY_HOTPLUG is not set
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK=y
# CONFIG_COMPACTION is not set
# CONFIG_PAGE_REPORTING is not set
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_MMU_NOTIFIER=y
# CONFIG_KSM is not set
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
CONFIG_MEMORY_FAILURE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ARCH_WANTS_THP_SWAP=y
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_NEED_PER_CPU_KM=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
# CONFIG_CMA is not set
CONFIG_GENERIC_EARLY_IOREMAP=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_ARCH_HAS_CURRENT_STACK_POINTER=y
CONFIG_ARCH_HAS_PTE_DEVMAP=y
CONFIG_ARCH_HAS_ZONE_DMA_SET=y
# CONFIG_ZONE_DMA is not set
CONFIG_ZONE_DMA32=y
CONFIG_ARCH_USES_HIGH_VMA_FLAGS=y
CONFIG_ARCH_HAS_PKEYS=y
# CONFIG_VM_EVENT_COUNTERS is not set
# CONFIG_PERCPU_STATS is not set

#
# GUP_TEST needs to have DEBUG_FS enabled
#
CONFIG_ARCH_HAS_PTE_SPECIAL=y
# CONFIG_USERFAULTFD is not set

#
# Data Access Monitoring
#
# CONFIG_DAMON is not set
# end of Data Access Monitoring
# end of Memory Management options

# CONFIG_NET is not set

#
# Device Drivers
#
CONFIG_HAVE_EISA=y
# CONFIG_EISA is not set
CONFIG_HAVE_PCI=y
CONFIG_PCI=y
CONFIG_PCI_DOMAINS=y
# CONFIG_PCIEPORTBUS is not set
CONFIG_PCIEASPM=y
CONFIG_PCIEASPM_DEFAULT=y
# CONFIG_PCIEASPM_POWERSAVE is not set
# CONFIG_PCIEASPM_POWER_SUPERSAVE is not set
# CONFIG_PCIEASPM_PERFORMANCE is not set
# CONFIG_PCIE_PTM is not set
# CONFIG_PCI_MSI is not set
CONFIG_PCI_QUIRKS=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_STUB is not set
CONFIG_PCI_LOCKLESS_CONFIG=y
# CONFIG_PCI_IOV is not set
# CONFIG_PCI_PRI is not set
# CONFIG_PCI_PASID is not set
CONFIG_PCI_LABEL=y
# CONFIG_PCIE_BUS_TUNE_OFF is not set
CONFIG_PCIE_BUS_DEFAULT=y
# CONFIG_PCIE_BUS_SAFE is not set
# CONFIG_PCIE_BUS_PERFORMANCE is not set
# CONFIG_PCIE_BUS_PEER2PEER is not set
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=16

#
# PCI controller drivers
#

#
# DesignWare PCI Core Support
#
# end of DesignWare PCI Core Support

#
# Mobiveil PCIe Core Support
#
# end of Mobiveil PCIe Core Support

#
# Cadence PCIe controllers support
#
# end of Cadence PCIe controllers support
# end of PCI controller drivers

#
# PCI Endpoint
#
# CONFIG_PCI_ENDPOINT is not set
# end of PCI Endpoint

#
# PCI switch controller drivers
#
# CONFIG_PCI_SW_SWITCHTEC is not set
# end of PCI switch controller drivers

# CONFIG_CXL_BUS is not set
# CONFIG_PCCARD is not set
# CONFIG_RAPIDIO is not set

#
# Generic Driver Options
#
# CONFIG_UEVENT_HELPER is not set
# CONFIG_DEVTMPFS is not set
# CONFIG_STANDALONE is not set
# CONFIG_PREVENT_FIRMWARE_BUILD is not set

#
# Firmware loader
#
# CONFIG_FW_LOADER is not set
# end of Firmware loader

# CONFIG_ALLOW_DEV_COREDUMP is not set
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_DEBUG_TEST_DRIVER_REMOVE is not set
CONFIG_GENERIC_CPU_AUTOPROBE=y
CONFIG_GENERIC_CPU_VULNERABILITIES=y
# end of Generic Driver Options

#
# Bus devices
#
# CONFIG_MHI_BUS is not set
# CONFIG_MHI_BUS_EP is not set
# end of Bus devices

#
# Firmware Drivers
#

#
# ARM System Control and Management Interface Protocol
#
# end of ARM System Control and Management Interface Protocol

# CONFIG_EDD is not set
# CONFIG_FIRMWARE_MEMMAP is not set
# CONFIG_SYSFB_SIMPLEFB is not set
# CONFIG_GOOGLE_FIRMWARE is not set

#
# Tegra firmware driver
#
# end of Tegra firmware driver
# end of Firmware Drivers

# CONFIG_GNSS is not set
# CONFIG_MTD is not set
# CONFIG_OF is not set
CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y
# CONFIG_PARPORT is not set
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_PNPACPI=y

#
# NVME Support
#
# end of NVME Support

#
# Misc devices
#
# CONFIG_DUMMY_IRQ is not set
# CONFIG_PHANTOM is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_HP_ILO is not set
# CONFIG_SRAM is not set
# CONFIG_DW_XDATA_PCIE is not set
# CONFIG_PCI_ENDPOINT_TEST is not set
# CONFIG_XILINX_SDFEC is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_93CX6 is not set
# end of EEPROM support

# CONFIG_CB710_CORE is not set

#
# Texas Instruments shared transport line discipline
#
# end of Texas Instruments shared transport line discipline

#
# Altera FPGA firmware download module (requires I2C)
#
# CONFIG_INTEL_MEI is not set
# CONFIG_INTEL_MEI_ME is not set
# CONFIG_INTEL_MEI_TXE is not set
# CONFIG_VMWARE_VMCI is not set
# CONFIG_GENWQE is not set
# CONFIG_ECHO is not set
# CONFIG_MISC_ALCOR_PCI is not set
# CONFIG_MISC_RTSX_PCI is not set
# CONFIG_HABANA_AI is not set
# CONFIG_PVPANIC is not set
# end of Misc devices

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
# end of SCSI device support

# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support
#
# CONFIG_FIREWIRE is not set
# CONFIG_FIREWIRE_NOSY is not set
# end of IEEE 1394 (FireWire) support

# CONFIG_MACINTOSH_DRIVERS is not set

#
# Input device support
#
# CONFIG_INPUT is not set

#
# Hardware I/O ports
#
# CONFIG_SERIO is not set
CONFIG_ARCH_MIGHT_HAVE_PC_SERIO=y
# CONFIG_GAMEPORT is not set
# end of Hardware I/O ports
# end of Input device support

#
# Character devices
#
# CONFIG_TTY is not set
# CONFIG_SERIAL_DEV_BUS is not set
# CONFIG_IPMI_HANDLER is not set
CONFIG_HW_RANDOM=y
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
CONFIG_HW_RANDOM_INTEL=y
CONFIG_HW_RANDOM_AMD=y
# CONFIG_HW_RANDOM_BA431 is not set
CONFIG_HW_RANDOM_VIA=y
# CONFIG_HW_RANDOM_XIPHERA is not set
# CONFIG_APPLICOM is not set
# CONFIG_DEVMEM is not set
# CONFIG_NVRAM is not set
CONFIG_DEVPORT=y
# CONFIG_HPET is not set
# CONFIG_HANGCHECK_TIMER is not set
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
# CONFIG_XILLYBUS is not set
# CONFIG_RANDOM_TRUST_CPU is not set
# CONFIG_RANDOM_TRUST_BOOTLOADER is not set
# end of Character devices

#
# I2C support
#
# CONFIG_I2C is not set
# end of I2C support

# CONFIG_I3C is not set
# CONFIG_SPI is not set
# CONFIG_SPMI is not set
# CONFIG_HSI is not set
# CONFIG_PPS is not set

#
# PTP clock support
#
CONFIG_PTP_1588_CLOCK_OPTIONAL=y

#
# Enable PHYLIB and NETWORK_PHY_TIMESTAMPING to see the additional clocks.
#
# end of PTP clock support

# CONFIG_PINCTRL is not set
# CONFIG_GPIOLIB is not set
# CONFIG_W1 is not set
# CONFIG_POWER_RESET is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_TEST_POWER is not set
# CONFIG_BATTERY_DS2780 is not set
# CONFIG_BATTERY_DS2781 is not set
# CONFIG_BATTERY_SAMSUNG_SDI is not set
# CONFIG_BATTERY_BQ27XXX is not set
# CONFIG_CHARGER_MAX8903 is not set
# CONFIG_BATTERY_GOLDFISH is not set
# CONFIG_HWMON is not set
CONFIG_THERMAL=y
# CONFIG_THERMAL_STATISTICS is not set
CONFIG_THERMAL_EMERGENCY_POWEROFF_DELAY_MS=0
CONFIG_THERMAL_WRITABLE_TRIPS=y
CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE=y
# CONFIG_THERMAL_DEFAULT_GOV_FAIR_SHARE is not set
# CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE is not set
# CONFIG_THERMAL_GOV_FAIR_SHARE is not set
CONFIG_THERMAL_GOV_STEP_WISE=y
# CONFIG_THERMAL_GOV_BANG_BANG is not set
CONFIG_THERMAL_GOV_USER_SPACE=y
# CONFIG_THERMAL_EMULATION is not set

#
# Intel thermal drivers
#
# CONFIG_INTEL_POWERCLAMP is not set
CONFIG_X86_THERMAL_VECTOR=y
CONFIG_X86_PKG_TEMP_THERMAL=y
# CONFIG_INTEL_SOC_DTS_THERMAL is not set

#
# ACPI INT340X thermal drivers
#
# CONFIG_INT340X_THERMAL is not set
# end of ACPI INT340X thermal drivers

# CONFIG_INTEL_PCH_THERMAL is not set
# CONFIG_INTEL_TCC_COOLING is not set
# CONFIG_INTEL_MENLOW is not set
# end of Intel thermal drivers

# CONFIG_WATCHDOG is not set
CONFIG_SSB_POSSIBLE=y
# CONFIG_SSB is not set
CONFIG_BCMA_POSSIBLE=y
# CONFIG_BCMA is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_MADERA is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_LPC_ICH is not set
# CONFIG_LPC_SCH is not set
# CONFIG_MFD_INTEL_LPSS_ACPI is not set
# CONFIG_MFD_INTEL_LPSS_PCI is not set
# CONFIG_MFD_JANZ_CMODIO is not set
# CONFIG_MFD_KEMPLD is not set
# CONFIG_MFD_MT6397 is not set
# CONFIG_MFD_RDC321X is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_MFD_SYSCON is not set
# CONFIG_MFD_TI_AM335X_TSCADC is not set
# CONFIG_MFD_TQMX86 is not set
# CONFIG_MFD_VX855 is not set
# end of Multifunction device drivers

# CONFIG_REGULATOR is not set

#
# CEC support
#
# CONFIG_MEDIA_CEC_SUPPORT is not set
# end of CEC support

# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
# CONFIG_AGP is not set
# CONFIG_VGA_SWITCHEROO is not set
# CONFIG_DRM is not set
# CONFIG_DRM_DEBUG_MODESET_LOCK is not set

#
# ARM devices
#
# end of ARM devices

#
# Frame buffer Devices
#
# CONFIG_FB is not set
# end of Frame buffer Devices

#
# Backlight & LCD device support
#
# CONFIG_LCD_CLASS_DEVICE is not set
# CONFIG_BACKLIGHT_CLASS_DEVICE is not set
# end of Backlight & LCD device support
# end of Graphics support

# CONFIG_SOUND is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
# CONFIG_USB_SUPPORT is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
# CONFIG_NEW_LEDS is not set
# CONFIG_ACCESSIBILITY is not set
CONFIG_EDAC_ATOMIC_SCRUB=y
CONFIG_EDAC_SUPPORT=y
# CONFIG_EDAC is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_MC146818_LIB=y
# CONFIG_RTC_CLASS is not set
CONFIG_DMADEVICES=y
# CONFIG_DMADEVICES_DEBUG is not set

#
# DMA Devices
#
CONFIG_DMA_ENGINE=y
CONFIG_DMA_ACPI=y
# CONFIG_ALTERA_MSGDMA is not set
# CONFIG_INTEL_IDMA64 is not set
# CONFIG_INTEL_IDXD_COMPAT is not set
# CONFIG_INTEL_IOATDMA is not set
# CONFIG_PLX_DMA is not set
# CONFIG_AMD_PTDMA is not set
# CONFIG_QCOM_HIDMA_MGMT is not set
# CONFIG_QCOM_HIDMA is not set
# CONFIG_DW_DMAC is not set
# CONFIG_DW_DMAC_PCI is not set
# CONFIG_SF_PDMA is not set
# CONFIG_INTEL_LDMA is not set

#
# DMA Clients
#
# CONFIG_ASYNC_TX_DMA is not set
# CONFIG_DMATEST is not set

#
# DMABUF options
#
# CONFIG_SYNC_FILE is not set
# CONFIG_DMABUF_HEAPS is not set
# end of DMABUF options

# CONFIG_AUXDISPLAY is not set
# CONFIG_UIO is not set
# CONFIG_VFIO is not set
CONFIG_IRQ_BYPASS_MANAGER=y
# CONFIG_VIRT_DRIVERS is not set
# CONFIG_VIRTIO_MENU is not set
# CONFIG_VHOST_MENU is not set

#
# Microsoft Hyper-V guest support
#
# end of Microsoft Hyper-V guest support

# CONFIG_COMEDI is not set
# CONFIG_STAGING is not set
# CONFIG_CHROME_PLATFORMS is not set
# CONFIG_MELLANOX_PLATFORM is not set
# CONFIG_SURFACE_PLATFORMS is not set
# CONFIG_X86_PLATFORM_DEVICES is not set
# CONFIG_P2SB is not set
# CONFIG_COMMON_CLK is not set
# CONFIG_HWSPINLOCK is not set

#
# Clock Source drivers
#
CONFIG_CLKEVT_I8253=y
CONFIG_CLKBLD_I8253=y
# end of Clock Source drivers

# CONFIG_MAILBOX is not set
# CONFIG_IOMMU_SUPPORT is not set

#
# Remoteproc drivers
#
# CONFIG_REMOTEPROC is not set
# end of Remoteproc drivers

#
# Rpmsg drivers
#
# CONFIG_RPMSG_VIRTIO is not set
# end of Rpmsg drivers

# CONFIG_SOUNDWIRE is not set

#
# SOC (System On Chip) specific Drivers
#

#
# Amlogic SoC drivers
#
# end of Amlogic SoC drivers

#
# Broadcom SoC drivers
#
# end of Broadcom SoC drivers

#
# NXP/Freescale QorIQ SoC drivers
#
# end of NXP/Freescale QorIQ SoC drivers

#
# fujitsu SoC drivers
#
# end of fujitsu SoC drivers

#
# i.MX SoC drivers
#
# end of i.MX SoC drivers

#
# Enable LiteX SoC Builder specific drivers
#
# end of Enable LiteX SoC Builder specific drivers

#
# Qualcomm SoC drivers
#
# end of Qualcomm SoC drivers

# CONFIG_SOC_TI is not set

#
# Xilinx SoC drivers
#
# end of Xilinx SoC drivers
# end of SOC (System On Chip) specific Drivers

# CONFIG_PM_DEVFREQ is not set
# CONFIG_EXTCON is not set
# CONFIG_MEMORY is not set
# CONFIG_IIO is not set
# CONFIG_NTB is not set
# CONFIG_PWM is not set

#
# IRQ chip support
#
# end of IRQ chip support

# CONFIG_IPACK_BUS is not set
# CONFIG_RESET_CONTROLLER is not set

#
# PHY Subsystem
#
# CONFIG_GENERIC_PHY is not set
# CONFIG_PHY_CAN_TRANSCEIVER is not set

#
# PHY drivers for Broadcom platforms
#
# CONFIG_BCM_KONA_USB2_PHY is not set
# end of PHY drivers for Broadcom platforms

# CONFIG_PHY_PXA_28NM_HSIC is not set
# CONFIG_PHY_PXA_28NM_USB2 is not set
# CONFIG_PHY_INTEL_LGM_EMMC is not set
# end of PHY Subsystem

# CONFIG_POWERCAP is not set
# CONFIG_MCB is not set

#
# Performance monitor support
#
# end of Performance monitor support

CONFIG_RAS=y
# CONFIG_USB4 is not set

#
# Android
#
# CONFIG_ANDROID_BINDER_IPC is not set
# end of Android

# CONFIG_DAX is not set
# CONFIG_NVMEM is not set

#
# HW tracing support
#
# CONFIG_STM is not set
# CONFIG_INTEL_TH is not set
# end of HW tracing support

# CONFIG_FPGA is not set
# CONFIG_TEE is not set
# CONFIG_SIOX is not set
# CONFIG_SLIMBUS is not set
# CONFIG_INTERCONNECT is not set
# CONFIG_COUNTER is not set
# CONFIG_PECI is not set
# CONFIG_HTE is not set
# end of Device Drivers

#
# File systems
#
CONFIG_DCACHE_WORD_ACCESS=y
# CONFIG_VALIDATE_FS_PARSER is not set
# CONFIG_EXPORTFS_BLOCK_OPS is not set
# CONFIG_FILE_LOCKING is not set
# CONFIG_FS_ENCRYPTION is not set
# CONFIG_FS_VERITY is not set
# CONFIG_DNOTIFY is not set
# CONFIG_INOTIFY_USER is not set
# CONFIG_FANOTIFY is not set
# CONFIG_QUOTA is not set
# CONFIG_AUTOFS4_FS is not set
# CONFIG_AUTOFS_FS is not set
# CONFIG_FUSE_FS is not set
# CONFIG_OVERLAY_FS is not set

#
# Caches
#
# CONFIG_FSCACHE is not set
# end of Caches

#
# Pseudo filesystems
#
# CONFIG_PROC_FS is not set
# CONFIG_PROC_CHILDREN is not set
# CONFIG_SYSFS is not set
# CONFIG_HUGETLBFS is not set
CONFIG_ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
CONFIG_ARCH_HAS_GIGANTIC_PAGE=y
# CONFIG_CONFIGFS_FS is not set
# end of Pseudo filesystems

# CONFIG_MISC_FILESYSTEMS is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-1"
# CONFIG_NLS_CODEPAGE_437 is not set
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
# CONFIG_NLS_ASCII is not set
# CONFIG_NLS_ISO8859_1 is not set
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_MAC_ROMAN is not set
# CONFIG_NLS_MAC_CELTIC is not set
# CONFIG_NLS_MAC_CENTEURO is not set
# CONFIG_NLS_MAC_CROATIAN is not set
# CONFIG_NLS_MAC_CYRILLIC is not set
# CONFIG_NLS_MAC_GAELIC is not set
# CONFIG_NLS_MAC_GREEK is not set
# CONFIG_NLS_MAC_ICELAND is not set
# CONFIG_NLS_MAC_INUIT is not set
# CONFIG_NLS_MAC_ROMANIAN is not set
# CONFIG_NLS_MAC_TURKISH is not set
# CONFIG_NLS_UTF8 is not set
# CONFIG_UNICODE is not set
# end of File systems

#
# Security options
#
# CONFIG_KEYS is not set
# CONFIG_SECURITY_DMESG_RESTRICT is not set
# CONFIG_SECURITYFS is not set
# CONFIG_FORTIFY_SOURCE is not set
# CONFIG_STATIC_USERMODEHELPER is not set
CONFIG_DEFAULT_SECURITY_DAC=y
CONFIG_LSM="landlock,lockdown,yama,loadpin,safesetid,integrity,bpf"

#
# Kernel hardening options
#

#
# Memory initialization
#
CONFIG_CC_HAS_AUTO_VAR_INIT_PATTERN=y
CONFIG_CC_HAS_AUTO_VAR_INIT_ZERO=y
# CONFIG_INIT_STACK_NONE is not set
# CONFIG_INIT_STACK_ALL_PATTERN is not set
CONFIG_INIT_STACK_ALL_ZERO=y
# CONFIG_INIT_ON_ALLOC_DEFAULT_ON is not set
# CONFIG_INIT_ON_FREE_DEFAULT_ON is not set
CONFIG_CC_HAS_ZERO_CALL_USED_REGS=y
# CONFIG_ZERO_CALL_USED_REGS is not set
# end of Memory initialization

CONFIG_RANDSTRUCT_NONE=y
# end of Kernel hardening options
# end of Security options

CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_SKCIPHER=y
CONFIG_CRYPTO_SKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_AKCIPHER2=y
CONFIG_CRYPTO_AKCIPHER=y
CONFIG_CRYPTO_KPP2=y
CONFIG_CRYPTO_ACOMP2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y
CONFIG_CRYPTO_NULL=y
CONFIG_CRYPTO_NULL2=y
# CONFIG_CRYPTO_CRYPTD is not set
CONFIG_CRYPTO_AUTHENC=y
# CONFIG_CRYPTO_TEST is not set

#
# Public-key cryptography
#
CONFIG_CRYPTO_RSA=y
# CONFIG_CRYPTO_DH is not set
# CONFIG_CRYPTO_ECDH is not set
# CONFIG_CRYPTO_ECDSA is not set
# CONFIG_CRYPTO_ECRDSA is not set
# CONFIG_CRYPTO_SM2 is not set
# CONFIG_CRYPTO_CURVE25519 is not set
# CONFIG_CRYPTO_CURVE25519_X86 is not set

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_CHACHA20POLY1305 is not set
# CONFIG_CRYPTO_AEGIS128 is not set
# CONFIG_CRYPTO_AEGIS128_AESNI_SSE2 is not set
# CONFIG_CRYPTO_SEQIV is not set
# CONFIG_CRYPTO_ECHAINIV is not set

#
# Block modes
#
# CONFIG_CRYPTO_CBC is not set
# CONFIG_CRYPTO_CFB is not set
# CONFIG_CRYPTO_CTR is not set
# CONFIG_CRYPTO_CTS is not set
# CONFIG_CRYPTO_ECB is not set
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_OFB is not set
# CONFIG_CRYPTO_PCBC is not set
# CONFIG_CRYPTO_XTS is not set
# CONFIG_CRYPTO_KEYWRAP is not set
# CONFIG_CRYPTO_NHPOLY1305_SSE2 is not set
# CONFIG_CRYPTO_NHPOLY1305_AVX2 is not set
# CONFIG_CRYPTO_ADIANTUM is not set
# CONFIG_CRYPTO_HCTR2 is not set
# CONFIG_CRYPTO_ESSIV is not set

#
# Hash modes
#
# CONFIG_CRYPTO_CMAC is not set
# CONFIG_CRYPTO_HMAC is not set
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_VMAC is not set

#
# Digest
#
# CONFIG_CRYPTO_CRC32C is not set
# CONFIG_CRYPTO_CRC32C_INTEL is not set
# CONFIG_CRYPTO_CRC32 is not set
# CONFIG_CRYPTO_CRC32_PCLMUL is not set
# CONFIG_CRYPTO_XXHASH is not set
# CONFIG_CRYPTO_BLAKE2B is not set
# CONFIG_CRYPTO_BLAKE2S_X86 is not set
# CONFIG_CRYPTO_CRCT10DIF is not set
# CONFIG_CRYPTO_GHASH is not set
# CONFIG_CRYPTO_POLYVAL_CLMUL_NI is not set
# CONFIG_CRYPTO_POLY1305 is not set
# CONFIG_CRYPTO_POLY1305_X86_64 is not set
# CONFIG_CRYPTO_MD4 is not set
# CONFIG_CRYPTO_MD5 is not set
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_RMD160 is not set
CONFIG_CRYPTO_SHA1=y
# CONFIG_CRYPTO_SHA1_SSSE3 is not set
# CONFIG_CRYPTO_SHA256_SSSE3 is not set
# CONFIG_CRYPTO_SHA512_SSSE3 is not set
CONFIG_CRYPTO_SHA256=y
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_SHA3 is not set
# CONFIG_CRYPTO_SM3_GENERIC is not set
# CONFIG_CRYPTO_SM3_AVX_X86_64 is not set
# CONFIG_CRYPTO_STREEBOG is not set
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL is not set

#
# Ciphers
#
# CONFIG_CRYPTO_AES is not set
# CONFIG_CRYPTO_AES_TI is not set
# CONFIG_CRYPTO_AES_NI_INTEL is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_BLOWFISH_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAMELLIA_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST5_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_CAST6_AVX_X86_64 is not set
# CONFIG_CRYPTO_DES is not set
# CONFIG_CRYPTO_DES3_EDE_X86_64 is not set
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_CHACHA20 is not set
# CONFIG_CRYPTO_CHACHA20_X86_64 is not set
# CONFIG_CRYPTO_ARIA is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_SERPENT_SSE2_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX2_X86_64 is not set
# CONFIG_CRYPTO_SM4_GENERIC is not set
# CONFIG_CRYPTO_SM4_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_SM4_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH_X86_64_3WAY is not set
# CONFIG_CRYPTO_TWOFISH_AVX_X86_64 is not set

#
# Compression
#
# CONFIG_CRYPTO_DEFLATE is not set
# CONFIG_CRYPTO_LZO is not set
# CONFIG_CRYPTO_842 is not set
# CONFIG_CRYPTO_LZ4 is not set
# CONFIG_CRYPTO_LZ4HC is not set
# CONFIG_CRYPTO_ZSTD is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
# CONFIG_CRYPTO_DRBG_MENU is not set
# CONFIG_CRYPTO_JITTERENTROPY is not set
CONFIG_CRYPTO_HW=y
# CONFIG_CRYPTO_DEV_PADLOCK is not set
CONFIG_CRYPTO_DEV_CCP=y
CONFIG_CRYPTO_DEV_CCP_DD=y
CONFIG_CRYPTO_DEV_SP_CCP=y
CONFIG_CRYPTO_DEV_CCP_CRYPTO=y
CONFIG_CRYPTO_DEV_SP_PSP=y
# CONFIG_CRYPTO_DEV_CCP_DEBUGFS is not set
# CONFIG_CRYPTO_DEV_QAT_DH895xCC is not set
# CONFIG_CRYPTO_DEV_QAT_C3XXX is not set
# CONFIG_CRYPTO_DEV_QAT_C62X is not set
# CONFIG_CRYPTO_DEV_QAT_4XXX is not set
# CONFIG_CRYPTO_DEV_QAT_DH895xCCVF is not set
# CONFIG_CRYPTO_DEV_QAT_C3XXXVF is not set
# CONFIG_CRYPTO_DEV_QAT_C62XVF is not set
# CONFIG_CRYPTO_DEV_SAFEXCEL is not set
# CONFIG_CRYPTO_DEV_AMLOGIC_GXL is not set

#
# Certificates for signature checking
#
# end of Certificates for signature checking

#
# Library routines
#
# CONFIG_PACKING is not set
CONFIG_BITREVERSE=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
# CONFIG_CORDIC is not set
# CONFIG_PRIME_NUMBERS is not set
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_USE_CMPXCHG_LOCKREF=y
CONFIG_ARCH_HAS_FAST_MULTIPLIER=y
CONFIG_ARCH_USE_SYM_ANNOTATIONS=y

#
# Crypto library routines
#
CONFIG_CRYPTO_LIB_AES=y
CONFIG_CRYPTO_LIB_BLAKE2S_GENERIC=y
# CONFIG_CRYPTO_LIB_CHACHA is not set
# CONFIG_CRYPTO_LIB_CURVE25519 is not set
CONFIG_CRYPTO_LIB_POLY1305_RSIZE=11
# CONFIG_CRYPTO_LIB_POLY1305 is not set
# CONFIG_CRYPTO_LIB_CHACHA20POLY1305 is not set
CONFIG_CRYPTO_LIB_SHA1=y
CONFIG_CRYPTO_LIB_SHA256=y
# end of Crypto library routines

CONFIG_LIB_MEMNEQ=y
# CONFIG_CRC_CCITT is not set
# CONFIG_CRC16 is not set
# CONFIG_CRC_T10DIF is not set
# CONFIG_CRC64_ROCKSOFT is not set
# CONFIG_CRC_ITU_T is not set
CONFIG_CRC32=y
# CONFIG_CRC32_SELFTEST is not set
CONFIG_CRC32_SLICEBY8=y
# CONFIG_CRC32_SLICEBY4 is not set
# CONFIG_CRC32_SARWATE is not set
# CONFIG_CRC32_BIT is not set
# CONFIG_CRC64 is not set
# CONFIG_CRC4 is not set
# CONFIG_CRC7 is not set
# CONFIG_LIBCRC32C is not set
# CONFIG_CRC8 is not set
# CONFIG_RANDOM32_SELFTEST is not set
# CONFIG_XZ_DEC is not set
CONFIG_GENERIC_ALLOCATOR=y
CONFIG_INTERVAL_TREE=y
CONFIG_XARRAY_MULTI=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT_MAP=y
CONFIG_HAS_DMA=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_ARCH_HAS_FORCE_DMA_UNENCRYPTED=y
CONFIG_SWIOTLB=y
CONFIG_DMA_COHERENT_POOL=y
# CONFIG_DMA_API_DEBUG is not set
CONFIG_SGL_ALLOC=y
CONFIG_CLZ_TAB=y
# CONFIG_IRQ_POLL is not set
CONFIG_MPILIB=y
CONFIG_HAVE_GENERIC_VDSO=y
CONFIG_GENERIC_GETTIMEOFDAY=y
CONFIG_GENERIC_VDSO_TIME_NS=y
CONFIG_ARCH_HAS_PMEM_API=y
CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE=y
CONFIG_ARCH_HAS_COPY_MC=y
CONFIG_ARCH_STACKWALK=y
# end of Library routines

#
# Kernel hacking
#

#
# printk and dmesg options
#
CONFIG_CONSOLE_LOGLEVEL_DEFAULT=7
CONFIG_CONSOLE_LOGLEVEL_QUIET=4
CONFIG_MESSAGE_LOGLEVEL_DEFAULT=4
# CONFIG_SYMBOLIC_ERRNAME is not set
# end of printk and dmesg options

CONFIG_DEBUG_KERNEL=y
# CONFIG_DEBUG_MISC is not set

#
# Compile-time checks and compiler options
#
CONFIG_DEBUG_INFO_NONE=y
# CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT is not set
# CONFIG_DEBUG_INFO_DWARF4 is not set
# CONFIG_DEBUG_INFO_DWARF5 is not set
CONFIG_FRAME_WARN=1024
# CONFIG_STRIP_ASM_SYMS is not set
# CONFIG_READABLE_ASM is not set
# CONFIG_HEADERS_INSTALL is not set
# CONFIG_DEBUG_SECTION_MISMATCH is not set
# CONFIG_SECTION_MISMATCH_WARN_ONLY is not set
# CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B is not set
CONFIG_OBJTOOL=y
# CONFIG_VMLINUX_MAP is not set
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
# end of Compile-time checks and compiler options

#
# Generic Kernel Debugging Instruments
#
# CONFIG_MAGIC_SYSRQ is not set
# CONFIG_DEBUG_FS is not set
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
CONFIG_ARCH_HAS_UBSAN_SANITIZE_ALL=y
# CONFIG_UBSAN is not set
CONFIG_HAVE_ARCH_KCSAN=y
CONFIG_HAVE_KCSAN_COMPILER=y
# CONFIG_KCSAN is not set
# end of Generic Kernel Debugging Instruments

#
# Networking Debugging
#
# CONFIG_NET_DEV_REFCNT_TRACKER is not set
# CONFIG_NET_NS_REFCNT_TRACKER is not set
# end of Networking Debugging

#
# Memory Debugging
#
# CONFIG_PAGE_EXTENSION is not set
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_PAGE_OWNER is not set
# CONFIG_PAGE_TABLE_CHECK is not set
# CONFIG_PAGE_POISONING is not set
# CONFIG_DEBUG_RODATA_TEST is not set
CONFIG_ARCH_HAS_DEBUG_WX=y
# CONFIG_DEBUG_WX is not set
CONFIG_GENERIC_PTDUMP=y
# CONFIG_DEBUG_OBJECTS is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
# CONFIG_DEBUG_KMEMLEAK is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_SCHED_STACK_END_CHECK is not set
CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_VM_PGTABLE is not set
CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
# CONFIG_DEBUG_VIRTUAL is not set
# CONFIG_DEBUG_MEMORY_INIT is not set
CONFIG_ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP=y
# CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP is not set
CONFIG_HAVE_ARCH_KASAN=y
CONFIG_HAVE_ARCH_KASAN_VMALLOC=y
CONFIG_CC_HAS_KASAN_GENERIC=y
CONFIG_CC_HAS_WORKING_NOSANITIZE_ADDRESS=y
CONFIG_HAVE_ARCH_KFENCE=y
# end of Memory Debugging

# CONFIG_DEBUG_SHIRQ is not set

#
# Debug Oops, Lockups and Hangs
#
# CONFIG_PANIC_ON_OOPS is not set
CONFIG_PANIC_ON_OOPS_VALUE=0
CONFIG_PANIC_TIMEOUT=0
# CONFIG_SOFTLOCKUP_DETECTOR is not set
CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
# CONFIG_HARDLOCKUP_DETECTOR is not set
# CONFIG_DETECT_HUNG_TASK is not set
# CONFIG_WQ_WATCHDOG is not set
# end of Debug Oops, Lockups and Hangs

#
# Scheduler Debugging
#
CONFIG_SCHED_INFO=y
# end of Scheduler Debugging

# CONFIG_DEBUG_TIMEKEEPING is not set

#
# Lock Debugging (spinlocks, mutexes, etc...)
#
CONFIG_LOCK_DEBUGGING_SUPPORT=y
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set
# CONFIG_DEBUG_RWSEMS is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_DEBUG_ATOMIC_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set
# CONFIG_WW_MUTEX_SELFTEST is not set
# CONFIG_SCF_TORTURE_TEST is not set
# CONFIG_CSD_LOCK_WAIT_DEBUG is not set
# end of Lock Debugging (spinlocks, mutexes, etc...)

# CONFIG_DEBUG_IRQFLAGS is not set
# CONFIG_STACKTRACE is not set
# CONFIG_WARN_ALL_UNSEEDED_RANDOM is not set
# CONFIG_DEBUG_KOBJECT is not set

#
# Debug kernel data structures
#
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_PLIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_BUG_ON_DATA_CORRUPTION is not set
# end of Debug kernel data structures

# CONFIG_DEBUG_CREDENTIALS is not set

#
# RCU Debugging
#
# CONFIG_RCU_SCALE_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_REF_SCALE_TEST is not set
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set
# end of RCU Debugging

# CONFIG_DEBUG_WQ_FORCE_RR_CPU is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_HAVE_RETHOOK=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_FENTRY=y
CONFIG_HAVE_OBJTOOL_MCOUNT=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_HAVE_BUILDTIME_MCOUNT_SORT=y
CONFIG_TRACING_SUPPORT=y
# CONFIG_FTRACE is not set
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_SAMPLE_FTRACE_DIRECT=y
CONFIG_HAVE_SAMPLE_FTRACE_DIRECT_MULTI=y
CONFIG_ARCH_HAS_DEVMEM_IS_ALLOWED=y

#
# x86 Debugging
#
# CONFIG_X86_VERBOSE_BOOTUP is not set
# CONFIG_EARLY_PRINTK is not set
# CONFIG_DEBUG_TLBFLUSH is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
# CONFIG_X86_DECODER_SELFTEST is not set
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
# CONFIG_CPA_DEBUG is not set
# CONFIG_DEBUG_ENTRY is not set
# CONFIG_DEBUG_NMI_SELFTEST is not set
# CONFIG_X86_DEBUG_FPU is not set
# CONFIG_PUNIT_ATOM_DEBUG is not set
# CONFIG_UNWINDER_ORC is not set
# CONFIG_UNWINDER_FRAME_POINTER is not set
CONFIG_UNWINDER_GUESS=y
# end of x86 Debugging

#
# Kernel Testing and Coverage
#
# CONFIG_KUNIT is not set
# CONFIG_NOTIFIER_ERROR_INJECTION is not set
# CONFIG_FAULT_INJECTION is not set
CONFIG_ARCH_HAS_KCOV=y
CONFIG_CC_HAS_SANCOV_TRACE_PC=y
# CONFIG_KCOV is not set
# CONFIG_RUNTIME_TESTING_MENU is not set
CONFIG_ARCH_USE_MEMTEST=y
# CONFIG_MEMTEST is not set
# end of Kernel Testing and Coverage
# end of Kernel hacking

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-03 11:01         ` Kirill A. Shutemov
@ 2022-10-04 15:39           ` Fuad Tabba
  0 siblings, 0 replies; 97+ messages in thread
From: Fuad Tabba @ 2022-10-04 15:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A . Shutemov, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, luto, jun.nakajima,
	dave.hansen, ak, david, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, wei.w.wang

Hi,

On Mon, Oct 3, 2022 at 12:01 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Mon, Oct 03, 2022 at 08:33:13AM +0100, Fuad Tabba wrote:
> > > I think it is "don't do that" category. inaccessible_register_notifier()
> > > caller has to know what file it operates on, no?
> >
> > The thing is, you could oops the kernel from userspace. For that, all
> > you have to do is a memfd_create without the MFD_INACCESSIBLE,
> > followed by a KVM_SET_USER_MEMORY_REGION using that as the private_fd.
> > I ran into this using my port of this patch series to arm64.
>
> My point is that it has to be handled on a different level. KVM has to
> reject private_fd if it is now inaccessible. It should be trivial by
> checking file->f_inode->i_sb->s_magic.

Yes, that makes sense.

Thanks,
/fuad

> --
>   Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-09-15 14:29 ` [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
                     ` (2 preceding siblings ...)
  2022-09-29 22:45   ` Isaku Yamahata
@ 2022-10-05 13:04   ` Jarkko Sakkinen
  2022-10-05 22:05     ` Jarkko Sakkinen
  2022-10-06  9:00   ` Fuad Tabba
  2022-10-06 14:58   ` Jarkko Sakkinen
  5 siblings, 1 reply; 97+ messages in thread
From: Jarkko Sakkinen @ 2022-10-05 13:04 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote:
> In memory encryption usage, guest memory may be encrypted with special
> key and can be accessed only by the VM itself. We call such memory
> private memory. It's valueless and sometimes can cause problem to allow
> userspace to access guest private memory. This patch extends the KVM
> memslot definition so that guest private memory can be provided though
> an inaccessible_notifier enlightened file descriptor (fd), without being
> mmaped into userspace.
> 
> This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> additional KVM memslot fields private_fd/private_offset to allow
> userspace to specify that guest private memory provided from the
> private_fd and guest_phys_addr mapped at the private_offset of the
> private_fd, spanning a range of memory_size.
> 
> The extended memslot can still have the userspace_addr(hva). When use, a
> single memslot can maintain both private memory through private
> fd(private_fd/private_offset) and shared memory through
> hva(userspace_addr). Whether the private or shared part is visible to
> guest is maintained by other KVM code.
> 
> Since there is no userspace mapping for private fd so we cannot
> get_user_pages() to get the pfn in KVM, instead we add a new
> inaccessible_notifier in the internal memslot structure and rely on it
> to get pfn by interacting with the memory file systems.
> 
> Together with the change, a new config HAVE_KVM_PRIVATE_MEM is added and
> right now it is selected on X86_64 for Intel TDX usage.
> 
> To make code maintenance easy, internally we use a binary compatible
> alias struct kvm_user_mem_region to handle both the normal and the
> '_ext' variants.
> 
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>

What if userspace_addr would contain address of an extension structure,
if the flag is set, instead of shared address? I.e. interpret that field
differently (could be turned into union too ofc).

That idea could be at least re-used, if there's ever any new KVM_MEM_*
flags that would need an extension.

E.g. have struct kvm_userspace_memory_private, which contains shared
address, fd and the offset.

BR, Jarkko

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-10-05 13:04   ` Jarkko Sakkinen
@ 2022-10-05 22:05     ` Jarkko Sakkinen
  0 siblings, 0 replies; 97+ messages in thread
From: Jarkko Sakkinen @ 2022-10-05 22:05 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Wed, Oct 05, 2022 at 04:04:05PM +0300, Jarkko Sakkinen wrote:
> On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote:
> > In memory encryption usage, guest memory may be encrypted with special
> > key and can be accessed only by the VM itself. We call such memory
> > private memory. It's valueless and sometimes can cause problem to allow
> > userspace to access guest private memory. This patch extends the KVM
> > memslot definition so that guest private memory can be provided though
> > an inaccessible_notifier enlightened file descriptor (fd), without being
> > mmaped into userspace.
> > 
> > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > additional KVM memslot fields private_fd/private_offset to allow
> > userspace to specify that guest private memory provided from the
> > private_fd and guest_phys_addr mapped at the private_offset of the
> > private_fd, spanning a range of memory_size.
> > 
> > The extended memslot can still have the userspace_addr(hva). When use, a
> > single memslot can maintain both private memory through private
> > fd(private_fd/private_offset) and shared memory through
> > hva(userspace_addr). Whether the private or shared part is visible to
> > guest is maintained by other KVM code.
> > 
> > Since there is no userspace mapping for private fd so we cannot
> > get_user_pages() to get the pfn in KVM, instead we add a new
> > inaccessible_notifier in the internal memslot structure and rely on it
> > to get pfn by interacting with the memory file systems.
> > 
> > Together with the change, a new config HAVE_KVM_PRIVATE_MEM is added and
> > right now it is selected on X86_64 for Intel TDX usage.
> > 
> > To make code maintenance easy, internally we use a binary compatible
> > alias struct kvm_user_mem_region to handle both the normal and the
> > '_ext' variants.
> > 
> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> 
> What if userspace_addr would contain address of an extension structure,
> if the flag is set, instead of shared address? I.e. interpret that field
> differently (could be turned into union too ofc).
> 
> That idea could be at least re-used, if there's ever any new KVM_MEM_*
> flags that would need an extension.
> 
> E.g. have struct kvm_userspace_memory_private, which contains shared
> address, fd and the offset.

Or add a new ioctl number instead of messing with the existing
parameter structure, e.g. KVM_SET_USER_MEMORY_REGION_PRIVATE.

With this alternative and the current approach in the patch,
it would be better just to redefine the struct fields that are
common.

It actually would reduce redundancy because then there is no
need to create that somewhat confusing kernel version of the
same struct, right? You don't save any redundancy with this
"embedded struct" approach.

BR, Jarkko

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-15 14:29 ` [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd Chao Peng
                     ` (2 preceding siblings ...)
  2022-09-30 16:14   ` Fuad Tabba
@ 2022-10-06  8:50   ` Fuad Tabba
  2022-10-06 13:04     ` Kirill A. Shutemov
  2022-10-17 13:00   ` Vlastimil Babka
  2022-10-19 12:23   ` Vishal Annapurve
  5 siblings, 1 reply; 97+ messages in thread
From: Fuad Tabba @ 2022-10-06  8:50 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

Hi,

<...>


> diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> new file mode 100644
> index 000000000000..2d33cbdd9282
> --- /dev/null
> +++ b/mm/memfd_inaccessible.c

<...>

> +struct file *memfd_mkinaccessible(struct file *memfd)
> +{
> +       struct inaccessible_data *data;
> +       struct address_space *mapping;
> +       struct inode *inode;
> +       struct file *file;
> +
> +       data = kzalloc(sizeof(*data), GFP_KERNEL);
> +       if (!data)
> +               return ERR_PTR(-ENOMEM);
> +
> +       data->memfd = memfd;
> +       mutex_init(&data->lock);
> +       INIT_LIST_HEAD(&data->notifiers);
> +
> +       inode = alloc_anon_inode(inaccessible_mnt->mnt_sb);
> +       if (IS_ERR(inode)) {
> +               kfree(data);
> +               return ERR_CAST(inode);
> +       }
> +
> +       inode->i_mode |= S_IFREG;
> +       inode->i_op = &inaccessible_iops;
> +       inode->i_mapping->private_data = data;
> +
> +       file = alloc_file_pseudo(inode, inaccessible_mnt,
> +                                "[memfd:inaccessible]", O_RDWR,
> +                                &inaccessible_fops);
> +       if (IS_ERR(file)) {
> +               iput(inode);
> +               kfree(data);

I think this might be missing a return at this point.

> +       }
> +
> +       file->f_flags |= O_LARGEFILE;
> +
> +       mapping = memfd->f_mapping;
> +       mapping_set_unevictable(mapping);
> +       mapping_set_gfp_mask(mapping,
> +                            mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> +
> +       return file;
> +}

Thanks,
/fuad



> +
> +void inaccessible_register_notifier(struct file *file,
> +                                   struct inaccessible_notifier *notifier)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_add(&notifier->list, &data->notifiers);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
> +
> +void inaccessible_unregister_notifier(struct file *file,
> +                                     struct inaccessible_notifier *notifier)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_del(&notifier->list);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_unregister_notifier);
> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +                        int *order)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       struct page *page;
> +       int ret;
> +
> +       ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> +       if (ret)
> +               return ret;
> +
> +       *pfn = page_to_pfn_t(page);
> +       *order = thp_order(compound_head(page));
> +       SetPageUptodate(page);
> +       unlock_page(page);
> +
> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> +
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> +{
> +       struct page *page = pfn_t_to_page(pfn);
> +
> +       if (WARN_ON_ONCE(!page))
> +               return;
> +
> +       put_page(page);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 8/8] KVM: Enable and expose KVM_MEM_PRIVATE
  2022-09-15 14:29 ` [PATCH v8 8/8] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
  2022-10-04 14:55   ` Jarkko Sakkinen
@ 2022-10-06  8:55   ` Fuad Tabba
  2022-10-10  8:33     ` Chao Peng
  1 sibling, 1 reply; 97+ messages in thread
From: Fuad Tabba @ 2022-10-06  8:55 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

Hi,

On Thu, Sep 15, 2022 at 3:37 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> Expose KVM_MEM_PRIVATE and memslot fields private_fd/offset to
> userspace. KVM will register/unregister private memslot to fd-based
> memory backing store and response to invalidation event from
> inaccessible_notifier to zap the existing memory mappings in the
> secondary page table.
>
> Whether KVM_MEM_PRIVATE is actually exposed to userspace is determined
> by architecture code which can turn on it by overriding the default
> kvm_arch_has_private_mem().
>
> A 'kvm' reference is added in memslot structure since in
> inaccessible_notifier callback we can only obtain a memslot reference
> but 'kvm' is needed to do the zapping.
>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  include/linux/kvm_host.h |   1 +
>  virt/kvm/kvm_main.c      | 116 +++++++++++++++++++++++++++++++++++++--
>  2 files changed, 111 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index b9906cdf468b..cb4eefac709c 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -589,6 +589,7 @@ struct kvm_memory_slot {
>         struct file *private_file;
>         loff_t private_offset;
>         struct inaccessible_notifier notifier;
> +       struct kvm *kvm;
>  };
>
>  static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 97d893f7482c..87e239d35b96 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -983,6 +983,57 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
>                 xa_erase(&kvm->mem_attr_array, index);
>         return r;
>  }
> +
> +static void kvm_private_notifier_invalidate(struct inaccessible_notifier *notifier,
> +                                           pgoff_t start, pgoff_t end)
> +{
> +       struct kvm_memory_slot *slot = container_of(notifier,
> +                                                   struct kvm_memory_slot,
> +                                                   notifier);
> +       unsigned long base_pgoff = slot->private_offset >> PAGE_SHIFT;
> +       gfn_t start_gfn = slot->base_gfn;
> +       gfn_t end_gfn = slot->base_gfn + slot->npages;
> +
> +
> +       if (start > base_pgoff)
> +               start_gfn = slot->base_gfn + start - base_pgoff;
> +
> +       if (end < base_pgoff + slot->npages)
> +               end_gfn = slot->base_gfn + end - base_pgoff;
> +
> +       if (start_gfn >= end_gfn)
> +               return;
> +
> +       kvm_zap_gfn_range(slot->kvm, start_gfn, end_gfn);
> +}
> +
> +static struct inaccessible_notifier_ops kvm_private_notifier_ops = {
> +       .invalidate = kvm_private_notifier_invalidate,
> +};
> +
> +static inline void kvm_private_mem_register(struct kvm_memory_slot *slot)
> +{
> +       slot->notifier.ops = &kvm_private_notifier_ops;
> +       inaccessible_register_notifier(slot->private_file, &slot->notifier);
> +}
> +
> +static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot)
> +{
> +       inaccessible_unregister_notifier(slot->private_file, &slot->notifier);
> +}
> +
> +#else /* !CONFIG_HAVE_KVM_PRIVATE_MEM */
> +
> +static inline void kvm_private_mem_register(struct kvm_memory_slot *slot)
> +{
> +       WARN_ON_ONCE(1);
> +}
> +
> +static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot)
> +{
> +       WARN_ON_ONCE(1);
> +}
> +
>  #endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
>
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> @@ -1029,6 +1080,11 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
>  /* This does not remove the slot from struct kvm_memslots data structures */
>  static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
>  {
> +       if (slot->flags & KVM_MEM_PRIVATE) {
> +               kvm_private_mem_unregister(slot);
> +               fput(slot->private_file);
> +       }
> +
>         kvm_destroy_dirty_bitmap(slot);
>
>         kvm_arch_free_memslot(kvm, slot);
> @@ -1600,10 +1656,16 @@ bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
>         return false;
>  }
>
> -static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> +static int check_memory_region_flags(struct kvm *kvm,
> +                                    const struct kvm_user_mem_region *mem)
>  {
>         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +       if (kvm_arch_has_private_mem(kvm))
> +               valid_flags |= KVM_MEM_PRIVATE;
> +#endif
> +
>  #ifdef __KVM_HAVE_READONLY_MEM
>         valid_flags |= KVM_MEM_READONLY;
>  #endif
> @@ -1679,6 +1741,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
>  {
>         int r;
>
> +       if (change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
> +               kvm_private_mem_register(new);
> +

From the discussion I had with Kirill in the first patch *, should
this check that the private_fd is inaccessible?

[*] https://lore.kernel.org/all/20221003110129.bbee7kawhw5ed745@box.shutemov.name/

Cheers,
/fuad

>         /*
>          * If dirty logging is disabled, nullify the bitmap; the old bitmap
>          * will be freed on "commit".  If logging is enabled in both old and
> @@ -1707,6 +1772,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
>         if (r && new && new->dirty_bitmap && (!old || !old->dirty_bitmap))
>                 kvm_destroy_dirty_bitmap(new);
>
> +       if (r && change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
> +               kvm_private_mem_unregister(new);
> +
>         return r;
>  }
>
> @@ -2004,7 +2072,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>         int as_id, id;
>         int r;
>
> -       r = check_memory_region_flags(mem);
> +       r = check_memory_region_flags(kvm, mem);
>         if (r)
>                 return r;
>
> @@ -2023,6 +2091,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
>              !access_ok((void __user *)(unsigned long)mem->userspace_addr,
>                         mem->memory_size))
>                 return -EINVAL;
> +       if (mem->flags & KVM_MEM_PRIVATE &&
> +               (mem->private_offset & (PAGE_SIZE - 1) ||
> +                mem->private_offset > U64_MAX - mem->memory_size))
> +               return -EINVAL;
>         if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
>                 return -EINVAL;
>         if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> @@ -2061,6 +2133,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
>                 if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
>                         return -EINVAL;
>         } else { /* Modify an existing slot. */
> +               /* Private memslots are immutable, they can only be deleted. */
> +               if (mem->flags & KVM_MEM_PRIVATE)
> +                       return -EINVAL;
>                 if ((mem->userspace_addr != old->userspace_addr) ||
>                     (npages != old->npages) ||
>                     ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> @@ -2089,10 +2164,27 @@ int __kvm_set_memory_region(struct kvm *kvm,
>         new->npages = npages;
>         new->flags = mem->flags;
>         new->userspace_addr = mem->userspace_addr;
> +       if (mem->flags & KVM_MEM_PRIVATE) {
> +               new->private_file = fget(mem->private_fd);
> +               if (!new->private_file) {
> +                       r = -EINVAL;
> +                       goto out;
> +               }
> +               new->private_offset = mem->private_offset;
> +       }
> +
> +       new->kvm = kvm;
>
>         r = kvm_set_memslot(kvm, old, new, change);
>         if (r)
> -               kfree(new);
> +               goto out;
> +
> +       return 0;
> +
> +out:
> +       if (new->private_file)
> +               fput(new->private_file);
> +       kfree(new);
>         return r;
>  }
>  EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> @@ -4747,16 +4839,28 @@ static long kvm_vm_ioctl(struct file *filp,
>         }
>         case KVM_SET_USER_MEMORY_REGION: {
>                 struct kvm_user_mem_region mem;
> -               unsigned long size = sizeof(struct kvm_userspace_memory_region);
> +               unsigned int flags_offset = offsetof(typeof(mem), flags);
> +               unsigned long size;
> +               u32 flags;
>
>                 kvm_sanity_check_user_mem_region_alias();
>
> +               memset(&mem, 0, sizeof(mem));
> +
>                 r = -EFAULT;
> -               if (copy_from_user(&mem, argp, size);
> +               if (get_user(flags, (u32 __user *)(argp + flags_offset)))
> +                       goto out;
> +
> +               if (flags & KVM_MEM_PRIVATE)
> +                       size = sizeof(struct kvm_userspace_memory_region_ext);
> +               else
> +                       size = sizeof(struct kvm_userspace_memory_region);
> +
> +               if (copy_from_user(&mem, argp, size))
>                         goto out;
>
>                 r = -EINVAL;
> -               if (mem.flags & KVM_MEM_PRIVATE)
> +               if ((flags ^ mem.flags) & KVM_MEM_PRIVATE)
>                         goto out;
>
>                 r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-09-15 14:29 ` [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
                     ` (3 preceding siblings ...)
  2022-10-05 13:04   ` Jarkko Sakkinen
@ 2022-10-06  9:00   ` Fuad Tabba
  2022-10-06 14:58   ` Jarkko Sakkinen
  5 siblings, 0 replies; 97+ messages in thread
From: Fuad Tabba @ 2022-10-06  9:00 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

Hi,

I'm not sure if this patch or the last one might be the best place for
it, but I think it would be useful to have a KVM_CAP associated with
this. I am working on getting kvmtool to work with this, and I haven't
found a clean way of getting it to discover whether mem_private is
supported.

Thanks.,
/fuad

On Thu, Sep 15, 2022 at 3:35 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> In memory encryption usage, guest memory may be encrypted with special
> key and can be accessed only by the VM itself. We call such memory
> private memory. It's valueless and sometimes can cause problem to allow
> userspace to access guest private memory. This patch extends the KVM
> memslot definition so that guest private memory can be provided though
> an inaccessible_notifier enlightened file descriptor (fd), without being
> mmaped into userspace.
>
> This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> additional KVM memslot fields private_fd/private_offset to allow
> userspace to specify that guest private memory provided from the
> private_fd and guest_phys_addr mapped at the private_offset of the
> private_fd, spanning a range of memory_size.
>
> The extended memslot can still have the userspace_addr(hva). When use, a
> single memslot can maintain both private memory through private
> fd(private_fd/private_offset) and shared memory through
> hva(userspace_addr). Whether the private or shared part is visible to
> guest is maintained by other KVM code.
>
> Since there is no userspace mapping for private fd so we cannot
> get_user_pages() to get the pfn in KVM, instead we add a new
> inaccessible_notifier in the internal memslot structure and rely on it
> to get pfn by interacting with the memory file systems.
>
> Together with the change, a new config HAVE_KVM_PRIVATE_MEM is added and
> right now it is selected on X86_64 for Intel TDX usage.
>
> To make code maintenance easy, internally we use a binary compatible
> alias struct kvm_user_mem_region to handle both the normal and the
> '_ext' variants.
>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  Documentation/virt/kvm/api.rst | 38 +++++++++++++++++++++-----
>  arch/x86/kvm/Kconfig           |  1 +
>  arch/x86/kvm/x86.c             |  2 +-
>  include/linux/kvm_host.h       | 13 +++++++--
>  include/uapi/linux/kvm.h       | 28 +++++++++++++++++++
>  virt/kvm/Kconfig               |  3 +++
>  virt/kvm/kvm_main.c            | 49 ++++++++++++++++++++++++++++------
>  7 files changed, 116 insertions(+), 18 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index abd7c32126ce..c1fac1e9f820 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
>  :Capability: KVM_CAP_USER_MEMORY
>  :Architectures: all
>  :Type: vm ioctl
> -:Parameters: struct kvm_userspace_memory_region (in)
> +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
>  :Returns: 0 on success, -1 on error
>
>  ::
> @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
>         __u64 userspace_addr; /* start of the userspace allocated memory */
>    };
>
> +  struct kvm_userspace_memory_region_ext {
> +       struct kvm_userspace_memory_region region;
> +       __u64 private_offset;
> +       __u32 private_fd;
> +       __u32 pad1;
> +       __u64 pad2[14];
> +  };
> +
>    /* for kvm_memory_region::flags */
>    #define KVM_MEM_LOG_DIRTY_PAGES      (1UL << 0)
>    #define KVM_MEM_READONLY     (1UL << 1)
> +  #define KVM_MEM_PRIVATE              (1UL << 2)
>
>  This ioctl allows the user to create, modify or delete a guest physical
>  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> @@ -1365,12 +1374,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
>  be identical.  This allows large pages in the guest to be backed by large
>  pages in the host.
>
> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> -to make a new slot read-only.  In this case, writes to this memory will be
> -posted to userspace as KVM_EXIT_MMIO exits.
> +kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region
> +fields. It also includes additional fields for some specific features. See
> +below description of flags field for more information. It's recommended to use
> +kvm_userspace_memory_region_ext in new userspace code.
> +
> +The flags field supports below flags:
> +
> +- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to
> +  memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to use it.
> +
> +- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to
> +  make a new slot read-only.  In this case, writes to this memory will be posted
> +  to userspace as KVM_EXIT_MMIO exits.
> +
> +- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by
> +  a file descirptor(fd) and the content of the private memory is invisible to
> +  userspace. In this case, userspace should use private_fd/private_offset in
> +  kvm_userspace_memory_region_ext to instruct KVM to provide private memory to
> +  guest. Userspace should guarantee not to map the same pfn indicated by
> +  private_fd/private_offset to different gfns with multiple memslots. Failed to
> +  do this may result undefined behavior.
>
>  When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
>  the memory region are automatically reflected into the guest.  For example, an
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index e3cbd7706136..31db64ec0b33 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -48,6 +48,7 @@ config KVM
>         select SRCU
>         select INTERVAL_TREE
>         select HAVE_KVM_PM_NOTIFIER if PM
> +       select HAVE_KVM_PRIVATE_MEM if X86_64
>         help
>           Support hosting fully virtualized guest machines using hardware
>           virtualization extensions.  You will need a fairly recent
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d7374d768296..081f62ccc9a1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12183,7 +12183,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
>         }
>
>         for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> -               struct kvm_userspace_memory_region m;
> +               struct kvm_user_mem_region m;
>
>                 m.slot = id | (i << 16);
>                 m.flags = 0;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f4519d3689e1..eac1787b899b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -44,6 +44,7 @@
>
>  #include <asm/kvm_host.h>
>  #include <linux/kvm_dirty_ring.h>
> +#include <linux/memfd.h>
>
>  #ifndef KVM_MAX_VCPU_IDS
>  #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> @@ -576,8 +577,16 @@ struct kvm_memory_slot {
>         u32 flags;
>         short id;
>         u16 as_id;
> +       struct file *private_file;
> +       loff_t private_offset;
> +       struct inaccessible_notifier notifier;
>  };
>
> +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
> +{
> +       return slot && (slot->flags & KVM_MEM_PRIVATE);
> +}
> +
>  static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
>  {
>         return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
> @@ -1104,9 +1113,9 @@ enum kvm_mr_change {
>  };
>
>  int kvm_set_memory_region(struct kvm *kvm,
> -                         const struct kvm_userspace_memory_region *mem);
> +                         const struct kvm_user_mem_region *mem);
>  int __kvm_set_memory_region(struct kvm *kvm,
> -                           const struct kvm_userspace_memory_region *mem);
> +                           const struct kvm_user_mem_region *mem);
>  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
>  void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
>  int kvm_arch_prepare_memory_region(struct kvm *kvm,
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index eed0315a77a6..3ef462fb3b2a 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
>         __u64 userspace_addr; /* start of the userspace allocated memory */
>  };
>
> +struct kvm_userspace_memory_region_ext {
> +       struct kvm_userspace_memory_region region;
> +       __u64 private_offset;
> +       __u32 private_fd;
> +       __u32 pad1;
> +       __u64 pad2[14];
> +};
> +
> +#ifdef __KERNEL__
> +/*
> + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> + * all fields from the top-level "extended" region.
> + */
> +struct kvm_user_mem_region {
> +       __u32 slot;
> +       __u32 flags;
> +       __u64 guest_phys_addr;
> +       __u64 memory_size;
> +       __u64 userspace_addr;
> +       __u64 private_offset;
> +       __u32 private_fd;
> +       __u32 pad1;
> +       __u64 pad2[14];
> +};
> +#endif
> +
>  /*
>   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
>   * other bits are reserved for kvm internal use which are defined in
> @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
>   */
>  #define KVM_MEM_LOG_DIRTY_PAGES        (1UL << 0)
>  #define KVM_MEM_READONLY       (1UL << 1)
> +#define KVM_MEM_PRIVATE                (1UL << 2)
>
>  /* for KVM_IRQ_LINE */
>  struct kvm_irq_level {
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index a8c5c9f06b3c..ccaff13cc5b8 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -72,3 +72,6 @@ config KVM_XFER_TO_GUEST_WORK
>
>  config HAVE_KVM_PM_NOTIFIER
>         bool
> +
> +config HAVE_KVM_PRIVATE_MEM
> +       bool
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 584a5bab3af3..12dc0dc57b06 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1526,7 +1526,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
>         }
>  }
>
> -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
>  {
>         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>
> @@ -1920,7 +1920,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
>   * Must be called holding kvm->slots_lock for write.
>   */
>  int __kvm_set_memory_region(struct kvm *kvm,
> -                           const struct kvm_userspace_memory_region *mem)
> +                           const struct kvm_user_mem_region *mem)
>  {
>         struct kvm_memory_slot *old, *new;
>         struct kvm_memslots *slots;
> @@ -2024,7 +2024,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
>
>  int kvm_set_memory_region(struct kvm *kvm,
> -                         const struct kvm_userspace_memory_region *mem)
> +                         const struct kvm_user_mem_region *mem)
>  {
>         int r;
>
> @@ -2036,7 +2036,7 @@ int kvm_set_memory_region(struct kvm *kvm,
>  EXPORT_SYMBOL_GPL(kvm_set_memory_region);
>
>  static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> -                                         struct kvm_userspace_memory_region *mem)
> +                                         struct kvm_user_mem_region *mem)
>  {
>         if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
>                 return -EINVAL;
> @@ -4622,6 +4622,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
>         return fd;
>  }
>
> +#define SANITY_CHECK_MEM_REGION_FIELD(field)                                   \
> +do {                                                                           \
> +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=             \
> +                    offsetof(struct kvm_userspace_memory_region, field));      \
> +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=         \
> +                    sizeof_field(struct kvm_userspace_memory_region, field));  \
> +} while (0)
> +
> +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)                                       \
> +do {                                                                                   \
> +       BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=                     \
> +                    offsetof(struct kvm_userspace_memory_region_ext, field));          \
> +       BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=                 \
> +                    sizeof_field(struct kvm_userspace_memory_region_ext, field));      \
> +} while (0)
> +
> +static void kvm_sanity_check_user_mem_region_alias(void)
> +{
> +       SANITY_CHECK_MEM_REGION_FIELD(slot);
> +       SANITY_CHECK_MEM_REGION_FIELD(flags);
> +       SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> +       SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> +       SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> +       SANITY_CHECK_MEM_REGION_EXT_FIELD(private_offset);
> +       SANITY_CHECK_MEM_REGION_EXT_FIELD(private_fd);
> +}
> +
>  static long kvm_vm_ioctl(struct file *filp,
>                            unsigned int ioctl, unsigned long arg)
>  {
> @@ -4645,14 +4672,20 @@ static long kvm_vm_ioctl(struct file *filp,
>                 break;
>         }
>         case KVM_SET_USER_MEMORY_REGION: {
> -               struct kvm_userspace_memory_region kvm_userspace_mem;
> +               struct kvm_user_mem_region mem;
> +               unsigned long size = sizeof(struct kvm_userspace_memory_region);
> +
> +               kvm_sanity_check_user_mem_region_alias();
>
>                 r = -EFAULT;
> -               if (copy_from_user(&kvm_userspace_mem, argp,
> -                                               sizeof(kvm_userspace_mem)))
> +               if (copy_from_user(&mem, argp, size);
> +                       goto out;
> +
> +               r = -EINVAL;
> +               if (mem.flags & KVM_MEM_PRIVATE)
>                         goto out;
>
> -               r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> +               r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>                 break;
>         }
>         case KVM_GET_DIRTY_LOG: {
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-06  8:50   ` Fuad Tabba
@ 2022-10-06 13:04     ` Kirill A. Shutemov
  0 siblings, 0 replies; 97+ messages in thread
From: Kirill A. Shutemov @ 2022-10-06 13:04 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Thu, Oct 06, 2022 at 09:50:28AM +0100, Fuad Tabba wrote:
> Hi,
> 
> <...>
> 
> 
> > diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> > new file mode 100644
> > index 000000000000..2d33cbdd9282
> > --- /dev/null
> > +++ b/mm/memfd_inaccessible.c
> 
> <...>
> 
> > +struct file *memfd_mkinaccessible(struct file *memfd)
> > +{
> > +       struct inaccessible_data *data;
> > +       struct address_space *mapping;
> > +       struct inode *inode;
> > +       struct file *file;
> > +
> > +       data = kzalloc(sizeof(*data), GFP_KERNEL);
> > +       if (!data)
> > +               return ERR_PTR(-ENOMEM);
> > +
> > +       data->memfd = memfd;
> > +       mutex_init(&data->lock);
> > +       INIT_LIST_HEAD(&data->notifiers);
> > +
> > +       inode = alloc_anon_inode(inaccessible_mnt->mnt_sb);
> > +       if (IS_ERR(inode)) {
> > +               kfree(data);
> > +               return ERR_CAST(inode);
> > +       }
> > +
> > +       inode->i_mode |= S_IFREG;
> > +       inode->i_op = &inaccessible_iops;
> > +       inode->i_mapping->private_data = data;
> > +
> > +       file = alloc_file_pseudo(inode, inaccessible_mnt,
> > +                                "[memfd:inaccessible]", O_RDWR,
> > +                                &inaccessible_fops);
> > +       if (IS_ERR(file)) {
> > +               iput(inode);
> > +               kfree(data);
> 
> I think this might be missing a return at this point.

Good catch! Thanks!

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-09-15 14:29 ` [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
                     ` (4 preceding siblings ...)
  2022-10-06  9:00   ` Fuad Tabba
@ 2022-10-06 14:58   ` Jarkko Sakkinen
  2022-10-06 15:07     ` Jarkko Sakkinen
  5 siblings, 1 reply; 97+ messages in thread
From: Jarkko Sakkinen @ 2022-10-06 14:58 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote:
> This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> additional KVM memslot fields private_fd/private_offset to allow
> userspace to specify that guest private memory provided from the
> private_fd and guest_phys_addr mapped at the private_offset of the
> private_fd, spanning a range of memory_size.
> 
> The extended memslot can still have the userspace_addr(hva). When use, a
> single memslot can maintain both private memory through private
> fd(private_fd/private_offset) and shared memory through
> hva(userspace_addr). Whether the private or shared part is visible to
> guest is maintained by other KVM code.

What is anyway the appeal of private_offset field, instead of having just
1:1 association between regions and files, i.e. one memfd per region?

If this was the case, then an extended struct would not be needed in the
first place. A simple union inside the existing struct would do:

        union {
                __u64 userspace_addr,
                __u64 private_fd,
        };

BR, Jarkko

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-10-06 14:58   ` Jarkko Sakkinen
@ 2022-10-06 15:07     ` Jarkko Sakkinen
  2022-10-06 15:34       ` Sean Christopherson
  0 siblings, 1 reply; 97+ messages in thread
From: Jarkko Sakkinen @ 2022-10-06 15:07 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote:
> On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote:
> > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > additional KVM memslot fields private_fd/private_offset to allow
> > userspace to specify that guest private memory provided from the
> > private_fd and guest_phys_addr mapped at the private_offset of the
> > private_fd, spanning a range of memory_size.
> > 
> > The extended memslot can still have the userspace_addr(hva). When use, a
> > single memslot can maintain both private memory through private
> > fd(private_fd/private_offset) and shared memory through
> > hva(userspace_addr). Whether the private or shared part is visible to
> > guest is maintained by other KVM code.
> 
> What is anyway the appeal of private_offset field, instead of having just
> 1:1 association between regions and files, i.e. one memfd per region?
> 
> If this was the case, then an extended struct would not be needed in the
> first place. A simple union inside the existing struct would do:
> 
>         union {
>                 __u64 userspace_addr,
>                 __u64 private_fd,
>         };

Also, why is this mechanism just for fd's with MFD_INACCESSIBLE flag? I'd
consider instead having KVM_MEM_FD flag. For generic KVM (if memfd does not
have MFD_INACCESSIBLE set), KVM could just use the memory as it is using
mapped memory. This would simplify user space code, as you can the use the
same thing for both cases.

BR, Jarkko

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-10-06 15:07     ` Jarkko Sakkinen
@ 2022-10-06 15:34       ` Sean Christopherson
  2022-10-07 11:14         ` Jarkko Sakkinen
  0 siblings, 1 reply; 97+ messages in thread
From: Sean Christopherson @ 2022-10-06 15:34 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Thu, Oct 06, 2022, Jarkko Sakkinen wrote:
> On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote:
> > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote:
> > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > > additional KVM memslot fields private_fd/private_offset to allow
> > > userspace to specify that guest private memory provided from the
> > > private_fd and guest_phys_addr mapped at the private_offset of the
> > > private_fd, spanning a range of memory_size.
> > > 
> > > The extended memslot can still have the userspace_addr(hva). When use, a
> > > single memslot can maintain both private memory through private
> > > fd(private_fd/private_offset) and shared memory through
> > > hva(userspace_addr). Whether the private or shared part is visible to
> > > guest is maintained by other KVM code.
> > 
> > What is anyway the appeal of private_offset field, instead of having just
> > 1:1 association between regions and files, i.e. one memfd per region?

Modifying memslots is slow, both in KVM and in QEMU (not sure about Google's VMM).
E.g. if a vCPU converts a single page, it will be forced to wait until all other
vCPUs drop SRCU, which can have severe latency spikes, e.g. if KVM is faulting in
memory.  KVM's memslot updates also hold a mutex for the entire duration of the
update, i.e. conversions on different vCPUs would be fully serialized, exacerbating
the SRCU problem.

KVM also has historical baggage where it "needs" to zap _all_ SPTEs when any
memslot is deleted.

Taking both a private_fd and a shared userspace address allows userspace to convert
between private and shared without having to manipulate memslots.

Paolo's original idea (was sent off-list):

  : The problem is that KVM_SET_USER_MEMORY_REGION and memslots in general
  : are designed around (S)RCU.  It is way too slow (in both QEMU and KVM)
  : to be called on every private<->shared conversion with 4K granularity,
  : and it tends naturally to have quadratic behavior (though, at least for
  : KVM, the in-progress "fast memslots" series would avoid that).
  : 
  : Since private PTEs are persistent, and userspace cannot access the memfd
  : in any other way, userspace could use fallocate() to map/unmap an
  : address range as private, and KVM can treat everything that userspace
  : hasn't mapped as shared.
  : 
  : This would be a new entry in struct guest_ops, called by fallocate(),
  : and the callback can take the mmu_lock for write to avoid racing with
  : page faults.  This doesn't add any more contention than
  : KVM_SET_USER_MEMORY_REGION, since the latter takes slots_lock.  If
  : there's something I'm missing then the mapping operation can use a
  : ioctl, while the unmapping can keep using FALLOC_FL_PUNCH_HOLE.
  : 
  : Then:
  : 
  : - for simplicity, mapping a private memslot fails if there are any
  : mappings (similar to the handling when F_SEAL_GUEST is set).
  : 
  : - for TDX, accessing a nonexistent private PTE will cause a userspace
  : exit for a shared->private conversion request.  For SNP, the guest will
  : do a page state change VMGEXIT to request an RMPUPDATE, which can cause
  : a userspace exit too; the consequent fallocate() on the private fd
  : invokes RMPUPDATE.
  : 
  : - trying to map a shared PTE where there's already a private PTE causes
  : a userspace exit for a private->shared conversion request.
  : kvm_faultin_pfn or handle_abnormal_pfn can query this in the private-fd
  : inode, which is essentially a single pagecache_get_page call.
  : 
  : - if userspace asks to map a private PTE where there's already a shared
  : PTE (which it can check because it has the mmu_lock taken for write),
  : KVM unmaps the shared PTE.

> > 
> > If this was the case, then an extended struct would not be needed in the
> > first place. A simple union inside the existing struct would do:
> > 
> >         union {
> >                 __u64 userspace_addr,
> >                 __u64 private_fd,
> >         };
> 
> Also, why is this mechanism just for fd's with MFD_INACCESSIBLE flag? I'd
> consider instead having KVM_MEM_FD flag. For generic KVM (if memfd does not
> have MFD_INACCESSIBLE set), KVM could just use the memory as it is using
> mapped memory. This would simplify user space code, as you can the use the
> same thing for both cases.

I explored this idea too[*].  Because we want to support specifying both the
private and shared backing stores in a single memslot, then we need two file
descriptors so that shared memory can also use fd-based memory.

[*] https://lore.kernel.org/all/YulTH7bL4MwT5v5K@google.com

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-10-06 15:34       ` Sean Christopherson
@ 2022-10-07 11:14         ` Jarkko Sakkinen
  2022-10-07 14:58           ` Sean Christopherson
  0 siblings, 1 reply; 97+ messages in thread
From: Jarkko Sakkinen @ 2022-10-07 11:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Thu, Oct 06, 2022 at 03:34:58PM +0000, Sean Christopherson wrote:
> On Thu, Oct 06, 2022, Jarkko Sakkinen wrote:
> > On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote:
> > > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote:
> > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > > > additional KVM memslot fields private_fd/private_offset to allow
> > > > userspace to specify that guest private memory provided from the
> > > > private_fd and guest_phys_addr mapped at the private_offset of the
> > > > private_fd, spanning a range of memory_size.
> > > > 
> > > > The extended memslot can still have the userspace_addr(hva). When use, a
> > > > single memslot can maintain both private memory through private
> > > > fd(private_fd/private_offset) and shared memory through
> > > > hva(userspace_addr). Whether the private or shared part is visible to
> > > > guest is maintained by other KVM code.
> > > 
> > > What is anyway the appeal of private_offset field, instead of having just
> > > 1:1 association between regions and files, i.e. one memfd per region?
> 
> Modifying memslots is slow, both in KVM and in QEMU (not sure about Google's VMM).
> E.g. if a vCPU converts a single page, it will be forced to wait until all other
> vCPUs drop SRCU, which can have severe latency spikes, e.g. if KVM is faulting in
> memory.  KVM's memslot updates also hold a mutex for the entire duration of the
> update, i.e. conversions on different vCPUs would be fully serialized, exacerbating
> the SRCU problem.
> 
> KVM also has historical baggage where it "needs" to zap _all_ SPTEs when any
> memslot is deleted.
> 
> Taking both a private_fd and a shared userspace address allows userspace to convert
> between private and shared without having to manipulate memslots.

Right, this was really good explanation, thank you.

Still wondering could this possibly work (or not):

1. Union userspace_addr and private_fd.
2. Instead of introducing private_offset, use guest_phys_addr as the
   offset.
  
BR, Jarkko

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-10-07 11:14         ` Jarkko Sakkinen
@ 2022-10-07 14:58           ` Sean Christopherson
  2022-10-07 21:54             ` Jarkko Sakkinen
  0 siblings, 1 reply; 97+ messages in thread
From: Sean Christopherson @ 2022-10-07 14:58 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Fri, Oct 07, 2022, Jarkko Sakkinen wrote:
> On Thu, Oct 06, 2022 at 03:34:58PM +0000, Sean Christopherson wrote:
> > On Thu, Oct 06, 2022, Jarkko Sakkinen wrote:
> > > On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote:
> > > > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote:
> > > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > > > > additional KVM memslot fields private_fd/private_offset to allow
> > > > > userspace to specify that guest private memory provided from the
> > > > > private_fd and guest_phys_addr mapped at the private_offset of the
> > > > > private_fd, spanning a range of memory_size.
> > > > > 
> > > > > The extended memslot can still have the userspace_addr(hva). When use, a
> > > > > single memslot can maintain both private memory through private
> > > > > fd(private_fd/private_offset) and shared memory through
> > > > > hva(userspace_addr). Whether the private or shared part is visible to
> > > > > guest is maintained by other KVM code.
> > > > 
> > > > What is anyway the appeal of private_offset field, instead of having just
> > > > 1:1 association between regions and files, i.e. one memfd per region?
> > 
> > Modifying memslots is slow, both in KVM and in QEMU (not sure about Google's VMM).
> > E.g. if a vCPU converts a single page, it will be forced to wait until all other
> > vCPUs drop SRCU, which can have severe latency spikes, e.g. if KVM is faulting in
> > memory.  KVM's memslot updates also hold a mutex for the entire duration of the
> > update, i.e. conversions on different vCPUs would be fully serialized, exacerbating
> > the SRCU problem.
> > 
> > KVM also has historical baggage where it "needs" to zap _all_ SPTEs when any
> > memslot is deleted.
> > 
> > Taking both a private_fd and a shared userspace address allows userspace to convert
> > between private and shared without having to manipulate memslots.
> 
> Right, this was really good explanation, thank you.
> 
> Still wondering could this possibly work (or not):
> 
> 1. Union userspace_addr and private_fd.

No, because userspace needs to be able to provide both userspace_addr (shared
memory) and private_fd (private memory) for a single memslot.

> 2. Instead of introducing private_offset, use guest_phys_addr as the
>    offset.

No, because that would force userspace to use a single private_fd for all of guest
memory since it effectively means private_offset=0.  And userspace couldn't skip
over holes in guest memory, i.e. the size of the memfd would need to follow the
max guest gpa.  In other words, dropping private_offset could work, but it'd be
quite kludgy and not worth saving 8 bytes.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-10-07 14:58           ` Sean Christopherson
@ 2022-10-07 21:54             ` Jarkko Sakkinen
  2022-10-08 16:15               ` Jarkko Sakkinen
  0 siblings, 1 reply; 97+ messages in thread
From: Jarkko Sakkinen @ 2022-10-07 21:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Fri, Oct 07, 2022 at 02:58:54PM +0000, Sean Christopherson wrote:
> On Fri, Oct 07, 2022, Jarkko Sakkinen wrote:
> > On Thu, Oct 06, 2022 at 03:34:58PM +0000, Sean Christopherson wrote:
> > > On Thu, Oct 06, 2022, Jarkko Sakkinen wrote:
> > > > On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote:
> > > > > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote:
> > > > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > > > > > additional KVM memslot fields private_fd/private_offset to allow
> > > > > > userspace to specify that guest private memory provided from the
> > > > > > private_fd and guest_phys_addr mapped at the private_offset of the
> > > > > > private_fd, spanning a range of memory_size.
> > > > > > 
> > > > > > The extended memslot can still have the userspace_addr(hva). When use, a
> > > > > > single memslot can maintain both private memory through private
> > > > > > fd(private_fd/private_offset) and shared memory through
> > > > > > hva(userspace_addr). Whether the private or shared part is visible to
> > > > > > guest is maintained by other KVM code.
> > > > > 
> > > > > What is anyway the appeal of private_offset field, instead of having just
> > > > > 1:1 association between regions and files, i.e. one memfd per region?
> > > 
> > > Modifying memslots is slow, both in KVM and in QEMU (not sure about Google's VMM).
> > > E.g. if a vCPU converts a single page, it will be forced to wait until all other
> > > vCPUs drop SRCU, which can have severe latency spikes, e.g. if KVM is faulting in
> > > memory.  KVM's memslot updates also hold a mutex for the entire duration of the
> > > update, i.e. conversions on different vCPUs would be fully serialized, exacerbating
> > > the SRCU problem.
> > > 
> > > KVM also has historical baggage where it "needs" to zap _all_ SPTEs when any
> > > memslot is deleted.
> > > 
> > > Taking both a private_fd and a shared userspace address allows userspace to convert
> > > between private and shared without having to manipulate memslots.
> > 
> > Right, this was really good explanation, thank you.
> > 
> > Still wondering could this possibly work (or not):
> > 
> > 1. Union userspace_addr and private_fd.
> 
> No, because userspace needs to be able to provide both userspace_addr (shared
> memory) and private_fd (private memory) for a single memslot.

Got it, thanks for clearing my misunderstandings on this topic, and it
is quite obviously visible in 5/8 and 7/8. I.e. if I got it right,
memblock can be partially private, and you dig the shared holes with
KVM_MEMORY_ENCRYPT_UNREG_REGION. We have (in Enarx) ATM have memblock
per host mmap, I was looking into this dilated by that mindset but makes
definitely sense to support that.

BR, Jarkko

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-10-07 21:54             ` Jarkko Sakkinen
@ 2022-10-08 16:15               ` Jarkko Sakkinen
  2022-10-08 17:35                 ` Jarkko Sakkinen
  0 siblings, 1 reply; 97+ messages in thread
From: Jarkko Sakkinen @ 2022-10-08 16:15 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Sat, Oct 08, 2022 at 12:54:32AM +0300, Jarkko Sakkinen wrote:
> On Fri, Oct 07, 2022 at 02:58:54PM +0000, Sean Christopherson wrote:
> > On Fri, Oct 07, 2022, Jarkko Sakkinen wrote:
> > > On Thu, Oct 06, 2022 at 03:34:58PM +0000, Sean Christopherson wrote:
> > > > On Thu, Oct 06, 2022, Jarkko Sakkinen wrote:
> > > > > On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote:
> > > > > > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote:
> > > > > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > > > > > > additional KVM memslot fields private_fd/private_offset to allow
> > > > > > > userspace to specify that guest private memory provided from the
> > > > > > > private_fd and guest_phys_addr mapped at the private_offset of the
> > > > > > > private_fd, spanning a range of memory_size.
> > > > > > > 
> > > > > > > The extended memslot can still have the userspace_addr(hva). When use, a
> > > > > > > single memslot can maintain both private memory through private
> > > > > > > fd(private_fd/private_offset) and shared memory through
> > > > > > > hva(userspace_addr). Whether the private or shared part is visible to
> > > > > > > guest is maintained by other KVM code.
> > > > > > 
> > > > > > What is anyway the appeal of private_offset field, instead of having just
> > > > > > 1:1 association between regions and files, i.e. one memfd per region?
> > > > 
> > > > Modifying memslots is slow, both in KVM and in QEMU (not sure about Google's VMM).
> > > > E.g. if a vCPU converts a single page, it will be forced to wait until all other
> > > > vCPUs drop SRCU, which can have severe latency spikes, e.g. if KVM is faulting in
> > > > memory.  KVM's memslot updates also hold a mutex for the entire duration of the
> > > > update, i.e. conversions on different vCPUs would be fully serialized, exacerbating
> > > > the SRCU problem.
> > > > 
> > > > KVM also has historical baggage where it "needs" to zap _all_ SPTEs when any
> > > > memslot is deleted.
> > > > 
> > > > Taking both a private_fd and a shared userspace address allows userspace to convert
> > > > between private and shared without having to manipulate memslots.
> > > 
> > > Right, this was really good explanation, thank you.
> > > 
> > > Still wondering could this possibly work (or not):
> > > 
> > > 1. Union userspace_addr and private_fd.
> > 
> > No, because userspace needs to be able to provide both userspace_addr (shared
> > memory) and private_fd (private memory) for a single memslot.
> 
> Got it, thanks for clearing my misunderstandings on this topic, and it
> is quite obviously visible in 5/8 and 7/8. I.e. if I got it right,
> memblock can be partially private, and you dig the shared holes with
> KVM_MEMORY_ENCRYPT_UNREG_REGION. We have (in Enarx) ATM have memblock
> per host mmap, I was looking into this dilated by that mindset but makes
> definitely sense to support that.

For me the most useful reference with this feature is kvm_set_phys_mem()
implementation in privmem-v8 branch. Took while to find it because I did
not have much experience with QEMU code base. I'd even recommend to mention
that function in the cover letter because it is really good reference on
how this feature is supposed to be used.

BR, Jarkko

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-10-08 16:15               ` Jarkko Sakkinen
@ 2022-10-08 17:35                 ` Jarkko Sakkinen
  2022-10-10  8:25                   ` Chao Peng
  0 siblings, 1 reply; 97+ messages in thread
From: Jarkko Sakkinen @ 2022-10-08 17:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Sat, Oct 08, 2022 at 07:15:17PM +0300, Jarkko Sakkinen wrote:
> On Sat, Oct 08, 2022 at 12:54:32AM +0300, Jarkko Sakkinen wrote:
> > On Fri, Oct 07, 2022 at 02:58:54PM +0000, Sean Christopherson wrote:
> > > On Fri, Oct 07, 2022, Jarkko Sakkinen wrote:
> > > > On Thu, Oct 06, 2022 at 03:34:58PM +0000, Sean Christopherson wrote:
> > > > > On Thu, Oct 06, 2022, Jarkko Sakkinen wrote:
> > > > > > On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote:
> > > > > > > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote:
> > > > > > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > > > > > > > additional KVM memslot fields private_fd/private_offset to allow
> > > > > > > > userspace to specify that guest private memory provided from the
> > > > > > > > private_fd and guest_phys_addr mapped at the private_offset of the
> > > > > > > > private_fd, spanning a range of memory_size.
> > > > > > > > 
> > > > > > > > The extended memslot can still have the userspace_addr(hva). When use, a
> > > > > > > > single memslot can maintain both private memory through private
> > > > > > > > fd(private_fd/private_offset) and shared memory through
> > > > > > > > hva(userspace_addr). Whether the private or shared part is visible to
> > > > > > > > guest is maintained by other KVM code.
> > > > > > > 
> > > > > > > What is anyway the appeal of private_offset field, instead of having just
> > > > > > > 1:1 association between regions and files, i.e. one memfd per region?
> > > > > 
> > > > > Modifying memslots is slow, both in KVM and in QEMU (not sure about Google's VMM).
> > > > > E.g. if a vCPU converts a single page, it will be forced to wait until all other
> > > > > vCPUs drop SRCU, which can have severe latency spikes, e.g. if KVM is faulting in
> > > > > memory.  KVM's memslot updates also hold a mutex for the entire duration of the
> > > > > update, i.e. conversions on different vCPUs would be fully serialized, exacerbating
> > > > > the SRCU problem.
> > > > > 
> > > > > KVM also has historical baggage where it "needs" to zap _all_ SPTEs when any
> > > > > memslot is deleted.
> > > > > 
> > > > > Taking both a private_fd and a shared userspace address allows userspace to convert
> > > > > between private and shared without having to manipulate memslots.
> > > > 
> > > > Right, this was really good explanation, thank you.
> > > > 
> > > > Still wondering could this possibly work (or not):
> > > > 
> > > > 1. Union userspace_addr and private_fd.
> > > 
> > > No, because userspace needs to be able to provide both userspace_addr (shared
> > > memory) and private_fd (private memory) for a single memslot.
> > 
> > Got it, thanks for clearing my misunderstandings on this topic, and it
> > is quite obviously visible in 5/8 and 7/8. I.e. if I got it right,
> > memblock can be partially private, and you dig the shared holes with
> > KVM_MEMORY_ENCRYPT_UNREG_REGION. We have (in Enarx) ATM have memblock
> > per host mmap, I was looking into this dilated by that mindset but makes
> > definitely sense to support that.
> 
> For me the most useful reference with this feature is kvm_set_phys_mem()
> implementation in privmem-v8 branch. Took while to find it because I did
> not have much experience with QEMU code base. I'd even recommend to mention
> that function in the cover letter because it is really good reference on
> how this feature is supposed to be used.

While learning QEMU code, I also noticed bunch of comparison like this:

if (slot->flags | KVM_MEM_PRIVATE)

I guess those could be just replaced with unconditional fills as it does
not do any harm, if KVM_MEM_PRIVATE is not set.

BR, Jarkko

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-10-08 17:35                 ` Jarkko Sakkinen
@ 2022-10-10  8:25                   ` Chao Peng
  2022-10-12  8:14                     ` Jarkko Sakkinen
  0 siblings, 1 reply; 97+ messages in thread
From: Chao Peng @ 2022-10-10  8:25 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Sean Christopherson, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Sat, Oct 08, 2022 at 08:35:47PM +0300, Jarkko Sakkinen wrote:
> On Sat, Oct 08, 2022 at 07:15:17PM +0300, Jarkko Sakkinen wrote:
> > On Sat, Oct 08, 2022 at 12:54:32AM +0300, Jarkko Sakkinen wrote:
> > > On Fri, Oct 07, 2022 at 02:58:54PM +0000, Sean Christopherson wrote:
> > > > On Fri, Oct 07, 2022, Jarkko Sakkinen wrote:
> > > > > On Thu, Oct 06, 2022 at 03:34:58PM +0000, Sean Christopherson wrote:
> > > > > > On Thu, Oct 06, 2022, Jarkko Sakkinen wrote:
> > > > > > > On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote:
> > > > > > > > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote:
> > > > > > > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > > > > > > > > additional KVM memslot fields private_fd/private_offset to allow
> > > > > > > > > userspace to specify that guest private memory provided from the
> > > > > > > > > private_fd and guest_phys_addr mapped at the private_offset of the
> > > > > > > > > private_fd, spanning a range of memory_size.
> > > > > > > > > 
> > > > > > > > > The extended memslot can still have the userspace_addr(hva). When use, a
> > > > > > > > > single memslot can maintain both private memory through private
> > > > > > > > > fd(private_fd/private_offset) and shared memory through
> > > > > > > > > hva(userspace_addr). Whether the private or shared part is visible to
> > > > > > > > > guest is maintained by other KVM code.
> > > > > > > > 
> > > > > > > > What is anyway the appeal of private_offset field, instead of having just
> > > > > > > > 1:1 association between regions and files, i.e. one memfd per region?
> > > > > > 
> > > > > > Modifying memslots is slow, both in KVM and in QEMU (not sure about Google's VMM).
> > > > > > E.g. if a vCPU converts a single page, it will be forced to wait until all other
> > > > > > vCPUs drop SRCU, which can have severe latency spikes, e.g. if KVM is faulting in
> > > > > > memory.  KVM's memslot updates also hold a mutex for the entire duration of the
> > > > > > update, i.e. conversions on different vCPUs would be fully serialized, exacerbating
> > > > > > the SRCU problem.
> > > > > > 
> > > > > > KVM also has historical baggage where it "needs" to zap _all_ SPTEs when any
> > > > > > memslot is deleted.
> > > > > > 
> > > > > > Taking both a private_fd and a shared userspace address allows userspace to convert
> > > > > > between private and shared without having to manipulate memslots.
> > > > > 
> > > > > Right, this was really good explanation, thank you.
> > > > > 
> > > > > Still wondering could this possibly work (or not):
> > > > > 
> > > > > 1. Union userspace_addr and private_fd.
> > > > 
> > > > No, because userspace needs to be able to provide both userspace_addr (shared
> > > > memory) and private_fd (private memory) for a single memslot.
> > > 
> > > Got it, thanks for clearing my misunderstandings on this topic, and it
> > > is quite obviously visible in 5/8 and 7/8. I.e. if I got it right,
> > > memblock can be partially private, and you dig the shared holes with
> > > KVM_MEMORY_ENCRYPT_UNREG_REGION. We have (in Enarx) ATM have memblock
> > > per host mmap, I was looking into this dilated by that mindset but makes
> > > definitely sense to support that.
> > 
> > For me the most useful reference with this feature is kvm_set_phys_mem()
> > implementation in privmem-v8 branch. Took while to find it because I did
> > not have much experience with QEMU code base. I'd even recommend to mention
> > that function in the cover letter because it is really good reference on
> > how this feature is supposed to be used.

That's a good point, I can mention that if people find useful. 

> 
> While learning QEMU code, I also noticed bunch of comparison like this:
> 
> if (slot->flags | KVM_MEM_PRIVATE)
> 
> I guess those could be just replaced with unconditional fills as it does
> not do any harm, if KVM_MEM_PRIVATE is not set.

Make sense, thanks.

Chao
> 
> BR, Jarkko

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 8/8] KVM: Enable and expose KVM_MEM_PRIVATE
  2022-10-04 14:55   ` Jarkko Sakkinen
@ 2022-10-10  8:31     ` Chao Peng
  0 siblings, 0 replies; 97+ messages in thread
From: Chao Peng @ 2022-10-10  8:31 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Tue, Oct 04, 2022 at 05:55:28PM +0300, Jarkko Sakkinen wrote:
> On Thu, Sep 15, 2022 at 10:29:13PM +0800, Chao Peng wrote:
> > Expose KVM_MEM_PRIVATE and memslot fields private_fd/offset to
> > userspace. KVM will register/unregister private memslot to fd-based
> > memory backing store and response to invalidation event from
> > inaccessible_notifier to zap the existing memory mappings in the
> > secondary page table.
> > 
> > Whether KVM_MEM_PRIVATE is actually exposed to userspace is determined
> > by architecture code which can turn on it by overriding the default
> > kvm_arch_has_private_mem().
> > 
> > A 'kvm' reference is added in memslot structure since in
> > inaccessible_notifier callback we can only obtain a memslot reference
> > but 'kvm' is needed to do the zapping.
> > 
> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> 
> ld: arch/x86/../../virt/kvm/kvm_main.o: in function `kvm_free_memslot':
> kvm_main.c:(.text+0x1385): undefined reference to `inaccessible_unregister_notifier'
> ld: arch/x86/../../virt/kvm/kvm_main.o: in function `kvm_set_memslot':
> kvm_main.c:(.text+0x1b86): undefined reference to `inaccessible_register_notifier'
> ld: kvm_main.c:(.text+0x1c85): undefined reference to `inaccessible_unregister_notifier'
> ld: arch/x86/kvm/mmu/mmu.o: in function `kvm_faultin_pfn':
> mmu.c:(.text+0x1e38): undefined reference to `inaccessible_get_pfn'
> ld: arch/x86/kvm/mmu/mmu.o: in function `direct_page_fault':
> mmu.c:(.text+0x67ca): undefined reference to `inaccessible_put_pfn'
> make: *** [Makefile:1169: vmlinux] Error 1
> 
> I attached kernel config for reproduction.
> 
> The problem is that CONFIG_MEMFD_CREATE does not get enabled:
> 
> mm/Makefile:obj-$(CONFIG_MEMFD_CREATE) += memfd.o memfd_inaccessible.o

Thanks for reporting. Yes there is a dependency issue needs to fix.

Chao


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 8/8] KVM: Enable and expose KVM_MEM_PRIVATE
  2022-10-06  8:55   ` Fuad Tabba
@ 2022-10-10  8:33     ` Chao Peng
  0 siblings, 0 replies; 97+ messages in thread
From: Chao Peng @ 2022-10-10  8:33 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Thu, Oct 06, 2022 at 09:55:31AM +0100, Fuad Tabba wrote:
> Hi,
> 
> On Thu, Sep 15, 2022 at 3:37 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > Expose KVM_MEM_PRIVATE and memslot fields private_fd/offset to
> > userspace. KVM will register/unregister private memslot to fd-based
> > memory backing store and response to invalidation event from
> > inaccessible_notifier to zap the existing memory mappings in the
> > secondary page table.
> >
> > Whether KVM_MEM_PRIVATE is actually exposed to userspace is determined
> > by architecture code which can turn on it by overriding the default
> > kvm_arch_has_private_mem().
> >
> > A 'kvm' reference is added in memslot structure since in
> > inaccessible_notifier callback we can only obtain a memslot reference
> > but 'kvm' is needed to do the zapping.
> >
> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  include/linux/kvm_host.h |   1 +
> >  virt/kvm/kvm_main.c      | 116 +++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 111 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index b9906cdf468b..cb4eefac709c 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -589,6 +589,7 @@ struct kvm_memory_slot {
> >         struct file *private_file;
> >         loff_t private_offset;
> >         struct inaccessible_notifier notifier;
> > +       struct kvm *kvm;
> >  };
> >
> >  static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 97d893f7482c..87e239d35b96 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -983,6 +983,57 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> >                 xa_erase(&kvm->mem_attr_array, index);
> >         return r;
> >  }
> > +
> > +static void kvm_private_notifier_invalidate(struct inaccessible_notifier *notifier,
> > +                                           pgoff_t start, pgoff_t end)
> > +{
> > +       struct kvm_memory_slot *slot = container_of(notifier,
> > +                                                   struct kvm_memory_slot,
> > +                                                   notifier);
> > +       unsigned long base_pgoff = slot->private_offset >> PAGE_SHIFT;
> > +       gfn_t start_gfn = slot->base_gfn;
> > +       gfn_t end_gfn = slot->base_gfn + slot->npages;
> > +
> > +
> > +       if (start > base_pgoff)
> > +               start_gfn = slot->base_gfn + start - base_pgoff;
> > +
> > +       if (end < base_pgoff + slot->npages)
> > +               end_gfn = slot->base_gfn + end - base_pgoff;
> > +
> > +       if (start_gfn >= end_gfn)
> > +               return;
> > +
> > +       kvm_zap_gfn_range(slot->kvm, start_gfn, end_gfn);
> > +}
> > +
> > +static struct inaccessible_notifier_ops kvm_private_notifier_ops = {
> > +       .invalidate = kvm_private_notifier_invalidate,
> > +};
> > +
> > +static inline void kvm_private_mem_register(struct kvm_memory_slot *slot)
> > +{
> > +       slot->notifier.ops = &kvm_private_notifier_ops;
> > +       inaccessible_register_notifier(slot->private_file, &slot->notifier);
> > +}
> > +
> > +static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot)
> > +{
> > +       inaccessible_unregister_notifier(slot->private_file, &slot->notifier);
> > +}
> > +
> > +#else /* !CONFIG_HAVE_KVM_PRIVATE_MEM */
> > +
> > +static inline void kvm_private_mem_register(struct kvm_memory_slot *slot)
> > +{
> > +       WARN_ON_ONCE(1);
> > +}
> > +
> > +static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot)
> > +{
> > +       WARN_ON_ONCE(1);
> > +}
> > +
> >  #endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
> >
> >  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> > @@ -1029,6 +1080,11 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
> >  /* This does not remove the slot from struct kvm_memslots data structures */
> >  static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> >  {
> > +       if (slot->flags & KVM_MEM_PRIVATE) {
> > +               kvm_private_mem_unregister(slot);
> > +               fput(slot->private_file);
> > +       }
> > +
> >         kvm_destroy_dirty_bitmap(slot);
> >
> >         kvm_arch_free_memslot(kvm, slot);
> > @@ -1600,10 +1656,16 @@ bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> >         return false;
> >  }
> >
> > -static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> > +static int check_memory_region_flags(struct kvm *kvm,
> > +                                    const struct kvm_user_mem_region *mem)
> >  {
> >         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> >
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > +       if (kvm_arch_has_private_mem(kvm))
> > +               valid_flags |= KVM_MEM_PRIVATE;
> > +#endif
> > +
> >  #ifdef __KVM_HAVE_READONLY_MEM
> >         valid_flags |= KVM_MEM_READONLY;
> >  #endif
> > @@ -1679,6 +1741,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
> >  {
> >         int r;
> >
> > +       if (change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
> > +               kvm_private_mem_register(new);
> > +
> 
> >From the discussion I had with Kirill in the first patch *, should
> this check that the private_fd is inaccessible?

Yes I can add a check in KVM code, see below for where I want to add it.

> 
> [*] https://lore.kernel.org/all/20221003110129.bbee7kawhw5ed745@box.shutemov.name/
> 
> Cheers,
> /fuad
> 
> >         /*
> >          * If dirty logging is disabled, nullify the bitmap; the old bitmap
> >          * will be freed on "commit".  If logging is enabled in both old and
> > @@ -1707,6 +1772,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
> >         if (r && new && new->dirty_bitmap && (!old || !old->dirty_bitmap))
> >                 kvm_destroy_dirty_bitmap(new);
> >
> > +       if (r && change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
> > +               kvm_private_mem_unregister(new);
> > +
> >         return r;
> >  }
> >
> > @@ -2004,7 +2072,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >         int as_id, id;
> >         int r;
> >
> > -       r = check_memory_region_flags(mem);
> > +       r = check_memory_region_flags(kvm, mem);
> >         if (r)
> >                 return r;
> >
> > @@ -2023,6 +2091,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >              !access_ok((void __user *)(unsigned long)mem->userspace_addr,
> >                         mem->memory_size))
> >                 return -EINVAL;
> > +       if (mem->flags & KVM_MEM_PRIVATE &&
> > +               (mem->private_offset & (PAGE_SIZE - 1) ||
> > +                mem->private_offset > U64_MAX - mem->memory_size))
> > +               return -EINVAL;
> >         if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> >                 return -EINVAL;
> >         if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> > @@ -2061,6 +2133,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >                 if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
> >                         return -EINVAL;
> >         } else { /* Modify an existing slot. */
> > +               /* Private memslots are immutable, they can only be deleted. */
> > +               if (mem->flags & KVM_MEM_PRIVATE)
> > +                       return -EINVAL;
> >                 if ((mem->userspace_addr != old->userspace_addr) ||
> >                     (npages != old->npages) ||
> >                     ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> > @@ -2089,10 +2164,27 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >         new->npages = npages;
> >         new->flags = mem->flags;
> >         new->userspace_addr = mem->userspace_addr;
> > +       if (mem->flags & KVM_MEM_PRIVATE) {
> > +               new->private_file = fget(mem->private_fd);
> > +               if (!new->private_file) {
> > +                       r = -EINVAL;

The check will go here.

> > +                       goto out;
> > +               }
> > +               new->private_offset = mem->private_offset;
> > +       }
> > +
> > +       new->kvm = kvm;
> >
> >         r = kvm_set_memslot(kvm, old, new, change);
> >         if (r)
> > -               kfree(new);
> > +               goto out;
> > +
> > +       return 0;
> > +
> > +out:
> > +       if (new->private_file)
> > +               fput(new->private_file);
> > +       kfree(new);
> >         return r;
> >  }
> >  EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> > @@ -4747,16 +4839,28 @@ static long kvm_vm_ioctl(struct file *filp,
> >         }
> >         case KVM_SET_USER_MEMORY_REGION: {
> >                 struct kvm_user_mem_region mem;
> > -               unsigned long size = sizeof(struct kvm_userspace_memory_region);
> > +               unsigned int flags_offset = offsetof(typeof(mem), flags);
> > +               unsigned long size;
> > +               u32 flags;
> >
> >                 kvm_sanity_check_user_mem_region_alias();
> >
> > +               memset(&mem, 0, sizeof(mem));
> > +
> >                 r = -EFAULT;
> > -               if (copy_from_user(&mem, argp, size);
> > +               if (get_user(flags, (u32 __user *)(argp + flags_offset)))
> > +                       goto out;
> > +
> > +               if (flags & KVM_MEM_PRIVATE)
> > +                       size = sizeof(struct kvm_userspace_memory_region_ext);
> > +               else
> > +                       size = sizeof(struct kvm_userspace_memory_region);
> > +
> > +               if (copy_from_user(&mem, argp, size))
> >                         goto out;
> >
> >                 r = -EINVAL;
> > -               if (mem.flags & KVM_MEM_PRIVATE)
> > +               if ((flags ^ mem.flags) & KVM_MEM_PRIVATE)
> >                         goto out;
> >
> >                 r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 5/8] KVM: Register/unregister the guest private memory regions
  2022-09-15 14:29 ` [PATCH v8 5/8] KVM: Register/unregister the guest private memory regions Chao Peng
  2022-09-26 10:36   ` Fuad Tabba
@ 2022-10-11  9:48   ` Fuad Tabba
  2022-10-12  2:35     ` Chao Peng
  1 sibling, 1 reply; 97+ messages in thread
From: Fuad Tabba @ 2022-10-11  9:48 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

Hi,

On Thu, Sep 15, 2022 at 3:38 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> If CONFIG_HAVE_KVM_PRIVATE_MEM=y, userspace can register/unregister the
> guest private memory regions through KVM_MEMORY_ENCRYPT_{UN,}REG_REGION
> ioctls. The patch reuses existing SEV ioctl number but differs that the
> address in the region for KVM_PRIVATE_MEM case is gpa while for SEV case
> it's hva. Which usages should the ioctls go is determined by the newly
> added kvm_arch_has_private_mem(). Architecture which supports
> KVM_PRIVATE_MEM should override this function.
>
> The current implementation defaults all memory to private. The shared
> memory regions are stored in a xarray variable for memory efficiency and
> zapping existing memory mappings is also a side effect of these two
> ioctls when defined.
>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  Documentation/virt/kvm/api.rst  | 17 ++++++--
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/mmu.h              |  2 -
>  include/linux/kvm_host.h        | 13 ++++++
>  virt/kvm/kvm_main.c             | 73 +++++++++++++++++++++++++++++++++
>  5 files changed, 100 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 1a6c003b2a0b..c0f800d04ffc 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -4715,10 +4715,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst.
>  This ioctl can be used to register a guest memory region which may
>  contain encrypted data (e.g. guest RAM, SMRAM etc).
>
> -It is used in the SEV-enabled guest. When encryption is enabled, a guest
> -memory region may contain encrypted data. The SEV memory encryption
> -engine uses a tweak such that two identical plaintext pages, each at
> -different locations will have differing ciphertexts. So swapping or
> +Currently this ioctl supports registering memory regions for two usages:
> +private memory and SEV-encrypted memory.
> +
> +When private memory is enabled, this ioctl is used to register guest private
> +memory region and the addr/size of kvm_enc_region represents guest physical
> +address (GPA). In this usage, this ioctl zaps the existing guest memory
> +mappings in KVM that fallen into the region.
> +
> +When SEV-encrypted memory is enabled, this ioctl is used to register guest
> +memory region which may contain encrypted data for a SEV-enabled guest. The
> +addr/size of kvm_enc_region represents userspace address (HVA). The SEV
> +memory encryption engine uses a tweak such that two identical plaintext pages,
> +each at different locations will have differing ciphertexts. So swapping or
>  moving ciphertext of those pages will not result in plaintext being
>  swapped. So relocating (or migrating) physical backing pages for the SEV
>  guest will require some additional steps.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2c96c43c313a..cfad6ba1a70a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -37,6 +37,7 @@
>  #include <asm/hyperv-tlfs.h>
>
>  #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> +#define __KVM_HAVE_ZAP_GFN_RANGE
>
>  #define KVM_MAX_VCPUS 1024
>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 6bdaacb6faa0..c94b620bf94b 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -211,8 +211,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>         return -(u32)fault & errcode;
>  }
>
> -void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> -
>  int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
>
>  int kvm_mmu_post_init_vm(struct kvm *kvm);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 2125b50f6345..d65690cae80b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -260,6 +260,15 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  #endif
>
> +#ifdef __KVM_HAVE_ZAP_GFN_RANGE
> +void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> +#else
> +static inline void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start
> +                                                     gfn_t gfn_end)
> +{
> +}
> +#endif
> +
>  enum {
>         OUTSIDE_GUEST_MODE,
>         IN_GUEST_MODE,
> @@ -795,6 +804,9 @@ struct kvm {
>         struct notifier_block pm_notifier;
>  #endif
>         char stats_id[KVM_STATS_NAME_SIZE];
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +       struct xarray mem_attr_array;
> +#endif
>  };
>
>  #define kvm_err(fmt, ...) \
> @@ -1454,6 +1466,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
>  int kvm_arch_post_init_vm(struct kvm *kvm);
>  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
>  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_has_private_mem(struct kvm *kvm);
>
>  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
>  /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fa9dd2d2c001..de5cce8c82c7 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -937,6 +937,47 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>
>  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +#define KVM_MEM_ATTR_SHARED    0x0001
> +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> +                                    bool is_private)
> +{

I wonder if this ioctl should be implemented as an arch-specific
ioctl. In this patch it performs some actions that pKVM might not need
or might want to do differently.

pKVM tracks the sharing status in the stage-2 page table's software
bits, so it can avoid the overhead of using mem_attr_array.

Also, this ioctl calls kvm_zap_gfn_range(), as does the invalidation
notifier (introduced in patch 8). For pKVM, the kind of zapping (or
the information conveyed to the hypervisor) might need to be different
depending on the cause; whether it's invalidation or change of sharing
status.

Thanks,
/fuad


> +       gfn_t start, end;
> +       unsigned long index;
> +       void *entry;
> +       int r;
> +
> +       if (size == 0 || gpa + size < gpa)
> +               return -EINVAL;
> +       if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> +               return -EINVAL;
> +
> +       start = gpa >> PAGE_SHIFT;
> +       end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> +       /*
> +        * Guest memory defaults to private, kvm->mem_attr_array only stores
> +        * shared memory.
> +        */
> +       entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> +
> +       for (index = start; index < end; index++) {
> +               r = xa_err(xa_store(&kvm->mem_attr_array, index, entry,
> +                                   GFP_KERNEL_ACCOUNT));
> +               if (r)
> +                       goto err;
> +       }
> +
> +       kvm_zap_gfn_range(kvm, start, end);
> +
> +       return r;
> +err:
> +       for (; index > start; index--)
> +               xa_erase(&kvm->mem_attr_array, index);
> +       return r;
> +}
> +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
> +
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>  static int kvm_pm_notifier_call(struct notifier_block *bl,
>                                 unsigned long state,
> @@ -1165,6 +1206,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>         spin_lock_init(&kvm->mn_invalidate_lock);
>         rcuwait_init(&kvm->mn_memslots_update_rcuwait);
>         xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +       xa_init(&kvm->mem_attr_array);
> +#endif
>
>         INIT_LIST_HEAD(&kvm->gpc_list);
>         spin_lock_init(&kvm->gpc_lock);
> @@ -1338,6 +1382,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
>                 kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
>                 kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
>         }
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +       xa_destroy(&kvm->mem_attr_array);
> +#endif
>         cleanup_srcu_struct(&kvm->irq_srcu);
>         cleanup_srcu_struct(&kvm->srcu);
>         kvm_arch_free_vm(kvm);
> @@ -1541,6 +1588,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
>         }
>  }
>
> +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> +{
> +       return false;
> +}
> +
>  static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
>  {
>         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> @@ -4703,6 +4755,24 @@ static long kvm_vm_ioctl(struct file *filp,
>                 r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>                 break;
>         }
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +       case KVM_MEMORY_ENCRYPT_REG_REGION:
> +       case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> +               struct kvm_enc_region region;
> +               bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> +
> +               if (!kvm_arch_has_private_mem(kvm))
> +                       goto arch_vm_ioctl;
> +
> +               r = -EFAULT;
> +               if (copy_from_user(&region, argp, sizeof(region)))
> +                       goto out;
> +
> +               r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> +                                             region.size, set);
> +               break;
> +       }
> +#endif
>         case KVM_GET_DIRTY_LOG: {
>                 struct kvm_dirty_log log;
>
> @@ -4856,6 +4926,9 @@ static long kvm_vm_ioctl(struct file *filp,
>                 r = kvm_vm_ioctl_get_stats_fd(kvm);
>                 break;
>         default:
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +arch_vm_ioctl:
> +#endif
>                 r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>         }
>  out:
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 5/8] KVM: Register/unregister the guest private memory regions
  2022-10-11  9:48   ` Fuad Tabba
@ 2022-10-12  2:35     ` Chao Peng
  2022-10-17 10:15       ` Fuad Tabba
  0 siblings, 1 reply; 97+ messages in thread
From: Chao Peng @ 2022-10-12  2:35 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Tue, Oct 11, 2022 at 10:48:58AM +0100, Fuad Tabba wrote:
> Hi,
> 
> On Thu, Sep 15, 2022 at 3:38 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > If CONFIG_HAVE_KVM_PRIVATE_MEM=y, userspace can register/unregister the
> > guest private memory regions through KVM_MEMORY_ENCRYPT_{UN,}REG_REGION
> > ioctls. The patch reuses existing SEV ioctl number but differs that the
> > address in the region for KVM_PRIVATE_MEM case is gpa while for SEV case
> > it's hva. Which usages should the ioctls go is determined by the newly
> > added kvm_arch_has_private_mem(). Architecture which supports
> > KVM_PRIVATE_MEM should override this function.
> >
> > The current implementation defaults all memory to private. The shared
> > memory regions are stored in a xarray variable for memory efficiency and
> > zapping existing memory mappings is also a side effect of these two
> > ioctls when defined.
> >
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  Documentation/virt/kvm/api.rst  | 17 ++++++--
> >  arch/x86/include/asm/kvm_host.h |  1 +
> >  arch/x86/kvm/mmu.h              |  2 -
> >  include/linux/kvm_host.h        | 13 ++++++
> >  virt/kvm/kvm_main.c             | 73 +++++++++++++++++++++++++++++++++
> >  5 files changed, 100 insertions(+), 6 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 1a6c003b2a0b..c0f800d04ffc 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -4715,10 +4715,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst.
> >  This ioctl can be used to register a guest memory region which may
> >  contain encrypted data (e.g. guest RAM, SMRAM etc).
> >
> > -It is used in the SEV-enabled guest. When encryption is enabled, a guest
> > -memory region may contain encrypted data. The SEV memory encryption
> > -engine uses a tweak such that two identical plaintext pages, each at
> > -different locations will have differing ciphertexts. So swapping or
> > +Currently this ioctl supports registering memory regions for two usages:
> > +private memory and SEV-encrypted memory.
> > +
> > +When private memory is enabled, this ioctl is used to register guest private
> > +memory region and the addr/size of kvm_enc_region represents guest physical
> > +address (GPA). In this usage, this ioctl zaps the existing guest memory
> > +mappings in KVM that fallen into the region.
> > +
> > +When SEV-encrypted memory is enabled, this ioctl is used to register guest
> > +memory region which may contain encrypted data for a SEV-enabled guest. The
> > +addr/size of kvm_enc_region represents userspace address (HVA). The SEV
> > +memory encryption engine uses a tweak such that two identical plaintext pages,
> > +each at different locations will have differing ciphertexts. So swapping or
> >  moving ciphertext of those pages will not result in plaintext being
> >  swapped. So relocating (or migrating) physical backing pages for the SEV
> >  guest will require some additional steps.
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 2c96c43c313a..cfad6ba1a70a 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -37,6 +37,7 @@
> >  #include <asm/hyperv-tlfs.h>
> >
> >  #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> > +#define __KVM_HAVE_ZAP_GFN_RANGE
> >
> >  #define KVM_MAX_VCPUS 1024
> >
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 6bdaacb6faa0..c94b620bf94b 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -211,8 +211,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> >         return -(u32)fault & errcode;
> >  }
> >
> > -void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> > -
> >  int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
> >
> >  int kvm_mmu_post_init_vm(struct kvm *kvm);
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 2125b50f6345..d65690cae80b 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -260,6 +260,15 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> >  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> >  #endif
> >
> > +#ifdef __KVM_HAVE_ZAP_GFN_RANGE
> > +void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> > +#else
> > +static inline void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start
> > +                                                     gfn_t gfn_end)
> > +{
> > +}
> > +#endif
> > +
> >  enum {
> >         OUTSIDE_GUEST_MODE,
> >         IN_GUEST_MODE,
> > @@ -795,6 +804,9 @@ struct kvm {
> >         struct notifier_block pm_notifier;
> >  #endif
> >         char stats_id[KVM_STATS_NAME_SIZE];
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > +       struct xarray mem_attr_array;
> > +#endif
> >  };
> >
> >  #define kvm_err(fmt, ...) \
> > @@ -1454,6 +1466,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> >  int kvm_arch_post_init_vm(struct kvm *kvm);
> >  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> >  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> >
> >  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> >  /*
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index fa9dd2d2c001..de5cce8c82c7 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -937,6 +937,47 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
> >
> >  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
> >
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > +#define KVM_MEM_ATTR_SHARED    0x0001
> > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> > +                                    bool is_private)
> > +{
> 
> I wonder if this ioctl should be implemented as an arch-specific
> ioctl. In this patch it performs some actions that pKVM might not need
> or might want to do differently.

I think it's doable. We can provide the mem_attr_array kind thing in
common code and let arch code decide to use it or not. Currently
mem_attr_array is defined in the struct kvm, if those bytes are
unnecessary for pKVM it can even be moved to arch definition, but that
also loses the potential code sharing for confidential usages in other
non-architectures, e.g. if ARM also supports such usage. Or it can be
provided through a different CONFIG_ instead of
CONFIG_HAVE_KVM_PRIVATE_MEM.

Thanks,
Chao
> 
> pKVM tracks the sharing status in the stage-2 page table's software
> bits, so it can avoid the overhead of using mem_attr_array.
> 
> Also, this ioctl calls kvm_zap_gfn_range(), as does the invalidation
> notifier (introduced in patch 8). For pKVM, the kind of zapping (or
> the information conveyed to the hypervisor) might need to be different
> depending on the cause; whether it's invalidation or change of sharing
> status.

> 
> Thanks,
> /fuad
> 
> 
> > +       gfn_t start, end;
> > +       unsigned long index;
> > +       void *entry;
> > +       int r;
> > +
> > +       if (size == 0 || gpa + size < gpa)
> > +               return -EINVAL;
> > +       if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> > +               return -EINVAL;
> > +
> > +       start = gpa >> PAGE_SHIFT;
> > +       end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > +       /*
> > +        * Guest memory defaults to private, kvm->mem_attr_array only stores
> > +        * shared memory.
> > +        */
> > +       entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> > +
> > +       for (index = start; index < end; index++) {
> > +               r = xa_err(xa_store(&kvm->mem_attr_array, index, entry,
> > +                                   GFP_KERNEL_ACCOUNT));
> > +               if (r)
> > +                       goto err;
> > +       }
> > +
> > +       kvm_zap_gfn_range(kvm, start, end);
> > +
> > +       return r;
> > +err:
> > +       for (; index > start; index--)
> > +               xa_erase(&kvm->mem_attr_array, index);
> > +       return r;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
> > +
> >  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> >  static int kvm_pm_notifier_call(struct notifier_block *bl,
> >                                 unsigned long state,
> > @@ -1165,6 +1206,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> >         spin_lock_init(&kvm->mn_invalidate_lock);
> >         rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> >         xa_init(&kvm->vcpu_array);
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > +       xa_init(&kvm->mem_attr_array);
> > +#endif
> >
> >         INIT_LIST_HEAD(&kvm->gpc_list);
> >         spin_lock_init(&kvm->gpc_lock);
> > @@ -1338,6 +1382,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
> >                 kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
> >                 kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
> >         }
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > +       xa_destroy(&kvm->mem_attr_array);
> > +#endif
> >         cleanup_srcu_struct(&kvm->irq_srcu);
> >         cleanup_srcu_struct(&kvm->srcu);
> >         kvm_arch_free_vm(kvm);
> > @@ -1541,6 +1588,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
> >         }
> >  }
> >
> > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > +{
> > +       return false;
> > +}
> > +
> >  static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> >  {
> >         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> > @@ -4703,6 +4755,24 @@ static long kvm_vm_ioctl(struct file *filp,
> >                 r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> >                 break;
> >         }
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > +       case KVM_MEMORY_ENCRYPT_REG_REGION:
> > +       case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> > +               struct kvm_enc_region region;
> > +               bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> > +
> > +               if (!kvm_arch_has_private_mem(kvm))
> > +                       goto arch_vm_ioctl;
> > +
> > +               r = -EFAULT;
> > +               if (copy_from_user(&region, argp, sizeof(region)))
> > +                       goto out;
> > +
> > +               r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> > +                                             region.size, set);
> > +               break;
> > +       }
> > +#endif
> >         case KVM_GET_DIRTY_LOG: {
> >                 struct kvm_dirty_log log;
> >
> > @@ -4856,6 +4926,9 @@ static long kvm_vm_ioctl(struct file *filp,
> >                 r = kvm_vm_ioctl_get_stats_fd(kvm);
> >                 break;
> >         default:
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > +arch_vm_ioctl:
> > +#endif
> >                 r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> >         }
> >  out:
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory
  2022-10-10  8:25                   ` Chao Peng
@ 2022-10-12  8:14                     ` Jarkko Sakkinen
  0 siblings, 0 replies; 97+ messages in thread
From: Jarkko Sakkinen @ 2022-10-12  8:14 UTC (permalink / raw)
  To: Chao Peng
  Cc: Sean Christopherson, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Mon, Oct 10, 2022 at 04:25:07PM +0800, Chao Peng wrote:
> On Sat, Oct 08, 2022 at 08:35:47PM +0300, Jarkko Sakkinen wrote:
> > On Sat, Oct 08, 2022 at 07:15:17PM +0300, Jarkko Sakkinen wrote:
> > > On Sat, Oct 08, 2022 at 12:54:32AM +0300, Jarkko Sakkinen wrote:
> > > > On Fri, Oct 07, 2022 at 02:58:54PM +0000, Sean Christopherson wrote:
> > > > > On Fri, Oct 07, 2022, Jarkko Sakkinen wrote:
> > > > > > On Thu, Oct 06, 2022 at 03:34:58PM +0000, Sean Christopherson wrote:
> > > > > > > On Thu, Oct 06, 2022, Jarkko Sakkinen wrote:
> > > > > > > > On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote:
> > > > > > > > > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote:
> > > > > > > > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > > > > > > > > > additional KVM memslot fields private_fd/private_offset to allow
> > > > > > > > > > userspace to specify that guest private memory provided from the
> > > > > > > > > > private_fd and guest_phys_addr mapped at the private_offset of the
> > > > > > > > > > private_fd, spanning a range of memory_size.
> > > > > > > > > > 
> > > > > > > > > > The extended memslot can still have the userspace_addr(hva). When use, a
> > > > > > > > > > single memslot can maintain both private memory through private
> > > > > > > > > > fd(private_fd/private_offset) and shared memory through
> > > > > > > > > > hva(userspace_addr). Whether the private or shared part is visible to
> > > > > > > > > > guest is maintained by other KVM code.
> > > > > > > > > 
> > > > > > > > > What is anyway the appeal of private_offset field, instead of having just
> > > > > > > > > 1:1 association between regions and files, i.e. one memfd per region?
> > > > > > > 
> > > > > > > Modifying memslots is slow, both in KVM and in QEMU (not sure about Google's VMM).
> > > > > > > E.g. if a vCPU converts a single page, it will be forced to wait until all other
> > > > > > > vCPUs drop SRCU, which can have severe latency spikes, e.g. if KVM is faulting in
> > > > > > > memory.  KVM's memslot updates also hold a mutex for the entire duration of the
> > > > > > > update, i.e. conversions on different vCPUs would be fully serialized, exacerbating
> > > > > > > the SRCU problem.
> > > > > > > 
> > > > > > > KVM also has historical baggage where it "needs" to zap _all_ SPTEs when any
> > > > > > > memslot is deleted.
> > > > > > > 
> > > > > > > Taking both a private_fd and a shared userspace address allows userspace to convert
> > > > > > > between private and shared without having to manipulate memslots.
> > > > > > 
> > > > > > Right, this was really good explanation, thank you.
> > > > > > 
> > > > > > Still wondering could this possibly work (or not):
> > > > > > 
> > > > > > 1. Union userspace_addr and private_fd.
> > > > > 
> > > > > No, because userspace needs to be able to provide both userspace_addr (shared
> > > > > memory) and private_fd (private memory) for a single memslot.
> > > > 
> > > > Got it, thanks for clearing my misunderstandings on this topic, and it
> > > > is quite obviously visible in 5/8 and 7/8. I.e. if I got it right,
> > > > memblock can be partially private, and you dig the shared holes with
> > > > KVM_MEMORY_ENCRYPT_UNREG_REGION. We have (in Enarx) ATM have memblock
> > > > per host mmap, I was looking into this dilated by that mindset but makes
> > > > definitely sense to support that.
> > > 
> > > For me the most useful reference with this feature is kvm_set_phys_mem()
> > > implementation in privmem-v8 branch. Took while to find it because I did
> > > not have much experience with QEMU code base. I'd even recommend to mention
> > > that function in the cover letter because it is really good reference on
> > > how this feature is supposed to be used.
> 
> That's a good point, I can mention that if people find useful. 

Yeah, I did implementation for Enarx (https://www.enarx.dev/) using just
that part as a reference. It has all the essentials what you need to
consider when you are already using KVM API, and want to add private
regions.

BR, Jarkko

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-30 16:19               ` Fuad Tabba
@ 2022-10-13 13:34                 ` Chao Peng
  2022-10-17 10:31                   ` Fuad Tabba
  2022-10-18  0:33                 ` Sean Christopherson
  1 sibling, 1 reply; 97+ messages in thread
From: Chao Peng @ 2022-10-13 13:34 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Sean Christopherson, David Hildenbrand, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, wei.w.wang,
	Will Deacon, Marc Zyngier

On Fri, Sep 30, 2022 at 05:19:00PM +0100, Fuad Tabba wrote:
> Hi,
> 
> On Tue, Sep 27, 2022 at 11:47 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Sep 26, 2022, Fuad Tabba wrote:
> > > Hi,
> > >
> > > On Mon, Sep 26, 2022 at 3:28 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > > >
> > > > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > > > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> > > > > >
> > > > > >   1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
> > > > > >      memory into the guest (after pre-boot phase).
> > > > > >
> > > > > >   2. Be mutually exclusive with shared<=>private conversions, and is allowed if
> > > > > >      and only if the entire gfn range of the associated memslot is shared.
> > > > >
> > > > > In general I think that this would work with pKVM. However, limiting
> > > > > private<->shared conversions to the granularity of a whole memslot
> > > > > might be difficult to handle in pKVM, since the guest doesn't have the
> > > > > concept of memslots. For example, in pKVM right now, when a guest
> > > > > shares back its restricted DMA pool with the host it does so at the
> > > > > page-level.
> >
> > Y'all are killing me :-)
> 
>  :D
> 
> > Isn't the guest enlightened?  E.g. can't you tell the guest "thou shalt share at
> > granularity X"?  With KVM's newfangled scalable memslots and per-vCPU MRU slot,
> > X doesn't even have to be that high to get reasonable performance, e.g. assuming
> > the DMA pool is at most 2GiB, that's "only" 1024 memslots, which is supposed to
> > work just fine in KVM.
> 
> The guest is potentially enlightened, but the host doesn't necessarily
> know which memslot the guest might want to share back, since it
> doesn't know where the guest might want to place the DMA pool. If I
> understand this correctly, for this to work, all memslots would need
> to be the same size and sharing would always need to happen at that
> granularity.
> 
> Moreover, for something like a small DMA pool this might scale, but
> I'm not sure about potential future workloads (e.g., multimedia
> in-place sharing).
> 
> >
> > > > > pKVM would also need a way to make an fd accessible again
> > > > > when shared back, which I think isn't possible with this patch.
> > > >
> > > > But does pKVM really want to mmap/munmap a new region at the page-level,
> > > > that can cause VMA fragmentation if the conversion is frequent as I see.
> > > > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > > > be the same issue.
> > >
> > > pKVM doesn't really need to unmap the memory. What is really important
> > > is that the memory is not GUP'able.
> >
> > Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
> > otherwise KVM wouldn't be able to get the PFN to map into guest memory.
> >
> > The problem is that gup() and "mapped" are tied together.  So yes, pKVM doesn't
> > strictly need to unmap memory _in the untrusted host_, but since mapped==guppable,
> > the end result is the same.
> >
> > Emphasis above because pKVM still needs unmap the memory _somehwere_.  IIUC, the
> > current approach is to do that only in the stage-2 page tables, i.e. only in the
> > context of the hypervisor.  Which is also the source of the gup() problems; the
> > untrusted kernel is blissfully unaware that the memory is inaccessible.
> >
> > Any approach that moves some of that information into the untrusted kernel so that
> > the kernel can protect itself will incur fragmentation in the VMAs.  Well, unless
> > all of guest memory becomes unguppable, but that's likely not a viable option.
> 
> Actually, for pKVM, there is no need for the guest memory to be
> GUP'able at all if we use the new inaccessible_get_pfn().

If pKVM can use inaccessible_get_pfn() to get pfn and can avoid GUP (I
think that is the major concern?), do you see any other gap from
existing API? 

> This of
> course goes back to what I'd mentioned before in v7; it seems that
> representing the memslot memory as a file descriptor should be
> orthogonal to whether the memory is shared or private, rather than a
> private_fd for private memory and the userspace_addr for shared
> memory. The host can then map or unmap the shared/private memory using
> the fd, which allows it more freedom in even choosing to unmap shared
> memory when not needed, for example.

Using both private_fd and userspace_addr is only needed in TDX and other
confidential computing scenarios, pKVM may only use private_fd if the fd
can also be mmaped as a whole to userspace as Sean suggested.

Thanks,
Chao
> 
> Cheers,
> /fuad

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 7/8] KVM: Handle page fault for private memory
  2022-09-15 14:29 ` [PATCH v8 7/8] KVM: Handle page fault for private memory Chao Peng
@ 2022-10-14 18:57   ` Sean Christopherson
  2022-10-17 14:48     ` Chao Peng
  0 siblings, 1 reply; 97+ messages in thread
From: Sean Christopherson @ 2022-10-14 18:57 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song,
	wei.w.wang

On Thu, Sep 15, 2022, Chao Peng wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a0f198cede3d..81ab20003824 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3028,6 +3028,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
>  			break;
>  	}
>  
> +	if (kvm_mem_is_private(kvm, gfn))

Rather than reload the Xarray info, which is unnecessary overhead, pass in
@is_private.  The caller must hold mmu_lock, i.e. invalidations from
private<->shared conversions will be stalled and will zap the new SPTE if the
state is changed.

E.g.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d68944f07b4b..44eea47697d8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3072,8 +3072,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
         * Enforce the iTLB multihit workaround after capturing the requested
         * level, which will be used to do precise, accurate accounting.
         */
-       fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
-                                                    fault->gfn, fault->max_level);
+       fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot, fault->gfn,
+                                                    fault->max_level, fault->is_private);
        if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
                return;
 
@@ -6460,7 +6460,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
                 */
                if (sp->role.direct &&
                    sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
-                                                              PG_LEVEL_NUM)) {
+                                                              PG_LEVEL_NUM, false)) {
                        kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
 
                        if (kvm_available_flush_tlb_with_range())
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 7670c13ce251..9acdf72537ce 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -315,6 +315,12 @@ static inline bool is_dirty_spte(u64 spte)
        return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
 }
 
+static inline bool is_private_spte(u64 spte)
+{
+       /* FIXME: Query C-bit/S-bit for SEV/TDX. */
+       return false;
+}
+
 static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64 pte,
                                int level)
 {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 672f0432d777..69ba00157e90 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1767,8 +1767,9 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
                if (iter.gfn < start || iter.gfn >= end)
                        continue;
 
-               max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
-                                                             iter.gfn, PG_LEVEL_NUM);
+               max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, iter.gfn,
+                                                             PG_LEVEL_NUM,
+                                                             is_private_spte(iter.old_spte));
                if (max_mapping_level < iter.level)
                        continue;
 

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 5/8] KVM: Register/unregister the guest private memory regions
  2022-10-12  2:35     ` Chao Peng
@ 2022-10-17 10:15       ` Fuad Tabba
  2022-10-17 22:17         ` Sean Christopherson
  0 siblings, 1 reply; 97+ messages in thread
From: Fuad Tabba @ 2022-10-17 10:15 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

Hi,

> > > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > > +#define KVM_MEM_ATTR_SHARED    0x0001
> > > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> > > +                                    bool is_private)
> > > +{
> >
> > I wonder if this ioctl should be implemented as an arch-specific
> > ioctl. In this patch it performs some actions that pKVM might not need
> > or might want to do differently.
>
> I think it's doable. We can provide the mem_attr_array kind thing in
> common code and let arch code decide to use it or not. Currently
> mem_attr_array is defined in the struct kvm, if those bytes are
> unnecessary for pKVM it can even be moved to arch definition, but that
> also loses the potential code sharing for confidential usages in other
> non-architectures, e.g. if ARM also supports such usage. Or it can be
> provided through a different CONFIG_ instead of
> CONFIG_HAVE_KVM_PRIVATE_MEM.

This sounds good. Thank you.


/fuad

> Thanks,
> Chao
> >
> > pKVM tracks the sharing status in the stage-2 page table's software
> > bits, so it can avoid the overhead of using mem_attr_array.
> >
> > Also, this ioctl calls kvm_zap_gfn_range(), as does the invalidation
> > notifier (introduced in patch 8). For pKVM, the kind of zapping (or
> > the information conveyed to the hypervisor) might need to be different
> > depending on the cause; whether it's invalidation or change of sharing
> > status.
>
> >
> > Thanks,
> > /fuad
> >
> >
> > > +       gfn_t start, end;
> > > +       unsigned long index;
> > > +       void *entry;
> > > +       int r;
> > > +
> > > +       if (size == 0 || gpa + size < gpa)
> > > +               return -EINVAL;
> > > +       if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> > > +               return -EINVAL;
> > > +
> > > +       start = gpa >> PAGE_SHIFT;
> > > +       end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > > +
> > > +       /*
> > > +        * Guest memory defaults to private, kvm->mem_attr_array only stores
> > > +        * shared memory.
> > > +        */
> > > +       entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> > > +
> > > +       for (index = start; index < end; index++) {
> > > +               r = xa_err(xa_store(&kvm->mem_attr_array, index, entry,
> > > +                                   GFP_KERNEL_ACCOUNT));
> > > +               if (r)
> > > +                       goto err;
> > > +       }
> > > +
> > > +       kvm_zap_gfn_range(kvm, start, end);
> > > +
> > > +       return r;
> > > +err:
> > > +       for (; index > start; index--)
> > > +               xa_erase(&kvm->mem_attr_array, index);
> > > +       return r;
> > > +}
> > > +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
> > > +
> > >  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> > >  static int kvm_pm_notifier_call(struct notifier_block *bl,
> > >                                 unsigned long state,
> > > @@ -1165,6 +1206,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > >         spin_lock_init(&kvm->mn_invalidate_lock);
> > >         rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> > >         xa_init(&kvm->vcpu_array);
> > > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > > +       xa_init(&kvm->mem_attr_array);
> > > +#endif
> > >
> > >         INIT_LIST_HEAD(&kvm->gpc_list);
> > >         spin_lock_init(&kvm->gpc_lock);
> > > @@ -1338,6 +1382,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
> > >                 kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
> > >                 kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
> > >         }
> > > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > > +       xa_destroy(&kvm->mem_attr_array);
> > > +#endif
> > >         cleanup_srcu_struct(&kvm->irq_srcu);
> > >         cleanup_srcu_struct(&kvm->srcu);
> > >         kvm_arch_free_vm(kvm);
> > > @@ -1541,6 +1588,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
> > >         }
> > >  }
> > >
> > > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > > +{
> > > +       return false;
> > > +}
> > > +
> > >  static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> > >  {
> > >         u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> > > @@ -4703,6 +4755,24 @@ static long kvm_vm_ioctl(struct file *filp,
> > >                 r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > >                 break;
> > >         }
> > > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > > +       case KVM_MEMORY_ENCRYPT_REG_REGION:
> > > +       case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> > > +               struct kvm_enc_region region;
> > > +               bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> > > +
> > > +               if (!kvm_arch_has_private_mem(kvm))
> > > +                       goto arch_vm_ioctl;
> > > +
> > > +               r = -EFAULT;
> > > +               if (copy_from_user(&region, argp, sizeof(region)))
> > > +                       goto out;
> > > +
> > > +               r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> > > +                                             region.size, set);
> > > +               break;
> > > +       }
> > > +#endif
> > >         case KVM_GET_DIRTY_LOG: {
> > >                 struct kvm_dirty_log log;
> > >
> > > @@ -4856,6 +4926,9 @@ static long kvm_vm_ioctl(struct file *filp,
> > >                 r = kvm_vm_ioctl_get_stats_fd(kvm);
> > >                 break;
> > >         default:
> > > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > > +arch_vm_ioctl:
> > > +#endif
> > >                 r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> > >         }
> > >  out:
> > > --
> > > 2.25.1
> > >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-13 13:34                 ` Chao Peng
@ 2022-10-17 10:31                   ` Fuad Tabba
  2022-10-17 14:58                     ` Chao Peng
  0 siblings, 1 reply; 97+ messages in thread
From: Fuad Tabba @ 2022-10-17 10:31 UTC (permalink / raw)
  To: Chao Peng
  Cc: Sean Christopherson, David Hildenbrand, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, wei.w.wang,
	Will Deacon, Marc Zyngier

Hi,

> >
> > Actually, for pKVM, there is no need for the guest memory to be
> > GUP'able at all if we use the new inaccessible_get_pfn().
>
> If pKVM can use inaccessible_get_pfn() to get pfn and can avoid GUP (I
> think that is the major concern?), do you see any other gap from
> existing API?

Actually for this part no, there aren't any gaps and
inaccessible_get_pfn() is sufficient.

> > This of
> > course goes back to what I'd mentioned before in v7; it seems that
> > representing the memslot memory as a file descriptor should be
> > orthogonal to whether the memory is shared or private, rather than a
> > private_fd for private memory and the userspace_addr for shared
> > memory. The host can then map or unmap the shared/private memory using
> > the fd, which allows it more freedom in even choosing to unmap shared
> > memory when not needed, for example.
>
> Using both private_fd and userspace_addr is only needed in TDX and other
> confidential computing scenarios, pKVM may only use private_fd if the fd
> can also be mmaped as a whole to userspace as Sean suggested.

That does work in practice, for now at least, and is what I do in my
current port. However, the naming and how the API is defined as
implied by the name and the documentation. By calling the field
private_fd, it does imply that it should not be mapped, which is also
what api.rst says in PATCH v8 5/8. My worry is that in that case pKVM
would be mis/ab-using this interface, and that future changes could
cause unforeseen issues for pKVM.

Maybe renaming this to something like "guest_fp", and specifying in
the documentation that it can be restricted, e.g., instead of "the
content of the private memory is invisible to userspace" something
along the lines of  "the content of the guest memory may be restricted
to userspace".

What do you think?

Cheers,
/fuad

>
> Thanks,
> Chao
> >
> > Cheers,
> > /fuad

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-15 14:29 ` [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd Chao Peng
                     ` (3 preceding siblings ...)
  2022-10-06  8:50   ` Fuad Tabba
@ 2022-10-17 13:00   ` Vlastimil Babka
  2022-10-17 16:19     ` Kirill A . Shutemov
  2022-10-19 12:23   ` Vishal Annapurve
  5 siblings, 1 reply; 97+ messages in thread
From: Vlastimil Babka @ 2022-10-17 13:00 UTC (permalink / raw)
  To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vishal Annapurve, Yu Zhang,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, wei.w.wang

On 9/15/22 16:29, Chao Peng wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> KVM can use memfd-provided memory for guest memory. For normal userspace
> accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> virtual address space and then tells KVM to use the virtual address to
> setup the mapping in the secondary page table (e.g. EPT).
> 
> With confidential computing technologies like Intel TDX, the
> memfd-provided memory may be encrypted with special key for special
> software domain (e.g. KVM guest) and is not expected to be directly
> accessed by userspace. Precisely, userspace access to such encrypted
> memory may lead to host crash so it should be prevented.
> 
> This patch introduces userspace inaccessible memfd (created with
> MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> in-kernel interface so KVM can directly interact with core-mm without
> the need to map the memory into KVM userspace.
> 
> It provides semantics required for KVM guest private(encrypted) memory
> support that a file descriptor with this flag set is going to be used as
> the source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV.
> 
> KVM userspace is still in charge of the lifecycle of the memfd. It
> should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> in this patch to obtain the physical memory address and then populate
> the secondary page table entries.
> 
> The userspace inaccessible memfd can be fallocate-ed and hole-punched
> from userspace. When hole-punching happens, KVM can get notified through
> inaccessible_notifier it then gets chance to remove any mapped entries
> of the range in the secondary page tables.
> 
> The userspace inaccessible memfd itself is implemented as a shim layer
> on top of real memory file systems like tmpfs/hugetlbfs but this patch
> only implemented tmpfs. The allocated memory is currently marked as
> unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---

...

> +static long inaccessible_fallocate(struct file *file, int mode,
> +				   loff_t offset, loff_t len)
> +{
> +	struct inaccessible_data *data = file->f_mapping->private_data;
> +	struct file *memfd = data->memfd;
> +	int ret;
> +
> +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> +			return -EINVAL;
> +	}
> +
> +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> +	inaccessible_notifier_invalidate(data, offset, offset + len);

Wonder if invalidate should precede the actual hole punch, otherwise we open
a window where the page tables point to memory no longer valid?

> +	return ret;
> +}
> +

...

> +
> +static struct file_system_type inaccessible_fs = {
> +	.owner		= THIS_MODULE,
> +	.name		= "[inaccessible]",

Dunno where exactly is this name visible, but shouldn't it better be
"[memfd:inaccessible]"?

> +	.init_fs_context = inaccessible_init_fs_context,
> +	.kill_sb	= kill_anon_super,
> +};
> +


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 7/8] KVM: Handle page fault for private memory
  2022-10-14 18:57   ` Sean Christopherson
@ 2022-10-17 14:48     ` Chao Peng
  0 siblings, 0 replies; 97+ messages in thread
From: Chao Peng @ 2022-10-17 14:48 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
	luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song,
	wei.w.wang

On Fri, Oct 14, 2022 at 06:57:20PM +0000, Sean Christopherson wrote:
> On Thu, Sep 15, 2022, Chao Peng wrote:
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index a0f198cede3d..81ab20003824 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3028,6 +3028,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> >  			break;
> >  	}
> >  
> > +	if (kvm_mem_is_private(kvm, gfn))
> 
> Rather than reload the Xarray info, which is unnecessary overhead, pass in
> @is_private.  The caller must hold mmu_lock, i.e. invalidations from
> private<->shared conversions will be stalled and will zap the new SPTE if the
> state is changed.

Make sense. TDX/SEV should be easy to query that.

Chao
> 
> E.g.
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index d68944f07b4b..44eea47697d8 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3072,8 +3072,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>          * Enforce the iTLB multihit workaround after capturing the requested
>          * level, which will be used to do precise, accurate accounting.
>          */
> -       fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> -                                                    fault->gfn, fault->max_level);
> +       fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot, fault->gfn,
> +                                                    fault->max_level, fault->is_private);
>         if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
>                 return;
>  
> @@ -6460,7 +6460,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>                  */
>                 if (sp->role.direct &&
>                     sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> -                                                              PG_LEVEL_NUM)) {
> +                                                              PG_LEVEL_NUM, false)) {
>                         kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
>  
>                         if (kvm_available_flush_tlb_with_range())
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 7670c13ce251..9acdf72537ce 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -315,6 +315,12 @@ static inline bool is_dirty_spte(u64 spte)
>         return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
>  }
>  
> +static inline bool is_private_spte(u64 spte)
> +{
> +       /* FIXME: Query C-bit/S-bit for SEV/TDX. */
> +       return false;
> +}
> +
>  static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64 pte,
>                                 int level)
>  {
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 672f0432d777..69ba00157e90 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1767,8 +1767,9 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
>                 if (iter.gfn < start || iter.gfn >= end)
>                         continue;
>  
> -               max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> -                                                             iter.gfn, PG_LEVEL_NUM);
> +               max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, iter.gfn,
> +                                                             PG_LEVEL_NUM,
> +                                                             is_private_spte(iter.old_spte));
>                 if (max_mapping_level < iter.level)
>                         continue;
>  

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-17 10:31                   ` Fuad Tabba
@ 2022-10-17 14:58                     ` Chao Peng
  2022-10-17 19:05                       ` Fuad Tabba
  0 siblings, 1 reply; 97+ messages in thread
From: Chao Peng @ 2022-10-17 14:58 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Sean Christopherson, David Hildenbrand, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, wei.w.wang,
	Will Deacon, Marc Zyngier

On Mon, Oct 17, 2022 at 11:31:19AM +0100, Fuad Tabba wrote:
> Hi,
> 
> > >
> > > Actually, for pKVM, there is no need for the guest memory to be
> > > GUP'able at all if we use the new inaccessible_get_pfn().
> >
> > If pKVM can use inaccessible_get_pfn() to get pfn and can avoid GUP (I
> > think that is the major concern?), do you see any other gap from
> > existing API?
> 
> Actually for this part no, there aren't any gaps and
> inaccessible_get_pfn() is sufficient.

Thanks for the confirmation.

> 
> > > This of
> > > course goes back to what I'd mentioned before in v7; it seems that
> > > representing the memslot memory as a file descriptor should be
> > > orthogonal to whether the memory is shared or private, rather than a
> > > private_fd for private memory and the userspace_addr for shared
> > > memory. The host can then map or unmap the shared/private memory using
> > > the fd, which allows it more freedom in even choosing to unmap shared
> > > memory when not needed, for example.
> >
> > Using both private_fd and userspace_addr is only needed in TDX and other
> > confidential computing scenarios, pKVM may only use private_fd if the fd
> > can also be mmaped as a whole to userspace as Sean suggested.
> 
> That does work in practice, for now at least, and is what I do in my
> current port. However, the naming and how the API is defined as
> implied by the name and the documentation. By calling the field
> private_fd, it does imply that it should not be mapped, which is also
> what api.rst says in PATCH v8 5/8. My worry is that in that case pKVM
> would be mis/ab-using this interface, and that future changes could
> cause unforeseen issues for pKVM.

That is fairly enough. We can change the naming and the documents.

> 
> Maybe renaming this to something like "guest_fp", and specifying in
> the documentation that it can be restricted, e.g., instead of "the
> content of the private memory is invisible to userspace" something
> along the lines of  "the content of the guest memory may be restricted
> to userspace".

Some other candidates in my mind:
- restricted_fd: to pair with the mm side restricted_memfd
- protected_fd: as Sean suggested before
- fd: how it's explained relies on the memslot.flag.

Thanks,
Chao
> 
> What do you think?
> 
> Cheers,
> /fuad
> 
> >
> > Thanks,
> > Chao
> > >
> > > Cheers,
> > > /fuad

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-17 13:00   ` Vlastimil Babka
@ 2022-10-17 16:19     ` Kirill A . Shutemov
  2022-10-17 16:39       ` Gupta, Pankaj
  0 siblings, 1 reply; 97+ messages in thread
From: Kirill A . Shutemov @ 2022-10-17 16:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vishal Annapurve, Yu Zhang, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song,
	wei.w.wang

On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> On 9/15/22 16:29, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > KVM can use memfd-provided memory for guest memory. For normal userspace
> > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > virtual address space and then tells KVM to use the virtual address to
> > setup the mapping in the secondary page table (e.g. EPT).
> > 
> > With confidential computing technologies like Intel TDX, the
> > memfd-provided memory may be encrypted with special key for special
> > software domain (e.g. KVM guest) and is not expected to be directly
> > accessed by userspace. Precisely, userspace access to such encrypted
> > memory may lead to host crash so it should be prevented.
> > 
> > This patch introduces userspace inaccessible memfd (created with
> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > in-kernel interface so KVM can directly interact with core-mm without
> > the need to map the memory into KVM userspace.
> > 
> > It provides semantics required for KVM guest private(encrypted) memory
> > support that a file descriptor with this flag set is going to be used as
> > the source of guest memory in confidential computing environments such
> > as Intel TDX/AMD SEV.
> > 
> > KVM userspace is still in charge of the lifecycle of the memfd. It
> > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > in this patch to obtain the physical memory address and then populate
> > the secondary page table entries.
> > 
> > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > from userspace. When hole-punching happens, KVM can get notified through
> > inaccessible_notifier it then gets chance to remove any mapped entries
> > of the range in the secondary page tables.
> > 
> > The userspace inaccessible memfd itself is implemented as a shim layer
> > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > only implemented tmpfs. The allocated memory is currently marked as
> > unmovable and unevictable, this is required for current confidential
> > usage. But in future this might be changed.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> 
> ...
> 
> > +static long inaccessible_fallocate(struct file *file, int mode,
> > +				   loff_t offset, loff_t len)
> > +{
> > +	struct inaccessible_data *data = file->f_mapping->private_data;
> > +	struct file *memfd = data->memfd;
> > +	int ret;
> > +
> > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > +			return -EINVAL;
> > +	}
> > +
> > +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > +	inaccessible_notifier_invalidate(data, offset, offset + len);
> 
> Wonder if invalidate should precede the actual hole punch, otherwise we open
> a window where the page tables point to memory no longer valid?

Yes, you are right. Thanks for catching this.

> > +	return ret;
> > +}
> > +
> 
> ...
> 
> > +
> > +static struct file_system_type inaccessible_fs = {
> > +	.owner		= THIS_MODULE,
> > +	.name		= "[inaccessible]",
> 
> Dunno where exactly is this name visible, but shouldn't it better be
> "[memfd:inaccessible]"?

Maybe. And skip brackets.

> 
> > +	.init_fs_context = inaccessible_init_fs_context,
> > +	.kill_sb	= kill_anon_super,
> > +};
> > +
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-17 16:19     ` Kirill A . Shutemov
@ 2022-10-17 16:39       ` Gupta, Pankaj
  2022-10-17 21:56         ` Kirill A . Shutemov
  0 siblings, 1 reply; 97+ messages in thread
From: Gupta, Pankaj @ 2022-10-17 16:39 UTC (permalink / raw)
  To: Kirill A . Shutemov, Vlastimil Babka
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vishal Annapurve, Yu Zhang, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song,
	wei.w.wang

On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
>> On 9/15/22 16:29, Chao Peng wrote:
>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>
>>> KVM can use memfd-provided memory for guest memory. For normal userspace
>>> accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
>>> virtual address space and then tells KVM to use the virtual address to
>>> setup the mapping in the secondary page table (e.g. EPT).
>>>
>>> With confidential computing technologies like Intel TDX, the
>>> memfd-provided memory may be encrypted with special key for special
>>> software domain (e.g. KVM guest) and is not expected to be directly
>>> accessed by userspace. Precisely, userspace access to such encrypted
>>> memory may lead to host crash so it should be prevented.
>>>
>>> This patch introduces userspace inaccessible memfd (created with
>>> MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
>>> ordinary MMU access (e.g. read/write/mmap) but can be accessed via
>>> in-kernel interface so KVM can directly interact with core-mm without
>>> the need to map the memory into KVM userspace.
>>>
>>> It provides semantics required for KVM guest private(encrypted) memory
>>> support that a file descriptor with this flag set is going to be used as
>>> the source of guest memory in confidential computing environments such
>>> as Intel TDX/AMD SEV.
>>>
>>> KVM userspace is still in charge of the lifecycle of the memfd. It
>>> should pass the opened fd to KVM. KVM uses the kernel APIs newly added
>>> in this patch to obtain the physical memory address and then populate
>>> the secondary page table entries.
>>>
>>> The userspace inaccessible memfd can be fallocate-ed and hole-punched
>>> from userspace. When hole-punching happens, KVM can get notified through
>>> inaccessible_notifier it then gets chance to remove any mapped entries
>>> of the range in the secondary page tables.
>>>
>>> The userspace inaccessible memfd itself is implemented as a shim layer
>>> on top of real memory file systems like tmpfs/hugetlbfs but this patch
>>> only implemented tmpfs. The allocated memory is currently marked as
>>> unmovable and unevictable, this is required for current confidential
>>> usage. But in future this might be changed.
>>>
>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>>> ---
>>
>> ...
>>
>>> +static long inaccessible_fallocate(struct file *file, int mode,
>>> +				   loff_t offset, loff_t len)
>>> +{
>>> +	struct inaccessible_data *data = file->f_mapping->private_data;
>>> +	struct file *memfd = data->memfd;
>>> +	int ret;
>>> +
>>> +	if (mode & FALLOC_FL_PUNCH_HOLE) {
>>> +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
>>> +			return -EINVAL;
>>> +	}
>>> +
>>> +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
>>> +	inaccessible_notifier_invalidate(data, offset, offset + len);
>>
>> Wonder if invalidate should precede the actual hole punch, otherwise we open
>> a window where the page tables point to memory no longer valid?
> 
> Yes, you are right. Thanks for catching this.

I also noticed this. But then thought the memory would be anyways zeroed 
(hole punched) before this call?

> 
>>> +	return ret;
>>> +}
>>> +
>>
>> ...
>>
>>> +
>>> +static struct file_system_type inaccessible_fs = {
>>> +	.owner		= THIS_MODULE,
>>> +	.name		= "[inaccessible]",
>>
>> Dunno where exactly is this name visible, but shouldn't it better be
>> "[memfd:inaccessible]"?
> 
> Maybe. And skip brackets.
> 
>>
>>> +	.init_fs_context = inaccessible_init_fs_context,
>>> +	.kill_sb	= kill_anon_super,
>>> +};
>>> +
>>
> 


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-17 14:58                     ` Chao Peng
@ 2022-10-17 19:05                       ` Fuad Tabba
  2022-10-19 13:30                         ` Chao Peng
  0 siblings, 1 reply; 97+ messages in thread
From: Fuad Tabba @ 2022-10-17 19:05 UTC (permalink / raw)
  To: Chao Peng
  Cc: Sean Christopherson, David Hildenbrand, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, wei.w.wang,
	Will Deacon, Marc Zyngier

Hi,

> > > Using both private_fd and userspace_addr is only needed in TDX and other
> > > confidential computing scenarios, pKVM may only use private_fd if the fd
> > > can also be mmaped as a whole to userspace as Sean suggested.
> >
> > That does work in practice, for now at least, and is what I do in my
> > current port. However, the naming and how the API is defined as
> > implied by the name and the documentation. By calling the field
> > private_fd, it does imply that it should not be mapped, which is also
> > what api.rst says in PATCH v8 5/8. My worry is that in that case pKVM
> > would be mis/ab-using this interface, and that future changes could
> > cause unforeseen issues for pKVM.
>
> That is fairly enough. We can change the naming and the documents.
>
> >
> > Maybe renaming this to something like "guest_fp", and specifying in
> > the documentation that it can be restricted, e.g., instead of "the
> > content of the private memory is invisible to userspace" something
> > along the lines of  "the content of the guest memory may be restricted
> > to userspace".
>
> Some other candidates in my mind:
> - restricted_fd: to pair with the mm side restricted_memfd
> - protected_fd: as Sean suggested before
> - fd: how it's explained relies on the memslot.flag.

All these sound good, since they all capture the potential use cases.
Restricted might be the most logical choice if that's going to also
become the name for the mem_fd.

Thanks,
/fuad

> Thanks,
> Chao
> >
> > What do you think?
> >
> > Cheers,
> > /fuad
> >
> > >
> > > Thanks,
> > > Chao
> > > >
> > > > Cheers,
> > > > /fuad

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-17 16:39       ` Gupta, Pankaj
@ 2022-10-17 21:56         ` Kirill A . Shutemov
  2022-10-18 13:42           ` Vishal Annapurve
  0 siblings, 1 reply; 97+ messages in thread
From: Kirill A . Shutemov @ 2022-10-17 21:56 UTC (permalink / raw)
  To: Gupta, Pankaj
  Cc: Vlastimil Babka, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins,
	Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
	Mike Rapoport, Steven Price, Maciej S . Szmigiero,
	Vishal Annapurve, Yu Zhang, luto, jun.nakajima, dave.hansen, ak,
	david, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, wei.w.wang

On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > On 9/15/22 16:29, Chao Peng wrote:
> > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > 
> > > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > > virtual address space and then tells KVM to use the virtual address to
> > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > 
> > > > With confidential computing technologies like Intel TDX, the
> > > > memfd-provided memory may be encrypted with special key for special
> > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > memory may lead to host crash so it should be prevented.
> > > > 
> > > > This patch introduces userspace inaccessible memfd (created with
> > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > in-kernel interface so KVM can directly interact with core-mm without
> > > > the need to map the memory into KVM userspace.
> > > > 
> > > > It provides semantics required for KVM guest private(encrypted) memory
> > > > support that a file descriptor with this flag set is going to be used as
> > > > the source of guest memory in confidential computing environments such
> > > > as Intel TDX/AMD SEV.
> > > > 
> > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > > > in this patch to obtain the physical memory address and then populate
> > > > the secondary page table entries.
> > > > 
> > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > > > from userspace. When hole-punching happens, KVM can get notified through
> > > > inaccessible_notifier it then gets chance to remove any mapped entries
> > > > of the range in the secondary page tables.
> > > > 
> > > > The userspace inaccessible memfd itself is implemented as a shim layer
> > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > unmovable and unevictable, this is required for current confidential
> > > > usage. But in future this might be changed.
> > > > 
> > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > > ---
> > > 
> > > ...
> > > 
> > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > +				   loff_t offset, loff_t len)
> > > > +{
> > > > +	struct inaccessible_data *data = file->f_mapping->private_data;
> > > > +	struct file *memfd = data->memfd;
> > > > +	int ret;
> > > > +
> > > > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > +			return -EINVAL;
> > > > +	}
> > > > +
> > > > +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > +	inaccessible_notifier_invalidate(data, offset, offset + len);
> > > 
> > > Wonder if invalidate should precede the actual hole punch, otherwise we open
> > > a window where the page tables point to memory no longer valid?
> > 
> > Yes, you are right. Thanks for catching this.
> 
> I also noticed this. But then thought the memory would be anyways zeroed
> (hole punched) before this call?

Hole punching can free pages, given that offset/len covers full page.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 5/8] KVM: Register/unregister the guest private memory regions
  2022-10-17 10:15       ` Fuad Tabba
@ 2022-10-17 22:17         ` Sean Christopherson
  2022-10-19 13:23           ` Chao Peng
  0 siblings, 1 reply; 97+ messages in thread
From: Sean Christopherson @ 2022-10-17 22:17 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Mon, Oct 17, 2022, Fuad Tabba wrote:
> Hi,
> 
> > > > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > > > +#define KVM_MEM_ATTR_SHARED    0x0001
> > > > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> > > > +                                    bool is_private)
> > > > +{
> > >
> > > I wonder if this ioctl should be implemented as an arch-specific
> > > ioctl. In this patch it performs some actions that pKVM might not need
> > > or might want to do differently.
> >
> > I think it's doable. We can provide the mem_attr_array kind thing in
> > common code and let arch code decide to use it or not. Currently
> > mem_attr_array is defined in the struct kvm, if those bytes are
> > unnecessary for pKVM it can even be moved to arch definition, but that
> > also loses the potential code sharing for confidential usages in other
> > non-architectures, e.g. if ARM also supports such usage. Or it can be
> > provided through a different CONFIG_ instead of
> > CONFIG_HAVE_KVM_PRIVATE_MEM.
> 
> This sounds good. Thank you.

I like the idea of a separate Kconfig, e.g. CONFIG_KVM_GENERIC_PRIVATE_MEM or
something.  I highly doubt there will be any non-x86 users for multiple years,
if ever, but it would allow testing the private memory stuff on ARM (and any other
non-x86 arch) without needing full pKVM support and with only minor KVM
modifications, e.g. the x86 support[*] to test UPM without TDX is shaping up to be
trivial.

[*] https://lore.kernel.org/all/Y0mu1FKugNQG5T8K@google.com

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-30 16:19               ` Fuad Tabba
  2022-10-13 13:34                 ` Chao Peng
@ 2022-10-18  0:33                 ` Sean Christopherson
  2022-10-19 15:04                   ` Fuad Tabba
  1 sibling, 1 reply; 97+ messages in thread
From: Sean Christopherson @ 2022-10-18  0:33 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Chao Peng, David Hildenbrand, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, wei.w.wang, Will Deacon, Marc Zyngier

On Fri, Sep 30, 2022, Fuad Tabba wrote:
> > > > > pKVM would also need a way to make an fd accessible again
> > > > > when shared back, which I think isn't possible with this patch.
> > > >
> > > > But does pKVM really want to mmap/munmap a new region at the page-level,
> > > > that can cause VMA fragmentation if the conversion is frequent as I see.
> > > > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > > > be the same issue.
> > >
> > > pKVM doesn't really need to unmap the memory. What is really important
> > > is that the memory is not GUP'able.
> >
> > Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
> > otherwise KVM wouldn't be able to get the PFN to map into guest memory.
> >
> > The problem is that gup() and "mapped" are tied together.  So yes, pKVM doesn't
> > strictly need to unmap memory _in the untrusted host_, but since mapped==guppable,
> > the end result is the same.
> >
> > Emphasis above because pKVM still needs unmap the memory _somehwere_.  IIUC, the
> > current approach is to do that only in the stage-2 page tables, i.e. only in the
> > context of the hypervisor.  Which is also the source of the gup() problems; the
> > untrusted kernel is blissfully unaware that the memory is inaccessible.
> >
> > Any approach that moves some of that information into the untrusted kernel so that
> > the kernel can protect itself will incur fragmentation in the VMAs.  Well, unless
> > all of guest memory becomes unguppable, but that's likely not a viable option.
> 
> Actually, for pKVM, there is no need for the guest memory to be GUP'able at
> all if we use the new inaccessible_get_pfn().

Ya, I was referring to pKVM without UPM / inaccessible memory.

Jumping back to blocking gup(), what about using the same tricks as secretmem to
block gup()?  E.g. compare vm_ops to block regular gup() and a_ops to block fast
gup() on struct page?  With a Kconfig that's selected by pKVM (which would also
need its own Kconfig), e.g. CONFIG_INACCESSIBLE_MAPPABLE_MEM, there would be zero
performance overhead for non-pKVM kernels, i.e. hooking gup() shouldn't be
controversial.

I suspect the fast gup() path could even be optimized to avoid the page_mapping()
lookup by adding a PG_inaccessible flag that's defined iff the TBD Kconfig is
selected.  I'm guessing pKVM isn't expected to be deployed on massivve NUMA systems
anytime soon, so there should be plenty of page flags to go around.

Blocking gup() instead of trying to play refcount games when converting back to
private would eliminate the need to put heavy restrictions on mapping, as the goal
of those were purely to simplify the KVM implementation, e.g. the "one mapping per
memslot" thing would go away entirely.

> This of course goes back to what I'd mentioned before in v7; it seems that
> representing the memslot memory as a file descriptor should be orthogonal to
> whether the memory is shared or private, rather than a private_fd for private
> memory and the userspace_addr for shared memory.

I also explored the idea of backing any guest memory with an fd, but came to
the conclusion that private memory needs a separate handle[1], at least on x86.

For SNP and TDX, even though the GPA is the same (ignoring the fact that SNP and
TDX steal GPA bits to differentiate private vs. shared), the two types need to be
treated as separate mappings[2].  Post-boot, converting is lossy in both directions,
so even conceptually they are two disctint pages that just happen to share (some)
GPA bits.

To allow conversions, i.e. changing which mapping to use, without memslot updates,
KVM needs to let userspace provide both mappings in a single memslot.  So while
fd-based memory is an orthogonal concept, e.g. we could add fd-based shared memory,
KVM would still need a dedicated private handle.

For pKVM, the fd doesn't strictly need to be mutually exclusive with the existing
userspace_addr, but since the private_fd is going to be added for x86, I think it
makes sense to use that instead of adding generic fd-based memory for pKVM's use
case (which is arguably still "private" memory but with special semantics).

[1] https://lore.kernel.org/all/YulTH7bL4MwT5v5K@google.com
[2] https://lore.kernel.org/all/869622df-5bf6-0fbb-cac4-34c6ae7df119@kernel.org

>  The host can then map or unmap the shared/private memory using the fd, which
>  allows it more freedom in even choosing to unmap shared memory when not
>  needed, for example.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-17 21:56         ` Kirill A . Shutemov
@ 2022-10-18 13:42           ` Vishal Annapurve
  2022-10-19 15:32             ` Kirill A . Shutemov
  0 siblings, 1 reply; 97+ messages in thread
From: Vishal Annapurve @ 2022-10-18 13:42 UTC (permalink / raw)
  To: Kirill A . Shutemov
  Cc: Gupta, Pankaj, Vlastimil Babka, Chao Peng, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Yu Zhang, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > > On 9/15/22 16:29, Chao Peng wrote:
> > > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > >
> > > > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > > > virtual address space and then tells KVM to use the virtual address to
> > > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > >
> > > > > With confidential computing technologies like Intel TDX, the
> > > > > memfd-provided memory may be encrypted with special key for special
> > > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > > memory may lead to host crash so it should be prevented.
> > > > >
> > > > > This patch introduces userspace inaccessible memfd (created with
> > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > > in-kernel interface so KVM can directly interact with core-mm without
> > > > > the need to map the memory into KVM userspace.
> > > > >
> > > > > It provides semantics required for KVM guest private(encrypted) memory
> > > > > support that a file descriptor with this flag set is going to be used as
> > > > > the source of guest memory in confidential computing environments such
> > > > > as Intel TDX/AMD SEV.
> > > > >
> > > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > > > > in this patch to obtain the physical memory address and then populate
> > > > > the secondary page table entries.
> > > > >
> > > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > > > > from userspace. When hole-punching happens, KVM can get notified through
> > > > > inaccessible_notifier it then gets chance to remove any mapped entries
> > > > > of the range in the secondary page tables.
> > > > >
> > > > > The userspace inaccessible memfd itself is implemented as a shim layer
> > > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > > unmovable and unevictable, this is required for current confidential
> > > > > usage. But in future this might be changed.
> > > > >
> > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > > > ---
> > > >
> > > > ...
> > > >
> > > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > > +                                  loff_t offset, loff_t len)
> > > > > +{
> > > > > +       struct inaccessible_data *data = file->f_mapping->private_data;
> > > > > +       struct file *memfd = data->memfd;
> > > > > +       int ret;
> > > > > +
> > > > > +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > +               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > +                       return -EINVAL;
> > > > > +       }
> > > > > +
> > > > > +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > +       inaccessible_notifier_invalidate(data, offset, offset + len);
> > > >
> > > > Wonder if invalidate should precede the actual hole punch, otherwise we open
> > > > a window where the page tables point to memory no longer valid?
> > >
> > > Yes, you are right. Thanks for catching this.
> >
> > I also noticed this. But then thought the memory would be anyways zeroed
> > (hole punched) before this call?
>
> Hole punching can free pages, given that offset/len covers full page.
>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

I think moving this notifier_invalidate before fallocate may not solve
the problem completely. Is it possible that between invalidate and
fallocate, KVM tries to handle the page fault for the guest VM from
another vcpu and uses the pages to be freed to back gpa ranges? Should
hole punching here also update mem_attr first to say that KVM should
consider the corresponding gpa ranges to be no more backed by
inaccessible memfd?

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-09-15 14:29 ` [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd Chao Peng
                     ` (4 preceding siblings ...)
  2022-10-17 13:00   ` Vlastimil Babka
@ 2022-10-19 12:23   ` Vishal Annapurve
  2022-10-21 13:47     ` Chao Peng
  5 siblings, 1 reply; 97+ messages in thread
From: Vishal Annapurve @ 2022-10-19 12:23 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, wei.w.wang

On Thu, Sep 15, 2022 at 8:04 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> KVM can use memfd-provided memory for guest memory. For normal userspace
> accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> virtual address space and then tells KVM to use the virtual address to
> setup the mapping in the secondary page table (e.g. EPT).
>
> With confidential computing technologies like Intel TDX, the
> memfd-provided memory may be encrypted with special key for special
> software domain (e.g. KVM guest) and is not expected to be directly
> accessed by userspace. Precisely, userspace access to such encrypted
> memory may lead to host crash so it should be prevented.
>
> This patch introduces userspace inaccessible memfd (created with
> MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> in-kernel interface so KVM can directly interact with core-mm without
> the need to map the memory into KVM userspace.
>
> It provides semantics required for KVM guest private(encrypted) memory
> support that a file descriptor with this flag set is going to be used as
> the source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV.
>
> KVM userspace is still in charge of the lifecycle of the memfd. It
> should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> in this patch to obtain the physical memory address and then populate
> the secondary page table entries.
>
> The userspace inaccessible memfd can be fallocate-ed and hole-punched
> from userspace. When hole-punching happens, KVM can get notified through
> inaccessible_notifier it then gets chance to remove any mapped entries
> of the range in the secondary page tables.
>
> The userspace inaccessible memfd itself is implemented as a shim layer
> on top of real memory file systems like tmpfs/hugetlbfs but this patch
> only implemented tmpfs. The allocated memory is currently marked as
> unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  include/linux/memfd.h      |  24 ++++
>  include/uapi/linux/magic.h |   1 +
>  include/uapi/linux/memfd.h |   1 +
>  mm/Makefile                |   2 +-
>  mm/memfd.c                 |  25 ++++-
>  mm/memfd_inaccessible.c    | 219 +++++++++++++++++++++++++++++++++++++
>  6 files changed, 270 insertions(+), 2 deletions(-)
>  create mode 100644 mm/memfd_inaccessible.c
>
> diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> index 4f1600413f91..334ddff08377 100644
> --- a/include/linux/memfd.h
> +++ b/include/linux/memfd.h
> @@ -3,6 +3,7 @@
>  #define __LINUX_MEMFD_H
>
>  #include <linux/file.h>
> +#include <linux/pfn_t.h>
>
>  #ifdef CONFIG_MEMFD_CREATE
>  extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
> @@ -13,4 +14,27 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
>  }
>  #endif
>
> +struct inaccessible_notifier;
> +
> +struct inaccessible_notifier_ops {
> +       void (*invalidate)(struct inaccessible_notifier *notifier,
> +                          pgoff_t start, pgoff_t end);
> +};
> +
> +struct inaccessible_notifier {
> +       struct list_head list;
> +       const struct inaccessible_notifier_ops *ops;
> +};
> +
> +void inaccessible_register_notifier(struct file *file,
> +                                   struct inaccessible_notifier *notifier);
> +void inaccessible_unregister_notifier(struct file *file,
> +                                     struct inaccessible_notifier *notifier);
> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +                        int *order);
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn);
> +
> +struct file *memfd_mkinaccessible(struct file *memfd);
> +
>  #endif /* __LINUX_MEMFD_H */
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 6325d1d0e90f..9d066be3d7e8 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -101,5 +101,6 @@
>  #define DMA_BUF_MAGIC          0x444d4142      /* "DMAB" */
>  #define DEVMEM_MAGIC           0x454d444d      /* "DMEM" */
>  #define SECRETMEM_MAGIC                0x5345434d      /* "SECM" */
> +#define INACCESSIBLE_MAGIC     0x494e4143      /* "INAC" */
>
>  #endif /* __LINUX_MAGIC_H__ */
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> index 7a8a26751c23..48750474b904 100644
> --- a/include/uapi/linux/memfd.h
> +++ b/include/uapi/linux/memfd.h
> @@ -8,6 +8,7 @@
>  #define MFD_CLOEXEC            0x0001U
>  #define MFD_ALLOW_SEALING      0x0002U
>  #define MFD_HUGETLB            0x0004U
> +#define MFD_INACCESSIBLE       0x0008U
>
>  /*
>   * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> diff --git a/mm/Makefile b/mm/Makefile
> index 9a564f836403..f82e5d4b4388 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -126,7 +126,7 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
>  obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
>  obj-$(CONFIG_ZONE_DEVICE) += memremap.o
>  obj-$(CONFIG_HMM_MIRROR) += hmm.o
> -obj-$(CONFIG_MEMFD_CREATE) += memfd.o
> +obj-$(CONFIG_MEMFD_CREATE) += memfd.o memfd_inaccessible.o
>  obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
>  obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
>  obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 08f5f8304746..1853a90f49ff 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -261,7 +261,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
>  #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
>  #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>
> -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> +                      MFD_INACCESSIBLE)
>
>  SYSCALL_DEFINE2(memfd_create,
>                 const char __user *, uname,
> @@ -283,6 +284,14 @@ SYSCALL_DEFINE2(memfd_create,
>                         return -EINVAL;
>         }
>
> +       /* Disallow sealing when MFD_INACCESSIBLE is set. */
> +       if ((flags & MFD_INACCESSIBLE) && (flags & MFD_ALLOW_SEALING))
> +               return -EINVAL;
> +
> +       /* TODO: add hugetlb support */
> +       if ((flags & MFD_INACCESSIBLE) && (flags & MFD_HUGETLB))
> +               return -EINVAL;
> +
>         /* length includes terminating zero */
>         len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
>         if (len <= 0)
> @@ -331,10 +340,24 @@ SYSCALL_DEFINE2(memfd_create,
>                 *file_seals &= ~F_SEAL_SEAL;
>         }
>
> +       if (flags & MFD_INACCESSIBLE) {
> +               struct file *inaccessible_file;
> +
> +               inaccessible_file = memfd_mkinaccessible(file);
> +               if (IS_ERR(inaccessible_file)) {
> +                       error = PTR_ERR(inaccessible_file);
> +                       goto err_file;
> +               }
> +
> +               file = inaccessible_file;
> +       }
> +
>         fd_install(fd, file);
>         kfree(name);
>         return fd;
>
> +err_file:
> +       fput(file);
>  err_fd:
>         put_unused_fd(fd);
>  err_name:
> diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> new file mode 100644
> index 000000000000..2d33cbdd9282
> --- /dev/null
> +++ b/mm/memfd_inaccessible.c
> @@ -0,0 +1,219 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include <linux/memfd.h>
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/shmem_fs.h>
> +#include <uapi/linux/falloc.h>
> +#include <uapi/linux/magic.h>
> +
> +struct inaccessible_data {
> +       struct mutex lock;
> +       struct file *memfd;
> +       struct list_head notifiers;
> +};
> +
> +static void inaccessible_notifier_invalidate(struct inaccessible_data *data,
> +                                pgoff_t start, pgoff_t end)
> +{
> +       struct inaccessible_notifier *notifier;
> +
> +       mutex_lock(&data->lock);
> +       list_for_each_entry(notifier, &data->notifiers, list) {
> +               notifier->ops->invalidate(notifier, start, end);
> +       }
> +       mutex_unlock(&data->lock);
> +}
> +
> +static int inaccessible_release(struct inode *inode, struct file *file)
> +{
> +       struct inaccessible_data *data = inode->i_mapping->private_data;
> +
> +       fput(data->memfd);
> +       kfree(data);
> +       return 0;
> +}
> +
> +static long inaccessible_fallocate(struct file *file, int mode,
> +                                  loff_t offset, loff_t len)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       int ret;
> +
> +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> +               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> +                       return -EINVAL;
> +       }
> +
> +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> +       inaccessible_notifier_invalidate(data, offset, offset + len);
> +       return ret;
> +}
> +
> +static const struct file_operations inaccessible_fops = {
> +       .release = inaccessible_release,
> +       .fallocate = inaccessible_fallocate,
> +};
> +
> +static int inaccessible_getattr(struct user_namespace *mnt_userns,
> +                               const struct path *path, struct kstat *stat,
> +                               u32 request_mask, unsigned int query_flags)
> +{
> +       struct inode *inode = d_inode(path->dentry);
> +       struct inaccessible_data *data = inode->i_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +
> +       return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
> +                                            request_mask, query_flags);
> +}
> +
> +static int inaccessible_setattr(struct user_namespace *mnt_userns,
> +                               struct dentry *dentry, struct iattr *attr)
> +{
> +       struct inode *inode = d_inode(dentry);
> +       struct inaccessible_data *data = inode->i_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       int ret;
> +
> +       if (attr->ia_valid & ATTR_SIZE) {
> +               if (memfd->f_inode->i_size)
> +                       return -EPERM;
> +
> +               if (!PAGE_ALIGNED(attr->ia_size))
> +                       return -EINVAL;
> +       }
> +
> +       ret = memfd->f_inode->i_op->setattr(mnt_userns,
> +                                           file_dentry(memfd), attr);
> +       return ret;
> +}
> +
> +static const struct inode_operations inaccessible_iops = {
> +       .getattr = inaccessible_getattr,
> +       .setattr = inaccessible_setattr,
> +};
> +
> +static int inaccessible_init_fs_context(struct fs_context *fc)
> +{
> +       if (!init_pseudo(fc, INACCESSIBLE_MAGIC))
> +               return -ENOMEM;
> +
> +       fc->s_iflags |= SB_I_NOEXEC;
> +       return 0;
> +}
> +
> +static struct file_system_type inaccessible_fs = {
> +       .owner          = THIS_MODULE,
> +       .name           = "[inaccessible]",
> +       .init_fs_context = inaccessible_init_fs_context,
> +       .kill_sb        = kill_anon_super,
> +};
> +
> +static struct vfsmount *inaccessible_mnt;
> +
> +static __init int inaccessible_init(void)
> +{
> +       inaccessible_mnt = kern_mount(&inaccessible_fs);
> +       if (IS_ERR(inaccessible_mnt))
> +               return PTR_ERR(inaccessible_mnt);
> +       return 0;
> +}
> +fs_initcall(inaccessible_init);
> +
> +struct file *memfd_mkinaccessible(struct file *memfd)
> +{
> +       struct inaccessible_data *data;
> +       struct address_space *mapping;
> +       struct inode *inode;
> +       struct file *file;
> +
> +       data = kzalloc(sizeof(*data), GFP_KERNEL);
> +       if (!data)
> +               return ERR_PTR(-ENOMEM);
> +
> +       data->memfd = memfd;
> +       mutex_init(&data->lock);
> +       INIT_LIST_HEAD(&data->notifiers);
> +
> +       inode = alloc_anon_inode(inaccessible_mnt->mnt_sb);
> +       if (IS_ERR(inode)) {
> +               kfree(data);
> +               return ERR_CAST(inode);
> +       }
> +
> +       inode->i_mode |= S_IFREG;
> +       inode->i_op = &inaccessible_iops;
> +       inode->i_mapping->private_data = data;
> +
> +       file = alloc_file_pseudo(inode, inaccessible_mnt,
> +                                "[memfd:inaccessible]", O_RDWR,
> +                                &inaccessible_fops);
> +       if (IS_ERR(file)) {
> +               iput(inode);
> +               kfree(data);
> +       }
> +
> +       file->f_flags |= O_LARGEFILE;
> +
> +       mapping = memfd->f_mapping;
> +       mapping_set_unevictable(mapping);
> +       mapping_set_gfp_mask(mapping,
> +                            mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> +
> +       return file;
> +}
> +
> +void inaccessible_register_notifier(struct file *file,
> +                                   struct inaccessible_notifier *notifier)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_add(&notifier->list, &data->notifiers);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
> +
> +void inaccessible_unregister_notifier(struct file *file,
> +                                     struct inaccessible_notifier *notifier)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_del(&notifier->list);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_unregister_notifier);
> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +                        int *order)
> +{
> +       struct inaccessible_data *data = file->f_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       struct page *page;
> +       int ret;
> +
> +       ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> +       if (ret)
> +               return ret;
> +
> +       *pfn = page_to_pfn_t(page);
> +       *order = thp_order(compound_head(page));
> +       SetPageUptodate(page);
> +       unlock_page(page);
> +
> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> +
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> +{
> +       struct page *page = pfn_t_to_page(pfn);
> +
> +       if (WARN_ON_ONCE(!page))
> +               return;
> +
> +       put_page(page);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
> --
> 2.25.1
>

In the context of userspace inaccessible memfd, what would be a
suggested way to enforce NUMA memory policy for physical memory
allocation? mbind[1] won't work here in absence of virtual address
range.

[1] https://github.com/chao-p/qemu/blob/privmem-v8/backends/hostmem.c#L382

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 5/8] KVM: Register/unregister the guest private memory regions
  2022-10-17 22:17         ` Sean Christopherson
@ 2022-10-19 13:23           ` Chao Peng
  2022-10-19 15:02             ` Fuad Tabba
  0 siblings, 1 reply; 97+ messages in thread
From: Chao Peng @ 2022-10-19 13:23 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Fuad Tabba, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Mon, Oct 17, 2022 at 10:17:45PM +0000, Sean Christopherson wrote:
> On Mon, Oct 17, 2022, Fuad Tabba wrote:
> > Hi,
> > 
> > > > > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > > > > +#define KVM_MEM_ATTR_SHARED    0x0001
> > > > > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> > > > > +                                    bool is_private)
> > > > > +{
> > > >
> > > > I wonder if this ioctl should be implemented as an arch-specific
> > > > ioctl. In this patch it performs some actions that pKVM might not need
> > > > or might want to do differently.
> > >
> > > I think it's doable. We can provide the mem_attr_array kind thing in
> > > common code and let arch code decide to use it or not. Currently
> > > mem_attr_array is defined in the struct kvm, if those bytes are
> > > unnecessary for pKVM it can even be moved to arch definition, but that
> > > also loses the potential code sharing for confidential usages in other
> > > non-architectures, e.g. if ARM also supports such usage. Or it can be
> > > provided through a different CONFIG_ instead of
> > > CONFIG_HAVE_KVM_PRIVATE_MEM.
> > 
> > This sounds good. Thank you.
> 
> I like the idea of a separate Kconfig, e.g. CONFIG_KVM_GENERIC_PRIVATE_MEM or
> something.  I highly doubt there will be any non-x86 users for multiple years,
> if ever, but it would allow testing the private memory stuff on ARM (and any other
> non-x86 arch) without needing full pKVM support and with only minor KVM
> modifications, e.g. the x86 support[*] to test UPM without TDX is shaping up to be
> trivial.

CONFIG_KVM_GENERIC_PRIVATE_MEM looks good to me.

Thanks,
Chao
> 
> [*] https://lore.kernel.org/all/Y0mu1FKugNQG5T8K@google.com

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-17 19:05                       ` Fuad Tabba
@ 2022-10-19 13:30                         ` Chao Peng
  0 siblings, 0 replies; 97+ messages in thread
From: Chao Peng @ 2022-10-19 13:30 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Sean Christopherson, David Hildenbrand, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
	Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
	Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
	jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, wei.w.wang,
	Will Deacon, Marc Zyngier

On Mon, Oct 17, 2022 at 08:05:10PM +0100, Fuad Tabba wrote:
> Hi,
> 
> > > > Using both private_fd and userspace_addr is only needed in TDX and other
> > > > confidential computing scenarios, pKVM may only use private_fd if the fd
> > > > can also be mmaped as a whole to userspace as Sean suggested.
> > >
> > > That does work in practice, for now at least, and is what I do in my
> > > current port. However, the naming and how the API is defined as
> > > implied by the name and the documentation. By calling the field
> > > private_fd, it does imply that it should not be mapped, which is also
> > > what api.rst says in PATCH v8 5/8. My worry is that in that case pKVM
> > > would be mis/ab-using this interface, and that future changes could
> > > cause unforeseen issues for pKVM.
> >
> > That is fairly enough. We can change the naming and the documents.
> >
> > >
> > > Maybe renaming this to something like "guest_fp", and specifying in
> > > the documentation that it can be restricted, e.g., instead of "the
> > > content of the private memory is invisible to userspace" something
> > > along the lines of  "the content of the guest memory may be restricted
> > > to userspace".
> >
> > Some other candidates in my mind:
> > - restricted_fd: to pair with the mm side restricted_memfd
> > - protected_fd: as Sean suggested before
> > - fd: how it's explained relies on the memslot.flag.
> 
> All these sound good, since they all capture the potential use cases.
> Restricted might be the most logical choice if that's going to also
> become the name for the mem_fd.

Thanks, I will use 'restricted' for them. e.g.:
- memfd_restricted() syscall
- restricted_fd
- restricted_offset

The memslot flags will still be KVM_MEM_PRIVATE, since I think pKVM will
create its own one?

Chao
> 
> Thanks,
> /fuad
> 
> > Thanks,
> > Chao
> > >
> > > What do you think?
> > >
> > > Cheers,
> > > /fuad
> > >
> > > >
> > > > Thanks,
> > > > Chao
> > > > >
> > > > > Cheers,
> > > > > /fuad

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 5/8] KVM: Register/unregister the guest private memory regions
  2022-10-19 13:23           ` Chao Peng
@ 2022-10-19 15:02             ` Fuad Tabba
  2022-10-19 16:09               ` Sean Christopherson
  0 siblings, 1 reply; 97+ messages in thread
From: Fuad Tabba @ 2022-10-19 15:02 UTC (permalink / raw)
  To: Chao Peng
  Cc: Sean Christopherson, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

> > > This sounds good. Thank you.
> >
> > I like the idea of a separate Kconfig, e.g. CONFIG_KVM_GENERIC_PRIVATE_MEM or
> > something.  I highly doubt there will be any non-x86 users for multiple years,
> > if ever, but it would allow testing the private memory stuff on ARM (and any other
> > non-x86 arch) without needing full pKVM support and with only minor KVM
> > modifications, e.g. the x86 support[*] to test UPM without TDX is shaping up to be
> > trivial.
>
> CONFIG_KVM_GENERIC_PRIVATE_MEM looks good to me.

That sounds good to me, and just keeping the xarray isn't really an
issue for pKVM. We could end up using it instead of some of the other
structures we use for tracking.

Cheers,
/fuad

> Thanks,
> Chao
> >
> > [*] https://lore.kernel.org/all/Y0mu1FKugNQG5T8K@google.com

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-18  0:33                 ` Sean Christopherson
@ 2022-10-19 15:04                   ` Fuad Tabba
  0 siblings, 0 replies; 97+ messages in thread
From: Fuad Tabba @ 2022-10-19 15:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, David Hildenbrand, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, wei.w.wang, Will Deacon, Marc Zyngier

Hi,

On Tue, Oct 18, 2022 at 1:34 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Sep 30, 2022, Fuad Tabba wrote:
> > > > > > pKVM would also need a way to make an fd accessible again
> > > > > > when shared back, which I think isn't possible with this patch.
> > > > >
> > > > > But does pKVM really want to mmap/munmap a new region at the page-level,
> > > > > that can cause VMA fragmentation if the conversion is frequent as I see.
> > > > > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > > > > be the same issue.
> > > >
> > > > pKVM doesn't really need to unmap the memory. What is really important
> > > > is that the memory is not GUP'able.
> > >
> > > Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
> > > otherwise KVM wouldn't be able to get the PFN to map into guest memory.
> > >
> > > The problem is that gup() and "mapped" are tied together.  So yes, pKVM doesn't
> > > strictly need to unmap memory _in the untrusted host_, but since mapped==guppable,
> > > the end result is the same.
> > >
> > > Emphasis above because pKVM still needs unmap the memory _somehwere_.  IIUC, the
> > > current approach is to do that only in the stage-2 page tables, i.e. only in the
> > > context of the hypervisor.  Which is also the source of the gup() problems; the
> > > untrusted kernel is blissfully unaware that the memory is inaccessible.
> > >
> > > Any approach that moves some of that information into the untrusted kernel so that
> > > the kernel can protect itself will incur fragmentation in the VMAs.  Well, unless
> > > all of guest memory becomes unguppable, but that's likely not a viable option.
> >
> > Actually, for pKVM, there is no need for the guest memory to be GUP'able at
> > all if we use the new inaccessible_get_pfn().
>
> Ya, I was referring to pKVM without UPM / inaccessible memory.
>
> Jumping back to blocking gup(), what about using the same tricks as secretmem to
> block gup()?  E.g. compare vm_ops to block regular gup() and a_ops to block fast
> gup() on struct page?  With a Kconfig that's selected by pKVM (which would also
> need its own Kconfig), e.g. CONFIG_INACCESSIBLE_MAPPABLE_MEM, there would be zero
> performance overhead for non-pKVM kernels, i.e. hooking gup() shouldn't be
> controversial.
>
> I suspect the fast gup() path could even be optimized to avoid the page_mapping()
> lookup by adding a PG_inaccessible flag that's defined iff the TBD Kconfig is
> selected.  I'm guessing pKVM isn't expected to be deployed on massivve NUMA systems
> anytime soon, so there should be plenty of page flags to go around.
>
> Blocking gup() instead of trying to play refcount games when converting back to
> private would eliminate the need to put heavy restrictions on mapping, as the goal
> of those were purely to simplify the KVM implementation, e.g. the "one mapping per
> memslot" thing would go away entirely.

My implementation of mmap for inaccessible_fops was setting VM_PFNMAP.
That said, I realized that that might be adding an unnecessary
restriction, and now have changed it to do it the secretmem way.
That's straightforward and works well.

> > This of course goes back to what I'd mentioned before in v7; it seems that
> > representing the memslot memory as a file descriptor should be orthogonal to
> > whether the memory is shared or private, rather than a private_fd for private
> > memory and the userspace_addr for shared memory.
>
> I also explored the idea of backing any guest memory with an fd, but came to
> the conclusion that private memory needs a separate handle[1], at least on x86.
>
> For SNP and TDX, even though the GPA is the same (ignoring the fact that SNP and
> TDX steal GPA bits to differentiate private vs. shared), the two types need to be
> treated as separate mappings[2].  Post-boot, converting is lossy in both directions,
> so even conceptually they are two disctint pages that just happen to share (some)
> GPA bits.
>
> To allow conversions, i.e. changing which mapping to use, without memslot updates,
> KVM needs to let userspace provide both mappings in a single memslot.  So while
> fd-based memory is an orthogonal concept, e.g. we could add fd-based shared memory,
> KVM would still need a dedicated private handle.
>
> For pKVM, the fd doesn't strictly need to be mutually exclusive with the existing
> userspace_addr, but since the private_fd is going to be added for x86, I think it
> makes sense to use that instead of adding generic fd-based memory for pKVM's use
> case (which is arguably still "private" memory but with special semantics).
>
> [1] https://lore.kernel.org/all/YulTH7bL4MwT5v5K@google.com
> [2] https://lore.kernel.org/all/869622df-5bf6-0fbb-cac4-34c6ae7df119@kernel.org

As long as the API does not impose this limit, which would imply pKVM
is misusing it, then I agree. I think that's why renaming it to
something like "restricted" might be clearer.

Thanks,
/fuad

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-18 13:42           ` Vishal Annapurve
@ 2022-10-19 15:32             ` Kirill A . Shutemov
  2022-10-20 10:50               ` Vishal Annapurve
  0 siblings, 1 reply; 97+ messages in thread
From: Kirill A . Shutemov @ 2022-10-19 15:32 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Gupta, Pankaj, Vlastimil Babka, Chao Peng, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Yu Zhang, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
> On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> > > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > > > On 9/15/22 16:29, Chao Peng wrote:
> > > > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > > >
> > > > > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > > > > virtual address space and then tells KVM to use the virtual address to
> > > > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > > >
> > > > > > With confidential computing technologies like Intel TDX, the
> > > > > > memfd-provided memory may be encrypted with special key for special
> > > > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > > > memory may lead to host crash so it should be prevented.
> > > > > >
> > > > > > This patch introduces userspace inaccessible memfd (created with
> > > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > > > in-kernel interface so KVM can directly interact with core-mm without
> > > > > > the need to map the memory into KVM userspace.
> > > > > >
> > > > > > It provides semantics required for KVM guest private(encrypted) memory
> > > > > > support that a file descriptor with this flag set is going to be used as
> > > > > > the source of guest memory in confidential computing environments such
> > > > > > as Intel TDX/AMD SEV.
> > > > > >
> > > > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > > > > > in this patch to obtain the physical memory address and then populate
> > > > > > the secondary page table entries.
> > > > > >
> > > > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > > > > > from userspace. When hole-punching happens, KVM can get notified through
> > > > > > inaccessible_notifier it then gets chance to remove any mapped entries
> > > > > > of the range in the secondary page tables.
> > > > > >
> > > > > > The userspace inaccessible memfd itself is implemented as a shim layer
> > > > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > > > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > > > unmovable and unevictable, this is required for current confidential
> > > > > > usage. But in future this might be changed.
> > > > > >
> > > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > > > > ---
> > > > >
> > > > > ...
> > > > >
> > > > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > > > +                                  loff_t offset, loff_t len)
> > > > > > +{
> > > > > > +       struct inaccessible_data *data = file->f_mapping->private_data;
> > > > > > +       struct file *memfd = data->memfd;
> > > > > > +       int ret;
> > > > > > +
> > > > > > +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > > +               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > > +                       return -EINVAL;
> > > > > > +       }
> > > > > > +
> > > > > > +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > > +       inaccessible_notifier_invalidate(data, offset, offset + len);
> > > > >
> > > > > Wonder if invalidate should precede the actual hole punch, otherwise we open
> > > > > a window where the page tables point to memory no longer valid?
> > > >
> > > > Yes, you are right. Thanks for catching this.
> > >
> > > I also noticed this. But then thought the memory would be anyways zeroed
> > > (hole punched) before this call?
> >
> > Hole punching can free pages, given that offset/len covers full page.
> >
> > --
> >   Kiryl Shutsemau / Kirill A. Shutemov
> 
> I think moving this notifier_invalidate before fallocate may not solve
> the problem completely. Is it possible that between invalidate and
> fallocate, KVM tries to handle the page fault for the guest VM from
> another vcpu and uses the pages to be freed to back gpa ranges? Should
> hole punching here also update mem_attr first to say that KVM should
> consider the corresponding gpa ranges to be no more backed by
> inaccessible memfd?

We rely on external synchronization to prevent this. See code around
mmu_invalidate_retry_hva().

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 5/8] KVM: Register/unregister the guest private memory regions
  2022-10-19 15:02             ` Fuad Tabba
@ 2022-10-19 16:09               ` Sean Christopherson
  2022-10-19 18:32                 ` Fuad Tabba
  0 siblings, 1 reply; 97+ messages in thread
From: Sean Christopherson @ 2022-10-19 16:09 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Wed, Oct 19, 2022, Fuad Tabba wrote:
> > > > This sounds good. Thank you.
> > >
> > > I like the idea of a separate Kconfig, e.g. CONFIG_KVM_GENERIC_PRIVATE_MEM or
> > > something.  I highly doubt there will be any non-x86 users for multiple years,
> > > if ever, but it would allow testing the private memory stuff on ARM (and any other
> > > non-x86 arch) without needing full pKVM support and with only minor KVM
> > > modifications, e.g. the x86 support[*] to test UPM without TDX is shaping up to be
> > > trivial.
> >
> > CONFIG_KVM_GENERIC_PRIVATE_MEM looks good to me.
> 
> That sounds good to me, and just keeping the xarray isn't really an
> issue for pKVM.

The xarray won't exist for pKVM if the #ifdefs in this patch are changed from
CONFIG_HAVE_KVM_PRIVATE_MEM => CONFIG_KVM_GENERIC_PRIVATE_MEM.

> We could end up using it instead of some of the other
> structures we use for tracking.

I don't think pKVM should hijack the xarray for other purposes.  At best, it will
be confusing, at worst we'll end up with a mess if ARM ever supports the "generic"
implementation.  

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 5/8] KVM: Register/unregister the guest private memory regions
  2022-10-19 16:09               ` Sean Christopherson
@ 2022-10-19 18:32                 ` Fuad Tabba
  0 siblings, 0 replies; 97+ messages in thread
From: Fuad Tabba @ 2022-10-19 18:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api,
	linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
	Yu Zhang, Kirill A . Shutemov, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Wed, Oct 19, 2022 at 5:09 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Oct 19, 2022, Fuad Tabba wrote:
> > > > > This sounds good. Thank you.
> > > >
> > > > I like the idea of a separate Kconfig, e.g. CONFIG_KVM_GENERIC_PRIVATE_MEM or
> > > > something.  I highly doubt there will be any non-x86 users for multiple years,
> > > > if ever, but it would allow testing the private memory stuff on ARM (and any other
> > > > non-x86 arch) without needing full pKVM support and with only minor KVM
> > > > modifications, e.g. the x86 support[*] to test UPM without TDX is shaping up to be
> > > > trivial.
> > >
> > > CONFIG_KVM_GENERIC_PRIVATE_MEM looks good to me.
> >
> > That sounds good to me, and just keeping the xarray isn't really an
> > issue for pKVM.
>
> The xarray won't exist for pKVM if the #ifdefs in this patch are changed from
> CONFIG_HAVE_KVM_PRIVATE_MEM => CONFIG_KVM_GENERIC_PRIVATE_MEM.
>
> > We could end up using it instead of some of the other
> > structures we use for tracking.
>
> I don't think pKVM should hijack the xarray for other purposes.  At best, it will
> be confusing, at worst we'll end up with a mess if ARM ever supports the "generic"
> implementation.

Definitely wasn't referring to hijacking it for other purposes, which
is the main reason I wanted to clarify the documentation and the
naming of private_fd. Anyway, I'm glad to see that we're in agreement.
Once I've tightened the screws, I'll share the pKVM port as an RFC as
well.

Cheers,
/fuad

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-19 15:32             ` Kirill A . Shutemov
@ 2022-10-20 10:50               ` Vishal Annapurve
  2022-10-21 13:54                 ` Chao Peng
  0 siblings, 1 reply; 97+ messages in thread
From: Vishal Annapurve @ 2022-10-20 10:50 UTC (permalink / raw)
  To: Kirill A . Shutemov
  Cc: Gupta, Pankaj, Vlastimil Babka, Chao Peng, kvm, linux-kernel,
	linux-mm, linux-fsdevel, linux-api, linux-doc, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Yu Zhang, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Wed, Oct 19, 2022 at 9:02 PM Kirill A . Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
> > On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> > > > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > > > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > > > > On 9/15/22 16:29, Chao Peng wrote:
> > > > > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > > > >
> > > > > > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > > > > > virtual address space and then tells KVM to use the virtual address to
> > > > > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > > > >
> > > > > > > With confidential computing technologies like Intel TDX, the
> > > > > > > memfd-provided memory may be encrypted with special key for special
> > > > > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > > > > memory may lead to host crash so it should be prevented.
> > > > > > >
> > > > > > > This patch introduces userspace inaccessible memfd (created with
> > > > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > > > > in-kernel interface so KVM can directly interact with core-mm without
> > > > > > > the need to map the memory into KVM userspace.
> > > > > > >
> > > > > > > It provides semantics required for KVM guest private(encrypted) memory
> > > > > > > support that a file descriptor with this flag set is going to be used as
> > > > > > > the source of guest memory in confidential computing environments such
> > > > > > > as Intel TDX/AMD SEV.
> > > > > > >
> > > > > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > > > > > > in this patch to obtain the physical memory address and then populate
> > > > > > > the secondary page table entries.
> > > > > > >
> > > > > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > > > > > > from userspace. When hole-punching happens, KVM can get notified through
> > > > > > > inaccessible_notifier it then gets chance to remove any mapped entries
> > > > > > > of the range in the secondary page tables.
> > > > > > >
> > > > > > > The userspace inaccessible memfd itself is implemented as a shim layer
> > > > > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > > > > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > > > > unmovable and unevictable, this is required for current confidential
> > > > > > > usage. But in future this might be changed.
> > > > > > >
> > > > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > > > > > ---
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > > > > +                                  loff_t offset, loff_t len)
> > > > > > > +{
> > > > > > > +       struct inaccessible_data *data = file->f_mapping->private_data;
> > > > > > > +       struct file *memfd = data->memfd;
> > > > > > > +       int ret;
> > > > > > > +
> > > > > > > +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > > > +               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > > > +                       return -EINVAL;
> > > > > > > +       }
> > > > > > > +
> > > > > > > +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > > > +       inaccessible_notifier_invalidate(data, offset, offset + len);
> > > > > >
> > > > > > Wonder if invalidate should precede the actual hole punch, otherwise we open
> > > > > > a window where the page tables point to memory no longer valid?
> > > > >
> > > > > Yes, you are right. Thanks for catching this.
> > > >
> > > > I also noticed this. But then thought the memory would be anyways zeroed
> > > > (hole punched) before this call?
> > >
> > > Hole punching can free pages, given that offset/len covers full page.
> > >
> > > --
> > >   Kiryl Shutsemau / Kirill A. Shutemov
> >
> > I think moving this notifier_invalidate before fallocate may not solve
> > the problem completely. Is it possible that between invalidate and
> > fallocate, KVM tries to handle the page fault for the guest VM from
> > another vcpu and uses the pages to be freed to back gpa ranges? Should
> > hole punching here also update mem_attr first to say that KVM should
> > consider the corresponding gpa ranges to be no more backed by
> > inaccessible memfd?
>
> We rely on external synchronization to prevent this. See code around
> mmu_invalidate_retry_hva().
>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

IIUC, mmu_invalidate_retry_hva/gfn ensures that page faults on gfn
ranges that are being invalidated are retried till invalidation is
complete. In this case, is it possible that KVM tries to serve the
page fault after inaccessible_notifier_invalidate is complete but
before fallocate could punch hole into the files?
e.g.
inaccessible_notifier_invalidate(...)
... (system event preempting this control flow, giving a window for
the guest to retry accessing the gfn range which was invalidated)
fallocate(.., PUNCH_HOLE..)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-19 12:23   ` Vishal Annapurve
@ 2022-10-21 13:47     ` Chao Peng
  2022-10-21 16:18       ` Sean Christopherson
  0 siblings, 1 reply; 97+ messages in thread
From: Chao Peng @ 2022-10-21 13:47 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, wei.w.wang

> 
> In the context of userspace inaccessible memfd, what would be a
> suggested way to enforce NUMA memory policy for physical memory
> allocation? mbind[1] won't work here in absence of virtual address
> range.

How about set_mempolicy():
https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html

Chao
> 
> [1] https://github.com/chao-p/qemu/blob/privmem-v8/backends/hostmem.c#L382

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-20 10:50               ` Vishal Annapurve
@ 2022-10-21 13:54                 ` Chao Peng
  2022-10-21 16:53                   ` Sean Christopherson
  0 siblings, 1 reply; 97+ messages in thread
From: Chao Peng @ 2022-10-21 13:54 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Kirill A . Shutemov, Gupta, Pankaj, Vlastimil Babka, kvm,
	linux-kernel, linux-mm, linux-fsdevel, linux-api, linux-doc,
	qemu-devel, Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Yu Zhang, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Thu, Oct 20, 2022 at 04:20:58PM +0530, Vishal Annapurve wrote:
> On Wed, Oct 19, 2022 at 9:02 PM Kirill A . Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
> > > On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
> > > <kirill.shutemov@linux.intel.com> wrote:
> > > >
> > > > On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> > > > > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > > > > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > > > > > On 9/15/22 16:29, Chao Peng wrote:
> > > > > > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > > > > >
> > > > > > > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > > > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > > > > > > virtual address space and then tells KVM to use the virtual address to
> > > > > > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > > > > >
> > > > > > > > With confidential computing technologies like Intel TDX, the
> > > > > > > > memfd-provided memory may be encrypted with special key for special
> > > > > > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > > > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > > > > > memory may lead to host crash so it should be prevented.
> > > > > > > >
> > > > > > > > This patch introduces userspace inaccessible memfd (created with
> > > > > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > > > > > in-kernel interface so KVM can directly interact with core-mm without
> > > > > > > > the need to map the memory into KVM userspace.
> > > > > > > >
> > > > > > > > It provides semantics required for KVM guest private(encrypted) memory
> > > > > > > > support that a file descriptor with this flag set is going to be used as
> > > > > > > > the source of guest memory in confidential computing environments such
> > > > > > > > as Intel TDX/AMD SEV.
> > > > > > > >
> > > > > > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > > > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > > > > > > > in this patch to obtain the physical memory address and then populate
> > > > > > > > the secondary page table entries.
> > > > > > > >
> > > > > > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > > > > > > > from userspace. When hole-punching happens, KVM can get notified through
> > > > > > > > inaccessible_notifier it then gets chance to remove any mapped entries
> > > > > > > > of the range in the secondary page tables.
> > > > > > > >
> > > > > > > > The userspace inaccessible memfd itself is implemented as a shim layer
> > > > > > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > > > > > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > > > > > unmovable and unevictable, this is required for current confidential
> > > > > > > > usage. But in future this might be changed.
> > > > > > > >
> > > > > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > > > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > > > > > > ---
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > > > > > +                                  loff_t offset, loff_t len)
> > > > > > > > +{
> > > > > > > > +       struct inaccessible_data *data = file->f_mapping->private_data;
> > > > > > > > +       struct file *memfd = data->memfd;
> > > > > > > > +       int ret;
> > > > > > > > +
> > > > > > > > +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > > > > +               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > > > > +                       return -EINVAL;
> > > > > > > > +       }
> > > > > > > > +
> > > > > > > > +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > > > > +       inaccessible_notifier_invalidate(data, offset, offset + len);
> > > > > > >
> > > > > > > Wonder if invalidate should precede the actual hole punch, otherwise we open
> > > > > > > a window where the page tables point to memory no longer valid?
> > > > > >
> > > > > > Yes, you are right. Thanks for catching this.
> > > > >
> > > > > I also noticed this. But then thought the memory would be anyways zeroed
> > > > > (hole punched) before this call?
> > > >
> > > > Hole punching can free pages, given that offset/len covers full page.
> > > >
> > > > --
> > > >   Kiryl Shutsemau / Kirill A. Shutemov
> > >
> > > I think moving this notifier_invalidate before fallocate may not solve
> > > the problem completely. Is it possible that between invalidate and
> > > fallocate, KVM tries to handle the page fault for the guest VM from
> > > another vcpu and uses the pages to be freed to back gpa ranges? Should
> > > hole punching here also update mem_attr first to say that KVM should
> > > consider the corresponding gpa ranges to be no more backed by
> > > inaccessible memfd?
> >
> > We rely on external synchronization to prevent this. See code around
> > mmu_invalidate_retry_hva().
> >
> > --
> >   Kiryl Shutsemau / Kirill A. Shutemov
> 
> IIUC, mmu_invalidate_retry_hva/gfn ensures that page faults on gfn
> ranges that are being invalidated are retried till invalidation is
> complete. In this case, is it possible that KVM tries to serve the
> page fault after inaccessible_notifier_invalidate is complete but
> before fallocate could punch hole into the files?
> e.g.
> inaccessible_notifier_invalidate(...)
> ... (system event preempting this control flow, giving a window for
> the guest to retry accessing the gfn range which was invalidated)
> fallocate(.., PUNCH_HOLE..)

Looks this is something can happen. And sounds to me the solution needs
just follow the mmu_notifier's way of using a invalidate_start/end pair.

  invalidate_start()  --> kvm->mmu_invalidate_in_progress++;
                          zap KVM page table entries;
  fallocate()
  invalidate_end()  --> kvm->mmu_invalidate_in_progress--;

Then during invalidate_start/end time window mmu_invalidate_retry_gfn
checks 'mmu_invalidate_in_progress' and prevent repopulating the same
page in KVM page table.

  if(kvm->mmu_invalidate_in_progress)
      return 1; /* retry */

Thanks,
Chao


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-21 13:47     ` Chao Peng
@ 2022-10-21 16:18       ` Sean Christopherson
  2022-10-24 14:59         ` Kirill A . Shutemov
  0 siblings, 1 reply; 97+ messages in thread
From: Sean Christopherson @ 2022-10-21 16:18 UTC (permalink / raw)
  To: Chao Peng
  Cc: Vishal Annapurve, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang,
	Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
	aarcange, ddutile, dhildenb, Quentin Perret, Michael Roth,
	mhocko, Muchun Song, wei.w.wang

On Fri, Oct 21, 2022, Chao Peng wrote:
> > 
> > In the context of userspace inaccessible memfd, what would be a
> > suggested way to enforce NUMA memory policy for physical memory
> > allocation? mbind[1] won't work here in absence of virtual address
> > range.
> 
> How about set_mempolicy():
> https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html

Andy Lutomirski brought this up in an off-list discussion way back when the whole
private-fd thing was first being proposed.

  : The current Linux NUMA APIs (mbind, move_pages) work on virtual addresses.  If
  : we want to support them for TDX private memory, we either need TDX private
  : memory to have an HVA or we need file-based equivalents. Arguably we should add
  : fmove_pages and fbind syscalls anyway, since the current API is quite awkward
  : even for tools like numactl.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-21 13:54                 ` Chao Peng
@ 2022-10-21 16:53                   ` Sean Christopherson
  0 siblings, 0 replies; 97+ messages in thread
From: Sean Christopherson @ 2022-10-21 16:53 UTC (permalink / raw)
  To: Chao Peng
  Cc: Vishal Annapurve, Kirill A . Shutemov, Gupta, Pankaj,
	Vlastimil Babka, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Yu Zhang, luto, jun.nakajima, dave.hansen,
	ak, david, aarcange, ddutile, dhildenb, Quentin Perret,
	Michael Roth, mhocko, Muchun Song, wei.w.wang

On Fri, Oct 21, 2022, Chao Peng wrote:
> On Thu, Oct 20, 2022 at 04:20:58PM +0530, Vishal Annapurve wrote:
> > On Wed, Oct 19, 2022 at 9:02 PM Kirill A . Shutemov <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
> > > > I think moving this notifier_invalidate before fallocate may not solve
> > > > the problem completely. Is it possible that between invalidate and
> > > > fallocate, KVM tries to handle the page fault for the guest VM from
> > > > another vcpu and uses the pages to be freed to back gpa ranges? Should
> > > > hole punching here also update mem_attr first to say that KVM should
> > > > consider the corresponding gpa ranges to be no more backed by
> > > > inaccessible memfd?
> > >
> > > We rely on external synchronization to prevent this. See code around
> > > mmu_invalidate_retry_hva().
> > >
> > > --
> > >   Kiryl Shutsemau / Kirill A. Shutemov
> > 
> > IIUC, mmu_invalidate_retry_hva/gfn ensures that page faults on gfn
> > ranges that are being invalidated are retried till invalidation is
> > complete. In this case, is it possible that KVM tries to serve the
> > page fault after inaccessible_notifier_invalidate is complete but
> > before fallocate could punch hole into the files?

It's not just the page fault edge case.  In the more straightforward scenario
where the memory is already mapped into the guest, freeing pages back to the kernel
before they are removed from the guest will lead to use-after-free.

> > e.g.
> > inaccessible_notifier_invalidate(...)
> > ... (system event preempting this control flow, giving a window for
> > the guest to retry accessing the gfn range which was invalidated)
> > fallocate(.., PUNCH_HOLE..)
> 
> Looks this is something can happen.
> And sounds to me the solution needs
> just follow the mmu_notifier's way of using a invalidate_start/end pair.
> 
>   invalidate_start()  --> kvm->mmu_invalidate_in_progress++;
>                           zap KVM page table entries;
>   fallocate()
>   invalidate_end()  --> kvm->mmu_invalidate_in_progress--;
> 
> Then during invalidate_start/end time window mmu_invalidate_retry_gfn
> checks 'mmu_invalidate_in_progress' and prevent repopulating the same
> page in KVM page table.

Yes, if it's not safe to invalidate after making the change (fallocate()), then
the change needs to be bookended by a start+end pair.  The mmu_notifier's unpaired
invalidate() hook works by zapping the primary MMU's PTEs before invalidate(), but
frees the underlying physical page _after_ invalidate().

And the only reason the unpaired invalidate() exists is because there are secondary
MMUs that reuse the primary MMU's page tables, e.g. shared virtual addressing, in
which case bookending doesn't work because the secondary MMU can't remove PTEs, it
can only flush its TLBs.

For this case, the whole point is to not create PTEs in the primary MMU, so there
should never be a use case that _needs_ an unpaired invalidate().

TL;DR: a start+end pair is likely the simplest solution.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-21 16:18       ` Sean Christopherson
@ 2022-10-24 14:59         ` Kirill A . Shutemov
  2022-10-24 15:26           ` David Hildenbrand
  2022-11-03 16:27           ` Vishal Annapurve
  0 siblings, 2 replies; 97+ messages in thread
From: Kirill A . Shutemov @ 2022-10-24 14:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Peng, Vishal Annapurve, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song,
	wei.w.wang

On Fri, Oct 21, 2022 at 04:18:14PM +0000, Sean Christopherson wrote:
> On Fri, Oct 21, 2022, Chao Peng wrote:
> > > 
> > > In the context of userspace inaccessible memfd, what would be a
> > > suggested way to enforce NUMA memory policy for physical memory
> > > allocation? mbind[1] won't work here in absence of virtual address
> > > range.
> > 
> > How about set_mempolicy():
> > https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html
> 
> Andy Lutomirski brought this up in an off-list discussion way back when the whole
> private-fd thing was first being proposed.
> 
>   : The current Linux NUMA APIs (mbind, move_pages) work on virtual addresses.  If
>   : we want to support them for TDX private memory, we either need TDX private
>   : memory to have an HVA or we need file-based equivalents. Arguably we should add
>   : fmove_pages and fbind syscalls anyway, since the current API is quite awkward
>   : even for tools like numactl.

Yeah, we definitely have gaps in API wrt NUMA, but I don't think it be
addressed in the initial submission.

BTW, it is not regression comparing to old KVM slots, if the memory is
backed by memfd or other file:

MBIND(2)
       The  specified policy will be ignored for any MAP_SHARED mappings in the
       specified memory range.  Rather the pages will be allocated according to
       the  memory  policy  of the thread that caused the page to be allocated.
       Again, this may not be the thread that called mbind().

It is not clear how to define fbind(2) semantics, considering that multiple
processes may compete for the same region of page cache.

Should it be per-inode or per-fd? Or maybe per-range in inode/fd?

fmove_pages(2) should be relatively straight forward, since it is
best-effort and does not guarantee that the page will note be moved
somewhare else just after return from the syscall.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-24 14:59         ` Kirill A . Shutemov
@ 2022-10-24 15:26           ` David Hildenbrand
  2022-11-03 16:27           ` Vishal Annapurve
  1 sibling, 0 replies; 97+ messages in thread
From: David Hildenbrand @ 2022-10-24 15:26 UTC (permalink / raw)
  To: Kirill A . Shutemov, Sean Christopherson
  Cc: Chao Peng, Vishal Annapurve, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang, luto,
	jun.nakajima, dave.hansen, ak, aarcange, ddutile, dhildenb,
	Quentin Perret, Michael Roth, mhocko, Muchun Song, wei.w.wang

On 24.10.22 16:59, Kirill A . Shutemov wrote:
> On Fri, Oct 21, 2022 at 04:18:14PM +0000, Sean Christopherson wrote:
>> On Fri, Oct 21, 2022, Chao Peng wrote:
>>>>
>>>> In the context of userspace inaccessible memfd, what would be a
>>>> suggested way to enforce NUMA memory policy for physical memory
>>>> allocation? mbind[1] won't work here in absence of virtual address
>>>> range.
>>>
>>> How about set_mempolicy():
>>> https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html
>>
>> Andy Lutomirski brought this up in an off-list discussion way back when the whole
>> private-fd thing was first being proposed.
>>
>>    : The current Linux NUMA APIs (mbind, move_pages) work on virtual addresses.  If
>>    : we want to support them for TDX private memory, we either need TDX private
>>    : memory to have an HVA or we need file-based equivalents. Arguably we should add
>>    : fmove_pages and fbind syscalls anyway, since the current API is quite awkward
>>    : even for tools like numactl.
> 
> Yeah, we definitely have gaps in API wrt NUMA, but I don't think it be
> addressed in the initial submission.
> 
> BTW, it is not regression comparing to old KVM slots, if the memory is
> backed by memfd or other file:
> 
> MBIND(2)
>         The  specified policy will be ignored for any MAP_SHARED mappings in the
>         specified memory range.  Rather the pages will be allocated according to
>         the  memory  policy  of the thread that caused the page to be allocated.
>         Again, this may not be the thread that called mbind().

IIRC, that documentation is imprecise/incorrect especially when it comes 
to memfd. Page faults in shared mappings will similarly obey the set 
mbind() policy when allocating new pages.

QEMU relies on that.

The "fun" begins when we have multiple mappings, and only some have a 
policy set ... or if we already, previously allocated the pages.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
  2022-10-24 14:59         ` Kirill A . Shutemov
  2022-10-24 15:26           ` David Hildenbrand
@ 2022-11-03 16:27           ` Vishal Annapurve
  1 sibling, 0 replies; 97+ messages in thread
From: Vishal Annapurve @ 2022-11-03 16:27 UTC (permalink / raw)
  To: Kirill A . Shutemov
  Cc: Sean Christopherson, Chao Peng, kvm, linux-kernel, linux-mm,
	linux-fsdevel, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
	Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
	Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang, luto,
	jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
	dhildenb, Quentin Perret, Michael Roth, mhocko, Muchun Song,
	wei.w.wang

On Mon, Oct 24, 2022 at 8:30 PM Kirill A . Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Fri, Oct 21, 2022 at 04:18:14PM +0000, Sean Christopherson wrote:
> > On Fri, Oct 21, 2022, Chao Peng wrote:
> > > >
> > > > In the context of userspace inaccessible memfd, what would be a
> > > > suggested way to enforce NUMA memory policy for physical memory
> > > > allocation? mbind[1] won't work here in absence of virtual address
> > > > range.
> > >
> > > How about set_mempolicy():
> > > https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html
> >
> > Andy Lutomirski brought this up in an off-list discussion way back when the whole
> > private-fd thing was first being proposed.
> >
> >   : The current Linux NUMA APIs (mbind, move_pages) work on virtual addresses.  If
> >   : we want to support them for TDX private memory, we either need TDX private
> >   : memory to have an HVA or we need file-based equivalents. Arguably we should add
> >   : fmove_pages and fbind syscalls anyway, since the current API is quite awkward
> >   : even for tools like numactl.
>
> Yeah, we definitely have gaps in API wrt NUMA, but I don't think it be
> addressed in the initial submission.
>
> BTW, it is not regression comparing to old KVM slots, if the memory is
> backed by memfd or other file:
>
> MBIND(2)
>        The  specified policy will be ignored for any MAP_SHARED mappings in the
>        specified memory range.  Rather the pages will be allocated according to
>        the  memory  policy  of the thread that caused the page to be allocated.
>        Again, this may not be the thread that called mbind().
>
> It is not clear how to define fbind(2) semantics, considering that multiple
> processes may compete for the same region of page cache.
>
> Should it be per-inode or per-fd? Or maybe per-range in inode/fd?
>

David's analysis on mempolicy with shmem seems to be right. set_policy
on virtual address range does seem to change the shared policy for the
inode irrespective of the mapping type.

Maybe having a way to set numa policy per-range in the inode would be
at par with what we can do today via mbind on virtual address ranges.



> fmove_pages(2) should be relatively straight forward, since it is
> best-effort and does not guarantee that the page will note be moved
> somewhare else just after return from the syscall.
>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 97+ messages in thread

end of thread, other threads:[~2022-11-03 16:28 UTC | newest]

Thread overview: 97+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-15 14:29 [PATCH v8 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
2022-09-15 14:29 ` [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd Chao Peng
2022-09-19  9:12   ` David Hildenbrand
2022-09-19 19:10     ` Sean Christopherson
2022-09-21 21:10       ` Andy Lutomirski
2022-09-22 13:23         ` Wang, Wei W
2022-09-23 15:20         ` Fuad Tabba
2022-09-23 15:19       ` Fuad Tabba
2022-09-26 14:23         ` Chao Peng
2022-09-26 15:51           ` Fuad Tabba
2022-09-27 22:47             ` Sean Christopherson
2022-09-30 16:19               ` Fuad Tabba
2022-10-13 13:34                 ` Chao Peng
2022-10-17 10:31                   ` Fuad Tabba
2022-10-17 14:58                     ` Chao Peng
2022-10-17 19:05                       ` Fuad Tabba
2022-10-19 13:30                         ` Chao Peng
2022-10-18  0:33                 ` Sean Christopherson
2022-10-19 15:04                   ` Fuad Tabba
2022-09-23  0:58     ` Kirill A . Shutemov
2022-09-26 10:35       ` David Hildenbrand
2022-09-26 14:48         ` Kirill A. Shutemov
2022-09-26 14:53           ` David Hildenbrand
2022-09-27 23:23             ` Sean Christopherson
2022-09-28 13:36               ` Kirill A. Shutemov
2022-09-22 13:26   ` Wang, Wei W
2022-09-22 19:49     ` Sean Christopherson
2022-09-23  0:53       ` Kirill A . Shutemov
2022-09-23 15:20         ` Fuad Tabba
2022-09-30 16:14   ` Fuad Tabba
2022-09-30 16:23     ` Kirill A . Shutemov
2022-10-03  7:33       ` Fuad Tabba
2022-10-03 11:01         ` Kirill A. Shutemov
2022-10-04 15:39           ` Fuad Tabba
2022-10-06  8:50   ` Fuad Tabba
2022-10-06 13:04     ` Kirill A. Shutemov
2022-10-17 13:00   ` Vlastimil Babka
2022-10-17 16:19     ` Kirill A . Shutemov
2022-10-17 16:39       ` Gupta, Pankaj
2022-10-17 21:56         ` Kirill A . Shutemov
2022-10-18 13:42           ` Vishal Annapurve
2022-10-19 15:32             ` Kirill A . Shutemov
2022-10-20 10:50               ` Vishal Annapurve
2022-10-21 13:54                 ` Chao Peng
2022-10-21 16:53                   ` Sean Christopherson
2022-10-19 12:23   ` Vishal Annapurve
2022-10-21 13:47     ` Chao Peng
2022-10-21 16:18       ` Sean Christopherson
2022-10-24 14:59         ` Kirill A . Shutemov
2022-10-24 15:26           ` David Hildenbrand
2022-11-03 16:27           ` Vishal Annapurve
2022-09-15 14:29 ` [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
2022-09-16  9:14   ` Bagas Sanjaya
2022-09-16  9:53     ` Chao Peng
2022-09-26 10:26   ` Fuad Tabba
2022-09-26 14:04     ` Chao Peng
2022-09-29 22:45   ` Isaku Yamahata
2022-09-29 23:22     ` Sean Christopherson
2022-10-05 13:04   ` Jarkko Sakkinen
2022-10-05 22:05     ` Jarkko Sakkinen
2022-10-06  9:00   ` Fuad Tabba
2022-10-06 14:58   ` Jarkko Sakkinen
2022-10-06 15:07     ` Jarkko Sakkinen
2022-10-06 15:34       ` Sean Christopherson
2022-10-07 11:14         ` Jarkko Sakkinen
2022-10-07 14:58           ` Sean Christopherson
2022-10-07 21:54             ` Jarkko Sakkinen
2022-10-08 16:15               ` Jarkko Sakkinen
2022-10-08 17:35                 ` Jarkko Sakkinen
2022-10-10  8:25                   ` Chao Peng
2022-10-12  8:14                     ` Jarkko Sakkinen
2022-09-15 14:29 ` [PATCH v8 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
2022-09-16  9:17   ` Bagas Sanjaya
2022-09-16  9:54     ` Chao Peng
2022-09-15 14:29 ` [PATCH v8 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
2022-09-15 14:29 ` [PATCH v8 5/8] KVM: Register/unregister the guest private memory regions Chao Peng
2022-09-26 10:36   ` Fuad Tabba
2022-09-26 14:07     ` Chao Peng
2022-10-11  9:48   ` Fuad Tabba
2022-10-12  2:35     ` Chao Peng
2022-10-17 10:15       ` Fuad Tabba
2022-10-17 22:17         ` Sean Christopherson
2022-10-19 13:23           ` Chao Peng
2022-10-19 15:02             ` Fuad Tabba
2022-10-19 16:09               ` Sean Christopherson
2022-10-19 18:32                 ` Fuad Tabba
2022-09-15 14:29 ` [PATCH v8 6/8] KVM: Update lpage info when private/shared memory are mixed Chao Peng
2022-09-29 16:52   ` Isaku Yamahata
2022-09-30  8:59     ` Chao Peng
2022-09-15 14:29 ` [PATCH v8 7/8] KVM: Handle page fault for private memory Chao Peng
2022-10-14 18:57   ` Sean Christopherson
2022-10-17 14:48     ` Chao Peng
2022-09-15 14:29 ` [PATCH v8 8/8] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
2022-10-04 14:55   ` Jarkko Sakkinen
2022-10-10  8:31     ` Chao Peng
2022-10-06  8:55   ` Fuad Tabba
2022-10-10  8:33     ` Chao Peng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).