* [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
@ 2022-12-02 6:13 Chao Peng
2022-12-02 6:13 ` [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
` (10 more replies)
0 siblings, 11 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-02 6:13 UTC (permalink / raw)
To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel
Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
mhocko, wei.w.wang
This patch series implements KVM guest private memory for confidential
computing scenarios like Intel TDX[1]. If a TDX host accesses
TDX-protected guest memory, machine check can happen which can further
crash the running host system, this is terrible for multi-tenant
configurations. The host accesses include those from KVM userspace like
QEMU. This series addresses KVM userspace induced crash by introducing
new mm and KVM interfaces so KVM userspace can still manage guest memory
via a fd-based approach, but it can never access the guest memory
content.
The patch series touches both core mm and KVM code. I appreciate
Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
reviews are always welcome.
- 01: mm change, target for mm tree
- 02-09: KVM change, target for KVM tree
Given KVM is the only current user for the mm part, I have chatted with
Paolo and he is OK to merge the mm change through KVM tree, but
reviewed-by/acked-by is still expected from the mm people.
The patches have been verified with Intel TDX environment, but Vishal
has done an excellent work on the selftests[4] which are dedicated for
this series, making it possible to test this series without innovative
hardware and fancy steps of building a VM environment. See Test section
below for more info.
Introduction
============
KVM userspace being able to crash the host is horrible. Under current
KVM architecture, all guest memory is inherently accessible from KVM
userspace and is exposed to the mentioned crash issue. The goal of this
series is to provide a solution to align mm and KVM, on a userspace
inaccessible approach of exposing guest memory.
Normally, KVM populates secondary page table (e.g. EPT) by using a host
virtual address (hva) from core mm page table (e.g. x86 userspace page
table). This requires guest memory being mmaped into KVM userspace, but
this is also the source where the mentioned crash issue can happen. In
theory, apart from those 'shared' memory for device emulation etc, guest
memory doesn't have to be mmaped into KVM userspace.
This series introduces fd-based guest memory which will not be mmaped
into KVM userspace. KVM populates secondary page table by using a
fd/offset pair backed by a memory file system. The fd can be created
from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
directly interact with them with newly introduced in-kernel interface,
therefore remove the KVM userspace from the path of accessing/mmaping
the guest memory.
Kirill had a patch [2] to address the same issue in a different way. It
tracks guest encrypted memory at the 'struct page' level and relies on
HWPOISON to reject the userspace access. The patch has been discussed in
several online and offline threads and resulted in a design document [3]
which is also the original proposal for this series. Later this patch
series evolved as more comments received in community but the major
concepts in [3] still hold true so recommend reading.
The patch series may also be useful for other usages, for example, pure
software approach may use it to harden itself against unintentional
access to guest memory. This series is designed with these usages in
mind but doesn't have code directly support them and extension might be
needed.
mm change
=========
Introduces a new memfd_restricted system call which can create memory
file that is restricted from userspace access via normal MMU operations
like read(), write() or mmap() etc and the only way to use it is
passing it to a third kernel module like KVM and relying on it to
access the fd through the newly added restrictedmem kernel interface.
The restrictedmem interface bridges the memory file subsystems
(tmpfs/hugetlbfs etc) and their users (KVM in this case) and provides
bi-directional communication between them.
KVM change
==========
Extends the KVM memslot to provide guest private (encrypted) memory from
a fd. With this extension, a single memslot can maintain both private
memory through private fd (restricted_fd/restricted_offset) and shared
(unencrypted) memory through userspace mmaped host virtual address
(userspace_addr). For a particular guest page, the corresponding page in
KVM memslot can be only either private or shared and only one of the
shared/private parts of the memslot is visible to guest. For how this
new extension is used in QEMU, please refer to kvm_set_phys_mem() in
below TDX-enabled QEMU repo.
Introduces new KVM_EXIT_MEMORY_FAULT exit to allow userspace to get the
chance on decision-making for shared <-> private memory conversion. The
exit can be an implicit conversion in KVM page fault handler or an
explicit conversion from guest OS.
Introduces new KVM ioctl KVM_SET_MEMORY_ATTRIBUTES to maintain whether a
page is private or shared. This ioctl allows userspace to convert a page
between private <-> shared. The data maintained tells the truth whether
a guest page is private or shared and this information will be used in
KVM page fault handler to decide whether the private or the shared part
of the memslot is visible to guest.
Test
====
Ran two kinds of tests:
- Selftests [4] from Vishal and VM boot tests in non-TDX environment
Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v10
- Functional tests in TDX capable environment
Tested the new functionalities in TDX environment. Code repos:
Linux: https://github.com/chao-p/linux/tree/privmem-v10-tdx
QEMU: https://github.com/chao-p/qemu/tree/privmem-v10
An example QEMU command line for TDX test:
-object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
-machine confidential-guest-support=tdx \
-object memory-backend-memfd-private,id=ram1,size=${mem} \
-machine memory-backend=ram1
TODO
====
- Page accounting and limiting for encrypted memory
- hugetlbfs support
Changelog
=========
v10:
- mm: hook up restricted_memfd to memory failure and route it to
kernel users through .error() callback.
- mm: call invalidate() notifier only for FALLOC_FL_PUNCH_HOLE, i.e.
not for allocation.
- KVM: introduce new ioctl KVM_SET_MEMORY_ATTRIBUTES for memory
conversion instead of reusing KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
- KVM: refine gfn-based mmu_notifier_retry() mechanism.
- KVM: improve lpage_info updating code.
- KVM: fix the bug in private memory handling that a private fault may
fall into a non-private memslot.
- KVM: handle memory machine check error for fd-based memory.
v9:
- mm: move inaccessible memfd into separated syscall.
- mm: return page instead of pfn_t for inaccessible_get_pfn and remove
inaccessible_put_pfn.
- KVM: rename inaccessible/private to restricted and CONFIG change to
make the code friendly to pKVM.
- KVM: add invalidate_begin/end pair to fix race contention and revise
the lock protection for invalidation path.
- KVM: optimize setting lpage_info for > 2M level by direct accessing
lower level's result.
- KVM: avoid load xarray in kvm_mmu_max_mapping_level() and instead let
the caller to pass in is_private.
- KVM: API doc improvement.
v8:
- mm: redesign mm part by introducing a shim layer(inaccessible_memfd)
in memfd to avoid touch the memory file systems directly.
- mm: exclude F_SEAL_AUTO_ALLOCATE as it is for shared memory and
cause confusion in this series, will send out separately.
- doc: exclude the man page change, it's not kernel patch and will
send out separately.
- KVM: adapt to use the new mm inaccessible_memfd interface.
- KVM: update lpage_info when setting mem_attr_array to support
large page.
- KVM: change from xa_store_range to xa_store for mem_attr_array due
to xa_store_range overrides all entries which is not intended
behavior for us.
- KVM: refine the mmu_invalidate_retry_gfn mechanism for private page.
- KVM: reorganize KVM_MEMORY_ENCRYPT_{UN,}REG_REGION and private page
handling code suggested by Sean.
v7:
- mm: introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
- KVM: use KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to record
private/shared info.
- KVM: use similar sync mechanism between zap/page fault paths as
mmu_notifier for memfile_notifier based invalidation.
v6:
- mm: introduce MEMFILE_F_* flags into memfile_node to allow checking
feature consistence among all memfile_notifier users and get rid of
internal flags like SHM_F_INACCESSIBLE.
- mm: make pfn_ops callbacks being members of memfile_backing_store
and then refer to it directly in memfile_notifier.
- mm: remove backing store unregister.
- mm: remove RLIMIT_MEMLOCK based memory accounting and limiting.
- KVM: reorganize patch sequence for page fault handling and private
memory enabling.
v5:
- Add man page for MFD_INACCESSIBLE flag and improve KVM API do for
the new memslot extensions.
- mm: introduce memfile_{un}register_backing_store to allow memory
backing store to register/unregister it from memfile_notifier.
- mm: remove F_SEAL_INACCESSIBLE, use in-kernel flag
(SHM_F_INACCESSIBLE for shmem) instead.
- mm: add memory accounting and limiting (RLIMIT_MEMLOCK based) for
MFD_INACCESSIBLE memory.
- KVM: remove the overlap check for mapping the same file+offset into
multiple gfns due to perf consideration, warned in document.
v4:
- mm: rename memfd_ops to memfile_notifier and separate it from
memfd.c to standalone memfile-notifier.c.
- KVM: move pfn_ops to per-memslot scope from per-vm scope and allow
registering multiple memslots to the same memory backing store.
- KVM: add a 'kvm' reference in memslot so that we can recover kvm in
memfile_notifier handlers.
- KVM: add 'private_' prefix for the new fields in memslot.
- KVM: reshape the 'type' to 'flag' for kvm_memory_exit
v3:
- Remove 'RFC' prefix.
- Fix race condition between memfile_notifier handlers and kvm destroy.
- mm: introduce MFD_INACCESSIBLE flag for memfd_create() to force
setting F_SEAL_INACCESSIBLE when the fd is created.
- KVM: add the shared part of the memslot back to make private/shared
pages live in one memslot.
Reference
=========
[1] Intel TDX:
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
[2] Kirill's implementation:
https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com/T/
[3] Original design proposal:
https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com/
[4] Selftest:
https://lore.kernel.org/all/20221111014244.1714148-1-vannapurve@google.com/
Chao Peng (8):
KVM: Introduce per-page memory attributes
KVM: Extend the memslot to support fd-based private memory
KVM: Add KVM_EXIT_MEMORY_FAULT exit
KVM: Use gfn instead of hva for mmu_notifier_retry
KVM: Unmap existing mappings when change the memory attributes
KVM: Update lpage info when private/shared memory are mixed
KVM: Handle page fault for private memory
KVM: Enable and expose KVM_MEM_PRIVATE
Kirill A. Shutemov (1):
mm: Introduce memfd_restricted system call to create restricted user
memory
Documentation/virt/kvm/api.rst | 125 ++++++-
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/asm/kvm_host.h | 9 +
arch/x86/kvm/Kconfig | 3 +
arch/x86/kvm/mmu/mmu.c | 205 ++++++++++-
arch/x86/kvm/mmu/mmu_internal.h | 14 +-
arch/x86/kvm/mmu/mmutrace.h | 1 +
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
arch/x86/kvm/x86.c | 17 +-
include/linux/kvm_host.h | 103 +++++-
include/linux/restrictedmem.h | 71 ++++
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/kvm.h | 53 +++
include/uapi/linux/magic.h | 1 +
kernel/sys_ni.c | 3 +
mm/Kconfig | 4 +
mm/Makefile | 1 +
mm/memory-failure.c | 3 +
mm/restrictedmem.c | 318 +++++++++++++++++
virt/kvm/Kconfig | 6 +
virt/kvm/kvm_main.c | 469 +++++++++++++++++++++----
23 files changed, 1323 insertions(+), 93 deletions(-)
create mode 100644 include/linux/restrictedmem.h
create mode 100644 mm/restrictedmem.c
base-commit: df0bb47baa95aad133820b149851d5b94cbc6790
--
2.25.1
^ permalink raw reply [flat|nested] 153+ messages in thread
* [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
2022-12-02 6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
@ 2022-12-02 6:13 ` Chao Peng
2022-12-06 14:57 ` Fuad Tabba
` (4 more replies)
2022-12-02 6:13 ` [PATCH v10 2/9] KVM: Introduce per-page memory attributes Chao Peng
` (9 subsequent siblings)
10 siblings, 5 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-02 6:13 UTC (permalink / raw)
To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel
Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
mhocko, wei.w.wang
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Introduce 'memfd_restricted' system call with the ability to create
memory areas that are restricted from userspace access through ordinary
MMU operations (e.g. read/write/mmap). The memory content is expected to
be used through the new in-kernel interface by a third kernel module.
memfd_restricted() is useful for scenarios where a file descriptor(fd)
can be used as an interface into mm but want to restrict userspace's
ability on the fd. Initially it is designed to provide protections for
KVM encrypted guest memory.
Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
(e.g. QEMU) and then using the mmaped virtual address to setup the
mapping in the KVM secondary page table (e.g. EPT). With confidential
computing technologies like Intel TDX, the memfd memory may be encrypted
with special key for special software domain (e.g. KVM guest) and is not
expected to be directly accessed by userspace. Precisely, userspace
access to such encrypted memory may lead to host crash so should be
prevented.
memfd_restricted() provides semantics required for KVM guest encrypted
memory support that a fd created with memfd_restricted() is going to be
used as the source of guest memory in confidential computing environment
and KVM can directly interact with core-mm without the need to expose
the memoy content into KVM userspace.
KVM userspace is still in charge of the lifecycle of the fd. It should
pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
obtain the physical memory page and then uses it to populate the KVM
secondary page table entries.
The userspace restricted memfd can be fallocate-ed or hole-punched
from userspace. When hole-punched, KVM can get notified through
invalidate_start/invalidate_end() callbacks, KVM then gets chance to
remove any mapped entries of the range in the secondary page tables.
Machine check can happen for memory pages in the restricted memfd,
instead of routing this directly to userspace, we call the error()
callback that KVM registered. KVM then gets chance to handle it
correctly.
memfd_restricted() itself is implemented as a shim layer on top of real
memory file systems (currently tmpfs). Pages in restrictedmem are marked
as unmovable and unevictable, this is required for current confidential
usage. But in future this might be changed.
By default memfd_restricted() prevents userspace read, write and mmap.
By defining new bit in the 'flags', it can be extended to support other
restricted semantics in the future.
The system call is currently wired up for x86 arch.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/restrictedmem.h | 71 ++++++
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/magic.h | 1 +
kernel/sys_ni.c | 3 +
mm/Kconfig | 4 +
mm/Makefile | 1 +
mm/memory-failure.c | 3 +
mm/restrictedmem.c | 318 +++++++++++++++++++++++++
11 files changed, 408 insertions(+), 1 deletion(-)
create mode 100644 include/linux/restrictedmem.h
create mode 100644 mm/restrictedmem.c
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 320480a8db4f..dc70ba90247e 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -455,3 +455,4 @@
448 i386 process_mrelease sys_process_mrelease
449 i386 futex_waitv sys_futex_waitv
450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node
+451 i386 memfd_restricted sys_memfd_restricted
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..06516abc8318 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common memfd_restricted sys_memfd_restricted
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
new file mode 100644
index 000000000000..c2700c5daa43
--- /dev/null
+++ b/include/linux/restrictedmem.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _LINUX_RESTRICTEDMEM_H
+
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/pfn_t.h>
+
+struct restrictedmem_notifier;
+
+struct restrictedmem_notifier_ops {
+ void (*invalidate_start)(struct restrictedmem_notifier *notifier,
+ pgoff_t start, pgoff_t end);
+ void (*invalidate_end)(struct restrictedmem_notifier *notifier,
+ pgoff_t start, pgoff_t end);
+ void (*error)(struct restrictedmem_notifier *notifier,
+ pgoff_t start, pgoff_t end);
+};
+
+struct restrictedmem_notifier {
+ struct list_head list;
+ const struct restrictedmem_notifier_ops *ops;
+};
+
+#ifdef CONFIG_RESTRICTEDMEM
+
+void restrictedmem_register_notifier(struct file *file,
+ struct restrictedmem_notifier *notifier);
+void restrictedmem_unregister_notifier(struct file *file,
+ struct restrictedmem_notifier *notifier);
+
+int restrictedmem_get_page(struct file *file, pgoff_t offset,
+ struct page **pagep, int *order);
+
+static inline bool file_is_restrictedmem(struct file *file)
+{
+ return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
+}
+
+void restrictedmem_error_page(struct page *page, struct address_space *mapping);
+
+#else
+
+static inline void restrictedmem_register_notifier(struct file *file,
+ struct restrictedmem_notifier *notifier)
+{
+}
+
+static inline void restrictedmem_unregister_notifier(struct file *file,
+ struct restrictedmem_notifier *notifier)
+{
+}
+
+static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
+ struct page **pagep, int *order)
+{
+ return -1;
+}
+
+static inline bool file_is_restrictedmem(struct file *file)
+{
+ return false;
+}
+
+static inline void restrictedmem_error_page(struct page *page,
+ struct address_space *mapping)
+{
+}
+
+#endif /* CONFIG_RESTRICTEDMEM */
+
+#endif /* _LINUX_RESTRICTEDMEM_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a34b0f9a9972..f9e9e0c820c5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
unsigned long home_node,
unsigned long flags);
+asmlinkage long sys_memfd_restricted(unsigned int flags);
/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..e93cd35e46d0 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
#define __NR_set_mempolicy_home_node 450
__SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
+#define __NR_memfd_restricted 451
+__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
+
#undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452
/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..8aa38324b90a 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@
#define DMA_BUF_MAGIC 0x444d4142 /* "DMAB" */
#define DEVMEM_MAGIC 0x454d444d /* "DMEM" */
#define SECRETMEM_MAGIC 0x5345434d /* "SECM" */
+#define RESTRICTEDMEM_MAGIC 0x5245534d /* "RESM" */
#endif /* __LINUX_MAGIC_H__ */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 860b2dcf3ac4..7c4a32cbd2e7 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
/* memfd_secret */
COND_SYSCALL(memfd_secret);
+/* memfd_restricted */
+COND_SYSCALL(memfd_restricted);
+
/*
* Architecture specific weak syscall entries.
*/
diff --git a/mm/Kconfig b/mm/Kconfig
index 57e1d8c5b505..06b0e1d6b8c1 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1076,6 +1076,10 @@ config IO_MAPPING
config SECRETMEM
def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
+config RESTRICTEDMEM
+ bool
+ depends on TMPFS
+
config ANON_VMA_NAME
bool "Anonymous VMA name support"
depends on PROC_FS && ADVISE_SYSCALLS && MMU
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..bcbb0edf9ba1 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -121,6 +121,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
obj-$(CONFIG_SECRETMEM) += secretmem.o
+obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 145bb561ddb3..f91b444e471e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -62,6 +62,7 @@
#include <linux/page-isolation.h>
#include <linux/pagewalk.h>
#include <linux/shmem_fs.h>
+#include <linux/restrictedmem.h>
#include "swap.h"
#include "internal.h"
#include "ras/ras_event.h"
@@ -940,6 +941,8 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
goto out;
}
+ restrictedmem_error_page(p, mapping);
+
/*
* The shmem page is kept in page cache instead of truncating
* so is expected to have an extra refcount after error-handling.
diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
new file mode 100644
index 000000000000..56953c204e5c
--- /dev/null
+++ b/mm/restrictedmem.c
@@ -0,0 +1,318 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "linux/sbitmap.h"
+#include <linux/pagemap.h>
+#include <linux/pseudo_fs.h>
+#include <linux/shmem_fs.h>
+#include <linux/syscalls.h>
+#include <uapi/linux/falloc.h>
+#include <uapi/linux/magic.h>
+#include <linux/restrictedmem.h>
+
+struct restrictedmem_data {
+ struct mutex lock;
+ struct file *memfd;
+ struct list_head notifiers;
+};
+
+static void restrictedmem_invalidate_start(struct restrictedmem_data *data,
+ pgoff_t start, pgoff_t end)
+{
+ struct restrictedmem_notifier *notifier;
+
+ mutex_lock(&data->lock);
+ list_for_each_entry(notifier, &data->notifiers, list) {
+ notifier->ops->invalidate_start(notifier, start, end);
+ }
+ mutex_unlock(&data->lock);
+}
+
+static void restrictedmem_invalidate_end(struct restrictedmem_data *data,
+ pgoff_t start, pgoff_t end)
+{
+ struct restrictedmem_notifier *notifier;
+
+ mutex_lock(&data->lock);
+ list_for_each_entry(notifier, &data->notifiers, list) {
+ notifier->ops->invalidate_end(notifier, start, end);
+ }
+ mutex_unlock(&data->lock);
+}
+
+static void restrictedmem_notifier_error(struct restrictedmem_data *data,
+ pgoff_t start, pgoff_t end)
+{
+ struct restrictedmem_notifier *notifier;
+
+ mutex_lock(&data->lock);
+ list_for_each_entry(notifier, &data->notifiers, list) {
+ notifier->ops->error(notifier, start, end);
+ }
+ mutex_unlock(&data->lock);
+}
+
+static int restrictedmem_release(struct inode *inode, struct file *file)
+{
+ struct restrictedmem_data *data = inode->i_mapping->private_data;
+
+ fput(data->memfd);
+ kfree(data);
+ return 0;
+}
+
+static long restrictedmem_punch_hole(struct restrictedmem_data *data, int mode,
+ loff_t offset, loff_t len)
+{
+ int ret;
+ pgoff_t start, end;
+ struct file *memfd = data->memfd;
+
+ if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
+ return -EINVAL;
+
+ start = offset >> PAGE_SHIFT;
+ end = (offset + len) >> PAGE_SHIFT;
+
+ restrictedmem_invalidate_start(data, start, end);
+ ret = memfd->f_op->fallocate(memfd, mode, offset, len);
+ restrictedmem_invalidate_end(data, start, end);
+
+ return ret;
+}
+
+static long restrictedmem_fallocate(struct file *file, int mode,
+ loff_t offset, loff_t len)
+{
+ struct restrictedmem_data *data = file->f_mapping->private_data;
+ struct file *memfd = data->memfd;
+
+ if (mode & FALLOC_FL_PUNCH_HOLE)
+ return restrictedmem_punch_hole(data, mode, offset, len);
+
+ return memfd->f_op->fallocate(memfd, mode, offset, len);
+}
+
+static const struct file_operations restrictedmem_fops = {
+ .release = restrictedmem_release,
+ .fallocate = restrictedmem_fallocate,
+};
+
+static int restrictedmem_getattr(struct user_namespace *mnt_userns,
+ const struct path *path, struct kstat *stat,
+ u32 request_mask, unsigned int query_flags)
+{
+ struct inode *inode = d_inode(path->dentry);
+ struct restrictedmem_data *data = inode->i_mapping->private_data;
+ struct file *memfd = data->memfd;
+
+ return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
+ request_mask, query_flags);
+}
+
+static int restrictedmem_setattr(struct user_namespace *mnt_userns,
+ struct dentry *dentry, struct iattr *attr)
+{
+ struct inode *inode = d_inode(dentry);
+ struct restrictedmem_data *data = inode->i_mapping->private_data;
+ struct file *memfd = data->memfd;
+ int ret;
+
+ if (attr->ia_valid & ATTR_SIZE) {
+ if (memfd->f_inode->i_size)
+ return -EPERM;
+
+ if (!PAGE_ALIGNED(attr->ia_size))
+ return -EINVAL;
+ }
+
+ ret = memfd->f_inode->i_op->setattr(mnt_userns,
+ file_dentry(memfd), attr);
+ return ret;
+}
+
+static const struct inode_operations restrictedmem_iops = {
+ .getattr = restrictedmem_getattr,
+ .setattr = restrictedmem_setattr,
+};
+
+static int restrictedmem_init_fs_context(struct fs_context *fc)
+{
+ if (!init_pseudo(fc, RESTRICTEDMEM_MAGIC))
+ return -ENOMEM;
+
+ fc->s_iflags |= SB_I_NOEXEC;
+ return 0;
+}
+
+static struct file_system_type restrictedmem_fs = {
+ .owner = THIS_MODULE,
+ .name = "memfd:restrictedmem",
+ .init_fs_context = restrictedmem_init_fs_context,
+ .kill_sb = kill_anon_super,
+};
+
+static struct vfsmount *restrictedmem_mnt;
+
+static __init int restrictedmem_init(void)
+{
+ restrictedmem_mnt = kern_mount(&restrictedmem_fs);
+ if (IS_ERR(restrictedmem_mnt))
+ return PTR_ERR(restrictedmem_mnt);
+ return 0;
+}
+fs_initcall(restrictedmem_init);
+
+static struct file *restrictedmem_file_create(struct file *memfd)
+{
+ struct restrictedmem_data *data;
+ struct address_space *mapping;
+ struct inode *inode;
+ struct file *file;
+
+ data = kzalloc(sizeof(*data), GFP_KERNEL);
+ if (!data)
+ return ERR_PTR(-ENOMEM);
+
+ data->memfd = memfd;
+ mutex_init(&data->lock);
+ INIT_LIST_HEAD(&data->notifiers);
+
+ inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
+ if (IS_ERR(inode)) {
+ kfree(data);
+ return ERR_CAST(inode);
+ }
+
+ inode->i_mode |= S_IFREG;
+ inode->i_op = &restrictedmem_iops;
+ inode->i_mapping->private_data = data;
+
+ file = alloc_file_pseudo(inode, restrictedmem_mnt,
+ "restrictedmem", O_RDWR,
+ &restrictedmem_fops);
+ if (IS_ERR(file)) {
+ iput(inode);
+ kfree(data);
+ return ERR_CAST(file);
+ }
+
+ file->f_flags |= O_LARGEFILE;
+
+ /*
+ * These pages are currently unmovable so don't place them into movable
+ * pageblocks (e.g. CMA and ZONE_MOVABLE).
+ */
+ mapping = memfd->f_mapping;
+ mapping_set_unevictable(mapping);
+ mapping_set_gfp_mask(mapping,
+ mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
+
+ return file;
+}
+
+SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
+{
+ struct file *file, *restricted_file;
+ int fd, err;
+
+ if (flags)
+ return -EINVAL;
+
+ fd = get_unused_fd_flags(0);
+ if (fd < 0)
+ return fd;
+
+ file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
+ if (IS_ERR(file)) {
+ err = PTR_ERR(file);
+ goto err_fd;
+ }
+ file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
+ file->f_flags |= O_LARGEFILE;
+
+ restricted_file = restrictedmem_file_create(file);
+ if (IS_ERR(restricted_file)) {
+ err = PTR_ERR(restricted_file);
+ fput(file);
+ goto err_fd;
+ }
+
+ fd_install(fd, restricted_file);
+ return fd;
+err_fd:
+ put_unused_fd(fd);
+ return err;
+}
+
+void restrictedmem_register_notifier(struct file *file,
+ struct restrictedmem_notifier *notifier)
+{
+ struct restrictedmem_data *data = file->f_mapping->private_data;
+
+ mutex_lock(&data->lock);
+ list_add(¬ifier->list, &data->notifiers);
+ mutex_unlock(&data->lock);
+}
+EXPORT_SYMBOL_GPL(restrictedmem_register_notifier);
+
+void restrictedmem_unregister_notifier(struct file *file,
+ struct restrictedmem_notifier *notifier)
+{
+ struct restrictedmem_data *data = file->f_mapping->private_data;
+
+ mutex_lock(&data->lock);
+ list_del(¬ifier->list);
+ mutex_unlock(&data->lock);
+}
+EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier);
+
+int restrictedmem_get_page(struct file *file, pgoff_t offset,
+ struct page **pagep, int *order)
+{
+ struct restrictedmem_data *data = file->f_mapping->private_data;
+ struct file *memfd = data->memfd;
+ struct folio *folio;
+ struct page *page;
+ int ret;
+
+ ret = shmem_get_folio(file_inode(memfd), offset, &folio, SGP_WRITE);
+ if (ret)
+ return ret;
+
+ page = folio_file_page(folio, offset);
+ *pagep = page;
+ if (order)
+ *order = thp_order(compound_head(page));
+
+ SetPageUptodate(page);
+ unlock_page(page);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(restrictedmem_get_page);
+
+void restrictedmem_error_page(struct page *page, struct address_space *mapping)
+{
+ struct super_block *sb = restrictedmem_mnt->mnt_sb;
+ struct inode *inode, *next;
+
+ if (!shmem_mapping(mapping))
+ return;
+
+ spin_lock(&sb->s_inode_list_lock);
+ list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+ struct restrictedmem_data *data = inode->i_mapping->private_data;
+ struct file *memfd = data->memfd;
+
+ if (memfd->f_mapping == mapping) {
+ pgoff_t start, end;
+
+ spin_unlock(&sb->s_inode_list_lock);
+
+ start = page->index;
+ end = start + thp_nr_pages(page);
+ restrictedmem_notifier_error(data, start, end);
+ return;
+ }
+ }
+ spin_unlock(&sb->s_inode_list_lock);
+}
--
2.25.1
^ permalink raw reply related [flat|nested] 153+ messages in thread
* [PATCH v10 2/9] KVM: Introduce per-page memory attributes
2022-12-02 6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
2022-12-02 6:13 ` [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
@ 2022-12-02 6:13 ` Chao Peng
2022-12-06 13:34 ` Fabiano Rosas
` (6 more replies)
2022-12-02 6:13 ` [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory Chao Peng
` (8 subsequent siblings)
10 siblings, 7 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-02 6:13 UTC (permalink / raw)
To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel
Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
mhocko, wei.w.wang
In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.
Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
userspace to operate on the per-page memory attributes.
- KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
a guest memory range.
- KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
memory attributes.
KVM internally uses xarray to store the per-page memory attributes.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
---
Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
arch/x86/kvm/Kconfig | 1 +
include/linux/kvm_host.h | 3 ++
include/uapi/linux/kvm.h | 17 ++++++++
virt/kvm/Kconfig | 3 ++
virt/kvm/kvm_main.c | 76 ++++++++++++++++++++++++++++++++++
6 files changed, 163 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 5617bc4f899f..bb2f709c0900 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
The "pad" and "reserved" fields may be used for future extensions and should be
set to 0s by userspace.
+4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
+-----------------------------------------
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: u64 memory attributes bitmask(out)
+:Returns: 0 on success, <0 on error
+
+Returns supported memory attributes bitmask. Supported memory attributes will
+have the corresponding bits set in u64 memory attributes bitmask.
+
+The following memory attributes are defined::
+
+ #define KVM_MEMORY_ATTRIBUTE_READ (1ULL << 0)
+ #define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
+ #define KVM_MEMORY_ATTRIBUTE_EXECUTE (1ULL << 2)
+ #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
+
+4.139 KVM_SET_MEMORY_ATTRIBUTES
+-----------------------------------------
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_memory_attributes(in/out)
+:Returns: 0 on success, <0 on error
+
+Sets memory attributes for pages in a guest memory range. Parameters are
+specified via the following structure::
+
+ struct kvm_memory_attributes {
+ __u64 address;
+ __u64 size;
+ __u64 attributes;
+ __u64 flags;
+ };
+
+The user sets the per-page memory attributes to a guest memory range indicated
+by address/size, and in return KVM adjusts address and size to reflect the
+actual pages of the memory range have been successfully set to the attributes.
+If the call returns 0, "address" is updated to the last successful address + 1
+and "size" is updated to the remaining address size that has not been set
+successfully. The user should check the return value as well as the size to
+decide if the operation succeeded for the whole range or not. The user may want
+to retry the operation with the returned address/size if the previous range was
+partially successful.
+
+Both address and size should be page aligned and the supported attributes can be
+retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
+
+The "flags" field may be used for future extensions and should be set to 0s.
+
5. The kvm_run structure
========================
@@ -8270,6 +8323,16 @@ structure.
When getting the Modified Change Topology Report value, the attr->addr
must point to a byte where the value will be stored or retrieved from.
+8.40 KVM_CAP_MEMORY_ATTRIBUTES
+------------------------------
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm
+
+This capability indicates KVM supports per-page memory attributes and ioctls
+KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
+
9. Known KVM API problems
=========================
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index fbeaa9ddef59..a8e379a3afee 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -49,6 +49,7 @@ config KVM
select SRCU
select INTERVAL_TREE
select HAVE_KVM_PM_NOTIFIER if PM
+ select HAVE_KVM_MEMORY_ATTRIBUTES
help
Support hosting fully virtualized guest machines using hardware
virtualization extensions. You will need a fairly recent
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8f874a964313..a784e2b06625 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -800,6 +800,9 @@ struct kvm {
#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
struct notifier_block pm_notifier;
+#endif
+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+ struct xarray mem_attr_array;
#endif
char stats_id[KVM_STATS_NAME_SIZE];
};
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 64dfe9c07c87..5d0941acb5bb 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1182,6 +1182,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_S390_CPU_TOPOLOGY 222
#define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
#define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
+#define KVM_CAP_MEMORY_ATTRIBUTES 225
#ifdef KVM_CAP_IRQ_ROUTING
@@ -2238,4 +2239,20 @@ struct kvm_s390_zpci_op {
/* flags for kvm_s390_zpci_op->u.reg_aen.flags */
#define KVM_S390_ZPCIOP_REGAEN_HOST (1 << 0)
+/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
+#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES _IOR(KVMIO, 0xd2, __u64)
+#define KVM_SET_MEMORY_ATTRIBUTES _IOWR(KVMIO, 0xd3, struct kvm_memory_attributes)
+
+struct kvm_memory_attributes {
+ __u64 address;
+ __u64 size;
+ __u64 attributes;
+ __u64 flags;
+};
+
+#define KVM_MEMORY_ATTRIBUTE_READ (1ULL << 0)
+#define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
+#define KVM_MEMORY_ATTRIBUTE_EXECUTE (1ULL << 2)
+#define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
+
#endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 800f9470e36b..effdea5dd4f0 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -19,6 +19,9 @@ config HAVE_KVM_IRQ_ROUTING
config HAVE_KVM_DIRTY_RING
bool
+config HAVE_KVM_MEMORY_ATTRIBUTES
+ bool
+
# Only strongly ordered architectures can select this, as it doesn't
# put any explicit constraint on userspace ordering. They can also
# select the _ACQ_REL version.
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1782c4555d94..7f0f5e9f2406 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
spin_lock_init(&kvm->mn_invalidate_lock);
rcuwait_init(&kvm->mn_memslots_update_rcuwait);
xa_init(&kvm->vcpu_array);
+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+ xa_init(&kvm->mem_attr_array);
+#endif
INIT_LIST_HEAD(&kvm->gpc_list);
spin_lock_init(&kvm->gpc_lock);
@@ -1323,6 +1326,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
}
+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+ xa_destroy(&kvm->mem_attr_array);
+#endif
cleanup_srcu_struct(&kvm->irq_srcu);
cleanup_srcu_struct(&kvm->srcu);
kvm_arch_free_vm(kvm);
@@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
}
#endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+static u64 kvm_supported_mem_attributes(struct kvm *kvm)
+{
+ return 0;
+}
+
+static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
+ struct kvm_memory_attributes *attrs)
+{
+ gfn_t start, end;
+ unsigned long i;
+ void *entry;
+ u64 supported_attrs = kvm_supported_mem_attributes(kvm);
+
+ /* flags is currently not used. */
+ if (attrs->flags)
+ return -EINVAL;
+ if (attrs->attributes & ~supported_attrs)
+ return -EINVAL;
+ if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
+ return -EINVAL;
+ if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
+ return -EINVAL;
+
+ start = attrs->address >> PAGE_SHIFT;
+ end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
+
+ entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
+
+ mutex_lock(&kvm->lock);
+ for (i = start; i < end; i++)
+ if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
+ GFP_KERNEL_ACCOUNT)))
+ break;
+ mutex_unlock(&kvm->lock);
+
+ attrs->address = i << PAGE_SHIFT;
+ attrs->size = (end - i) << PAGE_SHIFT;
+
+ return 0;
+}
+#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
+
struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
{
return __gfn_to_memslot(kvm_memslots(kvm), gfn);
@@ -4459,6 +4508,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
#ifdef CONFIG_HAVE_KVM_MSI
case KVM_CAP_SIGNAL_MSI:
#endif
+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+ case KVM_CAP_MEMORY_ATTRIBUTES:
+#endif
#ifdef CONFIG_HAVE_KVM_IRQFD
case KVM_CAP_IRQFD:
case KVM_CAP_IRQFD_RESAMPLE:
@@ -4804,6 +4856,30 @@ static long kvm_vm_ioctl(struct file *filp,
break;
}
#endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+ case KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES: {
+ u64 attrs = kvm_supported_mem_attributes(kvm);
+
+ r = -EFAULT;
+ if (copy_to_user(argp, &attrs, sizeof(attrs)))
+ goto out;
+ r = 0;
+ break;
+ }
+ case KVM_SET_MEMORY_ATTRIBUTES: {
+ struct kvm_memory_attributes attrs;
+
+ r = -EFAULT;
+ if (copy_from_user(&attrs, argp, sizeof(attrs)))
+ goto out;
+
+ r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
+
+ if (!r && copy_to_user(argp, &attrs, sizeof(attrs)))
+ r = -EFAULT;
+ break;
+ }
+#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
case KVM_CREATE_DEVICE: {
struct kvm_create_device cd;
--
2.25.1
^ permalink raw reply related [flat|nested] 153+ messages in thread
* [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
2022-12-02 6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
2022-12-02 6:13 ` [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
2022-12-02 6:13 ` [PATCH v10 2/9] KVM: Introduce per-page memory attributes Chao Peng
@ 2022-12-02 6:13 ` Chao Peng
2022-12-05 9:03 ` Fuad Tabba
` (3 more replies)
2022-12-02 6:13 ` [PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
` (7 subsequent siblings)
10 siblings, 4 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-02 6:13 UTC (permalink / raw)
To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel
Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
mhocko, wei.w.wang
In memory encryption usage, guest memory may be encrypted with special
key and can be accessed only by the guest itself. We call such memory
private memory. It's valueless and sometimes can cause problem to allow
userspace to access guest private memory. This new KVM memslot extension
allows guest private memory being provided through a restrictedmem
backed file descriptor(fd) and userspace is restricted to access the
bookmarked memory in the fd.
This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
additional KVM memslot fields restricted_fd/restricted_offset to allow
userspace to instruct KVM to provide guest memory through restricted_fd.
'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
and the size is 'memory_size'.
The extended memslot can still have the userspace_addr(hva). When use, a
single memslot can maintain both private memory through restricted_fd
and shared memory through userspace_addr. Whether the private or shared
part is visible to guest is maintained by other KVM code.
A restrictedmem_notifier field is also added to the memslot structure to
allow the restricted_fd's backing store to notify KVM the memory change,
KVM then can invalidate its page table entries or handle memory errors.
Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
and right now it is selected on X86_64 only.
To make future maintenance easy, internally use a binary compatible
alias struct kvm_user_mem_region to handle both the normal and the
'_ext' variants.
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
---
Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++-----
arch/x86/kvm/Kconfig | 2 ++
arch/x86/kvm/x86.c | 2 +-
include/linux/kvm_host.h | 8 ++++--
include/uapi/linux/kvm.h | 28 +++++++++++++++++++
virt/kvm/Kconfig | 3 +++
virt/kvm/kvm_main.c | 49 ++++++++++++++++++++++++++++------
7 files changed, 114 insertions(+), 18 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index bb2f709c0900..99352170c130 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
:Capability: KVM_CAP_USER_MEMORY
:Architectures: all
:Type: vm ioctl
-:Parameters: struct kvm_userspace_memory_region (in)
+:Parameters: struct kvm_userspace_memory_region(_ext) (in)
:Returns: 0 on success, -1 on error
::
@@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
__u64 userspace_addr; /* start of the userspace allocated memory */
};
+ struct kvm_userspace_memory_region_ext {
+ struct kvm_userspace_memory_region region;
+ __u64 restricted_offset;
+ __u32 restricted_fd;
+ __u32 pad1;
+ __u64 pad2[14];
+ };
+
/* for kvm_memory_region::flags */
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
+ #define KVM_MEM_PRIVATE (1UL << 2)
This ioctl allows the user to create, modify or delete a guest physical
memory slot. Bits 0-15 of "slot" specify the slot id and this value
@@ -1365,12 +1374,29 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
be identical. This allows large pages in the guest to be backed by large
pages in the host.
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of
-writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to
-use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only. In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+kvm_userspace_memory_region_ext struct includes all fields of
+kvm_userspace_memory_region struct, while also adds additional fields for some
+other features. See below description of flags field for more information.
+It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
+
+The flags field supports following flags:
+
+- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
+ within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
+
+- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
+ read-only. In this case, writes to this memory will be posted to userspace as
+ KVM_EXIT_MMIO exits.
+
+- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
+ KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has private
+ memory backed by a file descriptor(fd) and userspace access to the fd may be
+ restricted. Userspace should use restricted_fd/restricted_offset in the
+ kvm_userspace_memory_region_ext to instruct KVM to provide private memory
+ to guest. Userspace should guarantee not to map the same host physical address
+ indicated by restricted_fd/restricted_offset to different guest physical
+ addresses within multiple memslots. Failed to do this may result undefined
+ behavior.
When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
the memory region are automatically reflected into the guest. For example, an
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index a8e379a3afee..690cb21010e7 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -50,6 +50,8 @@ config KVM
select INTERVAL_TREE
select HAVE_KVM_PM_NOTIFIER if PM
select HAVE_KVM_MEMORY_ATTRIBUTES
+ select HAVE_KVM_RESTRICTED_MEM if X86_64
+ select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
help
Support hosting fully virtualized guest machines using hardware
virtualization extensions. You will need a fairly recent
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7f850dfb4086..9a07380f8d3c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12224,7 +12224,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
}
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
- struct kvm_userspace_memory_region m;
+ struct kvm_user_mem_region m;
m.slot = id | (i << 16);
m.flags = 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a784e2b06625..02347e386ea2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -44,6 +44,7 @@
#include <asm/kvm_host.h>
#include <linux/kvm_dirty_ring.h>
+#include <linux/restrictedmem.h>
#ifndef KVM_MAX_VCPU_IDS
#define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
@@ -585,6 +586,9 @@ struct kvm_memory_slot {
u32 flags;
short id;
u16 as_id;
+ struct file *restricted_file;
+ loff_t restricted_offset;
+ struct restrictedmem_notifier notifier;
};
static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
@@ -1123,9 +1127,9 @@ enum kvm_mr_change {
};
int kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem);
+ const struct kvm_user_mem_region *mem);
int __kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem);
+ const struct kvm_user_mem_region *mem);
void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5d0941acb5bb..13bff963b8b0 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
__u64 userspace_addr; /* start of the userspace allocated memory */
};
+struct kvm_userspace_memory_region_ext {
+ struct kvm_userspace_memory_region region;
+ __u64 restricted_offset;
+ __u32 restricted_fd;
+ __u32 pad1;
+ __u64 pad2[14];
+};
+
+#ifdef __KERNEL__
+/*
+ * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
+ * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
+ * all fields from the top-level "extended" region.
+ */
+struct kvm_user_mem_region {
+ __u32 slot;
+ __u32 flags;
+ __u64 guest_phys_addr;
+ __u64 memory_size;
+ __u64 userspace_addr;
+ __u64 restricted_offset;
+ __u32 restricted_fd;
+ __u32 pad1;
+ __u64 pad2[14];
+};
+#endif
+
/*
* The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
* other bits are reserved for kvm internal use which are defined in
@@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
*/
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
+#define KVM_MEM_PRIVATE (1UL << 2)
/* for KVM_IRQ_LINE */
struct kvm_irq_level {
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index effdea5dd4f0..d605545d6dd1 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -89,3 +89,6 @@ config KVM_XFER_TO_GUEST_WORK
config HAVE_KVM_PM_NOTIFIER
bool
+
+config HAVE_KVM_RESTRICTED_MEM
+ bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7f0f5e9f2406..b882eb2c76a2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1532,7 +1532,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
}
}
-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
+static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
{
u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
@@ -1934,7 +1934,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
* Must be called holding kvm->slots_lock for write.
*/
int __kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem)
+ const struct kvm_user_mem_region *mem)
{
struct kvm_memory_slot *old, *new;
struct kvm_memslots *slots;
@@ -2038,7 +2038,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
int kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem)
+ const struct kvm_user_mem_region *mem)
{
int r;
@@ -2050,7 +2050,7 @@ int kvm_set_memory_region(struct kvm *kvm,
EXPORT_SYMBOL_GPL(kvm_set_memory_region);
static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
- struct kvm_userspace_memory_region *mem)
+ struct kvm_user_mem_region *mem)
{
if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
return -EINVAL;
@@ -4698,6 +4698,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
return fd;
}
+#define SANITY_CHECK_MEM_REGION_FIELD(field) \
+do { \
+ BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \
+ offsetof(struct kvm_userspace_memory_region, field)); \
+ BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \
+ sizeof_field(struct kvm_userspace_memory_region, field)); \
+} while (0)
+
+#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field) \
+do { \
+ BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \
+ offsetof(struct kvm_userspace_memory_region_ext, field)); \
+ BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \
+ sizeof_field(struct kvm_userspace_memory_region_ext, field)); \
+} while (0)
+
+static void kvm_sanity_check_user_mem_region_alias(void)
+{
+ SANITY_CHECK_MEM_REGION_FIELD(slot);
+ SANITY_CHECK_MEM_REGION_FIELD(flags);
+ SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
+ SANITY_CHECK_MEM_REGION_FIELD(memory_size);
+ SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
+ SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
+ SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
+}
+
static long kvm_vm_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
@@ -4721,14 +4748,20 @@ static long kvm_vm_ioctl(struct file *filp,
break;
}
case KVM_SET_USER_MEMORY_REGION: {
- struct kvm_userspace_memory_region kvm_userspace_mem;
+ struct kvm_user_mem_region mem;
+ unsigned long size = sizeof(struct kvm_userspace_memory_region);
+
+ kvm_sanity_check_user_mem_region_alias();
r = -EFAULT;
- if (copy_from_user(&kvm_userspace_mem, argp,
- sizeof(kvm_userspace_mem)))
+ if (copy_from_user(&mem, argp, size))
+ goto out;
+
+ r = -EINVAL;
+ if (mem.flags & KVM_MEM_PRIVATE)
goto out;
- r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
+ r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
break;
}
case KVM_GET_DIRTY_LOG: {
--
2.25.1
^ permalink raw reply related [flat|nested] 153+ messages in thread
* [PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit
2022-12-02 6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
` (2 preceding siblings ...)
2022-12-02 6:13 ` [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory Chao Peng
@ 2022-12-02 6:13 ` Chao Peng
2022-12-06 15:47 ` Fuad Tabba
2023-01-13 23:13 ` Sean Christopherson
2022-12-02 6:13 ` [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
` (6 subsequent siblings)
10 siblings, 2 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-02 6:13 UTC (permalink / raw)
To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel
Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
mhocko, wei.w.wang
This new KVM exit allows userspace to handle memory-related errors. It
indicates an error happens in KVM at guest memory range [gpa, gpa+size).
The flags includes additional information for userspace to handle the
error. Currently bit 0 is defined as 'private memory' where '1'
indicates error happens due to private memory access and '0' indicates
error happens due to shared memory access.
When private memory is enabled, this new exit will be used for KVM to
exit to userspace for shared <-> private memory conversion in memory
encryption usage. In such usage, typically there are two kind of memory
conversions:
- explicit conversion: happens when guest explicitly calls into KVM
to map a range (as private or shared), KVM then exits to userspace
to perform the map/unmap operations.
- implicit conversion: happens in KVM page fault handler where KVM
exits to userspace for an implicit conversion when the page is in a
different state than requested (private or shared).
Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
---
Documentation/virt/kvm/api.rst | 22 ++++++++++++++++++++++
include/uapi/linux/kvm.h | 8 ++++++++
2 files changed, 30 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 99352170c130..d9edb14ce30b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6634,6 +6634,28 @@ array field represents return values. The userspace should update the return
values of SBI call before resuming the VCPU. For more details on RISC-V SBI
spec refer, https://github.com/riscv/riscv-sbi-doc.
+::
+
+ /* KVM_EXIT_MEMORY_FAULT */
+ struct {
+ #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 0)
+ __u64 flags;
+ __u64 gpa;
+ __u64 size;
+ } memory;
+
+If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
+encountered a memory error which is not handled by KVM kernel module and
+userspace may choose to handle it. The 'flags' field indicates the memory
+properties of the exit.
+
+ - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
+ private memory access when the bit is set. Otherwise the memory error is
+ caused by shared memory access when the bit is clear.
+
+'gpa' and 'size' indicate the memory range the error occurs at. The userspace
+may handle the error and return to KVM to retry the previous memory access.
+
::
/* KVM_EXIT_NOTIFY */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 13bff963b8b0..c7e9d375a902 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -300,6 +300,7 @@ struct kvm_xen_exit {
#define KVM_EXIT_RISCV_SBI 35
#define KVM_EXIT_RISCV_CSR 36
#define KVM_EXIT_NOTIFY 37
+#define KVM_EXIT_MEMORY_FAULT 38
/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
@@ -541,6 +542,13 @@ struct kvm_run {
#define KVM_NOTIFY_CONTEXT_INVALID (1 << 0)
__u32 flags;
} notify;
+ /* KVM_EXIT_MEMORY_FAULT */
+ struct {
+#define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 0)
+ __u64 flags;
+ __u64 gpa;
+ __u64 size;
+ } memory;
/* Fix the size of the union. */
char padding[256];
};
--
2.25.1
^ permalink raw reply related [flat|nested] 153+ messages in thread
* [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry
2022-12-02 6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
` (3 preceding siblings ...)
2022-12-02 6:13 ` [PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
@ 2022-12-02 6:13 ` Chao Peng
2022-12-05 9:23 ` Fuad Tabba
2022-12-02 6:13 ` [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes Chao Peng
` (5 subsequent siblings)
10 siblings, 1 reply; 153+ messages in thread
From: Chao Peng @ 2022-12-02 6:13 UTC (permalink / raw)
To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel
Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
mhocko, wei.w.wang
Currently in mmu_notifier invalidate path, hva range is recorded and
then checked against by mmu_notifier_retry_hva() in the page fault
handling path. However, for the to be introduced private memory, a page
fault may not have a hva associated, checking gfn(gpa) makes more sense.
For existing hva based shared memory, gfn is expected to also work. The
only downside is when aliasing multiple gfns to a single hva, the
current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
arch/x86/kvm/mmu/mmu.c | 8 +++++---
include/linux/kvm_host.h | 33 +++++++++++++++++++++------------
virt/kvm/kvm_main.c | 32 +++++++++++++++++++++++---------
3 files changed, 49 insertions(+), 24 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4736d7849c60..e2c70b5afa3e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4259,7 +4259,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
return true;
return fault->slot &&
- mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
+ mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
}
static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
@@ -6098,7 +6098,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
write_lock(&kvm->mmu_lock);
- kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
+ kvm_mmu_invalidate_begin(kvm);
+
+ kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
@@ -6112,7 +6114,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
gfn_end - gfn_start);
- kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
+ kvm_mmu_invalidate_end(kvm);
write_unlock(&kvm->mmu_lock);
}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 02347e386ea2..3d69484d2704 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -787,8 +787,8 @@ struct kvm {
struct mmu_notifier mmu_notifier;
unsigned long mmu_invalidate_seq;
long mmu_invalidate_in_progress;
- unsigned long mmu_invalidate_range_start;
- unsigned long mmu_invalidate_range_end;
+ gfn_t mmu_invalidate_range_start;
+ gfn_t mmu_invalidate_range_end;
#endif
struct list_head devices;
u64 manual_dirty_log_protect;
@@ -1389,10 +1389,9 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
#endif
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
- unsigned long end);
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
- unsigned long end);
+void kvm_mmu_invalidate_begin(struct kvm *kvm);
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
+void kvm_mmu_invalidate_end(struct kvm *kvm);
long kvm_arch_dev_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg);
@@ -1963,9 +1962,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
return 0;
}
-static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
+static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
unsigned long mmu_seq,
- unsigned long hva)
+ gfn_t gfn)
{
lockdep_assert_held(&kvm->mmu_lock);
/*
@@ -1974,10 +1973,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
* that might be being invalidated. Note that it may include some false
* positives, due to shortcuts when handing concurrent invalidations.
*/
- if (unlikely(kvm->mmu_invalidate_in_progress) &&
- hva >= kvm->mmu_invalidate_range_start &&
- hva < kvm->mmu_invalidate_range_end)
- return 1;
+ if (unlikely(kvm->mmu_invalidate_in_progress)) {
+ /*
+ * Dropping mmu_lock after bumping mmu_invalidate_in_progress
+ * but before updating the range is a KVM bug.
+ */
+ if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
+ kvm->mmu_invalidate_range_end == INVALID_GPA))
+ return 1;
+
+ if (gfn >= kvm->mmu_invalidate_range_start &&
+ gfn < kvm->mmu_invalidate_range_end)
+ return 1;
+ }
+
if (kvm->mmu_invalidate_seq != mmu_seq)
return 1;
return 0;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b882eb2c76a2..ad55dfbc75d7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -540,9 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
-typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
- unsigned long end);
-
+typedef void (*on_lock_fn_t)(struct kvm *kvm);
typedef void (*on_unlock_fn_t)(struct kvm *kvm);
struct kvm_hva_range {
@@ -628,7 +626,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
locked = true;
KVM_MMU_LOCK(kvm);
if (!IS_KVM_NULL_FN(range->on_lock))
- range->on_lock(kvm, range->start, range->end);
+ range->on_lock(kvm);
+
if (IS_KVM_NULL_FN(range->handler))
break;
}
@@ -715,8 +714,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
}
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
- unsigned long end)
+void kvm_mmu_invalidate_begin(struct kvm *kvm)
{
/*
* The count increase must become visible at unlock time as no
@@ -724,6 +722,17 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
* count is also read inside the mmu_lock critical section.
*/
kvm->mmu_invalidate_in_progress++;
+
+ if (likely(kvm->mmu_invalidate_in_progress == 1)) {
+ kvm->mmu_invalidate_range_start = INVALID_GPA;
+ kvm->mmu_invalidate_range_end = INVALID_GPA;
+ }
+}
+
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+ WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
+
if (likely(kvm->mmu_invalidate_in_progress == 1)) {
kvm->mmu_invalidate_range_start = start;
kvm->mmu_invalidate_range_end = end;
@@ -744,6 +753,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
}
}
+static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
+ return kvm_unmap_gfn_range(kvm, range);
+}
+
static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
const struct mmu_notifier_range *range)
{
@@ -752,7 +767,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
.start = range->start,
.end = range->end,
.pte = __pte(0),
- .handler = kvm_unmap_gfn_range,
+ .handler = kvm_mmu_unmap_gfn_range,
.on_lock = kvm_mmu_invalidate_begin,
.on_unlock = kvm_arch_guest_memory_reclaimed,
.flush_on_ret = true,
@@ -791,8 +806,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
return 0;
}
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
- unsigned long end)
+void kvm_mmu_invalidate_end(struct kvm *kvm)
{
/*
* This sequence increase will notify the kvm page fault that
--
2.25.1
^ permalink raw reply related [flat|nested] 153+ messages in thread
* [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
2022-12-02 6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
` (4 preceding siblings ...)
2022-12-02 6:13 ` [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
@ 2022-12-02 6:13 ` Chao Peng
2022-12-07 8:13 ` Yuan Yao
` (3 more replies)
2022-12-02 6:13 ` [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed Chao Peng
` (4 subsequent siblings)
10 siblings, 4 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-02 6:13 UTC (permalink / raw)
To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel
Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
mhocko, wei.w.wang
Unmap the existing guest mappings when memory attribute is changed
between shared and private. This is needed because shared pages and
private pages are from different backends, unmapping existing ones
gives a chance for page fault handler to re-populate the mappings
according to the new attribute.
Only architecture has private memory support needs this and the
supported architecture is expected to rewrite the weak
kvm_arch_has_private_mem().
Also, during memory attribute changing and the unmapping time frame,
page fault handler may happen in the same memory range and can cause
incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
page fault handler retry during this time frame.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
include/linux/kvm_host.h | 7 +-
virt/kvm/kvm_main.c | 168 ++++++++++++++++++++++++++-------------
2 files changed, 116 insertions(+), 59 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3d69484d2704..3331c0c92838 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
#endif
-#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
struct kvm_gfn_range {
struct kvm_memory_slot *slot;
gfn_t start;
@@ -264,6 +263,8 @@ struct kvm_gfn_range {
bool may_block;
};
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
+
+#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
@@ -785,11 +786,12 @@ struct kvm {
#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
struct mmu_notifier mmu_notifier;
+#endif
unsigned long mmu_invalidate_seq;
long mmu_invalidate_in_progress;
gfn_t mmu_invalidate_range_start;
gfn_t mmu_invalidate_range_end;
-#endif
+
struct list_head devices;
u64 manual_dirty_log_protect;
struct dentry *debugfs_dentry;
@@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
int kvm_arch_post_init_vm(struct kvm *kvm);
void kvm_arch_pre_destroy_vm(struct kvm *kvm);
int kvm_arch_create_vm_debugfs(struct kvm *kvm);
+bool kvm_arch_has_private_mem(struct kvm *kvm);
#ifndef __KVM_HAVE_ARCH_VM_ALLOC
/*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ad55dfbc75d7..4e1e1e113bf0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
}
EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
+void kvm_mmu_invalidate_begin(struct kvm *kvm)
+{
+ /*
+ * The count increase must become visible at unlock time as no
+ * spte can be established without taking the mmu_lock and
+ * count is also read inside the mmu_lock critical section.
+ */
+ kvm->mmu_invalidate_in_progress++;
+
+ if (likely(kvm->mmu_invalidate_in_progress == 1)) {
+ kvm->mmu_invalidate_range_start = INVALID_GPA;
+ kvm->mmu_invalidate_range_end = INVALID_GPA;
+ }
+}
+
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+ WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
+
+ if (likely(kvm->mmu_invalidate_in_progress == 1)) {
+ kvm->mmu_invalidate_range_start = start;
+ kvm->mmu_invalidate_range_end = end;
+ } else {
+ /*
+ * Fully tracking multiple concurrent ranges has diminishing
+ * returns. Keep things simple and just find the minimal range
+ * which includes the current and new ranges. As there won't be
+ * enough information to subtract a range after its invalidate
+ * completes, any ranges invalidated concurrently will
+ * accumulate and persist until all outstanding invalidates
+ * complete.
+ */
+ kvm->mmu_invalidate_range_start =
+ min(kvm->mmu_invalidate_range_start, start);
+ kvm->mmu_invalidate_range_end =
+ max(kvm->mmu_invalidate_range_end, end);
+ }
+}
+
+void kvm_mmu_invalidate_end(struct kvm *kvm)
+{
+ /*
+ * This sequence increase will notify the kvm page fault that
+ * the page that is going to be mapped in the spte could have
+ * been freed.
+ */
+ kvm->mmu_invalidate_seq++;
+ smp_wmb();
+ /*
+ * The above sequence increase must be visible before the
+ * below count decrease, which is ensured by the smp_wmb above
+ * in conjunction with the smp_rmb in mmu_invalidate_retry().
+ */
+ kvm->mmu_invalidate_in_progress--;
+}
+
#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
{
@@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
}
-void kvm_mmu_invalidate_begin(struct kvm *kvm)
-{
- /*
- * The count increase must become visible at unlock time as no
- * spte can be established without taking the mmu_lock and
- * count is also read inside the mmu_lock critical section.
- */
- kvm->mmu_invalidate_in_progress++;
-
- if (likely(kvm->mmu_invalidate_in_progress == 1)) {
- kvm->mmu_invalidate_range_start = INVALID_GPA;
- kvm->mmu_invalidate_range_end = INVALID_GPA;
- }
-}
-
-void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
-{
- WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
-
- if (likely(kvm->mmu_invalidate_in_progress == 1)) {
- kvm->mmu_invalidate_range_start = start;
- kvm->mmu_invalidate_range_end = end;
- } else {
- /*
- * Fully tracking multiple concurrent ranges has diminishing
- * returns. Keep things simple and just find the minimal range
- * which includes the current and new ranges. As there won't be
- * enough information to subtract a range after its invalidate
- * completes, any ranges invalidated concurrently will
- * accumulate and persist until all outstanding invalidates
- * complete.
- */
- kvm->mmu_invalidate_range_start =
- min(kvm->mmu_invalidate_range_start, start);
- kvm->mmu_invalidate_range_end =
- max(kvm->mmu_invalidate_range_end, end);
- }
-}
-
static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
{
kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
@@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
return 0;
}
-void kvm_mmu_invalidate_end(struct kvm *kvm)
-{
- /*
- * This sequence increase will notify the kvm page fault that
- * the page that is going to be mapped in the spte could have
- * been freed.
- */
- kvm->mmu_invalidate_seq++;
- smp_wmb();
- /*
- * The above sequence increase must be visible before the
- * below count decrease, which is ensured by the smp_wmb above
- * in conjunction with the smp_rmb in mmu_invalidate_retry().
- */
- kvm->mmu_invalidate_in_progress--;
-}
-
static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
const struct mmu_notifier_range *range)
{
@@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
return 0;
}
+bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
+{
+ return false;
+}
+
static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
{
struct kvm *kvm = kvm_arch_alloc_vm();
@@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
return 0;
}
+static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+ struct kvm_gfn_range gfn_range;
+ struct kvm_memory_slot *slot;
+ struct kvm_memslots *slots;
+ struct kvm_memslot_iter iter;
+ int i;
+ int r = 0;
+
+ gfn_range.pte = __pte(0);
+ gfn_range.may_block = true;
+
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ slots = __kvm_memslots(kvm, i);
+
+ kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
+ slot = iter.slot;
+ gfn_range.start = max(start, slot->base_gfn);
+ gfn_range.end = min(end, slot->base_gfn + slot->npages);
+ if (gfn_range.start >= gfn_range.end)
+ continue;
+ gfn_range.slot = slot;
+
+ r |= kvm_unmap_gfn_range(kvm, &gfn_range);
+ }
+ }
+
+ if (r)
+ kvm_flush_remote_tlbs(kvm);
+}
+
static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
struct kvm_memory_attributes *attrs)
{
gfn_t start, end;
unsigned long i;
void *entry;
+ int idx;
u64 supported_attrs = kvm_supported_mem_attributes(kvm);
- /* flags is currently not used. */
+ /* 'flags' is currently not used. */
if (attrs->flags)
return -EINVAL;
if (attrs->attributes & ~supported_attrs)
@@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
+ if (kvm_arch_has_private_mem(kvm)) {
+ KVM_MMU_LOCK(kvm);
+ kvm_mmu_invalidate_begin(kvm);
+ kvm_mmu_invalidate_range_add(kvm, start, end);
+ KVM_MMU_UNLOCK(kvm);
+ }
+
mutex_lock(&kvm->lock);
for (i = start; i < end; i++)
if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
@@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
break;
mutex_unlock(&kvm->lock);
+ if (kvm_arch_has_private_mem(kvm)) {
+ idx = srcu_read_lock(&kvm->srcu);
+ KVM_MMU_LOCK(kvm);
+ if (i > start)
+ kvm_unmap_mem_range(kvm, start, i);
+ kvm_mmu_invalidate_end(kvm);
+ KVM_MMU_UNLOCK(kvm);
+ srcu_read_unlock(&kvm->srcu, idx);
+ }
+
attrs->address = i << PAGE_SHIFT;
attrs->size = (end - i) << PAGE_SHIFT;
--
2.25.1
^ permalink raw reply related [flat|nested] 153+ messages in thread
* [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed
2022-12-02 6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
` (5 preceding siblings ...)
2022-12-02 6:13 ` [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes Chao Peng
@ 2022-12-02 6:13 ` Chao Peng
2022-12-05 22:49 ` Isaku Yamahata
` (2 more replies)
2022-12-02 6:13 ` [PATCH v10 8/9] KVM: Handle page fault for private memory Chao Peng
` (3 subsequent siblings)
10 siblings, 3 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-02 6:13 UTC (permalink / raw)
To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel
Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
mhocko, wei.w.wang
A large page with mixed private/shared subpages can't be mapped as large
page since its sub private/shared pages are from different memory
backends and may also treated by architecture differently. When
private/shared memory are mixed in a large page, the current lpage_info
is not sufficient to decide whether the page can be mapped as large page
or not and additional private/shared mixed information is needed.
Tracking this 'mixed' information with the current 'count' like
disallow_lpage is a bit challenge so reserve a bit in 'disallow_lpage'
to indicate a large page has mixed private/share subpages and update
this 'mixed' bit whenever the memory attribute is changed between
private and shared.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
arch/x86/include/asm/kvm_host.h | 8 ++
arch/x86/kvm/mmu/mmu.c | 134 +++++++++++++++++++++++++++++++-
arch/x86/kvm/x86.c | 2 +
include/linux/kvm_host.h | 19 +++++
virt/kvm/kvm_main.c | 9 ++-
5 files changed, 169 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 283cbb83d6ae..7772ab37ac89 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -38,6 +38,7 @@
#include <asm/hyperv-tlfs.h>
#define __KVM_HAVE_ARCH_VCPU_DEBUGFS
+#define __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
#define KVM_MAX_VCPUS 1024
@@ -1011,6 +1012,13 @@ struct kvm_vcpu_arch {
#endif
};
+/*
+ * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
+ * level. The remaining bits are used as a reference count.
+ */
+#define KVM_LPAGE_PRIVATE_SHARED_MIXED (1U << 31)
+#define KVM_LPAGE_COUNT_MAX ((1U << 31) - 1)
+
struct kvm_lpage_info {
int disallow_lpage;
};
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e2c70b5afa3e..2190fd8c95c0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -763,11 +763,16 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
{
struct kvm_lpage_info *linfo;
int i;
+ int disallow_count;
for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
linfo = lpage_info_slot(gfn, slot, i);
+
+ disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
+ WARN_ON(disallow_count + count < 0 ||
+ disallow_count > KVM_LPAGE_COUNT_MAX - count);
+
linfo->disallow_lpage += count;
- WARN_ON(linfo->disallow_lpage < 0);
}
}
@@ -6986,3 +6991,130 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
if (kvm->arch.nx_huge_page_recovery_thread)
kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
}
+
+static bool linfo_is_mixed(struct kvm_lpage_info *linfo)
+{
+ return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
+}
+
+static void linfo_set_mixed(gfn_t gfn, struct kvm_memory_slot *slot,
+ int level, bool mixed)
+{
+ struct kvm_lpage_info *linfo = lpage_info_slot(gfn, slot, level);
+
+ if (mixed)
+ linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
+ else
+ linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
+}
+
+static bool is_expected_attr_entry(void *entry, unsigned long expected_attrs)
+{
+ bool expect_private = expected_attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
+
+ if (xa_to_value(entry) & KVM_MEMORY_ATTRIBUTE_PRIVATE) {
+ if (!expect_private)
+ return false;
+ } else if (expect_private)
+ return false;
+
+ return true;
+}
+
+static bool mem_attrs_mixed_2m(struct kvm *kvm, unsigned long attrs,
+ gfn_t start, gfn_t end)
+{
+ XA_STATE(xas, &kvm->mem_attr_array, start);
+ gfn_t gfn = start;
+ void *entry;
+ bool mixed = false;
+
+ rcu_read_lock();
+ entry = xas_load(&xas);
+ while (gfn < end) {
+ if (xas_retry(&xas, entry))
+ continue;
+
+ KVM_BUG_ON(gfn != xas.xa_index, kvm);
+
+ if (!is_expected_attr_entry(entry, attrs)) {
+ mixed = true;
+ break;
+ }
+
+ entry = xas_next(&xas);
+ gfn++;
+ }
+
+ rcu_read_unlock();
+ return mixed;
+}
+
+static bool mem_attrs_mixed(struct kvm *kvm, struct kvm_memory_slot *slot,
+ int level, unsigned long attrs,
+ gfn_t start, gfn_t end)
+{
+ unsigned long gfn;
+
+ if (level == PG_LEVEL_2M)
+ return mem_attrs_mixed_2m(kvm, attrs, start, end);
+
+ for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1))
+ if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)) ||
+ !is_expected_attr_entry(xa_load(&kvm->mem_attr_array, gfn),
+ attrs))
+ return true;
+ return false;
+}
+
+static void kvm_update_lpage_private_shared_mixed(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ unsigned long attrs,
+ gfn_t start, gfn_t end)
+{
+ unsigned long pages, mask;
+ gfn_t gfn, gfn_end, first, last;
+ int level;
+ bool mixed;
+
+ /*
+ * The sequence matters here: we set the higher level basing on the
+ * lower level's scanning result.
+ */
+ for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
+ pages = KVM_PAGES_PER_HPAGE(level);
+ mask = ~(pages - 1);
+ first = start & mask;
+ last = (end - 1) & mask;
+
+ /*
+ * We only need to scan the head and tail page, for middle pages
+ * we know they will not be mixed.
+ */
+ gfn = max(first, slot->base_gfn);
+ gfn_end = min(first + pages, slot->base_gfn + slot->npages);
+ mixed = mem_attrs_mixed(kvm, slot, level, attrs, gfn, gfn_end);
+ linfo_set_mixed(gfn, slot, level, mixed);
+
+ if (first == last)
+ return;
+
+ for (gfn = first + pages; gfn < last; gfn += pages)
+ linfo_set_mixed(gfn, slot, level, false);
+
+ gfn = last;
+ gfn_end = min(last + pages, slot->base_gfn + slot->npages);
+ mixed = mem_attrs_mixed(kvm, slot, level, attrs, gfn, gfn_end);
+ linfo_set_mixed(gfn, slot, level, mixed);
+ }
+}
+
+void kvm_arch_set_memory_attributes(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ unsigned long attrs,
+ gfn_t start, gfn_t end)
+{
+ if (kvm_slot_can_be_private(slot))
+ kvm_update_lpage_private_shared_mixed(kvm, slot, attrs,
+ start, end);
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9a07380f8d3c..5aefcff614d2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12362,6 +12362,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
linfo[lpages - 1].disallow_lpage = 1;
ugfn = slot->userspace_addr >> PAGE_SHIFT;
+ if (kvm_slot_can_be_private(slot))
+ ugfn |= slot->restricted_offset >> PAGE_SHIFT;
/*
* If the gfn and userspace address are not aligned wrt each
* other, disable large page support for this slot.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3331c0c92838..25099c94e770 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -592,6 +592,11 @@ struct kvm_memory_slot {
struct restrictedmem_notifier notifier;
};
+static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
+{
+ return slot && (slot->flags & KVM_MEM_PRIVATE);
+}
+
static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
{
return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
@@ -2316,4 +2321,18 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
/* Max number of entries allowed for each kvm dirty ring */
#define KVM_DIRTY_RING_MAX_ENTRIES 65536
+#ifdef __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
+void kvm_arch_set_memory_attributes(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ unsigned long attrs,
+ gfn_t start, gfn_t end);
+#else
+static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ unsigned long attrs,
+ gfn_t start, gfn_t end)
+{
+}
+#endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
+
#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4e1e1e113bf0..e107afea32f0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2354,7 +2354,8 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
return 0;
}
-static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
+static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end,
+ unsigned long attrs)
{
struct kvm_gfn_range gfn_range;
struct kvm_memory_slot *slot;
@@ -2378,6 +2379,10 @@ static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
gfn_range.slot = slot;
r |= kvm_unmap_gfn_range(kvm, &gfn_range);
+
+ kvm_arch_set_memory_attributes(kvm, slot, attrs,
+ gfn_range.start,
+ gfn_range.end);
}
}
@@ -2427,7 +2432,7 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
idx = srcu_read_lock(&kvm->srcu);
KVM_MMU_LOCK(kvm);
if (i > start)
- kvm_unmap_mem_range(kvm, start, i);
+ kvm_unmap_mem_range(kvm, start, i, attrs->attributes);
kvm_mmu_invalidate_end(kvm);
KVM_MMU_UNLOCK(kvm);
srcu_read_unlock(&kvm->srcu, idx);
--
2.25.1
^ permalink raw reply related [flat|nested] 153+ messages in thread
* [PATCH v10 8/9] KVM: Handle page fault for private memory
2022-12-02 6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
` (6 preceding siblings ...)
2022-12-02 6:13 ` [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed Chao Peng
@ 2022-12-02 6:13 ` Chao Peng
2022-12-08 2:29 ` Yuan Yao
` (2 more replies)
2022-12-02 6:13 ` [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
` (2 subsequent siblings)
10 siblings, 3 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-02 6:13 UTC (permalink / raw)
To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel
Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
mhocko, wei.w.wang
A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
hva-based shared memory. Architecture code (like TDX code) can tell
whether the on-going fault is private or not. This patch adds a
'is_private' field to kvm_page_fault to indicate this and architecture
code is expected to set it.
To handle page fault for such memslot, the handling logic is different
depending on whether the fault is private or shared. KVM checks if
'is_private' matches the host's view of the page (maintained in
mem_attr_array).
- For a successful match, private pfn is obtained with
restrictedmem_get_page() and shared pfn is obtained with existing
get_user_pages().
- For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
userspace. Userspace then can convert memory between private/shared
in host's view and retry the fault.
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
arch/x86/kvm/mmu/mmu.c | 63 +++++++++++++++++++++++++++++++--
arch/x86/kvm/mmu/mmu_internal.h | 14 +++++++-
arch/x86/kvm/mmu/mmutrace.h | 1 +
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
include/linux/kvm_host.h | 30 ++++++++++++++++
5 files changed, 105 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2190fd8c95c0..b1953ebc012e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
int kvm_mmu_max_mapping_level(struct kvm *kvm,
const struct kvm_memory_slot *slot, gfn_t gfn,
- int max_level)
+ int max_level, bool is_private)
{
struct kvm_lpage_info *linfo;
int host_level;
@@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
break;
}
+ if (is_private)
+ return max_level;
+
if (max_level == PG_LEVEL_4K)
return PG_LEVEL_4K;
@@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
* level, which will be used to do precise, accurate accounting.
*/
fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
- fault->gfn, fault->max_level);
+ fault->gfn, fault->max_level,
+ fault->is_private);
if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
return;
@@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
}
+static inline u8 order_to_level(int order)
+{
+ BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
+
+ if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
+ return PG_LEVEL_1G;
+
+ if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
+ return PG_LEVEL_2M;
+
+ return PG_LEVEL_4K;
+}
+
+static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
+{
+ vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+ if (fault->is_private)
+ vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
+ else
+ vcpu->run->memory.flags = 0;
+ vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
+ vcpu->run->memory.size = PAGE_SIZE;
+ return RET_PF_USER;
+}
+
+static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
+{
+ int order;
+ struct kvm_memory_slot *slot = fault->slot;
+
+ if (!kvm_slot_can_be_private(slot))
+ return kvm_do_memory_fault_exit(vcpu, fault);
+
+ if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
+ return RET_PF_RETRY;
+
+ fault->max_level = min(order_to_level(order), fault->max_level);
+ fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
+ return RET_PF_CONTINUE;
+}
+
static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm_memory_slot *slot = fault->slot;
@@ -4210,6 +4257,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
return RET_PF_EMULATE;
}
+ if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
+ return kvm_do_memory_fault_exit(vcpu, fault);
+
+ if (fault->is_private)
+ return kvm_faultin_pfn_private(vcpu, fault);
+
async = false;
fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
fault->write, &fault->map_writable,
@@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
return -EIO;
}
+ if (r == RET_PF_USER)
+ return 0;
+
if (r < 0)
return r;
if (r != RET_PF_EMULATE)
@@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
*/
if (sp->role.direct &&
sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
- PG_LEVEL_NUM)) {
+ PG_LEVEL_NUM,
+ false)) {
kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
if (kvm_available_flush_tlb_with_range())
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index dbaf6755c5a7..5ccf08183b00 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -189,6 +189,7 @@ struct kvm_page_fault {
/* Derived from mmu and global state. */
const bool is_tdp;
+ const bool is_private;
const bool nx_huge_page_workaround_enabled;
/*
@@ -237,6 +238,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
* RET_PF_RETRY: let CPU fault again on the address.
* RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
* RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
+ * RET_PF_USER: need to exit to userspace to handle this fault.
* RET_PF_FIXED: The faulting entry has been fixed.
* RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
*
@@ -253,6 +255,7 @@ enum {
RET_PF_RETRY,
RET_PF_EMULATE,
RET_PF_INVALID,
+ RET_PF_USER,
RET_PF_FIXED,
RET_PF_SPURIOUS,
};
@@ -310,7 +313,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
int kvm_mmu_max_mapping_level(struct kvm *kvm,
const struct kvm_memory_slot *slot, gfn_t gfn,
- int max_level);
+ int max_level, bool is_private);
void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
@@ -319,4 +322,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
+static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, int *order)
+{
+ WARN_ON_ONCE(1);
+ return -EOPNOTSUPP;
+}
+#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
+
#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index ae86820cef69..2d7555381955 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
TRACE_DEFINE_ENUM(RET_PF_RETRY);
TRACE_DEFINE_ENUM(RET_PF_EMULATE);
TRACE_DEFINE_ENUM(RET_PF_INVALID);
+TRACE_DEFINE_ENUM(RET_PF_USER);
TRACE_DEFINE_ENUM(RET_PF_FIXED);
TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 771210ce5181..8ba1a4afc546 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1768,7 +1768,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
continue;
max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
- iter.gfn, PG_LEVEL_NUM);
+ iter.gfn, PG_LEVEL_NUM, false);
if (max_mapping_level < iter.level)
continue;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 25099c94e770..153842bb33df 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
}
#endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+{
+ return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
+ KVM_MEMORY_ATTRIBUTE_PRIVATE;
+}
+#else
+static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+{
+ return false;
+}
+
+#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
+
+#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
+static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, int *order)
+{
+ int ret;
+ struct page *page;
+ pgoff_t index = gfn - slot->base_gfn +
+ (slot->restricted_offset >> PAGE_SHIFT);
+
+ ret = restrictedmem_get_page(slot->restricted_file, index,
+ &page, order);
+ *pfn = page_to_pfn(page);
+ return ret;
+}
+#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
+
#endif
--
2.25.1
^ permalink raw reply related [flat|nested] 153+ messages in thread
* [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
2022-12-02 6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
` (7 preceding siblings ...)
2022-12-02 6:13 ` [PATCH v10 8/9] KVM: Handle page fault for private memory Chao Peng
@ 2022-12-02 6:13 ` Chao Peng
2022-12-09 9:11 ` Fuad Tabba
` (3 more replies)
2023-01-14 0:37 ` [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Sean Christopherson
2023-02-16 5:13 ` Mike Rapoport
10 siblings, 4 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-02 6:13 UTC (permalink / raw)
To: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel
Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
mhocko, wei.w.wang
Register/unregister private memslot to fd-based memory backing store
restrictedmem and implement the callbacks for restrictedmem_notifier:
- invalidate_start()/invalidate_end() to zap the existing memory
mappings in the KVM page table.
- error() to request KVM_REQ_MEMORY_MCE and later exit to userspace
with KVM_EXIT_SHUTDOWN.
Expose KVM_MEM_PRIVATE for memslot and KVM_MEMORY_ATTRIBUTE_PRIVATE for
KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to userspace but either are
controlled by kvm_arch_has_private_mem() which should be rewritten by
architecture code.
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/x86.c | 13 +++
include/linux/kvm_host.h | 3 +
virt/kvm/kvm_main.c | 179 +++++++++++++++++++++++++++++++-
4 files changed, 191 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7772ab37ac89..27ef31133352 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -114,6 +114,7 @@
KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define KVM_REQ_HV_TLB_FLUSH \
KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_MEMORY_MCE KVM_ARCH_REQ(33)
#define CR0_RESERVED_BITS \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5aefcff614d2..c67e22f3e2ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6587,6 +6587,13 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
}
#endif /* CONFIG_HAVE_KVM_PM_NOTIFIER */
+#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
+void kvm_arch_memory_mce(struct kvm *kvm)
+{
+ kvm_make_all_cpus_request(kvm, KVM_REQ_MEMORY_MCE);
+}
+#endif
+
static int kvm_vm_ioctl_get_clock(struct kvm *kvm, void __user *argp)
{
struct kvm_clock_data data = { 0 };
@@ -10357,6 +10364,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
+
+ if (kvm_check_request(KVM_REQ_MEMORY_MCE, vcpu)) {
+ vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
+ r = 0;
+ goto out;
+ }
}
if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 153842bb33df..f032d878e034 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -590,6 +590,7 @@ struct kvm_memory_slot {
struct file *restricted_file;
loff_t restricted_offset;
struct restrictedmem_notifier notifier;
+ struct kvm *kvm;
};
static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
@@ -2363,6 +2364,8 @@ static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
*pfn = page_to_pfn(page);
return ret;
}
+
+void kvm_arch_memory_mce(struct kvm *kvm);
#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e107afea32f0..ac835fc77273 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -936,6 +936,121 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
+#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
+static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
+ pgoff_t start, pgoff_t end,
+ gfn_t *gfn_start, gfn_t *gfn_end)
+{
+ unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
+
+ if (start > base_pgoff)
+ *gfn_start = slot->base_gfn + start - base_pgoff;
+ else
+ *gfn_start = slot->base_gfn;
+
+ if (end < base_pgoff + slot->npages)
+ *gfn_end = slot->base_gfn + end - base_pgoff;
+ else
+ *gfn_end = slot->base_gfn + slot->npages;
+
+ if (*gfn_start >= *gfn_end)
+ return false;
+
+ return true;
+}
+
+static void kvm_restrictedmem_invalidate_begin(struct restrictedmem_notifier *notifier,
+ pgoff_t start, pgoff_t end)
+{
+ struct kvm_memory_slot *slot = container_of(notifier,
+ struct kvm_memory_slot,
+ notifier);
+ struct kvm *kvm = slot->kvm;
+ gfn_t gfn_start, gfn_end;
+ struct kvm_gfn_range gfn_range;
+ int idx;
+
+ if (!restrictedmem_range_is_valid(slot, start, end,
+ &gfn_start, &gfn_end))
+ return;
+
+ gfn_range.start = gfn_start;
+ gfn_range.end = gfn_end;
+ gfn_range.slot = slot;
+ gfn_range.pte = __pte(0);
+ gfn_range.may_block = true;
+
+ idx = srcu_read_lock(&kvm->srcu);
+ KVM_MMU_LOCK(kvm);
+
+ kvm_mmu_invalidate_begin(kvm);
+ kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
+ if (kvm_unmap_gfn_range(kvm, &gfn_range))
+ kvm_flush_remote_tlbs(kvm);
+
+ KVM_MMU_UNLOCK(kvm);
+ srcu_read_unlock(&kvm->srcu, idx);
+}
+
+static void kvm_restrictedmem_invalidate_end(struct restrictedmem_notifier *notifier,
+ pgoff_t start, pgoff_t end)
+{
+ struct kvm_memory_slot *slot = container_of(notifier,
+ struct kvm_memory_slot,
+ notifier);
+ struct kvm *kvm = slot->kvm;
+ gfn_t gfn_start, gfn_end;
+
+ if (!restrictedmem_range_is_valid(slot, start, end,
+ &gfn_start, &gfn_end))
+ return;
+
+ KVM_MMU_LOCK(kvm);
+ kvm_mmu_invalidate_end(kvm);
+ KVM_MMU_UNLOCK(kvm);
+}
+
+static void kvm_restrictedmem_error(struct restrictedmem_notifier *notifier,
+ pgoff_t start, pgoff_t end)
+{
+ struct kvm_memory_slot *slot = container_of(notifier,
+ struct kvm_memory_slot,
+ notifier);
+ kvm_arch_memory_mce(slot->kvm);
+}
+
+static struct restrictedmem_notifier_ops kvm_restrictedmem_notifier_ops = {
+ .invalidate_start = kvm_restrictedmem_invalidate_begin,
+ .invalidate_end = kvm_restrictedmem_invalidate_end,
+ .error = kvm_restrictedmem_error,
+};
+
+static inline void kvm_restrictedmem_register(struct kvm_memory_slot *slot)
+{
+ slot->notifier.ops = &kvm_restrictedmem_notifier_ops;
+ restrictedmem_register_notifier(slot->restricted_file, &slot->notifier);
+}
+
+static inline void kvm_restrictedmem_unregister(struct kvm_memory_slot *slot)
+{
+ restrictedmem_unregister_notifier(slot->restricted_file,
+ &slot->notifier);
+}
+
+#else /* !CONFIG_HAVE_KVM_RESTRICTED_MEM */
+
+static inline void kvm_restrictedmem_register(struct kvm_memory_slot *slot)
+{
+ WARN_ON_ONCE(1);
+}
+
+static inline void kvm_restrictedmem_unregister(struct kvm_memory_slot *slot)
+{
+ WARN_ON_ONCE(1);
+}
+
+#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
+
#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
static int kvm_pm_notifier_call(struct notifier_block *bl,
unsigned long state,
@@ -980,6 +1095,11 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
/* This does not remove the slot from struct kvm_memslots data structures */
static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
{
+ if (slot->flags & KVM_MEM_PRIVATE) {
+ kvm_restrictedmem_unregister(slot);
+ fput(slot->restricted_file);
+ }
+
kvm_destroy_dirty_bitmap(slot);
kvm_arch_free_memslot(kvm, slot);
@@ -1551,10 +1671,14 @@ static void kvm_replace_memslot(struct kvm *kvm,
}
}
-static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
+static int check_memory_region_flags(struct kvm *kvm,
+ const struct kvm_user_mem_region *mem)
{
u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
+ if (kvm_arch_has_private_mem(kvm))
+ valid_flags |= KVM_MEM_PRIVATE;
+
#ifdef __KVM_HAVE_READONLY_MEM
valid_flags |= KVM_MEM_READONLY;
#endif
@@ -1630,6 +1754,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
{
int r;
+ if (change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
+ kvm_restrictedmem_register(new);
+
/*
* If dirty logging is disabled, nullify the bitmap; the old bitmap
* will be freed on "commit". If logging is enabled in both old and
@@ -1658,6 +1785,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
if (r && new && new->dirty_bitmap && (!old || !old->dirty_bitmap))
kvm_destroy_dirty_bitmap(new);
+ if (r && change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
+ kvm_restrictedmem_unregister(new);
+
return r;
}
@@ -1963,7 +2093,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
int as_id, id;
int r;
- r = check_memory_region_flags(mem);
+ r = check_memory_region_flags(kvm, mem);
if (r)
return r;
@@ -1982,6 +2112,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
!access_ok((void __user *)(unsigned long)mem->userspace_addr,
mem->memory_size))
return -EINVAL;
+ if (mem->flags & KVM_MEM_PRIVATE &&
+ (mem->restricted_offset & (PAGE_SIZE - 1) ||
+ mem->restricted_offset > U64_MAX - mem->memory_size))
+ return -EINVAL;
if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
return -EINVAL;
if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
@@ -2020,6 +2154,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
return -EINVAL;
} else { /* Modify an existing slot. */
+ /* Private memslots are immutable, they can only be deleted. */
+ if (mem->flags & KVM_MEM_PRIVATE)
+ return -EINVAL;
if ((mem->userspace_addr != old->userspace_addr) ||
(npages != old->npages) ||
((mem->flags ^ old->flags) & KVM_MEM_READONLY))
@@ -2048,10 +2185,28 @@ int __kvm_set_memory_region(struct kvm *kvm,
new->npages = npages;
new->flags = mem->flags;
new->userspace_addr = mem->userspace_addr;
+ if (mem->flags & KVM_MEM_PRIVATE) {
+ new->restricted_file = fget(mem->restricted_fd);
+ if (!new->restricted_file ||
+ !file_is_restrictedmem(new->restricted_file)) {
+ r = -EINVAL;
+ goto out;
+ }
+ new->restricted_offset = mem->restricted_offset;
+ }
+
+ new->kvm = kvm;
r = kvm_set_memslot(kvm, old, new, change);
if (r)
- kfree(new);
+ goto out;
+
+ return 0;
+
+out:
+ if (new->restricted_file)
+ fput(new->restricted_file);
+ kfree(new);
return r;
}
EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
@@ -2351,6 +2506,8 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
static u64 kvm_supported_mem_attributes(struct kvm *kvm)
{
+ if (kvm_arch_has_private_mem(kvm))
+ return KVM_MEMORY_ATTRIBUTE_PRIVATE;
return 0;
}
@@ -4822,16 +4979,28 @@ static long kvm_vm_ioctl(struct file *filp,
}
case KVM_SET_USER_MEMORY_REGION: {
struct kvm_user_mem_region mem;
- unsigned long size = sizeof(struct kvm_userspace_memory_region);
+ unsigned int flags_offset = offsetof(typeof(mem), flags);
+ unsigned long size;
+ u32 flags;
kvm_sanity_check_user_mem_region_alias();
+ memset(&mem, 0, sizeof(mem));
+
r = -EFAULT;
+ if (get_user(flags, (u32 __user *)(argp + flags_offset)))
+ goto out;
+
+ if (flags & KVM_MEM_PRIVATE)
+ size = sizeof(struct kvm_userspace_memory_region_ext);
+ else
+ size = sizeof(struct kvm_userspace_memory_region);
+
if (copy_from_user(&mem, argp, size))
goto out;
r = -EINVAL;
- if (mem.flags & KVM_MEM_PRIVATE)
+ if ((flags ^ mem.flags) & KVM_MEM_PRIVATE)
goto out;
r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
--
2.25.1
^ permalink raw reply related [flat|nested] 153+ messages in thread
* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
2022-12-02 6:13 ` [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory Chao Peng
@ 2022-12-05 9:03 ` Fuad Tabba
2022-12-06 11:53 ` Chao Peng
2022-12-08 8:37 ` Xiaoyao Li
` (2 subsequent siblings)
3 siblings, 1 reply; 153+ messages in thread
From: Fuad Tabba @ 2022-12-05 9:03 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
Hi Chao,
On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> In memory encryption usage, guest memory may be encrypted with special
> key and can be accessed only by the guest itself. We call such memory
> private memory. It's valueless and sometimes can cause problem to allow
> userspace to access guest private memory. This new KVM memslot extension
> allows guest private memory being provided through a restrictedmem
> backed file descriptor(fd) and userspace is restricted to access the
> bookmarked memory in the fd.
>
> This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> additional KVM memslot fields restricted_fd/restricted_offset to allow
> userspace to instruct KVM to provide guest memory through restricted_fd.
> 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> and the size is 'memory_size'.
>
> The extended memslot can still have the userspace_addr(hva). When use, a
> single memslot can maintain both private memory through restricted_fd
> and shared memory through userspace_addr. Whether the private or shared
> part is visible to guest is maintained by other KVM code.
>
> A restrictedmem_notifier field is also added to the memslot structure to
> allow the restricted_fd's backing store to notify KVM the memory change,
> KVM then can invalidate its page table entries or handle memory errors.
>
> Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> and right now it is selected on X86_64 only.
>
> To make future maintenance easy, internally use a binary compatible
> alias struct kvm_user_mem_region to handle both the normal and the
> '_ext' variants.
>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Tested-by: Fuad Tabba <tabba@google.com>
V9 of this patch [*] had KVM_CAP_PRIVATE_MEM, but it's not in this
patch series anymore. Any reason you removed it, or is it just an
omission?
[*] https://lore.kernel.org/linux-mm/20221025151344.3784230-3-chao.p.peng@linux.intel.com/
Thanks,
/fuad
> ---
> Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++-----
> arch/x86/kvm/Kconfig | 2 ++
> arch/x86/kvm/x86.c | 2 +-
> include/linux/kvm_host.h | 8 ++++--
> include/uapi/linux/kvm.h | 28 +++++++++++++++++++
> virt/kvm/Kconfig | 3 +++
> virt/kvm/kvm_main.c | 49 ++++++++++++++++++++++++++++------
> 7 files changed, 114 insertions(+), 18 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index bb2f709c0900..99352170c130 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
> :Capability: KVM_CAP_USER_MEMORY
> :Architectures: all
> :Type: vm ioctl
> -:Parameters: struct kvm_userspace_memory_region (in)
> +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> :Returns: 0 on success, -1 on error
>
> ::
> @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> __u64 userspace_addr; /* start of the userspace allocated memory */
> };
>
> + struct kvm_userspace_memory_region_ext {
> + struct kvm_userspace_memory_region region;
> + __u64 restricted_offset;
> + __u32 restricted_fd;
> + __u32 pad1;
> + __u64 pad2[14];
> + };
> +
> /* for kvm_memory_region::flags */
> #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> #define KVM_MEM_READONLY (1UL << 1)
> + #define KVM_MEM_PRIVATE (1UL << 2)
>
> This ioctl allows the user to create, modify or delete a guest physical
> memory slot. Bits 0-15 of "slot" specify the slot id and this value
> @@ -1365,12 +1374,29 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> be identical. This allows large pages in the guest to be backed by large
> pages in the host.
>
> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of
> -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to
> -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> -to make a new slot read-only. In this case, writes to this memory will be
> -posted to userspace as KVM_EXIT_MMIO exits.
> +kvm_userspace_memory_region_ext struct includes all fields of
> +kvm_userspace_memory_region struct, while also adds additional fields for some
> +other features. See below description of flags field for more information.
> +It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
> +
> +The flags field supports following flags:
> +
> +- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
> + within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
> +
> +- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
> + read-only. In this case, writes to this memory will be posted to userspace as
> + KVM_EXIT_MMIO exits.
> +
> +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> + KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has private
> + memory backed by a file descriptor(fd) and userspace access to the fd may be
> + restricted. Userspace should use restricted_fd/restricted_offset in the
> + kvm_userspace_memory_region_ext to instruct KVM to provide private memory
> + to guest. Userspace should guarantee not to map the same host physical address
> + indicated by restricted_fd/restricted_offset to different guest physical
> + addresses within multiple memslots. Failed to do this may result undefined
> + behavior.
>
> When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> the memory region are automatically reflected into the guest. For example, an
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index a8e379a3afee..690cb21010e7 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -50,6 +50,8 @@ config KVM
> select INTERVAL_TREE
> select HAVE_KVM_PM_NOTIFIER if PM
> select HAVE_KVM_MEMORY_ATTRIBUTES
> + select HAVE_KVM_RESTRICTED_MEM if X86_64
> + select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> help
> Support hosting fully virtualized guest machines using hardware
> virtualization extensions. You will need a fairly recent
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7f850dfb4086..9a07380f8d3c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12224,7 +12224,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
> }
>
> for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> - struct kvm_userspace_memory_region m;
> + struct kvm_user_mem_region m;
>
> m.slot = id | (i << 16);
> m.flags = 0;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index a784e2b06625..02347e386ea2 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -44,6 +44,7 @@
>
> #include <asm/kvm_host.h>
> #include <linux/kvm_dirty_ring.h>
> +#include <linux/restrictedmem.h>
>
> #ifndef KVM_MAX_VCPU_IDS
> #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> @@ -585,6 +586,9 @@ struct kvm_memory_slot {
> u32 flags;
> short id;
> u16 as_id;
> + struct file *restricted_file;
> + loff_t restricted_offset;
> + struct restrictedmem_notifier notifier;
> };
>
> static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> @@ -1123,9 +1127,9 @@ enum kvm_mr_change {
> };
>
> int kvm_set_memory_region(struct kvm *kvm,
> - const struct kvm_userspace_memory_region *mem);
> + const struct kvm_user_mem_region *mem);
> int __kvm_set_memory_region(struct kvm *kvm,
> - const struct kvm_userspace_memory_region *mem);
> + const struct kvm_user_mem_region *mem);
> void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
> void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
> int kvm_arch_prepare_memory_region(struct kvm *kvm,
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 5d0941acb5bb..13bff963b8b0 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
> __u64 userspace_addr; /* start of the userspace allocated memory */
> };
>
> +struct kvm_userspace_memory_region_ext {
> + struct kvm_userspace_memory_region region;
> + __u64 restricted_offset;
> + __u32 restricted_fd;
> + __u32 pad1;
> + __u64 pad2[14];
> +};
> +
> +#ifdef __KERNEL__
> +/*
> + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> + * all fields from the top-level "extended" region.
> + */
> +struct kvm_user_mem_region {
> + __u32 slot;
> + __u32 flags;
> + __u64 guest_phys_addr;
> + __u64 memory_size;
> + __u64 userspace_addr;
> + __u64 restricted_offset;
> + __u32 restricted_fd;
> + __u32 pad1;
> + __u64 pad2[14];
> +};
> +#endif
> +
> /*
> * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> * other bits are reserved for kvm internal use which are defined in
> @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
> */
> #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> #define KVM_MEM_READONLY (1UL << 1)
> +#define KVM_MEM_PRIVATE (1UL << 2)
>
> /* for KVM_IRQ_LINE */
> struct kvm_irq_level {
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index effdea5dd4f0..d605545d6dd1 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -89,3 +89,6 @@ config KVM_XFER_TO_GUEST_WORK
>
> config HAVE_KVM_PM_NOTIFIER
> bool
> +
> +config HAVE_KVM_RESTRICTED_MEM
> + bool
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7f0f5e9f2406..b882eb2c76a2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1532,7 +1532,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
> }
> }
>
> -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> {
> u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>
> @@ -1934,7 +1934,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> * Must be called holding kvm->slots_lock for write.
> */
> int __kvm_set_memory_region(struct kvm *kvm,
> - const struct kvm_userspace_memory_region *mem)
> + const struct kvm_user_mem_region *mem)
> {
> struct kvm_memory_slot *old, *new;
> struct kvm_memslots *slots;
> @@ -2038,7 +2038,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
>
> int kvm_set_memory_region(struct kvm *kvm,
> - const struct kvm_userspace_memory_region *mem)
> + const struct kvm_user_mem_region *mem)
> {
> int r;
>
> @@ -2050,7 +2050,7 @@ int kvm_set_memory_region(struct kvm *kvm,
> EXPORT_SYMBOL_GPL(kvm_set_memory_region);
>
> static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> - struct kvm_userspace_memory_region *mem)
> + struct kvm_user_mem_region *mem)
> {
> if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
> return -EINVAL;
> @@ -4698,6 +4698,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
> return fd;
> }
>
> +#define SANITY_CHECK_MEM_REGION_FIELD(field) \
> +do { \
> + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \
> + offsetof(struct kvm_userspace_memory_region, field)); \
> + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \
> + sizeof_field(struct kvm_userspace_memory_region, field)); \
> +} while (0)
> +
> +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field) \
> +do { \
> + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \
> + offsetof(struct kvm_userspace_memory_region_ext, field)); \
> + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \
> + sizeof_field(struct kvm_userspace_memory_region_ext, field)); \
> +} while (0)
> +
> +static void kvm_sanity_check_user_mem_region_alias(void)
> +{
> + SANITY_CHECK_MEM_REGION_FIELD(slot);
> + SANITY_CHECK_MEM_REGION_FIELD(flags);
> + SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> + SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> + SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> + SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
> + SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
> +}
> +
> static long kvm_vm_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
> {
> @@ -4721,14 +4748,20 @@ static long kvm_vm_ioctl(struct file *filp,
> break;
> }
> case KVM_SET_USER_MEMORY_REGION: {
> - struct kvm_userspace_memory_region kvm_userspace_mem;
> + struct kvm_user_mem_region mem;
> + unsigned long size = sizeof(struct kvm_userspace_memory_region);
> +
> + kvm_sanity_check_user_mem_region_alias();
>
> r = -EFAULT;
> - if (copy_from_user(&kvm_userspace_mem, argp,
> - sizeof(kvm_userspace_mem)))
> + if (copy_from_user(&mem, argp, size))
> + goto out;
> +
> + r = -EINVAL;
> + if (mem.flags & KVM_MEM_PRIVATE)
> goto out;
>
> - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> + r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> break;
> }
> case KVM_GET_DIRTY_LOG: {
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry
2022-12-02 6:13 ` [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
@ 2022-12-05 9:23 ` Fuad Tabba
2022-12-06 11:56 ` Chao Peng
0 siblings, 1 reply; 153+ messages in thread
From: Fuad Tabba @ 2022-12-05 9:23 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
Hi Chao,
On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> Currently in mmu_notifier invalidate path, hva range is recorded and
> then checked against by mmu_notifier_retry_hva() in the page fault
> handling path. However, for the to be introduced private memory, a page
> fault may not have a hva associated, checking gfn(gpa) makes more sense.
>
> For existing hva based shared memory, gfn is expected to also work. The
> only downside is when aliasing multiple gfns to a single hva, the
> current algorithm of checking multiple ranges could result in a much
> larger range being rejected. Such aliasing should be uncommon, so the
> impact is expected small.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
> arch/x86/kvm/mmu/mmu.c | 8 +++++---
> include/linux/kvm_host.h | 33 +++++++++++++++++++++------------
> virt/kvm/kvm_main.c | 32 +++++++++++++++++++++++---------
> 3 files changed, 49 insertions(+), 24 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4736d7849c60..e2c70b5afa3e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4259,7 +4259,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> return true;
>
> return fault->slot &&
> - mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> + mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
> }
>
> static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> @@ -6098,7 +6098,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>
> write_lock(&kvm->mmu_lock);
>
> - kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
> + kvm_mmu_invalidate_begin(kvm);
> +
> + kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
>
> flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>
> @@ -6112,7 +6114,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
> gfn_end - gfn_start);
>
> - kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
> + kvm_mmu_invalidate_end(kvm);
>
> write_unlock(&kvm->mmu_lock);
> }
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 02347e386ea2..3d69484d2704 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -787,8 +787,8 @@ struct kvm {
> struct mmu_notifier mmu_notifier;
> unsigned long mmu_invalidate_seq;
> long mmu_invalidate_in_progress;
> - unsigned long mmu_invalidate_range_start;
> - unsigned long mmu_invalidate_range_end;
> + gfn_t mmu_invalidate_range_start;
> + gfn_t mmu_invalidate_range_end;
> #endif
> struct list_head devices;
> u64 manual_dirty_log_protect;
> @@ -1389,10 +1389,9 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> #endif
>
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> - unsigned long end);
> -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> - unsigned long end);
> +void kvm_mmu_invalidate_begin(struct kvm *kvm);
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
> +void kvm_mmu_invalidate_end(struct kvm *kvm);
>
> long kvm_arch_dev_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg);
> @@ -1963,9 +1962,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
> return 0;
> }
>
> -static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> +static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
> unsigned long mmu_seq,
> - unsigned long hva)
> + gfn_t gfn)
> {
> lockdep_assert_held(&kvm->mmu_lock);
> /*
> @@ -1974,10 +1973,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> * that might be being invalidated. Note that it may include some false
nit: "might be" (or) "is being"
> * positives, due to shortcuts when handing concurrent invalidations.
nit: handling
> */
> - if (unlikely(kvm->mmu_invalidate_in_progress) &&
> - hva >= kvm->mmu_invalidate_range_start &&
> - hva < kvm->mmu_invalidate_range_end)
> - return 1;
> + if (unlikely(kvm->mmu_invalidate_in_progress)) {
> + /*
> + * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> + * but before updating the range is a KVM bug.
> + */
> + if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> + kvm->mmu_invalidate_range_end == INVALID_GPA))
INVALID_GPA is an x86-specific define in
arch/x86/include/asm/kvm_host.h, so this doesn't build on other
architectures. The obvious fix is to move it to
include/linux/kvm_host.h.
Cheers,
/fuad
> + return 1;
> +
> + if (gfn >= kvm->mmu_invalidate_range_start &&
> + gfn < kvm->mmu_invalidate_range_end)
> + return 1;
> + }
> +
> if (kvm->mmu_invalidate_seq != mmu_seq)
> return 1;
> return 0;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b882eb2c76a2..ad55dfbc75d7 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -540,9 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
>
> typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
>
> -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> - unsigned long end);
> -
> +typedef void (*on_lock_fn_t)(struct kvm *kvm);
> typedef void (*on_unlock_fn_t)(struct kvm *kvm);
>
> struct kvm_hva_range {
> @@ -628,7 +626,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> locked = true;
> KVM_MMU_LOCK(kvm);
> if (!IS_KVM_NULL_FN(range->on_lock))
> - range->on_lock(kvm, range->start, range->end);
> + range->on_lock(kvm);
> +
> if (IS_KVM_NULL_FN(range->handler))
> break;
> }
> @@ -715,8 +714,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> }
>
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> - unsigned long end)
> +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> {
> /*
> * The count increase must become visible at unlock time as no
> @@ -724,6 +722,17 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> * count is also read inside the mmu_lock critical section.
> */
> kvm->mmu_invalidate_in_progress++;
> +
> + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> + kvm->mmu_invalidate_range_start = INVALID_GPA;
> + kvm->mmu_invalidate_range_end = INVALID_GPA;
> + }
> +}
> +
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> +
> if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> kvm->mmu_invalidate_range_start = start;
> kvm->mmu_invalidate_range_end = end;
> @@ -744,6 +753,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> }
> }
>
> +static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> + kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> + return kvm_unmap_gfn_range(kvm, range);
> +}
> +
> static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> const struct mmu_notifier_range *range)
> {
> @@ -752,7 +767,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> .start = range->start,
> .end = range->end,
> .pte = __pte(0),
> - .handler = kvm_unmap_gfn_range,
> + .handler = kvm_mmu_unmap_gfn_range,
> .on_lock = kvm_mmu_invalidate_begin,
> .on_unlock = kvm_arch_guest_memory_reclaimed,
> .flush_on_ret = true,
> @@ -791,8 +806,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> return 0;
> }
>
> -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> - unsigned long end)
> +void kvm_mmu_invalidate_end(struct kvm *kvm)
> {
> /*
> * This sequence increase will notify the kvm page fault that
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed
2022-12-02 6:13 ` [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed Chao Peng
@ 2022-12-05 22:49 ` Isaku Yamahata
2022-12-06 12:02 ` Chao Peng
2023-01-13 23:12 ` Sean Christopherson
2023-01-13 23:16 ` Sean Christopherson
2 siblings, 1 reply; 153+ messages in thread
From: Isaku Yamahata @ 2022-12-05 22:49 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang, isaku.yamahata
On Fri, Dec 02, 2022 at 02:13:45PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:
> A large page with mixed private/shared subpages can't be mapped as large
> page since its sub private/shared pages are from different memory
> backends and may also treated by architecture differently. When
> private/shared memory are mixed in a large page, the current lpage_info
> is not sufficient to decide whether the page can be mapped as large page
> or not and additional private/shared mixed information is needed.
>
> Tracking this 'mixed' information with the current 'count' like
> disallow_lpage is a bit challenge so reserve a bit in 'disallow_lpage'
> to indicate a large page has mixed private/share subpages and update
> this 'mixed' bit whenever the memory attribute is changed between
> private and shared.
>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
> arch/x86/include/asm/kvm_host.h | 8 ++
> arch/x86/kvm/mmu/mmu.c | 134 +++++++++++++++++++++++++++++++-
> arch/x86/kvm/x86.c | 2 +
> include/linux/kvm_host.h | 19 +++++
> virt/kvm/kvm_main.c | 9 ++-
> 5 files changed, 169 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 283cbb83d6ae..7772ab37ac89 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -38,6 +38,7 @@
> #include <asm/hyperv-tlfs.h>
>
> #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> +#define __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
>
> #define KVM_MAX_VCPUS 1024
>
> @@ -1011,6 +1012,13 @@ struct kvm_vcpu_arch {
> #endif
> };
>
> +/*
> + * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
> + * level. The remaining bits are used as a reference count.
> + */
> +#define KVM_LPAGE_PRIVATE_SHARED_MIXED (1U << 31)
> +#define KVM_LPAGE_COUNT_MAX ((1U << 31) - 1)
> +
> struct kvm_lpage_info {
> int disallow_lpage;
> };
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e2c70b5afa3e..2190fd8c95c0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -763,11 +763,16 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
> {
> struct kvm_lpage_info *linfo;
> int i;
> + int disallow_count;
>
> for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
> linfo = lpage_info_slot(gfn, slot, i);
> +
> + disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
> + WARN_ON(disallow_count + count < 0 ||
> + disallow_count > KVM_LPAGE_COUNT_MAX - count);
> +
> linfo->disallow_lpage += count;
> - WARN_ON(linfo->disallow_lpage < 0);
> }
> }
>
> @@ -6986,3 +6991,130 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> if (kvm->arch.nx_huge_page_recovery_thread)
> kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
> }
> +
> +static bool linfo_is_mixed(struct kvm_lpage_info *linfo)
> +{
> + return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +}
> +
> +static void linfo_set_mixed(gfn_t gfn, struct kvm_memory_slot *slot,
> + int level, bool mixed)
> +{
> + struct kvm_lpage_info *linfo = lpage_info_slot(gfn, slot, level);
> +
> + if (mixed)
> + linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> + else
> + linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +}
> +
> +static bool is_expected_attr_entry(void *entry, unsigned long expected_attrs)
> +{
> + bool expect_private = expected_attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +
> + if (xa_to_value(entry) & KVM_MEMORY_ATTRIBUTE_PRIVATE) {
> + if (!expect_private)
> + return false;
> + } else if (expect_private)
> + return false;
> +
> + return true;
> +}
> +
> +static bool mem_attrs_mixed_2m(struct kvm *kvm, unsigned long attrs,
> + gfn_t start, gfn_t end)
> +{
> + XA_STATE(xas, &kvm->mem_attr_array, start);
> + gfn_t gfn = start;
> + void *entry;
> + bool mixed = false;
> +
> + rcu_read_lock();
> + entry = xas_load(&xas);
> + while (gfn < end) {
> + if (xas_retry(&xas, entry))
> + continue;
> +
> + KVM_BUG_ON(gfn != xas.xa_index, kvm);
> +
> + if (!is_expected_attr_entry(entry, attrs)) {
> + mixed = true;
> + break;
> + }
> +
> + entry = xas_next(&xas);
> + gfn++;
> + }
> +
> + rcu_read_unlock();
> + return mixed;
> +}
> +
> +static bool mem_attrs_mixed(struct kvm *kvm, struct kvm_memory_slot *slot,
> + int level, unsigned long attrs,
> + gfn_t start, gfn_t end)
> +{
> + unsigned long gfn;
> +
> + if (level == PG_LEVEL_2M)
> + return mem_attrs_mixed_2m(kvm, attrs, start, end);
> +
> + for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1))
> + if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)) ||
> + !is_expected_attr_entry(xa_load(&kvm->mem_attr_array, gfn),
> + attrs))
> + return true;
> + return false;
> +}
> +
> +static void kvm_update_lpage_private_shared_mixed(struct kvm *kvm,
> + struct kvm_memory_slot *slot,
> + unsigned long attrs,
> + gfn_t start, gfn_t end)
> +{
> + unsigned long pages, mask;
> + gfn_t gfn, gfn_end, first, last;
> + int level;
> + bool mixed;
> +
> + /*
> + * The sequence matters here: we set the higher level basing on the
> + * lower level's scanning result.
> + */
> + for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> + pages = KVM_PAGES_PER_HPAGE(level);
> + mask = ~(pages - 1);
> + first = start & mask;
> + last = (end - 1) & mask;
> +
> + /*
> + * We only need to scan the head and tail page, for middle pages
> + * we know they will not be mixed.
> + */
> + gfn = max(first, slot->base_gfn);
> + gfn_end = min(first + pages, slot->base_gfn + slot->npages);
> + mixed = mem_attrs_mixed(kvm, slot, level, attrs, gfn, gfn_end);
> + linfo_set_mixed(gfn, slot, level, mixed);
> +
> + if (first == last)
> + return;
continue.
> +
> + for (gfn = first + pages; gfn < last; gfn += pages)
> + linfo_set_mixed(gfn, slot, level, false);
> +
> + gfn = last;
> + gfn_end = min(last + pages, slot->base_gfn + slot->npages);
if (gfn == gfn_end) continue.
> + mixed = mem_attrs_mixed(kvm, slot, level, attrs, gfn, gfn_end);
> + linfo_set_mixed(gfn, slot, level, mixed);
> + }
> +}
> +
> +void kvm_arch_set_memory_attributes(struct kvm *kvm,
> + struct kvm_memory_slot *slot,
> + unsigned long attrs,
> + gfn_t start, gfn_t end)
> +{
> + if (kvm_slot_can_be_private(slot))
> + kvm_update_lpage_private_shared_mixed(kvm, slot, attrs,
> + start, end);
> +}
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9a07380f8d3c..5aefcff614d2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12362,6 +12362,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
> if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
> linfo[lpages - 1].disallow_lpage = 1;
> ugfn = slot->userspace_addr >> PAGE_SHIFT;
> + if (kvm_slot_can_be_private(slot))
> + ugfn |= slot->restricted_offset >> PAGE_SHIFT;
Is there any alignment restriction? If no, It should be +=.
In practice, alignment will hold though.
Thanks,
> /*
> * If the gfn and userspace address are not aligned wrt each
> * other, disable large page support for this slot.
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3331c0c92838..25099c94e770 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -592,6 +592,11 @@ struct kvm_memory_slot {
> struct restrictedmem_notifier notifier;
> };
>
> +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
> +{
> + return slot && (slot->flags & KVM_MEM_PRIVATE);
> +}
> +
> static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> {
> return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
> @@ -2316,4 +2321,18 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
> /* Max number of entries allowed for each kvm dirty ring */
> #define KVM_DIRTY_RING_MAX_ENTRIES 65536
>
> +#ifdef __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
> +void kvm_arch_set_memory_attributes(struct kvm *kvm,
> + struct kvm_memory_slot *slot,
> + unsigned long attrs,
> + gfn_t start, gfn_t end);
> +#else
> +static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> + struct kvm_memory_slot *slot,
> + unsigned long attrs,
> + gfn_t start, gfn_t end)
> +{
> +}
> +#endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
> +
> #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4e1e1e113bf0..e107afea32f0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2354,7 +2354,8 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> return 0;
> }
>
> -static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end,
> + unsigned long attrs)
> {
> struct kvm_gfn_range gfn_range;
> struct kvm_memory_slot *slot;
> @@ -2378,6 +2379,10 @@ static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> gfn_range.slot = slot;
>
> r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> +
> + kvm_arch_set_memory_attributes(kvm, slot, attrs,
> + gfn_range.start,
> + gfn_range.end);
> }
> }
>
> @@ -2427,7 +2432,7 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> idx = srcu_read_lock(&kvm->srcu);
> KVM_MMU_LOCK(kvm);
> if (i > start)
> - kvm_unmap_mem_range(kvm, start, i);
> + kvm_unmap_mem_range(kvm, start, i, attrs->attributes);
> kvm_mmu_invalidate_end(kvm);
> KVM_MMU_UNLOCK(kvm);
> srcu_read_unlock(&kvm->srcu, idx);
> --
> 2.25.1
>
--
Isaku Yamahata <isaku.yamahata@gmail.com>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
2022-12-05 9:03 ` Fuad Tabba
@ 2022-12-06 11:53 ` Chao Peng
2022-12-06 12:39 ` Fuad Tabba
0 siblings, 1 reply; 153+ messages in thread
From: Chao Peng @ 2022-12-06 11:53 UTC (permalink / raw)
To: Fuad Tabba
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
On Mon, Dec 05, 2022 at 09:03:11AM +0000, Fuad Tabba wrote:
> Hi Chao,
>
> On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > In memory encryption usage, guest memory may be encrypted with special
> > key and can be accessed only by the guest itself. We call such memory
> > private memory. It's valueless and sometimes can cause problem to allow
> > userspace to access guest private memory. This new KVM memslot extension
> > allows guest private memory being provided through a restrictedmem
> > backed file descriptor(fd) and userspace is restricted to access the
> > bookmarked memory in the fd.
> >
> > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > userspace to instruct KVM to provide guest memory through restricted_fd.
> > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > and the size is 'memory_size'.
> >
> > The extended memslot can still have the userspace_addr(hva). When use, a
> > single memslot can maintain both private memory through restricted_fd
> > and shared memory through userspace_addr. Whether the private or shared
> > part is visible to guest is maintained by other KVM code.
> >
> > A restrictedmem_notifier field is also added to the memslot structure to
> > allow the restricted_fd's backing store to notify KVM the memory change,
> > KVM then can invalidate its page table entries or handle memory errors.
> >
> > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > and right now it is selected on X86_64 only.
> >
> > To make future maintenance easy, internally use a binary compatible
> > alias struct kvm_user_mem_region to handle both the normal and the
> > '_ext' variants.
> >
> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > Reviewed-by: Fuad Tabba <tabba@google.com>
> > Tested-by: Fuad Tabba <tabba@google.com>
>
> V9 of this patch [*] had KVM_CAP_PRIVATE_MEM, but it's not in this
> patch series anymore. Any reason you removed it, or is it just an
> omission?
We had some discussion in v9 [1] to add generic memory attributes ioctls
and KVM_CAP_PRIVATE_MEM can be implemented as a new
KVM_MEMORY_ATTRIBUTE_PRIVATE flag via KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES()
ioctl [2]. The api doc has been updated:
+- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
+ KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl) …
[1] https://lore.kernel.org/linux-mm/Y2WB48kD0J4VGynX@google.com/
[2]
https://lore.kernel.org/linux-mm/20221202061347.1070246-3-chao.p.peng@linux.intel.com/
Thanks,
Chao
>
> [*] https://lore.kernel.org/linux-mm/20221025151344.3784230-3-chao.p.peng@linux.intel.com/
>
> Thanks,
> /fuad
>
> > ---
> > Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++-----
> > arch/x86/kvm/Kconfig | 2 ++
> > arch/x86/kvm/x86.c | 2 +-
> > include/linux/kvm_host.h | 8 ++++--
> > include/uapi/linux/kvm.h | 28 +++++++++++++++++++
> > virt/kvm/Kconfig | 3 +++
> > virt/kvm/kvm_main.c | 49 ++++++++++++++++++++++++++++------
> > 7 files changed, 114 insertions(+), 18 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index bb2f709c0900..99352170c130 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
> > :Capability: KVM_CAP_USER_MEMORY
> > :Architectures: all
> > :Type: vm ioctl
> > -:Parameters: struct kvm_userspace_memory_region (in)
> > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> > :Returns: 0 on success, -1 on error
> >
> > ::
> > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> > __u64 userspace_addr; /* start of the userspace allocated memory */
> > };
> >
> > + struct kvm_userspace_memory_region_ext {
> > + struct kvm_userspace_memory_region region;
> > + __u64 restricted_offset;
> > + __u32 restricted_fd;
> > + __u32 pad1;
> > + __u64 pad2[14];
> > + };
> > +
> > /* for kvm_memory_region::flags */
> > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> > #define KVM_MEM_READONLY (1UL << 1)
> > + #define KVM_MEM_PRIVATE (1UL << 2)
> >
> > This ioctl allows the user to create, modify or delete a guest physical
> > memory slot. Bits 0-15 of "slot" specify the slot id and this value
> > @@ -1365,12 +1374,29 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> > be identical. This allows large pages in the guest to be backed by large
> > pages in the host.
> >
> > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> > -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of
> > -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to
> > -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> > -to make a new slot read-only. In this case, writes to this memory will be
> > -posted to userspace as KVM_EXIT_MMIO exits.
> > +kvm_userspace_memory_region_ext struct includes all fields of
> > +kvm_userspace_memory_region struct, while also adds additional fields for some
> > +other features. See below description of flags field for more information.
> > +It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
> > +
> > +The flags field supports following flags:
> > +
> > +- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
> > + within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
> > +
> > +- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
> > + read-only. In this case, writes to this memory will be posted to userspace as
> > + KVM_EXIT_MMIO exits.
> > +
> > +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> > + KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has private
> > + memory backed by a file descriptor(fd) and userspace access to the fd may be
> > + restricted. Userspace should use restricted_fd/restricted_offset in the
> > + kvm_userspace_memory_region_ext to instruct KVM to provide private memory
> > + to guest. Userspace should guarantee not to map the same host physical address
> > + indicated by restricted_fd/restricted_offset to different guest physical
> > + addresses within multiple memslots. Failed to do this may result undefined
> > + behavior.
> >
> > When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> > the memory region are automatically reflected into the guest. For example, an
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index a8e379a3afee..690cb21010e7 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -50,6 +50,8 @@ config KVM
> > select INTERVAL_TREE
> > select HAVE_KVM_PM_NOTIFIER if PM
> > select HAVE_KVM_MEMORY_ATTRIBUTES
> > + select HAVE_KVM_RESTRICTED_MEM if X86_64
> > + select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> > help
> > Support hosting fully virtualized guest machines using hardware
> > virtualization extensions. You will need a fairly recent
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 7f850dfb4086..9a07380f8d3c 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -12224,7 +12224,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
> > }
> >
> > for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > - struct kvm_userspace_memory_region m;
> > + struct kvm_user_mem_region m;
> >
> > m.slot = id | (i << 16);
> > m.flags = 0;
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index a784e2b06625..02347e386ea2 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -44,6 +44,7 @@
> >
> > #include <asm/kvm_host.h>
> > #include <linux/kvm_dirty_ring.h>
> > +#include <linux/restrictedmem.h>
> >
> > #ifndef KVM_MAX_VCPU_IDS
> > #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> > @@ -585,6 +586,9 @@ struct kvm_memory_slot {
> > u32 flags;
> > short id;
> > u16 as_id;
> > + struct file *restricted_file;
> > + loff_t restricted_offset;
> > + struct restrictedmem_notifier notifier;
> > };
> >
> > static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> > @@ -1123,9 +1127,9 @@ enum kvm_mr_change {
> > };
> >
> > int kvm_set_memory_region(struct kvm *kvm,
> > - const struct kvm_userspace_memory_region *mem);
> > + const struct kvm_user_mem_region *mem);
> > int __kvm_set_memory_region(struct kvm *kvm,
> > - const struct kvm_userspace_memory_region *mem);
> > + const struct kvm_user_mem_region *mem);
> > void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
> > void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
> > int kvm_arch_prepare_memory_region(struct kvm *kvm,
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 5d0941acb5bb..13bff963b8b0 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
> > __u64 userspace_addr; /* start of the userspace allocated memory */
> > };
> >
> > +struct kvm_userspace_memory_region_ext {
> > + struct kvm_userspace_memory_region region;
> > + __u64 restricted_offset;
> > + __u32 restricted_fd;
> > + __u32 pad1;
> > + __u64 pad2[14];
> > +};
> > +
> > +#ifdef __KERNEL__
> > +/*
> > + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> > + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> > + * all fields from the top-level "extended" region.
> > + */
> > +struct kvm_user_mem_region {
> > + __u32 slot;
> > + __u32 flags;
> > + __u64 guest_phys_addr;
> > + __u64 memory_size;
> > + __u64 userspace_addr;
> > + __u64 restricted_offset;
> > + __u32 restricted_fd;
> > + __u32 pad1;
> > + __u64 pad2[14];
> > +};
> > +#endif
> > +
> > /*
> > * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> > * other bits are reserved for kvm internal use which are defined in
> > @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
> > */
> > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> > #define KVM_MEM_READONLY (1UL << 1)
> > +#define KVM_MEM_PRIVATE (1UL << 2)
> >
> > /* for KVM_IRQ_LINE */
> > struct kvm_irq_level {
> > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > index effdea5dd4f0..d605545d6dd1 100644
> > --- a/virt/kvm/Kconfig
> > +++ b/virt/kvm/Kconfig
> > @@ -89,3 +89,6 @@ config KVM_XFER_TO_GUEST_WORK
> >
> > config HAVE_KVM_PM_NOTIFIER
> > bool
> > +
> > +config HAVE_KVM_RESTRICTED_MEM
> > + bool
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 7f0f5e9f2406..b882eb2c76a2 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1532,7 +1532,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
> > }
> > }
> >
> > -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> > +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> > {
> > u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> >
> > @@ -1934,7 +1934,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> > * Must be called holding kvm->slots_lock for write.
> > */
> > int __kvm_set_memory_region(struct kvm *kvm,
> > - const struct kvm_userspace_memory_region *mem)
> > + const struct kvm_user_mem_region *mem)
> > {
> > struct kvm_memory_slot *old, *new;
> > struct kvm_memslots *slots;
> > @@ -2038,7 +2038,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> >
> > int kvm_set_memory_region(struct kvm *kvm,
> > - const struct kvm_userspace_memory_region *mem)
> > + const struct kvm_user_mem_region *mem)
> > {
> > int r;
> >
> > @@ -2050,7 +2050,7 @@ int kvm_set_memory_region(struct kvm *kvm,
> > EXPORT_SYMBOL_GPL(kvm_set_memory_region);
> >
> > static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> > - struct kvm_userspace_memory_region *mem)
> > + struct kvm_user_mem_region *mem)
> > {
> > if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
> > return -EINVAL;
> > @@ -4698,6 +4698,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
> > return fd;
> > }
> >
> > +#define SANITY_CHECK_MEM_REGION_FIELD(field) \
> > +do { \
> > + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \
> > + offsetof(struct kvm_userspace_memory_region, field)); \
> > + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \
> > + sizeof_field(struct kvm_userspace_memory_region, field)); \
> > +} while (0)
> > +
> > +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field) \
> > +do { \
> > + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \
> > + offsetof(struct kvm_userspace_memory_region_ext, field)); \
> > + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \
> > + sizeof_field(struct kvm_userspace_memory_region_ext, field)); \
> > +} while (0)
> > +
> > +static void kvm_sanity_check_user_mem_region_alias(void)
> > +{
> > + SANITY_CHECK_MEM_REGION_FIELD(slot);
> > + SANITY_CHECK_MEM_REGION_FIELD(flags);
> > + SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> > + SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> > + SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> > + SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
> > + SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
> > +}
> > +
> > static long kvm_vm_ioctl(struct file *filp,
> > unsigned int ioctl, unsigned long arg)
> > {
> > @@ -4721,14 +4748,20 @@ static long kvm_vm_ioctl(struct file *filp,
> > break;
> > }
> > case KVM_SET_USER_MEMORY_REGION: {
> > - struct kvm_userspace_memory_region kvm_userspace_mem;
> > + struct kvm_user_mem_region mem;
> > + unsigned long size = sizeof(struct kvm_userspace_memory_region);
> > +
> > + kvm_sanity_check_user_mem_region_alias();
> >
> > r = -EFAULT;
> > - if (copy_from_user(&kvm_userspace_mem, argp,
> > - sizeof(kvm_userspace_mem)))
> > + if (copy_from_user(&mem, argp, size))
> > + goto out;
> > +
> > + r = -EINVAL;
> > + if (mem.flags & KVM_MEM_PRIVATE)
> > goto out;
> >
> > - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> > + r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > break;
> > }
> > case KVM_GET_DIRTY_LOG: {
> > --
> > 2.25.1
> >
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry
2022-12-05 9:23 ` Fuad Tabba
@ 2022-12-06 11:56 ` Chao Peng
2022-12-06 15:48 ` Fuad Tabba
2022-12-07 6:34 ` Isaku Yamahata
0 siblings, 2 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-06 11:56 UTC (permalink / raw)
To: Fuad Tabba
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
On Mon, Dec 05, 2022 at 09:23:49AM +0000, Fuad Tabba wrote:
> Hi Chao,
>
> On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > Currently in mmu_notifier invalidate path, hva range is recorded and
> > then checked against by mmu_notifier_retry_hva() in the page fault
> > handling path. However, for the to be introduced private memory, a page
> > fault may not have a hva associated, checking gfn(gpa) makes more sense.
> >
> > For existing hva based shared memory, gfn is expected to also work. The
> > only downside is when aliasing multiple gfns to a single hva, the
> > current algorithm of checking multiple ranges could result in a much
> > larger range being rejected. Such aliasing should be uncommon, so the
> > impact is expected small.
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 8 +++++---
> > include/linux/kvm_host.h | 33 +++++++++++++++++++++------------
> > virt/kvm/kvm_main.c | 32 +++++++++++++++++++++++---------
> > 3 files changed, 49 insertions(+), 24 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 4736d7849c60..e2c70b5afa3e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4259,7 +4259,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> > return true;
> >
> > return fault->slot &&
> > - mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> > + mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
> > }
> >
> > static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > @@ -6098,7 +6098,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> >
> > write_lock(&kvm->mmu_lock);
> >
> > - kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
> > + kvm_mmu_invalidate_begin(kvm);
> > +
> > + kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
> >
> > flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
> >
> > @@ -6112,7 +6114,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> > kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
> > gfn_end - gfn_start);
> >
> > - kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
> > + kvm_mmu_invalidate_end(kvm);
> >
> > write_unlock(&kvm->mmu_lock);
> > }
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 02347e386ea2..3d69484d2704 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -787,8 +787,8 @@ struct kvm {
> > struct mmu_notifier mmu_notifier;
> > unsigned long mmu_invalidate_seq;
> > long mmu_invalidate_in_progress;
> > - unsigned long mmu_invalidate_range_start;
> > - unsigned long mmu_invalidate_range_end;
> > + gfn_t mmu_invalidate_range_start;
> > + gfn_t mmu_invalidate_range_end;
> > #endif
> > struct list_head devices;
> > u64 manual_dirty_log_protect;
> > @@ -1389,10 +1389,9 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> > void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > #endif
> >
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > - unsigned long end);
> > -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> > - unsigned long end);
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm);
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
> > +void kvm_mmu_invalidate_end(struct kvm *kvm);
> >
> > long kvm_arch_dev_ioctl(struct file *filp,
> > unsigned int ioctl, unsigned long arg);
> > @@ -1963,9 +1962,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
> > return 0;
> > }
> >
> > -static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> > +static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
> > unsigned long mmu_seq,
> > - unsigned long hva)
> > + gfn_t gfn)
> > {
> > lockdep_assert_held(&kvm->mmu_lock);
> > /*
> > @@ -1974,10 +1973,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> > * that might be being invalidated. Note that it may include some false
>
> nit: "might be" (or) "is being"
>
> > * positives, due to shortcuts when handing concurrent invalidations.
>
> nit: handling
Both are existing code, but I can fix it either.
>
> > */
> > - if (unlikely(kvm->mmu_invalidate_in_progress) &&
> > - hva >= kvm->mmu_invalidate_range_start &&
> > - hva < kvm->mmu_invalidate_range_end)
> > - return 1;
> > + if (unlikely(kvm->mmu_invalidate_in_progress)) {
> > + /*
> > + * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> > + * but before updating the range is a KVM bug.
> > + */
> > + if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> > + kvm->mmu_invalidate_range_end == INVALID_GPA))
>
> INVALID_GPA is an x86-specific define in
> arch/x86/include/asm/kvm_host.h, so this doesn't build on other
> architectures. The obvious fix is to move it to
> include/linux/kvm_host.h.
Hmm, INVALID_GPA is defined as ZERO for x86, not 100% confident this is
correct choice for other architectures, but after search it has not been
used for other architectures, so should be safe to make it common.
Thanks,
Chao
>
> Cheers,
> /fuad
>
> > + return 1;
> > +
> > + if (gfn >= kvm->mmu_invalidate_range_start &&
> > + gfn < kvm->mmu_invalidate_range_end)
> > + return 1;
> > + }
> > +
> > if (kvm->mmu_invalidate_seq != mmu_seq)
> > return 1;
> > return 0;
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index b882eb2c76a2..ad55dfbc75d7 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -540,9 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
> >
> > typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
> >
> > -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> > - unsigned long end);
> > -
> > +typedef void (*on_lock_fn_t)(struct kvm *kvm);
> > typedef void (*on_unlock_fn_t)(struct kvm *kvm);
> >
> > struct kvm_hva_range {
> > @@ -628,7 +626,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> > locked = true;
> > KVM_MMU_LOCK(kvm);
> > if (!IS_KVM_NULL_FN(range->on_lock))
> > - range->on_lock(kvm, range->start, range->end);
> > + range->on_lock(kvm);
> > +
> > if (IS_KVM_NULL_FN(range->handler))
> > break;
> > }
> > @@ -715,8 +714,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > }
> >
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > - unsigned long end)
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > {
> > /*
> > * The count increase must become visible at unlock time as no
> > @@ -724,6 +722,17 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > * count is also read inside the mmu_lock critical section.
> > */
> > kvm->mmu_invalidate_in_progress++;
> > +
> > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > + kvm->mmu_invalidate_range_start = INVALID_GPA;
> > + kvm->mmu_invalidate_range_end = INVALID_GPA;
> > + }
> > +}
> > +
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > +
> > if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > kvm->mmu_invalidate_range_start = start;
> > kvm->mmu_invalidate_range_end = end;
> > @@ -744,6 +753,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > }
> > }
> >
> > +static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > +{
> > + kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > + return kvm_unmap_gfn_range(kvm, range);
> > +}
> > +
> > static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > const struct mmu_notifier_range *range)
> > {
> > @@ -752,7 +767,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > .start = range->start,
> > .end = range->end,
> > .pte = __pte(0),
> > - .handler = kvm_unmap_gfn_range,
> > + .handler = kvm_mmu_unmap_gfn_range,
> > .on_lock = kvm_mmu_invalidate_begin,
> > .on_unlock = kvm_arch_guest_memory_reclaimed,
> > .flush_on_ret = true,
> > @@ -791,8 +806,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > return 0;
> > }
> >
> > -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> > - unsigned long end)
> > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> > {
> > /*
> > * This sequence increase will notify the kvm page fault that
> > --
> > 2.25.1
> >
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed
2022-12-05 22:49 ` Isaku Yamahata
@ 2022-12-06 12:02 ` Chao Peng
2022-12-07 6:42 ` Isaku Yamahata
0 siblings, 1 reply; 153+ messages in thread
From: Chao Peng @ 2022-12-06 12:02 UTC (permalink / raw)
To: Isaku Yamahata
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Mon, Dec 05, 2022 at 02:49:59PM -0800, Isaku Yamahata wrote:
> On Fri, Dec 02, 2022 at 02:13:45PM +0800,
> Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> > A large page with mixed private/shared subpages can't be mapped as large
> > page since its sub private/shared pages are from different memory
> > backends and may also treated by architecture differently. When
> > private/shared memory are mixed in a large page, the current lpage_info
> > is not sufficient to decide whether the page can be mapped as large page
> > or not and additional private/shared mixed information is needed.
> >
> > Tracking this 'mixed' information with the current 'count' like
> > disallow_lpage is a bit challenge so reserve a bit in 'disallow_lpage'
> > to indicate a large page has mixed private/share subpages and update
> > this 'mixed' bit whenever the memory attribute is changed between
> > private and shared.
> >
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> > arch/x86/include/asm/kvm_host.h | 8 ++
> > arch/x86/kvm/mmu/mmu.c | 134 +++++++++++++++++++++++++++++++-
> > arch/x86/kvm/x86.c | 2 +
> > include/linux/kvm_host.h | 19 +++++
> > virt/kvm/kvm_main.c | 9 ++-
> > 5 files changed, 169 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 283cbb83d6ae..7772ab37ac89 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -38,6 +38,7 @@
> > #include <asm/hyperv-tlfs.h>
> >
> > #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> > +#define __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
> >
> > #define KVM_MAX_VCPUS 1024
> >
> > @@ -1011,6 +1012,13 @@ struct kvm_vcpu_arch {
> > #endif
> > };
> >
> > +/*
> > + * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
> > + * level. The remaining bits are used as a reference count.
> > + */
> > +#define KVM_LPAGE_PRIVATE_SHARED_MIXED (1U << 31)
> > +#define KVM_LPAGE_COUNT_MAX ((1U << 31) - 1)
> > +
> > struct kvm_lpage_info {
> > int disallow_lpage;
> > };
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e2c70b5afa3e..2190fd8c95c0 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -763,11 +763,16 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
> > {
> > struct kvm_lpage_info *linfo;
> > int i;
> > + int disallow_count;
> >
> > for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
> > linfo = lpage_info_slot(gfn, slot, i);
> > +
> > + disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
> > + WARN_ON(disallow_count + count < 0 ||
> > + disallow_count > KVM_LPAGE_COUNT_MAX - count);
> > +
> > linfo->disallow_lpage += count;
> > - WARN_ON(linfo->disallow_lpage < 0);
> > }
> > }
> >
> > @@ -6986,3 +6991,130 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> > if (kvm->arch.nx_huge_page_recovery_thread)
> > kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
> > }
> > +
> > +static bool linfo_is_mixed(struct kvm_lpage_info *linfo)
> > +{
> > + return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +}
> > +
> > +static void linfo_set_mixed(gfn_t gfn, struct kvm_memory_slot *slot,
> > + int level, bool mixed)
> > +{
> > + struct kvm_lpage_info *linfo = lpage_info_slot(gfn, slot, level);
> > +
> > + if (mixed)
> > + linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > + else
> > + linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +}
> > +
> > +static bool is_expected_attr_entry(void *entry, unsigned long expected_attrs)
> > +{
> > + bool expect_private = expected_attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > +
> > + if (xa_to_value(entry) & KVM_MEMORY_ATTRIBUTE_PRIVATE) {
> > + if (!expect_private)
> > + return false;
> > + } else if (expect_private)
> > + return false;
> > +
> > + return true;
> > +}
> > +
> > +static bool mem_attrs_mixed_2m(struct kvm *kvm, unsigned long attrs,
> > + gfn_t start, gfn_t end)
> > +{
> > + XA_STATE(xas, &kvm->mem_attr_array, start);
> > + gfn_t gfn = start;
> > + void *entry;
> > + bool mixed = false;
> > +
> > + rcu_read_lock();
> > + entry = xas_load(&xas);
> > + while (gfn < end) {
> > + if (xas_retry(&xas, entry))
> > + continue;
> > +
> > + KVM_BUG_ON(gfn != xas.xa_index, kvm);
> > +
> > + if (!is_expected_attr_entry(entry, attrs)) {
> > + mixed = true;
> > + break;
> > + }
> > +
> > + entry = xas_next(&xas);
> > + gfn++;
> > + }
> > +
> > + rcu_read_unlock();
> > + return mixed;
> > +}
> > +
> > +static bool mem_attrs_mixed(struct kvm *kvm, struct kvm_memory_slot *slot,
> > + int level, unsigned long attrs,
> > + gfn_t start, gfn_t end)
> > +{
> > + unsigned long gfn;
> > +
> > + if (level == PG_LEVEL_2M)
> > + return mem_attrs_mixed_2m(kvm, attrs, start, end);
> > +
> > + for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1))
> > + if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)) ||
> > + !is_expected_attr_entry(xa_load(&kvm->mem_attr_array, gfn),
> > + attrs))
> > + return true;
> > + return false;
> > +}
> > +
> > +static void kvm_update_lpage_private_shared_mixed(struct kvm *kvm,
> > + struct kvm_memory_slot *slot,
> > + unsigned long attrs,
> > + gfn_t start, gfn_t end)
> > +{
> > + unsigned long pages, mask;
> > + gfn_t gfn, gfn_end, first, last;
> > + int level;
> > + bool mixed;
> > +
> > + /*
> > + * The sequence matters here: we set the higher level basing on the
> > + * lower level's scanning result.
> > + */
> > + for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> > + pages = KVM_PAGES_PER_HPAGE(level);
> > + mask = ~(pages - 1);
> > + first = start & mask;
> > + last = (end - 1) & mask;
> > +
> > + /*
> > + * We only need to scan the head and tail page, for middle pages
> > + * we know they will not be mixed.
> > + */
> > + gfn = max(first, slot->base_gfn);
> > + gfn_end = min(first + pages, slot->base_gfn + slot->npages);
> > + mixed = mem_attrs_mixed(kvm, slot, level, attrs, gfn, gfn_end);
> > + linfo_set_mixed(gfn, slot, level, mixed);
> > +
> > + if (first == last)
> > + return;
>
>
> continue.
Ya!
>
> > +
> > + for (gfn = first + pages; gfn < last; gfn += pages)
> > + linfo_set_mixed(gfn, slot, level, false);
> > +
> > + gfn = last;
> > + gfn_end = min(last + pages, slot->base_gfn + slot->npages);
>
> if (gfn == gfn_end) continue.
Do you see a case where gfn can equal to gfn_end? Though it does not
hurt to add a check.
>
>
> > + mixed = mem_attrs_mixed(kvm, slot, level, attrs, gfn, gfn_end);
> > + linfo_set_mixed(gfn, slot, level, mixed);
> > + }
> > +}
> > +
> > +void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > + struct kvm_memory_slot *slot,
> > + unsigned long attrs,
> > + gfn_t start, gfn_t end)
> > +{
> > + if (kvm_slot_can_be_private(slot))
> > + kvm_update_lpage_private_shared_mixed(kvm, slot, attrs,
> > + start, end);
> > +}
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 9a07380f8d3c..5aefcff614d2 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -12362,6 +12362,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
> > if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
> > linfo[lpages - 1].disallow_lpage = 1;
> > ugfn = slot->userspace_addr >> PAGE_SHIFT;
> > + if (kvm_slot_can_be_private(slot))
> > + ugfn |= slot->restricted_offset >> PAGE_SHIFT;
>
> Is there any alignment restriction? If no, It should be +=.
> In practice, alignment will hold though.
All we need here is checking whether both userspace_addr and
restricted_offset are aligned to HPAGE_SIZE or not. '+=' actually can
yield wrong value in cases when userspace_addr + restricted_offset is
aligned to HPAGE_SIZE but individually they may not align to HPAGE_SIZE.
Thanks,
Chao
>
> Thanks,
>
> > /*
> > * If the gfn and userspace address are not aligned wrt each
> > * other, disable large page support for this slot.
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 3331c0c92838..25099c94e770 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -592,6 +592,11 @@ struct kvm_memory_slot {
> > struct restrictedmem_notifier notifier;
> > };
> >
> > +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
> > +{
> > + return slot && (slot->flags & KVM_MEM_PRIVATE);
> > +}
> > +
> > static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> > {
> > return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
> > @@ -2316,4 +2321,18 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
> > /* Max number of entries allowed for each kvm dirty ring */
> > #define KVM_DIRTY_RING_MAX_ENTRIES 65536
> >
> > +#ifdef __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
> > +void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > + struct kvm_memory_slot *slot,
> > + unsigned long attrs,
> > + gfn_t start, gfn_t end);
> > +#else
> > +static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > + struct kvm_memory_slot *slot,
> > + unsigned long attrs,
> > + gfn_t start, gfn_t end)
> > +{
> > +}
> > +#endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
> > +
> > #endif
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 4e1e1e113bf0..e107afea32f0 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2354,7 +2354,8 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> > return 0;
> > }
> >
> > -static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end,
> > + unsigned long attrs)
> > {
> > struct kvm_gfn_range gfn_range;
> > struct kvm_memory_slot *slot;
> > @@ -2378,6 +2379,10 @@ static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > gfn_range.slot = slot;
> >
> > r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > +
> > + kvm_arch_set_memory_attributes(kvm, slot, attrs,
> > + gfn_range.start,
> > + gfn_range.end);
> > }
> > }
> >
> > @@ -2427,7 +2432,7 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > idx = srcu_read_lock(&kvm->srcu);
> > KVM_MMU_LOCK(kvm);
> > if (i > start)
> > - kvm_unmap_mem_range(kvm, start, i);
> > + kvm_unmap_mem_range(kvm, start, i, attrs->attributes);
> > kvm_mmu_invalidate_end(kvm);
> > KVM_MMU_UNLOCK(kvm);
> > srcu_read_unlock(&kvm->srcu, idx);
> > --
> > 2.25.1
> >
>
> --
> Isaku Yamahata <isaku.yamahata@gmail.com>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
2022-12-06 11:53 ` Chao Peng
@ 2022-12-06 12:39 ` Fuad Tabba
2022-12-07 15:10 ` Chao Peng
0 siblings, 1 reply; 153+ messages in thread
From: Fuad Tabba @ 2022-12-06 12:39 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
Hi Chao,
On Tue, Dec 6, 2022 at 11:58 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> On Mon, Dec 05, 2022 at 09:03:11AM +0000, Fuad Tabba wrote:
> > Hi Chao,
> >
> > On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > >
> > > In memory encryption usage, guest memory may be encrypted with special
> > > key and can be accessed only by the guest itself. We call such memory
> > > private memory. It's valueless and sometimes can cause problem to allow
> > > userspace to access guest private memory. This new KVM memslot extension
> > > allows guest private memory being provided through a restrictedmem
> > > backed file descriptor(fd) and userspace is restricted to access the
> > > bookmarked memory in the fd.
> > >
> > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > > userspace to instruct KVM to provide guest memory through restricted_fd.
> > > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > > and the size is 'memory_size'.
> > >
> > > The extended memslot can still have the userspace_addr(hva). When use, a
> > > single memslot can maintain both private memory through restricted_fd
> > > and shared memory through userspace_addr. Whether the private or shared
> > > part is visible to guest is maintained by other KVM code.
> > >
> > > A restrictedmem_notifier field is also added to the memslot structure to
> > > allow the restricted_fd's backing store to notify KVM the memory change,
> > > KVM then can invalidate its page table entries or handle memory errors.
> > >
> > > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > > and right now it is selected on X86_64 only.
> > >
> > > To make future maintenance easy, internally use a binary compatible
> > > alias struct kvm_user_mem_region to handle both the normal and the
> > > '_ext' variants.
> > >
> > > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > Reviewed-by: Fuad Tabba <tabba@google.com>
> > > Tested-by: Fuad Tabba <tabba@google.com>
> >
> > V9 of this patch [*] had KVM_CAP_PRIVATE_MEM, but it's not in this
> > patch series anymore. Any reason you removed it, or is it just an
> > omission?
>
> We had some discussion in v9 [1] to add generic memory attributes ioctls
> and KVM_CAP_PRIVATE_MEM can be implemented as a new
> KVM_MEMORY_ATTRIBUTE_PRIVATE flag via KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES()
> ioctl [2]. The api doc has been updated:
>
> +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> + KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl) …
>
>
> [1] https://lore.kernel.org/linux-mm/Y2WB48kD0J4VGynX@google.com/
> [2]
> https://lore.kernel.org/linux-mm/20221202061347.1070246-3-chao.p.peng@linux.intel.com/
I see. I just retested it with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES,
and my Reviewed/Tested-by still apply.
Cheers,
/fuad
>
> Thanks,
> Chao
> >
> > [*] https://lore.kernel.org/linux-mm/20221025151344.3784230-3-chao.p.peng@linux.intel.com/
> >
> > Thanks,
> > /fuad
> >
> > > ---
> > > Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++-----
> > > arch/x86/kvm/Kconfig | 2 ++
> > > arch/x86/kvm/x86.c | 2 +-
> > > include/linux/kvm_host.h | 8 ++++--
> > > include/uapi/linux/kvm.h | 28 +++++++++++++++++++
> > > virt/kvm/Kconfig | 3 +++
> > > virt/kvm/kvm_main.c | 49 ++++++++++++++++++++++++++++------
> > > 7 files changed, 114 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > index bb2f709c0900..99352170c130 100644
> > > --- a/Documentation/virt/kvm/api.rst
> > > +++ b/Documentation/virt/kvm/api.rst
> > > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
> > > :Capability: KVM_CAP_USER_MEMORY
> > > :Architectures: all
> > > :Type: vm ioctl
> > > -:Parameters: struct kvm_userspace_memory_region (in)
> > > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> > > :Returns: 0 on success, -1 on error
> > >
> > > ::
> > > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> > > __u64 userspace_addr; /* start of the userspace allocated memory */
> > > };
> > >
> > > + struct kvm_userspace_memory_region_ext {
> > > + struct kvm_userspace_memory_region region;
> > > + __u64 restricted_offset;
> > > + __u32 restricted_fd;
> > > + __u32 pad1;
> > > + __u64 pad2[14];
> > > + };
> > > +
> > > /* for kvm_memory_region::flags */
> > > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> > > #define KVM_MEM_READONLY (1UL << 1)
> > > + #define KVM_MEM_PRIVATE (1UL << 2)
> > >
> > > This ioctl allows the user to create, modify or delete a guest physical
> > > memory slot. Bits 0-15 of "slot" specify the slot id and this value
> > > @@ -1365,12 +1374,29 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> > > be identical. This allows large pages in the guest to be backed by large
> > > pages in the host.
> > >
> > > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> > > -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of
> > > -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to
> > > -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> > > -to make a new slot read-only. In this case, writes to this memory will be
> > > -posted to userspace as KVM_EXIT_MMIO exits.
> > > +kvm_userspace_memory_region_ext struct includes all fields of
> > > +kvm_userspace_memory_region struct, while also adds additional fields for some
> > > +other features. See below description of flags field for more information.
> > > +It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
> > > +
> > > +The flags field supports following flags:
> > > +
> > > +- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
> > > + within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
> > > +
> > > +- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
> > > + read-only. In this case, writes to this memory will be posted to userspace as
> > > + KVM_EXIT_MMIO exits.
> > > +
> > > +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> > > + KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has private
> > > + memory backed by a file descriptor(fd) and userspace access to the fd may be
> > > + restricted. Userspace should use restricted_fd/restricted_offset in the
> > > + kvm_userspace_memory_region_ext to instruct KVM to provide private memory
> > > + to guest. Userspace should guarantee not to map the same host physical address
> > > + indicated by restricted_fd/restricted_offset to different guest physical
> > > + addresses within multiple memslots. Failed to do this may result undefined
> > > + behavior.
> > >
> > > When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> > > the memory region are automatically reflected into the guest. For example, an
> > > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > > index a8e379a3afee..690cb21010e7 100644
> > > --- a/arch/x86/kvm/Kconfig
> > > +++ b/arch/x86/kvm/Kconfig
> > > @@ -50,6 +50,8 @@ config KVM
> > > select INTERVAL_TREE
> > > select HAVE_KVM_PM_NOTIFIER if PM
> > > select HAVE_KVM_MEMORY_ATTRIBUTES
> > > + select HAVE_KVM_RESTRICTED_MEM if X86_64
> > > + select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> > > help
> > > Support hosting fully virtualized guest machines using hardware
> > > virtualization extensions. You will need a fairly recent
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 7f850dfb4086..9a07380f8d3c 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -12224,7 +12224,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
> > > }
> > >
> > > for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > - struct kvm_userspace_memory_region m;
> > > + struct kvm_user_mem_region m;
> > >
> > > m.slot = id | (i << 16);
> > > m.flags = 0;
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index a784e2b06625..02347e386ea2 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -44,6 +44,7 @@
> > >
> > > #include <asm/kvm_host.h>
> > > #include <linux/kvm_dirty_ring.h>
> > > +#include <linux/restrictedmem.h>
> > >
> > > #ifndef KVM_MAX_VCPU_IDS
> > > #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> > > @@ -585,6 +586,9 @@ struct kvm_memory_slot {
> > > u32 flags;
> > > short id;
> > > u16 as_id;
> > > + struct file *restricted_file;
> > > + loff_t restricted_offset;
> > > + struct restrictedmem_notifier notifier;
> > > };
> > >
> > > static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> > > @@ -1123,9 +1127,9 @@ enum kvm_mr_change {
> > > };
> > >
> > > int kvm_set_memory_region(struct kvm *kvm,
> > > - const struct kvm_userspace_memory_region *mem);
> > > + const struct kvm_user_mem_region *mem);
> > > int __kvm_set_memory_region(struct kvm *kvm,
> > > - const struct kvm_userspace_memory_region *mem);
> > > + const struct kvm_user_mem_region *mem);
> > > void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
> > > void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
> > > int kvm_arch_prepare_memory_region(struct kvm *kvm,
> > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > index 5d0941acb5bb..13bff963b8b0 100644
> > > --- a/include/uapi/linux/kvm.h
> > > +++ b/include/uapi/linux/kvm.h
> > > @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
> > > __u64 userspace_addr; /* start of the userspace allocated memory */
> > > };
> > >
> > > +struct kvm_userspace_memory_region_ext {
> > > + struct kvm_userspace_memory_region region;
> > > + __u64 restricted_offset;
> > > + __u32 restricted_fd;
> > > + __u32 pad1;
> > > + __u64 pad2[14];
> > > +};
> > > +
> > > +#ifdef __KERNEL__
> > > +/*
> > > + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> > > + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> > > + * all fields from the top-level "extended" region.
> > > + */
> > > +struct kvm_user_mem_region {
> > > + __u32 slot;
> > > + __u32 flags;
> > > + __u64 guest_phys_addr;
> > > + __u64 memory_size;
> > > + __u64 userspace_addr;
> > > + __u64 restricted_offset;
> > > + __u32 restricted_fd;
> > > + __u32 pad1;
> > > + __u64 pad2[14];
> > > +};
> > > +#endif
> > > +
> > > /*
> > > * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> > > * other bits are reserved for kvm internal use which are defined in
> > > @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
> > > */
> > > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> > > #define KVM_MEM_READONLY (1UL << 1)
> > > +#define KVM_MEM_PRIVATE (1UL << 2)
> > >
> > > /* for KVM_IRQ_LINE */
> > > struct kvm_irq_level {
> > > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > > index effdea5dd4f0..d605545d6dd1 100644
> > > --- a/virt/kvm/Kconfig
> > > +++ b/virt/kvm/Kconfig
> > > @@ -89,3 +89,6 @@ config KVM_XFER_TO_GUEST_WORK
> > >
> > > config HAVE_KVM_PM_NOTIFIER
> > > bool
> > > +
> > > +config HAVE_KVM_RESTRICTED_MEM
> > > + bool
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 7f0f5e9f2406..b882eb2c76a2 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -1532,7 +1532,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
> > > }
> > > }
> > >
> > > -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> > > +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> > > {
> > > u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> > >
> > > @@ -1934,7 +1934,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> > > * Must be called holding kvm->slots_lock for write.
> > > */
> > > int __kvm_set_memory_region(struct kvm *kvm,
> > > - const struct kvm_userspace_memory_region *mem)
> > > + const struct kvm_user_mem_region *mem)
> > > {
> > > struct kvm_memory_slot *old, *new;
> > > struct kvm_memslots *slots;
> > > @@ -2038,7 +2038,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > > EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> > >
> > > int kvm_set_memory_region(struct kvm *kvm,
> > > - const struct kvm_userspace_memory_region *mem)
> > > + const struct kvm_user_mem_region *mem)
> > > {
> > > int r;
> > >
> > > @@ -2050,7 +2050,7 @@ int kvm_set_memory_region(struct kvm *kvm,
> > > EXPORT_SYMBOL_GPL(kvm_set_memory_region);
> > >
> > > static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> > > - struct kvm_userspace_memory_region *mem)
> > > + struct kvm_user_mem_region *mem)
> > > {
> > > if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
> > > return -EINVAL;
> > > @@ -4698,6 +4698,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
> > > return fd;
> > > }
> > >
> > > +#define SANITY_CHECK_MEM_REGION_FIELD(field) \
> > > +do { \
> > > + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \
> > > + offsetof(struct kvm_userspace_memory_region, field)); \
> > > + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \
> > > + sizeof_field(struct kvm_userspace_memory_region, field)); \
> > > +} while (0)
> > > +
> > > +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field) \
> > > +do { \
> > > + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \
> > > + offsetof(struct kvm_userspace_memory_region_ext, field)); \
> > > + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \
> > > + sizeof_field(struct kvm_userspace_memory_region_ext, field)); \
> > > +} while (0)
> > > +
> > > +static void kvm_sanity_check_user_mem_region_alias(void)
> > > +{
> > > + SANITY_CHECK_MEM_REGION_FIELD(slot);
> > > + SANITY_CHECK_MEM_REGION_FIELD(flags);
> > > + SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> > > + SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> > > + SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> > > + SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
> > > + SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
> > > +}
> > > +
> > > static long kvm_vm_ioctl(struct file *filp,
> > > unsigned int ioctl, unsigned long arg)
> > > {
> > > @@ -4721,14 +4748,20 @@ static long kvm_vm_ioctl(struct file *filp,
> > > break;
> > > }
> > > case KVM_SET_USER_MEMORY_REGION: {
> > > - struct kvm_userspace_memory_region kvm_userspace_mem;
> > > + struct kvm_user_mem_region mem;
> > > + unsigned long size = sizeof(struct kvm_userspace_memory_region);
> > > +
> > > + kvm_sanity_check_user_mem_region_alias();
> > >
> > > r = -EFAULT;
> > > - if (copy_from_user(&kvm_userspace_mem, argp,
> > > - sizeof(kvm_userspace_mem)))
> > > + if (copy_from_user(&mem, argp, size))
> > > + goto out;
> > > +
> > > + r = -EINVAL;
> > > + if (mem.flags & KVM_MEM_PRIVATE)
> > > goto out;
> > >
> > > - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> > > + r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > > break;
> > > }
> > > case KVM_GET_DIRTY_LOG: {
> > > --
> > > 2.25.1
> > >
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
2022-12-02 6:13 ` [PATCH v10 2/9] KVM: Introduce per-page memory attributes Chao Peng
@ 2022-12-06 13:34 ` Fabiano Rosas
2022-12-07 14:31 ` Chao Peng
2022-12-06 15:07 ` Fuad Tabba
` (5 subsequent siblings)
6 siblings, 1 reply; 153+ messages in thread
From: Fabiano Rosas @ 2022-12-06 13:34 UTC (permalink / raw)
To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
linux-arch, linux-api, linux-doc, qemu-devel
Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Chao Peng,
Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
mhocko, wei.w.wang
Chao Peng <chao.p.peng@linux.intel.com> writes:
> In confidential computing usages, whether a page is private or shared is
> necessary information for KVM to perform operations like page fault
> handling, page zapping etc. There are other potential use cases for
> per-page memory attributes, e.g. to make memory read-only (or no-exec,
> or exec-only, etc.) without having to modify memslots.
>
> Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> userspace to operate on the per-page memory attributes.
> - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> a guest memory range.
> - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> memory attributes.
>
> KVM internally uses xarray to store the per-page memory attributes.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> ---
> Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
> arch/x86/kvm/Kconfig | 1 +
> include/linux/kvm_host.h | 3 ++
> include/uapi/linux/kvm.h | 17 ++++++++
> virt/kvm/Kconfig | 3 ++
> virt/kvm/kvm_main.c | 76 ++++++++++++++++++++++++++++++++++
> 6 files changed, 163 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 5617bc4f899f..bb2f709c0900 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
> The "pad" and "reserved" fields may be used for future extensions and should be
> set to 0s by userspace.
>
> +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: u64 memory attributes bitmask(out)
> +:Returns: 0 on success, <0 on error
> +
> +Returns supported memory attributes bitmask. Supported memory attributes will
> +have the corresponding bits set in u64 memory attributes bitmask.
> +
> +The following memory attributes are defined::
> +
> + #define KVM_MEMORY_ATTRIBUTE_READ (1ULL << 0)
> + #define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
> + #define KVM_MEMORY_ATTRIBUTE_EXECUTE (1ULL << 2)
> + #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
> +
> +4.139 KVM_SET_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct kvm_memory_attributes(in/out)
> +:Returns: 0 on success, <0 on error
> +
> +Sets memory attributes for pages in a guest memory range. Parameters are
> +specified via the following structure::
> +
> + struct kvm_memory_attributes {
> + __u64 address;
> + __u64 size;
> + __u64 attributes;
> + __u64 flags;
> + };
> +
> +The user sets the per-page memory attributes to a guest memory range indicated
> +by address/size, and in return KVM adjusts address and size to reflect the
> +actual pages of the memory range have been successfully set to the attributes.
This wording could cause some confusion, what about a simpler:
"reflect the range of pages that had its attributes successfully set"
> +If the call returns 0, "address" is updated to the last successful address + 1
> +and "size" is updated to the remaining address size that has not been set
> +successfully.
"address + 1 page" or "subsequent page" perhaps.
In fact, wouldn't this all become simpler if size were number of pages instead?
> The user should check the return value as well as the size to
> +decide if the operation succeeded for the whole range or not. The user may want
> +to retry the operation with the returned address/size if the previous range was
> +partially successful.
> +
> +Both address and size should be page aligned and the supported attributes can be
> +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> +
> +The "flags" field may be used for future extensions and should be set to 0s.
> +
...
> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> + struct kvm_memory_attributes *attrs)
> +{
> + gfn_t start, end;
> + unsigned long i;
> + void *entry;
> + u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> + /* flags is currently not used. */
> + if (attrs->flags)
> + return -EINVAL;
> + if (attrs->attributes & ~supported_attrs)
> + return -EINVAL;
> + if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> + return -EINVAL;
> + if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> + return -EINVAL;
> +
> + start = attrs->address >> PAGE_SHIFT;
> + end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
Here PAGE_SIZE and -1 cancel out.
Consider using gpa_to_gfn as well.
> +
> + entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +
> + mutex_lock(&kvm->lock);
> + for (i = start; i < end; i++)
> + if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> + GFP_KERNEL_ACCOUNT)))
> + break;
> + mutex_unlock(&kvm->lock);
> +
> + attrs->address = i << PAGE_SHIFT;
> + attrs->size = (end - i) << PAGE_SHIFT;
> +
> + return 0;
> +}
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
> struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
> {
> return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> @@ -4459,6 +4508,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> #ifdef CONFIG_HAVE_KVM_MSI
> case KVM_CAP_SIGNAL_MSI:
> #endif
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> + case KVM_CAP_MEMORY_ATTRIBUTES:
> +#endif
> #ifdef CONFIG_HAVE_KVM_IRQFD
> case KVM_CAP_IRQFD:
> case KVM_CAP_IRQFD_RESAMPLE:
> @@ -4804,6 +4856,30 @@ static long kvm_vm_ioctl(struct file *filp,
> break;
> }
> #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> + case KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES: {
> + u64 attrs = kvm_supported_mem_attributes(kvm);
> +
> + r = -EFAULT;
> + if (copy_to_user(argp, &attrs, sizeof(attrs)))
> + goto out;
> + r = 0;
> + break;
> + }
> + case KVM_SET_MEMORY_ATTRIBUTES: {
> + struct kvm_memory_attributes attrs;
> +
> + r = -EFAULT;
> + if (copy_from_user(&attrs, argp, sizeof(attrs)))
> + goto out;
> +
> + r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
> +
> + if (!r && copy_to_user(argp, &attrs, sizeof(attrs)))
> + r = -EFAULT;
> + break;
> + }
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> case KVM_CREATE_DEVICE: {
> struct kvm_create_device cd;
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
2022-12-02 6:13 ` [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
@ 2022-12-06 14:57 ` Fuad Tabba
2022-12-07 13:50 ` Chao Peng
2022-12-13 23:49 ` Huang, Kai
` (3 subsequent siblings)
4 siblings, 1 reply; 153+ messages in thread
From: Fuad Tabba @ 2022-12-06 14:57 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
Hi,
On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Introduce 'memfd_restricted' system call with the ability to create
> memory areas that are restricted from userspace access through ordinary
> MMU operations (e.g. read/write/mmap). The memory content is expected to
> be used through the new in-kernel interface by a third kernel module.
>
> memfd_restricted() is useful for scenarios where a file descriptor(fd)
> can be used as an interface into mm but want to restrict userspace's
> ability on the fd. Initially it is designed to provide protections for
> KVM encrypted guest memory.
>
> Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> (e.g. QEMU) and then using the mmaped virtual address to setup the
> mapping in the KVM secondary page table (e.g. EPT). With confidential
> computing technologies like Intel TDX, the memfd memory may be encrypted
> with special key for special software domain (e.g. KVM guest) and is not
> expected to be directly accessed by userspace. Precisely, userspace
> access to such encrypted memory may lead to host crash so should be
> prevented.
>
> memfd_restricted() provides semantics required for KVM guest encrypted
> memory support that a fd created with memfd_restricted() is going to be
> used as the source of guest memory in confidential computing environment
> and KVM can directly interact with core-mm without the need to expose
> the memoy content into KVM userspace.
nit: memory
>
> KVM userspace is still in charge of the lifecycle of the fd. It should
> pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> obtain the physical memory page and then uses it to populate the KVM
> secondary page table entries.
>
> The userspace restricted memfd can be fallocate-ed or hole-punched
> from userspace. When hole-punched, KVM can get notified through
> invalidate_start/invalidate_end() callbacks, KVM then gets chance to
> remove any mapped entries of the range in the secondary page tables.
>
> Machine check can happen for memory pages in the restricted memfd,
> instead of routing this directly to userspace, we call the error()
> callback that KVM registered. KVM then gets chance to handle it
> correctly.
>
> memfd_restricted() itself is implemented as a shim layer on top of real
> memory file systems (currently tmpfs). Pages in restrictedmem are marked
> as unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
>
> By default memfd_restricted() prevents userspace read, write and mmap.
> By defining new bit in the 'flags', it can be extended to support other
> restricted semantics in the future.
>
> The system call is currently wired up for x86 arch.
Reviewed-by: Fuad Tabba <tabba@google.com>
After wiring the system call for arm64 (on qemu/arm64):
Tested-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> include/linux/restrictedmem.h | 71 ++++++
> include/linux/syscalls.h | 1 +
> include/uapi/asm-generic/unistd.h | 5 +-
> include/uapi/linux/magic.h | 1 +
> kernel/sys_ni.c | 3 +
> mm/Kconfig | 4 +
> mm/Makefile | 1 +
> mm/memory-failure.c | 3 +
> mm/restrictedmem.c | 318 +++++++++++++++++++++++++
> 11 files changed, 408 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/restrictedmem.h
> create mode 100644 mm/restrictedmem.c
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 320480a8db4f..dc70ba90247e 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -455,3 +455,4 @@
> 448 i386 process_mrelease sys_process_mrelease
> 449 i386 futex_waitv sys_futex_waitv
> 450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node
> +451 i386 memfd_restricted sys_memfd_restricted
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index c84d12608cd2..06516abc8318 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -372,6 +372,7 @@
> 448 common process_mrelease sys_process_mrelease
> 449 common futex_waitv sys_futex_waitv
> 450 common set_mempolicy_home_node sys_set_mempolicy_home_node
> +451 common memfd_restricted sys_memfd_restricted
>
> #
> # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> new file mode 100644
> index 000000000000..c2700c5daa43
> --- /dev/null
> +++ b/include/linux/restrictedmem.h
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _LINUX_RESTRICTEDMEM_H
> +
> +#include <linux/file.h>
> +#include <linux/magic.h>
> +#include <linux/pfn_t.h>
> +
> +struct restrictedmem_notifier;
> +
> +struct restrictedmem_notifier_ops {
> + void (*invalidate_start)(struct restrictedmem_notifier *notifier,
> + pgoff_t start, pgoff_t end);
> + void (*invalidate_end)(struct restrictedmem_notifier *notifier,
> + pgoff_t start, pgoff_t end);
> + void (*error)(struct restrictedmem_notifier *notifier,
> + pgoff_t start, pgoff_t end);
> +};
> +
> +struct restrictedmem_notifier {
> + struct list_head list;
> + const struct restrictedmem_notifier_ops *ops;
> +};
> +
> +#ifdef CONFIG_RESTRICTEDMEM
> +
> +void restrictedmem_register_notifier(struct file *file,
> + struct restrictedmem_notifier *notifier);
> +void restrictedmem_unregister_notifier(struct file *file,
> + struct restrictedmem_notifier *notifier);
> +
> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> + struct page **pagep, int *order);
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> + return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
> +}
> +
> +void restrictedmem_error_page(struct page *page, struct address_space *mapping);
> +
> +#else
> +
> +static inline void restrictedmem_register_notifier(struct file *file,
> + struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline void restrictedmem_unregister_notifier(struct file *file,
> + struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> + struct page **pagep, int *order)
> +{
> + return -1;
> +}
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> + return false;
> +}
> +
> +static inline void restrictedmem_error_page(struct page *page,
> + struct address_space *mapping)
> +{
> +}
> +
> +#endif /* CONFIG_RESTRICTEDMEM */
> +
> +#endif /* _LINUX_RESTRICTEDMEM_H */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a34b0f9a9972..f9e9e0c820c5 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
> asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
> unsigned long home_node,
> unsigned long flags);
> +asmlinkage long sys_memfd_restricted(unsigned int flags);
>
> /*
> * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 45fa180cc56a..e93cd35e46d0 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
> #define __NR_set_mempolicy_home_node 450
> __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
>
> +#define __NR_memfd_restricted 451
> +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
> +
> #undef __NR_syscalls
> -#define __NR_syscalls 451
> +#define __NR_syscalls 452
>
> /*
> * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 6325d1d0e90f..8aa38324b90a 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -101,5 +101,6 @@
> #define DMA_BUF_MAGIC 0x444d4142 /* "DMAB" */
> #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */
> #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */
> +#define RESTRICTEDMEM_MAGIC 0x5245534d /* "RESM" */
>
> #endif /* __LINUX_MAGIC_H__ */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 860b2dcf3ac4..7c4a32cbd2e7 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
> /* memfd_secret */
> COND_SYSCALL(memfd_secret);
>
> +/* memfd_restricted */
> +COND_SYSCALL(memfd_restricted);
> +
> /*
> * Architecture specific weak syscall entries.
> */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 57e1d8c5b505..06b0e1d6b8c1 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1076,6 +1076,10 @@ config IO_MAPPING
> config SECRETMEM
> def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
>
> +config RESTRICTEDMEM
> + bool
> + depends on TMPFS
> +
> config ANON_VMA_NAME
> bool "Anonymous VMA name support"
> depends on PROC_FS && ADVISE_SYSCALLS && MMU
> diff --git a/mm/Makefile b/mm/Makefile
> index 8e105e5b3e29..bcbb0edf9ba1 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -121,6 +121,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
> obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
> obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
> obj-$(CONFIG_SECRETMEM) += secretmem.o
> +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
> obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
> obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
> obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 145bb561ddb3..f91b444e471e 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -62,6 +62,7 @@
> #include <linux/page-isolation.h>
> #include <linux/pagewalk.h>
> #include <linux/shmem_fs.h>
> +#include <linux/restrictedmem.h>
> #include "swap.h"
> #include "internal.h"
> #include "ras/ras_event.h"
> @@ -940,6 +941,8 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
> goto out;
> }
>
> + restrictedmem_error_page(p, mapping);
> +
> /*
> * The shmem page is kept in page cache instead of truncating
> * so is expected to have an extra refcount after error-handling.
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> new file mode 100644
> index 000000000000..56953c204e5c
> --- /dev/null
> +++ b/mm/restrictedmem.c
> @@ -0,0 +1,318 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/shmem_fs.h>
> +#include <linux/syscalls.h>
> +#include <uapi/linux/falloc.h>
> +#include <uapi/linux/magic.h>
> +#include <linux/restrictedmem.h>
> +
> +struct restrictedmem_data {
> + struct mutex lock;
> + struct file *memfd;
> + struct list_head notifiers;
> +};
> +
> +static void restrictedmem_invalidate_start(struct restrictedmem_data *data,
> + pgoff_t start, pgoff_t end)
> +{
> + struct restrictedmem_notifier *notifier;
> +
> + mutex_lock(&data->lock);
> + list_for_each_entry(notifier, &data->notifiers, list) {
> + notifier->ops->invalidate_start(notifier, start, end);
> + }
> + mutex_unlock(&data->lock);
> +}
> +
> +static void restrictedmem_invalidate_end(struct restrictedmem_data *data,
> + pgoff_t start, pgoff_t end)
> +{
> + struct restrictedmem_notifier *notifier;
> +
> + mutex_lock(&data->lock);
> + list_for_each_entry(notifier, &data->notifiers, list) {
> + notifier->ops->invalidate_end(notifier, start, end);
> + }
> + mutex_unlock(&data->lock);
> +}
> +
> +static void restrictedmem_notifier_error(struct restrictedmem_data *data,
> + pgoff_t start, pgoff_t end)
> +{
> + struct restrictedmem_notifier *notifier;
> +
> + mutex_lock(&data->lock);
> + list_for_each_entry(notifier, &data->notifiers, list) {
> + notifier->ops->error(notifier, start, end);
> + }
> + mutex_unlock(&data->lock);
> +}
> +
> +static int restrictedmem_release(struct inode *inode, struct file *file)
> +{
> + struct restrictedmem_data *data = inode->i_mapping->private_data;
> +
> + fput(data->memfd);
> + kfree(data);
> + return 0;
> +}
> +
> +static long restrictedmem_punch_hole(struct restrictedmem_data *data, int mode,
> + loff_t offset, loff_t len)
> +{
> + int ret;
> + pgoff_t start, end;
> + struct file *memfd = data->memfd;
> +
> + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> + return -EINVAL;
> +
> + start = offset >> PAGE_SHIFT;
> + end = (offset + len) >> PAGE_SHIFT;
> +
> + restrictedmem_invalidate_start(data, start, end);
> + ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> + restrictedmem_invalidate_end(data, start, end);
> +
> + return ret;
> +}
> +
> +static long restrictedmem_fallocate(struct file *file, int mode,
> + loff_t offset, loff_t len)
> +{
> + struct restrictedmem_data *data = file->f_mapping->private_data;
> + struct file *memfd = data->memfd;
> +
> + if (mode & FALLOC_FL_PUNCH_HOLE)
> + return restrictedmem_punch_hole(data, mode, offset, len);
> +
> + return memfd->f_op->fallocate(memfd, mode, offset, len);
> +}
> +
> +static const struct file_operations restrictedmem_fops = {
> + .release = restrictedmem_release,
> + .fallocate = restrictedmem_fallocate,
> +};
> +
> +static int restrictedmem_getattr(struct user_namespace *mnt_userns,
> + const struct path *path, struct kstat *stat,
> + u32 request_mask, unsigned int query_flags)
> +{
> + struct inode *inode = d_inode(path->dentry);
> + struct restrictedmem_data *data = inode->i_mapping->private_data;
> + struct file *memfd = data->memfd;
> +
> + return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
> + request_mask, query_flags);
> +}
> +
> +static int restrictedmem_setattr(struct user_namespace *mnt_userns,
> + struct dentry *dentry, struct iattr *attr)
> +{
> + struct inode *inode = d_inode(dentry);
> + struct restrictedmem_data *data = inode->i_mapping->private_data;
> + struct file *memfd = data->memfd;
> + int ret;
> +
> + if (attr->ia_valid & ATTR_SIZE) {
> + if (memfd->f_inode->i_size)
> + return -EPERM;
> +
> + if (!PAGE_ALIGNED(attr->ia_size))
> + return -EINVAL;
> + }
> +
> + ret = memfd->f_inode->i_op->setattr(mnt_userns,
> + file_dentry(memfd), attr);
> + return ret;
> +}
> +
> +static const struct inode_operations restrictedmem_iops = {
> + .getattr = restrictedmem_getattr,
> + .setattr = restrictedmem_setattr,
> +};
> +
> +static int restrictedmem_init_fs_context(struct fs_context *fc)
> +{
> + if (!init_pseudo(fc, RESTRICTEDMEM_MAGIC))
> + return -ENOMEM;
> +
> + fc->s_iflags |= SB_I_NOEXEC;
> + return 0;
> +}
> +
> +static struct file_system_type restrictedmem_fs = {
> + .owner = THIS_MODULE,
> + .name = "memfd:restrictedmem",
> + .init_fs_context = restrictedmem_init_fs_context,
> + .kill_sb = kill_anon_super,
> +};
> +
> +static struct vfsmount *restrictedmem_mnt;
> +
> +static __init int restrictedmem_init(void)
> +{
> + restrictedmem_mnt = kern_mount(&restrictedmem_fs);
> + if (IS_ERR(restrictedmem_mnt))
> + return PTR_ERR(restrictedmem_mnt);
> + return 0;
> +}
> +fs_initcall(restrictedmem_init);
> +
> +static struct file *restrictedmem_file_create(struct file *memfd)
> +{
> + struct restrictedmem_data *data;
> + struct address_space *mapping;
> + struct inode *inode;
> + struct file *file;
> +
> + data = kzalloc(sizeof(*data), GFP_KERNEL);
> + if (!data)
> + return ERR_PTR(-ENOMEM);
> +
> + data->memfd = memfd;
> + mutex_init(&data->lock);
> + INIT_LIST_HEAD(&data->notifiers);
> +
> + inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> + if (IS_ERR(inode)) {
> + kfree(data);
> + return ERR_CAST(inode);
> + }
> +
> + inode->i_mode |= S_IFREG;
> + inode->i_op = &restrictedmem_iops;
> + inode->i_mapping->private_data = data;
> +
> + file = alloc_file_pseudo(inode, restrictedmem_mnt,
> + "restrictedmem", O_RDWR,
> + &restrictedmem_fops);
> + if (IS_ERR(file)) {
> + iput(inode);
> + kfree(data);
> + return ERR_CAST(file);
> + }
> +
> + file->f_flags |= O_LARGEFILE;
> +
> + /*
> + * These pages are currently unmovable so don't place them into movable
> + * pageblocks (e.g. CMA and ZONE_MOVABLE).
> + */
> + mapping = memfd->f_mapping;
> + mapping_set_unevictable(mapping);
> + mapping_set_gfp_mask(mapping,
> + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> +
> + return file;
> +}
> +
> +SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> +{
> + struct file *file, *restricted_file;
> + int fd, err;
> +
> + if (flags)
> + return -EINVAL;
> +
> + fd = get_unused_fd_flags(0);
> + if (fd < 0)
> + return fd;
> +
> + file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> + if (IS_ERR(file)) {
> + err = PTR_ERR(file);
> + goto err_fd;
> + }
> + file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> + file->f_flags |= O_LARGEFILE;
> +
> + restricted_file = restrictedmem_file_create(file);
> + if (IS_ERR(restricted_file)) {
> + err = PTR_ERR(restricted_file);
> + fput(file);
> + goto err_fd;
> + }
> +
> + fd_install(fd, restricted_file);
> + return fd;
> +err_fd:
> + put_unused_fd(fd);
> + return err;
> +}
> +
> +void restrictedmem_register_notifier(struct file *file,
> + struct restrictedmem_notifier *notifier)
> +{
> + struct restrictedmem_data *data = file->f_mapping->private_data;
> +
> + mutex_lock(&data->lock);
> + list_add(¬ifier->list, &data->notifiers);
> + mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_register_notifier);
> +
> +void restrictedmem_unregister_notifier(struct file *file,
> + struct restrictedmem_notifier *notifier)
> +{
> + struct restrictedmem_data *data = file->f_mapping->private_data;
> +
> + mutex_lock(&data->lock);
> + list_del(¬ifier->list);
> + mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier);
> +
> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> + struct page **pagep, int *order)
> +{
> + struct restrictedmem_data *data = file->f_mapping->private_data;
> + struct file *memfd = data->memfd;
> + struct folio *folio;
> + struct page *page;
> + int ret;
> +
> + ret = shmem_get_folio(file_inode(memfd), offset, &folio, SGP_WRITE);
> + if (ret)
> + return ret;
> +
> + page = folio_file_page(folio, offset);
> + *pagep = page;
> + if (order)
> + *order = thp_order(compound_head(page));
> +
> + SetPageUptodate(page);
> + unlock_page(page);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> +
> +void restrictedmem_error_page(struct page *page, struct address_space *mapping)
> +{
> + struct super_block *sb = restrictedmem_mnt->mnt_sb;
> + struct inode *inode, *next;
> +
> + if (!shmem_mapping(mapping))
> + return;
> +
> + spin_lock(&sb->s_inode_list_lock);
> + list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> + struct restrictedmem_data *data = inode->i_mapping->private_data;
> + struct file *memfd = data->memfd;
> +
> + if (memfd->f_mapping == mapping) {
> + pgoff_t start, end;
> +
> + spin_unlock(&sb->s_inode_list_lock);
> +
> + start = page->index;
> + end = start + thp_nr_pages(page);
> + restrictedmem_notifier_error(data, start, end);
> + return;
> + }
> + }
> + spin_unlock(&sb->s_inode_list_lock);
> +}
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
2022-12-02 6:13 ` [PATCH v10 2/9] KVM: Introduce per-page memory attributes Chao Peng
2022-12-06 13:34 ` Fabiano Rosas
@ 2022-12-06 15:07 ` Fuad Tabba
2022-12-07 14:51 ` Chao Peng
2022-12-16 15:09 ` Borislav Petkov
` (4 subsequent siblings)
6 siblings, 1 reply; 153+ messages in thread
From: Fuad Tabba @ 2022-12-06 15:07 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
Hi,
On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> In confidential computing usages, whether a page is private or shared is
> necessary information for KVM to perform operations like page fault
> handling, page zapping etc. There are other potential use cases for
> per-page memory attributes, e.g. to make memory read-only (or no-exec,
> or exec-only, etc.) without having to modify memslots.
>
> Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> userspace to operate on the per-page memory attributes.
> - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> a guest memory range.
> - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> memory attributes.
>
> KVM internally uses xarray to store the per-page memory attributes.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> ---
> Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
> arch/x86/kvm/Kconfig | 1 +
> include/linux/kvm_host.h | 3 ++
> include/uapi/linux/kvm.h | 17 ++++++++
> virt/kvm/Kconfig | 3 ++
> virt/kvm/kvm_main.c | 76 ++++++++++++++++++++++++++++++++++
> 6 files changed, 163 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 5617bc4f899f..bb2f709c0900 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
> The "pad" and "reserved" fields may be used for future extensions and should be
> set to 0s by userspace.
>
> +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: u64 memory attributes bitmask(out)
> +:Returns: 0 on success, <0 on error
> +
> +Returns supported memory attributes bitmask. Supported memory attributes will
> +have the corresponding bits set in u64 memory attributes bitmask.
> +
> +The following memory attributes are defined::
> +
> + #define KVM_MEMORY_ATTRIBUTE_READ (1ULL << 0)
> + #define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
> + #define KVM_MEMORY_ATTRIBUTE_EXECUTE (1ULL << 2)
> + #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
> +
> +4.139 KVM_SET_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct kvm_memory_attributes(in/out)
> +:Returns: 0 on success, <0 on error
> +
> +Sets memory attributes for pages in a guest memory range. Parameters are
> +specified via the following structure::
> +
> + struct kvm_memory_attributes {
> + __u64 address;
> + __u64 size;
> + __u64 attributes;
> + __u64 flags;
> + };
> +
> +The user sets the per-page memory attributes to a guest memory range indicated
> +by address/size, and in return KVM adjusts address and size to reflect the
> +actual pages of the memory range have been successfully set to the attributes.
> +If the call returns 0, "address" is updated to the last successful address + 1
> +and "size" is updated to the remaining address size that has not been set
> +successfully. The user should check the return value as well as the size to
> +decide if the operation succeeded for the whole range or not. The user may want
> +to retry the operation with the returned address/size if the previous range was
> +partially successful.
> +
> +Both address and size should be page aligned and the supported attributes can be
> +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> +
> +The "flags" field may be used for future extensions and should be set to 0s.
> +
> 5. The kvm_run structure
> ========================
>
> @@ -8270,6 +8323,16 @@ structure.
> When getting the Modified Change Topology Report value, the attr->addr
> must point to a byte where the value will be stored or retrieved from.
>
> +8.40 KVM_CAP_MEMORY_ATTRIBUTES
> +------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm
> +
> +This capability indicates KVM supports per-page memory attributes and ioctls
> +KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
> +
> 9. Known KVM API problems
> =========================
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index fbeaa9ddef59..a8e379a3afee 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -49,6 +49,7 @@ config KVM
> select SRCU
> select INTERVAL_TREE
> select HAVE_KVM_PM_NOTIFIER if PM
> + select HAVE_KVM_MEMORY_ATTRIBUTES
> help
> Support hosting fully virtualized guest machines using hardware
> virtualization extensions. You will need a fairly recent
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8f874a964313..a784e2b06625 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -800,6 +800,9 @@ struct kvm {
>
> #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> struct notifier_block pm_notifier;
> +#endif
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> + struct xarray mem_attr_array;
> #endif
> char stats_id[KVM_STATS_NAME_SIZE];
> };
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 64dfe9c07c87..5d0941acb5bb 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1182,6 +1182,7 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_S390_CPU_TOPOLOGY 222
> #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
> #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
> +#define KVM_CAP_MEMORY_ATTRIBUTES 225
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -2238,4 +2239,20 @@ struct kvm_s390_zpci_op {
> /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
> #define KVM_S390_ZPCIOP_REGAEN_HOST (1 << 0)
>
> +/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
> +#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES _IOR(KVMIO, 0xd2, __u64)
> +#define KVM_SET_MEMORY_ATTRIBUTES _IOWR(KVMIO, 0xd3, struct kvm_memory_attributes)
> +
> +struct kvm_memory_attributes {
> + __u64 address;
> + __u64 size;
> + __u64 attributes;
> + __u64 flags;
> +};
> +
> +#define KVM_MEMORY_ATTRIBUTE_READ (1ULL << 0)
> +#define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
> +#define KVM_MEMORY_ATTRIBUTE_EXECUTE (1ULL << 2)
> +#define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
nit: how about using the BIT() macro for these?
> +
> #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 800f9470e36b..effdea5dd4f0 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -19,6 +19,9 @@ config HAVE_KVM_IRQ_ROUTING
> config HAVE_KVM_DIRTY_RING
> bool
>
> +config HAVE_KVM_MEMORY_ATTRIBUTES
> + bool
> +
> # Only strongly ordered architectures can select this, as it doesn't
> # put any explicit constraint on userspace ordering. They can also
> # select the _ACQ_REL version.
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1782c4555d94..7f0f5e9f2406 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> spin_lock_init(&kvm->mn_invalidate_lock);
> rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> + xa_init(&kvm->mem_attr_array);
> +#endif
>
> INIT_LIST_HEAD(&kvm->gpc_list);
> spin_lock_init(&kvm->gpc_lock);
> @@ -1323,6 +1326,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
> kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
> kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
> }
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> + xa_destroy(&kvm->mem_attr_array);
> +#endif
> cleanup_srcu_struct(&kvm->irq_srcu);
> cleanup_srcu_struct(&kvm->srcu);
> kvm_arch_free_vm(kvm);
> @@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> }
> #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> +{
> + return 0;
> +}
> +
> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> + struct kvm_memory_attributes *attrs)
> +{
> + gfn_t start, end;
> + unsigned long i;
> + void *entry;
> + u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> + /* flags is currently not used. */
nit: "is reserved"? I think it makes it a bit clearer what its purpose is.
> + if (attrs->flags)
> + return -EINVAL;
> + if (attrs->attributes & ~supported_attrs)
> + return -EINVAL;
> + if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> + return -EINVAL;
> + if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> + return -EINVAL;
> +
> + start = attrs->address >> PAGE_SHIFT;
> + end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
Would using existing helpers be better for getting the frame numbers?
Also, the code checks that the address and size are page aligned, so
the end rounding up seems redundant, and might even be wrong if the
address+size-1 is close to the gfn_t limit (which this code tries to
avoid in an earlier check).
> + entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +
> + mutex_lock(&kvm->lock);
> + for (i = start; i < end; i++)
> + if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> + GFP_KERNEL_ACCOUNT)))
> + break;
> + mutex_unlock(&kvm->lock);
> +
> + attrs->address = i << PAGE_SHIFT;
> + attrs->size = (end - i) << PAGE_SHIFT;
nit: helpers for these too?
With the end calculation fixed,
Reviewed-by: Fuad Tabba <tabba@google.com>
After adding the necessary configs for arm64 (on qemu/arm64):
Tested-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> +
> + return 0;
> +}
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
> struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
> {
> return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> @@ -4459,6 +4508,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> #ifdef CONFIG_HAVE_KVM_MSI
> case KVM_CAP_SIGNAL_MSI:
> #endif
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> + case KVM_CAP_MEMORY_ATTRIBUTES:
> +#endif
> #ifdef CONFIG_HAVE_KVM_IRQFD
> case KVM_CAP_IRQFD:
> case KVM_CAP_IRQFD_RESAMPLE:
> @@ -4804,6 +4856,30 @@ static long kvm_vm_ioctl(struct file *filp,
> break;
> }
> #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> + case KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES: {
> + u64 attrs = kvm_supported_mem_attributes(kvm);
> +
> + r = -EFAULT;
> + if (copy_to_user(argp, &attrs, sizeof(attrs)))
> + goto out;
> + r = 0;
> + break;
> + }
> + case KVM_SET_MEMORY_ATTRIBUTES: {
> + struct kvm_memory_attributes attrs;
> +
> + r = -EFAULT;
> + if (copy_from_user(&attrs, argp, sizeof(attrs)))
> + goto out;
> +
> + r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
> +
> + if (!r && copy_to_user(argp, &attrs, sizeof(attrs)))
> + r = -EFAULT;
> + break;
> + }
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> case KVM_CREATE_DEVICE: {
> struct kvm_create_device cd;
>
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit
2022-12-02 6:13 ` [PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
@ 2022-12-06 15:47 ` Fuad Tabba
2022-12-07 15:11 ` Chao Peng
2023-01-13 23:13 ` Sean Christopherson
1 sibling, 1 reply; 153+ messages in thread
From: Fuad Tabba @ 2022-12-06 15:47 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
Hi,
On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> This new KVM exit allows userspace to handle memory-related errors. It
> indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> The flags includes additional information for userspace to handle the
> error. Currently bit 0 is defined as 'private memory' where '1'
> indicates error happens due to private memory access and '0' indicates
> error happens due to shared memory access.
>
> When private memory is enabled, this new exit will be used for KVM to
> exit to userspace for shared <-> private memory conversion in memory
> encryption usage. In such usage, typically there are two kind of memory
> conversions:
> - explicit conversion: happens when guest explicitly calls into KVM
> to map a range (as private or shared), KVM then exits to userspace
> to perform the map/unmap operations.
> - implicit conversion: happens in KVM page fault handler where KVM
> exits to userspace for an implicit conversion when the page is in a
> different state than requested (private or shared).
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> ---
> Documentation/virt/kvm/api.rst | 22 ++++++++++++++++++++++
> include/uapi/linux/kvm.h | 8 ++++++++
> 2 files changed, 30 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 99352170c130..d9edb14ce30b 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6634,6 +6634,28 @@ array field represents return values. The userspace should update the return
> values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> spec refer, https://github.com/riscv/riscv-sbi-doc.
>
> +::
> +
> + /* KVM_EXIT_MEMORY_FAULT */
> + struct {
> + #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 0)
> + __u64 flags;
I see you've removed the padding and increased the flag size.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> + __u64 gpa;
> + __u64 size;
> + } memory;
> +
> +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> +encountered a memory error which is not handled by KVM kernel module and
> +userspace may choose to handle it. The 'flags' field indicates the memory
> +properties of the exit.
> +
> + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> + private memory access when the bit is set. Otherwise the memory error is
> + caused by shared memory access when the bit is clear.
> +
> +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
> +may handle the error and return to KVM to retry the previous memory access.
> +
> ::
>
> /* KVM_EXIT_NOTIFY */
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 13bff963b8b0..c7e9d375a902 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -300,6 +300,7 @@ struct kvm_xen_exit {
> #define KVM_EXIT_RISCV_SBI 35
> #define KVM_EXIT_RISCV_CSR 36
> #define KVM_EXIT_NOTIFY 37
> +#define KVM_EXIT_MEMORY_FAULT 38
>
> /* For KVM_EXIT_INTERNAL_ERROR */
> /* Emulate instruction failed. */
> @@ -541,6 +542,13 @@ struct kvm_run {
> #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0)
> __u32 flags;
> } notify;
> + /* KVM_EXIT_MEMORY_FAULT */
> + struct {
> +#define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 0)
> + __u64 flags;
> + __u64 gpa;
> + __u64 size;
> + } memory;
> /* Fix the size of the union. */
> char padding[256];
> };
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry
2022-12-06 11:56 ` Chao Peng
@ 2022-12-06 15:48 ` Fuad Tabba
2022-12-09 6:24 ` Chao Peng
2022-12-07 6:34 ` Isaku Yamahata
1 sibling, 1 reply; 153+ messages in thread
From: Fuad Tabba @ 2022-12-06 15:48 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
Hi,
On Tue, Dec 6, 2022 at 12:01 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> On Mon, Dec 05, 2022 at 09:23:49AM +0000, Fuad Tabba wrote:
> > Hi Chao,
> >
> > On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > >
> > > Currently in mmu_notifier invalidate path, hva range is recorded and
> > > then checked against by mmu_notifier_retry_hva() in the page fault
> > > handling path. However, for the to be introduced private memory, a page
> > > fault may not have a hva associated, checking gfn(gpa) makes more sense.
> > >
> > > For existing hva based shared memory, gfn is expected to also work. The
> > > only downside is when aliasing multiple gfns to a single hva, the
> > > current algorithm of checking multiple ranges could result in a much
> > > larger range being rejected. Such aliasing should be uncommon, so the
> > > impact is expected small.
> > >
> > > Suggested-by: Sean Christopherson <seanjc@google.com>
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > ---
> > > arch/x86/kvm/mmu/mmu.c | 8 +++++---
> > > include/linux/kvm_host.h | 33 +++++++++++++++++++++------------
> > > virt/kvm/kvm_main.c | 32 +++++++++++++++++++++++---------
> > > 3 files changed, 49 insertions(+), 24 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 4736d7849c60..e2c70b5afa3e 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -4259,7 +4259,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> > > return true;
> > >
> > > return fault->slot &&
> > > - mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> > > + mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
> > > }
> > >
> > > static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > > @@ -6098,7 +6098,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> > >
> > > write_lock(&kvm->mmu_lock);
> > >
> > > - kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
> > > + kvm_mmu_invalidate_begin(kvm);
> > > +
> > > + kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
> > >
> > > flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
> > >
> > > @@ -6112,7 +6114,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> > > kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
> > > gfn_end - gfn_start);
> > >
> > > - kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
> > > + kvm_mmu_invalidate_end(kvm);
> > >
> > > write_unlock(&kvm->mmu_lock);
> > > }
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 02347e386ea2..3d69484d2704 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -787,8 +787,8 @@ struct kvm {
> > > struct mmu_notifier mmu_notifier;
> > > unsigned long mmu_invalidate_seq;
> > > long mmu_invalidate_in_progress;
> > > - unsigned long mmu_invalidate_range_start;
> > > - unsigned long mmu_invalidate_range_end;
> > > + gfn_t mmu_invalidate_range_start;
> > > + gfn_t mmu_invalidate_range_end;
> > > #endif
> > > struct list_head devices;
> > > u64 manual_dirty_log_protect;
> > > @@ -1389,10 +1389,9 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> > > void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > > #endif
> > >
> > > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > > - unsigned long end);
> > > -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> > > - unsigned long end);
> > > +void kvm_mmu_invalidate_begin(struct kvm *kvm);
> > > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
> > > +void kvm_mmu_invalidate_end(struct kvm *kvm);
> > >
> > > long kvm_arch_dev_ioctl(struct file *filp,
> > > unsigned int ioctl, unsigned long arg);
> > > @@ -1963,9 +1962,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
> > > return 0;
> > > }
> > >
> > > -static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> > > +static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
> > > unsigned long mmu_seq,
> > > - unsigned long hva)
> > > + gfn_t gfn)
> > > {
> > > lockdep_assert_held(&kvm->mmu_lock);
> > > /*
> > > @@ -1974,10 +1973,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> > > * that might be being invalidated. Note that it may include some false
> >
> > nit: "might be" (or) "is being"
> >
> > > * positives, due to shortcuts when handing concurrent invalidations.
> >
> > nit: handling
>
> Both are existing code, but I can fix it either.
That was just a nit, please feel free to ignore it, especially if it
might cause headaches in the future with merges.
>
> >
> > > */
> > > - if (unlikely(kvm->mmu_invalidate_in_progress) &&
> > > - hva >= kvm->mmu_invalidate_range_start &&
> > > - hva < kvm->mmu_invalidate_range_end)
> > > - return 1;
> > > + if (unlikely(kvm->mmu_invalidate_in_progress)) {
> > > + /*
> > > + * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> > > + * but before updating the range is a KVM bug.
> > > + */
> > > + if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> > > + kvm->mmu_invalidate_range_end == INVALID_GPA))
> >
> > INVALID_GPA is an x86-specific define in
> > arch/x86/include/asm/kvm_host.h, so this doesn't build on other
> > architectures. The obvious fix is to move it to
> > include/linux/kvm_host.h.
>
> Hmm, INVALID_GPA is defined as ZERO for x86, not 100% confident this is
> correct choice for other architectures, but after search it has not been
> used for other architectures, so should be safe to make it common.
With this fixed,
Reviewed-by: Fuad Tabba <tabba@google.com>
And the necessary work to port to arm64 (on qemu/arm64):
Tested-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
>
> Thanks,
> Chao
> >
> > Cheers,
> > /fuad
> >
> > > + return 1;
> > > +
> > > + if (gfn >= kvm->mmu_invalidate_range_start &&
> > > + gfn < kvm->mmu_invalidate_range_end)
> > > + return 1;
> > > + }
> > > +
> > > if (kvm->mmu_invalidate_seq != mmu_seq)
> > > return 1;
> > > return 0;
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index b882eb2c76a2..ad55dfbc75d7 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -540,9 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
> > >
> > > typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
> > >
> > > -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> > > - unsigned long end);
> > > -
> > > +typedef void (*on_lock_fn_t)(struct kvm *kvm);
> > > typedef void (*on_unlock_fn_t)(struct kvm *kvm);
> > >
> > > struct kvm_hva_range {
> > > @@ -628,7 +626,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> > > locked = true;
> > > KVM_MMU_LOCK(kvm);
> > > if (!IS_KVM_NULL_FN(range->on_lock))
> > > - range->on_lock(kvm, range->start, range->end);
> > > + range->on_lock(kvm);
> > > +
> > > if (IS_KVM_NULL_FN(range->handler))
> > > break;
> > > }
> > > @@ -715,8 +714,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > > }
> > >
> > > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > > - unsigned long end)
> > > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > > {
> > > /*
> > > * The count increase must become visible at unlock time as no
> > > @@ -724,6 +722,17 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > > * count is also read inside the mmu_lock critical section.
> > > */
> > > kvm->mmu_invalidate_in_progress++;
> > > +
> > > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > + kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > + kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > + }
> > > +}
> > > +
> > > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > +
> > > if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > kvm->mmu_invalidate_range_start = start;
> > > kvm->mmu_invalidate_range_end = end;
> > > @@ -744,6 +753,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > > }
> > > }
> > >
> > > +static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > > +{
> > > + kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > > + return kvm_unmap_gfn_range(kvm, range);
> > > +}
> > > +
> > > static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > > const struct mmu_notifier_range *range)
> > > {
> > > @@ -752,7 +767,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > > .start = range->start,
> > > .end = range->end,
> > > .pte = __pte(0),
> > > - .handler = kvm_unmap_gfn_range,
> > > + .handler = kvm_mmu_unmap_gfn_range,
> > > .on_lock = kvm_mmu_invalidate_begin,
> > > .on_unlock = kvm_arch_guest_memory_reclaimed,
> > > .flush_on_ret = true,
> > > @@ -791,8 +806,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > > return 0;
> > > }
> > >
> > > -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> > > - unsigned long end)
> > > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> > > {
> > > /*
> > > * This sequence increase will notify the kvm page fault that
> > > --
> > > 2.25.1
> > >
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry
2022-12-06 11:56 ` Chao Peng
2022-12-06 15:48 ` Fuad Tabba
@ 2022-12-07 6:34 ` Isaku Yamahata
2022-12-07 15:14 ` Chao Peng
1 sibling, 1 reply; 153+ messages in thread
From: Isaku Yamahata @ 2022-12-07 6:34 UTC (permalink / raw)
To: Chao Peng
Cc: Fuad Tabba, kvm, linux-kernel, linux-mm, linux-fsdevel,
linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Arnd Bergmann, Naoya Horiguchi,
Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang,
isaku.yamahata
On Tue, Dec 06, 2022 at 07:56:23PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > > - if (unlikely(kvm->mmu_invalidate_in_progress) &&
> > > - hva >= kvm->mmu_invalidate_range_start &&
> > > - hva < kvm->mmu_invalidate_range_end)
> > > - return 1;
> > > + if (unlikely(kvm->mmu_invalidate_in_progress)) {
> > > + /*
> > > + * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> > > + * but before updating the range is a KVM bug.
> > > + */
> > > + if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> > > + kvm->mmu_invalidate_range_end == INVALID_GPA))
> >
> > INVALID_GPA is an x86-specific define in
> > arch/x86/include/asm/kvm_host.h, so this doesn't build on other
> > architectures. The obvious fix is to move it to
> > include/linux/kvm_host.h.
>
> Hmm, INVALID_GPA is defined as ZERO for x86, not 100% confident this is
> correct choice for other architectures, but after search it has not been
> used for other architectures, so should be safe to make it common.
INVALID_GPA is defined as all bit 1. Please notice "~" (tilde).
#define INVALID_GPA (~(gpa_t)0)
--
Isaku Yamahata <isaku.yamahata@gmail.com>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed
2022-12-06 12:02 ` Chao Peng
@ 2022-12-07 6:42 ` Isaku Yamahata
2022-12-08 11:17 ` Chao Peng
0 siblings, 1 reply; 153+ messages in thread
From: Isaku Yamahata @ 2022-12-07 6:42 UTC (permalink / raw)
To: Chao Peng
Cc: Isaku Yamahata, kvm, linux-kernel, linux-mm, linux-fsdevel,
linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Arnd Bergmann, Naoya Horiguchi,
Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Tue, Dec 06, 2022 at 08:02:24PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:
> On Mon, Dec 05, 2022 at 02:49:59PM -0800, Isaku Yamahata wrote:
> > On Fri, Dec 02, 2022 at 02:13:45PM +0800,
> > Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > > A large page with mixed private/shared subpages can't be mapped as large
> > > page since its sub private/shared pages are from different memory
> > > backends and may also treated by architecture differently. When
> > > private/shared memory are mixed in a large page, the current lpage_info
> > > is not sufficient to decide whether the page can be mapped as large page
> > > or not and additional private/shared mixed information is needed.
> > >
> > > Tracking this 'mixed' information with the current 'count' like
> > > disallow_lpage is a bit challenge so reserve a bit in 'disallow_lpage'
> > > to indicate a large page has mixed private/share subpages and update
> > > this 'mixed' bit whenever the memory attribute is changed between
> > > private and shared.
> > >
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > ---
> > > arch/x86/include/asm/kvm_host.h | 8 ++
> > > arch/x86/kvm/mmu/mmu.c | 134 +++++++++++++++++++++++++++++++-
> > > arch/x86/kvm/x86.c | 2 +
> > > include/linux/kvm_host.h | 19 +++++
> > > virt/kvm/kvm_main.c | 9 ++-
> > > 5 files changed, 169 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 283cbb83d6ae..7772ab37ac89 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -38,6 +38,7 @@
> > > #include <asm/hyperv-tlfs.h>
> > >
> > > #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> > > +#define __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
> > >
> > > #define KVM_MAX_VCPUS 1024
> > >
> > > @@ -1011,6 +1012,13 @@ struct kvm_vcpu_arch {
> > > #endif
> > > };
> > >
> > > +/*
> > > + * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
> > > + * level. The remaining bits are used as a reference count.
> > > + */
> > > +#define KVM_LPAGE_PRIVATE_SHARED_MIXED (1U << 31)
> > > +#define KVM_LPAGE_COUNT_MAX ((1U << 31) - 1)
> > > +
> > > struct kvm_lpage_info {
> > > int disallow_lpage;
> > > };
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index e2c70b5afa3e..2190fd8c95c0 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -763,11 +763,16 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
> > > {
> > > struct kvm_lpage_info *linfo;
> > > int i;
> > > + int disallow_count;
> > >
> > > for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
> > > linfo = lpage_info_slot(gfn, slot, i);
> > > +
> > > + disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
> > > + WARN_ON(disallow_count + count < 0 ||
> > > + disallow_count > KVM_LPAGE_COUNT_MAX - count);
> > > +
> > > linfo->disallow_lpage += count;
> > > - WARN_ON(linfo->disallow_lpage < 0);
> > > }
> > > }
> > >
> > > @@ -6986,3 +6991,130 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> > > if (kvm->arch.nx_huge_page_recovery_thread)
> > > kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
> > > }
> > > +
> > > +static bool linfo_is_mixed(struct kvm_lpage_info *linfo)
> > > +{
> > > + return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > > +}
> > > +
> > > +static void linfo_set_mixed(gfn_t gfn, struct kvm_memory_slot *slot,
> > > + int level, bool mixed)
> > > +{
> > > + struct kvm_lpage_info *linfo = lpage_info_slot(gfn, slot, level);
> > > +
> > > + if (mixed)
> > > + linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > > + else
> > > + linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > > +}
> > > +
> > > +static bool is_expected_attr_entry(void *entry, unsigned long expected_attrs)
> > > +{
> > > + bool expect_private = expected_attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > > +
> > > + if (xa_to_value(entry) & KVM_MEMORY_ATTRIBUTE_PRIVATE) {
> > > + if (!expect_private)
> > > + return false;
> > > + } else if (expect_private)
> > > + return false;
> > > +
> > > + return true;
> > > +}
> > > +
> > > +static bool mem_attrs_mixed_2m(struct kvm *kvm, unsigned long attrs,
> > > + gfn_t start, gfn_t end)
> > > +{
> > > + XA_STATE(xas, &kvm->mem_attr_array, start);
> > > + gfn_t gfn = start;
> > > + void *entry;
> > > + bool mixed = false;
> > > +
> > > + rcu_read_lock();
> > > + entry = xas_load(&xas);
> > > + while (gfn < end) {
> > > + if (xas_retry(&xas, entry))
> > > + continue;
> > > +
> > > + KVM_BUG_ON(gfn != xas.xa_index, kvm);
> > > +
> > > + if (!is_expected_attr_entry(entry, attrs)) {
> > > + mixed = true;
> > > + break;
> > > + }
> > > +
> > > + entry = xas_next(&xas);
> > > + gfn++;
> > > + }
> > > +
> > > + rcu_read_unlock();
> > > + return mixed;
> > > +}
> > > +
> > > +static bool mem_attrs_mixed(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > + int level, unsigned long attrs,
> > > + gfn_t start, gfn_t end)
> > > +{
> > > + unsigned long gfn;
> > > +
> > > + if (level == PG_LEVEL_2M)
> > > + return mem_attrs_mixed_2m(kvm, attrs, start, end);
> > > +
> > > + for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1))
> > > + if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)) ||
> > > + !is_expected_attr_entry(xa_load(&kvm->mem_attr_array, gfn),
> > > + attrs))
> > > + return true;
> > > + return false;
> > > +}
> > > +
> > > +static void kvm_update_lpage_private_shared_mixed(struct kvm *kvm,
> > > + struct kvm_memory_slot *slot,
> > > + unsigned long attrs,
> > > + gfn_t start, gfn_t end)
> > > +{
> > > + unsigned long pages, mask;
> > > + gfn_t gfn, gfn_end, first, last;
> > > + int level;
> > > + bool mixed;
> > > +
> > > + /*
> > > + * The sequence matters here: we set the higher level basing on the
> > > + * lower level's scanning result.
> > > + */
> > > + for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> > > + pages = KVM_PAGES_PER_HPAGE(level);
> > > + mask = ~(pages - 1);
> > > + first = start & mask;
> > > + last = (end - 1) & mask;
> > > +
> > > + /*
> > > + * We only need to scan the head and tail page, for middle pages
> > > + * we know they will not be mixed.
> > > + */
> > > + gfn = max(first, slot->base_gfn);
> > > + gfn_end = min(first + pages, slot->base_gfn + slot->npages);
> > > + mixed = mem_attrs_mixed(kvm, slot, level, attrs, gfn, gfn_end);
> > > + linfo_set_mixed(gfn, slot, level, mixed);
> > > +
> > > + if (first == last)
> > > + return;
> >
> >
> > continue.
>
> Ya!
>
> >
> > > +
> > > + for (gfn = first + pages; gfn < last; gfn += pages)
> > > + linfo_set_mixed(gfn, slot, level, false);
> > > +
> > > + gfn = last;
> > > + gfn_end = min(last + pages, slot->base_gfn + slot->npages);
> >
> > if (gfn == gfn_end) continue.
>
> Do you see a case where gfn can equal to gfn_end? Though it does not
> hurt to add a check.
If last == base_gfn + npages, gfn == gfn_end can occur.
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 9a07380f8d3c..5aefcff614d2 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -12362,6 +12362,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
> > > if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
> > > linfo[lpages - 1].disallow_lpage = 1;
> > > ugfn = slot->userspace_addr >> PAGE_SHIFT;
> > > + if (kvm_slot_can_be_private(slot))
> > > + ugfn |= slot->restricted_offset >> PAGE_SHIFT;
> >
> > Is there any alignment restriction? If no, It should be +=.
> > In practice, alignment will hold though.
>
> All we need here is checking whether both userspace_addr and
> restricted_offset are aligned to HPAGE_SIZE or not. '+=' actually can
> yield wrong value in cases when userspace_addr + restricted_offset is
> aligned to HPAGE_SIZE but individually they may not align to HPAGE_SIZE.
Ah, got it. The blow comment explains it.
> Thanks,
> Chao
> >
> > Thanks,
> >
> > > /*
> > > * If the gfn and userspace address are not aligned wrt each
> > > * other, disable large page support for this slot.
--
Isaku Yamahata <isaku.yamahata@gmail.com>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
2022-12-02 6:13 ` [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes Chao Peng
@ 2022-12-07 8:13 ` Yuan Yao
2022-12-08 11:20 ` Chao Peng
2022-12-07 17:16 ` Fuad Tabba
` (2 subsequent siblings)
3 siblings, 1 reply; 153+ messages in thread
From: Yuan Yao @ 2022-12-07 8:13 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Fri, Dec 02, 2022 at 02:13:44PM +0800, Chao Peng wrote:
> Unmap the existing guest mappings when memory attribute is changed
> between shared and private. This is needed because shared pages and
> private pages are from different backends, unmapping existing ones
> gives a chance for page fault handler to re-populate the mappings
> according to the new attribute.
>
> Only architecture has private memory support needs this and the
> supported architecture is expected to rewrite the weak
> kvm_arch_has_private_mem().
>
> Also, during memory attribute changing and the unmapping time frame,
> page fault handler may happen in the same memory range and can cause
> incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> page fault handler retry during this time frame.
>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
> include/linux/kvm_host.h | 7 +-
> virt/kvm/kvm_main.c | 168 ++++++++++++++++++++++++++-------------
> 2 files changed, 116 insertions(+), 59 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3d69484d2704..3331c0c92838 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> #endif
>
> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> struct kvm_gfn_range {
> struct kvm_memory_slot *slot;
> gfn_t start;
> @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> bool may_block;
> };
> bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> +
> +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> @@ -785,11 +786,12 @@ struct kvm {
>
> #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> struct mmu_notifier mmu_notifier;
> +#endif
> unsigned long mmu_invalidate_seq;
> long mmu_invalidate_in_progress;
> gfn_t mmu_invalidate_range_start;
> gfn_t mmu_invalidate_range_end;
> -#endif
> +
> struct list_head devices;
> u64 manual_dirty_log_protect;
> struct dentry *debugfs_dentry;
> @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> int kvm_arch_post_init_vm(struct kvm *kvm);
> void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_has_private_mem(struct kvm *kvm);
>
> #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ad55dfbc75d7..4e1e1e113bf0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> }
> EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
>
> +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> +{
> + /*
> + * The count increase must become visible at unlock time as no
> + * spte can be established without taking the mmu_lock and
> + * count is also read inside the mmu_lock critical section.
> + */
> + kvm->mmu_invalidate_in_progress++;
> +
> + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> + kvm->mmu_invalidate_range_start = INVALID_GPA;
> + kvm->mmu_invalidate_range_end = INVALID_GPA;
> + }
> +}
> +
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> +
> + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> + kvm->mmu_invalidate_range_start = start;
> + kvm->mmu_invalidate_range_end = end;
> + } else {
> + /*
> + * Fully tracking multiple concurrent ranges has diminishing
> + * returns. Keep things simple and just find the minimal range
> + * which includes the current and new ranges. As there won't be
> + * enough information to subtract a range after its invalidate
> + * completes, any ranges invalidated concurrently will
> + * accumulate and persist until all outstanding invalidates
> + * complete.
> + */
> + kvm->mmu_invalidate_range_start =
> + min(kvm->mmu_invalidate_range_start, start);
> + kvm->mmu_invalidate_range_end =
> + max(kvm->mmu_invalidate_range_end, end);
> + }
> +}
> +
> +void kvm_mmu_invalidate_end(struct kvm *kvm)
> +{
> + /*
> + * This sequence increase will notify the kvm page fault that
> + * the page that is going to be mapped in the spte could have
> + * been freed.
> + */
> + kvm->mmu_invalidate_seq++;
> + smp_wmb();
> + /*
> + * The above sequence increase must be visible before the
> + * below count decrease, which is ensured by the smp_wmb above
> + * in conjunction with the smp_rmb in mmu_invalidate_retry().
> + */
> + kvm->mmu_invalidate_in_progress--;
> +}
> +
> #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> {
> @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> }
>
> -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> -{
> - /*
> - * The count increase must become visible at unlock time as no
> - * spte can be established without taking the mmu_lock and
> - * count is also read inside the mmu_lock critical section.
> - */
> - kvm->mmu_invalidate_in_progress++;
> -
> - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> - kvm->mmu_invalidate_range_start = INVALID_GPA;
> - kvm->mmu_invalidate_range_end = INVALID_GPA;
> - }
> -}
> -
> -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> -{
> - WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> -
> - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> - kvm->mmu_invalidate_range_start = start;
> - kvm->mmu_invalidate_range_end = end;
> - } else {
> - /*
> - * Fully tracking multiple concurrent ranges has diminishing
> - * returns. Keep things simple and just find the minimal range
> - * which includes the current and new ranges. As there won't be
> - * enough information to subtract a range after its invalidate
> - * completes, any ranges invalidated concurrently will
> - * accumulate and persist until all outstanding invalidates
> - * complete.
> - */
> - kvm->mmu_invalidate_range_start =
> - min(kvm->mmu_invalidate_range_start, start);
> - kvm->mmu_invalidate_range_end =
> - max(kvm->mmu_invalidate_range_end, end);
> - }
> -}
> -
> static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> return 0;
> }
>
> -void kvm_mmu_invalidate_end(struct kvm *kvm)
> -{
> - /*
> - * This sequence increase will notify the kvm page fault that
> - * the page that is going to be mapped in the spte could have
> - * been freed.
> - */
> - kvm->mmu_invalidate_seq++;
> - smp_wmb();
> - /*
> - * The above sequence increase must be visible before the
> - * below count decrease, which is ensured by the smp_wmb above
> - * in conjunction with the smp_rmb in mmu_invalidate_retry().
> - */
> - kvm->mmu_invalidate_in_progress--;
> -}
> -
> static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> const struct mmu_notifier_range *range)
> {
> @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
> return 0;
> }
>
> +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> +{
> + return false;
> +}
> +
> static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> {
> struct kvm *kvm = kvm_arch_alloc_vm();
> @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> return 0;
> }
>
> +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + struct kvm_gfn_range gfn_range;
> + struct kvm_memory_slot *slot;
> + struct kvm_memslots *slots;
> + struct kvm_memslot_iter iter;
> + int i;
> + int r = 0;
> +
> + gfn_range.pte = __pte(0);
> + gfn_range.may_block = true;
> +
> + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + slots = __kvm_memslots(kvm, i);
> +
> + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> + slot = iter.slot;
> + gfn_range.start = max(start, slot->base_gfn);
> + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> + if (gfn_range.start >= gfn_range.end)
> + continue;
> + gfn_range.slot = slot;
> +
> + r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> + }
> + }
> +
> + if (r)
> + kvm_flush_remote_tlbs(kvm);
> +}
> +
> static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> struct kvm_memory_attributes *attrs)
> {
> gfn_t start, end;
> unsigned long i;
> void *entry;
> + int idx;
> u64 supported_attrs = kvm_supported_mem_attributes(kvm);
>
> - /* flags is currently not used. */
> + /* 'flags' is currently not used. */
> if (attrs->flags)
> return -EINVAL;
> if (attrs->attributes & ~supported_attrs)
> @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>
> entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
>
> + if (kvm_arch_has_private_mem(kvm)) {
> + KVM_MMU_LOCK(kvm);
> + kvm_mmu_invalidate_begin(kvm);
> + kvm_mmu_invalidate_range_add(kvm, start, end);
Nit: this works for KVM_MEMORY_ATTRIBUTE_PRIVATE, but
the invalidation should be necessary yet for attribute change of:
KVM_MEMORY_ATTRIBUTE_READ
KVM_MEMORY_ATTRIBUTE_WRITE
KVM_MEMORY_ATTRIBUTE_EXECUTE
> + KVM_MMU_UNLOCK(kvm);
> + }
> +
> mutex_lock(&kvm->lock);
> for (i = start; i < end; i++)
> if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> break;
> mutex_unlock(&kvm->lock);
>
> + if (kvm_arch_has_private_mem(kvm)) {
> + idx = srcu_read_lock(&kvm->srcu);
> + KVM_MMU_LOCK(kvm);
> + if (i > start)
> + kvm_unmap_mem_range(kvm, start, i);
> + kvm_mmu_invalidate_end(kvm);
Ditto.
> + KVM_MMU_UNLOCK(kvm);
> + srcu_read_unlock(&kvm->srcu, idx);
> + }
> +
> attrs->address = i << PAGE_SHIFT;
> attrs->size = (end - i) << PAGE_SHIFT;
>
> --
> 2.25.1
>
>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
2022-12-06 14:57 ` Fuad Tabba
@ 2022-12-07 13:50 ` Chao Peng
0 siblings, 0 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-07 13:50 UTC (permalink / raw)
To: Fuad Tabba
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
On Tue, Dec 06, 2022 at 02:57:04PM +0000, Fuad Tabba wrote:
> Hi,
>
> On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > Introduce 'memfd_restricted' system call with the ability to create
> > memory areas that are restricted from userspace access through ordinary
> > MMU operations (e.g. read/write/mmap). The memory content is expected to
> > be used through the new in-kernel interface by a third kernel module.
> >
> > memfd_restricted() is useful for scenarios where a file descriptor(fd)
> > can be used as an interface into mm but want to restrict userspace's
> > ability on the fd. Initially it is designed to provide protections for
> > KVM encrypted guest memory.
> >
> > Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> > (e.g. QEMU) and then using the mmaped virtual address to setup the
> > mapping in the KVM secondary page table (e.g. EPT). With confidential
> > computing technologies like Intel TDX, the memfd memory may be encrypted
> > with special key for special software domain (e.g. KVM guest) and is not
> > expected to be directly accessed by userspace. Precisely, userspace
> > access to such encrypted memory may lead to host crash so should be
> > prevented.
> >
> > memfd_restricted() provides semantics required for KVM guest encrypted
> > memory support that a fd created with memfd_restricted() is going to be
> > used as the source of guest memory in confidential computing environment
> > and KVM can directly interact with core-mm without the need to expose
> > the memoy content into KVM userspace.
>
> nit: memory
Ya!
>
> >
> > KVM userspace is still in charge of the lifecycle of the fd. It should
> > pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> > obtain the physical memory page and then uses it to populate the KVM
> > secondary page table entries.
> >
> > The userspace restricted memfd can be fallocate-ed or hole-punched
> > from userspace. When hole-punched, KVM can get notified through
> > invalidate_start/invalidate_end() callbacks, KVM then gets chance to
> > remove any mapped entries of the range in the secondary page tables.
> >
> > Machine check can happen for memory pages in the restricted memfd,
> > instead of routing this directly to userspace, we call the error()
> > callback that KVM registered. KVM then gets chance to handle it
> > correctly.
> >
> > memfd_restricted() itself is implemented as a shim layer on top of real
> > memory file systems (currently tmpfs). Pages in restrictedmem are marked
> > as unmovable and unevictable, this is required for current confidential
> > usage. But in future this might be changed.
> >
> > By default memfd_restricted() prevents userspace read, write and mmap.
> > By defining new bit in the 'flags', it can be extended to support other
> > restricted semantics in the future.
> >
> > The system call is currently wired up for x86 arch.
>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> After wiring the system call for arm64 (on qemu/arm64):
> Tested-by: Fuad Tabba <tabba@google.com>
Thanks.
Chao
>
> Cheers,
> /fuad
>
>
>
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> > arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> > arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> > include/linux/restrictedmem.h | 71 ++++++
> > include/linux/syscalls.h | 1 +
> > include/uapi/asm-generic/unistd.h | 5 +-
> > include/uapi/linux/magic.h | 1 +
> > kernel/sys_ni.c | 3 +
> > mm/Kconfig | 4 +
> > mm/Makefile | 1 +
> > mm/memory-failure.c | 3 +
> > mm/restrictedmem.c | 318 +++++++++++++++++++++++++
> > 11 files changed, 408 insertions(+), 1 deletion(-)
> > create mode 100644 include/linux/restrictedmem.h
> > create mode 100644 mm/restrictedmem.c
> >
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 320480a8db4f..dc70ba90247e 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -455,3 +455,4 @@
> > 448 i386 process_mrelease sys_process_mrelease
> > 449 i386 futex_waitv sys_futex_waitv
> > 450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node
> > +451 i386 memfd_restricted sys_memfd_restricted
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index c84d12608cd2..06516abc8318 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -372,6 +372,7 @@
> > 448 common process_mrelease sys_process_mrelease
> > 449 common futex_waitv sys_futex_waitv
> > 450 common set_mempolicy_home_node sys_set_mempolicy_home_node
> > +451 common memfd_restricted sys_memfd_restricted
> >
> > #
> > # Due to a historical design error, certain syscalls are numbered differently
> > diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> > new file mode 100644
> > index 000000000000..c2700c5daa43
> > --- /dev/null
> > +++ b/include/linux/restrictedmem.h
> > @@ -0,0 +1,71 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _LINUX_RESTRICTEDMEM_H
> > +
> > +#include <linux/file.h>
> > +#include <linux/magic.h>
> > +#include <linux/pfn_t.h>
> > +
> > +struct restrictedmem_notifier;
> > +
> > +struct restrictedmem_notifier_ops {
> > + void (*invalidate_start)(struct restrictedmem_notifier *notifier,
> > + pgoff_t start, pgoff_t end);
> > + void (*invalidate_end)(struct restrictedmem_notifier *notifier,
> > + pgoff_t start, pgoff_t end);
> > + void (*error)(struct restrictedmem_notifier *notifier,
> > + pgoff_t start, pgoff_t end);
> > +};
> > +
> > +struct restrictedmem_notifier {
> > + struct list_head list;
> > + const struct restrictedmem_notifier_ops *ops;
> > +};
> > +
> > +#ifdef CONFIG_RESTRICTEDMEM
> > +
> > +void restrictedmem_register_notifier(struct file *file,
> > + struct restrictedmem_notifier *notifier);
> > +void restrictedmem_unregister_notifier(struct file *file,
> > + struct restrictedmem_notifier *notifier);
> > +
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > + struct page **pagep, int *order);
> > +
> > +static inline bool file_is_restrictedmem(struct file *file)
> > +{
> > + return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
> > +}
> > +
> > +void restrictedmem_error_page(struct page *page, struct address_space *mapping);
> > +
> > +#else
> > +
> > +static inline void restrictedmem_register_notifier(struct file *file,
> > + struct restrictedmem_notifier *notifier)
> > +{
> > +}
> > +
> > +static inline void restrictedmem_unregister_notifier(struct file *file,
> > + struct restrictedmem_notifier *notifier)
> > +{
> > +}
> > +
> > +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > + struct page **pagep, int *order)
> > +{
> > + return -1;
> > +}
> > +
> > +static inline bool file_is_restrictedmem(struct file *file)
> > +{
> > + return false;
> > +}
> > +
> > +static inline void restrictedmem_error_page(struct page *page,
> > + struct address_space *mapping)
> > +{
> > +}
> > +
> > +#endif /* CONFIG_RESTRICTEDMEM */
> > +
> > +#endif /* _LINUX_RESTRICTEDMEM_H */
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index a34b0f9a9972..f9e9e0c820c5 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
> > asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
> > unsigned long home_node,
> > unsigned long flags);
> > +asmlinkage long sys_memfd_restricted(unsigned int flags);
> >
> > /*
> > * Architecture-specific system calls
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index 45fa180cc56a..e93cd35e46d0 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
> > #define __NR_set_mempolicy_home_node 450
> > __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
> >
> > +#define __NR_memfd_restricted 451
> > +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
> > +
> > #undef __NR_syscalls
> > -#define __NR_syscalls 451
> > +#define __NR_syscalls 452
> >
> > /*
> > * 32 bit systems traditionally used different
> > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > index 6325d1d0e90f..8aa38324b90a 100644
> > --- a/include/uapi/linux/magic.h
> > +++ b/include/uapi/linux/magic.h
> > @@ -101,5 +101,6 @@
> > #define DMA_BUF_MAGIC 0x444d4142 /* "DMAB" */
> > #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */
> > #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */
> > +#define RESTRICTEDMEM_MAGIC 0x5245534d /* "RESM" */
> >
> > #endif /* __LINUX_MAGIC_H__ */
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index 860b2dcf3ac4..7c4a32cbd2e7 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
> > /* memfd_secret */
> > COND_SYSCALL(memfd_secret);
> >
> > +/* memfd_restricted */
> > +COND_SYSCALL(memfd_restricted);
> > +
> > /*
> > * Architecture specific weak syscall entries.
> > */
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 57e1d8c5b505..06b0e1d6b8c1 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -1076,6 +1076,10 @@ config IO_MAPPING
> > config SECRETMEM
> > def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
> >
> > +config RESTRICTEDMEM
> > + bool
> > + depends on TMPFS
> > +
> > config ANON_VMA_NAME
> > bool "Anonymous VMA name support"
> > depends on PROC_FS && ADVISE_SYSCALLS && MMU
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 8e105e5b3e29..bcbb0edf9ba1 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -121,6 +121,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
> > obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
> > obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
> > obj-$(CONFIG_SECRETMEM) += secretmem.o
> > +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
> > obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
> > obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
> > obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 145bb561ddb3..f91b444e471e 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -62,6 +62,7 @@
> > #include <linux/page-isolation.h>
> > #include <linux/pagewalk.h>
> > #include <linux/shmem_fs.h>
> > +#include <linux/restrictedmem.h>
> > #include "swap.h"
> > #include "internal.h"
> > #include "ras/ras_event.h"
> > @@ -940,6 +941,8 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
> > goto out;
> > }
> >
> > + restrictedmem_error_page(p, mapping);
> > +
> > /*
> > * The shmem page is kept in page cache instead of truncating
> > * so is expected to have an extra refcount after error-handling.
> > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> > new file mode 100644
> > index 000000000000..56953c204e5c
> > --- /dev/null
> > +++ b/mm/restrictedmem.c
> > @@ -0,0 +1,318 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "linux/sbitmap.h"
> > +#include <linux/pagemap.h>
> > +#include <linux/pseudo_fs.h>
> > +#include <linux/shmem_fs.h>
> > +#include <linux/syscalls.h>
> > +#include <uapi/linux/falloc.h>
> > +#include <uapi/linux/magic.h>
> > +#include <linux/restrictedmem.h>
> > +
> > +struct restrictedmem_data {
> > + struct mutex lock;
> > + struct file *memfd;
> > + struct list_head notifiers;
> > +};
> > +
> > +static void restrictedmem_invalidate_start(struct restrictedmem_data *data,
> > + pgoff_t start, pgoff_t end)
> > +{
> > + struct restrictedmem_notifier *notifier;
> > +
> > + mutex_lock(&data->lock);
> > + list_for_each_entry(notifier, &data->notifiers, list) {
> > + notifier->ops->invalidate_start(notifier, start, end);
> > + }
> > + mutex_unlock(&data->lock);
> > +}
> > +
> > +static void restrictedmem_invalidate_end(struct restrictedmem_data *data,
> > + pgoff_t start, pgoff_t end)
> > +{
> > + struct restrictedmem_notifier *notifier;
> > +
> > + mutex_lock(&data->lock);
> > + list_for_each_entry(notifier, &data->notifiers, list) {
> > + notifier->ops->invalidate_end(notifier, start, end);
> > + }
> > + mutex_unlock(&data->lock);
> > +}
> > +
> > +static void restrictedmem_notifier_error(struct restrictedmem_data *data,
> > + pgoff_t start, pgoff_t end)
> > +{
> > + struct restrictedmem_notifier *notifier;
> > +
> > + mutex_lock(&data->lock);
> > + list_for_each_entry(notifier, &data->notifiers, list) {
> > + notifier->ops->error(notifier, start, end);
> > + }
> > + mutex_unlock(&data->lock);
> > +}
> > +
> > +static int restrictedmem_release(struct inode *inode, struct file *file)
> > +{
> > + struct restrictedmem_data *data = inode->i_mapping->private_data;
> > +
> > + fput(data->memfd);
> > + kfree(data);
> > + return 0;
> > +}
> > +
> > +static long restrictedmem_punch_hole(struct restrictedmem_data *data, int mode,
> > + loff_t offset, loff_t len)
> > +{
> > + int ret;
> > + pgoff_t start, end;
> > + struct file *memfd = data->memfd;
> > +
> > + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > + return -EINVAL;
> > +
> > + start = offset >> PAGE_SHIFT;
> > + end = (offset + len) >> PAGE_SHIFT;
> > +
> > + restrictedmem_invalidate_start(data, start, end);
> > + ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > + restrictedmem_invalidate_end(data, start, end);
> > +
> > + return ret;
> > +}
> > +
> > +static long restrictedmem_fallocate(struct file *file, int mode,
> > + loff_t offset, loff_t len)
> > +{
> > + struct restrictedmem_data *data = file->f_mapping->private_data;
> > + struct file *memfd = data->memfd;
> > +
> > + if (mode & FALLOC_FL_PUNCH_HOLE)
> > + return restrictedmem_punch_hole(data, mode, offset, len);
> > +
> > + return memfd->f_op->fallocate(memfd, mode, offset, len);
> > +}
> > +
> > +static const struct file_operations restrictedmem_fops = {
> > + .release = restrictedmem_release,
> > + .fallocate = restrictedmem_fallocate,
> > +};
> > +
> > +static int restrictedmem_getattr(struct user_namespace *mnt_userns,
> > + const struct path *path, struct kstat *stat,
> > + u32 request_mask, unsigned int query_flags)
> > +{
> > + struct inode *inode = d_inode(path->dentry);
> > + struct restrictedmem_data *data = inode->i_mapping->private_data;
> > + struct file *memfd = data->memfd;
> > +
> > + return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
> > + request_mask, query_flags);
> > +}
> > +
> > +static int restrictedmem_setattr(struct user_namespace *mnt_userns,
> > + struct dentry *dentry, struct iattr *attr)
> > +{
> > + struct inode *inode = d_inode(dentry);
> > + struct restrictedmem_data *data = inode->i_mapping->private_data;
> > + struct file *memfd = data->memfd;
> > + int ret;
> > +
> > + if (attr->ia_valid & ATTR_SIZE) {
> > + if (memfd->f_inode->i_size)
> > + return -EPERM;
> > +
> > + if (!PAGE_ALIGNED(attr->ia_size))
> > + return -EINVAL;
> > + }
> > +
> > + ret = memfd->f_inode->i_op->setattr(mnt_userns,
> > + file_dentry(memfd), attr);
> > + return ret;
> > +}
> > +
> > +static const struct inode_operations restrictedmem_iops = {
> > + .getattr = restrictedmem_getattr,
> > + .setattr = restrictedmem_setattr,
> > +};
> > +
> > +static int restrictedmem_init_fs_context(struct fs_context *fc)
> > +{
> > + if (!init_pseudo(fc, RESTRICTEDMEM_MAGIC))
> > + return -ENOMEM;
> > +
> > + fc->s_iflags |= SB_I_NOEXEC;
> > + return 0;
> > +}
> > +
> > +static struct file_system_type restrictedmem_fs = {
> > + .owner = THIS_MODULE,
> > + .name = "memfd:restrictedmem",
> > + .init_fs_context = restrictedmem_init_fs_context,
> > + .kill_sb = kill_anon_super,
> > +};
> > +
> > +static struct vfsmount *restrictedmem_mnt;
> > +
> > +static __init int restrictedmem_init(void)
> > +{
> > + restrictedmem_mnt = kern_mount(&restrictedmem_fs);
> > + if (IS_ERR(restrictedmem_mnt))
> > + return PTR_ERR(restrictedmem_mnt);
> > + return 0;
> > +}
> > +fs_initcall(restrictedmem_init);
> > +
> > +static struct file *restrictedmem_file_create(struct file *memfd)
> > +{
> > + struct restrictedmem_data *data;
> > + struct address_space *mapping;
> > + struct inode *inode;
> > + struct file *file;
> > +
> > + data = kzalloc(sizeof(*data), GFP_KERNEL);
> > + if (!data)
> > + return ERR_PTR(-ENOMEM);
> > +
> > + data->memfd = memfd;
> > + mutex_init(&data->lock);
> > + INIT_LIST_HEAD(&data->notifiers);
> > +
> > + inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> > + if (IS_ERR(inode)) {
> > + kfree(data);
> > + return ERR_CAST(inode);
> > + }
> > +
> > + inode->i_mode |= S_IFREG;
> > + inode->i_op = &restrictedmem_iops;
> > + inode->i_mapping->private_data = data;
> > +
> > + file = alloc_file_pseudo(inode, restrictedmem_mnt,
> > + "restrictedmem", O_RDWR,
> > + &restrictedmem_fops);
> > + if (IS_ERR(file)) {
> > + iput(inode);
> > + kfree(data);
> > + return ERR_CAST(file);
> > + }
> > +
> > + file->f_flags |= O_LARGEFILE;
> > +
> > + /*
> > + * These pages are currently unmovable so don't place them into movable
> > + * pageblocks (e.g. CMA and ZONE_MOVABLE).
> > + */
> > + mapping = memfd->f_mapping;
> > + mapping_set_unevictable(mapping);
> > + mapping_set_gfp_mask(mapping,
> > + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> > +
> > + return file;
> > +}
> > +
> > +SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> > +{
> > + struct file *file, *restricted_file;
> > + int fd, err;
> > +
> > + if (flags)
> > + return -EINVAL;
> > +
> > + fd = get_unused_fd_flags(0);
> > + if (fd < 0)
> > + return fd;
> > +
> > + file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> > + if (IS_ERR(file)) {
> > + err = PTR_ERR(file);
> > + goto err_fd;
> > + }
> > + file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> > + file->f_flags |= O_LARGEFILE;
> > +
> > + restricted_file = restrictedmem_file_create(file);
> > + if (IS_ERR(restricted_file)) {
> > + err = PTR_ERR(restricted_file);
> > + fput(file);
> > + goto err_fd;
> > + }
> > +
> > + fd_install(fd, restricted_file);
> > + return fd;
> > +err_fd:
> > + put_unused_fd(fd);
> > + return err;
> > +}
> > +
> > +void restrictedmem_register_notifier(struct file *file,
> > + struct restrictedmem_notifier *notifier)
> > +{
> > + struct restrictedmem_data *data = file->f_mapping->private_data;
> > +
> > + mutex_lock(&data->lock);
> > + list_add(¬ifier->list, &data->notifiers);
> > + mutex_unlock(&data->lock);
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_register_notifier);
> > +
> > +void restrictedmem_unregister_notifier(struct file *file,
> > + struct restrictedmem_notifier *notifier)
> > +{
> > + struct restrictedmem_data *data = file->f_mapping->private_data;
> > +
> > + mutex_lock(&data->lock);
> > + list_del(¬ifier->list);
> > + mutex_unlock(&data->lock);
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier);
> > +
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > + struct page **pagep, int *order)
> > +{
> > + struct restrictedmem_data *data = file->f_mapping->private_data;
> > + struct file *memfd = data->memfd;
> > + struct folio *folio;
> > + struct page *page;
> > + int ret;
> > +
> > + ret = shmem_get_folio(file_inode(memfd), offset, &folio, SGP_WRITE);
> > + if (ret)
> > + return ret;
> > +
> > + page = folio_file_page(folio, offset);
> > + *pagep = page;
> > + if (order)
> > + *order = thp_order(compound_head(page));
> > +
> > + SetPageUptodate(page);
> > + unlock_page(page);
> > +
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> > +
> > +void restrictedmem_error_page(struct page *page, struct address_space *mapping)
> > +{
> > + struct super_block *sb = restrictedmem_mnt->mnt_sb;
> > + struct inode *inode, *next;
> > +
> > + if (!shmem_mapping(mapping))
> > + return;
> > +
> > + spin_lock(&sb->s_inode_list_lock);
> > + list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> > + struct restrictedmem_data *data = inode->i_mapping->private_data;
> > + struct file *memfd = data->memfd;
> > +
> > + if (memfd->f_mapping == mapping) {
> > + pgoff_t start, end;
> > +
> > + spin_unlock(&sb->s_inode_list_lock);
> > +
> > + start = page->index;
> > + end = start + thp_nr_pages(page);
> > + restrictedmem_notifier_error(data, start, end);
> > + return;
> > + }
> > + }
> > + spin_unlock(&sb->s_inode_list_lock);
> > +}
> > --
> > 2.25.1
> >
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
2022-12-06 13:34 ` Fabiano Rosas
@ 2022-12-07 14:31 ` Chao Peng
0 siblings, 0 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-07 14:31 UTC (permalink / raw)
To: Fabiano Rosas
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Tue, Dec 06, 2022 at 10:34:32AM -0300, Fabiano Rosas wrote:
> Chao Peng <chao.p.peng@linux.intel.com> writes:
>
> > In confidential computing usages, whether a page is private or shared is
> > necessary information for KVM to perform operations like page fault
> > handling, page zapping etc. There are other potential use cases for
> > per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > or exec-only, etc.) without having to modify memslots.
> >
> > Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > userspace to operate on the per-page memory attributes.
> > - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> > a guest memory range.
> > - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> > memory attributes.
> >
> > KVM internally uses xarray to store the per-page memory attributes.
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> > ---
> > Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
> > arch/x86/kvm/Kconfig | 1 +
> > include/linux/kvm_host.h | 3 ++
> > include/uapi/linux/kvm.h | 17 ++++++++
> > virt/kvm/Kconfig | 3 ++
> > virt/kvm/kvm_main.c | 76 ++++++++++++++++++++++++++++++++++
> > 6 files changed, 163 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 5617bc4f899f..bb2f709c0900 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
> > The "pad" and "reserved" fields may be used for future extensions and should be
> > set to 0s by userspace.
> >
> > +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> > +-----------------------------------------
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: u64 memory attributes bitmask(out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Returns supported memory attributes bitmask. Supported memory attributes will
> > +have the corresponding bits set in u64 memory attributes bitmask.
> > +
> > +The following memory attributes are defined::
> > +
> > + #define KVM_MEMORY_ATTRIBUTE_READ (1ULL << 0)
> > + #define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
> > + #define KVM_MEMORY_ATTRIBUTE_EXECUTE (1ULL << 2)
> > + #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
> > +
> > +4.139 KVM_SET_MEMORY_ATTRIBUTES
> > +-----------------------------------------
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: struct kvm_memory_attributes(in/out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Sets memory attributes for pages in a guest memory range. Parameters are
> > +specified via the following structure::
> > +
> > + struct kvm_memory_attributes {
> > + __u64 address;
> > + __u64 size;
> > + __u64 attributes;
> > + __u64 flags;
> > + };
> > +
> > +The user sets the per-page memory attributes to a guest memory range indicated
> > +by address/size, and in return KVM adjusts address and size to reflect the
> > +actual pages of the memory range have been successfully set to the attributes.
>
> This wording could cause some confusion, what about a simpler:
>
> "reflect the range of pages that had its attributes successfully set"
Thanks, this is much better.
>
> > +If the call returns 0, "address" is updated to the last successful address + 1
> > +and "size" is updated to the remaining address size that has not been set
> > +successfully.
>
> "address + 1 page" or "subsequent page" perhaps.
>
> In fact, wouldn't this all become simpler if size were number of pages instead?
It indeed becomes better if the size is number of pages and the address
is gfn, but I think we don't want to imply that the page size is 4K to
userspace.
>
> > The user should check the return value as well as the size to
> > +decide if the operation succeeded for the whole range or not. The user may want
> > +to retry the operation with the returned address/size if the previous range was
> > +partially successful.
> > +
> > +Both address and size should be page aligned and the supported attributes can be
> > +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> > +
> > +The "flags" field may be used for future extensions and should be set to 0s.
> > +
>
> ...
>
> > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > + struct kvm_memory_attributes *attrs)
> > +{
> > + gfn_t start, end;
> > + unsigned long i;
> > + void *entry;
> > + u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > +
> > + /* flags is currently not used. */
> > + if (attrs->flags)
> > + return -EINVAL;
> > + if (attrs->attributes & ~supported_attrs)
> > + return -EINVAL;
> > + if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > + return -EINVAL;
> > + if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > + return -EINVAL;
> > +
> > + start = attrs->address >> PAGE_SHIFT;
> > + end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
>
> Here PAGE_SIZE and -1 cancel out.
Correct!
>
> Consider using gpa_to_gfn as well.
Yes using gpa_to_gfn is appropriate.
Thanks,
Chao
>
> > +
> > + entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > +
> > + mutex_lock(&kvm->lock);
> > + for (i = start; i < end; i++)
> > + if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > + GFP_KERNEL_ACCOUNT)))
> > + break;
> > + mutex_unlock(&kvm->lock);
> > +
> > + attrs->address = i << PAGE_SHIFT;
> > + attrs->size = (end - i) << PAGE_SHIFT;
> > +
> > + return 0;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > +
> > struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
> > {
> > return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> > @@ -4459,6 +4508,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> > #ifdef CONFIG_HAVE_KVM_MSI
> > case KVM_CAP_SIGNAL_MSI:
> > #endif
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > + case KVM_CAP_MEMORY_ATTRIBUTES:
> > +#endif
> > #ifdef CONFIG_HAVE_KVM_IRQFD
> > case KVM_CAP_IRQFD:
> > case KVM_CAP_IRQFD_RESAMPLE:
> > @@ -4804,6 +4856,30 @@ static long kvm_vm_ioctl(struct file *filp,
> > break;
> > }
> > #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > + case KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES: {
> > + u64 attrs = kvm_supported_mem_attributes(kvm);
> > +
> > + r = -EFAULT;
> > + if (copy_to_user(argp, &attrs, sizeof(attrs)))
> > + goto out;
> > + r = 0;
> > + break;
> > + }
> > + case KVM_SET_MEMORY_ATTRIBUTES: {
> > + struct kvm_memory_attributes attrs;
> > +
> > + r = -EFAULT;
> > + if (copy_from_user(&attrs, argp, sizeof(attrs)))
> > + goto out;
> > +
> > + r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
> > +
> > + if (!r && copy_to_user(argp, &attrs, sizeof(attrs)))
> > + r = -EFAULT;
> > + break;
> > + }
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > case KVM_CREATE_DEVICE: {
> > struct kvm_create_device cd;
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
2022-12-06 15:07 ` Fuad Tabba
@ 2022-12-07 14:51 ` Chao Peng
0 siblings, 0 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-07 14:51 UTC (permalink / raw)
To: Fuad Tabba
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
On Tue, Dec 06, 2022 at 03:07:27PM +0000, Fuad Tabba wrote:
> Hi,
>
> On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > In confidential computing usages, whether a page is private or shared is
> > necessary information for KVM to perform operations like page fault
> > handling, page zapping etc. There are other potential use cases for
> > per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > or exec-only, etc.) without having to modify memslots.
> >
> > Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > userspace to operate on the per-page memory attributes.
> > - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> > a guest memory range.
> > - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> > memory attributes.
> >
> > KVM internally uses xarray to store the per-page memory attributes.
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> > ---
> > Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
> > arch/x86/kvm/Kconfig | 1 +
> > include/linux/kvm_host.h | 3 ++
> > include/uapi/linux/kvm.h | 17 ++++++++
> > virt/kvm/Kconfig | 3 ++
> > virt/kvm/kvm_main.c | 76 ++++++++++++++++++++++++++++++++++
> > 6 files changed, 163 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 5617bc4f899f..bb2f709c0900 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
> > The "pad" and "reserved" fields may be used for future extensions and should be
> > set to 0s by userspace.
> >
> > +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> > +-----------------------------------------
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: u64 memory attributes bitmask(out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Returns supported memory attributes bitmask. Supported memory attributes will
> > +have the corresponding bits set in u64 memory attributes bitmask.
> > +
> > +The following memory attributes are defined::
> > +
> > + #define KVM_MEMORY_ATTRIBUTE_READ (1ULL << 0)
> > + #define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
> > + #define KVM_MEMORY_ATTRIBUTE_EXECUTE (1ULL << 2)
> > + #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
> > +
> > +4.139 KVM_SET_MEMORY_ATTRIBUTES
> > +-----------------------------------------
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: struct kvm_memory_attributes(in/out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Sets memory attributes for pages in a guest memory range. Parameters are
> > +specified via the following structure::
> > +
> > + struct kvm_memory_attributes {
> > + __u64 address;
> > + __u64 size;
> > + __u64 attributes;
> > + __u64 flags;
> > + };
> > +
> > +The user sets the per-page memory attributes to a guest memory range indicated
> > +by address/size, and in return KVM adjusts address and size to reflect the
> > +actual pages of the memory range have been successfully set to the attributes.
> > +If the call returns 0, "address" is updated to the last successful address + 1
> > +and "size" is updated to the remaining address size that has not been set
> > +successfully. The user should check the return value as well as the size to
> > +decide if the operation succeeded for the whole range or not. The user may want
> > +to retry the operation with the returned address/size if the previous range was
> > +partially successful.
> > +
> > +Both address and size should be page aligned and the supported attributes can be
> > +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> > +
> > +The "flags" field may be used for future extensions and should be set to 0s.
> > +
> > 5. The kvm_run structure
> > ========================
> >
> > @@ -8270,6 +8323,16 @@ structure.
> > When getting the Modified Change Topology Report value, the attr->addr
> > must point to a byte where the value will be stored or retrieved from.
> >
> > +8.40 KVM_CAP_MEMORY_ATTRIBUTES
> > +------------------------------
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm
> > +
> > +This capability indicates KVM supports per-page memory attributes and ioctls
> > +KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
> > +
> > 9. Known KVM API problems
> > =========================
> >
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index fbeaa9ddef59..a8e379a3afee 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -49,6 +49,7 @@ config KVM
> > select SRCU
> > select INTERVAL_TREE
> > select HAVE_KVM_PM_NOTIFIER if PM
> > + select HAVE_KVM_MEMORY_ATTRIBUTES
> > help
> > Support hosting fully virtualized guest machines using hardware
> > virtualization extensions. You will need a fairly recent
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 8f874a964313..a784e2b06625 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -800,6 +800,9 @@ struct kvm {
> >
> > #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> > struct notifier_block pm_notifier;
> > +#endif
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > + struct xarray mem_attr_array;
> > #endif
> > char stats_id[KVM_STATS_NAME_SIZE];
> > };
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 64dfe9c07c87..5d0941acb5bb 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -1182,6 +1182,7 @@ struct kvm_ppc_resize_hpt {
> > #define KVM_CAP_S390_CPU_TOPOLOGY 222
> > #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
> > #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
> > +#define KVM_CAP_MEMORY_ATTRIBUTES 225
> >
> > #ifdef KVM_CAP_IRQ_ROUTING
> >
> > @@ -2238,4 +2239,20 @@ struct kvm_s390_zpci_op {
> > /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
> > #define KVM_S390_ZPCIOP_REGAEN_HOST (1 << 0)
> >
> > +/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
> > +#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES _IOR(KVMIO, 0xd2, __u64)
> > +#define KVM_SET_MEMORY_ATTRIBUTES _IOWR(KVMIO, 0xd3, struct kvm_memory_attributes)
> > +
> > +struct kvm_memory_attributes {
> > + __u64 address;
> > + __u64 size;
> > + __u64 attributes;
> > + __u64 flags;
> > +};
> > +
> > +#define KVM_MEMORY_ATTRIBUTE_READ (1ULL << 0)
> > +#define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
> > +#define KVM_MEMORY_ATTRIBUTE_EXECUTE (1ULL << 2)
> > +#define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
>
> nit: how about using the BIT() macro for these?
Might be the _BITULL() in include/uapi/linux/const.h since it will be
used by userspace also.
>
> > +
> > #endif /* __LINUX_KVM_H */
> > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > index 800f9470e36b..effdea5dd4f0 100644
> > --- a/virt/kvm/Kconfig
> > +++ b/virt/kvm/Kconfig
> > @@ -19,6 +19,9 @@ config HAVE_KVM_IRQ_ROUTING
> > config HAVE_KVM_DIRTY_RING
> > bool
> >
> > +config HAVE_KVM_MEMORY_ATTRIBUTES
> > + bool
> > +
> > # Only strongly ordered architectures can select this, as it doesn't
> > # put any explicit constraint on userspace ordering. They can also
> > # select the _ACQ_REL version.
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 1782c4555d94..7f0f5e9f2406 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > spin_lock_init(&kvm->mn_invalidate_lock);
> > rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> > xa_init(&kvm->vcpu_array);
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > + xa_init(&kvm->mem_attr_array);
> > +#endif
> >
> > INIT_LIST_HEAD(&kvm->gpc_list);
> > spin_lock_init(&kvm->gpc_lock);
> > @@ -1323,6 +1326,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
> > kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
> > kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
> > }
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > + xa_destroy(&kvm->mem_attr_array);
> > +#endif
> > cleanup_srcu_struct(&kvm->irq_srcu);
> > cleanup_srcu_struct(&kvm->srcu);
> > kvm_arch_free_vm(kvm);
> > @@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> > }
> > #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
> >
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> > +{
> > + return 0;
> > +}
> > +
> > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > + struct kvm_memory_attributes *attrs)
> > +{
> > + gfn_t start, end;
> > + unsigned long i;
> > + void *entry;
> > + u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > +
> > + /* flags is currently not used. */
>
> nit: "is reserved"? I think it makes it a bit clearer what its purpose is.
OK, then:
flags is reserved for future extention and currently is not used.
>
> > + if (attrs->flags)
> > + return -EINVAL;
> > + if (attrs->attributes & ~supported_attrs)
> > + return -EINVAL;
> > + if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > + return -EINVAL;
> > + if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > + return -EINVAL;
> > +
> > + start = attrs->address >> PAGE_SHIFT;
> > + end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
>
> Would using existing helpers be better for getting the frame numbers?
Yes, gpa_to_gfn() can be used.
> Also, the code checks that the address and size are page aligned, so
> the end rounding up seems redundant, and might even be wrong if the
> address+size-1 is close to the gfn_t limit (which this code tries to
> avoid in an earlier check).
That's right.
>
> > + entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > +
> > + mutex_lock(&kvm->lock);
> > + for (i = start; i < end; i++)
> > + if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > + GFP_KERNEL_ACCOUNT)))
> > + break;
> > + mutex_unlock(&kvm->lock);
> > +
> > + attrs->address = i << PAGE_SHIFT;
> > + attrs->size = (end - i) << PAGE_SHIFT;
>
> nit: helpers for these too?
Similarly, gfn_to_gpa() will be used.
>
> With the end calculation fixed,
>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> After adding the necessary configs for arm64 (on qemu/arm64):
> Tested-by: Fuad Tabba <tabba@google.com>
Thanks.
Chao
>
> Cheers,
> /fuad
>
> > +
> > + return 0;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > +
> > struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
> > {
> > return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> > @@ -4459,6 +4508,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> > #ifdef CONFIG_HAVE_KVM_MSI
> > case KVM_CAP_SIGNAL_MSI:
> > #endif
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > + case KVM_CAP_MEMORY_ATTRIBUTES:
> > +#endif
> > #ifdef CONFIG_HAVE_KVM_IRQFD
> > case KVM_CAP_IRQFD:
> > case KVM_CAP_IRQFD_RESAMPLE:
> > @@ -4804,6 +4856,30 @@ static long kvm_vm_ioctl(struct file *filp,
> > break;
> > }
> > #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > + case KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES: {
> > + u64 attrs = kvm_supported_mem_attributes(kvm);
> > +
> > + r = -EFAULT;
> > + if (copy_to_user(argp, &attrs, sizeof(attrs)))
> > + goto out;
> > + r = 0;
> > + break;
> > + }
> > + case KVM_SET_MEMORY_ATTRIBUTES: {
> > + struct kvm_memory_attributes attrs;
> > +
> > + r = -EFAULT;
> > + if (copy_from_user(&attrs, argp, sizeof(attrs)))
> > + goto out;
> > +
> > + r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
> > +
> > + if (!r && copy_to_user(argp, &attrs, sizeof(attrs)))
> > + r = -EFAULT;
> > + break;
> > + }
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > case KVM_CREATE_DEVICE: {
> > struct kvm_create_device cd;
> >
> > --
> > 2.25.1
> >
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
2022-12-06 12:39 ` Fuad Tabba
@ 2022-12-07 15:10 ` Chao Peng
0 siblings, 0 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-07 15:10 UTC (permalink / raw)
To: Fuad Tabba
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
On Tue, Dec 06, 2022 at 12:39:18PM +0000, Fuad Tabba wrote:
> Hi Chao,
>
> On Tue, Dec 6, 2022 at 11:58 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > On Mon, Dec 05, 2022 at 09:03:11AM +0000, Fuad Tabba wrote:
> > > Hi Chao,
> > >
> > > On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > > >
> > > > In memory encryption usage, guest memory may be encrypted with special
> > > > key and can be accessed only by the guest itself. We call such memory
> > > > private memory. It's valueless and sometimes can cause problem to allow
> > > > userspace to access guest private memory. This new KVM memslot extension
> > > > allows guest private memory being provided through a restrictedmem
> > > > backed file descriptor(fd) and userspace is restricted to access the
> > > > bookmarked memory in the fd.
> > > >
> > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > > > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > > > userspace to instruct KVM to provide guest memory through restricted_fd.
> > > > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > > > and the size is 'memory_size'.
> > > >
> > > > The extended memslot can still have the userspace_addr(hva). When use, a
> > > > single memslot can maintain both private memory through restricted_fd
> > > > and shared memory through userspace_addr. Whether the private or shared
> > > > part is visible to guest is maintained by other KVM code.
> > > >
> > > > A restrictedmem_notifier field is also added to the memslot structure to
> > > > allow the restricted_fd's backing store to notify KVM the memory change,
> > > > KVM then can invalidate its page table entries or handle memory errors.
> > > >
> > > > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > > > and right now it is selected on X86_64 only.
> > > >
> > > > To make future maintenance easy, internally use a binary compatible
> > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > '_ext' variants.
> > > >
> > > > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > > Reviewed-by: Fuad Tabba <tabba@google.com>
> > > > Tested-by: Fuad Tabba <tabba@google.com>
> > >
> > > V9 of this patch [*] had KVM_CAP_PRIVATE_MEM, but it's not in this
> > > patch series anymore. Any reason you removed it, or is it just an
> > > omission?
> >
> > We had some discussion in v9 [1] to add generic memory attributes ioctls
> > and KVM_CAP_PRIVATE_MEM can be implemented as a new
> > KVM_MEMORY_ATTRIBUTE_PRIVATE flag via KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES()
> > ioctl [2]. The api doc has been updated:
> >
> > +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> > + KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl) …
> >
> >
> > [1] https://lore.kernel.org/linux-mm/Y2WB48kD0J4VGynX@google.com/
> > [2]
> > https://lore.kernel.org/linux-mm/20221202061347.1070246-3-chao.p.peng@linux.intel.com/
>
> I see. I just retested it with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES,
> and my Reviewed/Tested-by still apply.
Thanks for the info.
Chao
>
> Cheers,
> /fuad
>
> >
> > Thanks,
> > Chao
> > >
> > > [*] https://lore.kernel.org/linux-mm/20221025151344.3784230-3-chao.p.peng@linux.intel.com/
> > >
> > > Thanks,
> > > /fuad
> > >
> > > > ---
> > > > Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++-----
> > > > arch/x86/kvm/Kconfig | 2 ++
> > > > arch/x86/kvm/x86.c | 2 +-
> > > > include/linux/kvm_host.h | 8 ++++--
> > > > include/uapi/linux/kvm.h | 28 +++++++++++++++++++
> > > > virt/kvm/Kconfig | 3 +++
> > > > virt/kvm/kvm_main.c | 49 ++++++++++++++++++++++++++++------
> > > > 7 files changed, 114 insertions(+), 18 deletions(-)
> > > >
> > > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > > index bb2f709c0900..99352170c130 100644
> > > > --- a/Documentation/virt/kvm/api.rst
> > > > +++ b/Documentation/virt/kvm/api.rst
> > > > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
> > > > :Capability: KVM_CAP_USER_MEMORY
> > > > :Architectures: all
> > > > :Type: vm ioctl
> > > > -:Parameters: struct kvm_userspace_memory_region (in)
> > > > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> > > > :Returns: 0 on success, -1 on error
> > > >
> > > > ::
> > > > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> > > > __u64 userspace_addr; /* start of the userspace allocated memory */
> > > > };
> > > >
> > > > + struct kvm_userspace_memory_region_ext {
> > > > + struct kvm_userspace_memory_region region;
> > > > + __u64 restricted_offset;
> > > > + __u32 restricted_fd;
> > > > + __u32 pad1;
> > > > + __u64 pad2[14];
> > > > + };
> > > > +
> > > > /* for kvm_memory_region::flags */
> > > > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> > > > #define KVM_MEM_READONLY (1UL << 1)
> > > > + #define KVM_MEM_PRIVATE (1UL << 2)
> > > >
> > > > This ioctl allows the user to create, modify or delete a guest physical
> > > > memory slot. Bits 0-15 of "slot" specify the slot id and this value
> > > > @@ -1365,12 +1374,29 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> > > > be identical. This allows large pages in the guest to be backed by large
> > > > pages in the host.
> > > >
> > > > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> > > > -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of
> > > > -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to
> > > > -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> > > > -to make a new slot read-only. In this case, writes to this memory will be
> > > > -posted to userspace as KVM_EXIT_MMIO exits.
> > > > +kvm_userspace_memory_region_ext struct includes all fields of
> > > > +kvm_userspace_memory_region struct, while also adds additional fields for some
> > > > +other features. See below description of flags field for more information.
> > > > +It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
> > > > +
> > > > +The flags field supports following flags:
> > > > +
> > > > +- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
> > > > + within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
> > > > +
> > > > +- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
> > > > + read-only. In this case, writes to this memory will be posted to userspace as
> > > > + KVM_EXIT_MMIO exits.
> > > > +
> > > > +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> > > > + KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has private
> > > > + memory backed by a file descriptor(fd) and userspace access to the fd may be
> > > > + restricted. Userspace should use restricted_fd/restricted_offset in the
> > > > + kvm_userspace_memory_region_ext to instruct KVM to provide private memory
> > > > + to guest. Userspace should guarantee not to map the same host physical address
> > > > + indicated by restricted_fd/restricted_offset to different guest physical
> > > > + addresses within multiple memslots. Failed to do this may result undefined
> > > > + behavior.
> > > >
> > > > When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> > > > the memory region are automatically reflected into the guest. For example, an
> > > > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > > > index a8e379a3afee..690cb21010e7 100644
> > > > --- a/arch/x86/kvm/Kconfig
> > > > +++ b/arch/x86/kvm/Kconfig
> > > > @@ -50,6 +50,8 @@ config KVM
> > > > select INTERVAL_TREE
> > > > select HAVE_KVM_PM_NOTIFIER if PM
> > > > select HAVE_KVM_MEMORY_ATTRIBUTES
> > > > + select HAVE_KVM_RESTRICTED_MEM if X86_64
> > > > + select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> > > > help
> > > > Support hosting fully virtualized guest machines using hardware
> > > > virtualization extensions. You will need a fairly recent
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 7f850dfb4086..9a07380f8d3c 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -12224,7 +12224,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
> > > > }
> > > >
> > > > for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > > - struct kvm_userspace_memory_region m;
> > > > + struct kvm_user_mem_region m;
> > > >
> > > > m.slot = id | (i << 16);
> > > > m.flags = 0;
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index a784e2b06625..02347e386ea2 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -44,6 +44,7 @@
> > > >
> > > > #include <asm/kvm_host.h>
> > > > #include <linux/kvm_dirty_ring.h>
> > > > +#include <linux/restrictedmem.h>
> > > >
> > > > #ifndef KVM_MAX_VCPU_IDS
> > > > #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
> > > > @@ -585,6 +586,9 @@ struct kvm_memory_slot {
> > > > u32 flags;
> > > > short id;
> > > > u16 as_id;
> > > > + struct file *restricted_file;
> > > > + loff_t restricted_offset;
> > > > + struct restrictedmem_notifier notifier;
> > > > };
> > > >
> > > > static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> > > > @@ -1123,9 +1127,9 @@ enum kvm_mr_change {
> > > > };
> > > >
> > > > int kvm_set_memory_region(struct kvm *kvm,
> > > > - const struct kvm_userspace_memory_region *mem);
> > > > + const struct kvm_user_mem_region *mem);
> > > > int __kvm_set_memory_region(struct kvm *kvm,
> > > > - const struct kvm_userspace_memory_region *mem);
> > > > + const struct kvm_user_mem_region *mem);
> > > > void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
> > > > void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
> > > > int kvm_arch_prepare_memory_region(struct kvm *kvm,
> > > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > > index 5d0941acb5bb..13bff963b8b0 100644
> > > > --- a/include/uapi/linux/kvm.h
> > > > +++ b/include/uapi/linux/kvm.h
> > > > @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
> > > > __u64 userspace_addr; /* start of the userspace allocated memory */
> > > > };
> > > >
> > > > +struct kvm_userspace_memory_region_ext {
> > > > + struct kvm_userspace_memory_region region;
> > > > + __u64 restricted_offset;
> > > > + __u32 restricted_fd;
> > > > + __u32 pad1;
> > > > + __u64 pad2[14];
> > > > +};
> > > > +
> > > > +#ifdef __KERNEL__
> > > > +/*
> > > > + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
> > > > + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
> > > > + * all fields from the top-level "extended" region.
> > > > + */
> > > > +struct kvm_user_mem_region {
> > > > + __u32 slot;
> > > > + __u32 flags;
> > > > + __u64 guest_phys_addr;
> > > > + __u64 memory_size;
> > > > + __u64 userspace_addr;
> > > > + __u64 restricted_offset;
> > > > + __u32 restricted_fd;
> > > > + __u32 pad1;
> > > > + __u64 pad2[14];
> > > > +};
> > > > +#endif
> > > > +
> > > > /*
> > > > * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> > > > * other bits are reserved for kvm internal use which are defined in
> > > > @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
> > > > */
> > > > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> > > > #define KVM_MEM_READONLY (1UL << 1)
> > > > +#define KVM_MEM_PRIVATE (1UL << 2)
> > > >
> > > > /* for KVM_IRQ_LINE */
> > > > struct kvm_irq_level {
> > > > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > > > index effdea5dd4f0..d605545d6dd1 100644
> > > > --- a/virt/kvm/Kconfig
> > > > +++ b/virt/kvm/Kconfig
> > > > @@ -89,3 +89,6 @@ config KVM_XFER_TO_GUEST_WORK
> > > >
> > > > config HAVE_KVM_PM_NOTIFIER
> > > > bool
> > > > +
> > > > +config HAVE_KVM_RESTRICTED_MEM
> > > > + bool
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index 7f0f5e9f2406..b882eb2c76a2 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -1532,7 +1532,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
> > > > }
> > > > }
> > > >
> > > > -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> > > > +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> > > > {
> > > > u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> > > >
> > > > @@ -1934,7 +1934,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> > > > * Must be called holding kvm->slots_lock for write.
> > > > */
> > > > int __kvm_set_memory_region(struct kvm *kvm,
> > > > - const struct kvm_userspace_memory_region *mem)
> > > > + const struct kvm_user_mem_region *mem)
> > > > {
> > > > struct kvm_memory_slot *old, *new;
> > > > struct kvm_memslots *slots;
> > > > @@ -2038,7 +2038,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > > > EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> > > >
> > > > int kvm_set_memory_region(struct kvm *kvm,
> > > > - const struct kvm_userspace_memory_region *mem)
> > > > + const struct kvm_user_mem_region *mem)
> > > > {
> > > > int r;
> > > >
> > > > @@ -2050,7 +2050,7 @@ int kvm_set_memory_region(struct kvm *kvm,
> > > > EXPORT_SYMBOL_GPL(kvm_set_memory_region);
> > > >
> > > > static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> > > > - struct kvm_userspace_memory_region *mem)
> > > > + struct kvm_user_mem_region *mem)
> > > > {
> > > > if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
> > > > return -EINVAL;
> > > > @@ -4698,6 +4698,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
> > > > return fd;
> > > > }
> > > >
> > > > +#define SANITY_CHECK_MEM_REGION_FIELD(field) \
> > > > +do { \
> > > > + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \
> > > > + offsetof(struct kvm_userspace_memory_region, field)); \
> > > > + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \
> > > > + sizeof_field(struct kvm_userspace_memory_region, field)); \
> > > > +} while (0)
> > > > +
> > > > +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field) \
> > > > +do { \
> > > > + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \
> > > > + offsetof(struct kvm_userspace_memory_region_ext, field)); \
> > > > + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \
> > > > + sizeof_field(struct kvm_userspace_memory_region_ext, field)); \
> > > > +} while (0)
> > > > +
> > > > +static void kvm_sanity_check_user_mem_region_alias(void)
> > > > +{
> > > > + SANITY_CHECK_MEM_REGION_FIELD(slot);
> > > > + SANITY_CHECK_MEM_REGION_FIELD(flags);
> > > > + SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> > > > + SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> > > > + SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> > > > + SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_offset);
> > > > + SANITY_CHECK_MEM_REGION_EXT_FIELD(restricted_fd);
> > > > +}
> > > > +
> > > > static long kvm_vm_ioctl(struct file *filp,
> > > > unsigned int ioctl, unsigned long arg)
> > > > {
> > > > @@ -4721,14 +4748,20 @@ static long kvm_vm_ioctl(struct file *filp,
> > > > break;
> > > > }
> > > > case KVM_SET_USER_MEMORY_REGION: {
> > > > - struct kvm_userspace_memory_region kvm_userspace_mem;
> > > > + struct kvm_user_mem_region mem;
> > > > + unsigned long size = sizeof(struct kvm_userspace_memory_region);
> > > > +
> > > > + kvm_sanity_check_user_mem_region_alias();
> > > >
> > > > r = -EFAULT;
> > > > - if (copy_from_user(&kvm_userspace_mem, argp,
> > > > - sizeof(kvm_userspace_mem)))
> > > > + if (copy_from_user(&mem, argp, size))
> > > > + goto out;
> > > > +
> > > > + r = -EINVAL;
> > > > + if (mem.flags & KVM_MEM_PRIVATE)
> > > > goto out;
> > > >
> > > > - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> > > > + r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> > > > break;
> > > > }
> > > > case KVM_GET_DIRTY_LOG: {
> > > > --
> > > > 2.25.1
> > > >
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit
2022-12-06 15:47 ` Fuad Tabba
@ 2022-12-07 15:11 ` Chao Peng
0 siblings, 0 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-07 15:11 UTC (permalink / raw)
To: Fuad Tabba
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
On Tue, Dec 06, 2022 at 03:47:20PM +0000, Fuad Tabba wrote:
> Hi,
>
> On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > This new KVM exit allows userspace to handle memory-related errors. It
> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> > The flags includes additional information for userspace to handle the
> > error. Currently bit 0 is defined as 'private memory' where '1'
> > indicates error happens due to private memory access and '0' indicates
> > error happens due to shared memory access.
> >
> > When private memory is enabled, this new exit will be used for KVM to
> > exit to userspace for shared <-> private memory conversion in memory
> > encryption usage. In such usage, typically there are two kind of memory
> > conversions:
> > - explicit conversion: happens when guest explicitly calls into KVM
> > to map a range (as private or shared), KVM then exits to userspace
> > to perform the map/unmap operations.
> > - implicit conversion: happens in KVM page fault handler where KVM
> > exits to userspace for an implicit conversion when the page is in a
> > different state than requested (private or shared).
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > Reviewed-by: Fuad Tabba <tabba@google.com>
> > ---
> > Documentation/virt/kvm/api.rst | 22 ++++++++++++++++++++++
> > include/uapi/linux/kvm.h | 8 ++++++++
> > 2 files changed, 30 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 99352170c130..d9edb14ce30b 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6634,6 +6634,28 @@ array field represents return values. The userspace should update the return
> > values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> > spec refer, https://github.com/riscv/riscv-sbi-doc.
> >
> > +::
> > +
> > + /* KVM_EXIT_MEMORY_FAULT */
> > + struct {
> > + #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 0)
> > + __u64 flags;
>
> I see you've removed the padding and increased the flag size.
Yes Sean suggested this and also looks good to me.
Chao
>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Tested-by: Fuad Tabba <tabba@google.com>
>
> Cheers,
> /fuad
>
>
>
>
> > + __u64 gpa;
> > + __u64 size;
> > + } memory;
> > +
> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> > +encountered a memory error which is not handled by KVM kernel module and
> > +userspace may choose to handle it. The 'flags' field indicates the memory
> > +properties of the exit.
> > +
> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> > + private memory access when the bit is set. Otherwise the memory error is
> > + caused by shared memory access when the bit is clear.
> > +
> > +'gpa' and 'size' indicate the memory range the error occurs at. The userspace
> > +may handle the error and return to KVM to retry the previous memory access.
> > +
> > ::
> >
> > /* KVM_EXIT_NOTIFY */
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 13bff963b8b0..c7e9d375a902 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -300,6 +300,7 @@ struct kvm_xen_exit {
> > #define KVM_EXIT_RISCV_SBI 35
> > #define KVM_EXIT_RISCV_CSR 36
> > #define KVM_EXIT_NOTIFY 37
> > +#define KVM_EXIT_MEMORY_FAULT 38
> >
> > /* For KVM_EXIT_INTERNAL_ERROR */
> > /* Emulate instruction failed. */
> > @@ -541,6 +542,13 @@ struct kvm_run {
> > #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0)
> > __u32 flags;
> > } notify;
> > + /* KVM_EXIT_MEMORY_FAULT */
> > + struct {
> > +#define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 0)
> > + __u64 flags;
> > + __u64 gpa;
> > + __u64 size;
> > + } memory;
> > /* Fix the size of the union. */
> > char padding[256];
> > };
> > --
> > 2.25.1
> >
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry
2022-12-07 6:34 ` Isaku Yamahata
@ 2022-12-07 15:14 ` Chao Peng
0 siblings, 0 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-07 15:14 UTC (permalink / raw)
To: Isaku Yamahata
Cc: Fuad Tabba, kvm, linux-kernel, linux-mm, linux-fsdevel,
linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Arnd Bergmann, Naoya Horiguchi,
Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
Steven Price, Maciej S . Szmigiero, Vlastimil Babka,
Vishal Annapurve, Yu Zhang, Kirill A . Shutemov, luto,
jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
On Tue, Dec 06, 2022 at 10:34:11PM -0800, Isaku Yamahata wrote:
> On Tue, Dec 06, 2022 at 07:56:23PM +0800,
> Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> > > > - if (unlikely(kvm->mmu_invalidate_in_progress) &&
> > > > - hva >= kvm->mmu_invalidate_range_start &&
> > > > - hva < kvm->mmu_invalidate_range_end)
> > > > - return 1;
> > > > + if (unlikely(kvm->mmu_invalidate_in_progress)) {
> > > > + /*
> > > > + * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> > > > + * but before updating the range is a KVM bug.
> > > > + */
> > > > + if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> > > > + kvm->mmu_invalidate_range_end == INVALID_GPA))
> > >
> > > INVALID_GPA is an x86-specific define in
> > > arch/x86/include/asm/kvm_host.h, so this doesn't build on other
> > > architectures. The obvious fix is to move it to
> > > include/linux/kvm_host.h.
> >
> > Hmm, INVALID_GPA is defined as ZERO for x86, not 100% confident this is
> > correct choice for other architectures, but after search it has not been
> > used for other architectures, so should be safe to make it common.
>
> INVALID_GPA is defined as all bit 1. Please notice "~" (tilde).
>
> #define INVALID_GPA (~(gpa_t)0)
Thanks for mention. Still looks right moving it to include/linux/kvm_host.h.
Chao
> --
> Isaku Yamahata <isaku.yamahata@gmail.com>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
2022-12-02 6:13 ` [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes Chao Peng
2022-12-07 8:13 ` Yuan Yao
@ 2022-12-07 17:16 ` Fuad Tabba
2022-12-08 11:13 ` Chao Peng
2022-12-13 23:51 ` Huang, Kai
2023-01-13 22:50 ` Sean Christopherson
3 siblings, 1 reply; 153+ messages in thread
From: Fuad Tabba @ 2022-12-07 17:16 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
Hi,
On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> Unmap the existing guest mappings when memory attribute is changed
> between shared and private. This is needed because shared pages and
> private pages are from different backends, unmapping existing ones
> gives a chance for page fault handler to re-populate the mappings
> according to the new attribute.
>
> Only architecture has private memory support needs this and the
> supported architecture is expected to rewrite the weak
> kvm_arch_has_private_mem().
This kind of ties into the discussion of being able to share memory in
place. For pKVM for example, shared and private memory would have the
same backend, and the unmapping wouldn't be needed.
So I guess that, instead of kvm_arch_has_private_mem(), can the check
be done differently, e.g., with a different function, say
kvm_arch_private_notify_attribute_change() (but maybe with a more
friendly name than what I suggested :) )?
Thanks,
/fuad
>
> Also, during memory attribute changing and the unmapping time frame,
> page fault handler may happen in the same memory range and can cause
> incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> page fault handler retry during this time frame.
>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
> include/linux/kvm_host.h | 7 +-
> virt/kvm/kvm_main.c | 168 ++++++++++++++++++++++++++-------------
> 2 files changed, 116 insertions(+), 59 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3d69484d2704..3331c0c92838 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> #endif
>
> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> struct kvm_gfn_range {
> struct kvm_memory_slot *slot;
> gfn_t start;
> @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> bool may_block;
> };
> bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> +
> +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> @@ -785,11 +786,12 @@ struct kvm {
>
> #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> struct mmu_notifier mmu_notifier;
> +#endif
> unsigned long mmu_invalidate_seq;
> long mmu_invalidate_in_progress;
> gfn_t mmu_invalidate_range_start;
> gfn_t mmu_invalidate_range_end;
> -#endif
> +
> struct list_head devices;
> u64 manual_dirty_log_protect;
> struct dentry *debugfs_dentry;
> @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> int kvm_arch_post_init_vm(struct kvm *kvm);
> void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_has_private_mem(struct kvm *kvm);
>
> #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ad55dfbc75d7..4e1e1e113bf0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> }
> EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
>
> +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> +{
> + /*
> + * The count increase must become visible at unlock time as no
> + * spte can be established without taking the mmu_lock and
> + * count is also read inside the mmu_lock critical section.
> + */
> + kvm->mmu_invalidate_in_progress++;
> +
> + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> + kvm->mmu_invalidate_range_start = INVALID_GPA;
> + kvm->mmu_invalidate_range_end = INVALID_GPA;
> + }
> +}
> +
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> +
> + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> + kvm->mmu_invalidate_range_start = start;
> + kvm->mmu_invalidate_range_end = end;
> + } else {
> + /*
> + * Fully tracking multiple concurrent ranges has diminishing
> + * returns. Keep things simple and just find the minimal range
> + * which includes the current and new ranges. As there won't be
> + * enough information to subtract a range after its invalidate
> + * completes, any ranges invalidated concurrently will
> + * accumulate and persist until all outstanding invalidates
> + * complete.
> + */
> + kvm->mmu_invalidate_range_start =
> + min(kvm->mmu_invalidate_range_start, start);
> + kvm->mmu_invalidate_range_end =
> + max(kvm->mmu_invalidate_range_end, end);
> + }
> +}
> +
> +void kvm_mmu_invalidate_end(struct kvm *kvm)
> +{
> + /*
> + * This sequence increase will notify the kvm page fault that
> + * the page that is going to be mapped in the spte could have
> + * been freed.
> + */
> + kvm->mmu_invalidate_seq++;
> + smp_wmb();
> + /*
> + * The above sequence increase must be visible before the
> + * below count decrease, which is ensured by the smp_wmb above
> + * in conjunction with the smp_rmb in mmu_invalidate_retry().
> + */
> + kvm->mmu_invalidate_in_progress--;
> +}
> +
> #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> {
> @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> }
>
> -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> -{
> - /*
> - * The count increase must become visible at unlock time as no
> - * spte can be established without taking the mmu_lock and
> - * count is also read inside the mmu_lock critical section.
> - */
> - kvm->mmu_invalidate_in_progress++;
> -
> - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> - kvm->mmu_invalidate_range_start = INVALID_GPA;
> - kvm->mmu_invalidate_range_end = INVALID_GPA;
> - }
> -}
> -
> -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> -{
> - WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> -
> - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> - kvm->mmu_invalidate_range_start = start;
> - kvm->mmu_invalidate_range_end = end;
> - } else {
> - /*
> - * Fully tracking multiple concurrent ranges has diminishing
> - * returns. Keep things simple and just find the minimal range
> - * which includes the current and new ranges. As there won't be
> - * enough information to subtract a range after its invalidate
> - * completes, any ranges invalidated concurrently will
> - * accumulate and persist until all outstanding invalidates
> - * complete.
> - */
> - kvm->mmu_invalidate_range_start =
> - min(kvm->mmu_invalidate_range_start, start);
> - kvm->mmu_invalidate_range_end =
> - max(kvm->mmu_invalidate_range_end, end);
> - }
> -}
> -
> static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> return 0;
> }
>
> -void kvm_mmu_invalidate_end(struct kvm *kvm)
> -{
> - /*
> - * This sequence increase will notify the kvm page fault that
> - * the page that is going to be mapped in the spte could have
> - * been freed.
> - */
> - kvm->mmu_invalidate_seq++;
> - smp_wmb();
> - /*
> - * The above sequence increase must be visible before the
> - * below count decrease, which is ensured by the smp_wmb above
> - * in conjunction with the smp_rmb in mmu_invalidate_retry().
> - */
> - kvm->mmu_invalidate_in_progress--;
> -}
> -
> static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> const struct mmu_notifier_range *range)
> {
> @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
> return 0;
> }
>
> +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> +{
> + return false;
> +}
> +
> static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> {
> struct kvm *kvm = kvm_arch_alloc_vm();
> @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> return 0;
> }
>
> +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + struct kvm_gfn_range gfn_range;
> + struct kvm_memory_slot *slot;
> + struct kvm_memslots *slots;
> + struct kvm_memslot_iter iter;
> + int i;
> + int r = 0;
> +
> + gfn_range.pte = __pte(0);
> + gfn_range.may_block = true;
> +
> + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + slots = __kvm_memslots(kvm, i);
> +
> + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> + slot = iter.slot;
> + gfn_range.start = max(start, slot->base_gfn);
> + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> + if (gfn_range.start >= gfn_range.end)
> + continue;
> + gfn_range.slot = slot;
> +
> + r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> + }
> + }
> +
> + if (r)
> + kvm_flush_remote_tlbs(kvm);
> +}
> +
> static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> struct kvm_memory_attributes *attrs)
> {
> gfn_t start, end;
> unsigned long i;
> void *entry;
> + int idx;
> u64 supported_attrs = kvm_supported_mem_attributes(kvm);
>
> - /* flags is currently not used. */
> + /* 'flags' is currently not used. */
> if (attrs->flags)
> return -EINVAL;
> if (attrs->attributes & ~supported_attrs)
> @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>
> entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
>
> + if (kvm_arch_has_private_mem(kvm)) {
> + KVM_MMU_LOCK(kvm);
> + kvm_mmu_invalidate_begin(kvm);
> + kvm_mmu_invalidate_range_add(kvm, start, end);
> + KVM_MMU_UNLOCK(kvm);
> + }
> +
> mutex_lock(&kvm->lock);
> for (i = start; i < end; i++)
> if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> break;
> mutex_unlock(&kvm->lock);
>
> + if (kvm_arch_has_private_mem(kvm)) {
> + idx = srcu_read_lock(&kvm->srcu);
> + KVM_MMU_LOCK(kvm);
> + if (i > start)
> + kvm_unmap_mem_range(kvm, start, i);
> + kvm_mmu_invalidate_end(kvm);
> + KVM_MMU_UNLOCK(kvm);
> + srcu_read_unlock(&kvm->srcu, idx);
> + }
> +
> attrs->address = i << PAGE_SHIFT;
> attrs->size = (end - i) << PAGE_SHIFT;
>
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 8/9] KVM: Handle page fault for private memory
2022-12-02 6:13 ` [PATCH v10 8/9] KVM: Handle page fault for private memory Chao Peng
@ 2022-12-08 2:29 ` Yuan Yao
2022-12-08 11:23 ` Chao Peng
2022-12-09 9:01 ` Fuad Tabba
2023-01-13 23:29 ` Sean Christopherson
2 siblings, 1 reply; 153+ messages in thread
From: Yuan Yao @ 2022-12-08 2:29 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Fri, Dec 02, 2022 at 02:13:46PM +0800, Chao Peng wrote:
> A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
> hva-based shared memory. Architecture code (like TDX code) can tell
> whether the on-going fault is private or not. This patch adds a
> 'is_private' field to kvm_page_fault to indicate this and architecture
> code is expected to set it.
>
> To handle page fault for such memslot, the handling logic is different
> depending on whether the fault is private or shared. KVM checks if
> 'is_private' matches the host's view of the page (maintained in
> mem_attr_array).
> - For a successful match, private pfn is obtained with
> restrictedmem_get_page() and shared pfn is obtained with existing
> get_user_pages().
> - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> userspace. Userspace then can convert memory between private/shared
> in host's view and retry the fault.
>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
> arch/x86/kvm/mmu/mmu.c | 63 +++++++++++++++++++++++++++++++--
> arch/x86/kvm/mmu/mmu_internal.h | 14 +++++++-
> arch/x86/kvm/mmu/mmutrace.h | 1 +
> arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
> include/linux/kvm_host.h | 30 ++++++++++++++++
> 5 files changed, 105 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2190fd8c95c0..b1953ebc012e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
>
> int kvm_mmu_max_mapping_level(struct kvm *kvm,
> const struct kvm_memory_slot *slot, gfn_t gfn,
> - int max_level)
> + int max_level, bool is_private)
> {
> struct kvm_lpage_info *linfo;
> int host_level;
> @@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> break;
> }
>
> + if (is_private)
> + return max_level;
lpage mixed information already saved, so is that possible
to query info->disallow_lpage without care 'is_private' ?
> +
> if (max_level == PG_LEVEL_4K)
> return PG_LEVEL_4K;
>
> @@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> * level, which will be used to do precise, accurate accounting.
> */
> fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> - fault->gfn, fault->max_level);
> + fault->gfn, fault->max_level,
> + fault->is_private);
> if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> return;
>
> @@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> }
>
> +static inline u8 order_to_level(int order)
> +{
> + BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> +
> + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> + return PG_LEVEL_1G;
> +
> + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> + return PG_LEVEL_2M;
> +
> + return PG_LEVEL_4K;
> +}
> +
> +static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault)
> +{
> + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> + if (fault->is_private)
> + vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> + else
> + vcpu->run->memory.flags = 0;
> + vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> + vcpu->run->memory.size = PAGE_SIZE;
> + return RET_PF_USER;
> +}
> +
> +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault)
> +{
> + int order;
> + struct kvm_memory_slot *slot = fault->slot;
> +
> + if (!kvm_slot_can_be_private(slot))
> + return kvm_do_memory_fault_exit(vcpu, fault);
> +
> + if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> + return RET_PF_RETRY;
> +
> + fault->max_level = min(order_to_level(order), fault->max_level);
> + fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> + return RET_PF_CONTINUE;
> +}
> +
> static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> struct kvm_memory_slot *slot = fault->slot;
> @@ -4210,6 +4257,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> return RET_PF_EMULATE;
> }
>
> + if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
> + return kvm_do_memory_fault_exit(vcpu, fault);
> +
> + if (fault->is_private)
> + return kvm_faultin_pfn_private(vcpu, fault);
> +
> async = false;
> fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> fault->write, &fault->map_writable,
> @@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> return -EIO;
> }
>
> + if (r == RET_PF_USER)
> + return 0;
> +
> if (r < 0)
> return r;
> if (r != RET_PF_EMULATE)
> @@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> */
> if (sp->role.direct &&
> sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> - PG_LEVEL_NUM)) {
> + PG_LEVEL_NUM,
> + false)) {
> kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
>
> if (kvm_available_flush_tlb_with_range())
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index dbaf6755c5a7..5ccf08183b00 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -189,6 +189,7 @@ struct kvm_page_fault {
>
> /* Derived from mmu and global state. */
> const bool is_tdp;
> + const bool is_private;
> const bool nx_huge_page_workaround_enabled;
>
> /*
> @@ -237,6 +238,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> * RET_PF_RETRY: let CPU fault again on the address.
> * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
> * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> + * RET_PF_USER: need to exit to userspace to handle this fault.
> * RET_PF_FIXED: The faulting entry has been fixed.
> * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
> *
> @@ -253,6 +255,7 @@ enum {
> RET_PF_RETRY,
> RET_PF_EMULATE,
> RET_PF_INVALID,
> + RET_PF_USER,
> RET_PF_FIXED,
> RET_PF_SPURIOUS,
> };
> @@ -310,7 +313,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>
> int kvm_mmu_max_mapping_level(struct kvm *kvm,
> const struct kvm_memory_slot *slot, gfn_t gfn,
> - int max_level);
> + int max_level, bool is_private);
> void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
>
> @@ -319,4 +322,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> +#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> +{
> + WARN_ON_ONCE(1);
> + return -EOPNOTSUPP;
> +}
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
> #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> index ae86820cef69..2d7555381955 100644
> --- a/arch/x86/kvm/mmu/mmutrace.h
> +++ b/arch/x86/kvm/mmu/mmutrace.h
> @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
> TRACE_DEFINE_ENUM(RET_PF_RETRY);
> TRACE_DEFINE_ENUM(RET_PF_EMULATE);
> TRACE_DEFINE_ENUM(RET_PF_INVALID);
> +TRACE_DEFINE_ENUM(RET_PF_USER);
> TRACE_DEFINE_ENUM(RET_PF_FIXED);
> TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 771210ce5181..8ba1a4afc546 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1768,7 +1768,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> continue;
>
> max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> - iter.gfn, PG_LEVEL_NUM);
> + iter.gfn, PG_LEVEL_NUM, false);
> if (max_mapping_level < iter.level)
> continue;
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 25099c94e770..153842bb33df 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> }
> #endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
>
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> + return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
> + KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +}
> +#else
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> + return false;
> +}
> +
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
> +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> +{
> + int ret;
> + struct page *page;
> + pgoff_t index = gfn - slot->base_gfn +
> + (slot->restricted_offset >> PAGE_SHIFT);
> +
> + ret = restrictedmem_get_page(slot->restricted_file, index,
> + &page, order);
> + *pfn = page_to_pfn(page);
> + return ret;
> +}
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
> #endif
> --
> 2.25.1
>
>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
2022-12-02 6:13 ` [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory Chao Peng
2022-12-05 9:03 ` Fuad Tabba
@ 2022-12-08 8:37 ` Xiaoyao Li
2022-12-08 11:30 ` Chao Peng
2022-12-19 14:36 ` Borislav Petkov
2023-01-05 11:23 ` Jarkko Sakkinen
3 siblings, 1 reply; 153+ messages in thread
From: Xiaoyao Li @ 2022-12-08 8:37 UTC (permalink / raw)
To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
linux-arch, linux-api, linux-doc, qemu-devel
Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On 12/2/2022 2:13 PM, Chao Peng wrote:
..
> Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> and right now it is selected on X86_64 only.
>
From the patch implementation, I have no idea why
HAVE_KVM_RESTRICTED_MEM is needed.
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
2022-12-07 17:16 ` Fuad Tabba
@ 2022-12-08 11:13 ` Chao Peng
2022-12-09 8:57 ` Fuad Tabba
0 siblings, 1 reply; 153+ messages in thread
From: Chao Peng @ 2022-12-08 11:13 UTC (permalink / raw)
To: Fuad Tabba
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
On Wed, Dec 07, 2022 at 05:16:34PM +0000, Fuad Tabba wrote:
> Hi,
>
> On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > Unmap the existing guest mappings when memory attribute is changed
> > between shared and private. This is needed because shared pages and
> > private pages are from different backends, unmapping existing ones
> > gives a chance for page fault handler to re-populate the mappings
> > according to the new attribute.
> >
> > Only architecture has private memory support needs this and the
> > supported architecture is expected to rewrite the weak
> > kvm_arch_has_private_mem().
>
> This kind of ties into the discussion of being able to share memory in
> place. For pKVM for example, shared and private memory would have the
> same backend, and the unmapping wouldn't be needed.
>
> So I guess that, instead of kvm_arch_has_private_mem(), can the check
> be done differently, e.g., with a different function, say
> kvm_arch_private_notify_attribute_change() (but maybe with a more
> friendly name than what I suggested :) )?
Besides controlling the unmapping here, kvm_arch_has_private_mem() is
also used to gate the memslot KVM_MEM_PRIVATE flag in patch09. I know
unmapping is confirmed unnecessary for pKVM, but how about
KVM_MEM_PRIVATE? Will pKVM add its own flag or reuse KVM_MEM_PRIVATE?
If the answer is the latter, then yes we should use a different check
which only works for confidential usages here.
Thanks,
Chao
>
> Thanks,
> /fuad
>
> >
> > Also, during memory attribute changing and the unmapping time frame,
> > page fault handler may happen in the same memory range and can cause
> > incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> > page fault handler retry during this time frame.
> >
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> > include/linux/kvm_host.h | 7 +-
> > virt/kvm/kvm_main.c | 168 ++++++++++++++++++++++++++-------------
> > 2 files changed, 116 insertions(+), 59 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 3d69484d2704..3331c0c92838 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > #endif
> >
> > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > struct kvm_gfn_range {
> > struct kvm_memory_slot *slot;
> > gfn_t start;
> > @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> > bool may_block;
> > };
> > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > +
> > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > @@ -785,11 +786,12 @@ struct kvm {
> >
> > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > struct mmu_notifier mmu_notifier;
> > +#endif
> > unsigned long mmu_invalidate_seq;
> > long mmu_invalidate_in_progress;
> > gfn_t mmu_invalidate_range_start;
> > gfn_t mmu_invalidate_range_end;
> > -#endif
> > +
> > struct list_head devices;
> > u64 manual_dirty_log_protect;
> > struct dentry *debugfs_dentry;
> > @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> > int kvm_arch_post_init_vm(struct kvm *kvm);
> > void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> > int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> >
> > #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> > /*
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index ad55dfbc75d7..4e1e1e113bf0 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> > }
> > EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> >
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > +{
> > + /*
> > + * The count increase must become visible at unlock time as no
> > + * spte can be established without taking the mmu_lock and
> > + * count is also read inside the mmu_lock critical section.
> > + */
> > + kvm->mmu_invalidate_in_progress++;
> > +
> > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > + kvm->mmu_invalidate_range_start = INVALID_GPA;
> > + kvm->mmu_invalidate_range_end = INVALID_GPA;
> > + }
> > +}
> > +
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > +
> > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > + kvm->mmu_invalidate_range_start = start;
> > + kvm->mmu_invalidate_range_end = end;
> > + } else {
> > + /*
> > + * Fully tracking multiple concurrent ranges has diminishing
> > + * returns. Keep things simple and just find the minimal range
> > + * which includes the current and new ranges. As there won't be
> > + * enough information to subtract a range after its invalidate
> > + * completes, any ranges invalidated concurrently will
> > + * accumulate and persist until all outstanding invalidates
> > + * complete.
> > + */
> > + kvm->mmu_invalidate_range_start =
> > + min(kvm->mmu_invalidate_range_start, start);
> > + kvm->mmu_invalidate_range_end =
> > + max(kvm->mmu_invalidate_range_end, end);
> > + }
> > +}
> > +
> > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> > +{
> > + /*
> > + * This sequence increase will notify the kvm page fault that
> > + * the page that is going to be mapped in the spte could have
> > + * been freed.
> > + */
> > + kvm->mmu_invalidate_seq++;
> > + smp_wmb();
> > + /*
> > + * The above sequence increase must be visible before the
> > + * below count decrease, which is ensured by the smp_wmb above
> > + * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > + */
> > + kvm->mmu_invalidate_in_progress--;
> > +}
> > +
> > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> > {
> > @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > }
> >
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > -{
> > - /*
> > - * The count increase must become visible at unlock time as no
> > - * spte can be established without taking the mmu_lock and
> > - * count is also read inside the mmu_lock critical section.
> > - */
> > - kvm->mmu_invalidate_in_progress++;
> > -
> > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > - kvm->mmu_invalidate_range_start = INVALID_GPA;
> > - kvm->mmu_invalidate_range_end = INVALID_GPA;
> > - }
> > -}
> > -
> > -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > -{
> > - WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > -
> > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > - kvm->mmu_invalidate_range_start = start;
> > - kvm->mmu_invalidate_range_end = end;
> > - } else {
> > - /*
> > - * Fully tracking multiple concurrent ranges has diminishing
> > - * returns. Keep things simple and just find the minimal range
> > - * which includes the current and new ranges. As there won't be
> > - * enough information to subtract a range after its invalidate
> > - * completes, any ranges invalidated concurrently will
> > - * accumulate and persist until all outstanding invalidates
> > - * complete.
> > - */
> > - kvm->mmu_invalidate_range_start =
> > - min(kvm->mmu_invalidate_range_start, start);
> > - kvm->mmu_invalidate_range_end =
> > - max(kvm->mmu_invalidate_range_end, end);
> > - }
> > -}
> > -
> > static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > return 0;
> > }
> >
> > -void kvm_mmu_invalidate_end(struct kvm *kvm)
> > -{
> > - /*
> > - * This sequence increase will notify the kvm page fault that
> > - * the page that is going to be mapped in the spte could have
> > - * been freed.
> > - */
> > - kvm->mmu_invalidate_seq++;
> > - smp_wmb();
> > - /*
> > - * The above sequence increase must be visible before the
> > - * below count decrease, which is ensured by the smp_wmb above
> > - * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > - */
> > - kvm->mmu_invalidate_in_progress--;
> > -}
> > -
> > static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > const struct mmu_notifier_range *range)
> > {
> > @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
> > return 0;
> > }
> >
> > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > +{
> > + return false;
> > +}
> > +
> > static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > {
> > struct kvm *kvm = kvm_arch_alloc_vm();
> > @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> > return 0;
> > }
> >
> > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > + struct kvm_gfn_range gfn_range;
> > + struct kvm_memory_slot *slot;
> > + struct kvm_memslots *slots;
> > + struct kvm_memslot_iter iter;
> > + int i;
> > + int r = 0;
> > +
> > + gfn_range.pte = __pte(0);
> > + gfn_range.may_block = true;
> > +
> > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > + slots = __kvm_memslots(kvm, i);
> > +
> > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > + slot = iter.slot;
> > + gfn_range.start = max(start, slot->base_gfn);
> > + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> > + if (gfn_range.start >= gfn_range.end)
> > + continue;
> > + gfn_range.slot = slot;
> > +
> > + r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > + }
> > + }
> > +
> > + if (r)
> > + kvm_flush_remote_tlbs(kvm);
> > +}
> > +
> > static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > struct kvm_memory_attributes *attrs)
> > {
> > gfn_t start, end;
> > unsigned long i;
> > void *entry;
> > + int idx;
> > u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> >
> > - /* flags is currently not used. */
> > + /* 'flags' is currently not used. */
> > if (attrs->flags)
> > return -EINVAL;
> > if (attrs->attributes & ~supported_attrs)
> > @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> >
> > entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> >
> > + if (kvm_arch_has_private_mem(kvm)) {
> > + KVM_MMU_LOCK(kvm);
> > + kvm_mmu_invalidate_begin(kvm);
> > + kvm_mmu_invalidate_range_add(kvm, start, end);
> > + KVM_MMU_UNLOCK(kvm);
> > + }
> > +
> > mutex_lock(&kvm->lock);
> > for (i = start; i < end; i++)
> > if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > break;
> > mutex_unlock(&kvm->lock);
> >
> > + if (kvm_arch_has_private_mem(kvm)) {
> > + idx = srcu_read_lock(&kvm->srcu);
> > + KVM_MMU_LOCK(kvm);
> > + if (i > start)
> > + kvm_unmap_mem_range(kvm, start, i);
> > + kvm_mmu_invalidate_end(kvm);
> > + KVM_MMU_UNLOCK(kvm);
> > + srcu_read_unlock(&kvm->srcu, idx);
> > + }
> > +
> > attrs->address = i << PAGE_SHIFT;
> > attrs->size = (end - i) << PAGE_SHIFT;
> >
> > --
> > 2.25.1
> >
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed
2022-12-07 6:42 ` Isaku Yamahata
@ 2022-12-08 11:17 ` Chao Peng
0 siblings, 0 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-08 11:17 UTC (permalink / raw)
To: Isaku Yamahata
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Tue, Dec 06, 2022 at 10:42:24PM -0800, Isaku Yamahata wrote:
> On Tue, Dec 06, 2022 at 08:02:24PM +0800,
> Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> > On Mon, Dec 05, 2022 at 02:49:59PM -0800, Isaku Yamahata wrote:
> > > On Fri, Dec 02, 2022 at 02:13:45PM +0800,
> > > Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > >
> > > > A large page with mixed private/shared subpages can't be mapped as large
> > > > page since its sub private/shared pages are from different memory
> > > > backends and may also treated by architecture differently. When
> > > > private/shared memory are mixed in a large page, the current lpage_info
> > > > is not sufficient to decide whether the page can be mapped as large page
> > > > or not and additional private/shared mixed information is needed.
> > > >
> > > > Tracking this 'mixed' information with the current 'count' like
> > > > disallow_lpage is a bit challenge so reserve a bit in 'disallow_lpage'
> > > > to indicate a large page has mixed private/share subpages and update
> > > > this 'mixed' bit whenever the memory attribute is changed between
> > > > private and shared.
> > > >
> > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > > ---
> > > > arch/x86/include/asm/kvm_host.h | 8 ++
> > > > arch/x86/kvm/mmu/mmu.c | 134 +++++++++++++++++++++++++++++++-
> > > > arch/x86/kvm/x86.c | 2 +
> > > > include/linux/kvm_host.h | 19 +++++
> > > > virt/kvm/kvm_main.c | 9 ++-
> > > > 5 files changed, 169 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index 283cbb83d6ae..7772ab37ac89 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -38,6 +38,7 @@
> > > > #include <asm/hyperv-tlfs.h>
> > > >
> > > > #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> > > > +#define __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
> > > >
> > > > #define KVM_MAX_VCPUS 1024
> > > >
> > > > @@ -1011,6 +1012,13 @@ struct kvm_vcpu_arch {
> > > > #endif
> > > > };
> > > >
> > > > +/*
> > > > + * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
> > > > + * level. The remaining bits are used as a reference count.
> > > > + */
> > > > +#define KVM_LPAGE_PRIVATE_SHARED_MIXED (1U << 31)
> > > > +#define KVM_LPAGE_COUNT_MAX ((1U << 31) - 1)
> > > > +
> > > > struct kvm_lpage_info {
> > > > int disallow_lpage;
> > > > };
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index e2c70b5afa3e..2190fd8c95c0 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -763,11 +763,16 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
> > > > {
> > > > struct kvm_lpage_info *linfo;
> > > > int i;
> > > > + int disallow_count;
> > > >
> > > > for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
> > > > linfo = lpage_info_slot(gfn, slot, i);
> > > > +
> > > > + disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
> > > > + WARN_ON(disallow_count + count < 0 ||
> > > > + disallow_count > KVM_LPAGE_COUNT_MAX - count);
> > > > +
> > > > linfo->disallow_lpage += count;
> > > > - WARN_ON(linfo->disallow_lpage < 0);
> > > > }
> > > > }
> > > >
> > > > @@ -6986,3 +6991,130 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> > > > if (kvm->arch.nx_huge_page_recovery_thread)
> > > > kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
> > > > }
> > > > +
> > > > +static bool linfo_is_mixed(struct kvm_lpage_info *linfo)
> > > > +{
> > > > + return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > > > +}
> > > > +
> > > > +static void linfo_set_mixed(gfn_t gfn, struct kvm_memory_slot *slot,
> > > > + int level, bool mixed)
> > > > +{
> > > > + struct kvm_lpage_info *linfo = lpage_info_slot(gfn, slot, level);
> > > > +
> > > > + if (mixed)
> > > > + linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > > > + else
> > > > + linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > > > +}
> > > > +
> > > > +static bool is_expected_attr_entry(void *entry, unsigned long expected_attrs)
> > > > +{
> > > > + bool expect_private = expected_attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > > > +
> > > > + if (xa_to_value(entry) & KVM_MEMORY_ATTRIBUTE_PRIVATE) {
> > > > + if (!expect_private)
> > > > + return false;
> > > > + } else if (expect_private)
> > > > + return false;
> > > > +
> > > > + return true;
> > > > +}
> > > > +
> > > > +static bool mem_attrs_mixed_2m(struct kvm *kvm, unsigned long attrs,
> > > > + gfn_t start, gfn_t end)
> > > > +{
> > > > + XA_STATE(xas, &kvm->mem_attr_array, start);
> > > > + gfn_t gfn = start;
> > > > + void *entry;
> > > > + bool mixed = false;
> > > > +
> > > > + rcu_read_lock();
> > > > + entry = xas_load(&xas);
> > > > + while (gfn < end) {
> > > > + if (xas_retry(&xas, entry))
> > > > + continue;
> > > > +
> > > > + KVM_BUG_ON(gfn != xas.xa_index, kvm);
> > > > +
> > > > + if (!is_expected_attr_entry(entry, attrs)) {
> > > > + mixed = true;
> > > > + break;
> > > > + }
> > > > +
> > > > + entry = xas_next(&xas);
> > > > + gfn++;
> > > > + }
> > > > +
> > > > + rcu_read_unlock();
> > > > + return mixed;
> > > > +}
> > > > +
> > > > +static bool mem_attrs_mixed(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > > + int level, unsigned long attrs,
> > > > + gfn_t start, gfn_t end)
> > > > +{
> > > > + unsigned long gfn;
> > > > +
> > > > + if (level == PG_LEVEL_2M)
> > > > + return mem_attrs_mixed_2m(kvm, attrs, start, end);
> > > > +
> > > > + for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1))
> > > > + if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)) ||
> > > > + !is_expected_attr_entry(xa_load(&kvm->mem_attr_array, gfn),
> > > > + attrs))
> > > > + return true;
> > > > + return false;
> > > > +}
> > > > +
> > > > +static void kvm_update_lpage_private_shared_mixed(struct kvm *kvm,
> > > > + struct kvm_memory_slot *slot,
> > > > + unsigned long attrs,
> > > > + gfn_t start, gfn_t end)
> > > > +{
> > > > + unsigned long pages, mask;
> > > > + gfn_t gfn, gfn_end, first, last;
> > > > + int level;
> > > > + bool mixed;
> > > > +
> > > > + /*
> > > > + * The sequence matters here: we set the higher level basing on the
> > > > + * lower level's scanning result.
> > > > + */
> > > > + for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> > > > + pages = KVM_PAGES_PER_HPAGE(level);
> > > > + mask = ~(pages - 1);
> > > > + first = start & mask;
> > > > + last = (end - 1) & mask;
> > > > +
> > > > + /*
> > > > + * We only need to scan the head and tail page, for middle pages
> > > > + * we know they will not be mixed.
> > > > + */
> > > > + gfn = max(first, slot->base_gfn);
> > > > + gfn_end = min(first + pages, slot->base_gfn + slot->npages);
> > > > + mixed = mem_attrs_mixed(kvm, slot, level, attrs, gfn, gfn_end);
> > > > + linfo_set_mixed(gfn, slot, level, mixed);
> > > > +
> > > > + if (first == last)
> > > > + return;
> > >
> > >
> > > continue.
> >
> > Ya!
> >
> > >
> > > > +
> > > > + for (gfn = first + pages; gfn < last; gfn += pages)
> > > > + linfo_set_mixed(gfn, slot, level, false);
> > > > +
> > > > + gfn = last;
> > > > + gfn_end = min(last + pages, slot->base_gfn + slot->npages);
> > >
> > > if (gfn == gfn_end) continue.
> >
> > Do you see a case where gfn can equal to gfn_end? Though it does not
> > hurt to add a check.
>
> If last == base_gfn + npages, gfn == gfn_end can occur.
'end' is guaranteed <= base_gfn + npages in kvm_unmap_mem_range():
gfn_range.end = min(end, slot->base_gfn + slot->npages);
And 'last' is defined as:
last = (end - 1) & mask;
Then the math is:
last = (end - 1) & mask
<= end - 1
<= base_gfn + npages - 1
< base_gfn + npages
Thanks,
Chao
>
>
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 9a07380f8d3c..5aefcff614d2 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -12362,6 +12362,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
> > > > if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
> > > > linfo[lpages - 1].disallow_lpage = 1;
> > > > ugfn = slot->userspace_addr >> PAGE_SHIFT;
> > > > + if (kvm_slot_can_be_private(slot))
> > > > + ugfn |= slot->restricted_offset >> PAGE_SHIFT;
> > >
> > > Is there any alignment restriction? If no, It should be +=.
> > > In practice, alignment will hold though.
> >
> > All we need here is checking whether both userspace_addr and
> > restricted_offset are aligned to HPAGE_SIZE or not. '+=' actually can
> > yield wrong value in cases when userspace_addr + restricted_offset is
> > aligned to HPAGE_SIZE but individually they may not align to HPAGE_SIZE.
>
> Ah, got it. The blow comment explains it.
>
> > Thanks,
> > Chao
> > >
> > > Thanks,
> > >
> > > > /*
> > > > * If the gfn and userspace address are not aligned wrt each
> > > > * other, disable large page support for this slot.
> --
> Isaku Yamahata <isaku.yamahata@gmail.com>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
2022-12-07 8:13 ` Yuan Yao
@ 2022-12-08 11:20 ` Chao Peng
2022-12-09 5:43 ` Yuan Yao
0 siblings, 1 reply; 153+ messages in thread
From: Chao Peng @ 2022-12-08 11:20 UTC (permalink / raw)
To: Yuan Yao
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Wed, Dec 07, 2022 at 04:13:14PM +0800, Yuan Yao wrote:
> On Fri, Dec 02, 2022 at 02:13:44PM +0800, Chao Peng wrote:
> > Unmap the existing guest mappings when memory attribute is changed
> > between shared and private. This is needed because shared pages and
> > private pages are from different backends, unmapping existing ones
> > gives a chance for page fault handler to re-populate the mappings
> > according to the new attribute.
> >
> > Only architecture has private memory support needs this and the
> > supported architecture is expected to rewrite the weak
> > kvm_arch_has_private_mem().
> >
> > Also, during memory attribute changing and the unmapping time frame,
> > page fault handler may happen in the same memory range and can cause
> > incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> > page fault handler retry during this time frame.
> >
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> > include/linux/kvm_host.h | 7 +-
> > virt/kvm/kvm_main.c | 168 ++++++++++++++++++++++++++-------------
> > 2 files changed, 116 insertions(+), 59 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 3d69484d2704..3331c0c92838 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > #endif
> >
> > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > struct kvm_gfn_range {
> > struct kvm_memory_slot *slot;
> > gfn_t start;
> > @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> > bool may_block;
> > };
> > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > +
> > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > @@ -785,11 +786,12 @@ struct kvm {
> >
> > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > struct mmu_notifier mmu_notifier;
> > +#endif
> > unsigned long mmu_invalidate_seq;
> > long mmu_invalidate_in_progress;
> > gfn_t mmu_invalidate_range_start;
> > gfn_t mmu_invalidate_range_end;
> > -#endif
> > +
> > struct list_head devices;
> > u64 manual_dirty_log_protect;
> > struct dentry *debugfs_dentry;
> > @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> > int kvm_arch_post_init_vm(struct kvm *kvm);
> > void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> > int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> >
> > #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> > /*
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index ad55dfbc75d7..4e1e1e113bf0 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> > }
> > EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> >
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > +{
> > + /*
> > + * The count increase must become visible at unlock time as no
> > + * spte can be established without taking the mmu_lock and
> > + * count is also read inside the mmu_lock critical section.
> > + */
> > + kvm->mmu_invalidate_in_progress++;
> > +
> > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > + kvm->mmu_invalidate_range_start = INVALID_GPA;
> > + kvm->mmu_invalidate_range_end = INVALID_GPA;
> > + }
> > +}
> > +
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > +
> > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > + kvm->mmu_invalidate_range_start = start;
> > + kvm->mmu_invalidate_range_end = end;
> > + } else {
> > + /*
> > + * Fully tracking multiple concurrent ranges has diminishing
> > + * returns. Keep things simple and just find the minimal range
> > + * which includes the current and new ranges. As there won't be
> > + * enough information to subtract a range after its invalidate
> > + * completes, any ranges invalidated concurrently will
> > + * accumulate and persist until all outstanding invalidates
> > + * complete.
> > + */
> > + kvm->mmu_invalidate_range_start =
> > + min(kvm->mmu_invalidate_range_start, start);
> > + kvm->mmu_invalidate_range_end =
> > + max(kvm->mmu_invalidate_range_end, end);
> > + }
> > +}
> > +
> > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> > +{
> > + /*
> > + * This sequence increase will notify the kvm page fault that
> > + * the page that is going to be mapped in the spte could have
> > + * been freed.
> > + */
> > + kvm->mmu_invalidate_seq++;
> > + smp_wmb();
> > + /*
> > + * The above sequence increase must be visible before the
> > + * below count decrease, which is ensured by the smp_wmb above
> > + * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > + */
> > + kvm->mmu_invalidate_in_progress--;
> > +}
> > +
> > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> > {
> > @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > }
> >
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > -{
> > - /*
> > - * The count increase must become visible at unlock time as no
> > - * spte can be established without taking the mmu_lock and
> > - * count is also read inside the mmu_lock critical section.
> > - */
> > - kvm->mmu_invalidate_in_progress++;
> > -
> > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > - kvm->mmu_invalidate_range_start = INVALID_GPA;
> > - kvm->mmu_invalidate_range_end = INVALID_GPA;
> > - }
> > -}
> > -
> > -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > -{
> > - WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > -
> > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > - kvm->mmu_invalidate_range_start = start;
> > - kvm->mmu_invalidate_range_end = end;
> > - } else {
> > - /*
> > - * Fully tracking multiple concurrent ranges has diminishing
> > - * returns. Keep things simple and just find the minimal range
> > - * which includes the current and new ranges. As there won't be
> > - * enough information to subtract a range after its invalidate
> > - * completes, any ranges invalidated concurrently will
> > - * accumulate and persist until all outstanding invalidates
> > - * complete.
> > - */
> > - kvm->mmu_invalidate_range_start =
> > - min(kvm->mmu_invalidate_range_start, start);
> > - kvm->mmu_invalidate_range_end =
> > - max(kvm->mmu_invalidate_range_end, end);
> > - }
> > -}
> > -
> > static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > return 0;
> > }
> >
> > -void kvm_mmu_invalidate_end(struct kvm *kvm)
> > -{
> > - /*
> > - * This sequence increase will notify the kvm page fault that
> > - * the page that is going to be mapped in the spte could have
> > - * been freed.
> > - */
> > - kvm->mmu_invalidate_seq++;
> > - smp_wmb();
> > - /*
> > - * The above sequence increase must be visible before the
> > - * below count decrease, which is ensured by the smp_wmb above
> > - * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > - */
> > - kvm->mmu_invalidate_in_progress--;
> > -}
> > -
> > static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > const struct mmu_notifier_range *range)
> > {
> > @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
> > return 0;
> > }
> >
> > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > +{
> > + return false;
> > +}
> > +
> > static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > {
> > struct kvm *kvm = kvm_arch_alloc_vm();
> > @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> > return 0;
> > }
> >
> > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > + struct kvm_gfn_range gfn_range;
> > + struct kvm_memory_slot *slot;
> > + struct kvm_memslots *slots;
> > + struct kvm_memslot_iter iter;
> > + int i;
> > + int r = 0;
> > +
> > + gfn_range.pte = __pte(0);
> > + gfn_range.may_block = true;
> > +
> > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > + slots = __kvm_memslots(kvm, i);
> > +
> > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > + slot = iter.slot;
> > + gfn_range.start = max(start, slot->base_gfn);
> > + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> > + if (gfn_range.start >= gfn_range.end)
> > + continue;
> > + gfn_range.slot = slot;
> > +
> > + r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > + }
> > + }
> > +
> > + if (r)
> > + kvm_flush_remote_tlbs(kvm);
> > +}
> > +
> > static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > struct kvm_memory_attributes *attrs)
> > {
> > gfn_t start, end;
> > unsigned long i;
> > void *entry;
> > + int idx;
> > u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> >
> > - /* flags is currently not used. */
> > + /* 'flags' is currently not used. */
> > if (attrs->flags)
> > return -EINVAL;
> > if (attrs->attributes & ~supported_attrs)
> > @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> >
> > entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> >
> > + if (kvm_arch_has_private_mem(kvm)) {
> > + KVM_MMU_LOCK(kvm);
> > + kvm_mmu_invalidate_begin(kvm);
> > + kvm_mmu_invalidate_range_add(kvm, start, end);
>
> Nit: this works for KVM_MEMORY_ATTRIBUTE_PRIVATE, but
> the invalidation should be necessary yet for attribute change of:
>
> KVM_MEMORY_ATTRIBUTE_READ
> KVM_MEMORY_ATTRIBUTE_WRITE
> KVM_MEMORY_ATTRIBUTE_EXECUTE
The unmapping is only needed for confidential usages which uses
KVM_MEMORY_ATTRIBUTE_PRIVATE only and the other flags are defined here
for other usages like pKVM. As Fuad commented in a different reply, pKVM
supports in-place remapping and unmapping is unnecessary.
Thanks,
Chao
>
> > + KVM_MMU_UNLOCK(kvm);
> > + }
> > +
> > mutex_lock(&kvm->lock);
> > for (i = start; i < end; i++)
> > if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > break;
> > mutex_unlock(&kvm->lock);
> >
> > + if (kvm_arch_has_private_mem(kvm)) {
> > + idx = srcu_read_lock(&kvm->srcu);
> > + KVM_MMU_LOCK(kvm);
> > + if (i > start)
> > + kvm_unmap_mem_range(kvm, start, i);
> > + kvm_mmu_invalidate_end(kvm);
>
> Ditto.
>
> > + KVM_MMU_UNLOCK(kvm);
> > + srcu_read_unlock(&kvm->srcu, idx);
> > + }
> > +
> > attrs->address = i << PAGE_SHIFT;
> > attrs->size = (end - i) << PAGE_SHIFT;
> >
> > --
> > 2.25.1
> >
> >
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 8/9] KVM: Handle page fault for private memory
2022-12-08 2:29 ` Yuan Yao
@ 2022-12-08 11:23 ` Chao Peng
2022-12-09 5:45 ` Yuan Yao
0 siblings, 1 reply; 153+ messages in thread
From: Chao Peng @ 2022-12-08 11:23 UTC (permalink / raw)
To: Yuan Yao
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Thu, Dec 08, 2022 at 10:29:18AM +0800, Yuan Yao wrote:
> On Fri, Dec 02, 2022 at 02:13:46PM +0800, Chao Peng wrote:
> > A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
> > hva-based shared memory. Architecture code (like TDX code) can tell
> > whether the on-going fault is private or not. This patch adds a
> > 'is_private' field to kvm_page_fault to indicate this and architecture
> > code is expected to set it.
> >
> > To handle page fault for such memslot, the handling logic is different
> > depending on whether the fault is private or shared. KVM checks if
> > 'is_private' matches the host's view of the page (maintained in
> > mem_attr_array).
> > - For a successful match, private pfn is obtained with
> > restrictedmem_get_page() and shared pfn is obtained with existing
> > get_user_pages().
> > - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> > userspace. Userspace then can convert memory between private/shared
> > in host's view and retry the fault.
> >
> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 63 +++++++++++++++++++++++++++++++--
> > arch/x86/kvm/mmu/mmu_internal.h | 14 +++++++-
> > arch/x86/kvm/mmu/mmutrace.h | 1 +
> > arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
> > include/linux/kvm_host.h | 30 ++++++++++++++++
> > 5 files changed, 105 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 2190fd8c95c0..b1953ebc012e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
> >
> > int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > const struct kvm_memory_slot *slot, gfn_t gfn,
> > - int max_level)
> > + int max_level, bool is_private)
> > {
> > struct kvm_lpage_info *linfo;
> > int host_level;
> > @@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > break;
> > }
> >
> > + if (is_private)
> > + return max_level;
>
> lpage mixed information already saved, so is that possible
> to query info->disallow_lpage without care 'is_private' ?
Actually we already queried info->disallow_lpage just before this
sentence. The check is needed because later in the function we call
host_pfn_mapping_level() which is shared memory specific.
Thanks,
Chao
>
> > +
> > if (max_level == PG_LEVEL_4K)
> > return PG_LEVEL_4K;
> >
> > @@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > * level, which will be used to do precise, accurate accounting.
> > */
> > fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> > - fault->gfn, fault->max_level);
> > + fault->gfn, fault->max_level,
> > + fault->is_private);
> > if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> > return;
> >
> > @@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> > kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> > }
> >
> > +static inline u8 order_to_level(int order)
> > +{
> > + BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> > +
> > + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> > + return PG_LEVEL_1G;
> > +
> > + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> > + return PG_LEVEL_2M;
> > +
> > + return PG_LEVEL_4K;
> > +}
> > +
> > +static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
> > + struct kvm_page_fault *fault)
> > +{
> > + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > + if (fault->is_private)
> > + vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> > + else
> > + vcpu->run->memory.flags = 0;
> > + vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> > + vcpu->run->memory.size = PAGE_SIZE;
> > + return RET_PF_USER;
> > +}
> > +
> > +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > + struct kvm_page_fault *fault)
> > +{
> > + int order;
> > + struct kvm_memory_slot *slot = fault->slot;
> > +
> > + if (!kvm_slot_can_be_private(slot))
> > + return kvm_do_memory_fault_exit(vcpu, fault);
> > +
> > + if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> > + return RET_PF_RETRY;
> > +
> > + fault->max_level = min(order_to_level(order), fault->max_level);
> > + fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> > + return RET_PF_CONTINUE;
> > +}
> > +
> > static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > {
> > struct kvm_memory_slot *slot = fault->slot;
> > @@ -4210,6 +4257,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > return RET_PF_EMULATE;
> > }
> >
> > + if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
> > + return kvm_do_memory_fault_exit(vcpu, fault);
> > +
> > + if (fault->is_private)
> > + return kvm_faultin_pfn_private(vcpu, fault);
> > +
> > async = false;
> > fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> > fault->write, &fault->map_writable,
> > @@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> > return -EIO;
> > }
> >
> > + if (r == RET_PF_USER)
> > + return 0;
> > +
> > if (r < 0)
> > return r;
> > if (r != RET_PF_EMULATE)
> > @@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> > */
> > if (sp->role.direct &&
> > sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> > - PG_LEVEL_NUM)) {
> > + PG_LEVEL_NUM,
> > + false)) {
> > kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
> >
> > if (kvm_available_flush_tlb_with_range())
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index dbaf6755c5a7..5ccf08183b00 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -189,6 +189,7 @@ struct kvm_page_fault {
> >
> > /* Derived from mmu and global state. */
> > const bool is_tdp;
> > + const bool is_private;
> > const bool nx_huge_page_workaround_enabled;
> >
> > /*
> > @@ -237,6 +238,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> > * RET_PF_RETRY: let CPU fault again on the address.
> > * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
> > * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> > + * RET_PF_USER: need to exit to userspace to handle this fault.
> > * RET_PF_FIXED: The faulting entry has been fixed.
> > * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
> > *
> > @@ -253,6 +255,7 @@ enum {
> > RET_PF_RETRY,
> > RET_PF_EMULATE,
> > RET_PF_INVALID,
> > + RET_PF_USER,
> > RET_PF_FIXED,
> > RET_PF_SPURIOUS,
> > };
> > @@ -310,7 +313,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >
> > int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > const struct kvm_memory_slot *slot, gfn_t gfn,
> > - int max_level);
> > + int max_level, bool is_private);
> > void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> > void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
> >
> > @@ -319,4 +322,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >
> > +#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > +{
> > + WARN_ON_ONCE(1);
> > + return -EOPNOTSUPP;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > +
> > #endif /* __KVM_X86_MMU_INTERNAL_H */
> > diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> > index ae86820cef69..2d7555381955 100644
> > --- a/arch/x86/kvm/mmu/mmutrace.h
> > +++ b/arch/x86/kvm/mmu/mmutrace.h
> > @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
> > TRACE_DEFINE_ENUM(RET_PF_RETRY);
> > TRACE_DEFINE_ENUM(RET_PF_EMULATE);
> > TRACE_DEFINE_ENUM(RET_PF_INVALID);
> > +TRACE_DEFINE_ENUM(RET_PF_USER);
> > TRACE_DEFINE_ENUM(RET_PF_FIXED);
> > TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 771210ce5181..8ba1a4afc546 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1768,7 +1768,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> > continue;
> >
> > max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> > - iter.gfn, PG_LEVEL_NUM);
> > + iter.gfn, PG_LEVEL_NUM, false);
> > if (max_mapping_level < iter.level)
> > continue;
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 25099c94e770..153842bb33df 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > }
> > #endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
> >
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > + return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
> > + KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > +}
> > +#else
> > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > + return false;
> > +}
> > +
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > +
> > +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > +{
> > + int ret;
> > + struct page *page;
> > + pgoff_t index = gfn - slot->base_gfn +
> > + (slot->restricted_offset >> PAGE_SHIFT);
> > +
> > + ret = restrictedmem_get_page(slot->restricted_file, index,
> > + &page, order);
> > + *pfn = page_to_pfn(page);
> > + return ret;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > +
> > #endif
> > --
> > 2.25.1
> >
> >
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
2022-12-08 8:37 ` Xiaoyao Li
@ 2022-12-08 11:30 ` Chao Peng
2022-12-13 12:04 ` Xiaoyao Li
0 siblings, 1 reply; 153+ messages in thread
From: Chao Peng @ 2022-12-08 11:30 UTC (permalink / raw)
To: Xiaoyao Li
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Thu, Dec 08, 2022 at 04:37:03PM +0800, Xiaoyao Li wrote:
> On 12/2/2022 2:13 PM, Chao Peng wrote:
>
> ..
>
> > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > and right now it is selected on X86_64 only.
> >
>
> From the patch implementation, I have no idea why HAVE_KVM_RESTRICTED_MEM is
> needed.
The reason is we want KVM further controls the feature enabling. An
opt-in CONFIG_RESTRICTEDMEM can cause problem if user sets that for
unsupported architectures.
Here is the original discussion:
https://lore.kernel.org/all/YkJLFu98hZOvTSrL@google.com/
Thanks,
Chao
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
2022-12-08 11:20 ` Chao Peng
@ 2022-12-09 5:43 ` Yuan Yao
0 siblings, 0 replies; 153+ messages in thread
From: Yuan Yao @ 2022-12-09 5:43 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Thu, Dec 08, 2022 at 07:20:43PM +0800, Chao Peng wrote:
> On Wed, Dec 07, 2022 at 04:13:14PM +0800, Yuan Yao wrote:
> > On Fri, Dec 02, 2022 at 02:13:44PM +0800, Chao Peng wrote:
> > > Unmap the existing guest mappings when memory attribute is changed
> > > between shared and private. This is needed because shared pages and
> > > private pages are from different backends, unmapping existing ones
> > > gives a chance for page fault handler to re-populate the mappings
> > > according to the new attribute.
> > >
> > > Only architecture has private memory support needs this and the
> > > supported architecture is expected to rewrite the weak
> > > kvm_arch_has_private_mem().
> > >
> > > Also, during memory attribute changing and the unmapping time frame,
> > > page fault handler may happen in the same memory range and can cause
> > > incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> > > page fault handler retry during this time frame.
> > >
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > ---
> > > include/linux/kvm_host.h | 7 +-
> > > virt/kvm/kvm_main.c | 168 ++++++++++++++++++++++++++-------------
> > > 2 files changed, 116 insertions(+), 59 deletions(-)
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 3d69484d2704..3331c0c92838 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > > #endif
> > >
> > > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > > struct kvm_gfn_range {
> > > struct kvm_memory_slot *slot;
> > > gfn_t start;
> > > @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> > > bool may_block;
> > > };
> > > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > > +
> > > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > @@ -785,11 +786,12 @@ struct kvm {
> > >
> > > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > > struct mmu_notifier mmu_notifier;
> > > +#endif
> > > unsigned long mmu_invalidate_seq;
> > > long mmu_invalidate_in_progress;
> > > gfn_t mmu_invalidate_range_start;
> > > gfn_t mmu_invalidate_range_end;
> > > -#endif
> > > +
> > > struct list_head devices;
> > > u64 manual_dirty_log_protect;
> > > struct dentry *debugfs_dentry;
> > > @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> > > int kvm_arch_post_init_vm(struct kvm *kvm);
> > > void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> > > int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> > >
> > > #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> > > /*
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index ad55dfbc75d7..4e1e1e113bf0 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> > > }
> > > EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> > >
> > > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > > +{
> > > + /*
> > > + * The count increase must become visible at unlock time as no
> > > + * spte can be established without taking the mmu_lock and
> > > + * count is also read inside the mmu_lock critical section.
> > > + */
> > > + kvm->mmu_invalidate_in_progress++;
> > > +
> > > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > + kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > + kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > + }
> > > +}
> > > +
> > > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > +
> > > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > + kvm->mmu_invalidate_range_start = start;
> > > + kvm->mmu_invalidate_range_end = end;
> > > + } else {
> > > + /*
> > > + * Fully tracking multiple concurrent ranges has diminishing
> > > + * returns. Keep things simple and just find the minimal range
> > > + * which includes the current and new ranges. As there won't be
> > > + * enough information to subtract a range after its invalidate
> > > + * completes, any ranges invalidated concurrently will
> > > + * accumulate and persist until all outstanding invalidates
> > > + * complete.
> > > + */
> > > + kvm->mmu_invalidate_range_start =
> > > + min(kvm->mmu_invalidate_range_start, start);
> > > + kvm->mmu_invalidate_range_end =
> > > + max(kvm->mmu_invalidate_range_end, end);
> > > + }
> > > +}
> > > +
> > > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> > > +{
> > > + /*
> > > + * This sequence increase will notify the kvm page fault that
> > > + * the page that is going to be mapped in the spte could have
> > > + * been freed.
> > > + */
> > > + kvm->mmu_invalidate_seq++;
> > > + smp_wmb();
> > > + /*
> > > + * The above sequence increase must be visible before the
> > > + * below count decrease, which is ensured by the smp_wmb above
> > > + * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > + */
> > > + kvm->mmu_invalidate_in_progress--;
> > > +}
> > > +
> > > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > > static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> > > {
> > > @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > > }
> > >
> > > -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > > -{
> > > - /*
> > > - * The count increase must become visible at unlock time as no
> > > - * spte can be established without taking the mmu_lock and
> > > - * count is also read inside the mmu_lock critical section.
> > > - */
> > > - kvm->mmu_invalidate_in_progress++;
> > > -
> > > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > - kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > - kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > - }
> > > -}
> > > -
> > > -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > -{
> > > - WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > -
> > > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > - kvm->mmu_invalidate_range_start = start;
> > > - kvm->mmu_invalidate_range_end = end;
> > > - } else {
> > > - /*
> > > - * Fully tracking multiple concurrent ranges has diminishing
> > > - * returns. Keep things simple and just find the minimal range
> > > - * which includes the current and new ranges. As there won't be
> > > - * enough information to subtract a range after its invalidate
> > > - * completes, any ranges invalidated concurrently will
> > > - * accumulate and persist until all outstanding invalidates
> > > - * complete.
> > > - */
> > > - kvm->mmu_invalidate_range_start =
> > > - min(kvm->mmu_invalidate_range_start, start);
> > > - kvm->mmu_invalidate_range_end =
> > > - max(kvm->mmu_invalidate_range_end, end);
> > > - }
> > > -}
> > > -
> > > static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > > {
> > > kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > > @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > > return 0;
> > > }
> > >
> > > -void kvm_mmu_invalidate_end(struct kvm *kvm)
> > > -{
> > > - /*
> > > - * This sequence increase will notify the kvm page fault that
> > > - * the page that is going to be mapped in the spte could have
> > > - * been freed.
> > > - */
> > > - kvm->mmu_invalidate_seq++;
> > > - smp_wmb();
> > > - /*
> > > - * The above sequence increase must be visible before the
> > > - * below count decrease, which is ensured by the smp_wmb above
> > > - * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > - */
> > > - kvm->mmu_invalidate_in_progress--;
> > > -}
> > > -
> > > static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > > const struct mmu_notifier_range *range)
> > > {
> > > @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
> > > return 0;
> > > }
> > >
> > > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > > +{
> > > + return false;
> > > +}
> > > +
> > > static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > > {
> > > struct kvm *kvm = kvm_arch_alloc_vm();
> > > @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> > > return 0;
> > > }
> > >
> > > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > + struct kvm_gfn_range gfn_range;
> > > + struct kvm_memory_slot *slot;
> > > + struct kvm_memslots *slots;
> > > + struct kvm_memslot_iter iter;
> > > + int i;
> > > + int r = 0;
> > > +
> > > + gfn_range.pte = __pte(0);
> > > + gfn_range.may_block = true;
> > > +
> > > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > + slots = __kvm_memslots(kvm, i);
> > > +
> > > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > > + slot = iter.slot;
> > > + gfn_range.start = max(start, slot->base_gfn);
> > > + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> > > + if (gfn_range.start >= gfn_range.end)
> > > + continue;
> > > + gfn_range.slot = slot;
> > > +
> > > + r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > > + }
> > > + }
> > > +
> > > + if (r)
> > > + kvm_flush_remote_tlbs(kvm);
> > > +}
> > > +
> > > static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > struct kvm_memory_attributes *attrs)
> > > {
> > > gfn_t start, end;
> > > unsigned long i;
> > > void *entry;
> > > + int idx;
> > > u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > >
> > > - /* flags is currently not used. */
> > > + /* 'flags' is currently not used. */
> > > if (attrs->flags)
> > > return -EINVAL;
> > > if (attrs->attributes & ~supported_attrs)
> > > @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > >
> > > entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > >
> > > + if (kvm_arch_has_private_mem(kvm)) {
> > > + KVM_MMU_LOCK(kvm);
> > > + kvm_mmu_invalidate_begin(kvm);
> > > + kvm_mmu_invalidate_range_add(kvm, start, end);
> >
> > Nit: this works for KVM_MEMORY_ATTRIBUTE_PRIVATE, but
> > the invalidation should be necessary yet for attribute change of:
> >
> > KVM_MEMORY_ATTRIBUTE_READ
> > KVM_MEMORY_ATTRIBUTE_WRITE
> > KVM_MEMORY_ATTRIBUTE_EXECUTE
>
> The unmapping is only needed for confidential usages which uses
> KVM_MEMORY_ATTRIBUTE_PRIVATE only and the other flags are defined here
> for other usages like pKVM. As Fuad commented in a different reply, pKVM
> supports in-place remapping and unmapping is unnecessary.
Ah, I see. It's fine to me, thanks.
>
> Thanks,
> Chao
> >
> > > + KVM_MMU_UNLOCK(kvm);
> > > + }
> > > +
> > > mutex_lock(&kvm->lock);
> > > for (i = start; i < end; i++)
> > > if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > > @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > break;
> > > mutex_unlock(&kvm->lock);
> > >
> > > + if (kvm_arch_has_private_mem(kvm)) {
> > > + idx = srcu_read_lock(&kvm->srcu);
> > > + KVM_MMU_LOCK(kvm);
> > > + if (i > start)
> > > + kvm_unmap_mem_range(kvm, start, i);
> > > + kvm_mmu_invalidate_end(kvm);
> >
> > Ditto.
> >
> > > + KVM_MMU_UNLOCK(kvm);
> > > + srcu_read_unlock(&kvm->srcu, idx);
> > > + }
> > > +
> > > attrs->address = i << PAGE_SHIFT;
> > > attrs->size = (end - i) << PAGE_SHIFT;
> > >
> > > --
> > > 2.25.1
> > >
> > >
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 8/9] KVM: Handle page fault for private memory
2022-12-08 11:23 ` Chao Peng
@ 2022-12-09 5:45 ` Yuan Yao
0 siblings, 0 replies; 153+ messages in thread
From: Yuan Yao @ 2022-12-09 5:45 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Thu, Dec 08, 2022 at 07:23:46PM +0800, Chao Peng wrote:
> On Thu, Dec 08, 2022 at 10:29:18AM +0800, Yuan Yao wrote:
> > On Fri, Dec 02, 2022 at 02:13:46PM +0800, Chao Peng wrote:
> > > A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
> > > hva-based shared memory. Architecture code (like TDX code) can tell
> > > whether the on-going fault is private or not. This patch adds a
> > > 'is_private' field to kvm_page_fault to indicate this and architecture
> > > code is expected to set it.
> > >
> > > To handle page fault for such memslot, the handling logic is different
> > > depending on whether the fault is private or shared. KVM checks if
> > > 'is_private' matches the host's view of the page (maintained in
> > > mem_attr_array).
> > > - For a successful match, private pfn is obtained with
> > > restrictedmem_get_page() and shared pfn is obtained with existing
> > > get_user_pages().
> > > - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> > > userspace. Userspace then can convert memory between private/shared
> > > in host's view and retry the fault.
> > >
> > > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > ---
> > > arch/x86/kvm/mmu/mmu.c | 63 +++++++++++++++++++++++++++++++--
> > > arch/x86/kvm/mmu/mmu_internal.h | 14 +++++++-
> > > arch/x86/kvm/mmu/mmutrace.h | 1 +
> > > arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
> > > include/linux/kvm_host.h | 30 ++++++++++++++++
> > > 5 files changed, 105 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 2190fd8c95c0..b1953ebc012e 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
> > >
> > > int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > > const struct kvm_memory_slot *slot, gfn_t gfn,
> > > - int max_level)
> > > + int max_level, bool is_private)
> > > {
> > > struct kvm_lpage_info *linfo;
> > > int host_level;
> > > @@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > > break;
> > > }
> > >
> > > + if (is_private)
> > > + return max_level;
> >
> > lpage mixed information already saved, so is that possible
> > to query info->disallow_lpage without care 'is_private' ?
>
> Actually we already queried info->disallow_lpage just before this
> sentence. The check is needed because later in the function we call
> host_pfn_mapping_level() which is shared memory specific.
You're right. We can't get mapping level info for private page from
host_pfn_mapping_level().
>
> Thanks,
> Chao
> >
> > > +
> > > if (max_level == PG_LEVEL_4K)
> > > return PG_LEVEL_4K;
> > >
> > > @@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > > * level, which will be used to do precise, accurate accounting.
> > > */
> > > fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> > > - fault->gfn, fault->max_level);
> > > + fault->gfn, fault->max_level,
> > > + fault->is_private);
> > > if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> > > return;
> > >
> > > @@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> > > kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> > > }
> > >
> > > +static inline u8 order_to_level(int order)
> > > +{
> > > + BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> > > +
> > > + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> > > + return PG_LEVEL_1G;
> > > +
> > > + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> > > + return PG_LEVEL_2M;
> > > +
> > > + return PG_LEVEL_4K;
> > > +}
> > > +
> > > +static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
> > > + struct kvm_page_fault *fault)
> > > +{
> > > + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > > + if (fault->is_private)
> > > + vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> > > + else
> > > + vcpu->run->memory.flags = 0;
> > > + vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> > > + vcpu->run->memory.size = PAGE_SIZE;
> > > + return RET_PF_USER;
> > > +}
> > > +
> > > +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > > + struct kvm_page_fault *fault)
> > > +{
> > > + int order;
> > > + struct kvm_memory_slot *slot = fault->slot;
> > > +
> > > + if (!kvm_slot_can_be_private(slot))
> > > + return kvm_do_memory_fault_exit(vcpu, fault);
> > > +
> > > + if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> > > + return RET_PF_RETRY;
> > > +
> > > + fault->max_level = min(order_to_level(order), fault->max_level);
> > > + fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> > > + return RET_PF_CONTINUE;
> > > +}
> > > +
> > > static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > > {
> > > struct kvm_memory_slot *slot = fault->slot;
> > > @@ -4210,6 +4257,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > > return RET_PF_EMULATE;
> > > }
> > >
> > > + if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
> > > + return kvm_do_memory_fault_exit(vcpu, fault);
> > > +
> > > + if (fault->is_private)
> > > + return kvm_faultin_pfn_private(vcpu, fault);
> > > +
> > > async = false;
> > > fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> > > fault->write, &fault->map_writable,
> > > @@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> > > return -EIO;
> > > }
> > >
> > > + if (r == RET_PF_USER)
> > > + return 0;
> > > +
> > > if (r < 0)
> > > return r;
> > > if (r != RET_PF_EMULATE)
> > > @@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> > > */
> > > if (sp->role.direct &&
> > > sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> > > - PG_LEVEL_NUM)) {
> > > + PG_LEVEL_NUM,
> > > + false)) {
> > > kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
> > >
> > > if (kvm_available_flush_tlb_with_range())
> > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > > index dbaf6755c5a7..5ccf08183b00 100644
> > > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > > @@ -189,6 +189,7 @@ struct kvm_page_fault {
> > >
> > > /* Derived from mmu and global state. */
> > > const bool is_tdp;
> > > + const bool is_private;
> > > const bool nx_huge_page_workaround_enabled;
> > >
> > > /*
> > > @@ -237,6 +238,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> > > * RET_PF_RETRY: let CPU fault again on the address.
> > > * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
> > > * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> > > + * RET_PF_USER: need to exit to userspace to handle this fault.
> > > * RET_PF_FIXED: The faulting entry has been fixed.
> > > * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
> > > *
> > > @@ -253,6 +255,7 @@ enum {
> > > RET_PF_RETRY,
> > > RET_PF_EMULATE,
> > > RET_PF_INVALID,
> > > + RET_PF_USER,
> > > RET_PF_FIXED,
> > > RET_PF_SPURIOUS,
> > > };
> > > @@ -310,7 +313,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > >
> > > int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > > const struct kvm_memory_slot *slot, gfn_t gfn,
> > > - int max_level);
> > > + int max_level, bool is_private);
> > > void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> > > void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
> > >
> > > @@ -319,4 +322,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > > void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > > void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > >
> > > +#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > > + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > > +{
> > > + WARN_ON_ONCE(1);
> > > + return -EOPNOTSUPP;
> > > +}
> > > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > > +
> > > #endif /* __KVM_X86_MMU_INTERNAL_H */
> > > diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> > > index ae86820cef69..2d7555381955 100644
> > > --- a/arch/x86/kvm/mmu/mmutrace.h
> > > +++ b/arch/x86/kvm/mmu/mmutrace.h
> > > @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
> > > TRACE_DEFINE_ENUM(RET_PF_RETRY);
> > > TRACE_DEFINE_ENUM(RET_PF_EMULATE);
> > > TRACE_DEFINE_ENUM(RET_PF_INVALID);
> > > +TRACE_DEFINE_ENUM(RET_PF_USER);
> > > TRACE_DEFINE_ENUM(RET_PF_FIXED);
> > > TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
> > >
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 771210ce5181..8ba1a4afc546 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -1768,7 +1768,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> > > continue;
> > >
> > > max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> > > - iter.gfn, PG_LEVEL_NUM);
> > > + iter.gfn, PG_LEVEL_NUM, false);
> > > if (max_mapping_level < iter.level)
> > > continue;
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 25099c94e770..153842bb33df 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > > }
> > > #endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
> > >
> > > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > > +{
> > > + return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
> > > + KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > > +}
> > > +#else
> > > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > > +{
> > > + return false;
> > > +}
> > > +
> > > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > > +
> > > +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > > + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > > +{
> > > + int ret;
> > > + struct page *page;
> > > + pgoff_t index = gfn - slot->base_gfn +
> > > + (slot->restricted_offset >> PAGE_SHIFT);
> > > +
> > > + ret = restrictedmem_get_page(slot->restricted_file, index,
> > > + &page, order);
> > > + *pfn = page_to_pfn(page);
> > > + return ret;
> > > +}
> > > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > > +
> > > #endif
> > > --
> > > 2.25.1
> > >
> > >
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry
2022-12-06 15:48 ` Fuad Tabba
@ 2022-12-09 6:24 ` Chao Peng
0 siblings, 0 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-09 6:24 UTC (permalink / raw)
To: Fuad Tabba
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
On Tue, Dec 06, 2022 at 03:48:50PM +0000, Fuad Tabba wrote:
...
> >
> > > > */
> > > > - if (unlikely(kvm->mmu_invalidate_in_progress) &&
> > > > - hva >= kvm->mmu_invalidate_range_start &&
> > > > - hva < kvm->mmu_invalidate_range_end)
> > > > - return 1;
> > > > + if (unlikely(kvm->mmu_invalidate_in_progress)) {
> > > > + /*
> > > > + * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> > > > + * but before updating the range is a KVM bug.
> > > > + */
> > > > + if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> > > > + kvm->mmu_invalidate_range_end == INVALID_GPA))
> > >
> > > INVALID_GPA is an x86-specific define in
> > > arch/x86/include/asm/kvm_host.h, so this doesn't build on other
> > > architectures. The obvious fix is to move it to
> > > include/linux/kvm_host.h.
> >
> > Hmm, INVALID_GPA is defined as ZERO for x86, not 100% confident this is
> > correct choice for other architectures, but after search it has not been
> > used for other architectures, so should be safe to make it common.
As Yu posted a patch:
https://lore.kernel.org/all/20221209023622.274715-1-yu.c.zhang@linux.intel.com/
There is a GPA_INVALID in include/linux/kvm_types.h and I see ARM has already
been using it so sounds that is exactly what I need.
Chao
>
> With this fixed,
>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> And the necessary work to port to arm64 (on qemu/arm64):
> Tested-by: Fuad Tabba <tabba@google.com>
>
> Cheers,
> /fuad
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
2022-12-08 11:13 ` Chao Peng
@ 2022-12-09 8:57 ` Fuad Tabba
2022-12-12 7:22 ` Chao Peng
0 siblings, 1 reply; 153+ messages in thread
From: Fuad Tabba @ 2022-12-09 8:57 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
Hi,
On Thu, Dec 8, 2022 at 11:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> On Wed, Dec 07, 2022 at 05:16:34PM +0000, Fuad Tabba wrote:
> > Hi,
> >
> > On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > >
> > > Unmap the existing guest mappings when memory attribute is changed
> > > between shared and private. This is needed because shared pages and
> > > private pages are from different backends, unmapping existing ones
> > > gives a chance for page fault handler to re-populate the mappings
> > > according to the new attribute.
> > >
> > > Only architecture has private memory support needs this and the
> > > supported architecture is expected to rewrite the weak
> > > kvm_arch_has_private_mem().
> >
> > This kind of ties into the discussion of being able to share memory in
> > place. For pKVM for example, shared and private memory would have the
> > same backend, and the unmapping wouldn't be needed.
> >
> > So I guess that, instead of kvm_arch_has_private_mem(), can the check
> > be done differently, e.g., with a different function, say
> > kvm_arch_private_notify_attribute_change() (but maybe with a more
> > friendly name than what I suggested :) )?
>
> Besides controlling the unmapping here, kvm_arch_has_private_mem() is
> also used to gate the memslot KVM_MEM_PRIVATE flag in patch09. I know
> unmapping is confirmed unnecessary for pKVM, but how about
> KVM_MEM_PRIVATE? Will pKVM add its own flag or reuse KVM_MEM_PRIVATE?
> If the answer is the latter, then yes we should use a different check
> which only works for confidential usages here.
I think it makes sense for pKVM to use the same flag (KVM_MEM_PRIVATE)
and not to add another one.
Thank you,
/fuad
>
> Thanks,
> Chao
> >
> > Thanks,
> > /fuad
> >
> > >
> > > Also, during memory attribute changing and the unmapping time frame,
> > > page fault handler may happen in the same memory range and can cause
> > > incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> > > page fault handler retry during this time frame.
> > >
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > ---
> > > include/linux/kvm_host.h | 7 +-
> > > virt/kvm/kvm_main.c | 168 ++++++++++++++++++++++++++-------------
> > > 2 files changed, 116 insertions(+), 59 deletions(-)
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 3d69484d2704..3331c0c92838 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > > #endif
> > >
> > > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > > struct kvm_gfn_range {
> > > struct kvm_memory_slot *slot;
> > > gfn_t start;
> > > @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> > > bool may_block;
> > > };
> > > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > > +
> > > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > @@ -785,11 +786,12 @@ struct kvm {
> > >
> > > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > > struct mmu_notifier mmu_notifier;
> > > +#endif
> > > unsigned long mmu_invalidate_seq;
> > > long mmu_invalidate_in_progress;
> > > gfn_t mmu_invalidate_range_start;
> > > gfn_t mmu_invalidate_range_end;
> > > -#endif
> > > +
> > > struct list_head devices;
> > > u64 manual_dirty_log_protect;
> > > struct dentry *debugfs_dentry;
> > > @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> > > int kvm_arch_post_init_vm(struct kvm *kvm);
> > > void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> > > int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> > >
> > > #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> > > /*
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index ad55dfbc75d7..4e1e1e113bf0 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> > > }
> > > EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> > >
> > > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > > +{
> > > + /*
> > > + * The count increase must become visible at unlock time as no
> > > + * spte can be established without taking the mmu_lock and
> > > + * count is also read inside the mmu_lock critical section.
> > > + */
> > > + kvm->mmu_invalidate_in_progress++;
> > > +
> > > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > + kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > + kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > + }
> > > +}
> > > +
> > > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > +
> > > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > + kvm->mmu_invalidate_range_start = start;
> > > + kvm->mmu_invalidate_range_end = end;
> > > + } else {
> > > + /*
> > > + * Fully tracking multiple concurrent ranges has diminishing
> > > + * returns. Keep things simple and just find the minimal range
> > > + * which includes the current and new ranges. As there won't be
> > > + * enough information to subtract a range after its invalidate
> > > + * completes, any ranges invalidated concurrently will
> > > + * accumulate and persist until all outstanding invalidates
> > > + * complete.
> > > + */
> > > + kvm->mmu_invalidate_range_start =
> > > + min(kvm->mmu_invalidate_range_start, start);
> > > + kvm->mmu_invalidate_range_end =
> > > + max(kvm->mmu_invalidate_range_end, end);
> > > + }
> > > +}
> > > +
> > > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> > > +{
> > > + /*
> > > + * This sequence increase will notify the kvm page fault that
> > > + * the page that is going to be mapped in the spte could have
> > > + * been freed.
> > > + */
> > > + kvm->mmu_invalidate_seq++;
> > > + smp_wmb();
> > > + /*
> > > + * The above sequence increase must be visible before the
> > > + * below count decrease, which is ensured by the smp_wmb above
> > > + * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > + */
> > > + kvm->mmu_invalidate_in_progress--;
> > > +}
> > > +
> > > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > > static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> > > {
> > > @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > > }
> > >
> > > -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > > -{
> > > - /*
> > > - * The count increase must become visible at unlock time as no
> > > - * spte can be established without taking the mmu_lock and
> > > - * count is also read inside the mmu_lock critical section.
> > > - */
> > > - kvm->mmu_invalidate_in_progress++;
> > > -
> > > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > - kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > - kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > - }
> > > -}
> > > -
> > > -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > -{
> > > - WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > -
> > > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > - kvm->mmu_invalidate_range_start = start;
> > > - kvm->mmu_invalidate_range_end = end;
> > > - } else {
> > > - /*
> > > - * Fully tracking multiple concurrent ranges has diminishing
> > > - * returns. Keep things simple and just find the minimal range
> > > - * which includes the current and new ranges. As there won't be
> > > - * enough information to subtract a range after its invalidate
> > > - * completes, any ranges invalidated concurrently will
> > > - * accumulate and persist until all outstanding invalidates
> > > - * complete.
> > > - */
> > > - kvm->mmu_invalidate_range_start =
> > > - min(kvm->mmu_invalidate_range_start, start);
> > > - kvm->mmu_invalidate_range_end =
> > > - max(kvm->mmu_invalidate_range_end, end);
> > > - }
> > > -}
> > > -
> > > static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > > {
> > > kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > > @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > > return 0;
> > > }
> > >
> > > -void kvm_mmu_invalidate_end(struct kvm *kvm)
> > > -{
> > > - /*
> > > - * This sequence increase will notify the kvm page fault that
> > > - * the page that is going to be mapped in the spte could have
> > > - * been freed.
> > > - */
> > > - kvm->mmu_invalidate_seq++;
> > > - smp_wmb();
> > > - /*
> > > - * The above sequence increase must be visible before the
> > > - * below count decrease, which is ensured by the smp_wmb above
> > > - * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > - */
> > > - kvm->mmu_invalidate_in_progress--;
> > > -}
> > > -
> > > static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > > const struct mmu_notifier_range *range)
> > > {
> > > @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
> > > return 0;
> > > }
> > >
> > > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > > +{
> > > + return false;
> > > +}
> > > +
> > > static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > > {
> > > struct kvm *kvm = kvm_arch_alloc_vm();
> > > @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> > > return 0;
> > > }
> > >
> > > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > + struct kvm_gfn_range gfn_range;
> > > + struct kvm_memory_slot *slot;
> > > + struct kvm_memslots *slots;
> > > + struct kvm_memslot_iter iter;
> > > + int i;
> > > + int r = 0;
> > > +
> > > + gfn_range.pte = __pte(0);
> > > + gfn_range.may_block = true;
> > > +
> > > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > + slots = __kvm_memslots(kvm, i);
> > > +
> > > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > > + slot = iter.slot;
> > > + gfn_range.start = max(start, slot->base_gfn);
> > > + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> > > + if (gfn_range.start >= gfn_range.end)
> > > + continue;
> > > + gfn_range.slot = slot;
> > > +
> > > + r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > > + }
> > > + }
> > > +
> > > + if (r)
> > > + kvm_flush_remote_tlbs(kvm);
> > > +}
> > > +
> > > static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > struct kvm_memory_attributes *attrs)
> > > {
> > > gfn_t start, end;
> > > unsigned long i;
> > > void *entry;
> > > + int idx;
> > > u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > >
> > > - /* flags is currently not used. */
> > > + /* 'flags' is currently not used. */
> > > if (attrs->flags)
> > > return -EINVAL;
> > > if (attrs->attributes & ~supported_attrs)
> > > @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > >
> > > entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > >
> > > + if (kvm_arch_has_private_mem(kvm)) {
> > > + KVM_MMU_LOCK(kvm);
> > > + kvm_mmu_invalidate_begin(kvm);
> > > + kvm_mmu_invalidate_range_add(kvm, start, end);
> > > + KVM_MMU_UNLOCK(kvm);
> > > + }
> > > +
> > > mutex_lock(&kvm->lock);
> > > for (i = start; i < end; i++)
> > > if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > > @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > break;
> > > mutex_unlock(&kvm->lock);
> > >
> > > + if (kvm_arch_has_private_mem(kvm)) {
> > > + idx = srcu_read_lock(&kvm->srcu);
> > > + KVM_MMU_LOCK(kvm);
> > > + if (i > start)
> > > + kvm_unmap_mem_range(kvm, start, i);
> > > + kvm_mmu_invalidate_end(kvm);
> > > + KVM_MMU_UNLOCK(kvm);
> > > + srcu_read_unlock(&kvm->srcu, idx);
> > > + }
> > > +
> > > attrs->address = i << PAGE_SHIFT;
> > > attrs->size = (end - i) << PAGE_SHIFT;
> > >
> > > --
> > > 2.25.1
> > >
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 8/9] KVM: Handle page fault for private memory
2022-12-02 6:13 ` [PATCH v10 8/9] KVM: Handle page fault for private memory Chao Peng
2022-12-08 2:29 ` Yuan Yao
@ 2022-12-09 9:01 ` Fuad Tabba
2022-12-12 7:23 ` Chao Peng
2023-01-13 23:29 ` Sean Christopherson
2 siblings, 1 reply; 153+ messages in thread
From: Fuad Tabba @ 2022-12-09 9:01 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
Hi,
On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
> hva-based shared memory. Architecture code (like TDX code) can tell
> whether the on-going fault is private or not. This patch adds a
> 'is_private' field to kvm_page_fault to indicate this and architecture
> code is expected to set it.
>
> To handle page fault for such memslot, the handling logic is different
> depending on whether the fault is private or shared. KVM checks if
> 'is_private' matches the host's view of the page (maintained in
> mem_attr_array).
> - For a successful match, private pfn is obtained with
> restrictedmem_get_page() and shared pfn is obtained with existing
> get_user_pages().
> - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> userspace. Userspace then can convert memory between private/shared
> in host's view and retry the fault.
>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
> arch/x86/kvm/mmu/mmu.c | 63 +++++++++++++++++++++++++++++++--
> arch/x86/kvm/mmu/mmu_internal.h | 14 +++++++-
> arch/x86/kvm/mmu/mmutrace.h | 1 +
> arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
> include/linux/kvm_host.h | 30 ++++++++++++++++
> 5 files changed, 105 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2190fd8c95c0..b1953ebc012e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
>
> int kvm_mmu_max_mapping_level(struct kvm *kvm,
> const struct kvm_memory_slot *slot, gfn_t gfn,
> - int max_level)
> + int max_level, bool is_private)
> {
> struct kvm_lpage_info *linfo;
> int host_level;
> @@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> break;
> }
>
> + if (is_private)
> + return max_level;
> +
> if (max_level == PG_LEVEL_4K)
> return PG_LEVEL_4K;
>
> @@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> * level, which will be used to do precise, accurate accounting.
> */
> fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> - fault->gfn, fault->max_level);
> + fault->gfn, fault->max_level,
> + fault->is_private);
> if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> return;
>
> @@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> }
>
> +static inline u8 order_to_level(int order)
> +{
> + BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> +
> + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> + return PG_LEVEL_1G;
> +
> + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> + return PG_LEVEL_2M;
> +
> + return PG_LEVEL_4K;
> +}
> +
> +static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault)
> +{
> + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> + if (fault->is_private)
> + vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> + else
> + vcpu->run->memory.flags = 0;
> + vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
nit: As in previous patches, use helpers (for this and other similar
shifts in this patch)?
> + vcpu->run->memory.size = PAGE_SIZE;
> + return RET_PF_USER;
> +}
> +
> +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault)
> +{
> + int order;
> + struct kvm_memory_slot *slot = fault->slot;
> +
> + if (!kvm_slot_can_be_private(slot))
> + return kvm_do_memory_fault_exit(vcpu, fault);
> +
> + if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> + return RET_PF_RETRY;
> +
> + fault->max_level = min(order_to_level(order), fault->max_level);
> + fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> + return RET_PF_CONTINUE;
> +}
> +
> static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> struct kvm_memory_slot *slot = fault->slot;
> @@ -4210,6 +4257,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> return RET_PF_EMULATE;
> }
>
> + if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
> + return kvm_do_memory_fault_exit(vcpu, fault);
> +
> + if (fault->is_private)
> + return kvm_faultin_pfn_private(vcpu, fault);
> +
> async = false;
> fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> fault->write, &fault->map_writable,
> @@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> return -EIO;
> }
>
> + if (r == RET_PF_USER)
> + return 0;
> +
> if (r < 0)
> return r;
> if (r != RET_PF_EMULATE)
> @@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> */
> if (sp->role.direct &&
> sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> - PG_LEVEL_NUM)) {
> + PG_LEVEL_NUM,
> + false)) {
> kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
>
> if (kvm_available_flush_tlb_with_range())
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index dbaf6755c5a7..5ccf08183b00 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -189,6 +189,7 @@ struct kvm_page_fault {
>
> /* Derived from mmu and global state. */
> const bool is_tdp;
> + const bool is_private;
> const bool nx_huge_page_workaround_enabled;
>
> /*
> @@ -237,6 +238,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> * RET_PF_RETRY: let CPU fault again on the address.
> * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
> * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> + * RET_PF_USER: need to exit to userspace to handle this fault.
> * RET_PF_FIXED: The faulting entry has been fixed.
> * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
> *
> @@ -253,6 +255,7 @@ enum {
> RET_PF_RETRY,
> RET_PF_EMULATE,
> RET_PF_INVALID,
> + RET_PF_USER,
> RET_PF_FIXED,
> RET_PF_SPURIOUS,
> };
> @@ -310,7 +313,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>
> int kvm_mmu_max_mapping_level(struct kvm *kvm,
> const struct kvm_memory_slot *slot, gfn_t gfn,
> - int max_level);
> + int max_level, bool is_private);
> void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
>
> @@ -319,4 +322,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> +#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> +{
> + WARN_ON_ONCE(1);
> + return -EOPNOTSUPP;
> +}
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
> #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> index ae86820cef69..2d7555381955 100644
> --- a/arch/x86/kvm/mmu/mmutrace.h
> +++ b/arch/x86/kvm/mmu/mmutrace.h
> @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
> TRACE_DEFINE_ENUM(RET_PF_RETRY);
> TRACE_DEFINE_ENUM(RET_PF_EMULATE);
> TRACE_DEFINE_ENUM(RET_PF_INVALID);
> +TRACE_DEFINE_ENUM(RET_PF_USER);
> TRACE_DEFINE_ENUM(RET_PF_FIXED);
> TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 771210ce5181..8ba1a4afc546 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1768,7 +1768,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> continue;
>
> max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> - iter.gfn, PG_LEVEL_NUM);
> + iter.gfn, PG_LEVEL_NUM, false);
> if (max_mapping_level < iter.level)
> continue;
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 25099c94e770..153842bb33df 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> }
> #endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
>
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> + return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
> + KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +}
> +#else
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> + return false;
> +}
> +
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
> +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> +{
> + int ret;
> + struct page *page;
> + pgoff_t index = gfn - slot->base_gfn +
> + (slot->restricted_offset >> PAGE_SHIFT);
> +
> + ret = restrictedmem_get_page(slot->restricted_file, index,
> + &page, order);
> + *pfn = page_to_pfn(page);
> + return ret;
> +}
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
> #endif
> --
> 2.25.1
>
With my limited understanding of x86 code:
Reviewed-by: Fuad Tabba <tabba@google.com>
The common code in kvm_host.h was used in the port to arm64, and the
x86 fault handling code was used as a guide to how it should be done
in pKVM (with similar code added there). So with these caveats in
mind:
Tested-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
2022-12-02 6:13 ` [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
@ 2022-12-09 9:11 ` Fuad Tabba
2023-01-05 20:38 ` Vishal Annapurve
` (2 subsequent siblings)
3 siblings, 0 replies; 153+ messages in thread
From: Fuad Tabba @ 2022-12-09 9:11 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
Hi,
On Fri, Dec 2, 2022 at 6:20 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> Register/unregister private memslot to fd-based memory backing store
> restrictedmem and implement the callbacks for restrictedmem_notifier:
> - invalidate_start()/invalidate_end() to zap the existing memory
> mappings in the KVM page table.
> - error() to request KVM_REQ_MEMORY_MCE and later exit to userspace
> with KVM_EXIT_SHUTDOWN.
>
> Expose KVM_MEM_PRIVATE for memslot and KVM_MEMORY_ATTRIBUTE_PRIVATE for
> KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to userspace but either are
> controlled by kvm_arch_has_private_mem() which should be rewritten by
> architecture code.
>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
With the code to port it to pKVM/arm64:
Tested-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/x86.c | 13 +++
> include/linux/kvm_host.h | 3 +
> virt/kvm/kvm_main.c | 179 +++++++++++++++++++++++++++++++-
> 4 files changed, 191 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7772ab37ac89..27ef31133352 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -114,6 +114,7 @@
> KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> #define KVM_REQ_HV_TLB_FLUSH \
> KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> +#define KVM_REQ_MEMORY_MCE KVM_ARCH_REQ(33)
>
> #define CR0_RESERVED_BITS \
> (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5aefcff614d2..c67e22f3e2ee 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6587,6 +6587,13 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
> }
> #endif /* CONFIG_HAVE_KVM_PM_NOTIFIER */
>
> +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +void kvm_arch_memory_mce(struct kvm *kvm)
> +{
> + kvm_make_all_cpus_request(kvm, KVM_REQ_MEMORY_MCE);
> +}
> +#endif
> +
> static int kvm_vm_ioctl_get_clock(struct kvm *kvm, void __user *argp)
> {
> struct kvm_clock_data data = { 0 };
> @@ -10357,6 +10364,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>
> if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
> static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
> +
> + if (kvm_check_request(KVM_REQ_MEMORY_MCE, vcpu)) {
> + vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
> + r = 0;
> + goto out;
> + }
> }
>
> if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 153842bb33df..f032d878e034 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -590,6 +590,7 @@ struct kvm_memory_slot {
> struct file *restricted_file;
> loff_t restricted_offset;
> struct restrictedmem_notifier notifier;
> + struct kvm *kvm;
> };
>
> static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
> @@ -2363,6 +2364,8 @@ static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> *pfn = page_to_pfn(page);
> return ret;
> }
> +
> +void kvm_arch_memory_mce(struct kvm *kvm);
> #endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
>
> #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index e107afea32f0..ac835fc77273 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -936,6 +936,121 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>
> #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>
> +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
> + pgoff_t start, pgoff_t end,
> + gfn_t *gfn_start, gfn_t *gfn_end)
> +{
> + unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
> +
> + if (start > base_pgoff)
> + *gfn_start = slot->base_gfn + start - base_pgoff;
> + else
> + *gfn_start = slot->base_gfn;
> +
> + if (end < base_pgoff + slot->npages)
> + *gfn_end = slot->base_gfn + end - base_pgoff;
> + else
> + *gfn_end = slot->base_gfn + slot->npages;
> +
> + if (*gfn_start >= *gfn_end)
> + return false;
> +
> + return true;
> +}
> +
> +static void kvm_restrictedmem_invalidate_begin(struct restrictedmem_notifier *notifier,
> + pgoff_t start, pgoff_t end)
> +{
> + struct kvm_memory_slot *slot = container_of(notifier,
> + struct kvm_memory_slot,
> + notifier);
> + struct kvm *kvm = slot->kvm;
> + gfn_t gfn_start, gfn_end;
> + struct kvm_gfn_range gfn_range;
> + int idx;
> +
> + if (!restrictedmem_range_is_valid(slot, start, end,
> + &gfn_start, &gfn_end))
> + return;
> +
> + gfn_range.start = gfn_start;
> + gfn_range.end = gfn_end;
> + gfn_range.slot = slot;
> + gfn_range.pte = __pte(0);
> + gfn_range.may_block = true;
> +
> + idx = srcu_read_lock(&kvm->srcu);
> + KVM_MMU_LOCK(kvm);
> +
> + kvm_mmu_invalidate_begin(kvm);
> + kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
> + if (kvm_unmap_gfn_range(kvm, &gfn_range))
> + kvm_flush_remote_tlbs(kvm);
> +
> + KVM_MMU_UNLOCK(kvm);
> + srcu_read_unlock(&kvm->srcu, idx);
> +}
> +
> +static void kvm_restrictedmem_invalidate_end(struct restrictedmem_notifier *notifier,
> + pgoff_t start, pgoff_t end)
> +{
> + struct kvm_memory_slot *slot = container_of(notifier,
> + struct kvm_memory_slot,
> + notifier);
> + struct kvm *kvm = slot->kvm;
> + gfn_t gfn_start, gfn_end;
> +
> + if (!restrictedmem_range_is_valid(slot, start, end,
> + &gfn_start, &gfn_end))
> + return;
> +
> + KVM_MMU_LOCK(kvm);
> + kvm_mmu_invalidate_end(kvm);
> + KVM_MMU_UNLOCK(kvm);
> +}
> +
> +static void kvm_restrictedmem_error(struct restrictedmem_notifier *notifier,
> + pgoff_t start, pgoff_t end)
> +{
> + struct kvm_memory_slot *slot = container_of(notifier,
> + struct kvm_memory_slot,
> + notifier);
> + kvm_arch_memory_mce(slot->kvm);
> +}
> +
> +static struct restrictedmem_notifier_ops kvm_restrictedmem_notifier_ops = {
> + .invalidate_start = kvm_restrictedmem_invalidate_begin,
> + .invalidate_end = kvm_restrictedmem_invalidate_end,
> + .error = kvm_restrictedmem_error,
> +};
> +
> +static inline void kvm_restrictedmem_register(struct kvm_memory_slot *slot)
> +{
> + slot->notifier.ops = &kvm_restrictedmem_notifier_ops;
> + restrictedmem_register_notifier(slot->restricted_file, &slot->notifier);
> +}
> +
> +static inline void kvm_restrictedmem_unregister(struct kvm_memory_slot *slot)
> +{
> + restrictedmem_unregister_notifier(slot->restricted_file,
> + &slot->notifier);
> +}
> +
> +#else /* !CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
> +static inline void kvm_restrictedmem_register(struct kvm_memory_slot *slot)
> +{
> + WARN_ON_ONCE(1);
> +}
> +
> +static inline void kvm_restrictedmem_unregister(struct kvm_memory_slot *slot)
> +{
> + WARN_ON_ONCE(1);
> +}
> +
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
> #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> static int kvm_pm_notifier_call(struct notifier_block *bl,
> unsigned long state,
> @@ -980,6 +1095,11 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
> /* This does not remove the slot from struct kvm_memslots data structures */
> static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> {
> + if (slot->flags & KVM_MEM_PRIVATE) {
> + kvm_restrictedmem_unregister(slot);
> + fput(slot->restricted_file);
> + }
> +
> kvm_destroy_dirty_bitmap(slot);
>
> kvm_arch_free_memslot(kvm, slot);
> @@ -1551,10 +1671,14 @@ static void kvm_replace_memslot(struct kvm *kvm,
> }
> }
>
> -static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> +static int check_memory_region_flags(struct kvm *kvm,
> + const struct kvm_user_mem_region *mem)
> {
> u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>
> + if (kvm_arch_has_private_mem(kvm))
> + valid_flags |= KVM_MEM_PRIVATE;
> +
> #ifdef __KVM_HAVE_READONLY_MEM
> valid_flags |= KVM_MEM_READONLY;
> #endif
> @@ -1630,6 +1754,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
> {
> int r;
>
> + if (change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
> + kvm_restrictedmem_register(new);
> +
> /*
> * If dirty logging is disabled, nullify the bitmap; the old bitmap
> * will be freed on "commit". If logging is enabled in both old and
> @@ -1658,6 +1785,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm,
> if (r && new && new->dirty_bitmap && (!old || !old->dirty_bitmap))
> kvm_destroy_dirty_bitmap(new);
>
> + if (r && change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
> + kvm_restrictedmem_unregister(new);
> +
> return r;
> }
>
> @@ -1963,7 +2093,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> int as_id, id;
> int r;
>
> - r = check_memory_region_flags(mem);
> + r = check_memory_region_flags(kvm, mem);
> if (r)
> return r;
>
> @@ -1982,6 +2112,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
> !access_ok((void __user *)(unsigned long)mem->userspace_addr,
> mem->memory_size))
> return -EINVAL;
> + if (mem->flags & KVM_MEM_PRIVATE &&
> + (mem->restricted_offset & (PAGE_SIZE - 1) ||
> + mem->restricted_offset > U64_MAX - mem->memory_size))
> + return -EINVAL;
> if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> return -EINVAL;
> if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> @@ -2020,6 +2154,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
> if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
> return -EINVAL;
> } else { /* Modify an existing slot. */
> + /* Private memslots are immutable, they can only be deleted. */
> + if (mem->flags & KVM_MEM_PRIVATE)
> + return -EINVAL;
> if ((mem->userspace_addr != old->userspace_addr) ||
> (npages != old->npages) ||
> ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> @@ -2048,10 +2185,28 @@ int __kvm_set_memory_region(struct kvm *kvm,
> new->npages = npages;
> new->flags = mem->flags;
> new->userspace_addr = mem->userspace_addr;
> + if (mem->flags & KVM_MEM_PRIVATE) {
> + new->restricted_file = fget(mem->restricted_fd);
> + if (!new->restricted_file ||
> + !file_is_restrictedmem(new->restricted_file)) {
> + r = -EINVAL;
> + goto out;
> + }
> + new->restricted_offset = mem->restricted_offset;
> + }
> +
> + new->kvm = kvm;
>
> r = kvm_set_memslot(kvm, old, new, change);
> if (r)
> - kfree(new);
> + goto out;
> +
> + return 0;
> +
> +out:
> + if (new->restricted_file)
> + fput(new->restricted_file);
> + kfree(new);
> return r;
> }
> EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
> @@ -2351,6 +2506,8 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> #ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> {
> + if (kvm_arch_has_private_mem(kvm))
> + return KVM_MEMORY_ATTRIBUTE_PRIVATE;
> return 0;
> }
>
> @@ -4822,16 +4979,28 @@ static long kvm_vm_ioctl(struct file *filp,
> }
> case KVM_SET_USER_MEMORY_REGION: {
> struct kvm_user_mem_region mem;
> - unsigned long size = sizeof(struct kvm_userspace_memory_region);
> + unsigned int flags_offset = offsetof(typeof(mem), flags);
> + unsigned long size;
> + u32 flags;
>
> kvm_sanity_check_user_mem_region_alias();
>
> + memset(&mem, 0, sizeof(mem));
> +
> r = -EFAULT;
> + if (get_user(flags, (u32 __user *)(argp + flags_offset)))
> + goto out;
> +
> + if (flags & KVM_MEM_PRIVATE)
> + size = sizeof(struct kvm_userspace_memory_region_ext);
> + else
> + size = sizeof(struct kvm_userspace_memory_region);
> +
> if (copy_from_user(&mem, argp, size))
> goto out;
>
> r = -EINVAL;
> - if (mem.flags & KVM_MEM_PRIVATE)
> + if ((flags ^ mem.flags) & KVM_MEM_PRIVATE)
> goto out;
>
> r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
2022-12-09 8:57 ` Fuad Tabba
@ 2022-12-12 7:22 ` Chao Peng
0 siblings, 0 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-12 7:22 UTC (permalink / raw)
To: Fuad Tabba
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
On Fri, Dec 09, 2022 at 08:57:31AM +0000, Fuad Tabba wrote:
> Hi,
>
> On Thu, Dec 8, 2022 at 11:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > On Wed, Dec 07, 2022 at 05:16:34PM +0000, Fuad Tabba wrote:
> > > Hi,
> > >
> > > On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > > >
> > > > Unmap the existing guest mappings when memory attribute is changed
> > > > between shared and private. This is needed because shared pages and
> > > > private pages are from different backends, unmapping existing ones
> > > > gives a chance for page fault handler to re-populate the mappings
> > > > according to the new attribute.
> > > >
> > > > Only architecture has private memory support needs this and the
> > > > supported architecture is expected to rewrite the weak
> > > > kvm_arch_has_private_mem().
> > >
> > > This kind of ties into the discussion of being able to share memory in
> > > place. For pKVM for example, shared and private memory would have the
> > > same backend, and the unmapping wouldn't be needed.
> > >
> > > So I guess that, instead of kvm_arch_has_private_mem(), can the check
> > > be done differently, e.g., with a different function, say
> > > kvm_arch_private_notify_attribute_change() (but maybe with a more
> > > friendly name than what I suggested :) )?
> >
> > Besides controlling the unmapping here, kvm_arch_has_private_mem() is
> > also used to gate the memslot KVM_MEM_PRIVATE flag in patch09. I know
> > unmapping is confirmed unnecessary for pKVM, but how about
> > KVM_MEM_PRIVATE? Will pKVM add its own flag or reuse KVM_MEM_PRIVATE?
> > If the answer is the latter, then yes we should use a different check
> > which only works for confidential usages here.
>
> I think it makes sense for pKVM to use the same flag (KVM_MEM_PRIVATE)
> and not to add another one.
Thanks for the reply.
Chao
>
> Thank you,
> /fuad
>
>
>
> >
> > Thanks,
> > Chao
> > >
> > > Thanks,
> > > /fuad
> > >
> > > >
> > > > Also, during memory attribute changing and the unmapping time frame,
> > > > page fault handler may happen in the same memory range and can cause
> > > > incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> > > > page fault handler retry during this time frame.
> > > >
> > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > > ---
> > > > include/linux/kvm_host.h | 7 +-
> > > > virt/kvm/kvm_main.c | 168 ++++++++++++++++++++++++++-------------
> > > > 2 files changed, 116 insertions(+), 59 deletions(-)
> > > >
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index 3d69484d2704..3331c0c92838 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > > > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > > > #endif
> > > >
> > > > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > > > struct kvm_gfn_range {
> > > > struct kvm_memory_slot *slot;
> > > > gfn_t start;
> > > > @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> > > > bool may_block;
> > > > };
> > > > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > > > +
> > > > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > > > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > > @@ -785,11 +786,12 @@ struct kvm {
> > > >
> > > > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > > > struct mmu_notifier mmu_notifier;
> > > > +#endif
> > > > unsigned long mmu_invalidate_seq;
> > > > long mmu_invalidate_in_progress;
> > > > gfn_t mmu_invalidate_range_start;
> > > > gfn_t mmu_invalidate_range_end;
> > > > -#endif
> > > > +
> > > > struct list_head devices;
> > > > u64 manual_dirty_log_protect;
> > > > struct dentry *debugfs_dentry;
> > > > @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> > > > int kvm_arch_post_init_vm(struct kvm *kvm);
> > > > void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> > > > int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > > > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> > > >
> > > > #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> > > > /*
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index ad55dfbc75d7..4e1e1e113bf0 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> > > > }
> > > > EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> > > >
> > > > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > > > +{
> > > > + /*
> > > > + * The count increase must become visible at unlock time as no
> > > > + * spte can be established without taking the mmu_lock and
> > > > + * count is also read inside the mmu_lock critical section.
> > > > + */
> > > > + kvm->mmu_invalidate_in_progress++;
> > > > +
> > > > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > > + kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > > + kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > > + }
> > > > +}
> > > > +
> > > > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > > +{
> > > > + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > > +
> > > > + if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > > + kvm->mmu_invalidate_range_start = start;
> > > > + kvm->mmu_invalidate_range_end = end;
> > > > + } else {
> > > > + /*
> > > > + * Fully tracking multiple concurrent ranges has diminishing
> > > > + * returns. Keep things simple and just find the minimal range
> > > > + * which includes the current and new ranges. As there won't be
> > > > + * enough information to subtract a range after its invalidate
> > > > + * completes, any ranges invalidated concurrently will
> > > > + * accumulate and persist until all outstanding invalidates
> > > > + * complete.
> > > > + */
> > > > + kvm->mmu_invalidate_range_start =
> > > > + min(kvm->mmu_invalidate_range_start, start);
> > > > + kvm->mmu_invalidate_range_end =
> > > > + max(kvm->mmu_invalidate_range_end, end);
> > > > + }
> > > > +}
> > > > +
> > > > +void kvm_mmu_invalidate_end(struct kvm *kvm)
> > > > +{
> > > > + /*
> > > > + * This sequence increase will notify the kvm page fault that
> > > > + * the page that is going to be mapped in the spte could have
> > > > + * been freed.
> > > > + */
> > > > + kvm->mmu_invalidate_seq++;
> > > > + smp_wmb();
> > > > + /*
> > > > + * The above sequence increase must be visible before the
> > > > + * below count decrease, which is ensured by the smp_wmb above
> > > > + * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > > + */
> > > > + kvm->mmu_invalidate_in_progress--;
> > > > +}
> > > > +
> > > > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > > > static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> > > > {
> > > > @@ -714,45 +770,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > > > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > > > }
> > > >
> > > > -void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > > > -{
> > > > - /*
> > > > - * The count increase must become visible at unlock time as no
> > > > - * spte can be established without taking the mmu_lock and
> > > > - * count is also read inside the mmu_lock critical section.
> > > > - */
> > > > - kvm->mmu_invalidate_in_progress++;
> > > > -
> > > > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > > - kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > > - kvm->mmu_invalidate_range_end = INVALID_GPA;
> > > > - }
> > > > -}
> > > > -
> > > > -void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > > > -{
> > > > - WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > > > -
> > > > - if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > > - kvm->mmu_invalidate_range_start = start;
> > > > - kvm->mmu_invalidate_range_end = end;
> > > > - } else {
> > > > - /*
> > > > - * Fully tracking multiple concurrent ranges has diminishing
> > > > - * returns. Keep things simple and just find the minimal range
> > > > - * which includes the current and new ranges. As there won't be
> > > > - * enough information to subtract a range after its invalidate
> > > > - * completes, any ranges invalidated concurrently will
> > > > - * accumulate and persist until all outstanding invalidates
> > > > - * complete.
> > > > - */
> > > > - kvm->mmu_invalidate_range_start =
> > > > - min(kvm->mmu_invalidate_range_start, start);
> > > > - kvm->mmu_invalidate_range_end =
> > > > - max(kvm->mmu_invalidate_range_end, end);
> > > > - }
> > > > -}
> > > > -
> > > > static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > > > {
> > > > kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> > > > @@ -806,23 +823,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > > > return 0;
> > > > }
> > > >
> > > > -void kvm_mmu_invalidate_end(struct kvm *kvm)
> > > > -{
> > > > - /*
> > > > - * This sequence increase will notify the kvm page fault that
> > > > - * the page that is going to be mapped in the spte could have
> > > > - * been freed.
> > > > - */
> > > > - kvm->mmu_invalidate_seq++;
> > > > - smp_wmb();
> > > > - /*
> > > > - * The above sequence increase must be visible before the
> > > > - * below count decrease, which is ensured by the smp_wmb above
> > > > - * in conjunction with the smp_rmb in mmu_invalidate_retry().
> > > > - */
> > > > - kvm->mmu_invalidate_in_progress--;
> > > > -}
> > > > -
> > > > static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > > > const struct mmu_notifier_range *range)
> > > > {
> > > > @@ -1140,6 +1140,11 @@ int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
> > > > return 0;
> > > > }
> > > >
> > > > +bool __weak kvm_arch_has_private_mem(struct kvm *kvm)
> > > > +{
> > > > + return false;
> > > > +}
> > > > +
> > > > static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > > > {
> > > > struct kvm *kvm = kvm_arch_alloc_vm();
> > > > @@ -2349,15 +2354,47 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> > > > return 0;
> > > > }
> > > >
> > > > +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> > > > +{
> > > > + struct kvm_gfn_range gfn_range;
> > > > + struct kvm_memory_slot *slot;
> > > > + struct kvm_memslots *slots;
> > > > + struct kvm_memslot_iter iter;
> > > > + int i;
> > > > + int r = 0;
> > > > +
> > > > + gfn_range.pte = __pte(0);
> > > > + gfn_range.may_block = true;
> > > > +
> > > > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > > + slots = __kvm_memslots(kvm, i);
> > > > +
> > > > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > > > + slot = iter.slot;
> > > > + gfn_range.start = max(start, slot->base_gfn);
> > > > + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> > > > + if (gfn_range.start >= gfn_range.end)
> > > > + continue;
> > > > + gfn_range.slot = slot;
> > > > +
> > > > + r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> > > > + }
> > > > + }
> > > > +
> > > > + if (r)
> > > > + kvm_flush_remote_tlbs(kvm);
> > > > +}
> > > > +
> > > > static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > > struct kvm_memory_attributes *attrs)
> > > > {
> > > > gfn_t start, end;
> > > > unsigned long i;
> > > > void *entry;
> > > > + int idx;
> > > > u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > > >
> > > > - /* flags is currently not used. */
> > > > + /* 'flags' is currently not used. */
> > > > if (attrs->flags)
> > > > return -EINVAL;
> > > > if (attrs->attributes & ~supported_attrs)
> > > > @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > >
> > > > entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > > >
> > > > + if (kvm_arch_has_private_mem(kvm)) {
> > > > + KVM_MMU_LOCK(kvm);
> > > > + kvm_mmu_invalidate_begin(kvm);
> > > > + kvm_mmu_invalidate_range_add(kvm, start, end);
> > > > + KVM_MMU_UNLOCK(kvm);
> > > > + }
> > > > +
> > > > mutex_lock(&kvm->lock);
> > > > for (i = start; i < end; i++)
> > > > if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > > > @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > > break;
> > > > mutex_unlock(&kvm->lock);
> > > >
> > > > + if (kvm_arch_has_private_mem(kvm)) {
> > > > + idx = srcu_read_lock(&kvm->srcu);
> > > > + KVM_MMU_LOCK(kvm);
> > > > + if (i > start)
> > > > + kvm_unmap_mem_range(kvm, start, i);
> > > > + kvm_mmu_invalidate_end(kvm);
> > > > + KVM_MMU_UNLOCK(kvm);
> > > > + srcu_read_unlock(&kvm->srcu, idx);
> > > > + }
> > > > +
> > > > attrs->address = i << PAGE_SHIFT;
> > > > attrs->size = (end - i) << PAGE_SHIFT;
> > > >
> > > > --
> > > > 2.25.1
> > > >
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 8/9] KVM: Handle page fault for private memory
2022-12-09 9:01 ` Fuad Tabba
@ 2022-12-12 7:23 ` Chao Peng
0 siblings, 0 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-12 7:23 UTC (permalink / raw)
To: Fuad Tabba
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
On Fri, Dec 09, 2022 at 09:01:04AM +0000, Fuad Tabba wrote:
> Hi,
>
> On Fri, Dec 2, 2022 at 6:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
> > hva-based shared memory. Architecture code (like TDX code) can tell
> > whether the on-going fault is private or not. This patch adds a
> > 'is_private' field to kvm_page_fault to indicate this and architecture
> > code is expected to set it.
> >
> > To handle page fault for such memslot, the handling logic is different
> > depending on whether the fault is private or shared. KVM checks if
> > 'is_private' matches the host's view of the page (maintained in
> > mem_attr_array).
> > - For a successful match, private pfn is obtained with
> > restrictedmem_get_page() and shared pfn is obtained with existing
> > get_user_pages().
> > - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> > userspace. Userspace then can convert memory between private/shared
> > in host's view and retry the fault.
> >
> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 63 +++++++++++++++++++++++++++++++--
> > arch/x86/kvm/mmu/mmu_internal.h | 14 +++++++-
> > arch/x86/kvm/mmu/mmutrace.h | 1 +
> > arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
> > include/linux/kvm_host.h | 30 ++++++++++++++++
> > 5 files changed, 105 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 2190fd8c95c0..b1953ebc012e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
> >
> > int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > const struct kvm_memory_slot *slot, gfn_t gfn,
> > - int max_level)
> > + int max_level, bool is_private)
> > {
> > struct kvm_lpage_info *linfo;
> > int host_level;
> > @@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > break;
> > }
> >
> > + if (is_private)
> > + return max_level;
> > +
> > if (max_level == PG_LEVEL_4K)
> > return PG_LEVEL_4K;
> >
> > @@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > * level, which will be used to do precise, accurate accounting.
> > */
> > fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> > - fault->gfn, fault->max_level);
> > + fault->gfn, fault->max_level,
> > + fault->is_private);
> > if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> > return;
> >
> > @@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> > kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> > }
> >
> > +static inline u8 order_to_level(int order)
> > +{
> > + BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> > +
> > + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> > + return PG_LEVEL_1G;
> > +
> > + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> > + return PG_LEVEL_2M;
> > +
> > + return PG_LEVEL_4K;
> > +}
> > +
> > +static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
> > + struct kvm_page_fault *fault)
> > +{
> > + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > + if (fault->is_private)
> > + vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> > + else
> > + vcpu->run->memory.flags = 0;
> > + vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
>
> nit: As in previous patches, use helpers (for this and other similar
> shifts in this patch)?
Agreed.
>
> > + vcpu->run->memory.size = PAGE_SIZE;
> > + return RET_PF_USER;
> > +}
> > +
> > +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > + struct kvm_page_fault *fault)
> > +{
> > + int order;
> > + struct kvm_memory_slot *slot = fault->slot;
> > +
> > + if (!kvm_slot_can_be_private(slot))
> > + return kvm_do_memory_fault_exit(vcpu, fault);
> > +
> > + if (kvm_restricted_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
> > + return RET_PF_RETRY;
> > +
> > + fault->max_level = min(order_to_level(order), fault->max_level);
> > + fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> > + return RET_PF_CONTINUE;
> > +}
> > +
> > static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > {
> > struct kvm_memory_slot *slot = fault->slot;
> > @@ -4210,6 +4257,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > return RET_PF_EMULATE;
> > }
> >
> > + if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
> > + return kvm_do_memory_fault_exit(vcpu, fault);
> > +
> > + if (fault->is_private)
> > + return kvm_faultin_pfn_private(vcpu, fault);
> > +
> > async = false;
> > fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
> > fault->write, &fault->map_writable,
> > @@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> > return -EIO;
> > }
> >
> > + if (r == RET_PF_USER)
> > + return 0;
> > +
> > if (r < 0)
> > return r;
> > if (r != RET_PF_EMULATE)
> > @@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> > */
> > if (sp->role.direct &&
> > sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> > - PG_LEVEL_NUM)) {
> > + PG_LEVEL_NUM,
> > + false)) {
> > kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
> >
> > if (kvm_available_flush_tlb_with_range())
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index dbaf6755c5a7..5ccf08183b00 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -189,6 +189,7 @@ struct kvm_page_fault {
> >
> > /* Derived from mmu and global state. */
> > const bool is_tdp;
> > + const bool is_private;
> > const bool nx_huge_page_workaround_enabled;
> >
> > /*
> > @@ -237,6 +238,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> > * RET_PF_RETRY: let CPU fault again on the address.
> > * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
> > * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> > + * RET_PF_USER: need to exit to userspace to handle this fault.
> > * RET_PF_FIXED: The faulting entry has been fixed.
> > * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
> > *
> > @@ -253,6 +255,7 @@ enum {
> > RET_PF_RETRY,
> > RET_PF_EMULATE,
> > RET_PF_INVALID,
> > + RET_PF_USER,
> > RET_PF_FIXED,
> > RET_PF_SPURIOUS,
> > };
> > @@ -310,7 +313,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >
> > int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > const struct kvm_memory_slot *slot, gfn_t gfn,
> > - int max_level);
> > + int max_level, bool is_private);
> > void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> > void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
> >
> > @@ -319,4 +322,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >
> > +#ifndef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > +{
> > + WARN_ON_ONCE(1);
> > + return -EOPNOTSUPP;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > +
> > #endif /* __KVM_X86_MMU_INTERNAL_H */
> > diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> > index ae86820cef69..2d7555381955 100644
> > --- a/arch/x86/kvm/mmu/mmutrace.h
> > +++ b/arch/x86/kvm/mmu/mmutrace.h
> > @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
> > TRACE_DEFINE_ENUM(RET_PF_RETRY);
> > TRACE_DEFINE_ENUM(RET_PF_EMULATE);
> > TRACE_DEFINE_ENUM(RET_PF_INVALID);
> > +TRACE_DEFINE_ENUM(RET_PF_USER);
> > TRACE_DEFINE_ENUM(RET_PF_FIXED);
> > TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 771210ce5181..8ba1a4afc546 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1768,7 +1768,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> > continue;
> >
> > max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> > - iter.gfn, PG_LEVEL_NUM);
> > + iter.gfn, PG_LEVEL_NUM, false);
> > if (max_mapping_level < iter.level)
> > continue;
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 25099c94e770..153842bb33df 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > }
> > #endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
> >
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > + return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
> > + KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > +}
> > +#else
> > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > + return false;
> > +}
> > +
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > +
> > +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> > +{
> > + int ret;
> > + struct page *page;
> > + pgoff_t index = gfn - slot->base_gfn +
> > + (slot->restricted_offset >> PAGE_SHIFT);
> > +
> > + ret = restrictedmem_get_page(slot->restricted_file, index,
> > + &page, order);
> > + *pfn = page_to_pfn(page);
> > + return ret;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> > +
> > #endif
> > --
> > 2.25.1
> >
>
> With my limited understanding of x86 code:
> Reviewed-by: Fuad Tabba <tabba@google.com>
>
> The common code in kvm_host.h was used in the port to arm64, and the
> x86 fault handling code was used as a guide to how it should be done
> in pKVM (with similar code added there). So with these caveats in
> mind:
> Tested-by: Fuad Tabba <tabba@google.com>
>
> Cheers,
> /fuad
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
2022-12-08 11:30 ` Chao Peng
@ 2022-12-13 12:04 ` Xiaoyao Li
2022-12-19 7:50 ` Chao Peng
0 siblings, 1 reply; 153+ messages in thread
From: Xiaoyao Li @ 2022-12-13 12:04 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On 12/8/2022 7:30 PM, Chao Peng wrote:
> On Thu, Dec 08, 2022 at 04:37:03PM +0800, Xiaoyao Li wrote:
>> On 12/2/2022 2:13 PM, Chao Peng wrote:
>>
>> ..
>>
>>> Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
>>> and right now it is selected on X86_64 only.
>>>
>>
>> From the patch implementation, I have no idea why HAVE_KVM_RESTRICTED_MEM is
>> needed.
>
> The reason is we want KVM further controls the feature enabling. An
> opt-in CONFIG_RESTRICTEDMEM can cause problem if user sets that for
> unsupported architectures.
HAVE_KVM_RESTRICTED_MEM is not used in this patch. It's better to
introduce it in the patch that actually uses it.
> Here is the original discussion:
> https://lore.kernel.org/all/YkJLFu98hZOvTSrL@google.com/
>
> Thanks,
> Chao
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
2022-12-02 6:13 ` [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
2022-12-06 14:57 ` Fuad Tabba
@ 2022-12-13 23:49 ` Huang, Kai
2022-12-19 7:53 ` Chao Peng
2023-01-13 21:54 ` Sean Christopherson
` (2 subsequent siblings)
4 siblings, 1 reply; 153+ messages in thread
From: Huang, Kai @ 2022-12-13 23:49 UTC (permalink / raw)
To: linux-api, linux-mm, chao.p.peng, qemu-devel, linux-kernel,
linux-arch, linux-doc, kvm, linux-fsdevel
Cc: tglx, jmattson, Lutomirski, Andy, pbonzini, ak, kirill.shutemov,
david, tabba, Hocko, Michal, michael.roth, corbet, bfields,
dhildenb, x86, bp, vannapurve, rppt, shuah, vkuznets, vbabka,
arnd, mail, qperret, Christopherson,,
Sean, ddutile, naoya.horiguchi, aarcange, wanpengli, yu.c.zhang,
hughd, mingo, hpa, Nakajima, Jun, jlayton, joro, steven.price,
Hansen, Dave, akpm, linmiaohe, Wang, Wei W
>
> memfd_restricted() itself is implemented as a shim layer on top of real
> memory file systems (currently tmpfs). Pages in restrictedmem are marked
> as unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
>
>
I didn't dig full histroy, but I interpret this as we don't support page
migration and swapping for restricted memfd for now. IMHO "page marked as
unmovable" can be confused with PageMovable(), which is a different thing from
this series. It's better to just say something like "those pages cannot be
migrated and swapped".
[...]
> +
> + /*
> + * These pages are currently unmovable so don't place them into movable
> + * pageblocks (e.g. CMA and ZONE_MOVABLE).
> + */
> + mapping = memfd->f_mapping;
> + mapping_set_unevictable(mapping);
> + mapping_set_gfp_mask(mapping,
> + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from non-
movable zones, but doesn't necessarily prevent page from being migrated. My
first glance is you need to implement either a_ops->migrate_folio() or just
get_page() after faulting in the page to prevent.
So I think the comment also needs improvement -- IMHO we can just call out
currently those pages cannot be migrated and swapped, which is clearer (and the
latter justifies mapping_set_unevictable() clearly).
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
2022-12-02 6:13 ` [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes Chao Peng
2022-12-07 8:13 ` Yuan Yao
2022-12-07 17:16 ` Fuad Tabba
@ 2022-12-13 23:51 ` Huang, Kai
2022-12-19 7:54 ` Chao Peng
2023-01-13 22:50 ` Sean Christopherson
3 siblings, 1 reply; 153+ messages in thread
From: Huang, Kai @ 2022-12-13 23:51 UTC (permalink / raw)
To: linux-api, linux-mm, chao.p.peng, qemu-devel, linux-kernel,
linux-arch, linux-doc, kvm, linux-fsdevel
Cc: tglx, jmattson, Lutomirski, Andy, pbonzini, ak, kirill.shutemov,
david, tabba, Hocko, Michal, michael.roth, corbet, bfields,
dhildenb, x86, bp, vannapurve, rppt, shuah, vkuznets, vbabka,
arnd, mail, qperret, Christopherson,,
Sean, ddutile, naoya.horiguchi, aarcange, wanpengli, yu.c.zhang,
hughd, mingo, hpa, Nakajima, Jun, jlayton, joro, steven.price,
Hansen, Dave, akpm, linmiaohe, Wang, Wei W
On Fri, 2022-12-02 at 14:13 +0800, Chao Peng wrote:
>
> - /* flags is currently not used. */
> + /* 'flags' is currently not used. */
> if (attrs->flags)
> return -EINVAL;
Unintended code change.
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
2022-12-02 6:13 ` [PATCH v10 2/9] KVM: Introduce per-page memory attributes Chao Peng
2022-12-06 13:34 ` Fabiano Rosas
2022-12-06 15:07 ` Fuad Tabba
@ 2022-12-16 15:09 ` Borislav Petkov
2022-12-19 8:15 ` Chao Peng
2022-12-28 8:28 ` Chenyi Qiang
` (3 subsequent siblings)
6 siblings, 1 reply; 153+ messages in thread
From: Borislav Petkov @ 2022-12-16 15:09 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Fri, Dec 02, 2022 at 02:13:40PM +0800, Chao Peng wrote:
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1782c4555d94..7f0f5e9f2406 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> spin_lock_init(&kvm->mn_invalidate_lock);
> rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> + xa_init(&kvm->mem_attr_array);
> +#endif
if (IS_ENABLED(CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES))
...
would at least remove the ugly ifdeffery.
Or you could create wrapper functions for that xa_init() and
xa_destroy() and put the ifdeffery in there.
> @@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> }
> #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static u64 kvm_supported_mem_attributes(struct kvm *kvm)
I guess that function should have a verb in the name:
kvm_get_supported_mem_attributes()
> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> + struct kvm_memory_attributes *attrs)
> +{
> + gfn_t start, end;
> + unsigned long i;
> + void *entry;
> + u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> + /* flags is currently not used. */
> + if (attrs->flags)
> + return -EINVAL;
> + if (attrs->attributes & ~supported_attrs)
> + return -EINVAL;
> + if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> + return -EINVAL;
> + if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> + return -EINVAL;
Dunno, shouldn't those issue some sort of an error message so that the
caller knows where it failed? Or at least return different retvals which
signal what the problem is?
> + start = attrs->address >> PAGE_SHIFT;
> + end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> + entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +
> + mutex_lock(&kvm->lock);
> + for (i = start; i < end; i++)
> + if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> + GFP_KERNEL_ACCOUNT)))
> + break;
> + mutex_unlock(&kvm->lock);
> +
> + attrs->address = i << PAGE_SHIFT;
> + attrs->size = (end - i) << PAGE_SHIFT;
> +
> + return 0;
> +}
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
2022-12-13 12:04 ` Xiaoyao Li
@ 2022-12-19 7:50 ` Chao Peng
0 siblings, 0 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-19 7:50 UTC (permalink / raw)
To: Xiaoyao Li
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Tue, Dec 13, 2022 at 08:04:14PM +0800, Xiaoyao Li wrote:
> On 12/8/2022 7:30 PM, Chao Peng wrote:
> > On Thu, Dec 08, 2022 at 04:37:03PM +0800, Xiaoyao Li wrote:
> > > On 12/2/2022 2:13 PM, Chao Peng wrote:
> > >
> > > ..
> > >
> > > > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > > > and right now it is selected on X86_64 only.
> > > >
> > >
> > > From the patch implementation, I have no idea why HAVE_KVM_RESTRICTED_MEM is
> > > needed.
> >
> > The reason is we want KVM further controls the feature enabling. An
> > opt-in CONFIG_RESTRICTEDMEM can cause problem if user sets that for
> > unsupported architectures.
>
> HAVE_KVM_RESTRICTED_MEM is not used in this patch. It's better to introduce
> it in the patch that actually uses it.
It's being 'used' in this patch by reverse selecting RESTRICTEDMEM in
arch/x86/kvm/Kconfig, this gives people a sense where
restrictedmem_notifier comes from. Introducing the config with other
private/restricted memslot stuff together can also help future
supporting architectures better identify what they need do. But those
are trivial and moving to patch 08 sounds also good to me.
Thanks,
Chao
>
> > Here is the original discussion:
> > https://lore.kernel.org/all/YkJLFu98hZOvTSrL@google.com/
> >
> > Thanks,
> > Chao
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
2022-12-13 23:49 ` Huang, Kai
@ 2022-12-19 7:53 ` Chao Peng
2022-12-19 8:48 ` Huang, Kai
0 siblings, 1 reply; 153+ messages in thread
From: Chao Peng @ 2022-12-19 7:53 UTC (permalink / raw)
To: Huang, Kai
Cc: linux-api, linux-mm, qemu-devel, linux-kernel, linux-arch,
linux-doc, kvm, linux-fsdevel, tglx, jmattson, Lutomirski, Andy,
pbonzini, ak, kirill.shutemov, david, tabba, Hocko, Michal,
michael.roth, corbet, bfields, dhildenb, x86, bp, vannapurve,
rppt, shuah, vkuznets, vbabka, arnd, mail, qperret,
Christopherson,,
Sean, ddutile, naoya.horiguchi, aarcange, wanpengli, yu.c.zhang,
hughd, mingo, hpa, Nakajima, Jun, jlayton, joro, steven.price,
Hansen, Dave, akpm, linmiaohe, Wang, Wei W
On Tue, Dec 13, 2022 at 11:49:13PM +0000, Huang, Kai wrote:
> >
> > memfd_restricted() itself is implemented as a shim layer on top of real
> > memory file systems (currently tmpfs). Pages in restrictedmem are marked
> > as unmovable and unevictable, this is required for current confidential
> > usage. But in future this might be changed.
> >
> >
> I didn't dig full histroy, but I interpret this as we don't support page
> migration and swapping for restricted memfd for now. IMHO "page marked as
> unmovable" can be confused with PageMovable(), which is a different thing from
> this series. It's better to just say something like "those pages cannot be
> migrated and swapped".
Yes, if that helps some clarification.
>
> [...]
>
> > +
> > + /*
> > + * These pages are currently unmovable so don't place them into movable
> > + * pageblocks (e.g. CMA and ZONE_MOVABLE).
> > + */
> > + mapping = memfd->f_mapping;
> > + mapping_set_unevictable(mapping);
> > + mapping_set_gfp_mask(mapping,
> > + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
>
> But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from non-
> movable zones, but doesn't necessarily prevent page from being migrated. My
> first glance is you need to implement either a_ops->migrate_folio() or just
> get_page() after faulting in the page to prevent.
The current api restrictedmem_get_page() already does this, after the
caller calling it, it holds a reference to the page. The caller then
decides when to call put_page() appropriately.
>
> So I think the comment also needs improvement -- IMHO we can just call out
> currently those pages cannot be migrated and swapped, which is clearer (and the
> latter justifies mapping_set_unevictable() clearly).
Good to me.
Thanks,
Chao
>
>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
2022-12-13 23:51 ` Huang, Kai
@ 2022-12-19 7:54 ` Chao Peng
0 siblings, 0 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-19 7:54 UTC (permalink / raw)
To: Huang, Kai
Cc: linux-api, linux-mm, qemu-devel, linux-kernel, linux-arch,
linux-doc, kvm, linux-fsdevel, tglx, jmattson, Lutomirski, Andy,
pbonzini, ak, kirill.shutemov, david, tabba, Hocko, Michal,
michael.roth, corbet, bfields, dhildenb, x86, bp, vannapurve,
rppt, shuah, vkuznets, vbabka, arnd, mail, qperret,
Christopherson,,
Sean, ddutile, naoya.horiguchi, aarcange, wanpengli, yu.c.zhang,
hughd, mingo, hpa, Nakajima, Jun, jlayton, joro, steven.price,
Hansen, Dave, akpm, linmiaohe, Wang, Wei W
On Tue, Dec 13, 2022 at 11:51:25PM +0000, Huang, Kai wrote:
> On Fri, 2022-12-02 at 14:13 +0800, Chao Peng wrote:
> >
> > - /* flags is currently not used. */
> > + /* 'flags' is currently not used. */
> > if (attrs->flags)
> > return -EINVAL;
>
> Unintended code change.
Yeah!
Chao
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
2022-12-16 15:09 ` Borislav Petkov
@ 2022-12-19 8:15 ` Chao Peng
2022-12-19 10:17 ` Borislav Petkov
0 siblings, 1 reply; 153+ messages in thread
From: Chao Peng @ 2022-12-19 8:15 UTC (permalink / raw)
To: Borislav Petkov
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Fri, Dec 16, 2022 at 04:09:06PM +0100, Borislav Petkov wrote:
> On Fri, Dec 02, 2022 at 02:13:40PM +0800, Chao Peng wrote:
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 1782c4555d94..7f0f5e9f2406 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > spin_lock_init(&kvm->mn_invalidate_lock);
> > rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> > xa_init(&kvm->vcpu_array);
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > + xa_init(&kvm->mem_attr_array);
> > +#endif
>
> if (IS_ENABLED(CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES))
> ...
>
> would at least remove the ugly ifdeffery.
>
> Or you could create wrapper functions for that xa_init() and
> xa_destroy() and put the ifdeffery in there.
Agreed.
>
> > @@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> > }
> > #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
> >
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +static u64 kvm_supported_mem_attributes(struct kvm *kvm)
>
> I guess that function should have a verb in the name:
>
> kvm_get_supported_mem_attributes()
Right!
>
> > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > + struct kvm_memory_attributes *attrs)
> > +{
> > + gfn_t start, end;
> > + unsigned long i;
> > + void *entry;
> > + u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > +
> > + /* flags is currently not used. */
> > + if (attrs->flags)
> > + return -EINVAL;
> > + if (attrs->attributes & ~supported_attrs)
> > + return -EINVAL;
> > + if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > + return -EINVAL;
> > + if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > + return -EINVAL;
>
> Dunno, shouldn't those issue some sort of an error message so that the
> caller knows where it failed? Or at least return different retvals which
> signal what the problem is?
Tamping down with error number a bit:
if (attrs->flags)
return -ENXIO;
if (attrs->attributes & ~supported_attrs)
return -EOPNOTSUPP;
if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size) ||
attrs->size == 0)
return -EINVAL;
if (attrs->address + attrs->size < attrs->address)
return -E2BIG;
Chao
>
> > + start = attrs->address >> PAGE_SHIFT;
> > + end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > + entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > +
> > + mutex_lock(&kvm->lock);
> > + for (i = start; i < end; i++)
> > + if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > + GFP_KERNEL_ACCOUNT)))
> > + break;
> > + mutex_unlock(&kvm->lock);
> > +
> > + attrs->address = i << PAGE_SHIFT;
> > + attrs->size = (end - i) << PAGE_SHIFT;
> > +
> > + return 0;
> > +}
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
2022-12-19 7:53 ` Chao Peng
@ 2022-12-19 8:48 ` Huang, Kai
2022-12-20 7:22 ` Chao Peng
0 siblings, 1 reply; 153+ messages in thread
From: Huang, Kai @ 2022-12-19 8:48 UTC (permalink / raw)
To: chao.p.peng
Cc: tglx, linux-arch, kvm, Wang, Wei W, jmattson, Lutomirski, Andy,
ak, kirill.shutemov, david, qemu-devel, tabba, Hocko, Michal,
michael.roth, corbet, linux-fsdevel, dhildenb, bfields,
linux-kernel, x86, bp, vannapurve, rppt, shuah, vkuznets, vbabka,
mail, linux-api, qperret, arnd, pbonzini, ddutile,
naoya.horiguchi, Christopherson,,
Sean, wanpengli, yu.c.zhang, hughd, aarcange, mingo, hpa,
Nakajima, Jun, jlayton, joro, linux-mm, steven.price, Hansen,
Dave, linux-doc, akpm, linmiaohe
On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> >
> > [...]
> >
> > > +
> > > + /*
> > > + * These pages are currently unmovable so don't place them into
> > > movable
> > > + * pageblocks (e.g. CMA and ZONE_MOVABLE).
> > > + */
> > > + mapping = memfd->f_mapping;
> > > + mapping_set_unevictable(mapping);
> > > + mapping_set_gfp_mask(mapping,
> > > + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> >
> > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from
> > non-
> > movable zones, but doesn't necessarily prevent page from being migrated. My
> > first glance is you need to implement either a_ops->migrate_folio() or just
> > get_page() after faulting in the page to prevent.
>
> The current api restrictedmem_get_page() already does this, after the
> caller calling it, it holds a reference to the page. The caller then
> decides when to call put_page() appropriately.
I tried to dig some history. Perhaps I am missing something, but it seems Kirill
said in v9 that this code doesn't prevent page migration, and we need to
increase page refcount in restrictedmem_get_page():
https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/
But looking at this series it seems restrictedmem_get_page() in this v10 is
identical to the one in v9 (except v10 uses 'folio' instead of 'page')?
Anyway if this is not fixed, then it should be fixed. Otherwise, a comment at
the place where page refcount is increased will be helpful to help people
understand page migration is actually prevented.
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
2022-12-19 8:15 ` Chao Peng
@ 2022-12-19 10:17 ` Borislav Petkov
2022-12-20 7:24 ` Chao Peng
0 siblings, 1 reply; 153+ messages in thread
From: Borislav Petkov @ 2022-12-19 10:17 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Mon, Dec 19, 2022 at 04:15:32PM +0800, Chao Peng wrote:
> Tamping down with error number a bit:
>
> if (attrs->flags)
> return -ENXIO;
> if (attrs->attributes & ~supported_attrs)
> return -EOPNOTSUPP;
> if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size) ||
> attrs->size == 0)
> return -EINVAL;
> if (attrs->address + attrs->size < attrs->address)
> return -E2BIG;
Yap, better.
I guess you should add those to the documentation of the ioctl too
so that people can find out why it fails. Or, well, they can look
at the code directly too but still... imagine some blurb about
user-friendliness here...
:-)
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
2022-12-02 6:13 ` [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory Chao Peng
2022-12-05 9:03 ` Fuad Tabba
2022-12-08 8:37 ` Xiaoyao Li
@ 2022-12-19 14:36 ` Borislav Petkov
2022-12-20 7:43 ` Chao Peng
2023-01-05 11:23 ` Jarkko Sakkinen
3 siblings, 1 reply; 153+ messages in thread
From: Borislav Petkov @ 2022-12-19 14:36 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> In memory encryption usage, guest memory may be encrypted with special
> key and can be accessed only by the guest itself. We call such memory
> private memory. It's valueless and sometimes can cause problem to allow
valueless?
I can't parse that.
> userspace to access guest private memory. This new KVM memslot extension
> allows guest private memory being provided through a restrictedmem
> backed file descriptor(fd) and userspace is restricted to access the
> bookmarked memory in the fd.
bookmarked?
> This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> additional KVM memslot fields restricted_fd/restricted_offset to allow
> userspace to instruct KVM to provide guest memory through restricted_fd.
> 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> and the size is 'memory_size'.
>
> The extended memslot can still have the userspace_addr(hva). When use, a
"When un use, ..."
...
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index a8e379a3afee..690cb21010e7 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -50,6 +50,8 @@ config KVM
> select INTERVAL_TREE
> select HAVE_KVM_PM_NOTIFIER if PM
> select HAVE_KVM_MEMORY_ATTRIBUTES
> + select HAVE_KVM_RESTRICTED_MEM if X86_64
> + select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
Those deps here look weird.
RESTRICTEDMEM should be selected by TDX_GUEST as it can't live without
it.
Then you don't have to select HAVE_KVM_RESTRICTED_MEM simply because of
X86_64 - you need that functionality when the respective guest support
is enabled in KVM.
Then, looking forward into your patchset, I'm not sure you even
need HAVE_KVM_RESTRICTED_MEM - you could make it all depend on
CONFIG_RESTRICTEDMEM. But that's KVM folks call - I'd always aim for
less Kconfig items because we have waay too many.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
2022-12-19 8:48 ` Huang, Kai
@ 2022-12-20 7:22 ` Chao Peng
2022-12-20 8:33 ` Huang, Kai
0 siblings, 1 reply; 153+ messages in thread
From: Chao Peng @ 2022-12-20 7:22 UTC (permalink / raw)
To: Huang, Kai
Cc: tglx, linux-arch, kvm, Wang, Wei W, jmattson, Lutomirski, Andy,
ak, kirill.shutemov, david, qemu-devel, tabba, Hocko, Michal,
michael.roth, corbet, linux-fsdevel, dhildenb, bfields,
linux-kernel, x86, bp, vannapurve, rppt, shuah, vkuznets, vbabka,
mail, linux-api, qperret, arnd, pbonzini, ddutile,
naoya.horiguchi, Christopherson,,
Sean, wanpengli, yu.c.zhang, hughd, aarcange, mingo, hpa,
Nakajima, Jun, jlayton, joro, linux-mm, steven.price, Hansen,
Dave, linux-doc, akpm, linmiaohe
On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
> On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > >
> > > [...]
> > >
> > > > +
> > > > + /*
> > > > + * These pages are currently unmovable so don't place them into
> > > > movable
> > > > + * pageblocks (e.g. CMA and ZONE_MOVABLE).
> > > > + */
> > > > + mapping = memfd->f_mapping;
> > > > + mapping_set_unevictable(mapping);
> > > > + mapping_set_gfp_mask(mapping,
> > > > + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> > >
> > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from
> > > non-
> > > movable zones, but doesn't necessarily prevent page from being migrated. My
> > > first glance is you need to implement either a_ops->migrate_folio() or just
> > > get_page() after faulting in the page to prevent.
> >
> > The current api restrictedmem_get_page() already does this, after the
> > caller calling it, it holds a reference to the page. The caller then
> > decides when to call put_page() appropriately.
>
> I tried to dig some history. Perhaps I am missing something, but it seems Kirill
> said in v9 that this code doesn't prevent page migration, and we need to
> increase page refcount in restrictedmem_get_page():
>
> https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/
>
> But looking at this series it seems restrictedmem_get_page() in this v10 is
> identical to the one in v9 (except v10 uses 'folio' instead of 'page')?
restrictedmem_get_page() increases page refcount several versions ago so
no change in v10 is needed. You probably missed my reply:
https://lore.kernel.org/linux-mm/20221129135844.GA902164@chaop.bj.intel.com/
The current solution is clear: unless we have better approach, we will
let restrictedmem user (KVM in this case) to hold the refcount to
prevent page migration.
Thanks,
Chao
>
> Anyway if this is not fixed, then it should be fixed. Otherwise, a comment at
> the place where page refcount is increased will be helpful to help people
> understand page migration is actually prevented.
>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
2022-12-19 10:17 ` Borislav Petkov
@ 2022-12-20 7:24 ` Chao Peng
0 siblings, 0 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-20 7:24 UTC (permalink / raw)
To: Borislav Petkov
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Mon, Dec 19, 2022 at 11:17:22AM +0100, Borislav Petkov wrote:
> On Mon, Dec 19, 2022 at 04:15:32PM +0800, Chao Peng wrote:
> > Tamping down with error number a bit:
> >
> > if (attrs->flags)
> > return -ENXIO;
> > if (attrs->attributes & ~supported_attrs)
> > return -EOPNOTSUPP;
> > if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size) ||
> > attrs->size == 0)
> > return -EINVAL;
> > if (attrs->address + attrs->size < attrs->address)
> > return -E2BIG;
>
> Yap, better.
>
> I guess you should add those to the documentation of the ioctl too
> so that people can find out why it fails. Or, well, they can look
> at the code directly too but still... imagine some blurb about
> user-friendliness here...
Thanks for reminding. Yes KVM api doc is the right place to put these
documentation in.
Thanks,
Chao
>
> :-)
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
2022-12-19 14:36 ` Borislav Petkov
@ 2022-12-20 7:43 ` Chao Peng
2022-12-20 9:55 ` Borislav Petkov
0 siblings, 1 reply; 153+ messages in thread
From: Chao Peng @ 2022-12-20 7:43 UTC (permalink / raw)
To: Borislav Petkov
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Mon, Dec 19, 2022 at 03:36:28PM +0100, Borislav Petkov wrote:
> On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > In memory encryption usage, guest memory may be encrypted with special
> > key and can be accessed only by the guest itself. We call such memory
> > private memory. It's valueless and sometimes can cause problem to allow
>
> valueless?
>
> I can't parse that.
It's unnecessary and ...
>
> > userspace to access guest private memory. This new KVM memslot extension
> > allows guest private memory being provided through a restrictedmem
> > backed file descriptor(fd) and userspace is restricted to access the
> > bookmarked memory in the fd.
>
> bookmarked?
userspace is restricted to access the memory content in the fd.
>
> > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > userspace to instruct KVM to provide guest memory through restricted_fd.
> > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > and the size is 'memory_size'.
> >
> > The extended memslot can still have the userspace_addr(hva). When use, a
>
> "When un use, ..."
When both userspace_addr and restricted_fd/offset were used, ...
>
> ...
>
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index a8e379a3afee..690cb21010e7 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -50,6 +50,8 @@ config KVM
> > select INTERVAL_TREE
> > select HAVE_KVM_PM_NOTIFIER if PM
> > select HAVE_KVM_MEMORY_ATTRIBUTES
> > + select HAVE_KVM_RESTRICTED_MEM if X86_64
> > + select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
>
> Those deps here look weird.
>
> RESTRICTEDMEM should be selected by TDX_GUEST as it can't live without
> it.
RESTRICTEDMEM is needed by TDX_HOST, not TDX_GUEST.
>
> Then you don't have to select HAVE_KVM_RESTRICTED_MEM simply because of
> X86_64 - you need that functionality when the respective guest support
> is enabled in KVM.
Letting the actual feature(e.g. TDX or pKVM) select it or add dependency
sounds a viable and clearer solution. Sean, let me know your opinion.
>
> Then, looking forward into your patchset, I'm not sure you even
> need HAVE_KVM_RESTRICTED_MEM - you could make it all depend on
> CONFIG_RESTRICTEDMEM. But that's KVM folks call - I'd always aim for
> less Kconfig items because we have waay too many.
The only reason to add another HAVE_KVM_RESTRICTED_MEM is some code only
works for 64bit[*] and CONFIG_RESTRICTEDMEM is not sufficient to enforce
that.
[*] https://lore.kernel.org/all/YkJLFu98hZOvTSrL@google.com/
Thanks,
Chao
>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
2022-12-20 7:22 ` Chao Peng
@ 2022-12-20 8:33 ` Huang, Kai
2022-12-21 13:39 ` Chao Peng
0 siblings, 1 reply; 153+ messages in thread
From: Huang, Kai @ 2022-12-20 8:33 UTC (permalink / raw)
To: chao.p.peng
Cc: tglx, linux-arch, kvm, jmattson, Lutomirski, Andy, ak,
kirill.shutemov, Hocko, Michal, qemu-devel, tabba, david,
michael.roth, corbet, bfields, dhildenb, linux-kernel,
linux-fsdevel, x86, bp, linux-api, rppt, shuah, vkuznets, vbabka,
mail, ddutile, qperret, arnd, pbonzini, vannapurve,
naoya.horiguchi, Christopherson,,
Sean, wanpengli, yu.c.zhang, hughd, aarcange, mingo, hpa,
Nakajima, Jun, jlayton, joro, linux-mm, Wang, Wei W,
steven.price, linux-doc, Hansen, Dave, akpm, linmiaohe
On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
> > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > >
> > > > [...]
> > > >
> > > > > +
> > > > > + /*
> > > > > + * These pages are currently unmovable so don't place them into
> > > > > movable
> > > > > + * pageblocks (e.g. CMA and ZONE_MOVABLE).
> > > > > + */
> > > > > + mapping = memfd->f_mapping;
> > > > > + mapping_set_unevictable(mapping);
> > > > > + mapping_set_gfp_mask(mapping,
> > > > > + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> > > >
> > > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from
> > > > non-
> > > > movable zones, but doesn't necessarily prevent page from being migrated. My
> > > > first glance is you need to implement either a_ops->migrate_folio() or just
> > > > get_page() after faulting in the page to prevent.
> > >
> > > The current api restrictedmem_get_page() already does this, after the
> > > caller calling it, it holds a reference to the page. The caller then
> > > decides when to call put_page() appropriately.
> >
> > I tried to dig some history. Perhaps I am missing something, but it seems Kirill
> > said in v9 that this code doesn't prevent page migration, and we need to
> > increase page refcount in restrictedmem_get_page():
> >
> > https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/
> >
> > But looking at this series it seems restrictedmem_get_page() in this v10 is
> > identical to the one in v9 (except v10 uses 'folio' instead of 'page')?
>
> restrictedmem_get_page() increases page refcount several versions ago so
> no change in v10 is needed. You probably missed my reply:
>
> https://lore.kernel.org/linux-mm/20221129135844.GA902164@chaop.bj.intel.com/
But for non-restricted-mem case, it is correct for KVM to decrease page's
refcount after setting up mapping in the secondary mmu, otherwise the page will
be pinned by KVM for normal VM (since KVM uses GUP to get the page).
So what we are expecting is: for KVM if the page comes from restricted mem, then
KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should.
>
> The current solution is clear: unless we have better approach, we will
> let restrictedmem user (KVM in this case) to hold the refcount to
> prevent page migration.
>
OK. Will leave to others :)
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
2022-12-20 7:43 ` Chao Peng
@ 2022-12-20 9:55 ` Borislav Petkov
2022-12-21 13:42 ` Chao Peng
0 siblings, 1 reply; 153+ messages in thread
From: Borislav Petkov @ 2022-12-20 9:55 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Tue, Dec 20, 2022 at 03:43:18PM +0800, Chao Peng wrote:
> RESTRICTEDMEM is needed by TDX_HOST, not TDX_GUEST.
Which basically means that RESTRICTEDMEM should simply depend on KVM.
Because you can't know upfront whether KVM will run a TDX guest or a SNP
guest and so on.
Which then means that RESTRICTEDMEM will practically end up always
enabled in KVM HV configs.
> The only reason to add another HAVE_KVM_RESTRICTED_MEM is some code only
> works for 64bit[*] and CONFIG_RESTRICTEDMEM is not sufficient to enforce
> that.
This is what I mean with "we have too many Kconfig items". :-\
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
2022-12-20 8:33 ` Huang, Kai
@ 2022-12-21 13:39 ` Chao Peng
2022-12-22 0:37 ` Huang, Kai
2022-12-22 18:15 ` Sean Christopherson
0 siblings, 2 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-21 13:39 UTC (permalink / raw)
To: Huang, Kai
Cc: tglx, linux-arch, kvm, jmattson, Lutomirski, Andy, ak,
kirill.shutemov, Hocko, Michal, qemu-devel, tabba, david,
michael.roth, corbet, bfields, dhildenb, linux-kernel,
linux-fsdevel, x86, bp, linux-api, rppt, shuah, vkuznets, vbabka,
mail, ddutile, qperret, arnd, pbonzini, vannapurve,
naoya.horiguchi, Christopherson,,
Sean, wanpengli, yu.c.zhang, hughd, aarcange, mingo, hpa,
Nakajima, Jun, jlayton, joro, linux-mm, Wang, Wei W,
steven.price, linux-doc, Hansen, Dave, akpm, linmiaohe
On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote:
> On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
> > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > > >
> > > > > [...]
> > > > >
> > > > > > +
> > > > > > + /*
> > > > > > + * These pages are currently unmovable so don't place them into
> > > > > > movable
> > > > > > + * pageblocks (e.g. CMA and ZONE_MOVABLE).
> > > > > > + */
> > > > > > + mapping = memfd->f_mapping;
> > > > > > + mapping_set_unevictable(mapping);
> > > > > > + mapping_set_gfp_mask(mapping,
> > > > > > + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> > > > >
> > > > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from
> > > > > non-
> > > > > movable zones, but doesn't necessarily prevent page from being migrated. My
> > > > > first glance is you need to implement either a_ops->migrate_folio() or just
> > > > > get_page() after faulting in the page to prevent.
> > > >
> > > > The current api restrictedmem_get_page() already does this, after the
> > > > caller calling it, it holds a reference to the page. The caller then
> > > > decides when to call put_page() appropriately.
> > >
> > > I tried to dig some history. Perhaps I am missing something, but it seems Kirill
> > > said in v9 that this code doesn't prevent page migration, and we need to
> > > increase page refcount in restrictedmem_get_page():
> > >
> > > https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/
> > >
> > > But looking at this series it seems restrictedmem_get_page() in this v10 is
> > > identical to the one in v9 (except v10 uses 'folio' instead of 'page')?
> >
> > restrictedmem_get_page() increases page refcount several versions ago so
> > no change in v10 is needed. You probably missed my reply:
> >
> > https://lore.kernel.org/linux-mm/20221129135844.GA902164@chaop.bj.intel.com/
>
> But for non-restricted-mem case, it is correct for KVM to decrease page's
> refcount after setting up mapping in the secondary mmu, otherwise the page will
> be pinned by KVM for normal VM (since KVM uses GUP to get the page).
That's true. Actually even true for restrictedmem case, most likely we
will still need the kvm_release_pfn_clean() for KVM generic code. On one
side, other restrictedmem users like pKVM may not require page pinning
at all. On the other side, see below.
>
> So what we are expecting is: for KVM if the page comes from restricted mem, then
> KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should.
I argue that this page pinning (or page migration prevention) is not
tied to where the page comes from, instead related to how the page will
be used. Whether the page is restrictedmem backed or GUP() backed, once
it's used by current version of TDX then the page pinning is needed. So
such page migration prevention is really TDX thing, even not KVM generic
thing (that's why I think we don't need change the existing logic of
kvm_release_pfn_clean()). Wouldn't better to let TDX code (or who
requires that) to increase/decrease the refcount when it populates/drops
the secure EPT entries? This is exactly what the current TDX code does:
get_page():
https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1217
put_page():
https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1334
Thanks,
Chao
>
> >
> > The current solution is clear: unless we have better approach, we will
> > let restrictedmem user (KVM in this case) to hold the refcount to
> > prevent page migration.
> >
>
> OK. Will leave to others :)
>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
2022-12-20 9:55 ` Borislav Petkov
@ 2022-12-21 13:42 ` Chao Peng
0 siblings, 0 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-21 13:42 UTC (permalink / raw)
To: Borislav Petkov
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Tue, Dec 20, 2022 at 10:55:44AM +0100, Borislav Petkov wrote:
> On Tue, Dec 20, 2022 at 03:43:18PM +0800, Chao Peng wrote:
> > RESTRICTEDMEM is needed by TDX_HOST, not TDX_GUEST.
>
> Which basically means that RESTRICTEDMEM should simply depend on KVM.
> Because you can't know upfront whether KVM will run a TDX guest or a SNP
> guest and so on.
>
> Which then means that RESTRICTEDMEM will practically end up always
> enabled in KVM HV configs.
That's right, CONFIG_RESTRICTEDMEM is always selected for supported KVM
architectures (currently x86_64).
>
> > The only reason to add another HAVE_KVM_RESTRICTED_MEM is some code only
> > works for 64bit[*] and CONFIG_RESTRICTEDMEM is not sufficient to enforce
> > that.
>
> This is what I mean with "we have too many Kconfig items". :-\
Yes I agree. One way to remove this is probably additionally checking
CONFIG_64BIT instead.
Thanks,
Chao
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
2022-12-21 13:39 ` Chao Peng
@ 2022-12-22 0:37 ` Huang, Kai
2022-12-23 8:20 ` Chao Peng
2023-01-23 14:03 ` Vlastimil Babka
2022-12-22 18:15 ` Sean Christopherson
1 sibling, 2 replies; 153+ messages in thread
From: Huang, Kai @ 2022-12-22 0:37 UTC (permalink / raw)
To: chao.p.peng
Cc: tglx, linux-arch, kvm, jmattson, Hocko, Michal, pbonzini, ak,
Lutomirski, Andy, linux-fsdevel, tabba, david, michael.roth,
kirill.shutemov, corbet, qemu-devel, dhildenb, bfields,
linux-kernel, x86, bp, ddutile, rppt, shuah, vkuznets, vbabka,
mail, naoya.horiguchi, qperret, arnd, linux-api, yu.c.zhang,
Christopherson,,
Sean, wanpengli, vannapurve, hughd, aarcange, mingo, hpa,
Nakajima, Jun, jlayton, joro, linux-mm, Wang, Wei W,
steven.price, linux-doc, Hansen, Dave, akpm, linmiaohe
On Wed, 2022-12-21 at 21:39 +0800, Chao Peng wrote:
> > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote:
> > > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> > > > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
> > > > > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > [...]
> > > > > > > > > > > >
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > + /*
> > > > > > > > > > > > > > + * These pages are currently unmovable so don't place them into
> > > > > > > > > > > > > > movable
> > > > > > > > > > > > > > + * pageblocks (e.g. CMA and ZONE_MOVABLE).
> > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > + mapping = memfd->f_mapping;
> > > > > > > > > > > > > > + mapping_set_unevictable(mapping);
> > > > > > > > > > > > > > + mapping_set_gfp_mask(mapping,
> > > > > > > > > > > > > > + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> > > > > > > > > > > >
> > > > > > > > > > > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from
> > > > > > > > > > > > non-
> > > > > > > > > > > > movable zones, but doesn't necessarily prevent page from being migrated. My
> > > > > > > > > > > > first glance is you need to implement either a_ops->migrate_folio() or just
> > > > > > > > > > > > get_page() after faulting in the page to prevent.
> > > > > > > > > >
> > > > > > > > > > The current api restrictedmem_get_page() already does this, after the
> > > > > > > > > > caller calling it, it holds a reference to the page. The caller then
> > > > > > > > > > decides when to call put_page() appropriately.
> > > > > > > >
> > > > > > > > I tried to dig some history. Perhaps I am missing something, but it seems Kirill
> > > > > > > > said in v9 that this code doesn't prevent page migration, and we need to
> > > > > > > > increase page refcount in restrictedmem_get_page():
> > > > > > > >
> > > > > > > > https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/
> > > > > > > >
> > > > > > > > But looking at this series it seems restrictedmem_get_page() in this v10 is
> > > > > > > > identical to the one in v9 (except v10 uses 'folio' instead of 'page')?
> > > > > >
> > > > > > restrictedmem_get_page() increases page refcount several versions ago so
> > > > > > no change in v10 is needed. You probably missed my reply:
> > > > > >
> > > > > > https://lore.kernel.org/linux-mm/20221129135844.GA902164@chaop.bj.intel.com/
> > > >
> > > > But for non-restricted-mem case, it is correct for KVM to decrease page's
> > > > refcount after setting up mapping in the secondary mmu, otherwise the page will
> > > > be pinned by KVM for normal VM (since KVM uses GUP to get the page).
> >
> > That's true. Actually even true for restrictedmem case, most likely we
> > will still need the kvm_release_pfn_clean() for KVM generic code. On one
> > side, other restrictedmem users like pKVM may not require page pinning
> > at all. On the other side, see below.
OK. Agreed.
> >
> > > >
> > > > So what we are expecting is: for KVM if the page comes from restricted mem, then
> > > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should.
> >
> > I argue that this page pinning (or page migration prevention) is not
> > tied to where the page comes from, instead related to how the page will
> > be used. Whether the page is restrictedmem backed or GUP() backed, once
> > it's used by current version of TDX then the page pinning is needed. So
> > such page migration prevention is really TDX thing, even not KVM generic
> > thing (that's why I think we don't need change the existing logic of
> > kvm_release_pfn_clean()).
> >
This essentially boils down to who "owns" page migration handling, and sadly,
page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page
migration by itself -- it's just a passive receiver.
For normal pages, page migration is totally done by the core-kernel (i.e. it
unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops-
>migrate_page() to actually migrate the page).
In the sense of TDX, conceptually it should be done in the same way. The more
important thing is: yes KVM can use get_page() to prevent page migration, but
when KVM wants to support it, KVM cannot just remove get_page(), as the core-
kernel will still just do migrate_page() which won't work for TDX (given
restricted_memfd doesn't have a_ops->migrate_page() implemented).
So I think the restricted_memfd filesystem should own page migration handling,
(i.e. by implementing a_ops->migrate_page() to either just reject page migration
or somehow support it).
To support page migration, it may require KVM's help in case of TDX (the
TDH.MEM.PAGE.RELOCATE SEAMCALL requires "GPA" and "level" of EPT mapping, which
are only available in KVM), but that doesn't make KVM to own the handling of
page migration.
> > Wouldn't better to let TDX code (or who
> > requires that) to increase/decrease the refcount when it populates/drops
> > the secure EPT entries? This is exactly what the current TDX code does:
> >
> > get_page():
> > https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1217
> >
> > put_page():
> > https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1334
> >
As explained above, I think doing so in KVM is wrong: it can prevent by using
get_page(), but you cannot simply remove it to support page migration.
Sean also said similar thing when reviewing v8 KVM TDX series and I also agree:
https://lore.kernel.org/lkml/Yvu5PsAndEbWKTHc@google.com/
https://lore.kernel.org/lkml/31fec1b4438a6d9bb7ff719f96caa8b23ed764d6.camel@intel.com/
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
2022-12-21 13:39 ` Chao Peng
2022-12-22 0:37 ` Huang, Kai
@ 2022-12-22 18:15 ` Sean Christopherson
2022-12-23 0:50 ` Huang, Kai
` (2 more replies)
1 sibling, 3 replies; 153+ messages in thread
From: Sean Christopherson @ 2022-12-22 18:15 UTC (permalink / raw)
To: Chao Peng
Cc: Huang, Kai, tglx, linux-arch, kvm, jmattson, Lutomirski, Andy,
ak, kirill.shutemov, Hocko, Michal, qemu-devel, tabba, david,
michael.roth, corbet, bfields, dhildenb, linux-kernel,
linux-fsdevel, x86, bp, linux-api, rppt, shuah, vkuznets, vbabka,
mail, ddutile, qperret, arnd, pbonzini, vannapurve,
naoya.horiguchi, wanpengli, yu.c.zhang, hughd, aarcange, mingo,
hpa, Nakajima, Jun, jlayton, joro, linux-mm, Wang, Wei W,
steven.price, linux-doc, Hansen, Dave, akpm, linmiaohe
On Wed, Dec 21, 2022, Chao Peng wrote:
> On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote:
> > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
> > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > But for non-restricted-mem case, it is correct for KVM to decrease page's
> > refcount after setting up mapping in the secondary mmu, otherwise the page will
> > be pinned by KVM for normal VM (since KVM uses GUP to get the page).
>
> That's true. Actually even true for restrictedmem case, most likely we
> will still need the kvm_release_pfn_clean() for KVM generic code. On one
> side, other restrictedmem users like pKVM may not require page pinning
> at all. On the other side, see below.
>
> >
> > So what we are expecting is: for KVM if the page comes from restricted mem, then
> > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should.
No, requiring the user (KVM) to guard against lack of support for page migration
in restricted mem is a terrible API. It's totally fine for restricted mem to not
support page migration until there's a use case, but punting the problem to KVM
is not acceptable. Restricted mem itself doesn't yet support page migration,
e.g. explosions would occur even if KVM wanted to allow migration since there is
no notification to invalidate existing mappings.
> I argue that this page pinning (or page migration prevention) is not
> tied to where the page comes from, instead related to how the page will
> be used. Whether the page is restrictedmem backed or GUP() backed, once
> it's used by current version of TDX then the page pinning is needed. So
> such page migration prevention is really TDX thing, even not KVM generic
> thing (that's why I think we don't need change the existing logic of
> kvm_release_pfn_clean()). Wouldn't better to let TDX code (or who
> requires that) to increase/decrease the refcount when it populates/drops
> the secure EPT entries? This is exactly what the current TDX code does:
I agree that whether or not migration is supported should be controllable by the
user, but I strongly disagree on punting refcount management to KVM (or TDX).
The whole point of restricted mem is to support technologies like TDX and SNP,
accomodating their special needs for things like page migration should be part of
the API, not some footnote in the documenation.
It's not difficult to let the user communicate support for page migration, e.g.
if/when restricted mem gains support, add a hook to restrictedmem_notifier_ops
to signal support (or lack thereof) for page migration. NULL == no migration,
non-NULL == migration allowed.
We know that supporting page migration in TDX and SNP is possible, and we know
that page migration will require a dedicated API since the backing store can't
memcpy() the page. I don't see any reason to ignore that eventuality.
But again, unless I'm missing something, that's a future problem because restricted
mem doesn't yet support page migration regardless of the downstream user.
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
2022-12-22 18:15 ` Sean Christopherson
@ 2022-12-23 0:50 ` Huang, Kai
2022-12-23 8:24 ` Chao Peng
2023-01-23 15:43 ` Kirill A. Shutemov
2 siblings, 0 replies; 153+ messages in thread
From: Huang, Kai @ 2022-12-23 0:50 UTC (permalink / raw)
To: Christopherson,, Sean, chao.p.peng
Cc: tglx, linux-arch, kvm, jmattson, Hocko, Michal, pbonzini, ak,
Lutomirski, Andy, linux-fsdevel, tabba, david, michael.roth,
kirill.shutemov, corbet, qemu-devel, dhildenb, bfields,
linux-kernel, x86, bp, ddutile, rppt, shuah, vkuznets, vbabka,
mail, naoya.horiguchi, qperret, arnd, linux-api, yu.c.zhang,
aarcange, wanpengli, vannapurve, hughd, mingo, hpa, Nakajima,
Jun, jlayton, joro, linux-mm, Wang, Wei W, steven.price,
linux-doc, Hansen, Dave, akpm, linmiaohe
On Thu, 2022-12-22 at 18:15 +0000, Sean Christopherson wrote:
> On Wed, Dec 21, 2022, Chao Peng wrote:
> > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote:
> > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
> > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > But for non-restricted-mem case, it is correct for KVM to decrease page's
> > > refcount after setting up mapping in the secondary mmu, otherwise the page will
> > > be pinned by KVM for normal VM (since KVM uses GUP to get the page).
> >
> > That's true. Actually even true for restrictedmem case, most likely we
> > will still need the kvm_release_pfn_clean() for KVM generic code. On one
> > side, other restrictedmem users like pKVM may not require page pinning
> > at all. On the other side, see below.
> >
> > >
> > > So what we are expecting is: for KVM if the page comes from restricted mem, then
> > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should.
>
> No, requiring the user (KVM) to guard against lack of support for page migration
> in restricted mem is a terrible API. It's totally fine for restricted mem to not
> support page migration until there's a use case, but punting the problem to KVM
> is not acceptable. Restricted mem itself doesn't yet support page migration,
> e.g. explosions would occur even if KVM wanted to allow migration since there is
> no notification to invalidate existing mappings.
>
>
Yes totally agree (I also replied separately).
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
2022-12-22 0:37 ` Huang, Kai
@ 2022-12-23 8:20 ` Chao Peng
2023-01-23 14:03 ` Vlastimil Babka
1 sibling, 0 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-23 8:20 UTC (permalink / raw)
To: Huang, Kai
Cc: tglx, linux-arch, kvm, jmattson, Hocko, Michal, pbonzini, ak,
Lutomirski, Andy, linux-fsdevel, tabba, david, michael.roth,
kirill.shutemov, corbet, qemu-devel, dhildenb, bfields,
linux-kernel, x86, bp, ddutile, rppt, shuah, vkuznets, vbabka,
mail, naoya.horiguchi, qperret, arnd, linux-api, yu.c.zhang,
Christopherson,,
Sean, wanpengli, vannapurve, hughd, aarcange, mingo, hpa,
Nakajima, Jun, jlayton, joro, linux-mm, Wang, Wei W,
steven.price, linux-doc, Hansen, Dave, akpm, linmiaohe
On Thu, Dec 22, 2022 at 12:37:19AM +0000, Huang, Kai wrote:
> On Wed, 2022-12-21 at 21:39 +0800, Chao Peng wrote:
> > > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote:
> > > > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> > > > > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
> > > > > > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > [...]
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > + /*
> > > > > > > > > > > > > > > + * These pages are currently unmovable so don't place them into
> > > > > > > > > > > > > > > movable
> > > > > > > > > > > > > > > + * pageblocks (e.g. CMA and ZONE_MOVABLE).
> > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > + mapping = memfd->f_mapping;
> > > > > > > > > > > > > > > + mapping_set_unevictable(mapping);
> > > > > > > > > > > > > > > + mapping_set_gfp_mask(mapping,
> > > > > > > > > > > > > > > + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> > > > > > > > > > > > >
> > > > > > > > > > > > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from
> > > > > > > > > > > > > non-
> > > > > > > > > > > > > movable zones, but doesn't necessarily prevent page from being migrated. My
> > > > > > > > > > > > > first glance is you need to implement either a_ops->migrate_folio() or just
> > > > > > > > > > > > > get_page() after faulting in the page to prevent.
> > > > > > > > > > >
> > > > > > > > > > > The current api restrictedmem_get_page() already does this, after the
> > > > > > > > > > > caller calling it, it holds a reference to the page. The caller then
> > > > > > > > > > > decides when to call put_page() appropriately.
> > > > > > > > >
> > > > > > > > > I tried to dig some history. Perhaps I am missing something, but it seems Kirill
> > > > > > > > > said in v9 that this code doesn't prevent page migration, and we need to
> > > > > > > > > increase page refcount in restrictedmem_get_page():
> > > > > > > > >
> > > > > > > > > https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/
> > > > > > > > >
> > > > > > > > > But looking at this series it seems restrictedmem_get_page() in this v10 is
> > > > > > > > > identical to the one in v9 (except v10 uses 'folio' instead of 'page')?
> > > > > > >
> > > > > > > restrictedmem_get_page() increases page refcount several versions ago so
> > > > > > > no change in v10 is needed. You probably missed my reply:
> > > > > > >
> > > > > > > https://lore.kernel.org/linux-mm/20221129135844.GA902164@chaop.bj.intel.com/
> > > > >
> > > > > But for non-restricted-mem case, it is correct for KVM to decrease page's
> > > > > refcount after setting up mapping in the secondary mmu, otherwise the page will
> > > > > be pinned by KVM for normal VM (since KVM uses GUP to get the page).
> > >
> > > That's true. Actually even true for restrictedmem case, most likely we
> > > will still need the kvm_release_pfn_clean() for KVM generic code. On one
> > > side, other restrictedmem users like pKVM may not require page pinning
> > > at all. On the other side, see below.
>
> OK. Agreed.
>
> > >
> > > > >
> > > > > So what we are expecting is: for KVM if the page comes from restricted mem, then
> > > > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should.
> > >
> > > I argue that this page pinning (or page migration prevention) is not
> > > tied to where the page comes from, instead related to how the page will
> > > be used. Whether the page is restrictedmem backed or GUP() backed, once
> > > it's used by current version of TDX then the page pinning is needed. So
> > > such page migration prevention is really TDX thing, even not KVM generic
> > > thing (that's why I think we don't need change the existing logic of
> > > kvm_release_pfn_clean()).
> > >
>
> This essentially boils down to who "owns" page migration handling, and sadly,
> page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page
> migration by itself -- it's just a passive receiver.
No, I'm not talking on the page migration handling itself, I know page
migration requires coordination from both core-mm and KVM. I'm more
concerning on the page migration prevention here. This is something we
need to address for TDX before the page migration is supported.
>
> For normal pages, page migration is totally done by the core-kernel (i.e. it
> unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops-
> >migrate_page() to actually migrate the page).
>
> In the sense of TDX, conceptually it should be done in the same way. The more
> important thing is: yes KVM can use get_page() to prevent page migration, but
> when KVM wants to support it, KVM cannot just remove get_page(), as the core-
> kernel will still just do migrate_page() which won't work for TDX (given
> restricted_memfd doesn't have a_ops->migrate_page() implemented).
>
> So I think the restricted_memfd filesystem should own page migration handling,
> (i.e. by implementing a_ops->migrate_page() to either just reject page migration
> or somehow support it).
>
> To support page migration, it may require KVM's help in case of TDX (the
> TDH.MEM.PAGE.RELOCATE SEAMCALL requires "GPA" and "level" of EPT mapping, which
> are only available in KVM), but that doesn't make KVM to own the handling of
> page migration.
>
>
> > > Wouldn't better to let TDX code (or who
> > > requires that) to increase/decrease the refcount when it populates/drops
> > > the secure EPT entries? This is exactly what the current TDX code does:
> > >
> > > get_page():
> > > https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1217
> > >
> > > put_page():
> > > https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1334
> > >
>
> As explained above, I think doing so in KVM is wrong: it can prevent by using
> get_page(), but you cannot simply remove it to support page migration.
Removing get_page() is definitely not enough for page migration support.
But the key thing is for page migration prevention, other than
get_page(), do we really have alternative.
Thanks,
Chao
>
> Sean also said similar thing when reviewing v8 KVM TDX series and I also agree:
>
> https://lore.kernel.org/lkml/Yvu5PsAndEbWKTHc@google.com/
> https://lore.kernel.org/lkml/31fec1b4438a6d9bb7ff719f96caa8b23ed764d6.camel@intel.com/
>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
2022-12-22 18:15 ` Sean Christopherson
2022-12-23 0:50 ` Huang, Kai
@ 2022-12-23 8:24 ` Chao Peng
2023-01-23 15:43 ` Kirill A. Shutemov
2 siblings, 0 replies; 153+ messages in thread
From: Chao Peng @ 2022-12-23 8:24 UTC (permalink / raw)
To: Sean Christopherson
Cc: Huang, Kai, tglx, linux-arch, kvm, jmattson, Lutomirski, Andy,
ak, kirill.shutemov, Hocko, Michal, qemu-devel, tabba, david,
michael.roth, corbet, bfields, dhildenb, linux-kernel,
linux-fsdevel, x86, bp, linux-api, rppt, shuah, vkuznets, vbabka,
mail, ddutile, qperret, arnd, pbonzini, vannapurve,
naoya.horiguchi, wanpengli, yu.c.zhang, hughd, aarcange, mingo,
hpa, Nakajima, Jun, jlayton, joro, linux-mm, Wang, Wei W,
steven.price, linux-doc, Hansen, Dave, akpm, linmiaohe
On Thu, Dec 22, 2022 at 06:15:24PM +0000, Sean Christopherson wrote:
> On Wed, Dec 21, 2022, Chao Peng wrote:
> > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote:
> > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
> > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > But for non-restricted-mem case, it is correct for KVM to decrease page's
> > > refcount after setting up mapping in the secondary mmu, otherwise the page will
> > > be pinned by KVM for normal VM (since KVM uses GUP to get the page).
> >
> > That's true. Actually even true for restrictedmem case, most likely we
> > will still need the kvm_release_pfn_clean() for KVM generic code. On one
> > side, other restrictedmem users like pKVM may not require page pinning
> > at all. On the other side, see below.
> >
> > >
> > > So what we are expecting is: for KVM if the page comes from restricted mem, then
> > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should.
>
> No, requiring the user (KVM) to guard against lack of support for page migration
> in restricted mem is a terrible API. It's totally fine for restricted mem to not
> support page migration until there's a use case, but punting the problem to KVM
> is not acceptable. Restricted mem itself doesn't yet support page migration,
> e.g. explosions would occur even if KVM wanted to allow migration since there is
> no notification to invalidate existing mappings.
>
> > I argue that this page pinning (or page migration prevention) is not
> > tied to where the page comes from, instead related to how the page will
> > be used. Whether the page is restrictedmem backed or GUP() backed, once
> > it's used by current version of TDX then the page pinning is needed. So
> > such page migration prevention is really TDX thing, even not KVM generic
> > thing (that's why I think we don't need change the existing logic of
> > kvm_release_pfn_clean()). Wouldn't better to let TDX code (or who
> > requires that) to increase/decrease the refcount when it populates/drops
> > the secure EPT entries? This is exactly what the current TDX code does:
>
> I agree that whether or not migration is supported should be controllable by the
> user, but I strongly disagree on punting refcount management to KVM (or TDX).
> The whole point of restricted mem is to support technologies like TDX and SNP,
> accomodating their special needs for things like page migration should be part of
> the API, not some footnote in the documenation.
I never doubt page migration should be part of restrictedmem API, but
that's not an initial implementing as we all agreed? Then before that
API being introduced, we need find a solution to prevent page migration
for TDX. Other than refcount management, do we have any other workable
solution?
>
> It's not difficult to let the user communicate support for page migration, e.g.
> if/when restricted mem gains support, add a hook to restrictedmem_notifier_ops
> to signal support (or lack thereof) for page migration. NULL == no migration,
> non-NULL == migration allowed.
I know.
>
> We know that supporting page migration in TDX and SNP is possible, and we know
> that page migration will require a dedicated API since the backing store can't
> memcpy() the page. I don't see any reason to ignore that eventuality.
No, I'm not ignoring it. It's just about the short-term page migration
prevention before that dedicated API being introduced.
>
> But again, unless I'm missing something, that's a future problem because restricted
> mem doesn't yet support page migration regardless of the downstream user.
It's true a future problem for page migration support itself, but page
migration prevention is not a future problem since TDX pages need to be
pinned before page migration gets supported.
Thanks,
Chao
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
2022-12-02 6:13 ` [PATCH v10 2/9] KVM: Introduce per-page memory attributes Chao Peng
` (2 preceding siblings ...)
2022-12-16 15:09 ` Borislav Petkov
@ 2022-12-28 8:28 ` Chenyi Qiang
2023-01-03 1:39 ` Chao Peng
2023-01-13 22:02 ` Sean Christopherson
` (2 subsequent siblings)
6 siblings, 1 reply; 153+ messages in thread
From: Chenyi Qiang @ 2022-12-28 8:28 UTC (permalink / raw)
To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
linux-arch, linux-api, linux-doc, qemu-devel
Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On 12/2/2022 2:13 PM, Chao Peng wrote:
> In confidential computing usages, whether a page is private or shared is
> necessary information for KVM to perform operations like page fault
> handling, page zapping etc. There are other potential use cases for
> per-page memory attributes, e.g. to make memory read-only (or no-exec,
> or exec-only, etc.) without having to modify memslots.
>
> Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> userspace to operate on the per-page memory attributes.
> - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> a guest memory range.
> - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> memory attributes.
>
> KVM internally uses xarray to store the per-page memory attributes.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> ---
> Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
> arch/x86/kvm/Kconfig | 1 +
> include/linux/kvm_host.h | 3 ++
> include/uapi/linux/kvm.h | 17 ++++++++
> virt/kvm/Kconfig | 3 ++
> virt/kvm/kvm_main.c | 76 ++++++++++++++++++++++++++++++++++
> 6 files changed, 163 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 5617bc4f899f..bb2f709c0900 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
> The "pad" and "reserved" fields may be used for future extensions and should be
> set to 0s by userspace.
>
> +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: u64 memory attributes bitmask(out)
> +:Returns: 0 on success, <0 on error
> +
> +Returns supported memory attributes bitmask. Supported memory attributes will
> +have the corresponding bits set in u64 memory attributes bitmask.
> +
> +The following memory attributes are defined::
> +
> + #define KVM_MEMORY_ATTRIBUTE_READ (1ULL << 0)
> + #define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
> + #define KVM_MEMORY_ATTRIBUTE_EXECUTE (1ULL << 2)
> + #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
> +
> +4.139 KVM_SET_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct kvm_memory_attributes(in/out)
> +:Returns: 0 on success, <0 on error
> +
> +Sets memory attributes for pages in a guest memory range. Parameters are
> +specified via the following structure::
> +
> + struct kvm_memory_attributes {
> + __u64 address;
> + __u64 size;
> + __u64 attributes;
> + __u64 flags;
> + };
> +
> +The user sets the per-page memory attributes to a guest memory range indicated
> +by address/size, and in return KVM adjusts address and size to reflect the
> +actual pages of the memory range have been successfully set to the attributes.
> +If the call returns 0, "address" is updated to the last successful address + 1
> +and "size" is updated to the remaining address size that has not been set
> +successfully. The user should check the return value as well as the size to
> +decide if the operation succeeded for the whole range or not. The user may want
> +to retry the operation with the returned address/size if the previous range was
> +partially successful.
> +
> +Both address and size should be page aligned and the supported attributes can be
> +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> +
> +The "flags" field may be used for future extensions and should be set to 0s.
> +
> 5. The kvm_run structure
> ========================
>
> @@ -8270,6 +8323,16 @@ structure.
> When getting the Modified Change Topology Report value, the attr->addr
> must point to a byte where the value will be stored or retrieved from.
>
> +8.40 KVM_CAP_MEMORY_ATTRIBUTES
> +------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm
> +
> +This capability indicates KVM supports per-page memory attributes and ioctls
> +KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
> +
> 9. Known KVM API problems
> =========================
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index fbeaa9ddef59..a8e379a3afee 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -49,6 +49,7 @@ config KVM
> select SRCU
> select INTERVAL_TREE
> select HAVE_KVM_PM_NOTIFIER if PM
> + select HAVE_KVM_MEMORY_ATTRIBUTES
> help
> Support hosting fully virtualized guest machines using hardware
> virtualization extensions. You will need a fairly recent
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8f874a964313..a784e2b06625 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -800,6 +800,9 @@ struct kvm {
>
> #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> struct notifier_block pm_notifier;
> +#endif
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> + struct xarray mem_attr_array;
> #endif
> char stats_id[KVM_STATS_NAME_SIZE];
> };
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 64dfe9c07c87..5d0941acb5bb 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1182,6 +1182,7 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_S390_CPU_TOPOLOGY 222
> #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
> #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
> +#define KVM_CAP_MEMORY_ATTRIBUTES 225
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -2238,4 +2239,20 @@ struct kvm_s390_zpci_op {
> /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
> #define KVM_S390_ZPCIOP_REGAEN_HOST (1 << 0)
>
> +/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
> +#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES _IOR(KVMIO, 0xd2, __u64)
> +#define KVM_SET_MEMORY_ATTRIBUTES _IOWR(KVMIO, 0xd3, struct kvm_memory_attributes)
> +
> +struct kvm_memory_attributes {
> + __u64 address;
> + __u64 size;
> + __u64 attributes;
> + __u64 flags;
> +};
> +
> +#define KVM_MEMORY_ATTRIBUTE_READ (1ULL << 0)
> +#define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
> +#define KVM_MEMORY_ATTRIBUTE_EXECUTE (1ULL << 2)
> +#define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
> +
> #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 800f9470e36b..effdea5dd4f0 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -19,6 +19,9 @@ config HAVE_KVM_IRQ_ROUTING
> config HAVE_KVM_DIRTY_RING
> bool
>
> +config HAVE_KVM_MEMORY_ATTRIBUTES
> + bool
> +
> # Only strongly ordered architectures can select this, as it doesn't
> # put any explicit constraint on userspace ordering. They can also
> # select the _ACQ_REL version.
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1782c4555d94..7f0f5e9f2406 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> spin_lock_init(&kvm->mn_invalidate_lock);
> rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> + xa_init(&kvm->mem_attr_array);
> +#endif
>
> INIT_LIST_HEAD(&kvm->gpc_list);
> spin_lock_init(&kvm->gpc_lock);
> @@ -1323,6 +1326,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
> kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
> kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
> }
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> + xa_destroy(&kvm->mem_attr_array);
> +#endif
> cleanup_srcu_struct(&kvm->irq_srcu);
> cleanup_srcu_struct(&kvm->srcu);
> kvm_arch_free_vm(kvm);
> @@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> }
> #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> +{
> + return 0;
> +}
> +
> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> + struct kvm_memory_attributes *attrs)
> +{
> + gfn_t start, end;
> + unsigned long i;
> + void *entry;
> + u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> + /* flags is currently not used. */
> + if (attrs->flags)
> + return -EINVAL;
> + if (attrs->attributes & ~supported_attrs)
> + return -EINVAL;
> + if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> + return -EINVAL;
> + if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> + return -EINVAL;
> +
> + start = attrs->address >> PAGE_SHIFT;
> + end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> + entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +
Because guest memory defaults to private, and now this patch stores the
attributes with KVM_MEMORY_ATTRIBUTE_PRIVATE instead of _SHARED, it
would bring more KVM_EXIT_MEMORY_FAULT exits at the beginning of boot
time. Maybe it can be optimized somehow in other places? e.g. set mem
attr in advance.
> + mutex_lock(&kvm->lock);
> + for (i = start; i < end; i++)
> + if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> + GFP_KERNEL_ACCOUNT)))
> + break;
> + mutex_unlock(&kvm->lock);
> +
> + attrs->address = i << PAGE_SHIFT;
> + attrs->size = (end - i) << PAGE_SHIFT;
> +
> + return 0;
> +}
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
> struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
> {
> return __gfn_to_memslot(kvm_memslots(kvm), gfn);
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
2022-12-28 8:28 ` Chenyi Qiang
@ 2023-01-03 1:39 ` Chao Peng
2023-01-03 3:32 ` Wang, Wei W
0 siblings, 1 reply; 153+ messages in thread
From: Chao Peng @ 2023-01-03 1:39 UTC (permalink / raw)
To: Chenyi Qiang
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Wed, Dec 28, 2022 at 04:28:01PM +0800, Chenyi Qiang wrote:
...
> > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > + struct kvm_memory_attributes *attrs)
> > +{
> > + gfn_t start, end;
> > + unsigned long i;
> > + void *entry;
> > + u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > +
> > + /* flags is currently not used. */
> > + if (attrs->flags)
> > + return -EINVAL;
> > + if (attrs->attributes & ~supported_attrs)
> > + return -EINVAL;
> > + if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > + return -EINVAL;
> > + if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > + return -EINVAL;
> > +
> > + start = attrs->address >> PAGE_SHIFT;
> > + end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > + entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > +
>
> Because guest memory defaults to private, and now this patch stores the
> attributes with KVM_MEMORY_ATTRIBUTE_PRIVATE instead of _SHARED, it
> would bring more KVM_EXIT_MEMORY_FAULT exits at the beginning of boot
> time. Maybe it can be optimized somehow in other places? e.g. set mem
> attr in advance.
KVM defaults to 'shared' because this ioctl can also be potentially used
by normal VMs and 'shared' sounds a value meaningful for both normal VMs
and confidential VMs. As for more KVM_EXIT_MEMORY_FAULT exits during the
booting time, yes, setting all memory to 'private' for confidential VMs
through this ioctl in userspace before guest launch is an approach for
KVM userspace to 'override' the KVM default and reduce the number of
implicit conversions.
Thanks,
Chao
>
> > + mutex_lock(&kvm->lock);
> > + for (i = start; i < end; i++)
> > + if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > + GFP_KERNEL_ACCOUNT)))
> > + break;
> > + mutex_unlock(&kvm->lock);
> > +
> > + attrs->address = i << PAGE_SHIFT;
> > + attrs->size = (end - i) << PAGE_SHIFT;
> > +
> > + return 0;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > +
> > struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
> > {
> > return __gfn_to_memslot(kvm_memslots(kvm), gfn);
^ permalink raw reply [flat|nested] 153+ messages in thread
* RE: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
2023-01-03 1:39 ` Chao Peng
@ 2023-01-03 3:32 ` Wang, Wei W
2023-01-03 23:06 ` Sean Christopherson
0 siblings, 1 reply; 153+ messages in thread
From: Wang, Wei W @ 2023-01-03 3:32 UTC (permalink / raw)
To: Chao Peng, Qiang, Chenyi
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Christopherson,,
Sean, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
Lutomirski, Andy, Nakajima, Jun, Hansen, Dave, ak, david,
aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
Hocko, Michal
On Tuesday, January 3, 2023 9:40 AM, Chao Peng wrote:
> > Because guest memory defaults to private, and now this patch stores
> > the attributes with KVM_MEMORY_ATTRIBUTE_PRIVATE instead of
> _SHARED,
> > it would bring more KVM_EXIT_MEMORY_FAULT exits at the beginning of
> > boot time. Maybe it can be optimized somehow in other places? e.g. set
> > mem attr in advance.
>
> KVM defaults to 'shared' because this ioctl can also be potentially used by
> normal VMs and 'shared' sounds a value meaningful for both normal VMs and
> confidential VMs.
Do you mean a normal VM could have pages marked private? What's the usage?
(If all the pages are just marked shared for normal VMs, then why do we need it)
> As for more KVM_EXIT_MEMORY_FAULT exits during the
> booting time, yes, setting all memory to 'private' for confidential VMs through
> this ioctl in userspace before guest launch is an approach for KVM userspace to
> 'override' the KVM default and reduce the number of implicit conversions.
Most pages of a confidential VM are likely to be private pages. It seems more efficient
(and not difficult to check vm_type) to have KVM defaults to "private" for confidential VMs
and defaults to "shared" for normal VMs.
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
2023-01-03 3:32 ` Wang, Wei W
@ 2023-01-03 23:06 ` Sean Christopherson
2023-01-05 4:39 ` Chao Peng
0 siblings, 1 reply; 153+ messages in thread
From: Sean Christopherson @ 2023-01-03 23:06 UTC (permalink / raw)
To: Wang, Wei W
Cc: Chao Peng, Qiang, Chenyi, kvm, linux-kernel, linux-mm,
linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
Yu Zhang, Kirill A . Shutemov, Lutomirski, Andy, Nakajima, Jun,
Hansen, Dave, ak, david, aarcange, ddutile, dhildenb,
Quentin Perret, tabba, Michael Roth, Hocko, Michal
On Tue, Jan 03, 2023, Wang, Wei W wrote:
> On Tuesday, January 3, 2023 9:40 AM, Chao Peng wrote:
> > > Because guest memory defaults to private, and now this patch stores
> > > the attributes with KVM_MEMORY_ATTRIBUTE_PRIVATE instead of
> > _SHARED,
> > > it would bring more KVM_EXIT_MEMORY_FAULT exits at the beginning of
> > > boot time. Maybe it can be optimized somehow in other places? e.g. set
> > > mem attr in advance.
> >
> > KVM defaults to 'shared' because this ioctl can also be potentially used by
> > normal VMs and 'shared' sounds a value meaningful for both normal VMs and
> > confidential VMs.
>
> Do you mean a normal VM could have pages marked private? What's the usage?
> (If all the pages are just marked shared for normal VMs, then why do we need it)
No, there are potential use cases for per-page attribute/permissions, e.g. to
make select pages read-only, exec-only, no-exec, etc...
> > As for more KVM_EXIT_MEMORY_FAULT exits during the
> > booting time, yes, setting all memory to 'private' for confidential VMs through
> > this ioctl in userspace before guest launch is an approach for KVM userspace to
> > 'override' the KVM default and reduce the number of implicit conversions.
>
> Most pages of a confidential VM are likely to be private pages. It seems more efficient
> (and not difficult to check vm_type) to have KVM defaults to "private" for confidential VMs
> and defaults to "shared" for normal VMs.
If done right, the default shouldn't matter all that much for efficiency. KVM
needs to be able to effeciently track large ranges regardless of the default,
otherwise the memory overhead and the presumably cost of lookups will be painful.
E.g. converting a 1GiB chunk to shared should ideally require one entry, not 256k
entries.
Looks like that behavior was changed in v8 in response to feedback[*] that doing
xa_store_range() on a subset of an existing range (entry) would overwrite the
entire existing range (entry), not just the smaller subset. xa_store_range() does
appear to be too simplistic for this use case, but looking at __filemap_add_folio(),
splitting an existing entry isn't super complex.
Using xa_store() for the very initial implementation is ok, and probably a good
idea since it's more obviously correct and will give us a bisection point. But
we definitely want a more performant implementation sooner than later. The hardest
part will likely be merging existing entries, but that can be done separately too,
and is probably lower priority.
E.g. (1) use xa_store() and always track at 4KiB granularity, (2) support storing
metadata in multi-index entries, and finally (3) support merging adjacent entries
with identical values.
[*] https://lore.kernel.org/all/CAGtprH9xyw6bt4=RBWF6-v2CSpabOCpKq5rPz+e-9co7EisoVQ@mail.gmail.com
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
2023-01-03 23:06 ` Sean Christopherson
@ 2023-01-05 4:39 ` Chao Peng
0 siblings, 0 replies; 153+ messages in thread
From: Chao Peng @ 2023-01-05 4:39 UTC (permalink / raw)
To: Sean Christopherson
Cc: Wang, Wei W, Qiang, Chenyi, kvm, linux-kernel, linux-mm,
linux-fsdevel, linux-arch, linux-api, linux-doc, qemu-devel,
Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86,
H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
Andrew Morton, Shuah Khan, Mike Rapoport, Steven Price,
Maciej S . Szmigiero, Vlastimil Babka, Vishal Annapurve,
Yu Zhang, Kirill A . Shutemov, Lutomirski, Andy, Nakajima, Jun,
Hansen, Dave, ak, david, aarcange, ddutile, dhildenb,
Quentin Perret, tabba, Michael Roth, Hocko, Michal
On Tue, Jan 03, 2023 at 11:06:37PM +0000, Sean Christopherson wrote:
> On Tue, Jan 03, 2023, Wang, Wei W wrote:
> > On Tuesday, January 3, 2023 9:40 AM, Chao Peng wrote:
> > > > Because guest memory defaults to private, and now this patch stores
> > > > the attributes with KVM_MEMORY_ATTRIBUTE_PRIVATE instead of
> > > _SHARED,
> > > > it would bring more KVM_EXIT_MEMORY_FAULT exits at the beginning of
> > > > boot time. Maybe it can be optimized somehow in other places? e.g. set
> > > > mem attr in advance.
> > >
> > > KVM defaults to 'shared' because this ioctl can also be potentially used by
> > > normal VMs and 'shared' sounds a value meaningful for both normal VMs and
> > > confidential VMs.
> >
> > Do you mean a normal VM could have pages marked private? What's the usage?
> > (If all the pages are just marked shared for normal VMs, then why do we need it)
>
> No, there are potential use cases for per-page attribute/permissions, e.g. to
> make select pages read-only, exec-only, no-exec, etc...
Right, normal VMs are not likely use private/shared bit. Not sure pKVM,
but perhaps not call it 'normal' VMs in this context. But since the
ioctl can be used by normal VMs for other bits (read-only, exec-only,
no-exec, etc), a default 'private' looks strange for them. That's why I
default it to 'shared' and for confidential guest, we can issue another
call to this ioctl to set all the memory to 'private' before guest
booting, if default 'private' is needed for guest.
Like Wei mentioned, it's also possible to make the default dependents on
vm_type, but that looks awkward to me from the API definition as well as
the implementation, also the vm_type has not been introduced at this time.
>
> > > As for more KVM_EXIT_MEMORY_FAULT exits during the
> > > booting time, yes, setting all memory to 'private' for confidential VMs through
> > > this ioctl in userspace before guest launch is an approach for KVM userspace to
> > > 'override' the KVM default and reduce the number of implicit conversions.
> >
> > Most pages of a confidential VM are likely to be private pages. It seems more efficient
> > (and not difficult to check vm_type) to have KVM defaults to "private" for confidential VMs
> > and defaults to "shared" for normal VMs.
>
> If done right, the default shouldn't matter all that much for efficiency. KVM
> needs to be able to effeciently track large ranges regardless of the default,
> otherwise the memory overhead and the presumably cost of lookups will be painful.
> E.g. converting a 1GiB chunk to shared should ideally require one entry, not 256k
> entries.
I agree, KVM should have the ability to track large ranges efficiently.
>
> Looks like that behavior was changed in v8 in response to feedback[*] that doing
> xa_store_range() on a subset of an existing range (entry) would overwrite the
> entire existing range (entry), not just the smaller subset. xa_store_range() does
> appear to be too simplistic for this use case, but looking at __filemap_add_folio(),
> splitting an existing entry isn't super complex.
Yes, xa_store_range() looks a perfect match for us initially but the
'overwriting the entire entry' behavior makes it incorrect for us when
storing a subset on an existing large entry. xarray lib has utilities
for splitting, the hard part is merging existing entries, as you also
said below. Thanks for pointing out the __filemap_add_folio() example,
it does look not too complex for splitting.
>
> Using xa_store() for the very initial implementation is ok, and probably a good
> idea since it's more obviously correct and will give us a bisection point. But
> we definitely want a more performant implementation sooner than later. The hardest
> part will likely be merging existing entries, but that can be done separately too,
> and is probably lower priority.
>
> E.g. (1) use xa_store() and always track at 4KiB granularity, (2) support storing
> metadata in multi-index entries, and finally (3) support merging adjacent entries
> with identical values.
This path looks good to me.
Thanks,
Chao
>
> [*] https://lore.kernel.org/all/CAGtprH9xyw6bt4=RBWF6-v2CSpabOCpKq5rPz+e-9co7EisoVQ@mail.gmail.com
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
2022-12-02 6:13 ` [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory Chao Peng
` (2 preceding siblings ...)
2022-12-19 14:36 ` Borislav Petkov
@ 2023-01-05 11:23 ` Jarkko Sakkinen
2023-01-06 9:40 ` Chao Peng
3 siblings, 1 reply; 153+ messages in thread
From: Jarkko Sakkinen @ 2023-01-05 11:23 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> In memory encryption usage, guest memory may be encrypted with special
> key and can be accessed only by the guest itself. We call such memory
> private memory. It's valueless and sometimes can cause problem to allow
> userspace to access guest private memory. This new KVM memslot extension
> allows guest private memory being provided through a restrictedmem
> backed file descriptor(fd) and userspace is restricted to access the
> bookmarked memory in the fd.
>
> This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> additional KVM memslot fields restricted_fd/restricted_offset to allow
> userspace to instruct KVM to provide guest memory through restricted_fd.
> 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> and the size is 'memory_size'.
>
> The extended memslot can still have the userspace_addr(hva). When use, a
> single memslot can maintain both private memory through restricted_fd
> and shared memory through userspace_addr. Whether the private or shared
> part is visible to guest is maintained by other KVM code.
>
> A restrictedmem_notifier field is also added to the memslot structure to
> allow the restricted_fd's backing store to notify KVM the memory change,
> KVM then can invalidate its page table entries or handle memory errors.
>
> Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> and right now it is selected on X86_64 only.
>
> To make future maintenance easy, internally use a binary compatible
> alias struct kvm_user_mem_region to handle both the normal and the
> '_ext' variants.
Feels bit hacky IMHO, and more like a completely new feature than
an extension.
Why not just add a new ioctl? The commit message does not address
the most essential design here.
BR, Jarkko
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
2022-12-02 6:13 ` [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
2022-12-09 9:11 ` Fuad Tabba
@ 2023-01-05 20:38 ` Vishal Annapurve
2023-01-06 4:13 ` Chao Peng
2023-01-14 0:01 ` Sean Christopherson
2023-03-07 19:14 ` Ackerley Tng
3 siblings, 1 reply; 153+ messages in thread
From: Vishal Annapurve @ 2023-01-05 20:38 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, luto,
jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Thu, Dec 1, 2022 at 10:20 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
> + pgoff_t start, pgoff_t end,
> + gfn_t *gfn_start, gfn_t *gfn_end)
> +{
> + unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
> +
> + if (start > base_pgoff)
> + *gfn_start = slot->base_gfn + start - base_pgoff;
There should be a check for overflow here in case start is a very big
value. Additional check can look like:
if (start >= base_pgoff + slot->npages)
return false;
> + else
> + *gfn_start = slot->base_gfn;
> +
> + if (end < base_pgoff + slot->npages)
> + *gfn_end = slot->base_gfn + end - base_pgoff;
If "end" is smaller than base_pgoff, this can cause overflow and
return the range as valid. There should be additional check:
if (end < base_pgoff)
return false;
> + else
> + *gfn_end = slot->base_gfn + slot->npages;
> +
> + if (*gfn_start >= *gfn_end)
> + return false;
> +
> + return true;
> +}
> +
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
2023-01-05 20:38 ` Vishal Annapurve
@ 2023-01-06 4:13 ` Chao Peng
0 siblings, 0 replies; 153+ messages in thread
From: Chao Peng @ 2023-01-06 4:13 UTC (permalink / raw)
To: Vishal Annapurve
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Yu Zhang, Kirill A . Shutemov, luto,
jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Thu, Jan 05, 2023 at 12:38:30PM -0800, Vishal Annapurve wrote:
> On Thu, Dec 1, 2022 at 10:20 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > +static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
> > + pgoff_t start, pgoff_t end,
> > + gfn_t *gfn_start, gfn_t *gfn_end)
> > +{
> > + unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
> > +
> > + if (start > base_pgoff)
> > + *gfn_start = slot->base_gfn + start - base_pgoff;
>
> There should be a check for overflow here in case start is a very big
> value. Additional check can look like:
> if (start >= base_pgoff + slot->npages)
> return false;
>
> > + else
> > + *gfn_start = slot->base_gfn;
> > +
> > + if (end < base_pgoff + slot->npages)
> > + *gfn_end = slot->base_gfn + end - base_pgoff;
>
> If "end" is smaller than base_pgoff, this can cause overflow and
> return the range as valid. There should be additional check:
> if (end < base_pgoff)
> return false;
Thanks! Both are good catches. The improved code:
static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
pgoff_t start, pgoff_t end,
gfn_t *gfn_start, gfn_t *gfn_end)
{
unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
if (start >= base_pgoff + slot->npages)
return false;
else if (start <= base_pgoff)
*gfn_start = slot->base_gfn;
else
*gfn_start = start - base_pgoff + slot->base_gfn;
if (end <= base_pgoff)
return false;
else if (end >= base_pgoff + slot->npages)
*gfn_end = slot->base_gfn + slot->npages;
else
*gfn_end = end - base_pgoff + slot->base_gfn;
if (*gfn_start >= *gfn_end)
return false;
return true;
}
Thanks,
Chao
>
>
> > + else
> > + *gfn_end = slot->base_gfn + slot->npages;
> > +
> > + if (*gfn_start >= *gfn_end)
> > + return false;
> > +
> > + return true;
> > +}
> > +
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
2023-01-05 11:23 ` Jarkko Sakkinen
@ 2023-01-06 9:40 ` Chao Peng
2023-01-09 19:32 ` Sean Christopherson
0 siblings, 1 reply; 153+ messages in thread
From: Chao Peng @ 2023-01-06 9:40 UTC (permalink / raw)
To: Jarkko Sakkinen
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > In memory encryption usage, guest memory may be encrypted with special
> > key and can be accessed only by the guest itself. We call such memory
> > private memory. It's valueless and sometimes can cause problem to allow
> > userspace to access guest private memory. This new KVM memslot extension
> > allows guest private memory being provided through a restrictedmem
> > backed file descriptor(fd) and userspace is restricted to access the
> > bookmarked memory in the fd.
> >
> > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > userspace to instruct KVM to provide guest memory through restricted_fd.
> > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > and the size is 'memory_size'.
> >
> > The extended memslot can still have the userspace_addr(hva). When use, a
> > single memslot can maintain both private memory through restricted_fd
> > and shared memory through userspace_addr. Whether the private or shared
> > part is visible to guest is maintained by other KVM code.
> >
> > A restrictedmem_notifier field is also added to the memslot structure to
> > allow the restricted_fd's backing store to notify KVM the memory change,
> > KVM then can invalidate its page table entries or handle memory errors.
> >
> > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > and right now it is selected on X86_64 only.
> >
> > To make future maintenance easy, internally use a binary compatible
> > alias struct kvm_user_mem_region to handle both the normal and the
> > '_ext' variants.
>
> Feels bit hacky IMHO, and more like a completely new feature than
> an extension.
>
> Why not just add a new ioctl? The commit message does not address
> the most essential design here.
Yes, people can always choose to add a new ioctl for this kind of change
and the balance point here is we want to also avoid 'too many ioctls' if
the functionalities are similar. The '_ext' variant reuses all the
existing fields in the 'normal' variant and most importantly KVM
internally can reuse most of the code. I certainly can add some words in
the commit message to explain this design choice.
Thanks,
Chao
>
> BR, Jarkko
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
2023-01-06 9:40 ` Chao Peng
@ 2023-01-09 19:32 ` Sean Christopherson
2023-01-10 9:14 ` Chao Peng
2023-01-20 23:28 ` Jarkko Sakkinen
0 siblings, 2 replies; 153+ messages in thread
From: Sean Christopherson @ 2023-01-09 19:32 UTC (permalink / raw)
To: Chao Peng
Cc: Jarkko Sakkinen, kvm, linux-kernel, linux-mm, linux-fsdevel,
linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Fri, Jan 06, 2023, Chao Peng wrote:
> On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > To make future maintenance easy, internally use a binary compatible
> > > alias struct kvm_user_mem_region to handle both the normal and the
> > > '_ext' variants.
> >
> > Feels bit hacky IMHO, and more like a completely new feature than
> > an extension.
> >
> > Why not just add a new ioctl? The commit message does not address
> > the most essential design here.
>
> Yes, people can always choose to add a new ioctl for this kind of change
> and the balance point here is we want to also avoid 'too many ioctls' if
> the functionalities are similar. The '_ext' variant reuses all the
> existing fields in the 'normal' variant and most importantly KVM
> internally can reuse most of the code. I certainly can add some words in
> the commit message to explain this design choice.
After seeing the userspace side of this, I agree with Jarkko; overloading
KVM_SET_USER_MEMORY_REGION is a hack. E.g. the size validation ends up being
bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
itself.
It feels absolutely ridiculous, but I think the best option is to do:
#define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
struct kvm_userspace_memory_region2)
/* for KVM_SET_USER_MEMORY_REGION2 */
struct kvm_user_mem_region2 {
__u32 slot;
__u32 flags;
__u64 guest_phys_addr;
__u64 memory_size;
__u64 userspace_addr;
__u64 restricted_offset;
__u32 restricted_fd;
__u32 pad1;
__u64 pad2[14];
}
And it's consistent with other KVM ioctls(), e.g. KVM_SET_CPUID2.
Regarding the userspace side of things, please include Vishal's selftests in v11,
it's impossible to properly review the uAPI changes without seeing the userspace
side of things. I'm in the process of reviewing Vishal's v2[*], I'll try to
massage it into a set of patches that you can incorporate into your series.
[*] https://lore.kernel.org/all/20221205232341.4131240-1-vannapurve@google.com
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
2023-01-09 19:32 ` Sean Christopherson
@ 2023-01-10 9:14 ` Chao Peng
2023-01-10 22:51 ` Vishal Annapurve
` (2 more replies)
2023-01-20 23:28 ` Jarkko Sakkinen
1 sibling, 3 replies; 153+ messages in thread
From: Chao Peng @ 2023-01-10 9:14 UTC (permalink / raw)
To: Sean Christopherson
Cc: Jarkko Sakkinen, kvm, linux-kernel, linux-mm, linux-fsdevel,
linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Mon, Jan 09, 2023 at 07:32:05PM +0000, Sean Christopherson wrote:
> On Fri, Jan 06, 2023, Chao Peng wrote:
> > On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > To make future maintenance easy, internally use a binary compatible
> > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > '_ext' variants.
> > >
> > > Feels bit hacky IMHO, and more like a completely new feature than
> > > an extension.
> > >
> > > Why not just add a new ioctl? The commit message does not address
> > > the most essential design here.
> >
> > Yes, people can always choose to add a new ioctl for this kind of change
> > and the balance point here is we want to also avoid 'too many ioctls' if
> > the functionalities are similar. The '_ext' variant reuses all the
> > existing fields in the 'normal' variant and most importantly KVM
> > internally can reuse most of the code. I certainly can add some words in
> > the commit message to explain this design choice.
>
> After seeing the userspace side of this, I agree with Jarkko; overloading
> KVM_SET_USER_MEMORY_REGION is a hack. E.g. the size validation ends up being
> bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
> itself.
How is the size validation being bogus? I don't quite follow. Then we
will use kvm_userspace_memory_region2 as the KVM internal alias, right?
I see similar examples use different functions to handle different
versions but it does look easier if we use alias for this function.
>
> It feels absolutely ridiculous, but I think the best option is to do:
>
> #define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
> struct kvm_userspace_memory_region2)
Just interesting, is 0x49 a safe number we can use?
>
> /* for KVM_SET_USER_MEMORY_REGION2 */
> struct kvm_user_mem_region2 {
> __u32 slot;
> __u32 flags;
> __u64 guest_phys_addr;
> __u64 memory_size;
> __u64 userspace_addr;
> __u64 restricted_offset;
> __u32 restricted_fd;
> __u32 pad1;
> __u64 pad2[14];
> }
>
> And it's consistent with other KVM ioctls(), e.g. KVM_SET_CPUID2.
Okay, agree from KVM userspace API perspective this is more consistent
with similar existing examples. I see several of them.
I think we will also need a CAP_KVM_SET_USER_MEMORY_REGION2 for this new
ioctl.
>
> Regarding the userspace side of things, please include Vishal's selftests in v11,
> it's impossible to properly review the uAPI changes without seeing the userspace
> side of things. I'm in the process of reviewing Vishal's v2[*], I'll try to
> massage it into a set of patches that you can incorporate into your series.
Previously I included Vishal's selftests in the github repo, but not
include them in this patch series. It's OK for me to incorporate them
directly into this series and review together if Vishal is fine.
Chao
>
> [*] https://lore.kernel.org/all/20221205232341.4131240-1-vannapurve@google.com
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
2023-01-10 9:14 ` Chao Peng
@ 2023-01-10 22:51 ` Vishal Annapurve
2023-01-13 22:37 ` Sean Christopherson
2023-01-20 23:42 ` Jarkko Sakkinen
2 siblings, 0 replies; 153+ messages in thread
From: Vishal Annapurve @ 2023-01-10 22:51 UTC (permalink / raw)
To: Chao Peng
Cc: Sean Christopherson, Jarkko Sakkinen, kvm, linux-kernel,
linux-mm, linux-fsdevel, linux-arch, linux-api, linux-doc,
qemu-devel, Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov,
Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Arnd Bergmann, Naoya Horiguchi,
Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
J . Bruce Fields, Andrew Morton, Shuah Khan, Mike Rapoport,
Steven Price, Maciej S . Szmigiero, Vlastimil Babka, Yu Zhang,
Kirill A . Shutemov, luto, jun.nakajima, dave.hansen, ak, david,
aarcange, ddutile, dhildenb, Quentin Perret, tabba, Michael Roth,
mhocko, wei.w.wang
On Tue, Jan 10, 2023 at 1:19 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > Regarding the userspace side of things, please include Vishal's selftests in v11,
> > it's impossible to properly review the uAPI changes without seeing the userspace
> > side of things. I'm in the process of reviewing Vishal's v2[*], I'll try to
> > massage it into a set of patches that you can incorporate into your series.
>
> Previously I included Vishal's selftests in the github repo, but not
> include them in this patch series. It's OK for me to incorporate them
> directly into this series and review together if Vishal is fine.
>
Yeah, I am ok with incorporating selftest patches into this series and
reviewing them together.
Regards,
Vishal
> Chao
> >
> > [*] https://lore.kernel.org/all/20221205232341.4131240-1-vannapurve@google.com
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
2022-12-02 6:13 ` [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
2022-12-06 14:57 ` Fuad Tabba
2022-12-13 23:49 ` Huang, Kai
@ 2023-01-13 21:54 ` Sean Christopherson
2023-01-17 12:41 ` Chao Peng
2023-02-22 2:07 ` Alexey Kardashevskiy
2023-01-30 5:26 ` Ackerley Tng
2023-02-16 9:51 ` Nikunj A. Dadhania
4 siblings, 2 replies; 153+ messages in thread
From: Sean Christopherson @ 2023-01-13 21:54 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Fri, Dec 02, 2022, Chao Peng wrote:
> The system call is currently wired up for x86 arch.
Building on other architectures (except for arm64 for some reason) yields:
CALL /.../scripts/checksyscalls.sh
<stdin>:1565:2: warning: #warning syscall memfd_restricted not implemented [-Wcpp]
Do we care? It's the only such warning, which makes me think we either need to
wire this up for all architectures, or explicitly document that it's unsupported.
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
...
> diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> new file mode 100644
> index 000000000000..c2700c5daa43
> --- /dev/null
> +++ b/include/linux/restrictedmem.h
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _LINUX_RESTRICTEDMEM_H
Missing
#define _LINUX_RESTRICTEDMEM_H
which causes fireworks if restrictedmem.h is included more than once.
> +#include <linux/file.h>
> +#include <linux/magic.h>
> +#include <linux/pfn_t.h>
...
> +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> + struct page **pagep, int *order)
> +{
> + return -1;
This should be a proper -errno, though in the current incarnation of things it's
a moot point because no stub is needed. KVM can (and should) easily provide its
own stub for this one.
> +}
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> + return false;
> +}
> +
> +static inline void restrictedmem_error_page(struct page *page,
> + struct address_space *mapping)
> +{
> +}
> +
> +#endif /* CONFIG_RESTRICTEDMEM */
> +
> +#endif /* _LINUX_RESTRICTEDMEM_H */
...
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> new file mode 100644
> index 000000000000..56953c204e5c
> --- /dev/null
> +++ b/mm/restrictedmem.c
> @@ -0,0 +1,318 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/shmem_fs.h>
> +#include <linux/syscalls.h>
> +#include <uapi/linux/falloc.h>
> +#include <uapi/linux/magic.h>
> +#include <linux/restrictedmem.h>
> +
> +struct restrictedmem_data {
Any objection to simply calling this "restrictedmem"? And then using either "rm"
or "rmem" for local variable names? I kept reading "data" as the underyling data
being written to the page, as opposed to the metadata describing the restrictedmem
instance.
> + struct mutex lock;
> + struct file *memfd;
> + struct list_head notifiers;
> +};
> +
> +static void restrictedmem_invalidate_start(struct restrictedmem_data *data,
> + pgoff_t start, pgoff_t end)
> +{
> + struct restrictedmem_notifier *notifier;
> +
> + mutex_lock(&data->lock);
This can be a r/w semaphore instead of a mutex, that way punching holes at multiple
points in the file can at least run the notifiers in parallel. The actual allocation
by shmem will still be serialized, but I think it's worth the simple optimization
since zapping and flushing in KVM may be somewhat slow.
> + list_for_each_entry(notifier, &data->notifiers, list) {
> + notifier->ops->invalidate_start(notifier, start, end);
Two major design issues that we overlooked long ago:
1. Blindly invoking notifiers will not scale. E.g. if userspace configures a
VM with a large number of convertible memslots that are all backed by a
single large restrictedmem instance, then converting a single page will
result in a linear walk through all memslots. I don't expect anyone to
actually do something silly like that, but I also never expected there to be
a legitimate usecase for thousands of memslots.
2. This approach fails to provide the ability for KVM to ensure a guest has
exclusive access to a page. As discussed in the past, the kernel can rely
on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but
only for SNP and TDX VMs. For VMs where userspace is trusted to some extent,
e.g. SEV, there is value in ensuring a 1:1 association.
And probably more importantly, relying on hardware for SNP and TDX yields a
poor ABI and complicates KVM's internals. If the kernel doesn't guarantee a
page is exclusive to a guest, i.e. if userspace can hand out the same page
from a restrictedmem instance to multiple VMs, then failure will occur only
when KVM tries to assign the page to the second VM. That will happen deep
in KVM, which means KVM needs to gracefully handle such errors, and it means
that KVM's ABI effectively allows plumbing garbage into its memslots.
Rather than use a simple list of notifiers, this appears to be yet another
opportunity to use an xarray. Supporting sharing of restrictedmem will be
non-trivial, but IMO we should punt that to the future since it's still unclear
exactly how sharing will work.
An xarray will solve #1 by notifying only the consumers (memslots) that are bound
to the affected range.
And for #2, it's relatively straightforward (knock wood) to detect existing
entries, i.e. if the user wants exclusive access to memory, then the bind operation
can be reject if there's an existing entry.
VERY lightly tested code snippet at the bottom (will provide link to fully worked
code in cover letter).
> +static long restrictedmem_punch_hole(struct restrictedmem_data *data, int mode,
> + loff_t offset, loff_t len)
> +{
> + int ret;
> + pgoff_t start, end;
> + struct file *memfd = data->memfd;
> +
> + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> + return -EINVAL;
> +
> + start = offset >> PAGE_SHIFT;
> + end = (offset + len) >> PAGE_SHIFT;
> +
> + restrictedmem_invalidate_start(data, start, end);
> + ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> + restrictedmem_invalidate_end(data, start, end);
The lock needs to be end for the entire duration of the hole punch, i.e. needs to
be taken before invalidate_start() and released after invalidate_end(). If a user
(un)binds/(un)registers after invalidate_state(), it will see an unpaired notification,
e.g. could leave KVM with incorrect notifier counts.
> +
> + return ret;
> +}
What I ended up with for an xarray-based implementation. I'm very flexible on
names and whatnot, these are just what made sense to me.
static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode,
loff_t offset, loff_t len)
{
struct restrictedmem_notifier *notifier;
struct file *memfd = rm->memfd;
unsigned long index;
pgoff_t start, end;
int ret;
if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
return -EINVAL;
start = offset >> PAGE_SHIFT;
end = (offset + len) >> PAGE_SHIFT;
/*
* Bindings must stable across invalidation to ensure the start+end
* are balanced.
*/
down_read(&rm->lock);
xa_for_each_range(&rm->bindings, index, notifier, start, end)
notifier->ops->invalidate_start(notifier, start, end);
ret = memfd->f_op->fallocate(memfd, mode, offset, len);
xa_for_each_range(&rm->bindings, index, notifier, start, end)
notifier->ops->invalidate_end(notifier, start, end);
up_read(&rm->lock);
return ret;
}
int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
struct restrictedmem_notifier *notifier, bool exclusive)
{
struct restrictedmem *rm = file->f_mapping->private_data;
int ret = -EINVAL;
down_write(&rm->lock);
/* Non-exclusive mappings are not yet implemented. */
if (!exclusive)
goto out_unlock;
if (!xa_empty(&rm->bindings)) {
if (exclusive != rm->exclusive)
goto out_unlock;
if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT))
goto out_unlock;
}
xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL);
rm->exclusive = exclusive;
ret = 0;
out_unlock:
up_write(&rm->lock);
return ret;
}
EXPORT_SYMBOL_GPL(restrictedmem_bind);
void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end,
struct restrictedmem_notifier *notifier)
{
struct restrictedmem *rm = file->f_mapping->private_data;
down_write(&rm->lock);
xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL);
synchronize_rcu();
up_write(&rm->lock);
}
EXPORT_SYMBOL_GPL(restrictedmem_unbind);
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
2022-12-02 6:13 ` [PATCH v10 2/9] KVM: Introduce per-page memory attributes Chao Peng
` (3 preceding siblings ...)
2022-12-28 8:28 ` Chenyi Qiang
@ 2023-01-13 22:02 ` Sean Christopherson
2023-01-17 3:21 ` Binbin Wu
2023-02-09 7:25 ` Isaku Yamahata
6 siblings, 0 replies; 153+ messages in thread
From: Sean Christopherson @ 2023-01-13 22:02 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Fri, Dec 02, 2022, Chao Peng wrote:
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index fbeaa9ddef59..a8e379a3afee 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -49,6 +49,7 @@ config KVM
> select SRCU
> select INTERVAL_TREE
> select HAVE_KVM_PM_NOTIFIER if PM
> + select HAVE_KVM_MEMORY_ATTRIBUTES
I would prefer to call this KVM_GENERIC_MEMORY_ATTRIBUTES. Similar to
KVM_GENERIC_HARDWARE_ENABLING, ARM does need/have hardware enabling, it just
doesn't want KVM's generic implementation. In this case, pKVM does support memory
attributes, but uses stage-2 tables to track ownership and doesn't need/want the
overhead of the generic implementation.
> help
...
> +#define KVM_MEMORY_ATTRIBUTE_READ (1ULL << 0)
> +#define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
> +#define KVM_MEMORY_ATTRIBUTE_EXECUTE (1ULL << 2)
> +#define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
I think we should carve out bits 0-2 for RWX, but I don't think we should actually
define them until they're actually accepted by KVM.
> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> + struct kvm_memory_attributes *attrs)
> +{
> + gfn_t start, end;
> + unsigned long i;
> + void *entry;
> + u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> + /* flags is currently not used. */
> + if (attrs->flags)
> + return -EINVAL;
> + if (attrs->attributes & ~supported_attrs)
Nit, no need for "supported_attrs", just consume kvm_supported_mem_attributes()
directly.
> + return -EINVAL;
> + if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> + return -EINVAL;
> + if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> + return -EINVAL;
> +
> + start = attrs->address >> PAGE_SHIFT;
> + end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> + entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +
> + mutex_lock(&kvm->lock);
Peeking forward multiple patches, this needs to take kvm->slots_lock, not kvm->lock.
There's a bug in the lpage_disallowed patch that I believe can most easily be
solved by making this mutually exclusive with memslot changes.
When a memslot is created, KVM needs to walk through the attributes to detect
whether or not the attributes are identical for the entire slot. To avoid races,
that means taking slots_lock.
The alternative would be to query the attributes when adjusting the hugepage level
and avoid lpage_disallowed entirely, but in the (very brief) time I've thought
about this I haven't come up with a way to do that in a performant manner.
> + for (i = start; i < end; i++)
Curly braces needed on the for-loop.
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
2023-01-10 9:14 ` Chao Peng
2023-01-10 22:51 ` Vishal Annapurve
@ 2023-01-13 22:37 ` Sean Christopherson
2023-01-17 12:42 ` Chao Peng
2023-01-20 23:42 ` Jarkko Sakkinen
2 siblings, 1 reply; 153+ messages in thread
From: Sean Christopherson @ 2023-01-13 22:37 UTC (permalink / raw)
To: Chao Peng
Cc: Jarkko Sakkinen, kvm, linux-kernel, linux-mm, linux-fsdevel,
linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Tue, Jan 10, 2023, Chao Peng wrote:
> On Mon, Jan 09, 2023 at 07:32:05PM +0000, Sean Christopherson wrote:
> > On Fri, Jan 06, 2023, Chao Peng wrote:
> > > On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > > To make future maintenance easy, internally use a binary compatible
> > > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > > '_ext' variants.
> > > >
> > > > Feels bit hacky IMHO, and more like a completely new feature than
> > > > an extension.
> > > >
> > > > Why not just add a new ioctl? The commit message does not address
> > > > the most essential design here.
> > >
> > > Yes, people can always choose to add a new ioctl for this kind of change
> > > and the balance point here is we want to also avoid 'too many ioctls' if
> > > the functionalities are similar. The '_ext' variant reuses all the
> > > existing fields in the 'normal' variant and most importantly KVM
> > > internally can reuse most of the code. I certainly can add some words in
> > > the commit message to explain this design choice.
> >
> > After seeing the userspace side of this, I agree with Jarkko; overloading
> > KVM_SET_USER_MEMORY_REGION is a hack. E.g. the size validation ends up being
> > bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
> > itself.
>
> How is the size validation being bogus? I don't quite follow.
The ioctl() magic embeds the size of the payload (struct kvm_userspace_memory_region
in this case) in the ioctl() number, and that information is visible to userspace
via _IOCTL_SIZE(). Attempting to take a larger size can mess up sanity checks,
e.g. KVM selftests get tripped up on this assert if KVM_SET_USER_MEMORY_REGION is
passed an "extended" struct.
#define kvm_do_ioctl(fd, cmd, arg) \
({ \
kvm_static_assert(!_IOC_SIZE(cmd) || sizeof(*arg) == _IOC_SIZE(cmd)); \
ioctl(fd, cmd, arg); \
})
> Then we will use kvm_userspace_memory_region2 as the KVM internal alias,
> right?
Yep.
> I see similar examples use different functions to handle different versions
> but it does look easier if we use alias for this function.
>
> >
> > It feels absolutely ridiculous, but I think the best option is to do:
> >
> > #define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
> > struct kvm_userspace_memory_region2)
>
> Just interesting, is 0x49 a safe number we can use?
Yes? So long as its not used by KVM, it's safe. AFAICT, it's unused.
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes
2022-12-02 6:13 ` [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes Chao Peng
` (2 preceding siblings ...)
2022-12-13 23:51 ` Huang, Kai
@ 2023-01-13 22:50 ` Sean Christopherson
3 siblings, 0 replies; 153+ messages in thread
From: Sean Christopherson @ 2023-01-13 22:50 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Fri, Dec 02, 2022, Chao Peng wrote:
> @@ -785,11 +786,12 @@ struct kvm {
>
> #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> struct mmu_notifier mmu_notifier;
> +#endif
> unsigned long mmu_invalidate_seq;
> long mmu_invalidate_in_progress;
> gfn_t mmu_invalidate_range_start;
> gfn_t mmu_invalidate_range_end;
> -#endif
Blech. The existing code is a bit ugly, and trying to extend for this use case
makes things even worse.
Rather than use the base MMU_NOTIFIER Kconfig and an arbitrary define, I think we
should first add a proper Kconfig, e.g. KVM_GENERIC_MMU_NOTIFIER, to replace the
combination. E.g
config KVM_GENERIC_MMU_NOTIFIER
select MMU_NOTIFIER
bool
and then all architectures that currently #define KVM_ARCH_WANT_MMU_NOTIFIER can
simply select the Kconfig, which is everything except s390. "GENERIC" again because
s390 does select MMU_NOTIFER and actually registers its own notifier for s390's
version of protected VMs (at least, I think that's what its "pv" stands for).
And then later down the line in this series, when the attributes and private mem
needs to tie into the notifiers, we can do:
config KVM_GENERIC_MEMORY_ATTRIBUTES
select KVM_GENERIC_MMU_NOTIFIER
bool
I.e. that way this patch doesn't need to partially expose KVM's notifier stuff
and can instead just keep the soon-to-be-existing KVM_GENERIC_MMU_NOTIFIER.
Taking a depending on KVM_GENERIC_MMU_NOTIFIER for KVM_GENERIC_MEMORY_ATTRIBUTES
makes sense, because AFAICT, changing any type of attribute, e.g. RWX bits, is
going to necessitate unmapping the affected gfn range.
> struct list_head devices;
> u64 manual_dirty_log_protect;
> struct dentry *debugfs_dentry;
> @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
> int kvm_arch_post_init_vm(struct kvm *kvm);
> void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_has_private_mem(struct kvm *kvm);
The reference to private memory belongs in a later patch. More below.
> +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + struct kvm_gfn_range gfn_range;
> + struct kvm_memory_slot *slot;
> + struct kvm_memslots *slots;
> + struct kvm_memslot_iter iter;
> + int i;
> + int r = 0;
The return from kvm_unmap_gfn_range() is a bool, this should be:
bool flush = false;
> +
> + gfn_range.pte = __pte(0);
> + gfn_range.may_block = true;
> +
> + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + slots = __kvm_memslots(kvm, i);
> +
> + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> + slot = iter.slot;
> + gfn_range.start = max(start, slot->base_gfn);
> + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> + if (gfn_range.start >= gfn_range.end)
> + continue;
> + gfn_range.slot = slot;
> +
> + r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> + }
> + }
> +
> + if (r)
> + kvm_flush_remote_tlbs(kvm);
> +}
> +
> static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> struct kvm_memory_attributes *attrs)
> {
> gfn_t start, end;
> unsigned long i;
> void *entry;
> + int idx;
> u64 supported_attrs = kvm_supported_mem_attributes(kvm);
>
> - /* flags is currently not used. */
> + /* 'flags' is currently not used. */
Kind of a spurious change.
> if (attrs->flags)
> return -EINVAL;
> if (attrs->attributes & ~supported_attrs)
> @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>
> entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
>
> + if (kvm_arch_has_private_mem(kvm)) {
I think we should assume that any future attributes will necessitate unmapping
and invalidation, i.e. drop the private mem check. That allows introducing
kvm_arch_has_private_mem() in a later patch that is more directly related to
private memory.
> + KVM_MMU_LOCK(kvm);
> + kvm_mmu_invalidate_begin(kvm);
> + kvm_mmu_invalidate_range_add(kvm, start, end);
> + KVM_MMU_UNLOCK(kvm);
> + }
> +
> mutex_lock(&kvm->lock);
> for (i = start; i < end; i++)
> if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> break;
> mutex_unlock(&kvm->lock);
>
> + if (kvm_arch_has_private_mem(kvm)) {
> + idx = srcu_read_lock(&kvm->srcu);
Mostly for reference, this goes away if slots_lock is used instead of kvm->lock.
> + KVM_MMU_LOCK(kvm);
> + if (i > start)
> + kvm_unmap_mem_range(kvm, start, i);
> + kvm_mmu_invalidate_end(kvm);
> + KVM_MMU_UNLOCK(kvm);
> + srcu_read_unlock(&kvm->srcu, idx);
> + }
> +
> attrs->address = i << PAGE_SHIFT;
> attrs->size = (end - i) << PAGE_SHIFT;
>
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed
2022-12-02 6:13 ` [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed Chao Peng
2022-12-05 22:49 ` Isaku Yamahata
@ 2023-01-13 23:12 ` Sean Christopherson
2023-01-13 23:16 ` Sean Christopherson
2 siblings, 0 replies; 153+ messages in thread
From: Sean Christopherson @ 2023-01-13 23:12 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Fri, Dec 02, 2022, Chao Peng wrote:
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 283cbb83d6ae..7772ab37ac89 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -38,6 +38,7 @@
> #include <asm/hyperv-tlfs.h>
>
> #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> +#define __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
No need for this, I think we should just make it mandatory to implement the
arch hook when CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES=y. If another arch gains
support for mem attributes and doesn't need the hook, then we can simply add a
weak helper (or maybe add a #define then if we feel that's the way to go).
> #define KVM_MAX_VCPUS 1024
>
> @@ -1011,6 +1012,13 @@ struct kvm_vcpu_arch {
> #endif
> };
>
> +/*
> + * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
> + * level. The remaining bits are used as a reference count.
> + */
> +#define KVM_LPAGE_PRIVATE_SHARED_MIXED (1U << 31)
Similar to the need to unmap, I think we should just say "mixed" and ignore the
private vs. shared, i.e. make this a flag for all memory attributes.
> +#define KVM_LPAGE_COUNT_MAX ((1U << 31) - 1)
"MAX" is technically correct, but it's more of a mask. I think we can make it a
moot point though. There's no need to mask the count, we just want to assert that
adjusting the counting doesn't change the flag.
I would also say throw these defines into mmu.c, at least pending the bug fix
for kvm_alloc_memslot_metadata() (more on that below).
> struct kvm_lpage_info {
> int disallow_lpage;
> };
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e2c70b5afa3e..2190fd8c95c0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -763,11 +763,16 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
> {
> struct kvm_lpage_info *linfo;
> int i;
> + int disallow_count;
>
> for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
> linfo = lpage_info_slot(gfn, slot, i);
> +
> + disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
> + WARN_ON(disallow_count + count < 0 ||
> + disallow_count > KVM_LPAGE_COUNT_MAX - count);
> +
> linfo->disallow_lpage += count;
> - WARN_ON(linfo->disallow_lpage < 0);
It's been a long week so don't trust my math, but I believe this can simply be:
old = linfo->disallow_lpage;
linfo->disallow_lpage += count;
WARN_ON_ONCE((old ^ linfo->disallow_lpage) & KVM_LPAGE_MIXED_FLAG);
> }
> }
>
> @@ -6986,3 +6991,130 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> if (kvm->arch.nx_huge_page_recovery_thread)
> kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
> }
> +
> +static bool linfo_is_mixed(struct kvm_lpage_info *linfo)
> +{
> + return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +}
> +
> +static void linfo_set_mixed(gfn_t gfn, struct kvm_memory_slot *slot,
> + int level, bool mixed)
> +{
> + struct kvm_lpage_info *linfo = lpage_info_slot(gfn, slot, level);
> +
> + if (mixed)
> + linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> + else
> + linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +}
> +
> +static bool is_expected_attr_entry(void *entry, unsigned long expected_attrs)
> +{
> + bool expect_private = expected_attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +
> + if (xa_to_value(entry) & KVM_MEMORY_ATTRIBUTE_PRIVATE) {
> + if (!expect_private)
> + return false;
> + } else if (expect_private)
> + return false;
This is messy. If we drop the private vs. shared specifity, this can go away if
we add a helper to get attributes
static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
{
return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
}
and then we can do
if (KVM_BUG_ON(gfn != xas.xa_index, kvm) ||
attrs != kvm_get_memory_attributes(kvm, gfn)) {
mixed = true;
break;
}
and
if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)) ||
attrs != kvm_get_memory_attributes(kvm, gfn))
return true;
> +
> + return true;
> +}
> +
> +static bool mem_attrs_mixed_2m(struct kvm *kvm, unsigned long attrs,
> + gfn_t start, gfn_t end)
> +{
> + XA_STATE(xas, &kvm->mem_attr_array, start);
> + gfn_t gfn = start;
> + void *entry;
> + bool mixed = false;
> +
> + rcu_read_lock();
> + entry = xas_load(&xas);
> + while (gfn < end) {
> + if (xas_retry(&xas, entry))
> + continue;
> +
> + KVM_BUG_ON(gfn != xas.xa_index, kvm);
As above, I think it's worth bailing immediately if there's a mismatch.
> +
> + if (!is_expected_attr_entry(entry, attrs)) {
> + mixed = true;
> + break;
> + }
> +
> + entry = xas_next(&xas);
> + gfn++;
> + }
> +
> + rcu_read_unlock();
> + return mixed;
> +}
> +
> +static bool mem_attrs_mixed(struct kvm *kvm, struct kvm_memory_slot *slot,
s/mem_attrs_mixed/has_mixed_attrs to make it clear this is querying, not setting.
And has_mixed_attrs_2m() above.
> + int level, unsigned long attrs,
> + gfn_t start, gfn_t end)
> +{
> + unsigned long gfn;
> +
> + if (level == PG_LEVEL_2M)
> + return mem_attrs_mixed_2m(kvm, attrs, start, end);
> +
> + for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1))
Curly braces needed on the for-loop.
> + if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)) ||
> + !is_expected_attr_entry(xa_load(&kvm->mem_attr_array, gfn),
> + attrs))
> + return true;
> + return false;
> +}
> +
> +static void kvm_update_lpage_private_shared_mixed(struct kvm *kvm,
> + struct kvm_memory_slot *slot,
> + unsigned long attrs,
> + gfn_t start, gfn_t end)
> +{
> + unsigned long pages, mask;
> + gfn_t gfn, gfn_end, first, last;
> + int level;
> + bool mixed;
> +
> + /*
> + * The sequence matters here: we set the higher level basing on the
> + * lower level's scanning result.
> + */
> + for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> + pages = KVM_PAGES_PER_HPAGE(level);
> + mask = ~(pages - 1);
> + first = start & mask;
> + last = (end - 1) & mask;
> +
> + /*
> + * We only need to scan the head and tail page, for middle pages
> + * we know they will not be mixed.
> + */
> + gfn = max(first, slot->base_gfn);
> + gfn_end = min(first + pages, slot->base_gfn + slot->npages);
> + mixed = mem_attrs_mixed(kvm, slot, level, attrs, gfn, gfn_end);
> + linfo_set_mixed(gfn, slot, level, mixed);
> +
> + if (first == last)
> + return;
> +
> + for (gfn = first + pages; gfn < last; gfn += pages)
> + linfo_set_mixed(gfn, slot, level, false);
> +
> + gfn = last;
> + gfn_end = min(last + pages, slot->base_gfn + slot->npages);
> + mixed = mem_attrs_mixed(kvm, slot, level, attrs, gfn, gfn_end);
> + linfo_set_mixed(gfn, slot, level, mixed);
> + }
> +}
> +
> +void kvm_arch_set_memory_attributes(struct kvm *kvm,
> + struct kvm_memory_slot *slot,
> + unsigned long attrs,
> + gfn_t start, gfn_t end)
> +{
> + if (kvm_slot_can_be_private(slot))
Make this an early return optimization, with a comment explaining that KVM x86
doesn't yet support other attributes.
/*
* KVM x86 currently only supports KVM_MEMORY_ATTRIBUTE_PRIVATE, skip
* the slot if the slot will never consume the PRIVATE attribute.
*/
if (!kvm_slot_can_be_private(slot))
return;
> + kvm_update_lpage_private_shared_mixed(kvm, slot, attrs,
> + start, end);
> +}
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9a07380f8d3c..5aefcff614d2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12362,6 +12362,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
> if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
> linfo[lpages - 1].disallow_lpage = 1;
> ugfn = slot->userspace_addr >> PAGE_SHIFT;
> + if (kvm_slot_can_be_private(slot))
> + ugfn |= slot->restricted_offset >> PAGE_SHIFT;
I would rather reject memslot if the gfn has lesser alignment than the offset.
I'm totally ok with this approach _if_ there's a use case. Until such a use case
presents itself, I would rather be conservative from a uAPI perspective.
> /*
> * If the gfn and userspace address are not aligned wrt each
> * other, disable large page support for this slot.
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3331c0c92838..25099c94e770 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -592,6 +592,11 @@ struct kvm_memory_slot {
> struct restrictedmem_notifier notifier;
> };
>
> +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
> +{
> + return slot && (slot->flags & KVM_MEM_PRIVATE);
KVM_MEM_PRIVATE should really be defined only when private memory is exposed to
userspace. For this patch, even though it means we have untestable code, I think
it makes sense to "return false".
> +}
> +
> static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> {
> return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
> @@ -2316,4 +2321,18 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
> /* Max number of entries allowed for each kvm dirty ring */
> #define KVM_DIRTY_RING_MAX_ENTRIES 65536
>
> +#ifdef __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
> +void kvm_arch_set_memory_attributes(struct kvm *kvm,
> + struct kvm_memory_slot *slot,
> + unsigned long attrs,
> + gfn_t start, gfn_t end);
> +#else
> +static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> + struct kvm_memory_slot *slot,
> + unsigned long attrs,
> + gfn_t start, gfn_t end)
> +{
> +}
> +#endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
As above, no stub is necessary.
> #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4e1e1e113bf0..e107afea32f0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2354,7 +2354,8 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> return 0;
> }
>
> -static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
Feedback for an earlier patch (to avoid churn): this should be kvm_mem_attrs_changed()
or so now that this does more than just unmap.
> +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end,
> + unsigned long attrs)
Weird nit. I think we should keep the prototypes for kvm_mem_attrs_changed()
and kvm_arch_set_memory_attributes() somewhat similar, i.e. squeeze in @attrs
before @start.
> {
> struct kvm_gfn_range gfn_range;
> struct kvm_memory_slot *slot;
> @@ -2378,6 +2379,10 @@ static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> gfn_range.slot = slot;
>
> r |= kvm_unmap_gfn_range(kvm, &gfn_range);
> +
> + kvm_arch_set_memory_attributes(kvm, slot, attrs,
> + gfn_range.start,
> + gfn_range.end);
> }
> }
>
> @@ -2427,7 +2432,7 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> idx = srcu_read_lock(&kvm->srcu);
> KVM_MMU_LOCK(kvm);
> if (i > start)
> - kvm_unmap_mem_range(kvm, start, i);
> + kvm_unmap_mem_range(kvm, start, i, attrs->attributes);
> kvm_mmu_invalidate_end(kvm);
> KVM_MMU_UNLOCK(kvm);
> srcu_read_unlock(&kvm->srcu, idx);
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit
2022-12-02 6:13 ` [PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
2022-12-06 15:47 ` Fuad Tabba
@ 2023-01-13 23:13 ` Sean Christopherson
1 sibling, 0 replies; 153+ messages in thread
From: Sean Christopherson @ 2023-01-13 23:13 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Fri, Dec 02, 2022, Chao Peng wrote:
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 99352170c130..d9edb14ce30b 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6634,6 +6634,28 @@ array field represents return values. The userspace should update the return
> values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> spec refer, https://github.com/riscv/riscv-sbi-doc.
>
> +::
> +
> + /* KVM_EXIT_MEMORY_FAULT */
> + struct {
> + #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 0)
Unless there's a reason not to, we should use bit 3 to match the attributes.
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed
2022-12-02 6:13 ` [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed Chao Peng
2022-12-05 22:49 ` Isaku Yamahata
2023-01-13 23:12 ` Sean Christopherson
@ 2023-01-13 23:16 ` Sean Christopherson
2023-01-28 13:54 ` Chao Peng
2 siblings, 1 reply; 153+ messages in thread
From: Sean Christopherson @ 2023-01-13 23:16 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Fri, Dec 02, 2022, Chao Peng wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9a07380f8d3c..5aefcff614d2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12362,6 +12362,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
> if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
> linfo[lpages - 1].disallow_lpage = 1;
> ugfn = slot->userspace_addr >> PAGE_SHIFT;
> + if (kvm_slot_can_be_private(slot))
> + ugfn |= slot->restricted_offset >> PAGE_SHIFT;
> /*
> * If the gfn and userspace address are not aligned wrt each
> * other, disable large page support for this slot.
Forgot to talk about the bug. This code needs to handle the scenario where a
memslot is created with existing, non-uniform attributes. It might be a bit ugly
(I didn't even try to write the code), but it's definitely possible, and since
memslot updates are already slow I think it's best to handle things here.
In the meantime, I added this so we don't forget to fix it before merging.
#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
pr_crit_once("FIXME: Walk the memory attributes of the slot and set the mixed status appropriately");
#endif
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 8/9] KVM: Handle page fault for private memory
2022-12-02 6:13 ` [PATCH v10 8/9] KVM: Handle page fault for private memory Chao Peng
2022-12-08 2:29 ` Yuan Yao
2022-12-09 9:01 ` Fuad Tabba
@ 2023-01-13 23:29 ` Sean Christopherson
2 siblings, 0 replies; 153+ messages in thread
From: Sean Christopherson @ 2023-01-13 23:29 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Fri, Dec 02, 2022, Chao Peng wrote:
> @@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> return -EIO;
> }
>
> + if (r == RET_PF_USER)
> + return 0;
> +
> if (r < 0)
> return r;
> if (r != RET_PF_EMULATE)
> @@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> */
> if (sp->role.direct &&
> sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn,
> - PG_LEVEL_NUM)) {
> + PG_LEVEL_NUM,
> + false)) {
Passing %false is incorrect. It might not cause problems because KVM currently
doesn't allowing modifying private memslots (that likely needs to change to allow
dirty logging), but it's wrong since nothing guarantees KVM is operating on SPTEs
for shared memory.
One option would be to take the patches from the TDX series that add a "private"
flag to the shadow page role, but I'd rather not add the role until it's truly
necessary.
For now, I think we can do this without impacting performance of guests that don't
support private memory.
int kvm_mmu_max_mapping_level(struct kvm *kvm,
const struct kvm_memory_slot *slot, gfn_t gfn,
int max_level)
{
bool is_private = kvm_slot_can_be_private(slot) &&
kvm_mem_is_private(kvm, gfn);
return __kvm_mmu_max_mapping_level(kvm, slot, gfn, max_level, is_private);
}
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 25099c94e770..153842bb33df 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2335,4 +2335,34 @@ static inline void kvm_arch_set_memory_attributes(struct kvm *kvm,
> }
> #endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
>
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
This code, i.e. the generic KVM changes, belongs in a separate patch. It'll be
small, but I want to separate x86's page fault changes from the restrictedmem
support adding to common KVM.
This should also short-circuit based on CONFIG_HAVE_KVM_RESTRICTED_MEM, though
I would name that CONFIG_KVM_PRIVATE_MEMORY since in KVM's world, it's all about
private vs. shared at this time.
> + return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)) &
> + KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +}
> +#else
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> + return false;
> +}
> +
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
> +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> +{
> + int ret;
> + struct page *page;
> + pgoff_t index = gfn - slot->base_gfn +
> + (slot->restricted_offset >> PAGE_SHIFT);
> +
> + ret = restrictedmem_get_page(slot->restricted_file, index,
> + &page, order);
This needs handle errors. If "ret" is non-zero, "page" is garbage.
> + *pfn = page_to_pfn(page);
> + return ret;
> +}
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
> #endif
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
2022-12-02 6:13 ` [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
2022-12-09 9:11 ` Fuad Tabba
2023-01-05 20:38 ` Vishal Annapurve
@ 2023-01-14 0:01 ` Sean Christopherson
2023-01-17 13:12 ` Chao Peng
2023-01-28 14:00 ` Chao Peng
2023-03-07 19:14 ` Ackerley Tng
3 siblings, 2 replies; 153+ messages in thread
From: Sean Christopherson @ 2023-01-14 0:01 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Fri, Dec 02, 2022, Chao Peng wrote:
> @@ -10357,6 +10364,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>
> if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
> static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
> +
> + if (kvm_check_request(KVM_REQ_MEMORY_MCE, vcpu)) {
> + vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
Synthesizing triple fault shutdown is not the right approach. Even with TDX's
MCE "architecture" (heavy sarcasm), it's possible that host userspace and the
guest have a paravirt interface for handling memory errors without killing the
host.
> + r = 0;
> + goto out;
> + }
> }
> @@ -1982,6 +2112,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
> !access_ok((void __user *)(unsigned long)mem->userspace_addr,
> mem->memory_size))
> return -EINVAL;
> + if (mem->flags & KVM_MEM_PRIVATE &&
> + (mem->restricted_offset & (PAGE_SIZE - 1) ||
Align indentation.
> + mem->restricted_offset > U64_MAX - mem->memory_size))
Strongly prefer to use similar logic to existing code that detects wraps:
mem->restricted_offset + mem->memory_size < mem->restricted_offset
This is also where I'd like to add the "gfn is aligned to offset" check, though
my brain is too fried to figure that out right now.
> + return -EINVAL;
> if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> return -EINVAL;
> if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> @@ -2020,6 +2154,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
> if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
> return -EINVAL;
> } else { /* Modify an existing slot. */
> + /* Private memslots are immutable, they can only be deleted. */
I'm 99% certain I suggested this, but if we're going to make these memslots
immutable, then we should straight up disallow dirty logging, otherwise we'll
end up with a bizarre uAPI.
> + if (mem->flags & KVM_MEM_PRIVATE)
> + return -EINVAL;
> if ((mem->userspace_addr != old->userspace_addr) ||
> (npages != old->npages) ||
> ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> @@ -2048,10 +2185,28 @@ int __kvm_set_memory_region(struct kvm *kvm,
> new->npages = npages;
> new->flags = mem->flags;
> new->userspace_addr = mem->userspace_addr;
> + if (mem->flags & KVM_MEM_PRIVATE) {
> + new->restricted_file = fget(mem->restricted_fd);
> + if (!new->restricted_file ||
> + !file_is_restrictedmem(new->restricted_file)) {
> + r = -EINVAL;
> + goto out;
> + }
> + new->restricted_offset = mem->restricted_offset;
> + }
> +
> + new->kvm = kvm;
Set this above, just so that the code flows better.
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
2022-12-02 6:13 [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Chao Peng
` (8 preceding siblings ...)
2022-12-02 6:13 ` [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
@ 2023-01-14 0:37 ` Sean Christopherson
2023-01-16 13:48 ` Kirill A. Shutemov
` (4 more replies)
2023-02-16 5:13 ` Mike Rapoport
10 siblings, 5 replies; 153+ messages in thread
From: Sean Christopherson @ 2023-01-14 0:37 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Fri, Dec 02, 2022, Chao Peng wrote:
> This patch series implements KVM guest private memory for confidential
> computing scenarios like Intel TDX[1]. If a TDX host accesses
> TDX-protected guest memory, machine check can happen which can further
> crash the running host system, this is terrible for multi-tenant
> configurations. The host accesses include those from KVM userspace like
> QEMU. This series addresses KVM userspace induced crash by introducing
> new mm and KVM interfaces so KVM userspace can still manage guest memory
> via a fd-based approach, but it can never access the guest memory
> content.
>
> The patch series touches both core mm and KVM code. I appreciate
> Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> reviews are always welcome.
> - 01: mm change, target for mm tree
> - 02-09: KVM change, target for KVM tree
A version with all of my feedback, plus reworked versions of Vishal's selftest,
is available here:
git@github.com:sean-jc/linux.git x86/upm_base_support
It compiles and passes the selftest, but it's otherwise barely tested. There are
a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
a WIP.
As for next steps, can you (handwaving all of the TDX folks) take a look at what
I pushed and see if there's anything horrifically broken, and that it still works
for TDX?
Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush
(and I mean that).
On my side, the two things on my mind are (a) tests and (b) downstream dependencies
(SEV and TDX). For tests, I want to build a lists of tests that are required for
merging so that the criteria for merging are clear, and so that if the list is large
(haven't thought much yet), the work of writing and running tests can be distributed.
Regarding downstream dependencies, before this lands, I want to pull in all the
TDX and SNP series and see how everything fits together. Specifically, I want to
make sure that we don't end up with a uAPI that necessitates ugly code, and that we
don't miss an opportunity to make things simpler. The patches in the SNP series to
add "legacy" SEV support for UPM in particular made me slightly rethink some minor
details. Nothing remotely major, but something that needs attention since it'll
be uAPI.
I'm off Monday, so it'll be at least Tuesday before I make any more progress on
my side.
Thanks!
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
2023-01-14 0:37 ` [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Sean Christopherson
@ 2023-01-16 13:48 ` Kirill A. Shutemov
2023-01-17 13:19 ` Chao Peng
` (3 subsequent siblings)
4 siblings, 0 replies; 153+ messages in thread
From: Kirill A. Shutemov @ 2023-01-16 13:48 UTC (permalink / raw)
To: Sean Christopherson
Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Sat, Jan 14, 2023 at 12:37:59AM +0000, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> >
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> > - 01: mm change, target for mm tree
> > - 02-09: KVM change, target for KVM tree
>
> A version with all of my feedback, plus reworked versions of Vishal's selftest,
> is available here:
>
> git@github.com:sean-jc/linux.git x86/upm_base_support
>
> It compiles and passes the selftest, but it's otherwise barely tested. There are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> a WIP.
>
> As for next steps, can you (handwaving all of the TDX folks) take a look at what
> I pushed and see if there's anything horrifically broken, and that it still works
> for TDX?
Minor build fix:
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6eb5336ccc65..4a9e9fa2552a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7211,8 +7211,8 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
int level;
bool mixed;
- lockdep_assert_held_write(kvm->mmu_lock);
- lockdep_assert_held(kvm->slots_lock);
+ lockdep_assert_held_write(&kvm->mmu_lock);
+ lockdep_assert_held(&kvm->slots_lock);
/*
* KVM x86 currently only supports KVM_MEMORY_ATTRIBUTE_PRIVATE, skip
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 467916943c73..4ef60ba7eb1d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2304,7 +2304,7 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
{
- lockdep_assert_held(kvm->mmu_lock);
+ lockdep_assert_held(&kvm->mmu_lock);
return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
}
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply related [flat|nested] 153+ messages in thread
* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
2022-12-02 6:13 ` [PATCH v10 2/9] KVM: Introduce per-page memory attributes Chao Peng
` (4 preceding siblings ...)
2023-01-13 22:02 ` Sean Christopherson
@ 2023-01-17 3:21 ` Binbin Wu
2023-01-17 13:30 ` Chao Peng
2023-02-09 7:25 ` Isaku Yamahata
6 siblings, 1 reply; 153+ messages in thread
From: Binbin Wu @ 2023-01-17 3:21 UTC (permalink / raw)
To: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
linux-arch, linux-api, linux-doc, qemu-devel
Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On 12/2/2022 2:13 PM, Chao Peng wrote:
> In confidential computing usages, whether a page is private or shared is
> necessary information for KVM to perform operations like page fault
> handling, page zapping etc. There are other potential use cases for
> per-page memory attributes, e.g. to make memory read-only (or no-exec,
> or exec-only, etc.) without having to modify memslots.
>
> Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> userspace to operate on the per-page memory attributes.
> - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> a guest memory range.
> - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> memory attributes.
>
> KVM internally uses xarray to store the per-page memory attributes.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> ---
> Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
> arch/x86/kvm/Kconfig | 1 +
> include/linux/kvm_host.h | 3 ++
> include/uapi/linux/kvm.h | 17 ++++++++
Should the changes introduced in this file also need to be added in
tools/include/uapi/linux/kvm.h ?
> virt/kvm/Kconfig | 3 ++
> virt/kvm/kvm_main.c | 76 ++++++++++++++++++++++++++++++++++
> 6 files changed, 163 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 5617bc4f899f..bb2f709c0900 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
> The "pad" and "reserved" fields may be used for future extensions and should be
> set to 0s by userspace.
>
> +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: u64 memory attributes bitmask(out)
> +:Returns: 0 on success, <0 on error
> +
> +Returns supported memory attributes bitmask. Supported memory attributes will
> +have the corresponding bits set in u64 memory attributes bitmask.
> +
> +The following memory attributes are defined::
> +
> + #define KVM_MEMORY_ATTRIBUTE_READ (1ULL << 0)
> + #define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
> + #define KVM_MEMORY_ATTRIBUTE_EXECUTE (1ULL << 2)
> + #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
> +
> +4.139 KVM_SET_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct kvm_memory_attributes(in/out)
> +:Returns: 0 on success, <0 on error
> +
> +Sets memory attributes for pages in a guest memory range. Parameters are
> +specified via the following structure::
> +
> + struct kvm_memory_attributes {
> + __u64 address;
> + __u64 size;
> + __u64 attributes;
> + __u64 flags;
> + };
> +
> +The user sets the per-page memory attributes to a guest memory range indicated
> +by address/size, and in return KVM adjusts address and size to reflect the
> +actual pages of the memory range have been successfully set to the attributes.
> +If the call returns 0, "address" is updated to the last successful address + 1
> +and "size" is updated to the remaining address size that has not been set
> +successfully. The user should check the return value as well as the size to
> +decide if the operation succeeded for the whole range or not. The user may want
> +to retry the operation with the returned address/size if the previous range was
> +partially successful.
> +
> +Both address and size should be page aligned and the supported attributes can be
> +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> +
> +The "flags" field may be used for future extensions and should be set to 0s.
> +
> 5. The kvm_run structure
> ========================
>
> @@ -8270,6 +8323,16 @@ structure.
> When getting the Modified Change Topology Report value, the attr->addr
> must point to a byte where the value will be stored or retrieved from.
>
> +8.40 KVM_CAP_MEMORY_ATTRIBUTES
> +------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm
> +
> +This capability indicates KVM supports per-page memory attributes and ioctls
> +KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
> +
> 9. Known KVM API problems
> =========================
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index fbeaa9ddef59..a8e379a3afee 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -49,6 +49,7 @@ config KVM
> select SRCU
> select INTERVAL_TREE
> select HAVE_KVM_PM_NOTIFIER if PM
> + select HAVE_KVM_MEMORY_ATTRIBUTES
> help
> Support hosting fully virtualized guest machines using hardware
> virtualization extensions. You will need a fairly recent
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8f874a964313..a784e2b06625 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -800,6 +800,9 @@ struct kvm {
>
> #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> struct notifier_block pm_notifier;
> +#endif
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> + struct xarray mem_attr_array;
> #endif
> char stats_id[KVM_STATS_NAME_SIZE];
> };
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 64dfe9c07c87..5d0941acb5bb 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1182,6 +1182,7 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_S390_CPU_TOPOLOGY 222
> #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
> #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
> +#define KVM_CAP_MEMORY_ATTRIBUTES 225
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -2238,4 +2239,20 @@ struct kvm_s390_zpci_op {
> /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
> #define KVM_S390_ZPCIOP_REGAEN_HOST (1 << 0)
>
> +/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
> +#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES _IOR(KVMIO, 0xd2, __u64)
> +#define KVM_SET_MEMORY_ATTRIBUTES _IOWR(KVMIO, 0xd3, struct kvm_memory_attributes)
> +
> +struct kvm_memory_attributes {
> + __u64 address;
> + __u64 size;
> + __u64 attributes;
> + __u64 flags;
> +};
> +
> +#define KVM_MEMORY_ATTRIBUTE_READ (1ULL << 0)
> +#define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
> +#define KVM_MEMORY_ATTRIBUTE_EXECUTE (1ULL << 2)
> +#define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
> +
> #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 800f9470e36b..effdea5dd4f0 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -19,6 +19,9 @@ config HAVE_KVM_IRQ_ROUTING
> config HAVE_KVM_DIRTY_RING
> bool
>
> +config HAVE_KVM_MEMORY_ATTRIBUTES
> + bool
> +
> # Only strongly ordered architectures can select this, as it doesn't
> # put any explicit constraint on userspace ordering. They can also
> # select the _ACQ_REL version.
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1782c4555d94..7f0f5e9f2406 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> spin_lock_init(&kvm->mn_invalidate_lock);
> rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> + xa_init(&kvm->mem_attr_array);
> +#endif
>
> INIT_LIST_HEAD(&kvm->gpc_list);
> spin_lock_init(&kvm->gpc_lock);
> @@ -1323,6 +1326,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
> kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
> kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
> }
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> + xa_destroy(&kvm->mem_attr_array);
> +#endif
> cleanup_srcu_struct(&kvm->irq_srcu);
> cleanup_srcu_struct(&kvm->srcu);
> kvm_arch_free_vm(kvm);
> @@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> }
> #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> +{
> + return 0;
> +}
> +
> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> + struct kvm_memory_attributes *attrs)
> +{
> + gfn_t start, end;
> + unsigned long i;
> + void *entry;
> + u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> + /* flags is currently not used. */
> + if (attrs->flags)
> + return -EINVAL;
> + if (attrs->attributes & ~supported_attrs)
> + return -EINVAL;
> + if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> + return -EINVAL;
> + if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> + return -EINVAL;
> +
> + start = attrs->address >> PAGE_SHIFT;
> + end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> + entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +
> + mutex_lock(&kvm->lock);
> + for (i = start; i < end; i++)
> + if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> + GFP_KERNEL_ACCOUNT)))
> + break;
> + mutex_unlock(&kvm->lock);
> +
> + attrs->address = i << PAGE_SHIFT;
> + attrs->size = (end - i) << PAGE_SHIFT;
> +
> + return 0;
> +}
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
> struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
> {
> return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> @@ -4459,6 +4508,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> #ifdef CONFIG_HAVE_KVM_MSI
> case KVM_CAP_SIGNAL_MSI:
> #endif
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> + case KVM_CAP_MEMORY_ATTRIBUTES:
> +#endif
> #ifdef CONFIG_HAVE_KVM_IRQFD
> case KVM_CAP_IRQFD:
> case KVM_CAP_IRQFD_RESAMPLE:
> @@ -4804,6 +4856,30 @@ static long kvm_vm_ioctl(struct file *filp,
> break;
> }
> #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> + case KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES: {
> + u64 attrs = kvm_supported_mem_attributes(kvm);
> +
> + r = -EFAULT;
> + if (copy_to_user(argp, &attrs, sizeof(attrs)))
> + goto out;
> + r = 0;
> + break;
> + }
> + case KVM_SET_MEMORY_ATTRIBUTES: {
> + struct kvm_memory_attributes attrs;
> +
> + r = -EFAULT;
> + if (copy_from_user(&attrs, argp, sizeof(attrs)))
> + goto out;
> +
> + r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
> +
> + if (!r && copy_to_user(argp, &attrs, sizeof(attrs)))
> + r = -EFAULT;
> + break;
> + }
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> case KVM_CREATE_DEVICE: {
> struct kvm_create_device cd;
>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
2023-01-13 21:54 ` Sean Christopherson
@ 2023-01-17 12:41 ` Chao Peng
2023-01-17 16:34 ` Sean Christopherson
2023-02-22 2:07 ` Alexey Kardashevskiy
1 sibling, 1 reply; 153+ messages in thread
From: Chao Peng @ 2023-01-17 12:41 UTC (permalink / raw)
To: Sean Christopherson
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Fri, Jan 13, 2023 at 09:54:41PM +0000, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > The system call is currently wired up for x86 arch.
>
> Building on other architectures (except for arm64 for some reason) yields:
>
> CALL /.../scripts/checksyscalls.sh
> <stdin>:1565:2: warning: #warning syscall memfd_restricted not implemented [-Wcpp]
>
> Do we care? It's the only such warning, which makes me think we either need to
> wire this up for all architectures, or explicitly document that it's unsupported.
I'm a bit conservative and prefer enabling only on x86 where we know the
exact usecase. For the warning we can get rid of by changing
scripts/checksyscalls.sh, just like __IGNORE_memfd_secret:
https://lkml.kernel.org/r/20210518072034.31572-7-rppt@kernel.org
>
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
>
> ...
>
> > diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> > new file mode 100644
> > index 000000000000..c2700c5daa43
> > --- /dev/null
> > +++ b/include/linux/restrictedmem.h
> > @@ -0,0 +1,71 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _LINUX_RESTRICTEDMEM_H
>
> Missing
>
> #define _LINUX_RESTRICTEDMEM_H
>
> which causes fireworks if restrictedmem.h is included more than once.
>
> > +#include <linux/file.h>
> > +#include <linux/magic.h>
> > +#include <linux/pfn_t.h>
>
> ...
>
> > +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > + struct page **pagep, int *order)
> > +{
> > + return -1;
>
> This should be a proper -errno, though in the current incarnation of things it's
> a moot point because no stub is needed. KVM can (and should) easily provide its
> own stub for this one.
>
> > +}
> > +
> > +static inline bool file_is_restrictedmem(struct file *file)
> > +{
> > + return false;
> > +}
> > +
> > +static inline void restrictedmem_error_page(struct page *page,
> > + struct address_space *mapping)
> > +{
> > +}
> > +
> > +#endif /* CONFIG_RESTRICTEDMEM */
> > +
> > +#endif /* _LINUX_RESTRICTEDMEM_H */
>
> ...
>
> > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> > new file mode 100644
> > index 000000000000..56953c204e5c
> > --- /dev/null
> > +++ b/mm/restrictedmem.c
> > @@ -0,0 +1,318 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "linux/sbitmap.h"
> > +#include <linux/pagemap.h>
> > +#include <linux/pseudo_fs.h>
> > +#include <linux/shmem_fs.h>
> > +#include <linux/syscalls.h>
> > +#include <uapi/linux/falloc.h>
> > +#include <uapi/linux/magic.h>
> > +#include <linux/restrictedmem.h>
> > +
> > +struct restrictedmem_data {
>
> Any objection to simply calling this "restrictedmem"? And then using either "rm"
> or "rmem" for local variable names? I kept reading "data" as the underyling data
> being written to the page, as opposed to the metadata describing the restrictedmem
> instance.
>
> > + struct mutex lock;
> > + struct file *memfd;
> > + struct list_head notifiers;
> > +};
> > +
> > +static void restrictedmem_invalidate_start(struct restrictedmem_data *data,
> > + pgoff_t start, pgoff_t end)
> > +{
> > + struct restrictedmem_notifier *notifier;
> > +
> > + mutex_lock(&data->lock);
>
> This can be a r/w semaphore instead of a mutex, that way punching holes at multiple
> points in the file can at least run the notifiers in parallel. The actual allocation
> by shmem will still be serialized, but I think it's worth the simple optimization
> since zapping and flushing in KVM may be somewhat slow.
>
> > + list_for_each_entry(notifier, &data->notifiers, list) {
> > + notifier->ops->invalidate_start(notifier, start, end);
>
> Two major design issues that we overlooked long ago:
>
> 1. Blindly invoking notifiers will not scale. E.g. if userspace configures a
> VM with a large number of convertible memslots that are all backed by a
> single large restrictedmem instance, then converting a single page will
> result in a linear walk through all memslots. I don't expect anyone to
> actually do something silly like that, but I also never expected there to be
> a legitimate usecase for thousands of memslots.
>
> 2. This approach fails to provide the ability for KVM to ensure a guest has
> exclusive access to a page. As discussed in the past, the kernel can rely
> on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but
> only for SNP and TDX VMs. For VMs where userspace is trusted to some extent,
> e.g. SEV, there is value in ensuring a 1:1 association.
>
> And probably more importantly, relying on hardware for SNP and TDX yields a
> poor ABI and complicates KVM's internals. If the kernel doesn't guarantee a
> page is exclusive to a guest, i.e. if userspace can hand out the same page
> from a restrictedmem instance to multiple VMs, then failure will occur only
> when KVM tries to assign the page to the second VM. That will happen deep
> in KVM, which means KVM needs to gracefully handle such errors, and it means
> that KVM's ABI effectively allows plumbing garbage into its memslots.
It may not be a valid usage, but in my TDX environment I do meet below
issue.
kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fe1ebfff000 ret=0
kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc00000 size=0x400000 ua=0x7fe271579000 ret=0
kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda0000 size=0x20000 ua=0x7fe1ec09f000 ret=-22
Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU
and slot#2 fails due to below exclusive check.
Currently I changed QEMU code to mark these alias slots as shared
instead of private but I'm not 100% confident this is correct fix.
>
> Rather than use a simple list of notifiers, this appears to be yet another
> opportunity to use an xarray. Supporting sharing of restrictedmem will be
> non-trivial, but IMO we should punt that to the future since it's still unclear
> exactly how sharing will work.
>
> An xarray will solve #1 by notifying only the consumers (memslots) that are bound
> to the affected range.
>
> And for #2, it's relatively straightforward (knock wood) to detect existing
> entries, i.e. if the user wants exclusive access to memory, then the bind operation
> can be reject if there's an existing entry.
>
> VERY lightly tested code snippet at the bottom (will provide link to fully worked
> code in cover letter).
>
>
> > +static long restrictedmem_punch_hole(struct restrictedmem_data *data, int mode,
> > + loff_t offset, loff_t len)
> > +{
> > + int ret;
> > + pgoff_t start, end;
> > + struct file *memfd = data->memfd;
> > +
> > + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > + return -EINVAL;
> > +
> > + start = offset >> PAGE_SHIFT;
> > + end = (offset + len) >> PAGE_SHIFT;
> > +
> > + restrictedmem_invalidate_start(data, start, end);
> > + ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > + restrictedmem_invalidate_end(data, start, end);
>
> The lock needs to be end for the entire duration of the hole punch, i.e. needs to
> be taken before invalidate_start() and released after invalidate_end(). If a user
> (un)binds/(un)registers after invalidate_state(), it will see an unpaired notification,
> e.g. could leave KVM with incorrect notifier counts.
>
> > +
> > + return ret;
> > +}
>
> What I ended up with for an xarray-based implementation. I'm very flexible on
> names and whatnot, these are just what made sense to me.
>
> static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode,
> loff_t offset, loff_t len)
> {
> struct restrictedmem_notifier *notifier;
> struct file *memfd = rm->memfd;
> unsigned long index;
> pgoff_t start, end;
> int ret;
>
> if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> return -EINVAL;
>
> start = offset >> PAGE_SHIFT;
> end = (offset + len) >> PAGE_SHIFT;
>
> /*
> * Bindings must stable across invalidation to ensure the start+end
> * are balanced.
> */
> down_read(&rm->lock);
>
> xa_for_each_range(&rm->bindings, index, notifier, start, end)
> notifier->ops->invalidate_start(notifier, start, end);
>
> ret = memfd->f_op->fallocate(memfd, mode, offset, len);
>
> xa_for_each_range(&rm->bindings, index, notifier, start, end)
> notifier->ops->invalidate_end(notifier, start, end);
>
> up_read(&rm->lock);
>
> return ret;
> }
>
> int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
> struct restrictedmem_notifier *notifier, bool exclusive)
> {
> struct restrictedmem *rm = file->f_mapping->private_data;
> int ret = -EINVAL;
>
> down_write(&rm->lock);
>
> /* Non-exclusive mappings are not yet implemented. */
> if (!exclusive)
> goto out_unlock;
>
> if (!xa_empty(&rm->bindings)) {
> if (exclusive != rm->exclusive)
> goto out_unlock;
>
> if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT))
> goto out_unlock;
> }
>
> xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL);
> rm->exclusive = exclusive;
> ret = 0;
> out_unlock:
> up_write(&rm->lock);
> return ret;
> }
> EXPORT_SYMBOL_GPL(restrictedmem_bind);
>
> void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end,
> struct restrictedmem_notifier *notifier)
> {
> struct restrictedmem *rm = file->f_mapping->private_data;
>
> down_write(&rm->lock);
> xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL);
> synchronize_rcu();
> up_write(&rm->lock);
> }
> EXPORT_SYMBOL_GPL(restrictedmem_unbind);
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory
2023-01-13 22:37 ` Sean Christopherson
@ 2023-01-17 12:42 ` Chao Peng
0 siblings, 0 replies; 153+ messages in thread
From: Chao Peng @ 2023-01-17 12:42 UTC (permalink / raw)
To: Sean Christopherson
Cc: Jarkko Sakkinen, kvm, linux-kernel, linux-mm, linux-fsdevel,
linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Fri, Jan 13, 2023 at 10:37:39PM +0000, Sean Christopherson wrote:
> On Tue, Jan 10, 2023, Chao Peng wrote:
> > On Mon, Jan 09, 2023 at 07:32:05PM +0000, Sean Christopherson wrote:
> > > On Fri, Jan 06, 2023, Chao Peng wrote:
> > > > On Thu, Jan 05, 2023 at 11:23:01AM +0000, Jarkko Sakkinen wrote:
> > > > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > > > To make future maintenance easy, internally use a binary compatible
> > > > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > > > '_ext' variants.
> > > > >
> > > > > Feels bit hacky IMHO, and more like a completely new feature than
> > > > > an extension.
> > > > >
> > > > > Why not just add a new ioctl? The commit message does not address
> > > > > the most essential design here.
> > > >
> > > > Yes, people can always choose to add a new ioctl for this kind of change
> > > > and the balance point here is we want to also avoid 'too many ioctls' if
> > > > the functionalities are similar. The '_ext' variant reuses all the
> > > > existing fields in the 'normal' variant and most importantly KVM
> > > > internally can reuse most of the code. I certainly can add some words in
> > > > the commit message to explain this design choice.
> > >
> > > After seeing the userspace side of this, I agree with Jarkko; overloading
> > > KVM_SET_USER_MEMORY_REGION is a hack. E.g. the size validation ends up being
> > > bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
> > > itself.
> >
> > How is the size validation being bogus? I don't quite follow.
>
> The ioctl() magic embeds the size of the payload (struct kvm_userspace_memory_region
> in this case) in the ioctl() number, and that information is visible to userspace
> via _IOCTL_SIZE(). Attempting to take a larger size can mess up sanity checks,
> e.g. KVM selftests get tripped up on this assert if KVM_SET_USER_MEMORY_REGION is
> passed an "extended" struct.
>
> #define kvm_do_ioctl(fd, cmd, arg) \
> ({ \
> kvm_static_assert(!_IOC_SIZE(cmd) || sizeof(*arg) == _IOC_SIZE(cmd)); \
> ioctl(fd, cmd, arg); \
> })
Got it. Thanks for the explanation.
Chao
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
2023-01-14 0:01 ` Sean Christopherson
@ 2023-01-17 13:12 ` Chao Peng
2023-01-17 19:35 ` Sean Christopherson
2023-01-28 14:00 ` Chao Peng
1 sibling, 1 reply; 153+ messages in thread
From: Chao Peng @ 2023-01-17 13:12 UTC (permalink / raw)
To: Sean Christopherson
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Sat, Jan 14, 2023 at 12:01:01AM +0000, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > @@ -10357,6 +10364,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >
> > if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
> > static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
> > +
> > + if (kvm_check_request(KVM_REQ_MEMORY_MCE, vcpu)) {
> > + vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
>
> Synthesizing triple fault shutdown is not the right approach. Even with TDX's
> MCE "architecture" (heavy sarcasm), it's possible that host userspace and the
> guest have a paravirt interface for handling memory errors without killing the
> host.
Agree shutdown is not the correct choice. I see you made below change:
send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current)
The MCE may happen in any thread than KVM thread, sending siginal to
'current' thread may not be the expected behavior. Also how userspace
can tell is the MCE on the shared page or private page? Do we care?
>
> > + r = 0;
> > + goto out;
> > + }
> > }
>
>
> > @@ -1982,6 +2112,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > !access_ok((void __user *)(unsigned long)mem->userspace_addr,
> > mem->memory_size))
> > return -EINVAL;
> > + if (mem->flags & KVM_MEM_PRIVATE &&
> > + (mem->restricted_offset & (PAGE_SIZE - 1) ||
>
> Align indentation.
>
> > + mem->restricted_offset > U64_MAX - mem->memory_size))
>
> Strongly prefer to use similar logic to existing code that detects wraps:
>
> mem->restricted_offset + mem->memory_size < mem->restricted_offset
>
> This is also where I'd like to add the "gfn is aligned to offset" check, though
> my brain is too fried to figure that out right now.
>
> > + return -EINVAL;
> > if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> > return -EINVAL;
> > if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> > @@ -2020,6 +2154,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
> > return -EINVAL;
> > } else { /* Modify an existing slot. */
> > + /* Private memslots are immutable, they can only be deleted. */
>
> I'm 99% certain I suggested this, but if we're going to make these memslots
> immutable, then we should straight up disallow dirty logging, otherwise we'll
> end up with a bizarre uAPI.
But in my mind dirty logging will be needed in the very short time, when
live migration gets supported?
>
> > + if (mem->flags & KVM_MEM_PRIVATE)
> > + return -EINVAL;
> > if ((mem->userspace_addr != old->userspace_addr) ||
> > (npages != old->npages) ||
> > ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> > @@ -2048,10 +2185,28 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > new->npages = npages;
> > new->flags = mem->flags;
> > new->userspace_addr = mem->userspace_addr;
> > + if (mem->flags & KVM_MEM_PRIVATE) {
> > + new->restricted_file = fget(mem->restricted_fd);
> > + if (!new->restricted_file ||
> > + !file_is_restrictedmem(new->restricted_file)) {
> > + r = -EINVAL;
> > + goto out;
> > + }
> > + new->restricted_offset = mem->restricted_offset;
I see you changed slot->restricted_offset type from loff_t to gfn_t and
used pgoff_t when doing the restrictedmem_bind/unbind(). Using page
index is reasonable KVM internally and sounds simpler than loff_t. But
we also need initialize it to page index here as well as changes in
another two cases. This is needed when restricted_offset != 0.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 547b92215002..49e375e78f30 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2364,8 +2364,7 @@ static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
gfn_t gfn, kvm_pfn_t *pfn,
int *order)
{
- pgoff_t index = gfn - slot->base_gfn +
- (slot->restricted_offset >> PAGE_SHIFT);
+ pgoff_t index = gfn - slot->base_gfn + slot->restricted_offset;
struct page *page;
int ret;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 01db35ddd5b3..7439bdcb0d04 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -935,7 +935,7 @@ static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
pgoff_t start, pgoff_t end,
gfn_t *gfn_start, gfn_t *gfn_end)
{
- unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
+ unsigned long base_pgoff = slot->restricted_offset;
if (start > base_pgoff)
*gfn_start = slot->base_gfn + start - base_pgoff;
@@ -2275,7 +2275,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
r = -EINVAL;
goto out;
}
- new->restricted_offset = mem->restricted_offset;
+ new->restricted_offset = mem->restricted_offset >> PAGE_SHIFT;
}
r = kvm_set_memslot(kvm, old, new, change);
Chao
> > + }
> > +
> > + new->kvm = kvm;
>
> Set this above, just so that the code flows better.
^ permalink raw reply related [flat|nested] 153+ messages in thread
* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
2023-01-14 0:37 ` [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Sean Christopherson
2023-01-16 13:48 ` Kirill A. Shutemov
@ 2023-01-17 13:19 ` Chao Peng
2023-01-17 14:32 ` Fuad Tabba
` (2 subsequent siblings)
4 siblings, 0 replies; 153+ messages in thread
From: Chao Peng @ 2023-01-17 13:19 UTC (permalink / raw)
To: Sean Christopherson
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Sat, Jan 14, 2023 at 12:37:59AM +0000, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> >
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> > - 01: mm change, target for mm tree
> > - 02-09: KVM change, target for KVM tree
>
> A version with all of my feedback, plus reworked versions of Vishal's selftest,
> is available here:
>
> git@github.com:sean-jc/linux.git x86/upm_base_support
>
> It compiles and passes the selftest, but it's otherwise barely tested. There are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> a WIP.
Thanks very much for doing this. Almost all of your comments are well
received, except for two cases that need more discussions which have
replied individually.
>
> As for next steps, can you (handwaving all of the TDX folks) take a look at what
> I pushed and see if there's anything horrifically broken, and that it still works
> for TDX?
I have integrated this into my local TDX repo, with some changes (as I
replied individually), the new code basically still works with TDX.
I have also asked other TDX folks to take a look.
>
> Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush
> (and I mean that).
>
> On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> (SEV and TDX). For tests, I want to build a lists of tests that are required for
> merging so that the criteria for merging are clear, and so that if the list is large
> (haven't thought much yet), the work of writing and running tests can be distributed.
>
> Regarding downstream dependencies, before this lands, I want to pull in all the
> TDX and SNP series and see how everything fits together. Specifically, I want to
> make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> don't miss an opportunity to make things simpler. The patches in the SNP series to
> add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> details. Nothing remotely major, but something that needs attention since it'll
> be uAPI.
>
> I'm off Monday, so it'll be at least Tuesday before I make any more progress on
> my side.
Appreciate your effort. As for the next steps, if you see something we
can do parallel, feel free to let me know.
Thanks,
Chao
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
2023-01-17 3:21 ` Binbin Wu
@ 2023-01-17 13:30 ` Chao Peng
2023-01-17 17:25 ` Sean Christopherson
0 siblings, 1 reply; 153+ messages in thread
From: Chao Peng @ 2023-01-17 13:30 UTC (permalink / raw)
To: Binbin Wu
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Tue, Jan 17, 2023 at 11:21:10AM +0800, Binbin Wu wrote:
>
> On 12/2/2022 2:13 PM, Chao Peng wrote:
> > In confidential computing usages, whether a page is private or shared is
> > necessary information for KVM to perform operations like page fault
> > handling, page zapping etc. There are other potential use cases for
> > per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > or exec-only, etc.) without having to modify memslots.
> >
> > Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > userspace to operate on the per-page memory attributes.
> > - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> > a guest memory range.
> > - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> > memory attributes.
> >
> > KVM internally uses xarray to store the per-page memory attributes.
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> > ---
> > Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
> > arch/x86/kvm/Kconfig | 1 +
> > include/linux/kvm_host.h | 3 ++
> > include/uapi/linux/kvm.h | 17 ++++++++
>
> Should the changes introduced in this file also need to be added in
> tools/include/uapi/linux/kvm.h ?
Yes I think. But I'm hesitate to include in this patch or not. I see
many commits sync kernel kvm.h to tools's copy. Looks that is done
periodically and with a 'pull' model.
Chao
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
2023-01-14 0:37 ` [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Sean Christopherson
2023-01-16 13:48 ` Kirill A. Shutemov
2023-01-17 13:19 ` Chao Peng
@ 2023-01-17 14:32 ` Fuad Tabba
2023-01-19 11:13 ` Isaku Yamahata
2023-01-24 16:08 ` Liam Merwick
4 siblings, 0 replies; 153+ messages in thread
From: Fuad Tabba @ 2023-01-17 14:32 UTC (permalink / raw)
To: Sean Christopherson
Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, Michael Roth, mhocko, wei.w.wang
Hi Sean,
On Sat, Jan 14, 2023 at 12:38 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> >
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> > - 01: mm change, target for mm tree
> > - 02-09: KVM change, target for KVM tree
>
> A version with all of my feedback, plus reworked versions of Vishal's selftest,
> is available here:
>
> git@github.com:sean-jc/linux.git x86/upm_base_support
>
> It compiles and passes the selftest, but it's otherwise barely tested. There are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> a WIP.
>
> As for next steps, can you (handwaving all of the TDX folks) take a look at what
> I pushed and see if there's anything horrifically broken, and that it still works
> for TDX?
>
> Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush
> (and I mean that).
Thanks for sharing this. I've had a look at the patches, and have
ported them to work with pKVM. At a high level, the new interface
seems fine and it works with the arm64/pKVM port. I have a couple of
comments regarding some of the details, but they can wait until v11 is
posted.
Cheers,
/fuad
> On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> (SEV and TDX). For tests, I want to build a lists of tests that are required for
> merging so that the criteria for merging are clear, and so that if the list is large
> (haven't thought much yet), the work of writing and running tests can be distributed.
>
> Regarding downstream dependencies, before this lands, I want to pull in all the
> TDX and SNP series and see how everything fits together. Specifically, I want to
> make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> don't miss an opportunity to make things simpler. The patches in the SNP series to
> add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> details. Nothing remotely major, but something that needs attention since it'll
> be uAPI.
>
> I'm off Monday, so it'll be at least Tuesday before I make any more progress on
> my side.
>
> Thanks!
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
2023-01-17 12:41 ` Chao Peng
@ 2023-01-17 16:34 ` Sean Christopherson
2023-01-18 8:16 ` Chao Peng
0 siblings, 1 reply; 153+ messages in thread
From: Sean Christopherson @ 2023-01-17 16:34 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Tue, Jan 17, 2023, Chao Peng wrote:
> On Fri, Jan 13, 2023 at 09:54:41PM +0000, Sean Christopherson wrote:
> > > + list_for_each_entry(notifier, &data->notifiers, list) {
> > > + notifier->ops->invalidate_start(notifier, start, end);
> >
> > Two major design issues that we overlooked long ago:
> >
> > 1. Blindly invoking notifiers will not scale. E.g. if userspace configures a
> > VM with a large number of convertible memslots that are all backed by a
> > single large restrictedmem instance, then converting a single page will
> > result in a linear walk through all memslots. I don't expect anyone to
> > actually do something silly like that, but I also never expected there to be
> > a legitimate usecase for thousands of memslots.
> >
> > 2. This approach fails to provide the ability for KVM to ensure a guest has
> > exclusive access to a page. As discussed in the past, the kernel can rely
> > on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but
> > only for SNP and TDX VMs. For VMs where userspace is trusted to some extent,
> > e.g. SEV, there is value in ensuring a 1:1 association.
> >
> > And probably more importantly, relying on hardware for SNP and TDX yields a
> > poor ABI and complicates KVM's internals. If the kernel doesn't guarantee a
> > page is exclusive to a guest, i.e. if userspace can hand out the same page
> > from a restrictedmem instance to multiple VMs, then failure will occur only
> > when KVM tries to assign the page to the second VM. That will happen deep
> > in KVM, which means KVM needs to gracefully handle such errors, and it means
> > that KVM's ABI effectively allows plumbing garbage into its memslots.
>
> It may not be a valid usage, but in my TDX environment I do meet below
> issue.
>
> kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fe1ebfff000 ret=0
> kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc00000 size=0x400000 ua=0x7fe271579000 ret=0
> kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda0000 size=0x20000 ua=0x7fe1ec09f000 ret=-22
>
> Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU
> and slot#2 fails due to below exclusive check.
>
> Currently I changed QEMU code to mark these alias slots as shared
> instead of private but I'm not 100% confident this is correct fix.
That's a QEMU bug of sorts. SMM is mutually exclusive with TDX, QEMU shouldn't
be configuring SMRAM (or any SMM memslots for that matter) for TDX guests.
Actually, KVM should enforce that by disallowing SMM memslots for TDX guests.
Ditto for SNP guests and UPM-backed SEV and SEV-ES guests. I think it probably
even makes sense to introduce that restriction in the base UPM support, e.g.
something like the below. That would unnecessarily prevent emulating SMM for
KVM_X86_PROTECTED_VM types that aren't encrypted, but IMO that's an acceptable
limitation until there's an actual use case for KVM_X86_PROTECTED_VM guests beyond
SEV (my thought is that KVM_X86_PROTECTED_VM will mostly be a vehicle for selftests
and UPM-based SEV and SEV-ES guests).
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 48b7bdad1e0a..0a8aac821cb0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4357,6 +4357,14 @@ bool kvm_arch_has_private_mem(struct kvm *kvm)
return kvm->arch.vm_type != KVM_X86_DEFAULT_VM;
}
+bool kvm_arch_nr_address_spaces(struct kvm *kvm)
+{
+ if (kvm->arch.vm_type != KVM_X86_DEFAULT_VM)
+ return 1;
+
+ return KVM_ADDRESS_SPACE_NUM;
+}
+
static bool kvm_is_vm_type_supported(unsigned long type)
{
return type == KVM_X86_DEFAULT_VM ||
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 97801d81ee42..e0a3fc819fe5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2126,7 +2126,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
mem->restricted_offset + mem->memory_size < mem->restricted_offset ||
0 /* TODO: require gfn be aligned with restricted offset */))
return -EINVAL;
- if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
+ if (as_id >= kvm_arch_nr_address_spaces(vm) || id >= KVM_MEM_SLOTS_NUM)
return -EINVAL;
if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
return -EINVAL;
^ permalink raw reply related [flat|nested] 153+ messages in thread
* Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes
2023-01-17 13:30 ` Chao Peng
@ 2023-01-17 17:25 ` Sean Christopherson
0 siblings, 0 replies; 153+ messages in thread
From: Sean Christopherson @ 2023-01-17 17:25 UTC (permalink / raw)
To: Chao Peng
Cc: Binbin Wu, kvm, linux-kernel, linux-mm, linux-fsdevel,
linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Tue, Jan 17, 2023, Chao Peng wrote:
> On Tue, Jan 17, 2023 at 11:21:10AM +0800, Binbin Wu wrote:
> >
> > On 12/2/2022 2:13 PM, Chao Peng wrote:
> > > In confidential computing usages, whether a page is private or shared is
> > > necessary information for KVM to perform operations like page fault
> > > handling, page zapping etc. There are other potential use cases for
> > > per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > > or exec-only, etc.) without having to modify memslots.
> > >
> > > Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > > userspace to operate on the per-page memory attributes.
> > > - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> > > a guest memory range.
> > > - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> > > memory attributes.
> > >
> > > KVM internally uses xarray to store the per-page memory attributes.
> > >
> > > Suggested-by: Sean Christopherson <seanjc@google.com>
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> > > ---
> > > Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
> > > arch/x86/kvm/Kconfig | 1 +
> > > include/linux/kvm_host.h | 3 ++
> > > include/uapi/linux/kvm.h | 17 ++++++++
> >
> > Should the changes introduced in this file also need to be added in
> > tools/include/uapi/linux/kvm.h ?
>
> Yes I think.
I'm not sure how Paolo or others feel, but my preference is to never update KVM's
uapi headers in tools/ in KVM's tree. Nothing KVM-related in tools/ actually
relies on the headers being copied into tools/, e.g. KVM selftests pulls KVM's
headers from the .../usr/include/ directory that's populated by `make headers_install`.
Perf's tooling is what actually "needs" the headers to be copied into tools/, so
my preference is to let the tools/perf maintainers deal with the headache of keeping
everything up-to-date.
> But I'm hesitate to include in this patch or not. I see many commits sync
> kernel kvm.h to tools's copy. Looks that is done periodically and with a
> 'pull' model.
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
2023-01-17 13:12 ` Chao Peng
@ 2023-01-17 19:35 ` Sean Christopherson
2023-01-18 8:23 ` Chao Peng
0 siblings, 1 reply; 153+ messages in thread
From: Sean Christopherson @ 2023-01-17 19:35 UTC (permalink / raw)
To: Chao Peng
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Tue, Jan 17, 2023, Chao Peng wrote:
> On Sat, Jan 14, 2023 at 12:01:01AM +0000, Sean Christopherson wrote:
> > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > @@ -10357,6 +10364,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > >
> > > if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
> > > static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
> > > +
> > > + if (kvm_check_request(KVM_REQ_MEMORY_MCE, vcpu)) {
> > > + vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
> >
> > Synthesizing triple fault shutdown is not the right approach. Even with TDX's
> > MCE "architecture" (heavy sarcasm), it's possible that host userspace and the
> > guest have a paravirt interface for handling memory errors without killing the
> > host.
>
> Agree shutdown is not the correct choice. I see you made below change:
>
> send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current)
>
> The MCE may happen in any thread than KVM thread, sending siginal to
> 'current' thread may not be the expected behavior.
This is already true today, e.g. a #MC in memory that is mapped into the guest can
be triggered by a host access. Hrm, but in this case we actually have a KVM
instance, and we know that the #MC is relevant to the KVM instance, so I agree
that signaling 'current' is kludgy.
> Also how userspace can tell is the MCE on the shared page or private page?
> Do we care?
We care. I was originally thinking we could require userspace to keep track of
things, but that's quite prescriptive and flawed, e.g. could race with conversions.
One option would be to KVM_EXIT_MEMORY_FAULT, and then wire up a generic (not x86
specific) KVM request to exit to userspace, e.g.
/* KVM_EXIT_MEMORY_FAULT */
struct {
#define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3)
#define KVM_MEMORY_EXIT_FLAG_HW_ERROR (1ULL << 4)
__u64 flags;
__u64 gpa;
__u64 size;
} memory;
But I'm not sure that's the correct approach. It kinda feels like we're reinventing
the wheel. It seems like restrictedmem_get_page() _must_ be able to reject attempts
to get a poisoned page, i.e. restrictedmem_get_page() should yield KVM_PFN_ERR_HWPOISON.
Assuming that's the case, then I believe KVM simply needs to zap SPTEs in response
to an error notification in order to force vCPUs to fault on the poisoned page.
> > > + return -EINVAL;
> > > if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> > > return -EINVAL;
> > > if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> > > @@ -2020,6 +2154,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > > if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
> > > return -EINVAL;
> > > } else { /* Modify an existing slot. */
> > > + /* Private memslots are immutable, they can only be deleted. */
> >
> > I'm 99% certain I suggested this, but if we're going to make these memslots
> > immutable, then we should straight up disallow dirty logging, otherwise we'll
> > end up with a bizarre uAPI.
>
> But in my mind dirty logging will be needed in the very short time, when
> live migration gets supported?
Ya, but if/when live migration support is added, private memslots will no longer
be immutable as userspace will want to enable dirty logging only when a VM is
being migrated, i.e. something will need to change.
Given that it looks like we have clear line of sight to SEV+UPM guests, my
preference would be to allow toggling dirty logging from the get-go. It doesn't
necessarily have to be in the first patch, e.g. KVM could initially reject
KVM_MEM_LOG_DIRTY_PAGES + KVM_MEM_PRIVATE and then add support separately to make
the series easier to review, test, and bisect.
static int check_memory_region_flags(struct kvm *kvm,
const struct kvm_userspace_memory_region2 *mem)
{
u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
if (kvm_arch_has_private_mem(kvm) &&
~(mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
valid_flags |= KVM_MEM_PRIVATE;
...
}
> > > + if (mem->flags & KVM_MEM_PRIVATE)
> > > + return -EINVAL;
> > > if ((mem->userspace_addr != old->userspace_addr) ||
> > > (npages != old->npages) ||
> > > ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> > > @@ -2048,10 +2185,28 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > > new->npages = npages;
> > > new->flags = mem->flags;
> > > new->userspace_addr = mem->userspace_addr;
> > > + if (mem->flags & KVM_MEM_PRIVATE) {
> > > + new->restricted_file = fget(mem->restricted_fd);
> > > + if (!new->restricted_file ||
> > > + !file_is_restrictedmem(new->restricted_file)) {
> > > + r = -EINVAL;
> > > + goto out;
> > > + }
> > > + new->restricted_offset = mem->restricted_offset;
>
> I see you changed slot->restricted_offset type from loff_t to gfn_t and
> used pgoff_t when doing the restrictedmem_bind/unbind(). Using page
> index is reasonable KVM internally and sounds simpler than loff_t. But
> we also need initialize it to page index here as well as changes in
> another two cases. This is needed when restricted_offset != 0.
Oof. I'm pretty sure I completely missed that loff_t is used for byte offsets,
whereas pgoff_t is a frame index.
Given that the restrictmem APIs take pgoff_t, I definitely think it makes sense
to the index, but I'm very tempted to store pgoff_t instead of gfn_t, and name
the field "index" to help connect the dots to the rest of kernel, where "pgoff_t index"
is quite common.
And looking at those bits again, we should wrap all of the restrictedmem fields
with CONFIG_KVM_PRIVATE_MEM. It'll require minor tweaks to __kvm_set_memory_region(),
but I think will yield cleaner code (and internal APIs) overall.
And wrap the three fields in an anonymous struct? E.g. this is a little more
versbose (restrictedmem instead restricted), but at first glance it doesn't seem
to cause widespared line length issues.
#ifdef CONFIG_KVM_PRIVATE_MEM
struct {
struct file *file;
pgoff_t index;
struct restrictedmem_notifier notifier;
} restrictedmem;
#endif
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 547b92215002..49e375e78f30 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2364,8 +2364,7 @@ static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> gfn_t gfn, kvm_pfn_t *pfn,
> int *order)
> {
> - pgoff_t index = gfn - slot->base_gfn +
> - (slot->restricted_offset >> PAGE_SHIFT);
> + pgoff_t index = gfn - slot->base_gfn + slot->restricted_offset;
> struct page *page;
> int ret;
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 01db35ddd5b3..7439bdcb0d04 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -935,7 +935,7 @@ static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
> pgoff_t start, pgoff_t end,
> gfn_t *gfn_start, gfn_t *gfn_end)
> {
> - unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
> + unsigned long base_pgoff = slot->restricted_offset;
>
> if (start > base_pgoff)
> *gfn_start = slot->base_gfn + start - base_pgoff;
> @@ -2275,7 +2275,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> r = -EINVAL;
> goto out;
> }
> - new->restricted_offset = mem->restricted_offset;
> + new->restricted_offset = mem->restricted_offset >> PAGE_SHIFT;
> }
>
> r = kvm_set_memslot(kvm, old, new, change);
>
> Chao
> > > + }
> > > +
> > > + new->kvm = kvm;
> >
> > Set this above, just so that the code flows better.
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
2023-01-17 16:34 ` Sean Christopherson
@ 2023-01-18 8:16 ` Chao Peng
2023-01-18 10:17 ` Isaku Yamahata
0 siblings, 1 reply; 153+ messages in thread
From: Chao Peng @ 2023-01-18 8:16 UTC (permalink / raw)
To: Sean Christopherson
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Tue, Jan 17, 2023 at 04:34:15PM +0000, Sean Christopherson wrote:
> On Tue, Jan 17, 2023, Chao Peng wrote:
> > On Fri, Jan 13, 2023 at 09:54:41PM +0000, Sean Christopherson wrote:
> > > > + list_for_each_entry(notifier, &data->notifiers, list) {
> > > > + notifier->ops->invalidate_start(notifier, start, end);
> > >
> > > Two major design issues that we overlooked long ago:
> > >
> > > 1. Blindly invoking notifiers will not scale. E.g. if userspace configures a
> > > VM with a large number of convertible memslots that are all backed by a
> > > single large restrictedmem instance, then converting a single page will
> > > result in a linear walk through all memslots. I don't expect anyone to
> > > actually do something silly like that, but I also never expected there to be
> > > a legitimate usecase for thousands of memslots.
> > >
> > > 2. This approach fails to provide the ability for KVM to ensure a guest has
> > > exclusive access to a page. As discussed in the past, the kernel can rely
> > > on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but
> > > only for SNP and TDX VMs. For VMs where userspace is trusted to some extent,
> > > e.g. SEV, there is value in ensuring a 1:1 association.
> > >
> > > And probably more importantly, relying on hardware for SNP and TDX yields a
> > > poor ABI and complicates KVM's internals. If the kernel doesn't guarantee a
> > > page is exclusive to a guest, i.e. if userspace can hand out the same page
> > > from a restrictedmem instance to multiple VMs, then failure will occur only
> > > when KVM tries to assign the page to the second VM. That will happen deep
> > > in KVM, which means KVM needs to gracefully handle such errors, and it means
> > > that KVM's ABI effectively allows plumbing garbage into its memslots.
> >
> > It may not be a valid usage, but in my TDX environment I do meet below
> > issue.
> >
> > kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fe1ebfff000 ret=0
> > kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc00000 size=0x400000 ua=0x7fe271579000 ret=0
> > kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda0000 size=0x20000 ua=0x7fe1ec09f000 ret=-22
> >
> > Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU
> > and slot#2 fails due to below exclusive check.
> >
> > Currently I changed QEMU code to mark these alias slots as shared
> > instead of private but I'm not 100% confident this is correct fix.
>
> That's a QEMU bug of sorts. SMM is mutually exclusive with TDX, QEMU shouldn't
> be configuring SMRAM (or any SMM memslots for that matter) for TDX guests.
Thanks for the confirmation. As long as we only bind one notifier for
each address, using xarray does make things simple.
Chao
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
2023-01-17 19:35 ` Sean Christopherson
@ 2023-01-18 8:23 ` Chao Peng
0 siblings, 0 replies; 153+ messages in thread
From: Chao Peng @ 2023-01-18 8:23 UTC (permalink / raw)
To: Sean Christopherson
Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-arch,
linux-api, linux-doc, qemu-devel, Paolo Bonzini, Jonathan Corbet,
Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Arnd Bergmann,
Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin, Hugh Dickins,
Jeff Layton, J . Bruce Fields, Andrew Morton, Shuah Khan,
Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Tue, Jan 17, 2023 at 07:35:58PM +0000, Sean Christopherson wrote:
> On Tue, Jan 17, 2023, Chao Peng wrote:
> > On Sat, Jan 14, 2023 at 12:01:01AM +0000, Sean Christopherson wrote:
> > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > @@ -10357,6 +10364,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > > >
> > > > if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
> > > > static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
> > > > +
> > > > + if (kvm_check_request(KVM_REQ_MEMORY_MCE, vcpu)) {
> > > > + vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
> > >
> > > Synthesizing triple fault shutdown is not the right approach. Even with TDX's
> > > MCE "architecture" (heavy sarcasm), it's possible that host userspace and the
> > > guest have a paravirt interface for handling memory errors without killing the
> > > host.
> >
> > Agree shutdown is not the correct choice. I see you made below change:
> >
> > send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current)
> >
> > The MCE may happen in any thread than KVM thread, sending siginal to
> > 'current' thread may not be the expected behavior.
>
> This is already true today, e.g. a #MC in memory that is mapped into the guest can
> be triggered by a host access. Hrm, but in this case we actually have a KVM
> instance, and we know that the #MC is relevant to the KVM instance, so I agree
> that signaling 'current' is kludgy.
>
> > Also how userspace can tell is the MCE on the shared page or private page?
> > Do we care?
>
> We care. I was originally thinking we could require userspace to keep track of
> things, but that's quite prescriptive and flawed, e.g. could race with conversions.
>
> One option would be to KVM_EXIT_MEMORY_FAULT, and then wire up a generic (not x86
> specific) KVM request to exit to userspace, e.g.
>
> /* KVM_EXIT_MEMORY_FAULT */
> struct {
> #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3)
> #define KVM_MEMORY_EXIT_FLAG_HW_ERROR (1ULL << 4)
> __u64 flags;
> __u64 gpa;
> __u64 size;
> } memory;
>
> But I'm not sure that's the correct approach. It kinda feels like we're reinventing
> the wheel. It seems like restrictedmem_get_page() _must_ be able to reject attempts
> to get a poisoned page, i.e. restrictedmem_get_page() should yield KVM_PFN_ERR_HWPOISON.
Yes, I see there is -EHWPOISON handling for hva_to_pfn() for shared
memory. It makes sense doing similar for private page.
> Assuming that's the case, then I believe KVM simply needs to zap SPTEs in response
> to an error notification in order to force vCPUs to fault on the poisoned page.
Agree, this is waht we should do anyway.
>
> > > > + return -EINVAL;
> > > > if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> > > > return -EINVAL;
> > > > if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> > > > @@ -2020,6 +2154,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > > > if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
> > > > return -EINVAL;
> > > > } else { /* Modify an existing slot. */
> > > > + /* Private memslots are immutable, they can only be deleted. */
> > >
> > > I'm 99% certain I suggested this, but if we're going to make these memslots
> > > immutable, then we should straight up disallow dirty logging, otherwise we'll
> > > end up with a bizarre uAPI.
> >
> > But in my mind dirty logging will be needed in the very short time, when
> > live migration gets supported?
>
> Ya, but if/when live migration support is added, private memslots will no longer
> be immutable as userspace will want to enable dirty logging only when a VM is
> being migrated, i.e. something will need to change.
>
> Given that it looks like we have clear line of sight to SEV+UPM guests, my
> preference would be to allow toggling dirty logging from the get-go. It doesn't
> necessarily have to be in the first patch, e.g. KVM could initially reject
> KVM_MEM_LOG_DIRTY_PAGES + KVM_MEM_PRIVATE and then add support separately to make
> the series easier to review, test, and bisect.
>
> static int check_memory_region_flags(struct kvm *kvm,
> const struct kvm_userspace_memory_region2 *mem)
> {
> u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
>
> if (kvm_arch_has_private_mem(kvm) &&
> ~(mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
> valid_flags |= KVM_MEM_PRIVATE;
Adding this limitation is OK to me. It's not too hard to remove it when
live migration gets added.
>
>
> ...
> }
>
> > > > + if (mem->flags & KVM_MEM_PRIVATE)
> > > > + return -EINVAL;
> > > > if ((mem->userspace_addr != old->userspace_addr) ||
> > > > (npages != old->npages) ||
> > > > ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> > > > @@ -2048,10 +2185,28 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > > > new->npages = npages;
> > > > new->flags = mem->flags;
> > > > new->userspace_addr = mem->userspace_addr;
> > > > + if (mem->flags & KVM_MEM_PRIVATE) {
> > > > + new->restricted_file = fget(mem->restricted_fd);
> > > > + if (!new->restricted_file ||
> > > > + !file_is_restrictedmem(new->restricted_file)) {
> > > > + r = -EINVAL;
> > > > + goto out;
> > > > + }
> > > > + new->restricted_offset = mem->restricted_offset;
> >
> > I see you changed slot->restricted_offset type from loff_t to gfn_t and
> > used pgoff_t when doing the restrictedmem_bind/unbind(). Using page
> > index is reasonable KVM internally and sounds simpler than loff_t. But
> > we also need initialize it to page index here as well as changes in
> > another two cases. This is needed when restricted_offset != 0.
>
> Oof. I'm pretty sure I completely missed that loff_t is used for byte offsets,
> whereas pgoff_t is a frame index.
>
> Given that the restrictmem APIs take pgoff_t, I definitely think it makes sense
> to the index, but I'm very tempted to store pgoff_t instead of gfn_t, and name
> the field "index" to help connect the dots to the rest of kernel, where "pgoff_t index"
> is quite common.
>
> And looking at those bits again, we should wrap all of the restrictedmem fields
> with CONFIG_KVM_PRIVATE_MEM. It'll require minor tweaks to __kvm_set_memory_region(),
> but I think will yield cleaner code (and internal APIs) overall.
>
> And wrap the three fields in an anonymous struct? E.g. this is a little more
> versbose (restrictedmem instead restricted), but at first glance it doesn't seem
> to cause widespared line length issues.
>
> #ifdef CONFIG_KVM_PRIVATE_MEM
> struct {
> struct file *file;
> pgoff_t index;
> struct restrictedmem_notifier notifier;
> } restrictedmem;
> #endif
Looks better.
Thanks,
Chao
>
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 547b92215002..49e375e78f30 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2364,8 +2364,7 @@ static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> > gfn_t gfn, kvm_pfn_t *pfn,
> > int *order)
> > {
> > - pgoff_t index = gfn - slot->base_gfn +
> > - (slot->restricted_offset >> PAGE_SHIFT);
> > + pgoff_t index = gfn - slot->base_gfn + slot->restricted_offset;
> > struct page *page;
> > int ret;
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 01db35ddd5b3..7439bdcb0d04 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -935,7 +935,7 @@ static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
> > pgoff_t start, pgoff_t end,
> > gfn_t *gfn_start, gfn_t *gfn_end)
> > {
> > - unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
> > + unsigned long base_pgoff = slot->restricted_offset;
> >
> > if (start > base_pgoff)
> > *gfn_start = slot->base_gfn + start - base_pgoff;
> > @@ -2275,7 +2275,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > r = -EINVAL;
> > goto out;
> > }
> > - new->restricted_offset = mem->restricted_offset;
> > + new->restricted_offset = mem->restricted_offset >> PAGE_SHIFT;
> > }
> >
> > r = kvm_set_memslot(kvm, old, new, change);
> >
> > Chao
> > > > + }
> > > > +
> > > > + new->kvm = kvm;
> > >
> > > Set this above, just so that the code flows better.
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
2023-01-18 8:16 ` Chao Peng
@ 2023-01-18 10:17 ` Isaku Yamahata
0 siblings, 0 replies; 153+ messages in thread
From: Isaku Yamahata @ 2023-01-18 10:17 UTC (permalink / raw)
To: Chao Peng
Cc: Sean Christopherson, kvm, linux-kernel, linux-mm, linux-fsdevel,
linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang, isaku.yamahata
On Wed, Jan 18, 2023 at 04:16:41PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:
> On Tue, Jan 17, 2023 at 04:34:15PM +0000, Sean Christopherson wrote:
> > On Tue, Jan 17, 2023, Chao Peng wrote:
> > > On Fri, Jan 13, 2023 at 09:54:41PM +0000, Sean Christopherson wrote:
> > > > > + list_for_each_entry(notifier, &data->notifiers, list) {
> > > > > + notifier->ops->invalidate_start(notifier, start, end);
> > > >
> > > > Two major design issues that we overlooked long ago:
> > > >
> > > > 1. Blindly invoking notifiers will not scale. E.g. if userspace configures a
> > > > VM with a large number of convertible memslots that are all backed by a
> > > > single large restrictedmem instance, then converting a single page will
> > > > result in a linear walk through all memslots. I don't expect anyone to
> > > > actually do something silly like that, but I also never expected there to be
> > > > a legitimate usecase for thousands of memslots.
> > > >
> > > > 2. This approach fails to provide the ability for KVM to ensure a guest has
> > > > exclusive access to a page. As discussed in the past, the kernel can rely
> > > > on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but
> > > > only for SNP and TDX VMs. For VMs where userspace is trusted to some extent,
> > > > e.g. SEV, there is value in ensuring a 1:1 association.
> > > >
> > > > And probably more importantly, relying on hardware for SNP and TDX yields a
> > > > poor ABI and complicates KVM's internals. If the kernel doesn't guarantee a
> > > > page is exclusive to a guest, i.e. if userspace can hand out the same page
> > > > from a restrictedmem instance to multiple VMs, then failure will occur only
> > > > when KVM tries to assign the page to the second VM. That will happen deep
> > > > in KVM, which means KVM needs to gracefully handle such errors, and it means
> > > > that KVM's ABI effectively allows plumbing garbage into its memslots.
> > >
> > > It may not be a valid usage, but in my TDX environment I do meet below
> > > issue.
> > >
> > > kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fe1ebfff000 ret=0
> > > kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc00000 size=0x400000 ua=0x7fe271579000 ret=0
> > > kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda0000 size=0x20000 ua=0x7fe1ec09f000 ret=-22
> > >
> > > Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU
> > > and slot#2 fails due to below exclusive check.
> > >
> > > Currently I changed QEMU code to mark these alias slots as shared
> > > instead of private but I'm not 100% confident this is correct fix.
> >
> > That's a QEMU bug of sorts. SMM is mutually exclusive with TDX, QEMU shouldn't
> > be configuring SMRAM (or any SMM memslots for that matter) for TDX guests.
>
> Thanks for the confirmation. As long as we only bind one notifier for
> each address, using xarray does make things simple.
In the past, I had patches for qemu to disable PAM and SMRAM, but they were
dropped for simplicity because SMRAM/PAM are disabled as reset state with unused
memslot registered. TDX guest bios(TDVF or EDK2) doesn't enable them.
Now we can revive them.
--
Isaku Yamahata <isaku.yamahata@gmail.com>
^ permalink raw reply [flat|nested] 153+ messages in thread
* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
2023-01-14 0:37 ` [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM Sean Christopherson
` (2 preceding siblings ...)
2023-01-17 14:32 ` Fuad Tabba
@ 2023-01-19 11:13 ` Isaku Yamahata
2023-01-19 15:25 ` Sean Christopherson
2023-01-24 16:08 ` Liam Merwick
4 siblings, 1 reply; 153+ messages in thread
From: Isaku Yamahata @ 2023-01-19 11:13 UTC (permalink / raw)
To: Sean Christopherson
Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang, isaku.yamahata
On Sat, Jan 14, 2023 at 12:37:59AM +0000,
Sean Christopherson <seanjc@google.com> wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> >
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> > - 01: mm change, target for mm tree
> > - 02-09: KVM change, target for KVM tree
>
> A version with all of my feedback, plus reworked versions of Vishal's selftest,
> is available here:
>
> git@github.com:sean-jc/linux.git x86/upm_base_support
>
> It compiles and passes the selftest, but it's otherwise barely tested. There are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> a WIP.
>
> As for next steps, can you (handwaving all of the TDX folks) take a look at what
> I pushed and see if there's anything horrifically broken, and that it still works
> for TDX?
>
> Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush
> (and I mean that).
>
> On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> (SEV and TDX). For tests, I want to build a lists of tests that are required for
> merging so that the criteria for merging are clear, and so that if the list is large
> (haven't thought much yet), the work of writing and running tests can be distributed.
>
> Regarding downstream dependencies, before this lands, I want to pull in all the
> TDX and SNP series and see how everything fits together. Specifically, I want to
> make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> don't miss an opportunity to make things simpler. The patches in the SNP series to
> add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> details. Nothing remotely major, but something that needs attention since it'll
> be uAPI.
Although I'm still debuging with TDX KVM, I needed the following.
kvm_faultin_pfn() is called without mmu_lock held. the race to change
private/shared is handled by mmu_seq. Maybe dedicated function only for
kvm_faultin_pfn().
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 02be5e1cba1e..38699ca75ab8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2322,7 +2322,7 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
{
- lockdep_assert_held(&kvm->mmu_lock);
+ // lockdep_assert_held(&kvm->mmu_lock);
return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
}
--
Isaku Yamahata <isaku.yamahata@gmail.com>
^ permalink raw reply related [flat|nested] 153+ messages in thread
* Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM
2023-01-19 11:13 ` Isaku Yamahata
@ 2023-01-19 15:25 ` Sean Christopherson
2023-01-19 22:37 ` Isaku Yamahata
0 siblings, 1 reply; 153+ messages in thread
From: Sean Christopherson @ 2023-01-19 15:25 UTC (permalink / raw)
To: Isaku Yamahata
Cc: Chao Peng, kvm, linux-kernel, linux-mm, linux-fsdevel,
linux-arch, linux-api, linux-doc, qemu-devel, Paolo Bonzini,
Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Arnd Bergmann, Naoya Horiguchi, Miaohe Lin, x86, H . Peter Anvin,
Hugh Dickins, Jeff Layton, J . Bruce Fields, Andrew Morton,
Shuah Khan, Mike Rapoport, Steven Price, Maciej S . Szmigiero,
Vlastimil Babka, Vishal Annapurve, Yu Zhang, Kirill A . Shutemov,
luto, jun.nakajima, dave.hansen, ak, david, aarcange, ddutile,
dhildenb, Quentin Perret, tabba, Michael Roth, mhocko,
wei.w.wang
On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> On Sat, Jan 14, 2023 at 12:37:59AM +0000,
> Sean Christopherson <seanjc@google.com> wrote:
>
> > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > This patch series implements KVM guest private memory for confidential
> > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > TDX-protected guest memory, machine check can happen which can further
> > > crash the running host system, this is terrible for multi-tenant
> > > configurations. The host accesses include those from KVM userspace like
> > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > via a fd-based approach, but it can never access the guest memory
> > > content.
> > >
> > > The patch series touches both core mm and KVM code. I appreciate
> > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > reviews are always welcome.
> > > - 01: mm change, target for mm tree
> > > - 02-09: KVM change, target for KVM tree
> >
> > A version with all of my feedback, plus reworked versions of Vishal's selftest,
> > is available here:
> >
> > git@github.com:sean-jc/linux.git x86/upm_base_support
> >
> > It compiles and passes the selftest, but it's otherwise barely tested. There are
> > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's still
> > a WIP.
> >
> > As for next steps, can you (handwaving all of the TDX folks) take a look at what
> > I pushed and see if there's anything horrifically broken, and that it still works
> > for TDX?
> >
> > Fuad (and pKVM folks) same ask for you with respect to pKVM. Absolutely no rush
> > (and I mean that).
> >
> > On my side, the two things on my mind are (a) tests and (b) downstream dependencies
> > (SEV and TDX). For tests, I want to build a lists of tests that are required for
> > merging so that the criteria for merging are clear, and so that if the list is large
> > (haven't thought much yet), the work of writing and running tests can be distributed.
> >
> > Regarding downstream dependencies, before this lands, I want to pull in all the
> > TDX and SNP series and see how everything fits together. Specifically, I want to
> > make sure that we don't end up with a uAPI that necessitates ugly code, and that we
> > don't miss an opportunity to make things simpler. The patches in the SNP series to
> > add "legacy" SEV support for UPM in particular made me slightly rethink some minor
> > details. Nothing remotely major, but something that needs attention since it'll
> > be uAPI.
>
> Although I'm still debuging with TDX KVM, I needed the following.
> kvm_faultin_pfn() is called without mmu_lock held. the race to change
> private/shared is handled by mmu_seq. Maybe dedicated function only for
> kvm_faultin_pfn().
Gah, you're not on the other thread where this was discussed[*]. Simply deleting
the lockdep assertion is safe, for guest types that rely on the attributes to
define shared vs. private, KVM rechecks the attributes under the protection of
mmu_seq.
I'll get a fixed version pushed out today.
[*] https://lore.kernel.org/all/Y8gpl+LwSuSgBFks@google.com