All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/15] KVM: mm: fd-based approach for supporting KVM guest private memory
@ 2021-12-21 15:11 ` Chao Peng
  0 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Chao Peng, Kirill A . Shutemov, luto,
	john.ji, susie.li, jun.nakajima, dave.hansen, ak, david

This is the third version of this series which try to implement the
fd-based KVM guest private memory.

In general this patch series introduce fd-based memslot which provide
guest memory through a memfd file descriptor fd[offset,size] instead of
hva/size. The fd then can be created from a supported memory filesystem
like tmpfs/hugetlbfs etc which we refer as memory backend. KVM and the
memory backend exchange some callbacks when such memslot gets created.
At runtime KVM will call into callbacks provided by backend to get the
pfn with the fd+offset. Memory backend will also call into KVM callbacks
when userspace fallocate/punch hole on the fd to notify KVM to map/unmap
secondary MMU page tables.

Comparing to existing hva-based memslot, this new type of memslot allow
guest memory unmapped from host userspace like QEMU and even the kernel
itself, therefore reduce attack surface and prevent userspace bugs.

Based on this fd-based memslot, we can build guest private memory that
is going to be used in confidential computing environments such as Intel
TDX and AMD SEV. When supported, the memory backend can provide more
enforcement on the fd and KVM can use a single memslot to hold both the
private and shared part of the guest memory. 

Memfd/shmem extension
---------------------
Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
created with this flag cannot read(), write() or mmap() etc.

In addition, two sets of callbacks are introduced as new MEMFD_OPS:
  - memfd_falloc_notifier: memfd -> KVM notifier when memory gets
    allocated/invalidated through fallocate().
  - memfd_pfn_ops: kvm -> memfd to get a pfn with the fd+offset.

Memslot extension
-----------------
Add the private fd and the offset into the fd to existing 'shared' memslot
so that both private/shared guest memory can live in one single memslot.
A page in the memslot is either private or shared. A page is private only
when it's already allocated in the backend fd, all the other cases it's
treated as shared, this includes those already mapped as shared as well as
those having not been mapped. This means the memory backend is the place
which tells the truth of which page is private.

Private memory map/unmap and conversion
---------------------------------------
Userspace's map/unmap operations are done by fallocate() ioctl on the
backend fd.
  - map: default fallocate() with mode=0.
  - unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.
The map/unmap will trigger above memfd_falloc_notifier to let KVM
map/unmap second MMU page tables.

Test
----
This code has been tested with latest TDX code patches hosted at
(https://github.com/intel/tdx/tree/kvm-upstream) with minimal TDX
adaption and QEMU support.

Example QEMU command line:
-object tdx-guest,id=tdx \
-object memory-backend-memfd-private,id=ram1,size=2G \
-machine q35,kvm-type=tdx,pic=no,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1

Changelog
----------
v3:
  - Added locking protection when calling
    invalidate_page_range/fallocate callbacks.
  - Changed memslot structure to keep use useraddr for shared memory.
  - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
  - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
  - Commit message improvement.
  - Many small fixes for comments from the last version.

Links of previous discussions
-----------------------------
[1]
https://lkml.kernel.org/kvm/51a6f74f-6c05-74b9-3fd7-b7cd900fb8cc@redhat.com/
[2]
https://lkml.kernel.org/linux-fsdevel/20211111141352.26311-1-chao.p.peng@linux.intel.com/


Chao Peng (13):
  mm/memfd: Introduce MFD_INACCESSIBLE flag
  KVM: Extend the memslot to support fd-based private memory
  KVM: Implement fd-based memory using MEMFD_OPS interfaces
  KVM: Refactor hva based memory invalidation code
  KVM: Special handling for fd-based memory invalidation
  KVM: Split out common memory invalidation code
  KVM: Implement fd-based memory invalidation
  KVM: Add kvm_map_gfn_range
  KVM: Implement fd-based memory fallocation
  KVM: Add KVM_EXIT_MEMORY_ERROR exit
  KVM: Handle page fault for private memory
  KVM: Use kvm_userspace_memory_region_ext
  KVM: Register/unregister private memory slot to memfd

Kirill A. Shutemov (2):
  mm/shmem: Introduce F_SEAL_INACCESSIBLE
  mm/memfd: Introduce MEMFD_OPS

 arch/arm64/kvm/mmu.c               |  14 +-
 arch/mips/kvm/mips.c               |  14 +-
 arch/powerpc/include/asm/kvm_ppc.h |  28 ++--
 arch/powerpc/kvm/book3s.c          |  14 +-
 arch/powerpc/kvm/book3s_hv.c       |  14 +-
 arch/powerpc/kvm/book3s_pr.c       |  14 +-
 arch/powerpc/kvm/booke.c           |  14 +-
 arch/powerpc/kvm/powerpc.c         |  14 +-
 arch/riscv/kvm/mmu.c               |  14 +-
 arch/s390/kvm/kvm-s390.c           |  14 +-
 arch/x86/kvm/Kconfig               |   1 +
 arch/x86/kvm/Makefile              |   3 +-
 arch/x86/kvm/mmu/mmu.c             | 112 +++++++++++++-
 arch/x86/kvm/mmu/paging_tmpl.h     |  11 +-
 arch/x86/kvm/x86.c                 |  16 +-
 include/linux/kvm_host.h           |  57 +++++--
 include/linux/memfd.h              |  22 +++
 include/linux/shmem_fs.h           |  16 ++
 include/uapi/linux/fcntl.h         |   1 +
 include/uapi/linux/kvm.h           |  27 ++++
 include/uapi/linux/memfd.h         |   1 +
 mm/Kconfig                         |   4 +
 mm/memfd.c                         |  33 +++-
 mm/shmem.c                         | 195 ++++++++++++++++++++++-
 virt/kvm/kvm_main.c                | 239 +++++++++++++++++++++--------
 virt/kvm/memfd.c                   | 100 ++++++++++++
 26 files changed, 822 insertions(+), 170 deletions(-)
 create mode 100644 virt/kvm/memfd.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v3 00/15] KVM: mm: fd-based approach for supporting KVM guest private memory
@ 2021-12-21 15:11 ` Chao Peng
  0 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Wanpeng Li, jun.nakajima, david, J . Bruce Fields, dave.hansen,
	H . Peter Anvin, Chao Peng, ak, Jonathan Corbet, Joerg Roedel,
	x86, Hugh Dickins, Ingo Molnar, Borislav Petkov, luto,
	Thomas Gleixner, Vitaly Kuznetsov, Jim Mattson,
	Sean Christopherson, susie.li, Jeff Layton, john.ji, Yu Zhang,
	Paolo Bonzini, Andrew Morton, Kirill A . Shutemov

This is the third version of this series which try to implement the
fd-based KVM guest private memory.

In general this patch series introduce fd-based memslot which provide
guest memory through a memfd file descriptor fd[offset,size] instead of
hva/size. The fd then can be created from a supported memory filesystem
like tmpfs/hugetlbfs etc which we refer as memory backend. KVM and the
memory backend exchange some callbacks when such memslot gets created.
At runtime KVM will call into callbacks provided by backend to get the
pfn with the fd+offset. Memory backend will also call into KVM callbacks
when userspace fallocate/punch hole on the fd to notify KVM to map/unmap
secondary MMU page tables.

Comparing to existing hva-based memslot, this new type of memslot allow
guest memory unmapped from host userspace like QEMU and even the kernel
itself, therefore reduce attack surface and prevent userspace bugs.

Based on this fd-based memslot, we can build guest private memory that
is going to be used in confidential computing environments such as Intel
TDX and AMD SEV. When supported, the memory backend can provide more
enforcement on the fd and KVM can use a single memslot to hold both the
private and shared part of the guest memory. 

Memfd/shmem extension
---------------------
Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file
created with this flag cannot read(), write() or mmap() etc.

In addition, two sets of callbacks are introduced as new MEMFD_OPS:
  - memfd_falloc_notifier: memfd -> KVM notifier when memory gets
    allocated/invalidated through fallocate().
  - memfd_pfn_ops: kvm -> memfd to get a pfn with the fd+offset.

Memslot extension
-----------------
Add the private fd and the offset into the fd to existing 'shared' memslot
so that both private/shared guest memory can live in one single memslot.
A page in the memslot is either private or shared. A page is private only
when it's already allocated in the backend fd, all the other cases it's
treated as shared, this includes those already mapped as shared as well as
those having not been mapped. This means the memory backend is the place
which tells the truth of which page is private.

Private memory map/unmap and conversion
---------------------------------------
Userspace's map/unmap operations are done by fallocate() ioctl on the
backend fd.
  - map: default fallocate() with mode=0.
  - unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.
The map/unmap will trigger above memfd_falloc_notifier to let KVM
map/unmap second MMU page tables.

Test
----
This code has been tested with latest TDX code patches hosted at
(https://github.com/intel/tdx/tree/kvm-upstream) with minimal TDX
adaption and QEMU support.

Example QEMU command line:
-object tdx-guest,id=tdx \
-object memory-backend-memfd-private,id=ram1,size=2G \
-machine q35,kvm-type=tdx,pic=no,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1

Changelog
----------
v3:
  - Added locking protection when calling
    invalidate_page_range/fallocate callbacks.
  - Changed memslot structure to keep use useraddr for shared memory.
  - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
  - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
  - Commit message improvement.
  - Many small fixes for comments from the last version.

Links of previous discussions
-----------------------------
[1]
https://lkml.kernel.org/kvm/51a6f74f-6c05-74b9-3fd7-b7cd900fb8cc@redhat.com/
[2]
https://lkml.kernel.org/linux-fsdevel/20211111141352.26311-1-chao.p.peng@linux.intel.com/


Chao Peng (13):
  mm/memfd: Introduce MFD_INACCESSIBLE flag
  KVM: Extend the memslot to support fd-based private memory
  KVM: Implement fd-based memory using MEMFD_OPS interfaces
  KVM: Refactor hva based memory invalidation code
  KVM: Special handling for fd-based memory invalidation
  KVM: Split out common memory invalidation code
  KVM: Implement fd-based memory invalidation
  KVM: Add kvm_map_gfn_range
  KVM: Implement fd-based memory fallocation
  KVM: Add KVM_EXIT_MEMORY_ERROR exit
  KVM: Handle page fault for private memory
  KVM: Use kvm_userspace_memory_region_ext
  KVM: Register/unregister private memory slot to memfd

Kirill A. Shutemov (2):
  mm/shmem: Introduce F_SEAL_INACCESSIBLE
  mm/memfd: Introduce MEMFD_OPS

 arch/arm64/kvm/mmu.c               |  14 +-
 arch/mips/kvm/mips.c               |  14 +-
 arch/powerpc/include/asm/kvm_ppc.h |  28 ++--
 arch/powerpc/kvm/book3s.c          |  14 +-
 arch/powerpc/kvm/book3s_hv.c       |  14 +-
 arch/powerpc/kvm/book3s_pr.c       |  14 +-
 arch/powerpc/kvm/booke.c           |  14 +-
 arch/powerpc/kvm/powerpc.c         |  14 +-
 arch/riscv/kvm/mmu.c               |  14 +-
 arch/s390/kvm/kvm-s390.c           |  14 +-
 arch/x86/kvm/Kconfig               |   1 +
 arch/x86/kvm/Makefile              |   3 +-
 arch/x86/kvm/mmu/mmu.c             | 112 +++++++++++++-
 arch/x86/kvm/mmu/paging_tmpl.h     |  11 +-
 arch/x86/kvm/x86.c                 |  16 +-
 include/linux/kvm_host.h           |  57 +++++--
 include/linux/memfd.h              |  22 +++
 include/linux/shmem_fs.h           |  16 ++
 include/uapi/linux/fcntl.h         |   1 +
 include/uapi/linux/kvm.h           |  27 ++++
 include/uapi/linux/memfd.h         |   1 +
 mm/Kconfig                         |   4 +
 mm/memfd.c                         |  33 +++-
 mm/shmem.c                         | 195 ++++++++++++++++++++++-
 virt/kvm/kvm_main.c                | 239 +++++++++++++++++++++--------
 virt/kvm/memfd.c                   | 100 ++++++++++++
 26 files changed, 822 insertions(+), 170 deletions(-)
 create mode 100644 virt/kvm/memfd.c

-- 
2.17.1



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v3 01/15] mm/shmem: Introduce F_SEAL_INACCESSIBLE
  2021-12-21 15:11 ` Chao Peng
@ 2021-12-21 15:11   ` Chao Peng
  -1 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Chao Peng, Kirill A . Shutemov, luto,
	john.ji, susie.li, jun.nakajima, dave.hansen, ak, david

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
the file is inaccessible from userspace in any possible ways like
read(),write() or mmap() etc.

It provides semantics required for KVM guest private memory support
that a file descriptor with this seal set is going to be used as the
source of guest memory in confidential computing environments such
as Intel TDX/AMD SEV but may not be accessible from host userspace.

At this time only shmem implements this seal.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/uapi/linux/fcntl.h |  1 +
 mm/shmem.c                 | 37 +++++++++++++++++++++++++++++++++++--
 2 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 2f86b2ad6d7e..e2bad051936f 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -43,6 +43,7 @@
 #define F_SEAL_GROW	0x0004	/* prevent file from growing */
 #define F_SEAL_WRITE	0x0008	/* prevent writes */
 #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
+#define F_SEAL_INACCESSIBLE	0x0020  /* prevent file from accessing */
 /* (1U << 31) is reserved for signed error codes */
 
 /*
diff --git a/mm/shmem.c b/mm/shmem.c
index 18f93c2d68f1..faa7e9b1b9bc 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1098,6 +1098,10 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
 		    (newsize > oldsize && (info->seals & F_SEAL_GROW)))
 			return -EPERM;
 
+		if ((info->seals & F_SEAL_INACCESSIBLE) &&
+		    (newsize & ~PAGE_MASK))
+			return -EINVAL;
+
 		if (newsize != oldsize) {
 			error = shmem_reacct_size(SHMEM_I(inode)->flags,
 					oldsize, newsize);
@@ -1364,6 +1368,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 		goto redirty;
 	if (!total_swap_pages)
 		goto redirty;
+	if (info->seals & F_SEAL_INACCESSIBLE)
+		goto redirty;
 
 	/*
 	 * Our capabilities prevent regular writeback or sync from ever calling
@@ -2262,6 +2268,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 	if (ret)
 		return ret;
 
+	if (info->seals & F_SEAL_INACCESSIBLE)
+		return -EPERM;
+
 	/* arm64 - allow memory tagging on RAM-based files */
 	vma->vm_flags |= VM_MTE_ALLOWED;
 
@@ -2459,12 +2468,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 	pgoff_t index = pos >> PAGE_SHIFT;
 
 	/* i_rwsem is held by caller */
-	if (unlikely(info->seals & (F_SEAL_GROW |
-				   F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))) {
+	if (unlikely(info->seals & (F_SEAL_GROW | F_SEAL_WRITE |
+				    F_SEAL_FUTURE_WRITE |
+				    F_SEAL_INACCESSIBLE))) {
 		if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))
 			return -EPERM;
 		if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
 			return -EPERM;
+		if (info->seals & F_SEAL_INACCESSIBLE)
+			return -EPERM;
 	}
 
 	return shmem_getpage(inode, index, pagep, SGP_WRITE);
@@ -2538,6 +2550,21 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		end_index = i_size >> PAGE_SHIFT;
 		if (index > end_index)
 			break;
+
+		/*
+		 * inode_lock protects setting up seals as well as write to
+		 * i_size. Setting F_SEAL_INACCESSIBLE only allowed with
+		 * i_size == 0.
+		 *
+		 * Check F_SEAL_INACCESSIBLE after i_size. It effectively
+		 * serialize read vs. setting F_SEAL_INACCESSIBLE without
+		 * taking inode_lock in read path.
+		 */
+		if (SHMEM_I(inode)->seals & F_SEAL_INACCESSIBLE) {
+			error = -EPERM;
+			break;
+		}
+
 		if (index == end_index) {
 			nr = i_size & ~PAGE_MASK;
 			if (nr <= offset)
@@ -2663,6 +2690,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 			goto out;
 		}
 
+		if ((info->seals & F_SEAL_INACCESSIBLE) &&
+		    (offset & ~PAGE_MASK || len & ~PAGE_MASK)) {
+			error = -EINVAL;
+			goto out;
+		}
+
 		shmem_falloc.waitq = &shmem_falloc_waitq;
 		shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT;
 		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 01/15] mm/shmem: Introduce F_SEAL_INACCESSIBLE
@ 2021-12-21 15:11   ` Chao Peng
  0 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Wanpeng Li, jun.nakajima, david, J . Bruce Fields, dave.hansen,
	H . Peter Anvin, Chao Peng, ak, Jonathan Corbet, Joerg Roedel,
	x86, Hugh Dickins, Ingo Molnar, Borislav Petkov, luto,
	Thomas Gleixner, Vitaly Kuznetsov, Jim Mattson,
	Sean Christopherson, susie.li, Jeff Layton, john.ji, Yu Zhang,
	Paolo Bonzini, Andrew Morton, Kirill A . Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
the file is inaccessible from userspace in any possible ways like
read(),write() or mmap() etc.

It provides semantics required for KVM guest private memory support
that a file descriptor with this seal set is going to be used as the
source of guest memory in confidential computing environments such
as Intel TDX/AMD SEV but may not be accessible from host userspace.

At this time only shmem implements this seal.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/uapi/linux/fcntl.h |  1 +
 mm/shmem.c                 | 37 +++++++++++++++++++++++++++++++++++--
 2 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 2f86b2ad6d7e..e2bad051936f 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -43,6 +43,7 @@
 #define F_SEAL_GROW	0x0004	/* prevent file from growing */
 #define F_SEAL_WRITE	0x0008	/* prevent writes */
 #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
+#define F_SEAL_INACCESSIBLE	0x0020  /* prevent file from accessing */
 /* (1U << 31) is reserved for signed error codes */
 
 /*
diff --git a/mm/shmem.c b/mm/shmem.c
index 18f93c2d68f1..faa7e9b1b9bc 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1098,6 +1098,10 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
 		    (newsize > oldsize && (info->seals & F_SEAL_GROW)))
 			return -EPERM;
 
+		if ((info->seals & F_SEAL_INACCESSIBLE) &&
+		    (newsize & ~PAGE_MASK))
+			return -EINVAL;
+
 		if (newsize != oldsize) {
 			error = shmem_reacct_size(SHMEM_I(inode)->flags,
 					oldsize, newsize);
@@ -1364,6 +1368,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 		goto redirty;
 	if (!total_swap_pages)
 		goto redirty;
+	if (info->seals & F_SEAL_INACCESSIBLE)
+		goto redirty;
 
 	/*
 	 * Our capabilities prevent regular writeback or sync from ever calling
@@ -2262,6 +2268,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 	if (ret)
 		return ret;
 
+	if (info->seals & F_SEAL_INACCESSIBLE)
+		return -EPERM;
+
 	/* arm64 - allow memory tagging on RAM-based files */
 	vma->vm_flags |= VM_MTE_ALLOWED;
 
@@ -2459,12 +2468,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 	pgoff_t index = pos >> PAGE_SHIFT;
 
 	/* i_rwsem is held by caller */
-	if (unlikely(info->seals & (F_SEAL_GROW |
-				   F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))) {
+	if (unlikely(info->seals & (F_SEAL_GROW | F_SEAL_WRITE |
+				    F_SEAL_FUTURE_WRITE |
+				    F_SEAL_INACCESSIBLE))) {
 		if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))
 			return -EPERM;
 		if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
 			return -EPERM;
+		if (info->seals & F_SEAL_INACCESSIBLE)
+			return -EPERM;
 	}
 
 	return shmem_getpage(inode, index, pagep, SGP_WRITE);
@@ -2538,6 +2550,21 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		end_index = i_size >> PAGE_SHIFT;
 		if (index > end_index)
 			break;
+
+		/*
+		 * inode_lock protects setting up seals as well as write to
+		 * i_size. Setting F_SEAL_INACCESSIBLE only allowed with
+		 * i_size == 0.
+		 *
+		 * Check F_SEAL_INACCESSIBLE after i_size. It effectively
+		 * serialize read vs. setting F_SEAL_INACCESSIBLE without
+		 * taking inode_lock in read path.
+		 */
+		if (SHMEM_I(inode)->seals & F_SEAL_INACCESSIBLE) {
+			error = -EPERM;
+			break;
+		}
+
 		if (index == end_index) {
 			nr = i_size & ~PAGE_MASK;
 			if (nr <= offset)
@@ -2663,6 +2690,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 			goto out;
 		}
 
+		if ((info->seals & F_SEAL_INACCESSIBLE) &&
+		    (offset & ~PAGE_MASK || len & ~PAGE_MASK)) {
+			error = -EINVAL;
+			goto out;
+		}
+
 		shmem_falloc.waitq = &shmem_falloc_waitq;
 		shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT;
 		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 02/15] mm/memfd: Introduce MFD_INACCESSIBLE flag
  2021-12-21 15:11 ` Chao Peng
@ 2021-12-21 15:11   ` Chao Peng
  -1 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Chao Peng, Kirill A . Shutemov, luto,
	john.ji, susie.li, jun.nakajima, dave.hansen, ak, david

Introduce a new memfd_create() flag indicating the content of the
created memfd is inaccessible from userspace. It does this by force
setting F_SEAL_INACCESSIBLE seal when the file is created. It also set
F_SEAL_SEAL to prevent future sealing, which means, it can not coexist
with MFD_ALLOW_SEALING.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/uapi/linux/memfd.h |  1 +
 mm/memfd.c                 | 12 +++++++++++-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..48750474b904 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,7 @@
 #define MFD_CLOEXEC		0x0001U
 #define MFD_ALLOW_SEALING	0x0002U
 #define MFD_HUGETLB		0x0004U
+#define MFD_INACCESSIBLE	0x0008U
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/memfd.c b/mm/memfd.c
index 9f80f162791a..c898a007fb76 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -245,7 +245,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
 
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
+		       MFD_INACCESSIBLE)
 
 SYSCALL_DEFINE2(memfd_create,
 		const char __user *, uname,
@@ -267,6 +268,10 @@ SYSCALL_DEFINE2(memfd_create,
 			return -EINVAL;
 	}
 
+	/* Disallow sealing when MFD_INACCESSIBLE is set. */
+	if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING)
+		return -EINVAL;
+
 	/* length includes terminating zero */
 	len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
 	if (len <= 0)
@@ -315,6 +320,11 @@ SYSCALL_DEFINE2(memfd_create,
 		*file_seals &= ~F_SEAL_SEAL;
 	}
 
+	if (flags & MFD_INACCESSIBLE) {
+		file_seals = memfd_file_seals_ptr(file);
+		*file_seals &= F_SEAL_SEAL | F_SEAL_INACCESSIBLE;
+	}
+
 	fd_install(fd, file);
 	kfree(name);
 	return fd;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 02/15] mm/memfd: Introduce MFD_INACCESSIBLE flag
@ 2021-12-21 15:11   ` Chao Peng
  0 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Wanpeng Li, jun.nakajima, david, J . Bruce Fields, dave.hansen,
	H . Peter Anvin, Chao Peng, ak, Jonathan Corbet, Joerg Roedel,
	x86, Hugh Dickins, Ingo Molnar, Borislav Petkov, luto,
	Thomas Gleixner, Vitaly Kuznetsov, Jim Mattson,
	Sean Christopherson, susie.li, Jeff Layton, john.ji, Yu Zhang,
	Paolo Bonzini, Andrew Morton, Kirill A . Shutemov

Introduce a new memfd_create() flag indicating the content of the
created memfd is inaccessible from userspace. It does this by force
setting F_SEAL_INACCESSIBLE seal when the file is created. It also set
F_SEAL_SEAL to prevent future sealing, which means, it can not coexist
with MFD_ALLOW_SEALING.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/uapi/linux/memfd.h |  1 +
 mm/memfd.c                 | 12 +++++++++++-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..48750474b904 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,7 @@
 #define MFD_CLOEXEC		0x0001U
 #define MFD_ALLOW_SEALING	0x0002U
 #define MFD_HUGETLB		0x0004U
+#define MFD_INACCESSIBLE	0x0008U
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/memfd.c b/mm/memfd.c
index 9f80f162791a..c898a007fb76 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -245,7 +245,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
 
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
+		       MFD_INACCESSIBLE)
 
 SYSCALL_DEFINE2(memfd_create,
 		const char __user *, uname,
@@ -267,6 +268,10 @@ SYSCALL_DEFINE2(memfd_create,
 			return -EINVAL;
 	}
 
+	/* Disallow sealing when MFD_INACCESSIBLE is set. */
+	if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING)
+		return -EINVAL;
+
 	/* length includes terminating zero */
 	len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
 	if (len <= 0)
@@ -315,6 +320,11 @@ SYSCALL_DEFINE2(memfd_create,
 		*file_seals &= ~F_SEAL_SEAL;
 	}
 
+	if (flags & MFD_INACCESSIBLE) {
+		file_seals = memfd_file_seals_ptr(file);
+		*file_seals &= F_SEAL_SEAL | F_SEAL_INACCESSIBLE;
+	}
+
 	fd_install(fd, file);
 	kfree(name);
 	return fd;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 03/15] mm/memfd: Introduce MEMFD_OPS
  2021-12-21 15:11 ` Chao Peng
@ 2021-12-21 15:11   ` Chao Peng
  -1 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Chao Peng, Kirill A . Shutemov, luto,
	john.ji, susie.li, jun.nakajima, dave.hansen, ak, david

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The patch introduces new MEMFD_OPS facility around file created by
memfd_create() to allow a third kernel component to make use of memory
bookmarked in a memfd and gets notifier when the memory in the file
is allocated/invalidated. It will be used for KVM to use memfd file
descriptor as the guest memory backend and KVM will use MEMFD_OPS to
interact with memfd subsystem. In the future there might be other
consumers (e.g. VFIO with encrypted device memory).

It consists two set of callbacks:
  - memfd_falloc_notifier: callbacks which provided by KVM and called
    by memfd when memory gets allocated/invalidated through fallocate()
    ioctl.
  - memfd_pfn_ops: callbacks which provided by memfd and called by KVM
    to request memory page from memfd.

Locking is needed for above callbacks to prevent race condition.
  - get_owner/put_owner is used to ensure the owner is still alive in
    the invalidate_page_range/fallocate callback handlers using a
    reference mechanism.
  - page is locked between get_lock_pfn/put_unlock_pfn to ensure pfn is
    still valid when it's used (e.g. when KVM page fault handler uses
    it to establish the mapping in the secondary MMU page tables).

Userspace is in charge of guest memory lifecycle: it can allocate the
memory with fallocate() or punch hole to free memory from the guest.

The file descriptor passed down to KVM as guest memory backend. KVM
registers itself as the owner of the memfd via
memfd_register_falloc_notifier() and provides memfd_falloc_notifier
callbacks that need to be called on fallocate() and punching hole.

memfd_register_falloc_notifier() returns memfd_pfn_ops callbacks that
need to be used for requesting a new page from KVM.

At this time only shmem is supported.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/memfd.h    |  22 ++++++
 include/linux/shmem_fs.h |  16 ++++
 mm/Kconfig               |   4 +
 mm/memfd.c               |  21 ++++++
 mm/shmem.c               | 158 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 221 insertions(+)

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..0007073b53dc 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -13,4 +13,26 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
 }
 #endif
 
+#ifdef CONFIG_MEMFD_OPS
+struct memfd_falloc_notifier {
+	void (*invalidate_page_range)(struct inode *inode, void *owner,
+				      pgoff_t start, pgoff_t end);
+	void (*fallocate)(struct inode *inode, void *owner,
+			  pgoff_t start, pgoff_t end);
+	bool (*get_owner)(void *owner);
+	void (*put_owner)(void *owner);
+};
+
+struct memfd_pfn_ops {
+	long (*get_lock_pfn)(struct inode *inode, pgoff_t offset, int *order);
+	void (*put_unlock_pfn)(unsigned long pfn);
+
+};
+
+extern int memfd_register_falloc_notifier(struct inode *inode, void *owner,
+				const struct memfd_falloc_notifier *notifier,
+				const struct memfd_pfn_ops **pfn_ops);
+extern void memfd_unregister_falloc_notifier(struct inode *inode);
+#endif
+
 #endif /* __LINUX_MEMFD_H */
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 166158b6e917..503adc63728c 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -12,6 +12,11 @@
 
 /* inode in-kernel data */
 
+#ifdef CONFIG_MEMFD_OPS
+struct memfd_falloc_notifier;
+struct memfd_pfn_ops;
+#endif
+
 struct shmem_inode_info {
 	spinlock_t		lock;
 	unsigned int		seals;		/* shmem seals */
@@ -24,6 +29,10 @@ struct shmem_inode_info {
 	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct simple_xattrs	xattrs;		/* list of xattrs */
 	atomic_t		stop_eviction;	/* hold when working on inode */
+#ifdef CONFIG_MEMFD_OPS
+	void			*owner;
+	const struct memfd_falloc_notifier *falloc_notifier;
+#endif
 	struct inode		vfs_inode;
 };
 
@@ -96,6 +105,13 @@ extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
 extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
 						pgoff_t start, pgoff_t end);
 
+#ifdef CONFIG_MEMFD_OPS
+extern int shmem_register_falloc_notifier(struct inode *inode, void *owner,
+				const struct memfd_falloc_notifier *notifier,
+				const struct memfd_pfn_ops **pfn_ops);
+extern void shmem_unregister_falloc_notifier(struct inode *inode);
+#endif
+
 /* Flag allocation requirements to shmem_getpage */
 enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
diff --git a/mm/Kconfig b/mm/Kconfig
index 28edafc820ad..9989904d1b56 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -900,6 +900,10 @@ config IO_MAPPING
 config SECRETMEM
 	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
 
+config MEMFD_OPS
+	bool
+	depends on MEMFD_CREATE
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/memfd.c b/mm/memfd.c
index c898a007fb76..41861870fc21 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -130,6 +130,27 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
 	return NULL;
 }
 
+#ifdef CONFIG_MEMFD_OPS
+int memfd_register_falloc_notifier(struct inode *inode, void *owner,
+				   const struct memfd_falloc_notifier *notifier,
+				   const struct memfd_pfn_ops **pfn_ops)
+{
+	if (shmem_mapping(inode->i_mapping))
+		return shmem_register_falloc_notifier(inode, owner,
+						      notifier, pfn_ops);
+
+	return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(memfd_register_falloc_notifier);
+
+void memfd_unregister_falloc_notifier(struct inode *inode)
+{
+	if (shmem_mapping(inode->i_mapping))
+		shmem_unregister_falloc_notifier(inode);
+}
+EXPORT_SYMBOL_GPL(memfd_unregister_falloc_notifier);
+#endif
+
 #define F_ALL_SEALS (F_SEAL_SEAL | \
 		     F_SEAL_SHRINK | \
 		     F_SEAL_GROW | \
diff --git a/mm/shmem.c b/mm/shmem.c
index faa7e9b1b9bc..4d8a75c4d037 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -78,6 +78,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/userfaultfd_k.h>
 #include <linux/rmap.h>
 #include <linux/uuid.h>
+#include <linux/memfd.h>
 
 #include <linux/uaccess.h>
 
@@ -906,6 +907,68 @@ static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
 	return split_huge_page(page) >= 0;
 }
 
+static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
+{
+#ifdef CONFIG_MEMFD_OPS
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	const struct memfd_falloc_notifier *notifier;
+	void *owner;
+	bool ret;
+
+	if (!info->falloc_notifier)
+		return;
+
+	spin_lock(&info->lock);
+	notifier = info->falloc_notifier;
+	if (!notifier) {
+		spin_unlock(&info->lock);
+		return;
+	}
+
+	owner = info->owner;
+	ret = notifier->get_owner(owner);
+	spin_unlock(&info->lock);
+	if (!ret)
+		return;
+
+	notifier->fallocate(inode, owner, start, end);
+	notifier->put_owner(owner);
+#endif
+}
+
+static void notify_invalidate_page(struct inode *inode, struct page *page,
+				   pgoff_t start, pgoff_t end)
+{
+#ifdef CONFIG_MEMFD_OPS
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	const struct memfd_falloc_notifier *notifier;
+	void *owner;
+	bool ret;
+
+	if (!info->falloc_notifier)
+		return;
+
+	spin_lock(&info->lock);
+	notifier = info->falloc_notifier;
+	if (!notifier) {
+		spin_unlock(&info->lock);
+		return;
+	}
+
+	owner = info->owner;
+	ret = notifier->get_owner(owner);
+	spin_unlock(&info->lock);
+	if (!ret)
+		return;
+
+	start = max(start, page->index);
+	end = min(end, page->index + thp_nr_pages(page));
+
+	notifier->invalidate_page_range(inode, owner, start, end);
+	notifier->put_owner(owner);
+#endif
+}
+
 /*
  * Remove range of pages and swap entries from page cache, and free them.
  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -949,6 +1012,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			index += thp_nr_pages(page) - 1;
 
+			notify_invalidate_page(inode, page, start, end);
+
 			if (!unfalloc || !PageUptodate(page))
 				truncate_inode_page(mapping, page);
 			unlock_page(page);
@@ -1025,6 +1090,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 					index--;
 					break;
 				}
+
+				notify_invalidate_page(inode, page, start, end);
+
 				VM_BUG_ON_PAGE(PageWriteback(page), page);
 				if (shmem_punch_compound(page, start, end))
 					truncate_inode_page(mapping, page);
@@ -2815,6 +2883,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
 		i_size_write(inode, offset + len);
 	inode->i_ctime = current_time(inode);
+	notify_fallocate(inode, start, end);
 undone:
 	spin_lock(&inode->i_lock);
 	inode->i_private = NULL;
@@ -3784,6 +3853,20 @@ static void shmem_destroy_inodecache(void)
 	kmem_cache_destroy(shmem_inode_cachep);
 }
 
+#ifdef CONFIG_MIGRATION
+int shmem_migrate_page(struct address_space *mapping, struct page *newpage,
+		       struct page *page, enum migrate_mode mode)
+{
+#ifdef CONFIG_MEMFD_OPS
+	struct inode *inode = mapping->host;
+
+	if (SHMEM_I(inode)->owner)
+		return -EOPNOTSUPP;
+#endif
+	return migrate_page(mapping, newpage, page, mode);
+}
+#endif
+
 const struct address_space_operations shmem_aops = {
 	.writepage	= shmem_writepage,
 	.set_page_dirty	= __set_page_dirty_no_writeback,
@@ -3798,6 +3881,81 @@ const struct address_space_operations shmem_aops = {
 };
 EXPORT_SYMBOL(shmem_aops);
 
+#ifdef CONFIG_MEMFD_OPS
+static long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset, int *order)
+{
+	struct page *page;
+	int ret;
+
+	ret = shmem_getpage(inode, offset, &page, SGP_NOALLOC);
+	if (ret)
+		return ret;
+
+	*order = thp_order(compound_head(page));
+
+	return page_to_pfn(page);
+}
+
+static void shmem_put_unlock_pfn(unsigned long pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	set_page_dirty(page);
+	unlock_page(page);
+	put_page(page);
+}
+
+static const struct memfd_pfn_ops shmem_pfn_ops = {
+	.get_lock_pfn = shmem_get_lock_pfn,
+	.put_unlock_pfn = shmem_put_unlock_pfn,
+};
+
+int shmem_register_falloc_notifier(struct inode *inode, void *owner,
+				const struct memfd_falloc_notifier *notifier,
+				const struct memfd_pfn_ops **pfn_ops)
+{
+	gfp_t gfp;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	if (!inode || !owner || !notifier || !pfn_ops ||
+	    !notifier->invalidate_page_range ||
+	    !notifier->fallocate ||
+	    !notifier->get_owner ||
+	    !notifier->put_owner)
+		return -EINVAL;
+
+	spin_lock(&info->lock);
+	if (info->owner && info->owner != owner) {
+		spin_unlock(&info->lock);
+		return -EPERM;
+	}
+
+	info->owner = owner;
+	info->falloc_notifier = notifier;
+	spin_unlock(&info->lock);
+
+	gfp = mapping_gfp_mask(inode->i_mapping);
+	gfp &= ~__GFP_MOVABLE;
+	mapping_set_gfp_mask(inode->i_mapping, gfp);
+	mapping_set_unevictable(inode->i_mapping);
+
+	*pfn_ops = &shmem_pfn_ops;
+	return 0;
+}
+
+void shmem_unregister_falloc_notifier(struct inode *inode)
+{
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	spin_lock(&info->lock);
+	info->owner = NULL;
+	info->falloc_notifier = NULL;
+	spin_unlock(&info->lock);
+}
+#endif
+
 static const struct file_operations shmem_file_operations = {
 	.mmap		= shmem_mmap,
 	.get_unmapped_area = shmem_get_unmapped_area,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 03/15] mm/memfd: Introduce MEMFD_OPS
@ 2021-12-21 15:11   ` Chao Peng
  0 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Wanpeng Li, jun.nakajima, david, J . Bruce Fields, dave.hansen,
	H . Peter Anvin, Chao Peng, ak, Jonathan Corbet, Joerg Roedel,
	x86, Hugh Dickins, Ingo Molnar, Borislav Petkov, luto,
	Thomas Gleixner, Vitaly Kuznetsov, Jim Mattson,
	Sean Christopherson, susie.li, Jeff Layton, john.ji, Yu Zhang,
	Paolo Bonzini, Andrew Morton, Kirill A . Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The patch introduces new MEMFD_OPS facility around file created by
memfd_create() to allow a third kernel component to make use of memory
bookmarked in a memfd and gets notifier when the memory in the file
is allocated/invalidated. It will be used for KVM to use memfd file
descriptor as the guest memory backend and KVM will use MEMFD_OPS to
interact with memfd subsystem. In the future there might be other
consumers (e.g. VFIO with encrypted device memory).

It consists two set of callbacks:
  - memfd_falloc_notifier: callbacks which provided by KVM and called
    by memfd when memory gets allocated/invalidated through fallocate()
    ioctl.
  - memfd_pfn_ops: callbacks which provided by memfd and called by KVM
    to request memory page from memfd.

Locking is needed for above callbacks to prevent race condition.
  - get_owner/put_owner is used to ensure the owner is still alive in
    the invalidate_page_range/fallocate callback handlers using a
    reference mechanism.
  - page is locked between get_lock_pfn/put_unlock_pfn to ensure pfn is
    still valid when it's used (e.g. when KVM page fault handler uses
    it to establish the mapping in the secondary MMU page tables).

Userspace is in charge of guest memory lifecycle: it can allocate the
memory with fallocate() or punch hole to free memory from the guest.

The file descriptor passed down to KVM as guest memory backend. KVM
registers itself as the owner of the memfd via
memfd_register_falloc_notifier() and provides memfd_falloc_notifier
callbacks that need to be called on fallocate() and punching hole.

memfd_register_falloc_notifier() returns memfd_pfn_ops callbacks that
need to be used for requesting a new page from KVM.

At this time only shmem is supported.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/memfd.h    |  22 ++++++
 include/linux/shmem_fs.h |  16 ++++
 mm/Kconfig               |   4 +
 mm/memfd.c               |  21 ++++++
 mm/shmem.c               | 158 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 221 insertions(+)

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..0007073b53dc 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -13,4 +13,26 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
 }
 #endif
 
+#ifdef CONFIG_MEMFD_OPS
+struct memfd_falloc_notifier {
+	void (*invalidate_page_range)(struct inode *inode, void *owner,
+				      pgoff_t start, pgoff_t end);
+	void (*fallocate)(struct inode *inode, void *owner,
+			  pgoff_t start, pgoff_t end);
+	bool (*get_owner)(void *owner);
+	void (*put_owner)(void *owner);
+};
+
+struct memfd_pfn_ops {
+	long (*get_lock_pfn)(struct inode *inode, pgoff_t offset, int *order);
+	void (*put_unlock_pfn)(unsigned long pfn);
+
+};
+
+extern int memfd_register_falloc_notifier(struct inode *inode, void *owner,
+				const struct memfd_falloc_notifier *notifier,
+				const struct memfd_pfn_ops **pfn_ops);
+extern void memfd_unregister_falloc_notifier(struct inode *inode);
+#endif
+
 #endif /* __LINUX_MEMFD_H */
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 166158b6e917..503adc63728c 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -12,6 +12,11 @@
 
 /* inode in-kernel data */
 
+#ifdef CONFIG_MEMFD_OPS
+struct memfd_falloc_notifier;
+struct memfd_pfn_ops;
+#endif
+
 struct shmem_inode_info {
 	spinlock_t		lock;
 	unsigned int		seals;		/* shmem seals */
@@ -24,6 +29,10 @@ struct shmem_inode_info {
 	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct simple_xattrs	xattrs;		/* list of xattrs */
 	atomic_t		stop_eviction;	/* hold when working on inode */
+#ifdef CONFIG_MEMFD_OPS
+	void			*owner;
+	const struct memfd_falloc_notifier *falloc_notifier;
+#endif
 	struct inode		vfs_inode;
 };
 
@@ -96,6 +105,13 @@ extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
 extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
 						pgoff_t start, pgoff_t end);
 
+#ifdef CONFIG_MEMFD_OPS
+extern int shmem_register_falloc_notifier(struct inode *inode, void *owner,
+				const struct memfd_falloc_notifier *notifier,
+				const struct memfd_pfn_ops **pfn_ops);
+extern void shmem_unregister_falloc_notifier(struct inode *inode);
+#endif
+
 /* Flag allocation requirements to shmem_getpage */
 enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
diff --git a/mm/Kconfig b/mm/Kconfig
index 28edafc820ad..9989904d1b56 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -900,6 +900,10 @@ config IO_MAPPING
 config SECRETMEM
 	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
 
+config MEMFD_OPS
+	bool
+	depends on MEMFD_CREATE
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/memfd.c b/mm/memfd.c
index c898a007fb76..41861870fc21 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -130,6 +130,27 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
 	return NULL;
 }
 
+#ifdef CONFIG_MEMFD_OPS
+int memfd_register_falloc_notifier(struct inode *inode, void *owner,
+				   const struct memfd_falloc_notifier *notifier,
+				   const struct memfd_pfn_ops **pfn_ops)
+{
+	if (shmem_mapping(inode->i_mapping))
+		return shmem_register_falloc_notifier(inode, owner,
+						      notifier, pfn_ops);
+
+	return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(memfd_register_falloc_notifier);
+
+void memfd_unregister_falloc_notifier(struct inode *inode)
+{
+	if (shmem_mapping(inode->i_mapping))
+		shmem_unregister_falloc_notifier(inode);
+}
+EXPORT_SYMBOL_GPL(memfd_unregister_falloc_notifier);
+#endif
+
 #define F_ALL_SEALS (F_SEAL_SEAL | \
 		     F_SEAL_SHRINK | \
 		     F_SEAL_GROW | \
diff --git a/mm/shmem.c b/mm/shmem.c
index faa7e9b1b9bc..4d8a75c4d037 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -78,6 +78,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/userfaultfd_k.h>
 #include <linux/rmap.h>
 #include <linux/uuid.h>
+#include <linux/memfd.h>
 
 #include <linux/uaccess.h>
 
@@ -906,6 +907,68 @@ static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
 	return split_huge_page(page) >= 0;
 }
 
+static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
+{
+#ifdef CONFIG_MEMFD_OPS
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	const struct memfd_falloc_notifier *notifier;
+	void *owner;
+	bool ret;
+
+	if (!info->falloc_notifier)
+		return;
+
+	spin_lock(&info->lock);
+	notifier = info->falloc_notifier;
+	if (!notifier) {
+		spin_unlock(&info->lock);
+		return;
+	}
+
+	owner = info->owner;
+	ret = notifier->get_owner(owner);
+	spin_unlock(&info->lock);
+	if (!ret)
+		return;
+
+	notifier->fallocate(inode, owner, start, end);
+	notifier->put_owner(owner);
+#endif
+}
+
+static void notify_invalidate_page(struct inode *inode, struct page *page,
+				   pgoff_t start, pgoff_t end)
+{
+#ifdef CONFIG_MEMFD_OPS
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	const struct memfd_falloc_notifier *notifier;
+	void *owner;
+	bool ret;
+
+	if (!info->falloc_notifier)
+		return;
+
+	spin_lock(&info->lock);
+	notifier = info->falloc_notifier;
+	if (!notifier) {
+		spin_unlock(&info->lock);
+		return;
+	}
+
+	owner = info->owner;
+	ret = notifier->get_owner(owner);
+	spin_unlock(&info->lock);
+	if (!ret)
+		return;
+
+	start = max(start, page->index);
+	end = min(end, page->index + thp_nr_pages(page));
+
+	notifier->invalidate_page_range(inode, owner, start, end);
+	notifier->put_owner(owner);
+#endif
+}
+
 /*
  * Remove range of pages and swap entries from page cache, and free them.
  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -949,6 +1012,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			index += thp_nr_pages(page) - 1;
 
+			notify_invalidate_page(inode, page, start, end);
+
 			if (!unfalloc || !PageUptodate(page))
 				truncate_inode_page(mapping, page);
 			unlock_page(page);
@@ -1025,6 +1090,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 					index--;
 					break;
 				}
+
+				notify_invalidate_page(inode, page, start, end);
+
 				VM_BUG_ON_PAGE(PageWriteback(page), page);
 				if (shmem_punch_compound(page, start, end))
 					truncate_inode_page(mapping, page);
@@ -2815,6 +2883,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
 		i_size_write(inode, offset + len);
 	inode->i_ctime = current_time(inode);
+	notify_fallocate(inode, start, end);
 undone:
 	spin_lock(&inode->i_lock);
 	inode->i_private = NULL;
@@ -3784,6 +3853,20 @@ static void shmem_destroy_inodecache(void)
 	kmem_cache_destroy(shmem_inode_cachep);
 }
 
+#ifdef CONFIG_MIGRATION
+int shmem_migrate_page(struct address_space *mapping, struct page *newpage,
+		       struct page *page, enum migrate_mode mode)
+{
+#ifdef CONFIG_MEMFD_OPS
+	struct inode *inode = mapping->host;
+
+	if (SHMEM_I(inode)->owner)
+		return -EOPNOTSUPP;
+#endif
+	return migrate_page(mapping, newpage, page, mode);
+}
+#endif
+
 const struct address_space_operations shmem_aops = {
 	.writepage	= shmem_writepage,
 	.set_page_dirty	= __set_page_dirty_no_writeback,
@@ -3798,6 +3881,81 @@ const struct address_space_operations shmem_aops = {
 };
 EXPORT_SYMBOL(shmem_aops);
 
+#ifdef CONFIG_MEMFD_OPS
+static long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset, int *order)
+{
+	struct page *page;
+	int ret;
+
+	ret = shmem_getpage(inode, offset, &page, SGP_NOALLOC);
+	if (ret)
+		return ret;
+
+	*order = thp_order(compound_head(page));
+
+	return page_to_pfn(page);
+}
+
+static void shmem_put_unlock_pfn(unsigned long pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	set_page_dirty(page);
+	unlock_page(page);
+	put_page(page);
+}
+
+static const struct memfd_pfn_ops shmem_pfn_ops = {
+	.get_lock_pfn = shmem_get_lock_pfn,
+	.put_unlock_pfn = shmem_put_unlock_pfn,
+};
+
+int shmem_register_falloc_notifier(struct inode *inode, void *owner,
+				const struct memfd_falloc_notifier *notifier,
+				const struct memfd_pfn_ops **pfn_ops)
+{
+	gfp_t gfp;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	if (!inode || !owner || !notifier || !pfn_ops ||
+	    !notifier->invalidate_page_range ||
+	    !notifier->fallocate ||
+	    !notifier->get_owner ||
+	    !notifier->put_owner)
+		return -EINVAL;
+
+	spin_lock(&info->lock);
+	if (info->owner && info->owner != owner) {
+		spin_unlock(&info->lock);
+		return -EPERM;
+	}
+
+	info->owner = owner;
+	info->falloc_notifier = notifier;
+	spin_unlock(&info->lock);
+
+	gfp = mapping_gfp_mask(inode->i_mapping);
+	gfp &= ~__GFP_MOVABLE;
+	mapping_set_gfp_mask(inode->i_mapping, gfp);
+	mapping_set_unevictable(inode->i_mapping);
+
+	*pfn_ops = &shmem_pfn_ops;
+	return 0;
+}
+
+void shmem_unregister_falloc_notifier(struct inode *inode)
+{
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	spin_lock(&info->lock);
+	info->owner = NULL;
+	info->falloc_notifier = NULL;
+	spin_unlock(&info->lock);
+}
+#endif
+
 static const struct file_operations shmem_file_operations = {
 	.mmap		= shmem_mmap,
 	.get_unmapped_area = shmem_get_unmapped_area,
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 04/15] KVM: Extend the memslot to support fd-based private memory
  2021-12-21 15:11 ` Chao Peng
@ 2021-12-21 15:11   ` Chao Peng
  -1 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Chao Peng, Kirill A . Shutemov, luto,
	john.ji, susie.li, jun.nakajima, dave.hansen, ak, david

Extend the memslot definition to provide fd-based private memory support
by adding two new fields(fd/ofs). The memslot then can maintain memory
for both shared and private pages in a single memslot. Shared pages are
provided in the existing way by using userspace_addr(hva) field and
get_user_pages() while private pages are provided through the new
fields(fd/ofs). Since there is no 'hva' concept anymore for private
memory we cannot call get_user_pages() to get a pfn, instead we rely on
the newly introduced MEMFD_OPS callbacks to do the same job.

This new extension is indicated by a new flag KVM_MEM_PRIVATE.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/kvm_host.h |  9 +++++++++
 include/uapi/linux/kvm.h | 12 ++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 865a677baf52..96e46b288ecd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -433,8 +433,17 @@ struct kvm_memory_slot {
 	u32 flags;
 	short id;
 	u16 as_id;
+	struct file *file;
+	u64 file_ofs;
 };
 
+static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
+{
+	if (slot && (slot->flags & KVM_MEM_PRIVATE))
+		return true;
+	return false;
+}
+
 static inline bool kvm_slot_dirty_track_enabled(struct kvm_memory_slot *slot)
 {
 	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index e3706e574bd2..a2c1fb8c9843 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -103,6 +103,17 @@ struct kvm_userspace_memory_region {
 	__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
+struct kvm_userspace_memory_region_ext {
+	__u32 slot;
+	__u32 flags;
+	__u64 guest_phys_addr;
+	__u64 memory_size; /* bytes */
+	__u64 userspace_addr; /* hva */
+	__u64 ofs; /* offset into fd */
+	__u32 fd;
+	__u32 padding[5];
+};
+
 /*
  * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
  * other bits are reserved for kvm internal use which are defined in
@@ -110,6 +121,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_PRIVATE		(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 04/15] KVM: Extend the memslot to support fd-based private memory
@ 2021-12-21 15:11   ` Chao Peng
  0 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Wanpeng Li, jun.nakajima, david, J . Bruce Fields, dave.hansen,
	H . Peter Anvin, Chao Peng, ak, Jonathan Corbet, Joerg Roedel,
	x86, Hugh Dickins, Ingo Molnar, Borislav Petkov, luto,
	Thomas Gleixner, Vitaly Kuznetsov, Jim Mattson,
	Sean Christopherson, susie.li, Jeff Layton, john.ji, Yu Zhang,
	Paolo Bonzini, Andrew Morton, Kirill A . Shutemov

Extend the memslot definition to provide fd-based private memory support
by adding two new fields(fd/ofs). The memslot then can maintain memory
for both shared and private pages in a single memslot. Shared pages are
provided in the existing way by using userspace_addr(hva) field and
get_user_pages() while private pages are provided through the new
fields(fd/ofs). Since there is no 'hva' concept anymore for private
memory we cannot call get_user_pages() to get a pfn, instead we rely on
the newly introduced MEMFD_OPS callbacks to do the same job.

This new extension is indicated by a new flag KVM_MEM_PRIVATE.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/kvm_host.h |  9 +++++++++
 include/uapi/linux/kvm.h | 12 ++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 865a677baf52..96e46b288ecd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -433,8 +433,17 @@ struct kvm_memory_slot {
 	u32 flags;
 	short id;
 	u16 as_id;
+	struct file *file;
+	u64 file_ofs;
 };
 
+static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
+{
+	if (slot && (slot->flags & KVM_MEM_PRIVATE))
+		return true;
+	return false;
+}
+
 static inline bool kvm_slot_dirty_track_enabled(struct kvm_memory_slot *slot)
 {
 	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index e3706e574bd2..a2c1fb8c9843 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -103,6 +103,17 @@ struct kvm_userspace_memory_region {
 	__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
+struct kvm_userspace_memory_region_ext {
+	__u32 slot;
+	__u32 flags;
+	__u64 guest_phys_addr;
+	__u64 memory_size; /* bytes */
+	__u64 userspace_addr; /* hva */
+	__u64 ofs; /* offset into fd */
+	__u32 fd;
+	__u32 padding[5];
+};
+
 /*
  * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
  * other bits are reserved for kvm internal use which are defined in
@@ -110,6 +121,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_PRIVATE		(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 05/15] KVM: Implement fd-based memory using MEMFD_OPS interfaces
  2021-12-21 15:11 ` Chao Peng
@ 2021-12-21 15:11   ` Chao Peng
  -1 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Chao Peng, Kirill A . Shutemov, luto,
	john.ji, susie.li, jun.nakajima, dave.hansen, ak, david

This patch adds the new memfd facility in KVM using MEMFD_OPS to provide
guest memory from a file descriptor created in user space with
memfd_create() instead of traditional userspace hva. It mainly provides
two kind of functions:
  - Pair/unpair a fd-based memslot to a memory backend that owns the
    file descriptor when such memslot gets created/deleted.
  - Get/put a pfn that to be used in KVM page fault handler from/to the
    paired memory backend.

At the pairing time, KVM and the memfd subsystem exchange calllbacks
that each can call into the other side. These callbacks are the major
places to implement fd-based guest memory provisioning.
KVM->memfd:
  - get_pfn: get and lock a page at specified offset in the fd.
  - put_pfn: put and unlock the pfn.
    Note: page needs to be locked between get_pfn/put_pfn to ensure pfn
    is valid when KVM uses it to establish the mapping in the secondary
    MMU page table.
memfd->KVM:
  - invalidate_page_range: called when userspace punch hole on the fd,
    KVM should unmap related pages in the second MMU.
  - fallocate: called when userspace fallocate space on the fd, KVM
    can map related pages in the second MMU.
  - get/put_owner: used to ensure guest is still alive using a reference
    mechanism when calling above invalidate/fallocate callbacks.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/kvm/Kconfig     |  1 +
 arch/x86/kvm/Makefile    |  3 +-
 include/linux/kvm_host.h |  8 ++++
 virt/kvm/memfd.c         | 95 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 106 insertions(+), 1 deletion(-)
 create mode 100644 virt/kvm/memfd.c

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 619186138176..b90ba95db5f3 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -44,6 +44,7 @@ config KVM
 	select KVM_VFIO
 	select SRCU
 	select HAVE_KVM_PM_NOTIFIER if PM
+	select MEMFD_OPS
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 51b2d5fdaeed..87e49d1d9980 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -11,7 +11,8 @@ KVM := ../../../virt/kvm
 
 kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
 				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
-				$(KVM)/dirty_ring.o $(KVM)/binary_stats.o
+				$(KVM)/dirty_ring.o $(KVM)/binary_stats.o \
+				$(KVM)/memfd.o
 kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
 
 kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 96e46b288ecd..b0b63c9a160f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -773,6 +773,14 @@ static inline void kvm_irqfd_exit(void)
 {
 }
 #endif
+
+int kvm_memfd_register(struct kvm *kvm,
+		       const struct kvm_userspace_memory_region_ext *mem,
+		       struct kvm_memory_slot *slot);
+void kvm_memfd_unregister(struct kvm *kvm, struct kvm_memory_slot *slot);
+long kvm_memfd_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn, int *order);
+void kvm_memfd_put_pfn(kvm_pfn_t pfn);
+
 int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 		  struct module *module);
 void kvm_exit(void);
diff --git a/virt/kvm/memfd.c b/virt/kvm/memfd.c
new file mode 100644
index 000000000000..96a1a5bee0f7
--- /dev/null
+++ b/virt/kvm/memfd.c
@@ -0,0 +1,95 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * memfd.c: routines for fd based guest memory
+ * Copyright (c) 2021, Intel Corporation.
+ *
+ * Author:
+ *	Chao Peng <chao.p.peng@linux.intel.com>
+ */
+
+#include <linux/kvm_host.h>
+#include <linux/memfd.h>
+
+#ifdef CONFIG_MEMFD_OPS
+static const struct memfd_pfn_ops *memfd_ops;
+
+static void memfd_invalidate_page_range(struct inode *inode, void *owner,
+					pgoff_t start, pgoff_t end)
+{
+}
+
+static void memfd_fallocate(struct inode *inode, void *owner,
+			    pgoff_t start, pgoff_t end)
+{
+}
+
+static bool memfd_get_owner(void *owner)
+{
+	return kvm_get_kvm_safe(owner);
+}
+
+static void memfd_put_owner(void *owner)
+{
+	kvm_put_kvm(owner);
+}
+
+static const struct  memfd_falloc_notifier memfd_notifier = {
+	.invalidate_page_range = memfd_invalidate_page_range,
+	.fallocate = memfd_fallocate,
+	.get_owner = memfd_get_owner,
+	.put_owner = memfd_put_owner,
+};
+#endif
+
+long kvm_memfd_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn, int *order)
+{
+#ifdef CONFIG_MEMFD_OPS
+	pgoff_t index = gfn - slot->base_gfn +
+			(slot->file_ofs >> PAGE_SHIFT);
+
+	return memfd_ops->get_lock_pfn(slot->file->f_inode, index, order);
+#else
+	return -1;
+#endif
+}
+
+void kvm_memfd_put_pfn(kvm_pfn_t pfn)
+{
+#ifdef CONFIG_MEMFD_OPS
+	memfd_ops->put_unlock_pfn(pfn);
+#endif
+}
+
+int kvm_memfd_register(struct kvm *kvm,
+		       const struct kvm_userspace_memory_region_ext *mem,
+		       struct kvm_memory_slot *slot)
+{
+#ifdef CONFIG_MEMFD_OPS
+	int ret;
+	struct fd fd = fdget(mem->fd);
+
+	if (!fd.file)
+		return -EINVAL;
+
+	ret = memfd_register_falloc_notifier(fd.file->f_inode, kvm,
+				   &memfd_notifier, &memfd_ops);
+	if (ret)
+		return ret;
+
+	slot->file = fd.file;
+	slot->file_ofs = mem->ofs;
+	return 0;
+#else
+	return -EOPNOTSUPP;
+#endif
+}
+
+void kvm_memfd_unregister(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+#ifdef CONFIG_MEMFD_OPS
+	if (slot->file) {
+		fput(slot->file);
+		slot->file = NULL;
+	}
+#endif
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 05/15] KVM: Implement fd-based memory using MEMFD_OPS interfaces
@ 2021-12-21 15:11   ` Chao Peng
  0 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Wanpeng Li, jun.nakajima, david, J . Bruce Fields, dave.hansen,
	H . Peter Anvin, Chao Peng, ak, Jonathan Corbet, Joerg Roedel,
	x86, Hugh Dickins, Ingo Molnar, Borislav Petkov, luto,
	Thomas Gleixner, Vitaly Kuznetsov, Jim Mattson,
	Sean Christopherson, susie.li, Jeff Layton, john.ji, Yu Zhang,
	Paolo Bonzini, Andrew Morton, Kirill A . Shutemov

This patch adds the new memfd facility in KVM using MEMFD_OPS to provide
guest memory from a file descriptor created in user space with
memfd_create() instead of traditional userspace hva. It mainly provides
two kind of functions:
  - Pair/unpair a fd-based memslot to a memory backend that owns the
    file descriptor when such memslot gets created/deleted.
  - Get/put a pfn that to be used in KVM page fault handler from/to the
    paired memory backend.

At the pairing time, KVM and the memfd subsystem exchange calllbacks
that each can call into the other side. These callbacks are the major
places to implement fd-based guest memory provisioning.
KVM->memfd:
  - get_pfn: get and lock a page at specified offset in the fd.
  - put_pfn: put and unlock the pfn.
    Note: page needs to be locked between get_pfn/put_pfn to ensure pfn
    is valid when KVM uses it to establish the mapping in the secondary
    MMU page table.
memfd->KVM:
  - invalidate_page_range: called when userspace punch hole on the fd,
    KVM should unmap related pages in the second MMU.
  - fallocate: called when userspace fallocate space on the fd, KVM
    can map related pages in the second MMU.
  - get/put_owner: used to ensure guest is still alive using a reference
    mechanism when calling above invalidate/fallocate callbacks.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/kvm/Kconfig     |  1 +
 arch/x86/kvm/Makefile    |  3 +-
 include/linux/kvm_host.h |  8 ++++
 virt/kvm/memfd.c         | 95 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 106 insertions(+), 1 deletion(-)
 create mode 100644 virt/kvm/memfd.c

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 619186138176..b90ba95db5f3 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -44,6 +44,7 @@ config KVM
 	select KVM_VFIO
 	select SRCU
 	select HAVE_KVM_PM_NOTIFIER if PM
+	select MEMFD_OPS
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 51b2d5fdaeed..87e49d1d9980 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -11,7 +11,8 @@ KVM := ../../../virt/kvm
 
 kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
 				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
-				$(KVM)/dirty_ring.o $(KVM)/binary_stats.o
+				$(KVM)/dirty_ring.o $(KVM)/binary_stats.o \
+				$(KVM)/memfd.o
 kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
 
 kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 96e46b288ecd..b0b63c9a160f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -773,6 +773,14 @@ static inline void kvm_irqfd_exit(void)
 {
 }
 #endif
+
+int kvm_memfd_register(struct kvm *kvm,
+		       const struct kvm_userspace_memory_region_ext *mem,
+		       struct kvm_memory_slot *slot);
+void kvm_memfd_unregister(struct kvm *kvm, struct kvm_memory_slot *slot);
+long kvm_memfd_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn, int *order);
+void kvm_memfd_put_pfn(kvm_pfn_t pfn);
+
 int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 		  struct module *module);
 void kvm_exit(void);
diff --git a/virt/kvm/memfd.c b/virt/kvm/memfd.c
new file mode 100644
index 000000000000..96a1a5bee0f7
--- /dev/null
+++ b/virt/kvm/memfd.c
@@ -0,0 +1,95 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * memfd.c: routines for fd based guest memory
+ * Copyright (c) 2021, Intel Corporation.
+ *
+ * Author:
+ *	Chao Peng <chao.p.peng@linux.intel.com>
+ */
+
+#include <linux/kvm_host.h>
+#include <linux/memfd.h>
+
+#ifdef CONFIG_MEMFD_OPS
+static const struct memfd_pfn_ops *memfd_ops;
+
+static void memfd_invalidate_page_range(struct inode *inode, void *owner,
+					pgoff_t start, pgoff_t end)
+{
+}
+
+static void memfd_fallocate(struct inode *inode, void *owner,
+			    pgoff_t start, pgoff_t end)
+{
+}
+
+static bool memfd_get_owner(void *owner)
+{
+	return kvm_get_kvm_safe(owner);
+}
+
+static void memfd_put_owner(void *owner)
+{
+	kvm_put_kvm(owner);
+}
+
+static const struct  memfd_falloc_notifier memfd_notifier = {
+	.invalidate_page_range = memfd_invalidate_page_range,
+	.fallocate = memfd_fallocate,
+	.get_owner = memfd_get_owner,
+	.put_owner = memfd_put_owner,
+};
+#endif
+
+long kvm_memfd_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn, int *order)
+{
+#ifdef CONFIG_MEMFD_OPS
+	pgoff_t index = gfn - slot->base_gfn +
+			(slot->file_ofs >> PAGE_SHIFT);
+
+	return memfd_ops->get_lock_pfn(slot->file->f_inode, index, order);
+#else
+	return -1;
+#endif
+}
+
+void kvm_memfd_put_pfn(kvm_pfn_t pfn)
+{
+#ifdef CONFIG_MEMFD_OPS
+	memfd_ops->put_unlock_pfn(pfn);
+#endif
+}
+
+int kvm_memfd_register(struct kvm *kvm,
+		       const struct kvm_userspace_memory_region_ext *mem,
+		       struct kvm_memory_slot *slot)
+{
+#ifdef CONFIG_MEMFD_OPS
+	int ret;
+	struct fd fd = fdget(mem->fd);
+
+	if (!fd.file)
+		return -EINVAL;
+
+	ret = memfd_register_falloc_notifier(fd.file->f_inode, kvm,
+				   &memfd_notifier, &memfd_ops);
+	if (ret)
+		return ret;
+
+	slot->file = fd.file;
+	slot->file_ofs = mem->ofs;
+	return 0;
+#else
+	return -EOPNOTSUPP;
+#endif
+}
+
+void kvm_memfd_unregister(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+#ifdef CONFIG_MEMFD_OPS
+	if (slot->file) {
+		fput(slot->file);
+		slot->file = NULL;
+	}
+#endif
+}
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 06/15] KVM: Refactor hva based memory invalidation code
  2021-12-21 15:11 ` Chao Peng
@ 2021-12-21 15:11   ` Chao Peng
  -1 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Chao Peng, Kirill A . Shutemov, luto,
	john.ji, susie.li, jun.nakajima, dave.hansen, ak, david

The purpose of this patch is for fd-based memslot to reuse the same
mmu_notifier based guest memory invalidation code for private pages.

No functional changes except renaming 'hva' to more neutral 'useraddr'
so that it can also cover 'offset' in a fd that private pages live in.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/kvm_host.h |  8 +++++--
 virt/kvm/kvm_main.c      | 47 +++++++++++++++++++++-------------------
 2 files changed, 31 insertions(+), 24 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b0b63c9a160f..7279f46f35d3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1327,9 +1327,13 @@ static inline int memslot_id(struct kvm *kvm, gfn_t gfn)
 }
 
 static inline gfn_t
-hva_to_gfn_memslot(unsigned long hva, struct kvm_memory_slot *slot)
+useraddr_to_gfn_memslot(unsigned long useraddr, struct kvm_memory_slot *slot,
+			bool addr_is_hva)
 {
-	gfn_t gfn_offset = (hva - slot->userspace_addr) >> PAGE_SHIFT;
+	unsigned long useraddr_base = addr_is_hva ? slot->userspace_addr
+						  : slot->file_ofs;
+
+	gfn_t gfn_offset = (useraddr - useraddr_base) >> PAGE_SHIFT;
 
 	return slot->base_gfn + gfn_offset;
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 68018ee7f0cd..856f89ed8ab5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -471,16 +471,16 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
-typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
+typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
 			     unsigned long end);
 
-struct kvm_hva_range {
+struct kvm_useraddr_range {
 	unsigned long start;
 	unsigned long end;
 	pte_t pte;
-	hva_handler_t handler;
+	gfn_handler_t handler;
 	on_lock_fn_t on_lock;
 	bool flush_on_ret;
 	bool may_block;
@@ -499,8 +499,8 @@ static void kvm_null_fn(void)
 }
 #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
 
-static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
-						  const struct kvm_hva_range *range)
+static __always_inline int __kvm_handle_useraddr_range(struct kvm *kvm,
+					const struct kvm_useraddr_range *range)
 {
 	bool ret = false, locked = false;
 	struct kvm_gfn_range gfn_range;
@@ -518,12 +518,12 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
 		slots = __kvm_memslots(kvm, i);
 		kvm_for_each_memslot(slot, slots) {
-			unsigned long hva_start, hva_end;
+			unsigned long useraddr_start, useraddr_end;
 
-			hva_start = max(range->start, slot->userspace_addr);
-			hva_end = min(range->end, slot->userspace_addr +
+			useraddr_start = max(range->start, slot->userspace_addr);
+			useraddr_end = min(range->end, slot->userspace_addr +
 						  (slot->npages << PAGE_SHIFT));
-			if (hva_start >= hva_end)
+			if (useraddr_start >= useraddr_end)
 				continue;
 
 			/*
@@ -536,11 +536,14 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 			gfn_range.may_block = range->may_block;
 
 			/*
-			 * {gfn(page) | page intersects with [hva_start, hva_end)} =
+			 * {gfn(page) | page intersects with [useraddr_start, useraddr_end)} =
 			 * {gfn_start, gfn_start+1, ..., gfn_end-1}.
 			 */
-			gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
-			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
+			gfn_range.start = useraddr_to_gfn_memslot(useraddr_start,
+								  slot, true);
+			gfn_range.end = useraddr_to_gfn_memslot(
+						useraddr_end + PAGE_SIZE - 1,
+						slot, true);
 			gfn_range.slot = slot;
 
 			if (!locked) {
@@ -571,10 +574,10 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 						unsigned long start,
 						unsigned long end,
 						pte_t pte,
-						hva_handler_t handler)
+						gfn_handler_t handler)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_hva_range range = {
+	const struct kvm_useraddr_range range = {
 		.start		= start,
 		.end		= end,
 		.pte		= pte,
@@ -584,16 +587,16 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 		.may_block	= false,
 	};
 
-	return __kvm_handle_hva_range(kvm, &range);
+	return __kvm_handle_useraddr_range(kvm, &range);
 }
 
 static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
 							 unsigned long start,
 							 unsigned long end,
-							 hva_handler_t handler)
+							 gfn_handler_t handler)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_hva_range range = {
+	const struct kvm_useraddr_range range = {
 		.start		= start,
 		.end		= end,
 		.pte		= __pte(0),
@@ -603,7 +606,7 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
 		.may_block	= false,
 	};
 
-	return __kvm_handle_hva_range(kvm, &range);
+	return __kvm_handle_useraddr_range(kvm, &range);
 }
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
@@ -661,7 +664,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_hva_range hva_range = {
+	const struct kvm_useraddr_range useraddr_range = {
 		.start		= range->start,
 		.end		= range->end,
 		.pte		= __pte(0),
@@ -685,7 +688,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	kvm->mn_active_invalidate_count++;
 	spin_unlock(&kvm->mn_invalidate_lock);
 
-	__kvm_handle_hva_range(kvm, &hva_range);
+	__kvm_handle_useraddr_range(kvm, &useraddr_range);
 
 	return 0;
 }
@@ -712,7 +715,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_hva_range hva_range = {
+	const struct kvm_useraddr_range useraddr_range = {
 		.start		= range->start,
 		.end		= range->end,
 		.pte		= __pte(0),
@@ -723,7 +726,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	};
 	bool wake;
 
-	__kvm_handle_hva_range(kvm, &hva_range);
+	__kvm_handle_useraddr_range(kvm, &useraddr_range);
 
 	/* Pairs with the increment in range_start(). */
 	spin_lock(&kvm->mn_invalidate_lock);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 06/15] KVM: Refactor hva based memory invalidation code
@ 2021-12-21 15:11   ` Chao Peng
  0 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Wanpeng Li, jun.nakajima, david, J . Bruce Fields, dave.hansen,
	H . Peter Anvin, Chao Peng, ak, Jonathan Corbet, Joerg Roedel,
	x86, Hugh Dickins, Ingo Molnar, Borislav Petkov, luto,
	Thomas Gleixner, Vitaly Kuznetsov, Jim Mattson,
	Sean Christopherson, susie.li, Jeff Layton, john.ji, Yu Zhang,
	Paolo Bonzini, Andrew Morton, Kirill A . Shutemov

The purpose of this patch is for fd-based memslot to reuse the same
mmu_notifier based guest memory invalidation code for private pages.

No functional changes except renaming 'hva' to more neutral 'useraddr'
so that it can also cover 'offset' in a fd that private pages live in.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/kvm_host.h |  8 +++++--
 virt/kvm/kvm_main.c      | 47 +++++++++++++++++++++-------------------
 2 files changed, 31 insertions(+), 24 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b0b63c9a160f..7279f46f35d3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1327,9 +1327,13 @@ static inline int memslot_id(struct kvm *kvm, gfn_t gfn)
 }
 
 static inline gfn_t
-hva_to_gfn_memslot(unsigned long hva, struct kvm_memory_slot *slot)
+useraddr_to_gfn_memslot(unsigned long useraddr, struct kvm_memory_slot *slot,
+			bool addr_is_hva)
 {
-	gfn_t gfn_offset = (hva - slot->userspace_addr) >> PAGE_SHIFT;
+	unsigned long useraddr_base = addr_is_hva ? slot->userspace_addr
+						  : slot->file_ofs;
+
+	gfn_t gfn_offset = (useraddr - useraddr_base) >> PAGE_SHIFT;
 
 	return slot->base_gfn + gfn_offset;
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 68018ee7f0cd..856f89ed8ab5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -471,16 +471,16 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
-typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
+typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
 			     unsigned long end);
 
-struct kvm_hva_range {
+struct kvm_useraddr_range {
 	unsigned long start;
 	unsigned long end;
 	pte_t pte;
-	hva_handler_t handler;
+	gfn_handler_t handler;
 	on_lock_fn_t on_lock;
 	bool flush_on_ret;
 	bool may_block;
@@ -499,8 +499,8 @@ static void kvm_null_fn(void)
 }
 #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
 
-static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
-						  const struct kvm_hva_range *range)
+static __always_inline int __kvm_handle_useraddr_range(struct kvm *kvm,
+					const struct kvm_useraddr_range *range)
 {
 	bool ret = false, locked = false;
 	struct kvm_gfn_range gfn_range;
@@ -518,12 +518,12 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
 		slots = __kvm_memslots(kvm, i);
 		kvm_for_each_memslot(slot, slots) {
-			unsigned long hva_start, hva_end;
+			unsigned long useraddr_start, useraddr_end;
 
-			hva_start = max(range->start, slot->userspace_addr);
-			hva_end = min(range->end, slot->userspace_addr +
+			useraddr_start = max(range->start, slot->userspace_addr);
+			useraddr_end = min(range->end, slot->userspace_addr +
 						  (slot->npages << PAGE_SHIFT));
-			if (hva_start >= hva_end)
+			if (useraddr_start >= useraddr_end)
 				continue;
 
 			/*
@@ -536,11 +536,14 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 			gfn_range.may_block = range->may_block;
 
 			/*
-			 * {gfn(page) | page intersects with [hva_start, hva_end)} =
+			 * {gfn(page) | page intersects with [useraddr_start, useraddr_end)} =
 			 * {gfn_start, gfn_start+1, ..., gfn_end-1}.
 			 */
-			gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
-			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
+			gfn_range.start = useraddr_to_gfn_memslot(useraddr_start,
+								  slot, true);
+			gfn_range.end = useraddr_to_gfn_memslot(
+						useraddr_end + PAGE_SIZE - 1,
+						slot, true);
 			gfn_range.slot = slot;
 
 			if (!locked) {
@@ -571,10 +574,10 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 						unsigned long start,
 						unsigned long end,
 						pte_t pte,
-						hva_handler_t handler)
+						gfn_handler_t handler)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_hva_range range = {
+	const struct kvm_useraddr_range range = {
 		.start		= start,
 		.end		= end,
 		.pte		= pte,
@@ -584,16 +587,16 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 		.may_block	= false,
 	};
 
-	return __kvm_handle_hva_range(kvm, &range);
+	return __kvm_handle_useraddr_range(kvm, &range);
 }
 
 static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
 							 unsigned long start,
 							 unsigned long end,
-							 hva_handler_t handler)
+							 gfn_handler_t handler)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_hva_range range = {
+	const struct kvm_useraddr_range range = {
 		.start		= start,
 		.end		= end,
 		.pte		= __pte(0),
@@ -603,7 +606,7 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
 		.may_block	= false,
 	};
 
-	return __kvm_handle_hva_range(kvm, &range);
+	return __kvm_handle_useraddr_range(kvm, &range);
 }
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
@@ -661,7 +664,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_hva_range hva_range = {
+	const struct kvm_useraddr_range useraddr_range = {
 		.start		= range->start,
 		.end		= range->end,
 		.pte		= __pte(0),
@@ -685,7 +688,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	kvm->mn_active_invalidate_count++;
 	spin_unlock(&kvm->mn_invalidate_lock);
 
-	__kvm_handle_hva_range(kvm, &hva_range);
+	__kvm_handle_useraddr_range(kvm, &useraddr_range);
 
 	return 0;
 }
@@ -712,7 +715,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_hva_range hva_range = {
+	const struct kvm_useraddr_range useraddr_range = {
 		.start		= range->start,
 		.end		= range->end,
 		.pte		= __pte(0),
@@ -723,7 +726,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	};
 	bool wake;
 
-	__kvm_handle_hva_range(kvm, &hva_range);
+	__kvm_handle_useraddr_range(kvm, &useraddr_range);
 
 	/* Pairs with the increment in range_start(). */
 	spin_lock(&kvm->mn_invalidate_lock);
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 07/15] KVM: Special handling for fd-based memory invalidation
  2021-12-21 15:11 ` Chao Peng
@ 2021-12-21 15:11   ` Chao Peng
  -1 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Chao Peng, Kirill A . Shutemov, luto,
	john.ji, susie.li, jun.nakajima, dave.hansen, ak, david

For fd-based guest memory, the memory backend (e.g. the fd provider)
should notify KVM to unmap/invalidate the privated memory from KVM
second MMU when userspace punches hole on the fd (e.g. when userspace
converts private memory to shared memory).

To support fd-based memory invalidation, existing hva-based memory
invalidation needs to be extended. A new 'inode' for the fd is passed in
from memfd_falloc_notifier and the 'start/end' will represent start/end
offset in the fd instead of hva range. During the invalidation KVM needs
to check this inode against that in the memslot. Only when the 'inode' in
memslot equals to the passed-in 'inode' we should invalidate the mapping
in KVM.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 virt/kvm/kvm_main.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 856f89ed8ab5..0f2d1002f6a7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -479,6 +479,7 @@ typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
 struct kvm_useraddr_range {
 	unsigned long start;
 	unsigned long end;
+	struct inode *inode;
 	pte_t pte;
 	gfn_handler_t handler;
 	on_lock_fn_t on_lock;
@@ -519,9 +520,19 @@ static __always_inline int __kvm_handle_useraddr_range(struct kvm *kvm,
 		slots = __kvm_memslots(kvm, i);
 		kvm_for_each_memslot(slot, slots) {
 			unsigned long useraddr_start, useraddr_end;
+			unsigned long useraddr_base;
 
-			useraddr_start = max(range->start, slot->userspace_addr);
-			useraddr_end = min(range->end, slot->userspace_addr +
+			if (range->inode) {
+				if (!slot->file ||
+				    slot->file->f_inode != range->inode)
+					continue;
+				useraddr_base = slot->file_ofs;
+			} else
+				useraddr_base = slot->userspace_addr;
+
+
+			useraddr_start = max(range->start, useraddr_base);
+			useraddr_end = min(range->end, useraddr_base +
 						  (slot->npages << PAGE_SHIFT));
 			if (useraddr_start >= useraddr_end)
 				continue;
@@ -540,10 +551,10 @@ static __always_inline int __kvm_handle_useraddr_range(struct kvm *kvm,
 			 * {gfn_start, gfn_start+1, ..., gfn_end-1}.
 			 */
 			gfn_range.start = useraddr_to_gfn_memslot(useraddr_start,
-								  slot, true);
+							slot, !range->inode);
 			gfn_range.end = useraddr_to_gfn_memslot(
 						useraddr_end + PAGE_SIZE - 1,
-						slot, true);
+						slot, !range->inode);
 			gfn_range.slot = slot;
 
 			if (!locked) {
@@ -585,6 +596,7 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 		.on_lock	= (void *)kvm_null_fn,
 		.flush_on_ret	= true,
 		.may_block	= false,
+		.inode		= NULL,
 	};
 
 	return __kvm_handle_useraddr_range(kvm, &range);
@@ -604,6 +616,7 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
 		.on_lock	= (void *)kvm_null_fn,
 		.flush_on_ret	= false,
 		.may_block	= false,
+		.inode		= NULL,
 	};
 
 	return __kvm_handle_useraddr_range(kvm, &range);
@@ -672,6 +685,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.on_lock	= kvm_inc_notifier_count,
 		.flush_on_ret	= true,
 		.may_block	= mmu_notifier_range_blockable(range),
+		.inode		= NULL,
 	};
 
 	trace_kvm_unmap_hva_range(range->start, range->end);
@@ -723,6 +737,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 		.on_lock	= kvm_dec_notifier_count,
 		.flush_on_ret	= false,
 		.may_block	= mmu_notifier_range_blockable(range),
+		.inode		= NULL,
 	};
 	bool wake;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 07/15] KVM: Special handling for fd-based memory invalidation
@ 2021-12-21 15:11   ` Chao Peng
  0 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Wanpeng Li, jun.nakajima, david, J . Bruce Fields, dave.hansen,
	H . Peter Anvin, Chao Peng, ak, Jonathan Corbet, Joerg Roedel,
	x86, Hugh Dickins, Ingo Molnar, Borislav Petkov, luto,
	Thomas Gleixner, Vitaly Kuznetsov, Jim Mattson,
	Sean Christopherson, susie.li, Jeff Layton, john.ji, Yu Zhang,
	Paolo Bonzini, Andrew Morton, Kirill A . Shutemov

For fd-based guest memory, the memory backend (e.g. the fd provider)
should notify KVM to unmap/invalidate the privated memory from KVM
second MMU when userspace punches hole on the fd (e.g. when userspace
converts private memory to shared memory).

To support fd-based memory invalidation, existing hva-based memory
invalidation needs to be extended. A new 'inode' for the fd is passed in
from memfd_falloc_notifier and the 'start/end' will represent start/end
offset in the fd instead of hva range. During the invalidation KVM needs
to check this inode against that in the memslot. Only when the 'inode' in
memslot equals to the passed-in 'inode' we should invalidate the mapping
in KVM.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 virt/kvm/kvm_main.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 856f89ed8ab5..0f2d1002f6a7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -479,6 +479,7 @@ typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
 struct kvm_useraddr_range {
 	unsigned long start;
 	unsigned long end;
+	struct inode *inode;
 	pte_t pte;
 	gfn_handler_t handler;
 	on_lock_fn_t on_lock;
@@ -519,9 +520,19 @@ static __always_inline int __kvm_handle_useraddr_range(struct kvm *kvm,
 		slots = __kvm_memslots(kvm, i);
 		kvm_for_each_memslot(slot, slots) {
 			unsigned long useraddr_start, useraddr_end;
+			unsigned long useraddr_base;
 
-			useraddr_start = max(range->start, slot->userspace_addr);
-			useraddr_end = min(range->end, slot->userspace_addr +
+			if (range->inode) {
+				if (!slot->file ||
+				    slot->file->f_inode != range->inode)
+					continue;
+				useraddr_base = slot->file_ofs;
+			} else
+				useraddr_base = slot->userspace_addr;
+
+
+			useraddr_start = max(range->start, useraddr_base);
+			useraddr_end = min(range->end, useraddr_base +
 						  (slot->npages << PAGE_SHIFT));
 			if (useraddr_start >= useraddr_end)
 				continue;
@@ -540,10 +551,10 @@ static __always_inline int __kvm_handle_useraddr_range(struct kvm *kvm,
 			 * {gfn_start, gfn_start+1, ..., gfn_end-1}.
 			 */
 			gfn_range.start = useraddr_to_gfn_memslot(useraddr_start,
-								  slot, true);
+							slot, !range->inode);
 			gfn_range.end = useraddr_to_gfn_memslot(
 						useraddr_end + PAGE_SIZE - 1,
-						slot, true);
+						slot, !range->inode);
 			gfn_range.slot = slot;
 
 			if (!locked) {
@@ -585,6 +596,7 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 		.on_lock	= (void *)kvm_null_fn,
 		.flush_on_ret	= true,
 		.may_block	= false,
+		.inode		= NULL,
 	};
 
 	return __kvm_handle_useraddr_range(kvm, &range);
@@ -604,6 +616,7 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
 		.on_lock	= (void *)kvm_null_fn,
 		.flush_on_ret	= false,
 		.may_block	= false,
+		.inode		= NULL,
 	};
 
 	return __kvm_handle_useraddr_range(kvm, &range);
@@ -672,6 +685,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.on_lock	= kvm_inc_notifier_count,
 		.flush_on_ret	= true,
 		.may_block	= mmu_notifier_range_blockable(range),
+		.inode		= NULL,
 	};
 
 	trace_kvm_unmap_hva_range(range->start, range->end);
@@ -723,6 +737,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 		.on_lock	= kvm_dec_notifier_count,
 		.flush_on_ret	= false,
 		.may_block	= mmu_notifier_range_blockable(range),
+		.inode		= NULL,
 	};
 	bool wake;
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 08/15] KVM: Split out common memory invalidation code
  2021-12-21 15:11 ` Chao Peng
@ 2021-12-21 15:11   ` Chao Peng
  -1 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Chao Peng, Kirill A . Shutemov, luto,
	john.ji, susie.li, jun.nakajima, dave.hansen, ak, david

When fd-based memory is enabled, there will be two types of memory
invalidation:
  - memory invalidation from native MMU through mmu_notifier callback
    for hva-based memory, and,
  - memory invalidation from memfd through memfd_notifier callback for
    fd-based memory.

Some code can be shared between these two types of memory invalidation.
This patch moves these shared code into one place so that it can be
used for CONFIG_MMU_NOTIFIER as well as CONFIG_MEMFD_NOTIFIER.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 virt/kvm/kvm_main.c | 35 +++++++++++++++++++----------------
 1 file changed, 19 insertions(+), 16 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0f2d1002f6a7..59f01e68337b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -454,22 +454,6 @@ void kvm_vcpu_destroy(struct kvm_vcpu *vcpu)
 EXPORT_SYMBOL_GPL(kvm_vcpu_destroy);
 
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
-static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
-{
-	return container_of(mn, struct kvm, mmu_notifier);
-}
-
-static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
-					      struct mm_struct *mm,
-					      unsigned long start, unsigned long end)
-{
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	int idx;
-
-	idx = srcu_read_lock(&kvm->srcu);
-	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
-	srcu_read_unlock(&kvm->srcu, idx);
-}
 
 typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
@@ -580,6 +564,25 @@ static __always_inline int __kvm_handle_useraddr_range(struct kvm *kvm,
 	/* The notifiers are averse to booleans. :-( */
 	return (int)ret;
 }
+#endif
+
+#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
+{
+	return container_of(mn, struct kvm, mmu_notifier);
+}
+
+static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
+					      struct mm_struct *mm,
+					      unsigned long start, unsigned long end)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	int idx;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+	srcu_read_unlock(&kvm->srcu, idx);
+}
 
 static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 						unsigned long start,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 08/15] KVM: Split out common memory invalidation code
@ 2021-12-21 15:11   ` Chao Peng
  0 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Wanpeng Li, jun.nakajima, david, J . Bruce Fields, dave.hansen,
	H . Peter Anvin, Chao Peng, ak, Jonathan Corbet, Joerg Roedel,
	x86, Hugh Dickins, Ingo Molnar, Borislav Petkov, luto,
	Thomas Gleixner, Vitaly Kuznetsov, Jim Mattson,
	Sean Christopherson, susie.li, Jeff Layton, john.ji, Yu Zhang,
	Paolo Bonzini, Andrew Morton, Kirill A . Shutemov

When fd-based memory is enabled, there will be two types of memory
invalidation:
  - memory invalidation from native MMU through mmu_notifier callback
    for hva-based memory, and,
  - memory invalidation from memfd through memfd_notifier callback for
    fd-based memory.

Some code can be shared between these two types of memory invalidation.
This patch moves these shared code into one place so that it can be
used for CONFIG_MMU_NOTIFIER as well as CONFIG_MEMFD_NOTIFIER.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 virt/kvm/kvm_main.c | 35 +++++++++++++++++++----------------
 1 file changed, 19 insertions(+), 16 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0f2d1002f6a7..59f01e68337b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -454,22 +454,6 @@ void kvm_vcpu_destroy(struct kvm_vcpu *vcpu)
 EXPORT_SYMBOL_GPL(kvm_vcpu_destroy);
 
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
-static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
-{
-	return container_of(mn, struct kvm, mmu_notifier);
-}
-
-static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
-					      struct mm_struct *mm,
-					      unsigned long start, unsigned long end)
-{
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	int idx;
-
-	idx = srcu_read_lock(&kvm->srcu);
-	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
-	srcu_read_unlock(&kvm->srcu, idx);
-}
 
 typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
@@ -580,6 +564,25 @@ static __always_inline int __kvm_handle_useraddr_range(struct kvm *kvm,
 	/* The notifiers are averse to booleans. :-( */
 	return (int)ret;
 }
+#endif
+
+#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
+{
+	return container_of(mn, struct kvm, mmu_notifier);
+}
+
+static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
+					      struct mm_struct *mm,
+					      unsigned long start, unsigned long end)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	int idx;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+	srcu_read_unlock(&kvm->srcu, idx);
+}
 
 static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 						unsigned long start,
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 09/15] KVM: Implement fd-based memory invalidation
  2021-12-21 15:11 ` Chao Peng
@ 2021-12-21 15:11   ` Chao Peng
  -1 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Chao Peng, Kirill A . Shutemov, luto,
	john.ji, susie.li, jun.nakajima, dave.hansen, ak, david

KVM gets notified when userspace punches a hole in a fd which is used
for guest memory. KVM should invalidate the mapping in the second MMU
page tables. This is the same logic as MMU notifier invalidation except
the fd related information is carried around to indicate the memory
range. KVM hence can reuse most of existing MMU notifier invalidation
code including looping through the memslots and then calling into
kvm_unmap_gfn_range() which should do whatever needed for fd-based
memory unmapping (e.g. for private memory managed by TDX it may need
call into SEAM-MODULE).

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/kvm_host.h |  8 ++++-
 virt/kvm/kvm_main.c      | 69 +++++++++++++++++++++++++++++++---------
 virt/kvm/memfd.c         |  2 ++
 3 files changed, 63 insertions(+), 16 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7279f46f35d3..d9573305e273 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -229,7 +229,7 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
-#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
+#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_MEMFD_OPS)
 struct kvm_gfn_range {
 	struct kvm_memory_slot *slot;
 	gfn_t start;
@@ -1874,4 +1874,10 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+#ifdef CONFIG_MEMFD_OPS
+int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
+			       unsigned long start, unsigned long end);
+#endif /* CONFIG_MEMFD_OPS */
+
+
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 59f01e68337b..d84cb867b686 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -453,7 +453,8 @@ void kvm_vcpu_destroy(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_destroy);
 
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+#if defined(CONFIG_MEMFD_OPS) ||\
+	(defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER))
 
 typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
@@ -564,6 +565,30 @@ static __always_inline int __kvm_handle_useraddr_range(struct kvm *kvm,
 	/* The notifiers are averse to booleans. :-( */
 	return (int)ret;
 }
+
+static void mn_active_invalidate_count_inc(struct kvm *kvm)
+{
+	spin_lock(&kvm->mn_invalidate_lock);
+	kvm->mn_active_invalidate_count++;
+	spin_unlock(&kvm->mn_invalidate_lock);
+
+}
+
+static void mn_active_invalidate_count_dec(struct kvm *kvm)
+{
+	bool wake;
+
+	spin_lock(&kvm->mn_invalidate_lock);
+	wake = (--kvm->mn_active_invalidate_count == 0);
+	spin_unlock(&kvm->mn_invalidate_lock);
+
+	/*
+	 * There can only be one waiter, since the wait happens under
+	 * slots_lock.
+	 */
+	if (wake)
+		rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
+}
 #endif
 
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
@@ -701,9 +726,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 *
 	 * Pairs with the decrement in range_end().
 	 */
-	spin_lock(&kvm->mn_invalidate_lock);
-	kvm->mn_active_invalidate_count++;
-	spin_unlock(&kvm->mn_invalidate_lock);
+	mn_active_invalidate_count_inc(kvm);
 
 	__kvm_handle_useraddr_range(kvm, &useraddr_range);
 
@@ -742,21 +765,11 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 		.may_block	= mmu_notifier_range_blockable(range),
 		.inode		= NULL,
 	};
-	bool wake;
 
 	__kvm_handle_useraddr_range(kvm, &useraddr_range);
 
 	/* Pairs with the increment in range_start(). */
-	spin_lock(&kvm->mn_invalidate_lock);
-	wake = (--kvm->mn_active_invalidate_count == 0);
-	spin_unlock(&kvm->mn_invalidate_lock);
-
-	/*
-	 * There can only be one waiter, since the wait happens under
-	 * slots_lock.
-	 */
-	if (wake)
-		rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
+	mn_active_invalidate_count_dec(kvm);
 
 	BUG_ON(kvm->mmu_notifier_count < 0);
 }
@@ -841,6 +854,32 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
 
+#ifdef CONFIG_MEMFD_OPS
+int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
+			       unsigned long start, unsigned long end)
+{
+	int ret;
+	const struct kvm_useraddr_range useraddr_range = {
+		.start		= start,
+		.end		= end,
+		.pte		= __pte(0),
+		.handler	= kvm_unmap_gfn_range,
+		.on_lock	= (void *)kvm_null_fn,
+		.flush_on_ret	= true,
+		.may_block	= false,
+		.inode		= inode,
+	};
+
+
+	/* Prevent memslot modification */
+	mn_active_invalidate_count_inc(kvm);
+	ret = __kvm_handle_useraddr_range(kvm, &useraddr_range);
+	mn_active_invalidate_count_dec(kvm);
+
+	return ret;
+}
+#endif /* CONFIG_MEMFD_OPS */
+
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 static int kvm_pm_notifier_call(struct notifier_block *bl,
 				unsigned long state,
diff --git a/virt/kvm/memfd.c b/virt/kvm/memfd.c
index 96a1a5bee0f7..d092a9b6f496 100644
--- a/virt/kvm/memfd.c
+++ b/virt/kvm/memfd.c
@@ -16,6 +16,8 @@ static const struct memfd_pfn_ops *memfd_ops;
 static void memfd_invalidate_page_range(struct inode *inode, void *owner,
 					pgoff_t start, pgoff_t end)
 {
+	kvm_memfd_invalidate_range(owner, inode, start >> PAGE_SHIFT,
+						 end >> PAGE_SHIFT);
 }
 
 static void memfd_fallocate(struct inode *inode, void *owner,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 09/15] KVM: Implement fd-based memory invalidation
@ 2021-12-21 15:11   ` Chao Peng
  0 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Wanpeng Li, jun.nakajima, david, J . Bruce Fields, dave.hansen,
	H . Peter Anvin, Chao Peng, ak, Jonathan Corbet, Joerg Roedel,
	x86, Hugh Dickins, Ingo Molnar, Borislav Petkov, luto,
	Thomas Gleixner, Vitaly Kuznetsov, Jim Mattson,
	Sean Christopherson, susie.li, Jeff Layton, john.ji, Yu Zhang,
	Paolo Bonzini, Andrew Morton, Kirill A . Shutemov

KVM gets notified when userspace punches a hole in a fd which is used
for guest memory. KVM should invalidate the mapping in the second MMU
page tables. This is the same logic as MMU notifier invalidation except
the fd related information is carried around to indicate the memory
range. KVM hence can reuse most of existing MMU notifier invalidation
code including looping through the memslots and then calling into
kvm_unmap_gfn_range() which should do whatever needed for fd-based
memory unmapping (e.g. for private memory managed by TDX it may need
call into SEAM-MODULE).

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/kvm_host.h |  8 ++++-
 virt/kvm/kvm_main.c      | 69 +++++++++++++++++++++++++++++++---------
 virt/kvm/memfd.c         |  2 ++
 3 files changed, 63 insertions(+), 16 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7279f46f35d3..d9573305e273 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -229,7 +229,7 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
-#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
+#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_MEMFD_OPS)
 struct kvm_gfn_range {
 	struct kvm_memory_slot *slot;
 	gfn_t start;
@@ -1874,4 +1874,10 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+#ifdef CONFIG_MEMFD_OPS
+int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
+			       unsigned long start, unsigned long end);
+#endif /* CONFIG_MEMFD_OPS */
+
+
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 59f01e68337b..d84cb867b686 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -453,7 +453,8 @@ void kvm_vcpu_destroy(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_destroy);
 
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+#if defined(CONFIG_MEMFD_OPS) ||\
+	(defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER))
 
 typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
@@ -564,6 +565,30 @@ static __always_inline int __kvm_handle_useraddr_range(struct kvm *kvm,
 	/* The notifiers are averse to booleans. :-( */
 	return (int)ret;
 }
+
+static void mn_active_invalidate_count_inc(struct kvm *kvm)
+{
+	spin_lock(&kvm->mn_invalidate_lock);
+	kvm->mn_active_invalidate_count++;
+	spin_unlock(&kvm->mn_invalidate_lock);
+
+}
+
+static void mn_active_invalidate_count_dec(struct kvm *kvm)
+{
+	bool wake;
+
+	spin_lock(&kvm->mn_invalidate_lock);
+	wake = (--kvm->mn_active_invalidate_count == 0);
+	spin_unlock(&kvm->mn_invalidate_lock);
+
+	/*
+	 * There can only be one waiter, since the wait happens under
+	 * slots_lock.
+	 */
+	if (wake)
+		rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
+}
 #endif
 
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
@@ -701,9 +726,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 *
 	 * Pairs with the decrement in range_end().
 	 */
-	spin_lock(&kvm->mn_invalidate_lock);
-	kvm->mn_active_invalidate_count++;
-	spin_unlock(&kvm->mn_invalidate_lock);
+	mn_active_invalidate_count_inc(kvm);
 
 	__kvm_handle_useraddr_range(kvm, &useraddr_range);
 
@@ -742,21 +765,11 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 		.may_block	= mmu_notifier_range_blockable(range),
 		.inode		= NULL,
 	};
-	bool wake;
 
 	__kvm_handle_useraddr_range(kvm, &useraddr_range);
 
 	/* Pairs with the increment in range_start(). */
-	spin_lock(&kvm->mn_invalidate_lock);
-	wake = (--kvm->mn_active_invalidate_count == 0);
-	spin_unlock(&kvm->mn_invalidate_lock);
-
-	/*
-	 * There can only be one waiter, since the wait happens under
-	 * slots_lock.
-	 */
-	if (wake)
-		rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
+	mn_active_invalidate_count_dec(kvm);
 
 	BUG_ON(kvm->mmu_notifier_count < 0);
 }
@@ -841,6 +854,32 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
 
+#ifdef CONFIG_MEMFD_OPS
+int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
+			       unsigned long start, unsigned long end)
+{
+	int ret;
+	const struct kvm_useraddr_range useraddr_range = {
+		.start		= start,
+		.end		= end,
+		.pte		= __pte(0),
+		.handler	= kvm_unmap_gfn_range,
+		.on_lock	= (void *)kvm_null_fn,
+		.flush_on_ret	= true,
+		.may_block	= false,
+		.inode		= inode,
+	};
+
+
+	/* Prevent memslot modification */
+	mn_active_invalidate_count_inc(kvm);
+	ret = __kvm_handle_useraddr_range(kvm, &useraddr_range);
+	mn_active_invalidate_count_dec(kvm);
+
+	return ret;
+}
+#endif /* CONFIG_MEMFD_OPS */
+
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 static int kvm_pm_notifier_call(struct notifier_block *bl,
 				unsigned long state,
diff --git a/virt/kvm/memfd.c b/virt/kvm/memfd.c
index 96a1a5bee0f7..d092a9b6f496 100644
--- a/virt/kvm/memfd.c
+++ b/virt/kvm/memfd.c
@@ -16,6 +16,8 @@ static const struct memfd_pfn_ops *memfd_ops;
 static void memfd_invalidate_page_range(struct inode *inode, void *owner,
 					pgoff_t start, pgoff_t end)
 {
+	kvm_memfd_invalidate_range(owner, inode, start >> PAGE_SHIFT,
+						 end >> PAGE_SHIFT);
 }
 
 static void memfd_fallocate(struct inode *inode, void *owner,
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 10/15] KVM: Add kvm_map_gfn_range
  2021-12-21 15:11 ` Chao Peng
@ 2021-12-21 15:11   ` Chao Peng
  -1 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Chao Peng, Kirill A . Shutemov, luto,
	john.ji, susie.li, jun.nakajima, dave.hansen, ak, david

This new function establishes the mapping in KVM page tables for a
given gfn range. It can be used in the memory fallocate callback for
memfd based memory to establish the mapping for KVM secondary MMU when
the pages are allocated in the memory backend.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/kvm/mmu/mmu.c   | 47 ++++++++++++++++++++++++++++++++++++++++
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c      |  5 +++++
 3 files changed, 54 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2da19356679d..a7006e1ac2d2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1781,6 +1781,53 @@ static __always_inline bool kvm_handle_gfn_range(struct kvm *kvm,
 	return ret;
 }
 
+bool kvm_map_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	struct kvm_vcpu *vcpu;
+	kvm_pfn_t pfn;
+	gfn_t gfn;
+	int idx;
+	bool ret = true;
+
+	/* Need vcpu context for kvm_mmu_do_page_fault. */
+	vcpu = kvm_get_vcpu(kvm, 0);
+	if (mutex_lock_killable(&vcpu->mutex))
+		return false;
+
+	vcpu_load(vcpu);
+	idx = srcu_read_lock(&kvm->srcu);
+
+	kvm_mmu_reload(vcpu);
+
+	gfn = range->start;
+	while (gfn < range->end) {
+		if (signal_pending(current)) {
+			ret = false;
+			break;
+		}
+
+		if (need_resched())
+			cond_resched();
+
+		pfn = kvm_mmu_do_page_fault(vcpu, gfn << PAGE_SHIFT,
+					PFERR_WRITE_MASK | PFERR_USER_MASK,
+					false);
+		if (is_error_noslot_pfn(pfn) || kvm->vm_bugged) {
+			ret = false;
+			break;
+		}
+
+		gfn++;
+	}
+
+	srcu_read_unlock(&kvm->srcu, idx);
+	vcpu_put(vcpu);
+
+	mutex_unlock(&vcpu->mutex);
+
+	return ret;
+}
+
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool flush = false;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d9573305e273..9c02fb53b8ab 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -237,6 +237,8 @@ struct kvm_gfn_range {
 	pte_t pte;
 	bool may_block;
 };
+
+bool kvm_map_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d84cb867b686..b9855b2fdaae 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -456,6 +456,11 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_destroy);
 #if defined(CONFIG_MEMFD_OPS) ||\
 	(defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER))
 
+bool __weak kvm_map_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	return false;
+}
+
 typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 10/15] KVM: Add kvm_map_gfn_range
@ 2021-12-21 15:11   ` Chao Peng
  0 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Wanpeng Li, jun.nakajima, david, J . Bruce Fields, dave.hansen,
	H . Peter Anvin, Chao Peng, ak, Jonathan Corbet, Joerg Roedel,
	x86, Hugh Dickins, Ingo Molnar, Borislav Petkov, luto,
	Thomas Gleixner, Vitaly Kuznetsov, Jim Mattson,
	Sean Christopherson, susie.li, Jeff Layton, john.ji, Yu Zhang,
	Paolo Bonzini, Andrew Morton, Kirill A . Shutemov

This new function establishes the mapping in KVM page tables for a
given gfn range. It can be used in the memory fallocate callback for
memfd based memory to establish the mapping for KVM secondary MMU when
the pages are allocated in the memory backend.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/kvm/mmu/mmu.c   | 47 ++++++++++++++++++++++++++++++++++++++++
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c      |  5 +++++
 3 files changed, 54 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2da19356679d..a7006e1ac2d2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1781,6 +1781,53 @@ static __always_inline bool kvm_handle_gfn_range(struct kvm *kvm,
 	return ret;
 }
 
+bool kvm_map_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	struct kvm_vcpu *vcpu;
+	kvm_pfn_t pfn;
+	gfn_t gfn;
+	int idx;
+	bool ret = true;
+
+	/* Need vcpu context for kvm_mmu_do_page_fault. */
+	vcpu = kvm_get_vcpu(kvm, 0);
+	if (mutex_lock_killable(&vcpu->mutex))
+		return false;
+
+	vcpu_load(vcpu);
+	idx = srcu_read_lock(&kvm->srcu);
+
+	kvm_mmu_reload(vcpu);
+
+	gfn = range->start;
+	while (gfn < range->end) {
+		if (signal_pending(current)) {
+			ret = false;
+			break;
+		}
+
+		if (need_resched())
+			cond_resched();
+
+		pfn = kvm_mmu_do_page_fault(vcpu, gfn << PAGE_SHIFT,
+					PFERR_WRITE_MASK | PFERR_USER_MASK,
+					false);
+		if (is_error_noslot_pfn(pfn) || kvm->vm_bugged) {
+			ret = false;
+			break;
+		}
+
+		gfn++;
+	}
+
+	srcu_read_unlock(&kvm->srcu, idx);
+	vcpu_put(vcpu);
+
+	mutex_unlock(&vcpu->mutex);
+
+	return ret;
+}
+
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool flush = false;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d9573305e273..9c02fb53b8ab 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -237,6 +237,8 @@ struct kvm_gfn_range {
 	pte_t pte;
 	bool may_block;
 };
+
+bool kvm_map_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d84cb867b686..b9855b2fdaae 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -456,6 +456,11 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_destroy);
 #if defined(CONFIG_MEMFD_OPS) ||\
 	(defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER))
 
+bool __weak kvm_map_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	return false;
+}
+
 typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 11/15] KVM: Implement fd-based memory fallocation
  2021-12-21 15:11 ` Chao Peng
@ 2021-12-21 15:11   ` Chao Peng
  -1 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Chao Peng, Kirill A . Shutemov, luto,
	john.ji, susie.li, jun.nakajima, dave.hansen, ak, david

KVM gets notified through memfd_notifier when user space allocate space
via fallocate() on the fd which is used for guest memory. KVM can set up
the mapping in the secondary MMU page tables at this time. This patch
adds function in KVM to map pfn to gfn when the page is allocated in the
memory backend.

While it's possible to postpone the mapping of the secondary MMU to KVM
page fault handler but we can reduce some VMExits by also mapping the
secondary page tables when a page is mapped in the primary MMU.

It reuses the same code for kvm_memfd_invalidate_range, except using
kvm_map_gfn_range as its handler.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c      | 22 +++++++++++++++++++---
 virt/kvm/memfd.c         |  2 ++
 3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9c02fb53b8ab..1f69d76983a2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1879,6 +1879,8 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
 #ifdef CONFIG_MEMFD_OPS
 int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
 			       unsigned long start, unsigned long end);
+int kvm_memfd_fallocate_range(struct kvm *kvm, struct inode *inode,
+			      unsigned long start, unsigned long end);
 #endif /* CONFIG_MEMFD_OPS */
 
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b9855b2fdaae..1b7cf05759c7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -860,15 +860,17 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
 
 #ifdef CONFIG_MEMFD_OPS
-int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
-			       unsigned long start, unsigned long end)
+int kvm_memfd_handle_range(struct kvm *kvm, struct inode *inode,
+			   unsigned long start, unsigned long end,
+			   gfn_handler_t handler)
+
 {
 	int ret;
 	const struct kvm_useraddr_range useraddr_range = {
 		.start		= start,
 		.end		= end,
 		.pte		= __pte(0),
-		.handler	= kvm_unmap_gfn_range,
+		.handler	= handler,
 		.on_lock	= (void *)kvm_null_fn,
 		.flush_on_ret	= true,
 		.may_block	= false,
@@ -883,6 +885,20 @@ int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
 
 	return ret;
 }
+
+int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
+			       unsigned long start, unsigned long end)
+{
+	return kvm_memfd_handle_range(kvm, inode, start, end,
+				      kvm_unmap_gfn_range);
+}
+
+int kvm_memfd_fallocate_range(struct kvm *kvm, struct inode *inode,
+			      unsigned long start, unsigned long end)
+{
+	return kvm_memfd_handle_range(kvm, inode, start, end,
+				      kvm_map_gfn_range);
+}
 #endif /* CONFIG_MEMFD_OPS */
 
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
diff --git a/virt/kvm/memfd.c b/virt/kvm/memfd.c
index d092a9b6f496..e7a2ab790cc6 100644
--- a/virt/kvm/memfd.c
+++ b/virt/kvm/memfd.c
@@ -23,6 +23,8 @@ static void memfd_invalidate_page_range(struct inode *inode, void *owner,
 static void memfd_fallocate(struct inode *inode, void *owner,
 			    pgoff_t start, pgoff_t end)
 {
+	kvm_memfd_fallocate_range(owner, inode, start >> PAGE_SHIFT,
+						end >> PAGE_SHIFT);
 }
 
 static bool memfd_get_owner(void *owner)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 11/15] KVM: Implement fd-based memory fallocation
@ 2021-12-21 15:11   ` Chao Peng
  0 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Wanpeng Li, jun.nakajima, david, J . Bruce Fields, dave.hansen,
	H . Peter Anvin, Chao Peng, ak, Jonathan Corbet, Joerg Roedel,
	x86, Hugh Dickins, Ingo Molnar, Borislav Petkov, luto,
	Thomas Gleixner, Vitaly Kuznetsov, Jim Mattson,
	Sean Christopherson, susie.li, Jeff Layton, john.ji, Yu Zhang,
	Paolo Bonzini, Andrew Morton, Kirill A . Shutemov

KVM gets notified through memfd_notifier when user space allocate space
via fallocate() on the fd which is used for guest memory. KVM can set up
the mapping in the secondary MMU page tables at this time. This patch
adds function in KVM to map pfn to gfn when the page is allocated in the
memory backend.

While it's possible to postpone the mapping of the secondary MMU to KVM
page fault handler but we can reduce some VMExits by also mapping the
secondary page tables when a page is mapped in the primary MMU.

It reuses the same code for kvm_memfd_invalidate_range, except using
kvm_map_gfn_range as its handler.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c      | 22 +++++++++++++++++++---
 virt/kvm/memfd.c         |  2 ++
 3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9c02fb53b8ab..1f69d76983a2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1879,6 +1879,8 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
 #ifdef CONFIG_MEMFD_OPS
 int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
 			       unsigned long start, unsigned long end);
+int kvm_memfd_fallocate_range(struct kvm *kvm, struct inode *inode,
+			      unsigned long start, unsigned long end);
 #endif /* CONFIG_MEMFD_OPS */
 
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b9855b2fdaae..1b7cf05759c7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -860,15 +860,17 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
 
 #ifdef CONFIG_MEMFD_OPS
-int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
-			       unsigned long start, unsigned long end)
+int kvm_memfd_handle_range(struct kvm *kvm, struct inode *inode,
+			   unsigned long start, unsigned long end,
+			   gfn_handler_t handler)
+
 {
 	int ret;
 	const struct kvm_useraddr_range useraddr_range = {
 		.start		= start,
 		.end		= end,
 		.pte		= __pte(0),
-		.handler	= kvm_unmap_gfn_range,
+		.handler	= handler,
 		.on_lock	= (void *)kvm_null_fn,
 		.flush_on_ret	= true,
 		.may_block	= false,
@@ -883,6 +885,20 @@ int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
 
 	return ret;
 }
+
+int kvm_memfd_invalidate_range(struct kvm *kvm, struct inode *inode,
+			       unsigned long start, unsigned long end)
+{
+	return kvm_memfd_handle_range(kvm, inode, start, end,
+				      kvm_unmap_gfn_range);
+}
+
+int kvm_memfd_fallocate_range(struct kvm *kvm, struct inode *inode,
+			      unsigned long start, unsigned long end)
+{
+	return kvm_memfd_handle_range(kvm, inode, start, end,
+				      kvm_map_gfn_range);
+}
 #endif /* CONFIG_MEMFD_OPS */
 
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
diff --git a/virt/kvm/memfd.c b/virt/kvm/memfd.c
index d092a9b6f496..e7a2ab790cc6 100644
--- a/virt/kvm/memfd.c
+++ b/virt/kvm/memfd.c
@@ -23,6 +23,8 @@ static void memfd_invalidate_page_range(struct inode *inode, void *owner,
 static void memfd_fallocate(struct inode *inode, void *owner,
 			    pgoff_t start, pgoff_t end)
 {
+	kvm_memfd_fallocate_range(owner, inode, start >> PAGE_SHIFT,
+						end >> PAGE_SHIFT);
 }
 
 static bool memfd_get_owner(void *owner)
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 12/15] KVM: Add KVM_EXIT_MEMORY_ERROR exit
  2021-12-21 15:11 ` Chao Peng
@ 2021-12-21 15:11   ` Chao Peng
  -1 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Chao Peng, Kirill A . Shutemov, luto,
	john.ji, susie.li, jun.nakajima, dave.hansen, ak, david

This new exit allows user space to handle memory-related errors.
Currently it supports two types (KVM_EXIT_MEM_MAP_SHARED/PRIVATE) of
errors which are used for shared memory <-> private memory conversion
in memory encryption usage.

After private memory is enabled, there are two places in KVM that can
exit to userspace to trigger private <-> shared conversion:
  - explicit conversion: happens when guest explicitly calls into KVM to
    map a range (as private or shared), KVM then exits to userspace to
    do the map/unmap operations.
  - implicit conversion: happens in KVM page fault handler.
    * if the fault is due to a private memory access then causes a
      userspace exit for a shared->private conversion request when the
      page has not been allocated in the private memory backend.
    * If the fault is due to a shared memory access then causes a
      userspace exit for a private->shared conversion request when the
      page has already been allocated in the private memory backend.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/uapi/linux/kvm.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a2c1fb8c9843..d0c2af431cd9 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -297,6 +297,18 @@ struct kvm_tdx_exit {
 	} u;
 };
 
+struct kvm_memory_exit {
+#define KVM_EXIT_MEM_MAP_SHARED         1
+#define KVM_EXIT_MEM_MAP_PRIVATE        2
+	__u32 type;
+	union {
+		struct {
+			__u64 gpa;
+			__u64 size;
+		} map;
+	} u;
+};
+
 #define KVM_S390_GET_SKEYS_NONE   1
 #define KVM_S390_SKEYS_MAX        1048576
 
@@ -336,6 +348,7 @@ struct kvm_tdx_exit {
 #define KVM_EXIT_X86_BUS_LOCK     33
 #define KVM_EXIT_XEN              34
 #define KVM_EXIT_RISCV_SBI        35
+#define KVM_EXIT_MEMORY_ERROR     36
 #define KVM_EXIT_TDX              50	/* dump number to avoid conflict. */
 
 /* For KVM_EXIT_INTERNAL_ERROR */
@@ -554,6 +567,8 @@ struct kvm_run {
 			unsigned long args[6];
 			unsigned long ret[2];
 		} riscv_sbi;
+		/* KVM_EXIT_MEMORY_ERROR */
+		struct kvm_memory_exit mem;
 		/* KVM_EXIT_TDX_VMCALL */
 		struct kvm_tdx_exit tdx;
 		/* Fix the size of the union. */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 12/15] KVM: Add KVM_EXIT_MEMORY_ERROR exit
@ 2021-12-21 15:11   ` Chao Peng
  0 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Wanpeng Li, jun.nakajima, david, J . Bruce Fields, dave.hansen,
	H . Peter Anvin, Chao Peng, ak, Jonathan Corbet, Joerg Roedel,
	x86, Hugh Dickins, Ingo Molnar, Borislav Petkov, luto,
	Thomas Gleixner, Vitaly Kuznetsov, Jim Mattson,
	Sean Christopherson, susie.li, Jeff Layton, john.ji, Yu Zhang,
	Paolo Bonzini, Andrew Morton, Kirill A . Shutemov

This new exit allows user space to handle memory-related errors.
Currently it supports two types (KVM_EXIT_MEM_MAP_SHARED/PRIVATE) of
errors which are used for shared memory <-> private memory conversion
in memory encryption usage.

After private memory is enabled, there are two places in KVM that can
exit to userspace to trigger private <-> shared conversion:
  - explicit conversion: happens when guest explicitly calls into KVM to
    map a range (as private or shared), KVM then exits to userspace to
    do the map/unmap operations.
  - implicit conversion: happens in KVM page fault handler.
    * if the fault is due to a private memory access then causes a
      userspace exit for a shared->private conversion request when the
      page has not been allocated in the private memory backend.
    * If the fault is due to a shared memory access then causes a
      userspace exit for a private->shared conversion request when the
      page has already been allocated in the private memory backend.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/uapi/linux/kvm.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a2c1fb8c9843..d0c2af431cd9 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -297,6 +297,18 @@ struct kvm_tdx_exit {
 	} u;
 };
 
+struct kvm_memory_exit {
+#define KVM_EXIT_MEM_MAP_SHARED         1
+#define KVM_EXIT_MEM_MAP_PRIVATE        2
+	__u32 type;
+	union {
+		struct {
+			__u64 gpa;
+			__u64 size;
+		} map;
+	} u;
+};
+
 #define KVM_S390_GET_SKEYS_NONE   1
 #define KVM_S390_SKEYS_MAX        1048576
 
@@ -336,6 +348,7 @@ struct kvm_tdx_exit {
 #define KVM_EXIT_X86_BUS_LOCK     33
 #define KVM_EXIT_XEN              34
 #define KVM_EXIT_RISCV_SBI        35
+#define KVM_EXIT_MEMORY_ERROR     36
 #define KVM_EXIT_TDX              50	/* dump number to avoid conflict. */
 
 /* For KVM_EXIT_INTERNAL_ERROR */
@@ -554,6 +567,8 @@ struct kvm_run {
 			unsigned long args[6];
 			unsigned long ret[2];
 		} riscv_sbi;
+		/* KVM_EXIT_MEMORY_ERROR */
+		struct kvm_memory_exit mem;
 		/* KVM_EXIT_TDX_VMCALL */
 		struct kvm_tdx_exit tdx;
 		/* Fix the size of the union. */
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 13/15] KVM: Handle page fault for private memory
  2021-12-21 15:11 ` Chao Peng
@ 2021-12-21 15:11   ` Chao Peng
  -1 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Chao Peng, Kirill A . Shutemov, luto,
	john.ji, susie.li, jun.nakajima, dave.hansen, ak, david

When a page fault from the secondary page table while the guest is
running happens in a memslot with KVM_MEM_PRIVATE, we need go
different paths for private access and shared access.

  - For private access, KVM checks if the page is already allocated in
    the memory backend, if yes KVM establishes the mapping, otherwise
    exits to userspace to convert a shared page to private one.

  - For shared access, KVM also checks if the page is already allocated
    in the memory backend, if yes then exit to userspace to convert a
    private page to shared one, otherwise it's treated as a traditional
    hva-based shared memory, KVM lets existing code to obtain a pfn with
    get_user_pages() and establish the mapping.

The above code assume private memory is persistent and pre-allocated in
the memory backend so KVM can use this information as an indicator for
a page is private or shared. The above check is then performed by
calling kvm_memfd_get_pfn() which currently is implemented as a
pagecache search but in theory that can be implemented differently
(i.e. when the page is even not mapped into host pagecache there should
be some different implementation).

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/kvm/mmu/mmu.c         | 64 +++++++++++++++++++++++++++++++---
 arch/x86/kvm/mmu/paging_tmpl.h | 11 ++++--
 2 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a7006e1ac2d2..7fc29f85313d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3156,6 +3156,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
 
+	if (kvm_slot_is_private(slot))
+		return max_level;
+
 	host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
 	return min(host_level, max_level);
 }
@@ -4359,7 +4362,50 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
 }
 
-static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
+static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
+				    struct kvm_page_fault *fault,
+				    bool *is_private_pfn, int *r)
+{
+	int order;
+	int mem_convert_type;
+	struct kvm_memory_slot *slot = fault->slot;
+	long pfn = kvm_memfd_get_pfn(slot, fault->gfn, &order);
+
+	if (kvm_vcpu_is_private_gfn(vcpu, fault->addr >> PAGE_SHIFT)) {
+		if (pfn < 0)
+			mem_convert_type = KVM_EXIT_MEM_MAP_PRIVATE;
+		else {
+			fault->pfn = pfn;
+			if (slot->flags & KVM_MEM_READONLY)
+				fault->map_writable = false;
+			else
+				fault->map_writable = true;
+
+			if (order == 0)
+				fault->max_level = PG_LEVEL_4K;
+			*is_private_pfn = true;
+			*r = RET_PF_FIXED;
+			return true;
+		}
+	} else {
+		if (pfn < 0)
+			return false;
+
+		kvm_memfd_put_pfn(pfn);
+		mem_convert_type = KVM_EXIT_MEM_MAP_SHARED;
+	}
+
+	vcpu->run->exit_reason = KVM_EXIT_MEMORY_ERROR;
+	vcpu->run->mem.type = mem_convert_type;
+	vcpu->run->mem.u.map.gpa = fault->gfn << PAGE_SHIFT;
+	vcpu->run->mem.u.map.size = PAGE_SIZE;
+	fault->pfn = -1;
+	*r = -1;
+	return true;
+}
+
+static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
+			    bool *is_private_pfn, int *r)
 {
 	struct kvm_memory_slot *slot = fault->slot;
 	bool async;
@@ -4400,6 +4446,10 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		}
 	}
 
+	if (kvm_slot_is_private(slot) &&
+	    kvm_faultin_pfn_private(vcpu, fault, is_private_pfn, r))
+		return *r == RET_PF_FIXED ? false : true;
+
 	async = false;
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
 					  fault->write, &fault->map_writable,
@@ -4446,6 +4496,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu);
 
 	unsigned long mmu_seq;
+	bool is_private_pfn = false;
 	int r;
 
 	fault->gfn = kvm_gfn_unalias(vcpu->kvm, fault->addr);
@@ -4465,7 +4516,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (kvm_faultin_pfn(vcpu, fault, &r))
+	if (kvm_faultin_pfn(vcpu, fault, &is_private_pfn, &r))
 		return r;
 
 	if (handle_abnormal_pfn(vcpu, fault, ACC_ALL, &r))
@@ -4504,7 +4555,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	else
 		write_lock(&vcpu->kvm->mmu_lock);
 
-	if (is_page_fault_stale(vcpu, fault, mmu_seq))
+	if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq))
 		goto out_unlock;
 
 	r = make_mmu_pages_available(vcpu);
@@ -4522,7 +4573,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 		read_unlock(&vcpu->kvm->mmu_lock);
 	else
 		write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+
+	if (is_private_pfn)
+		kvm_memfd_put_pfn(fault->pfn);
+	else
+		kvm_release_pfn_clean(fault->pfn);
+
 	return r;
 }
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 6d343a399913..ebd5c923f844 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -842,6 +842,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	int r;
 	unsigned long mmu_seq;
 	bool is_self_change_mapping;
+	bool is_private_pfn = false;
+
 
 	pgprintk("%s: addr %lx err %x\n", __func__, fault->addr, fault->error_code);
 	WARN_ON_ONCE(fault->is_tdp);
@@ -890,7 +892,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (kvm_faultin_pfn(vcpu, fault, &r))
+	if (kvm_faultin_pfn(vcpu, fault, &is_private_pfn, &r))
 		return r;
 
 	if (handle_abnormal_pfn(vcpu, fault, walker.pte_access, &r))
@@ -918,7 +920,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	r = RET_PF_RETRY;
 	write_lock(&vcpu->kvm->mmu_lock);
 
-	if (is_page_fault_stale(vcpu, fault, mmu_seq))
+	if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq))
 		goto out_unlock;
 
 	kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
@@ -930,7 +932,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 
 out_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+	if (is_private_pfn)
+		kvm_memfd_put_pfn(fault->pfn);
+	else
+		kvm_release_pfn_clean(fault->pfn);
 	return r;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 13/15] KVM: Handle page fault for private memory
@ 2021-12-21 15:11   ` Chao Peng
  0 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Wanpeng Li, jun.nakajima, david, J . Bruce Fields, dave.hansen,
	H . Peter Anvin, Chao Peng, ak, Jonathan Corbet, Joerg Roedel,
	x86, Hugh Dickins, Ingo Molnar, Borislav Petkov, luto,
	Thomas Gleixner, Vitaly Kuznetsov, Jim Mattson,
	Sean Christopherson, susie.li, Jeff Layton, john.ji, Yu Zhang,
	Paolo Bonzini, Andrew Morton, Kirill A . Shutemov

When a page fault from the secondary page table while the guest is
running happens in a memslot with KVM_MEM_PRIVATE, we need go
different paths for private access and shared access.

  - For private access, KVM checks if the page is already allocated in
    the memory backend, if yes KVM establishes the mapping, otherwise
    exits to userspace to convert a shared page to private one.

  - For shared access, KVM also checks if the page is already allocated
    in the memory backend, if yes then exit to userspace to convert a
    private page to shared one, otherwise it's treated as a traditional
    hva-based shared memory, KVM lets existing code to obtain a pfn with
    get_user_pages() and establish the mapping.

The above code assume private memory is persistent and pre-allocated in
the memory backend so KVM can use this information as an indicator for
a page is private or shared. The above check is then performed by
calling kvm_memfd_get_pfn() which currently is implemented as a
pagecache search but in theory that can be implemented differently
(i.e. when the page is even not mapped into host pagecache there should
be some different implementation).

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/kvm/mmu/mmu.c         | 64 +++++++++++++++++++++++++++++++---
 arch/x86/kvm/mmu/paging_tmpl.h | 11 ++++--
 2 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a7006e1ac2d2..7fc29f85313d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3156,6 +3156,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
 
+	if (kvm_slot_is_private(slot))
+		return max_level;
+
 	host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
 	return min(host_level, max_level);
 }
@@ -4359,7 +4362,50 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
 }
 
-static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
+static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
+				    struct kvm_page_fault *fault,
+				    bool *is_private_pfn, int *r)
+{
+	int order;
+	int mem_convert_type;
+	struct kvm_memory_slot *slot = fault->slot;
+	long pfn = kvm_memfd_get_pfn(slot, fault->gfn, &order);
+
+	if (kvm_vcpu_is_private_gfn(vcpu, fault->addr >> PAGE_SHIFT)) {
+		if (pfn < 0)
+			mem_convert_type = KVM_EXIT_MEM_MAP_PRIVATE;
+		else {
+			fault->pfn = pfn;
+			if (slot->flags & KVM_MEM_READONLY)
+				fault->map_writable = false;
+			else
+				fault->map_writable = true;
+
+			if (order == 0)
+				fault->max_level = PG_LEVEL_4K;
+			*is_private_pfn = true;
+			*r = RET_PF_FIXED;
+			return true;
+		}
+	} else {
+		if (pfn < 0)
+			return false;
+
+		kvm_memfd_put_pfn(pfn);
+		mem_convert_type = KVM_EXIT_MEM_MAP_SHARED;
+	}
+
+	vcpu->run->exit_reason = KVM_EXIT_MEMORY_ERROR;
+	vcpu->run->mem.type = mem_convert_type;
+	vcpu->run->mem.u.map.gpa = fault->gfn << PAGE_SHIFT;
+	vcpu->run->mem.u.map.size = PAGE_SIZE;
+	fault->pfn = -1;
+	*r = -1;
+	return true;
+}
+
+static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
+			    bool *is_private_pfn, int *r)
 {
 	struct kvm_memory_slot *slot = fault->slot;
 	bool async;
@@ -4400,6 +4446,10 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		}
 	}
 
+	if (kvm_slot_is_private(slot) &&
+	    kvm_faultin_pfn_private(vcpu, fault, is_private_pfn, r))
+		return *r == RET_PF_FIXED ? false : true;
+
 	async = false;
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
 					  fault->write, &fault->map_writable,
@@ -4446,6 +4496,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu);
 
 	unsigned long mmu_seq;
+	bool is_private_pfn = false;
 	int r;
 
 	fault->gfn = kvm_gfn_unalias(vcpu->kvm, fault->addr);
@@ -4465,7 +4516,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (kvm_faultin_pfn(vcpu, fault, &r))
+	if (kvm_faultin_pfn(vcpu, fault, &is_private_pfn, &r))
 		return r;
 
 	if (handle_abnormal_pfn(vcpu, fault, ACC_ALL, &r))
@@ -4504,7 +4555,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	else
 		write_lock(&vcpu->kvm->mmu_lock);
 
-	if (is_page_fault_stale(vcpu, fault, mmu_seq))
+	if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq))
 		goto out_unlock;
 
 	r = make_mmu_pages_available(vcpu);
@@ -4522,7 +4573,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 		read_unlock(&vcpu->kvm->mmu_lock);
 	else
 		write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+
+	if (is_private_pfn)
+		kvm_memfd_put_pfn(fault->pfn);
+	else
+		kvm_release_pfn_clean(fault->pfn);
+
 	return r;
 }
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 6d343a399913..ebd5c923f844 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -842,6 +842,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	int r;
 	unsigned long mmu_seq;
 	bool is_self_change_mapping;
+	bool is_private_pfn = false;
+
 
 	pgprintk("%s: addr %lx err %x\n", __func__, fault->addr, fault->error_code);
 	WARN_ON_ONCE(fault->is_tdp);
@@ -890,7 +892,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (kvm_faultin_pfn(vcpu, fault, &r))
+	if (kvm_faultin_pfn(vcpu, fault, &is_private_pfn, &r))
 		return r;
 
 	if (handle_abnormal_pfn(vcpu, fault, walker.pte_access, &r))
@@ -918,7 +920,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	r = RET_PF_RETRY;
 	write_lock(&vcpu->kvm->mmu_lock);
 
-	if (is_page_fault_stale(vcpu, fault, mmu_seq))
+	if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq))
 		goto out_unlock;
 
 	kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
@@ -930,7 +932,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 
 out_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
+	if (is_private_pfn)
+		kvm_memfd_put_pfn(fault->pfn);
+	else
+		kvm_release_pfn_clean(fault->pfn);
 	return r;
 }
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 14/15] KVM: Use kvm_userspace_memory_region_ext
  2021-12-21 15:11 ` Chao Peng
@ 2021-12-21 15:11   ` Chao Peng
  -1 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Chao Peng, Kirill A . Shutemov, luto,
	john.ji, susie.li, jun.nakajima, dave.hansen, ak, david

Use the new extended memslot structure kvm_userspace_memory_region_ext
which includes two additional fd/ofs fields comparing to the current
kvm_userspace_memory_region. The fields fd/ofs will be copied from
userspace only when KVM_MEM_PRIVATE is set.

Internal the KVM we change all existing kvm_userspace_memory_region to
kvm_userspace_memory_region_ext since the new extended structure covers
all the existing fields in kvm_userspace_memory_region.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/arm64/kvm/mmu.c               | 14 +++++++-------
 arch/mips/kvm/mips.c               | 14 +++++++-------
 arch/powerpc/include/asm/kvm_ppc.h | 28 ++++++++++++++--------------
 arch/powerpc/kvm/book3s.c          | 14 +++++++-------
 arch/powerpc/kvm/book3s_hv.c       | 14 +++++++-------
 arch/powerpc/kvm/book3s_pr.c       | 14 +++++++-------
 arch/powerpc/kvm/booke.c           | 14 +++++++-------
 arch/powerpc/kvm/powerpc.c         | 14 +++++++-------
 arch/riscv/kvm/mmu.c               | 14 +++++++-------
 arch/s390/kvm/kvm-s390.c           | 14 +++++++-------
 arch/x86/kvm/x86.c                 | 16 ++++++++--------
 include/linux/kvm_host.h           | 18 +++++++++---------
 virt/kvm/kvm_main.c                | 23 +++++++++++++++--------
 13 files changed, 109 insertions(+), 102 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 326cdfec74a1..395e52314834 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1463,10 +1463,10 @@ int kvm_mmu_init(u32 *hyp_va_bits)
 }
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
-				   const struct kvm_userspace_memory_region *mem,
-				   struct kvm_memory_slot *old,
-				   const struct kvm_memory_slot *new,
-				   enum kvm_mr_change change)
+			const struct kvm_userspace_memory_region_ext *mem,
+			struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change)
 {
 	/*
 	 * At this point memslot has been committed and there is an
@@ -1486,9 +1486,9 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 }
 
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
-				   struct kvm_memory_slot *memslot,
-				   const struct kvm_userspace_memory_region *mem,
-				   enum kvm_mr_change change)
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change)
 {
 	hva_t hva = mem->userspace_addr;
 	hva_t reg_end = hva + mem->memory_size;
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index aa20d074d388..28c58562266c 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -233,18 +233,18 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 }
 
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
-				   struct kvm_memory_slot *memslot,
-				   const struct kvm_userspace_memory_region *mem,
-				   enum kvm_mr_change change)
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change)
 {
 	return 0;
 }
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
-				   const struct kvm_userspace_memory_region *mem,
-				   struct kvm_memory_slot *old,
-				   const struct kvm_memory_slot *new,
-				   enum kvm_mr_change change)
+			const struct kvm_userspace_memory_region_ext *mem,
+			struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change)
 {
 	int needs_flush;
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 671fbd1a765e..7cdc756a94a0 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -200,14 +200,14 @@ extern void kvmppc_core_destroy_vm(struct kvm *kvm);
 extern void kvmppc_core_free_memslot(struct kvm *kvm,
 				     struct kvm_memory_slot *slot);
 extern int kvmppc_core_prepare_memory_region(struct kvm *kvm,
-				struct kvm_memory_slot *memslot,
-				const struct kvm_userspace_memory_region *mem,
-				enum kvm_mr_change change);
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change);
 extern void kvmppc_core_commit_memory_region(struct kvm *kvm,
-				const struct kvm_userspace_memory_region *mem,
-				const struct kvm_memory_slot *old,
-				const struct kvm_memory_slot *new,
-				enum kvm_mr_change change);
+			const struct kvm_userspace_memory_region_ext *mem,
+			const struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change);
 extern int kvm_vm_ioctl_get_smmu_info(struct kvm *kvm,
 				      struct kvm_ppc_smmu_info *info);
 extern void kvmppc_core_flush_memslot(struct kvm *kvm,
@@ -274,14 +274,14 @@ struct kvmppc_ops {
 	int (*get_dirty_log)(struct kvm *kvm, struct kvm_dirty_log *log);
 	void (*flush_memslot)(struct kvm *kvm, struct kvm_memory_slot *memslot);
 	int (*prepare_memory_region)(struct kvm *kvm,
-				     struct kvm_memory_slot *memslot,
-				     const struct kvm_userspace_memory_region *mem,
-				     enum kvm_mr_change change);
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change);
 	void (*commit_memory_region)(struct kvm *kvm,
-				     const struct kvm_userspace_memory_region *mem,
-				     const struct kvm_memory_slot *old,
-				     const struct kvm_memory_slot *new,
-				     enum kvm_mr_change change);
+			const struct kvm_userspace_memory_region_ext *mem,
+			const struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change);
 	bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
 	bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
 	bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index b785f6772391..6b4bf08e7c8b 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -847,19 +847,19 @@ void kvmppc_core_flush_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot)
 }
 
 int kvmppc_core_prepare_memory_region(struct kvm *kvm,
-				struct kvm_memory_slot *memslot,
-				const struct kvm_userspace_memory_region *mem,
-				enum kvm_mr_change change)
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change)
 {
 	return kvm->arch.kvm_ops->prepare_memory_region(kvm, memslot, mem,
 							change);
 }
 
 void kvmppc_core_commit_memory_region(struct kvm *kvm,
-				const struct kvm_userspace_memory_region *mem,
-				const struct kvm_memory_slot *old,
-				const struct kvm_memory_slot *new,
-				enum kvm_mr_change change)
+			const struct kvm_userspace_memory_region_ext *mem,
+			const struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change)
 {
 	kvm->arch.kvm_ops->commit_memory_region(kvm, mem, old, new, change);
 }
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 7b74fc0a986b..3b7be7894c48 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4854,9 +4854,9 @@ static void kvmppc_core_free_memslot_hv(struct kvm_memory_slot *slot)
 }
 
 static int kvmppc_core_prepare_memory_region_hv(struct kvm *kvm,
-					struct kvm_memory_slot *slot,
-					const struct kvm_userspace_memory_region *mem,
-					enum kvm_mr_change change)
+			struct kvm_memory_slot *slot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change)
 {
 	unsigned long npages = mem->memory_size >> PAGE_SHIFT;
 
@@ -4871,10 +4871,10 @@ static int kvmppc_core_prepare_memory_region_hv(struct kvm *kvm,
 }
 
 static void kvmppc_core_commit_memory_region_hv(struct kvm *kvm,
-				const struct kvm_userspace_memory_region *mem,
-				const struct kvm_memory_slot *old,
-				const struct kvm_memory_slot *new,
-				enum kvm_mr_change change)
+			const struct kvm_userspace_memory_region_ext *mem,
+			const struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change)
 {
 	unsigned long npages = mem->memory_size >> PAGE_SHIFT;
 
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 6bc9425acb32..4dd06b24c1b6 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -1899,18 +1899,18 @@ static void kvmppc_core_flush_memslot_pr(struct kvm *kvm,
 }
 
 static int kvmppc_core_prepare_memory_region_pr(struct kvm *kvm,
-					struct kvm_memory_slot *memslot,
-					const struct kvm_userspace_memory_region *mem,
-					enum kvm_mr_change change)
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change)
 {
 	return 0;
 }
 
 static void kvmppc_core_commit_memory_region_pr(struct kvm *kvm,
-				const struct kvm_userspace_memory_region *mem,
-				const struct kvm_memory_slot *old,
-				const struct kvm_memory_slot *new,
-				enum kvm_mr_change change)
+			const struct kvm_userspace_memory_region_ext *mem,
+			const struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change)
 {
 	return;
 }
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 8c15c90dd3a9..f2d1acd782bf 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -1821,18 +1821,18 @@ void kvmppc_core_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 }
 
 int kvmppc_core_prepare_memory_region(struct kvm *kvm,
-				      struct kvm_memory_slot *memslot,
-				      const struct kvm_userspace_memory_region *mem,
-				      enum kvm_mr_change change)
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change)
 {
 	return 0;
 }
 
 void kvmppc_core_commit_memory_region(struct kvm *kvm,
-				const struct kvm_userspace_memory_region *mem,
-				const struct kvm_memory_slot *old,
-				const struct kvm_memory_slot *new,
-				enum kvm_mr_change change)
+			const struct kvm_userspace_memory_region_ext *mem,
+			const struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change)
 {
 }
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index a72920f4f221..aaff1476af52 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -706,18 +706,18 @@ void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 }
 
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
-				   struct kvm_memory_slot *memslot,
-				   const struct kvm_userspace_memory_region *mem,
-				   enum kvm_mr_change change)
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change)
 {
 	return kvmppc_core_prepare_memory_region(kvm, memslot, mem, change);
 }
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
-				   const struct kvm_userspace_memory_region *mem,
-				   struct kvm_memory_slot *old,
-				   const struct kvm_memory_slot *new,
-				   enum kvm_mr_change change)
+			const struct kvm_userspace_memory_region_ext *mem,
+			struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change)
 {
 	kvmppc_core_commit_memory_region(kvm, mem, old, new, change);
 }
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index fc058ff5f4b6..83b784336018 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -462,10 +462,10 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 }
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
-				const struct kvm_userspace_memory_region *mem,
-				struct kvm_memory_slot *old,
-				const struct kvm_memory_slot *new,
-				enum kvm_mr_change change)
+			const struct kvm_userspace_memory_region_ext *mem,
+			struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change)
 {
 	/*
 	 * At this point memslot has been committed and there is an
@@ -477,9 +477,9 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 }
 
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
-				struct kvm_memory_slot *memslot,
-				const struct kvm_userspace_memory_region *mem,
-				enum kvm_mr_change change)
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change)
 {
 	hva_t hva = mem->userspace_addr;
 	hva_t reg_end = hva + mem->memory_size;
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 14a18ba5ff2c..3c1b8e0136dc 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -5020,9 +5020,9 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
 
 /* Section: memory related */
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
-				   struct kvm_memory_slot *memslot,
-				   const struct kvm_userspace_memory_region *mem,
-				   enum kvm_mr_change change)
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change)
 {
 	/* A few sanity checks. We can have memory slots which have to be
 	   located/ended at a segment boundary (1MB). The memory in userland is
@@ -5045,10 +5045,10 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 }
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
-				const struct kvm_userspace_memory_region *mem,
-				struct kvm_memory_slot *old,
-				const struct kvm_memory_slot *new,
-				enum kvm_mr_change change)
+			const struct kvm_userspace_memory_region_ext *mem,
+			struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change)
 {
 	int rc = 0;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0d3ce7b356e8..6b6f819fdb0b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11583,7 +11583,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 	}
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-		struct kvm_userspace_memory_region m;
+		struct kvm_userspace_memory_region_ext m;
 
 		m.slot = id | (i << 16);
 		m.flags = 0;
@@ -11765,9 +11765,9 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
 }
 
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
-				struct kvm_memory_slot *memslot,
-				const struct kvm_userspace_memory_region *mem,
-				enum kvm_mr_change change)
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change)
 {
 	if (change == KVM_MR_CREATE || change == KVM_MR_MOVE)
 		return kvm_alloc_memslot_metadata(kvm, memslot,
@@ -11863,10 +11863,10 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 }
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
-				const struct kvm_userspace_memory_region *mem,
-				struct kvm_memory_slot *old,
-				const struct kvm_memory_slot *new,
-				enum kvm_mr_change change)
+			const struct kvm_userspace_memory_region_ext *mem,
+			struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change)
 {
 	if (!kvm->arch.n_requested_mmu_pages)
 		kvm_mmu_change_mmu_pages(kvm,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1f69d76983a2..0c53df0a6b2e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -847,20 +847,20 @@ enum kvm_mr_change {
 };
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem);
+			  const struct kvm_userspace_memory_region_ext *mem);
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem);
+			    const struct kvm_userspace_memory_region_ext *mem);
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
-				struct kvm_memory_slot *memslot,
-				const struct kvm_userspace_memory_region *mem,
-				enum kvm_mr_change change);
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change);
 void kvm_arch_commit_memory_region(struct kvm *kvm,
-				const struct kvm_userspace_memory_region *mem,
-				struct kvm_memory_slot *old,
-				const struct kvm_memory_slot *new,
-				enum kvm_mr_change change);
+			const struct kvm_userspace_memory_region_ext *mem,
+			struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change);
 /* flush all memory translations */
 void kvm_arch_flush_shadow_all(struct kvm *kvm);
 /* flush memory translations pointing to 'slot' */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1b7cf05759c7..79313c549fb9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1510,7 +1510,7 @@ bool __weak kvm_arch_dirty_log_supported(struct kvm *kvm)
 }
 
 static int check_memory_region_flags(struct kvm *kvm,
-				     const struct kvm_userspace_memory_region *mem)
+			     const struct kvm_userspace_memory_region_ext *mem)
 {
 	u32 valid_flags = 0;
 
@@ -1622,7 +1622,7 @@ static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
 }
 
 static int kvm_set_memslot(struct kvm *kvm,
-			   const struct kvm_userspace_memory_region *mem,
+			   const struct kvm_userspace_memory_region_ext *mem,
 			   struct kvm_memory_slot *new, int as_id,
 			   enum kvm_mr_change change)
 {
@@ -1737,7 +1737,7 @@ static int kvm_set_memslot(struct kvm *kvm,
 }
 
 static int kvm_delete_memslot(struct kvm *kvm,
-			      const struct kvm_userspace_memory_region *mem,
+			      const struct kvm_userspace_memory_region_ext *mem,
 			      struct kvm_memory_slot *old, int as_id)
 {
 	struct kvm_memory_slot new;
@@ -1765,7 +1765,7 @@ static int kvm_delete_memslot(struct kvm *kvm,
  * Must be called holding kvm->slots_lock for write.
  */
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem)
+			    const struct kvm_userspace_memory_region_ext *mem)
 {
 	struct kvm_memory_slot old, new;
 	struct kvm_memory_slot *tmp;
@@ -1884,7 +1884,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem)
+			  const struct kvm_userspace_memory_region_ext *mem)
 {
 	int r;
 
@@ -1896,7 +1896,7 @@ int kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(kvm_set_memory_region);
 
 static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
-					  struct kvm_userspace_memory_region *mem)
+				struct kvm_userspace_memory_region_ext *mem)
 {
 	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
 		return -EINVAL;
@@ -4393,12 +4393,19 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 	case KVM_SET_USER_MEMORY_REGION: {
-		struct kvm_userspace_memory_region kvm_userspace_mem;
+		struct kvm_userspace_memory_region_ext kvm_userspace_mem;
 
 		r = -EFAULT;
 		if (copy_from_user(&kvm_userspace_mem, argp,
-						sizeof(kvm_userspace_mem)))
+				sizeof(struct kvm_userspace_memory_region)))
 			goto out;
+		if (kvm_userspace_mem.flags & KVM_MEM_PRIVATE) {
+			int offset = offsetof(
+				struct kvm_userspace_memory_region_ext, ofs);
+			if (copy_from_user(&kvm_userspace_mem.ofs, argp + offset,
+					   sizeof(kvm_userspace_mem) - offset))
+				goto out;
+		}
 
 		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
 		break;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 14/15] KVM: Use kvm_userspace_memory_region_ext
@ 2021-12-21 15:11   ` Chao Peng
  0 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Wanpeng Li, jun.nakajima, david, J . Bruce Fields, dave.hansen,
	H . Peter Anvin, Chao Peng, ak, Jonathan Corbet, Joerg Roedel,
	x86, Hugh Dickins, Ingo Molnar, Borislav Petkov, luto,
	Thomas Gleixner, Vitaly Kuznetsov, Jim Mattson,
	Sean Christopherson, susie.li, Jeff Layton, john.ji, Yu Zhang,
	Paolo Bonzini, Andrew Morton, Kirill A . Shutemov

Use the new extended memslot structure kvm_userspace_memory_region_ext
which includes two additional fd/ofs fields comparing to the current
kvm_userspace_memory_region. The fields fd/ofs will be copied from
userspace only when KVM_MEM_PRIVATE is set.

Internal the KVM we change all existing kvm_userspace_memory_region to
kvm_userspace_memory_region_ext since the new extended structure covers
all the existing fields in kvm_userspace_memory_region.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/arm64/kvm/mmu.c               | 14 +++++++-------
 arch/mips/kvm/mips.c               | 14 +++++++-------
 arch/powerpc/include/asm/kvm_ppc.h | 28 ++++++++++++++--------------
 arch/powerpc/kvm/book3s.c          | 14 +++++++-------
 arch/powerpc/kvm/book3s_hv.c       | 14 +++++++-------
 arch/powerpc/kvm/book3s_pr.c       | 14 +++++++-------
 arch/powerpc/kvm/booke.c           | 14 +++++++-------
 arch/powerpc/kvm/powerpc.c         | 14 +++++++-------
 arch/riscv/kvm/mmu.c               | 14 +++++++-------
 arch/s390/kvm/kvm-s390.c           | 14 +++++++-------
 arch/x86/kvm/x86.c                 | 16 ++++++++--------
 include/linux/kvm_host.h           | 18 +++++++++---------
 virt/kvm/kvm_main.c                | 23 +++++++++++++++--------
 13 files changed, 109 insertions(+), 102 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 326cdfec74a1..395e52314834 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1463,10 +1463,10 @@ int kvm_mmu_init(u32 *hyp_va_bits)
 }
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
-				   const struct kvm_userspace_memory_region *mem,
-				   struct kvm_memory_slot *old,
-				   const struct kvm_memory_slot *new,
-				   enum kvm_mr_change change)
+			const struct kvm_userspace_memory_region_ext *mem,
+			struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change)
 {
 	/*
 	 * At this point memslot has been committed and there is an
@@ -1486,9 +1486,9 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 }
 
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
-				   struct kvm_memory_slot *memslot,
-				   const struct kvm_userspace_memory_region *mem,
-				   enum kvm_mr_change change)
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change)
 {
 	hva_t hva = mem->userspace_addr;
 	hva_t reg_end = hva + mem->memory_size;
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index aa20d074d388..28c58562266c 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -233,18 +233,18 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 }
 
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
-				   struct kvm_memory_slot *memslot,
-				   const struct kvm_userspace_memory_region *mem,
-				   enum kvm_mr_change change)
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change)
 {
 	return 0;
 }
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
-				   const struct kvm_userspace_memory_region *mem,
-				   struct kvm_memory_slot *old,
-				   const struct kvm_memory_slot *new,
-				   enum kvm_mr_change change)
+			const struct kvm_userspace_memory_region_ext *mem,
+			struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change)
 {
 	int needs_flush;
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 671fbd1a765e..7cdc756a94a0 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -200,14 +200,14 @@ extern void kvmppc_core_destroy_vm(struct kvm *kvm);
 extern void kvmppc_core_free_memslot(struct kvm *kvm,
 				     struct kvm_memory_slot *slot);
 extern int kvmppc_core_prepare_memory_region(struct kvm *kvm,
-				struct kvm_memory_slot *memslot,
-				const struct kvm_userspace_memory_region *mem,
-				enum kvm_mr_change change);
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change);
 extern void kvmppc_core_commit_memory_region(struct kvm *kvm,
-				const struct kvm_userspace_memory_region *mem,
-				const struct kvm_memory_slot *old,
-				const struct kvm_memory_slot *new,
-				enum kvm_mr_change change);
+			const struct kvm_userspace_memory_region_ext *mem,
+			const struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change);
 extern int kvm_vm_ioctl_get_smmu_info(struct kvm *kvm,
 				      struct kvm_ppc_smmu_info *info);
 extern void kvmppc_core_flush_memslot(struct kvm *kvm,
@@ -274,14 +274,14 @@ struct kvmppc_ops {
 	int (*get_dirty_log)(struct kvm *kvm, struct kvm_dirty_log *log);
 	void (*flush_memslot)(struct kvm *kvm, struct kvm_memory_slot *memslot);
 	int (*prepare_memory_region)(struct kvm *kvm,
-				     struct kvm_memory_slot *memslot,
-				     const struct kvm_userspace_memory_region *mem,
-				     enum kvm_mr_change change);
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change);
 	void (*commit_memory_region)(struct kvm *kvm,
-				     const struct kvm_userspace_memory_region *mem,
-				     const struct kvm_memory_slot *old,
-				     const struct kvm_memory_slot *new,
-				     enum kvm_mr_change change);
+			const struct kvm_userspace_memory_region_ext *mem,
+			const struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change);
 	bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
 	bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
 	bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index b785f6772391..6b4bf08e7c8b 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -847,19 +847,19 @@ void kvmppc_core_flush_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot)
 }
 
 int kvmppc_core_prepare_memory_region(struct kvm *kvm,
-				struct kvm_memory_slot *memslot,
-				const struct kvm_userspace_memory_region *mem,
-				enum kvm_mr_change change)
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change)
 {
 	return kvm->arch.kvm_ops->prepare_memory_region(kvm, memslot, mem,
 							change);
 }
 
 void kvmppc_core_commit_memory_region(struct kvm *kvm,
-				const struct kvm_userspace_memory_region *mem,
-				const struct kvm_memory_slot *old,
-				const struct kvm_memory_slot *new,
-				enum kvm_mr_change change)
+			const struct kvm_userspace_memory_region_ext *mem,
+			const struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change)
 {
 	kvm->arch.kvm_ops->commit_memory_region(kvm, mem, old, new, change);
 }
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 7b74fc0a986b..3b7be7894c48 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4854,9 +4854,9 @@ static void kvmppc_core_free_memslot_hv(struct kvm_memory_slot *slot)
 }
 
 static int kvmppc_core_prepare_memory_region_hv(struct kvm *kvm,
-					struct kvm_memory_slot *slot,
-					const struct kvm_userspace_memory_region *mem,
-					enum kvm_mr_change change)
+			struct kvm_memory_slot *slot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change)
 {
 	unsigned long npages = mem->memory_size >> PAGE_SHIFT;
 
@@ -4871,10 +4871,10 @@ static int kvmppc_core_prepare_memory_region_hv(struct kvm *kvm,
 }
 
 static void kvmppc_core_commit_memory_region_hv(struct kvm *kvm,
-				const struct kvm_userspace_memory_region *mem,
-				const struct kvm_memory_slot *old,
-				const struct kvm_memory_slot *new,
-				enum kvm_mr_change change)
+			const struct kvm_userspace_memory_region_ext *mem,
+			const struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change)
 {
 	unsigned long npages = mem->memory_size >> PAGE_SHIFT;
 
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 6bc9425acb32..4dd06b24c1b6 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -1899,18 +1899,18 @@ static void kvmppc_core_flush_memslot_pr(struct kvm *kvm,
 }
 
 static int kvmppc_core_prepare_memory_region_pr(struct kvm *kvm,
-					struct kvm_memory_slot *memslot,
-					const struct kvm_userspace_memory_region *mem,
-					enum kvm_mr_change change)
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change)
 {
 	return 0;
 }
 
 static void kvmppc_core_commit_memory_region_pr(struct kvm *kvm,
-				const struct kvm_userspace_memory_region *mem,
-				const struct kvm_memory_slot *old,
-				const struct kvm_memory_slot *new,
-				enum kvm_mr_change change)
+			const struct kvm_userspace_memory_region_ext *mem,
+			const struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change)
 {
 	return;
 }
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 8c15c90dd3a9..f2d1acd782bf 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -1821,18 +1821,18 @@ void kvmppc_core_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 }
 
 int kvmppc_core_prepare_memory_region(struct kvm *kvm,
-				      struct kvm_memory_slot *memslot,
-				      const struct kvm_userspace_memory_region *mem,
-				      enum kvm_mr_change change)
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change)
 {
 	return 0;
 }
 
 void kvmppc_core_commit_memory_region(struct kvm *kvm,
-				const struct kvm_userspace_memory_region *mem,
-				const struct kvm_memory_slot *old,
-				const struct kvm_memory_slot *new,
-				enum kvm_mr_change change)
+			const struct kvm_userspace_memory_region_ext *mem,
+			const struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change)
 {
 }
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index a72920f4f221..aaff1476af52 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -706,18 +706,18 @@ void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 }
 
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
-				   struct kvm_memory_slot *memslot,
-				   const struct kvm_userspace_memory_region *mem,
-				   enum kvm_mr_change change)
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change)
 {
 	return kvmppc_core_prepare_memory_region(kvm, memslot, mem, change);
 }
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
-				   const struct kvm_userspace_memory_region *mem,
-				   struct kvm_memory_slot *old,
-				   const struct kvm_memory_slot *new,
-				   enum kvm_mr_change change)
+			const struct kvm_userspace_memory_region_ext *mem,
+			struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change)
 {
 	kvmppc_core_commit_memory_region(kvm, mem, old, new, change);
 }
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index fc058ff5f4b6..83b784336018 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -462,10 +462,10 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 }
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
-				const struct kvm_userspace_memory_region *mem,
-				struct kvm_memory_slot *old,
-				const struct kvm_memory_slot *new,
-				enum kvm_mr_change change)
+			const struct kvm_userspace_memory_region_ext *mem,
+			struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change)
 {
 	/*
 	 * At this point memslot has been committed and there is an
@@ -477,9 +477,9 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 }
 
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
-				struct kvm_memory_slot *memslot,
-				const struct kvm_userspace_memory_region *mem,
-				enum kvm_mr_change change)
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change)
 {
 	hva_t hva = mem->userspace_addr;
 	hva_t reg_end = hva + mem->memory_size;
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 14a18ba5ff2c..3c1b8e0136dc 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -5020,9 +5020,9 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
 
 /* Section: memory related */
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
-				   struct kvm_memory_slot *memslot,
-				   const struct kvm_userspace_memory_region *mem,
-				   enum kvm_mr_change change)
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change)
 {
 	/* A few sanity checks. We can have memory slots which have to be
 	   located/ended at a segment boundary (1MB). The memory in userland is
@@ -5045,10 +5045,10 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 }
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
-				const struct kvm_userspace_memory_region *mem,
-				struct kvm_memory_slot *old,
-				const struct kvm_memory_slot *new,
-				enum kvm_mr_change change)
+			const struct kvm_userspace_memory_region_ext *mem,
+			struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change)
 {
 	int rc = 0;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0d3ce7b356e8..6b6f819fdb0b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11583,7 +11583,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 	}
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-		struct kvm_userspace_memory_region m;
+		struct kvm_userspace_memory_region_ext m;
 
 		m.slot = id | (i << 16);
 		m.flags = 0;
@@ -11765,9 +11765,9 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
 }
 
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
-				struct kvm_memory_slot *memslot,
-				const struct kvm_userspace_memory_region *mem,
-				enum kvm_mr_change change)
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change)
 {
 	if (change == KVM_MR_CREATE || change == KVM_MR_MOVE)
 		return kvm_alloc_memslot_metadata(kvm, memslot,
@@ -11863,10 +11863,10 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 }
 
 void kvm_arch_commit_memory_region(struct kvm *kvm,
-				const struct kvm_userspace_memory_region *mem,
-				struct kvm_memory_slot *old,
-				const struct kvm_memory_slot *new,
-				enum kvm_mr_change change)
+			const struct kvm_userspace_memory_region_ext *mem,
+			struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change)
 {
 	if (!kvm->arch.n_requested_mmu_pages)
 		kvm_mmu_change_mmu_pages(kvm,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1f69d76983a2..0c53df0a6b2e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -847,20 +847,20 @@ enum kvm_mr_change {
 };
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem);
+			  const struct kvm_userspace_memory_region_ext *mem);
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem);
+			    const struct kvm_userspace_memory_region_ext *mem);
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
-				struct kvm_memory_slot *memslot,
-				const struct kvm_userspace_memory_region *mem,
-				enum kvm_mr_change change);
+			struct kvm_memory_slot *memslot,
+			const struct kvm_userspace_memory_region_ext *mem,
+			enum kvm_mr_change change);
 void kvm_arch_commit_memory_region(struct kvm *kvm,
-				const struct kvm_userspace_memory_region *mem,
-				struct kvm_memory_slot *old,
-				const struct kvm_memory_slot *new,
-				enum kvm_mr_change change);
+			const struct kvm_userspace_memory_region_ext *mem,
+			struct kvm_memory_slot *old,
+			const struct kvm_memory_slot *new,
+			enum kvm_mr_change change);
 /* flush all memory translations */
 void kvm_arch_flush_shadow_all(struct kvm *kvm);
 /* flush memory translations pointing to 'slot' */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1b7cf05759c7..79313c549fb9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1510,7 +1510,7 @@ bool __weak kvm_arch_dirty_log_supported(struct kvm *kvm)
 }
 
 static int check_memory_region_flags(struct kvm *kvm,
-				     const struct kvm_userspace_memory_region *mem)
+			     const struct kvm_userspace_memory_region_ext *mem)
 {
 	u32 valid_flags = 0;
 
@@ -1622,7 +1622,7 @@ static struct kvm_memslots *kvm_dup_memslots(struct kvm_memslots *old,
 }
 
 static int kvm_set_memslot(struct kvm *kvm,
-			   const struct kvm_userspace_memory_region *mem,
+			   const struct kvm_userspace_memory_region_ext *mem,
 			   struct kvm_memory_slot *new, int as_id,
 			   enum kvm_mr_change change)
 {
@@ -1737,7 +1737,7 @@ static int kvm_set_memslot(struct kvm *kvm,
 }
 
 static int kvm_delete_memslot(struct kvm *kvm,
-			      const struct kvm_userspace_memory_region *mem,
+			      const struct kvm_userspace_memory_region_ext *mem,
 			      struct kvm_memory_slot *old, int as_id)
 {
 	struct kvm_memory_slot new;
@@ -1765,7 +1765,7 @@ static int kvm_delete_memslot(struct kvm *kvm,
  * Must be called holding kvm->slots_lock for write.
  */
 int __kvm_set_memory_region(struct kvm *kvm,
-			    const struct kvm_userspace_memory_region *mem)
+			    const struct kvm_userspace_memory_region_ext *mem)
 {
 	struct kvm_memory_slot old, new;
 	struct kvm_memory_slot *tmp;
@@ -1884,7 +1884,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
 
 int kvm_set_memory_region(struct kvm *kvm,
-			  const struct kvm_userspace_memory_region *mem)
+			  const struct kvm_userspace_memory_region_ext *mem)
 {
 	int r;
 
@@ -1896,7 +1896,7 @@ int kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(kvm_set_memory_region);
 
 static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
-					  struct kvm_userspace_memory_region *mem)
+				struct kvm_userspace_memory_region_ext *mem)
 {
 	if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
 		return -EINVAL;
@@ -4393,12 +4393,19 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 	case KVM_SET_USER_MEMORY_REGION: {
-		struct kvm_userspace_memory_region kvm_userspace_mem;
+		struct kvm_userspace_memory_region_ext kvm_userspace_mem;
 
 		r = -EFAULT;
 		if (copy_from_user(&kvm_userspace_mem, argp,
-						sizeof(kvm_userspace_mem)))
+				sizeof(struct kvm_userspace_memory_region)))
 			goto out;
+		if (kvm_userspace_mem.flags & KVM_MEM_PRIVATE) {
+			int offset = offsetof(
+				struct kvm_userspace_memory_region_ext, ofs);
+			if (copy_from_user(&kvm_userspace_mem.ofs, argp + offset,
+					   sizeof(kvm_userspace_mem) - offset))
+				goto out;
+		}
 
 		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
 		break;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 15/15] KVM: Register/unregister private memory slot to memfd
  2021-12-21 15:11 ` Chao Peng
@ 2021-12-21 15:11   ` Chao Peng
  -1 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Hugh Dickins, Jeff Layton, J . Bruce Fields,
	Andrew Morton, Yu Zhang, Chao Peng, Kirill A . Shutemov, luto,
	john.ji, susie.li, jun.nakajima, dave.hansen, ak, david

Expose KVM_MEM_PRIVATE flag and register/unregister private memory
slot to memfd when userspace sets the flag.

KVM_MEM_PRIVATE is disallowed by default but architecture code can
turn it on by implementing kvm_arch_private_memory_supported().

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c      | 35 +++++++++++++++++++++++++++++++----
 2 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0c53df0a6b2e..0f0e24f19892 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1096,6 +1096,8 @@ int kvm_arch_post_init_vm(struct kvm *kvm);
 void kvm_arch_pre_destroy_vm(struct kvm *kvm);
 int kvm_arch_create_vm_debugfs(struct kvm *kvm);
 bool kvm_arch_dirty_log_supported(struct kvm *kvm);
+bool kvm_arch_private_memory_supported(struct kvm *kvm);
+
 
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 /*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 79313c549fb9..6eb0d86abdcf 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1509,6 +1509,11 @@ bool __weak kvm_arch_dirty_log_supported(struct kvm *kvm)
 	return true;
 }
 
+bool __weak kvm_arch_private_memory_supported(struct kvm *kvm)
+{
+	return false;
+}
+
 static int check_memory_region_flags(struct kvm *kvm,
 			     const struct kvm_userspace_memory_region_ext *mem)
 {
@@ -1517,6 +1522,9 @@ static int check_memory_region_flags(struct kvm *kvm,
 	if (kvm_arch_dirty_log_supported(kvm))
 		valid_flags |= KVM_MEM_LOG_DIRTY_PAGES;
 
+	if (kvm_arch_private_memory_supported(kvm))
+		valid_flags |= KVM_MEM_PRIVATE;
+
 #ifdef __KVM_HAVE_READONLY_MEM
 	valid_flags |= KVM_MEM_READONLY;
 #endif
@@ -1708,9 +1716,21 @@ static int kvm_set_memslot(struct kvm *kvm,
 	/* Copy the arch-specific data, again after (re)acquiring slots_arch_lock. */
 	memcpy(&new->arch, &old.arch, sizeof(old.arch));
 
+	if (mem->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE) {
+		r = kvm_memfd_register(kvm, mem, new);
+		if (r)
+			goto out_slots;
+	}
+
 	r = kvm_arch_prepare_memory_region(kvm, new, mem, change);
-	if (r)
+	if (r) {
+		if (mem->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE)
+			kvm_memfd_unregister(kvm, new);
 		goto out_slots;
+	}
+
+	if (mem->flags & KVM_MEM_PRIVATE && change == KVM_MR_DELETE)
+		kvm_memfd_unregister(kvm, new);
 
 	update_memslots(slots, new, change);
 	slots = install_new_memslots(kvm, as_id, slots);
@@ -1786,10 +1806,12 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		return -EINVAL;
 	if (mem->guest_phys_addr & (PAGE_SIZE - 1))
 		return -EINVAL;
-	/* We can read the guest memory with __xxx_user() later on. */
 	if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
-	    (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
-	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
+	    (mem->userspace_addr != untagged_addr(mem->userspace_addr)))
+		return -EINVAL;
+	/* We can read the guest memory with __xxx_user() later on. */
+	if (!(mem->flags & KVM_MEM_PRIVATE) &&
+	    !access_ok((void __user *)(unsigned long)mem->userspace_addr,
 			mem->memory_size))
 		return -EINVAL;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
@@ -1821,6 +1843,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	new.npages = mem->memory_size >> PAGE_SHIFT;
 	new.flags = mem->flags;
 	new.userspace_addr = mem->userspace_addr;
+	new.file = NULL;
+	new.file_ofs = 0;
 
 	if (new.npages > KVM_MEM_MAX_NR_PAGES)
 		return -EINVAL;
@@ -1829,6 +1853,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		change = KVM_MR_CREATE;
 		new.dirty_bitmap = NULL;
 	} else { /* Modify an existing slot. */
+		/* Private memslots are immutable, they can only be deleted. */
+		if (mem->flags & KVM_MEM_PRIVATE)
+			return -EINVAL;
 		if ((new.userspace_addr != old.userspace_addr) ||
 		    (new.npages != old.npages) ||
 		    ((new.flags ^ old.flags) & KVM_MEM_READONLY))
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 15/15] KVM: Register/unregister private memory slot to memfd
@ 2021-12-21 15:11   ` Chao Peng
  0 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-21 15:11 UTC (permalink / raw)
  To: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel
  Cc: Wanpeng Li, jun.nakajima, david, J . Bruce Fields, dave.hansen,
	H . Peter Anvin, Chao Peng, ak, Jonathan Corbet, Joerg Roedel,
	x86, Hugh Dickins, Ingo Molnar, Borislav Petkov, luto,
	Thomas Gleixner, Vitaly Kuznetsov, Jim Mattson,
	Sean Christopherson, susie.li, Jeff Layton, john.ji, Yu Zhang,
	Paolo Bonzini, Andrew Morton, Kirill A . Shutemov

Expose KVM_MEM_PRIVATE flag and register/unregister private memory
slot to memfd when userspace sets the flag.

KVM_MEM_PRIVATE is disallowed by default but architecture code can
turn it on by implementing kvm_arch_private_memory_supported().

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c      | 35 +++++++++++++++++++++++++++++++----
 2 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0c53df0a6b2e..0f0e24f19892 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1096,6 +1096,8 @@ int kvm_arch_post_init_vm(struct kvm *kvm);
 void kvm_arch_pre_destroy_vm(struct kvm *kvm);
 int kvm_arch_create_vm_debugfs(struct kvm *kvm);
 bool kvm_arch_dirty_log_supported(struct kvm *kvm);
+bool kvm_arch_private_memory_supported(struct kvm *kvm);
+
 
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 /*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 79313c549fb9..6eb0d86abdcf 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1509,6 +1509,11 @@ bool __weak kvm_arch_dirty_log_supported(struct kvm *kvm)
 	return true;
 }
 
+bool __weak kvm_arch_private_memory_supported(struct kvm *kvm)
+{
+	return false;
+}
+
 static int check_memory_region_flags(struct kvm *kvm,
 			     const struct kvm_userspace_memory_region_ext *mem)
 {
@@ -1517,6 +1522,9 @@ static int check_memory_region_flags(struct kvm *kvm,
 	if (kvm_arch_dirty_log_supported(kvm))
 		valid_flags |= KVM_MEM_LOG_DIRTY_PAGES;
 
+	if (kvm_arch_private_memory_supported(kvm))
+		valid_flags |= KVM_MEM_PRIVATE;
+
 #ifdef __KVM_HAVE_READONLY_MEM
 	valid_flags |= KVM_MEM_READONLY;
 #endif
@@ -1708,9 +1716,21 @@ static int kvm_set_memslot(struct kvm *kvm,
 	/* Copy the arch-specific data, again after (re)acquiring slots_arch_lock. */
 	memcpy(&new->arch, &old.arch, sizeof(old.arch));
 
+	if (mem->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE) {
+		r = kvm_memfd_register(kvm, mem, new);
+		if (r)
+			goto out_slots;
+	}
+
 	r = kvm_arch_prepare_memory_region(kvm, new, mem, change);
-	if (r)
+	if (r) {
+		if (mem->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE)
+			kvm_memfd_unregister(kvm, new);
 		goto out_slots;
+	}
+
+	if (mem->flags & KVM_MEM_PRIVATE && change == KVM_MR_DELETE)
+		kvm_memfd_unregister(kvm, new);
 
 	update_memslots(slots, new, change);
 	slots = install_new_memslots(kvm, as_id, slots);
@@ -1786,10 +1806,12 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		return -EINVAL;
 	if (mem->guest_phys_addr & (PAGE_SIZE - 1))
 		return -EINVAL;
-	/* We can read the guest memory with __xxx_user() later on. */
 	if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
-	    (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
-	     !access_ok((void __user *)(unsigned long)mem->userspace_addr,
+	    (mem->userspace_addr != untagged_addr(mem->userspace_addr)))
+		return -EINVAL;
+	/* We can read the guest memory with __xxx_user() later on. */
+	if (!(mem->flags & KVM_MEM_PRIVATE) &&
+	    !access_ok((void __user *)(unsigned long)mem->userspace_addr,
 			mem->memory_size))
 		return -EINVAL;
 	if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
@@ -1821,6 +1843,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	new.npages = mem->memory_size >> PAGE_SHIFT;
 	new.flags = mem->flags;
 	new.userspace_addr = mem->userspace_addr;
+	new.file = NULL;
+	new.file_ofs = 0;
 
 	if (new.npages > KVM_MEM_MAX_NR_PAGES)
 		return -EINVAL;
@@ -1829,6 +1853,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		change = KVM_MR_CREATE;
 		new.dirty_bitmap = NULL;
 	} else { /* Modify an existing slot. */
+		/* Private memslots are immutable, they can only be deleted. */
+		if (mem->flags & KVM_MEM_PRIVATE)
+			return -EINVAL;
 		if ((new.userspace_addr != old.userspace_addr) ||
 		    (new.npages != old.npages) ||
 		    ((new.flags ^ old.flags) & KVM_MEM_READONLY))
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 00/15] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-12-21 15:11 ` Chao Peng
                   ` (15 preceding siblings ...)
  (?)
@ 2021-12-21 15:44 ` Sean Christopherson
  2021-12-22  1:22     ` Chao Peng
  -1 siblings, 1 reply; 35+ messages in thread
From: Sean Christopherson @ 2021-12-21 15:44 UTC (permalink / raw)
  To: Chao Peng
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Yu Zhang, Kirill A . Shutemov,
	luto, john.ji, susie.li, jun.nakajima, dave.hansen, ak, david

On Tue, Dec 21, 2021, Chao Peng wrote:
> This is the third version of this series which try to implement the
> fd-based KVM guest private memory.

...

> Test
> ----
> This code has been tested with latest TDX code patches hosted at
> (https://github.com/intel/tdx/tree/kvm-upstream) with minimal TDX
> adaption and QEMU support.
> 
> Example QEMU command line:
> -object tdx-guest,id=tdx \
> -object memory-backend-memfd-private,id=ram1,size=2G \
> -machine q35,kvm-type=tdx,pic=no,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1
> 
> Changelog
> ----------
> v3:
>   - Added locking protection when calling
>     invalidate_page_range/fallocate callbacks.
>   - Changed memslot structure to keep use useraddr for shared memory.
>   - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
>   - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
>   - Commit message improvement.
>   - Many small fixes for comments from the last version.

Can you rebase on top of kvm/queue and send a new version?  There's a massive
overhaul of KVM's memslots code that's queued for 5.17, and the KVM core changes
in this series conflict mightily.

It's ok if the private memslot support isn't tested exactly as-is, it's not like
any of us reviewers can test it anyways, but I would like to be able to apply
cleanly and verify that the series doesn't break existing functionality.

This version also appears to be based on an internal development branch, e.g. patch
12/15 has some bits from the TDX series.

@@ -336,6 +348,7 @@ struct kvm_tdx_exit {
 #define KVM_EXIT_X86_BUS_LOCK     33
 #define KVM_EXIT_XEN              34
 #define KVM_EXIT_RISCV_SBI        35
+#define KVM_EXIT_MEMORY_ERROR     36
 #define KVM_EXIT_TDX              50   /* dump number to avoid conflict. */

 /* For KVM_EXIT_INTERNAL_ERROR */
@@ -554,6 +567,8 @@ struct kvm_run {
                        unsigned long args[6];
                        unsigned long ret[2];
                } riscv_sbi;
+               /* KVM_EXIT_MEMORY_ERROR */
+               struct kvm_memory_exit mem;
                /* KVM_EXIT_TDX_VMCALL */
                struct kvm_tdx_exit tdx;
                /* Fix the size of the union. */


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 00/15] KVM: mm: fd-based approach for supporting KVM guest private memory
  2021-12-21 15:44 ` [PATCH v3 00/15] KVM: mm: fd-based approach for supporting KVM guest private memory Sean Christopherson
@ 2021-12-22  1:22     ` Chao Peng
  0 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-22  1:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, qemu-devel,
	Paolo Bonzini, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H . Peter Anvin, Hugh Dickins, Jeff Layton,
	J . Bruce Fields, Andrew Morton, Yu Zhang, Kirill A . Shutemov,
	luto, john.ji, susie.li, jun.nakajima, dave.hansen, ak, david

On Tue, Dec 21, 2021 at 03:44:40PM +0000, Sean Christopherson wrote:
> On Tue, Dec 21, 2021, Chao Peng wrote:
> > This is the third version of this series which try to implement the
> > fd-based KVM guest private memory.
> 
> ...
> 
> > Test
> > ----
> > This code has been tested with latest TDX code patches hosted at
> > (https://github.com/intel/tdx/tree/kvm-upstream) with minimal TDX
> > adaption and QEMU support.
> > 
> > Example QEMU command line:
> > -object tdx-guest,id=tdx \
> > -object memory-backend-memfd-private,id=ram1,size=2G \
> > -machine q35,kvm-type=tdx,pic=no,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1
> > 
> > Changelog
> > ----------
> > v3:
> >   - Added locking protection when calling
> >     invalidate_page_range/fallocate callbacks.
> >   - Changed memslot structure to keep use useraddr for shared memory.
> >   - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
> >   - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
> >   - Commit message improvement.
> >   - Many small fixes for comments from the last version.
> 
> Can you rebase on top of kvm/queue and send a new version?  There's a massive
> overhaul of KVM's memslots code that's queued for 5.17, and the KVM core changes
> in this series conflict mightily.

Sure, will do the rebase and send a new version.

> 
> It's ok if the private memslot support isn't tested exactly as-is, it's not like
> any of us reviewers can test it anyways, but I would like to be able to apply
> cleanly and verify that the series doesn't break existing functionality.

Good, it will ease me if that is acceptable (e.g. test on the relative
new TDX codebase but send out the patch on latest kvm/queue which is not
verified for the new function). This gets rid of the 'chicken and egg'
dependency between this series and TDX patchset.

> 
> This version also appears to be based on an internal development branch, e.g. patch
> 12/15 has some bits from the TDX series.

Right, it's based on latest TDX code https://github.com/intel/tdx/tree/kvm-upstream.
I did this because this is the only way I can test the code. 

Thanks,
Chao
> 
> @@ -336,6 +348,7 @@ struct kvm_tdx_exit {
>  #define KVM_EXIT_X86_BUS_LOCK     33
>  #define KVM_EXIT_XEN              34
>  #define KVM_EXIT_RISCV_SBI        35
> +#define KVM_EXIT_MEMORY_ERROR     36
>  #define KVM_EXIT_TDX              50   /* dump number to avoid conflict. */
> 
>  /* For KVM_EXIT_INTERNAL_ERROR */
> @@ -554,6 +567,8 @@ struct kvm_run {
>                         unsigned long args[6];
>                         unsigned long ret[2];
>                 } riscv_sbi;
> +               /* KVM_EXIT_MEMORY_ERROR */
> +               struct kvm_memory_exit mem;
>                 /* KVM_EXIT_TDX_VMCALL */
>                 struct kvm_tdx_exit tdx;
>                 /* Fix the size of the union. */

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 00/15] KVM: mm: fd-based approach for supporting KVM guest private memory
@ 2021-12-22  1:22     ` Chao Peng
  0 siblings, 0 replies; 35+ messages in thread
From: Chao Peng @ 2021-12-22  1:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Wanpeng Li, jun.nakajima, kvm, david, qemu-devel,
	J . Bruce Fields, linux-mm, H . Peter Anvin, ak, Jonathan Corbet,
	Joerg Roedel, x86, Hugh Dickins, Ingo Molnar, Borislav Petkov,
	luto, Thomas Gleixner, Vitaly Kuznetsov, Jim Mattson,
	dave.hansen, susie.li, Jeff Layton, linux-kernel, john.ji,
	Yu Zhang, linux-fsdevel, Paolo Bonzini, Andrew Morton,
	Kirill A . Shutemov

On Tue, Dec 21, 2021 at 03:44:40PM +0000, Sean Christopherson wrote:
> On Tue, Dec 21, 2021, Chao Peng wrote:
> > This is the third version of this series which try to implement the
> > fd-based KVM guest private memory.
> 
> ...
> 
> > Test
> > ----
> > This code has been tested with latest TDX code patches hosted at
> > (https://github.com/intel/tdx/tree/kvm-upstream) with minimal TDX
> > adaption and QEMU support.
> > 
> > Example QEMU command line:
> > -object tdx-guest,id=tdx \
> > -object memory-backend-memfd-private,id=ram1,size=2G \
> > -machine q35,kvm-type=tdx,pic=no,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1
> > 
> > Changelog
> > ----------
> > v3:
> >   - Added locking protection when calling
> >     invalidate_page_range/fallocate callbacks.
> >   - Changed memslot structure to keep use useraddr for shared memory.
> >   - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
> >   - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
> >   - Commit message improvement.
> >   - Many small fixes for comments from the last version.
> 
> Can you rebase on top of kvm/queue and send a new version?  There's a massive
> overhaul of KVM's memslots code that's queued for 5.17, and the KVM core changes
> in this series conflict mightily.

Sure, will do the rebase and send a new version.

> 
> It's ok if the private memslot support isn't tested exactly as-is, it's not like
> any of us reviewers can test it anyways, but I would like to be able to apply
> cleanly and verify that the series doesn't break existing functionality.

Good, it will ease me if that is acceptable (e.g. test on the relative
new TDX codebase but send out the patch on latest kvm/queue which is not
verified for the new function). This gets rid of the 'chicken and egg'
dependency between this series and TDX patchset.

> 
> This version also appears to be based on an internal development branch, e.g. patch
> 12/15 has some bits from the TDX series.

Right, it's based on latest TDX code https://github.com/intel/tdx/tree/kvm-upstream.
I did this because this is the only way I can test the code. 

Thanks,
Chao
> 
> @@ -336,6 +348,7 @@ struct kvm_tdx_exit {
>  #define KVM_EXIT_X86_BUS_LOCK     33
>  #define KVM_EXIT_XEN              34
>  #define KVM_EXIT_RISCV_SBI        35
> +#define KVM_EXIT_MEMORY_ERROR     36
>  #define KVM_EXIT_TDX              50   /* dump number to avoid conflict. */
> 
>  /* For KVM_EXIT_INTERNAL_ERROR */
> @@ -554,6 +567,8 @@ struct kvm_run {
>                         unsigned long args[6];
>                         unsigned long ret[2];
>                 } riscv_sbi;
> +               /* KVM_EXIT_MEMORY_ERROR */
> +               struct kvm_memory_exit mem;
>                 /* KVM_EXIT_TDX_VMCALL */
>                 struct kvm_tdx_exit tdx;
>                 /* Fix the size of the union. */


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2021-12-22  1:26 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-21 15:11 [PATCH v3 00/15] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
2021-12-21 15:11 ` Chao Peng
2021-12-21 15:11 ` [PATCH v3 01/15] mm/shmem: Introduce F_SEAL_INACCESSIBLE Chao Peng
2021-12-21 15:11   ` Chao Peng
2021-12-21 15:11 ` [PATCH v3 02/15] mm/memfd: Introduce MFD_INACCESSIBLE flag Chao Peng
2021-12-21 15:11   ` Chao Peng
2021-12-21 15:11 ` [PATCH v3 03/15] mm/memfd: Introduce MEMFD_OPS Chao Peng
2021-12-21 15:11   ` Chao Peng
2021-12-21 15:11 ` [PATCH v3 04/15] KVM: Extend the memslot to support fd-based private memory Chao Peng
2021-12-21 15:11   ` Chao Peng
2021-12-21 15:11 ` [PATCH v3 05/15] KVM: Implement fd-based memory using MEMFD_OPS interfaces Chao Peng
2021-12-21 15:11   ` Chao Peng
2021-12-21 15:11 ` [PATCH v3 06/15] KVM: Refactor hva based memory invalidation code Chao Peng
2021-12-21 15:11   ` Chao Peng
2021-12-21 15:11 ` [PATCH v3 07/15] KVM: Special handling for fd-based memory invalidation Chao Peng
2021-12-21 15:11   ` Chao Peng
2021-12-21 15:11 ` [PATCH v3 08/15] KVM: Split out common memory invalidation code Chao Peng
2021-12-21 15:11   ` Chao Peng
2021-12-21 15:11 ` [PATCH v3 09/15] KVM: Implement fd-based memory invalidation Chao Peng
2021-12-21 15:11   ` Chao Peng
2021-12-21 15:11 ` [PATCH v3 10/15] KVM: Add kvm_map_gfn_range Chao Peng
2021-12-21 15:11   ` Chao Peng
2021-12-21 15:11 ` [PATCH v3 11/15] KVM: Implement fd-based memory fallocation Chao Peng
2021-12-21 15:11   ` Chao Peng
2021-12-21 15:11 ` [PATCH v3 12/15] KVM: Add KVM_EXIT_MEMORY_ERROR exit Chao Peng
2021-12-21 15:11   ` Chao Peng
2021-12-21 15:11 ` [PATCH v3 13/15] KVM: Handle page fault for private memory Chao Peng
2021-12-21 15:11   ` Chao Peng
2021-12-21 15:11 ` [PATCH v3 14/15] KVM: Use kvm_userspace_memory_region_ext Chao Peng
2021-12-21 15:11   ` Chao Peng
2021-12-21 15:11 ` [PATCH v3 15/15] KVM: Register/unregister private memory slot to memfd Chao Peng
2021-12-21 15:11   ` Chao Peng
2021-12-21 15:44 ` [PATCH v3 00/15] KVM: mm: fd-based approach for supporting KVM guest private memory Sean Christopherson
2021-12-22  1:22   ` Chao Peng
2021-12-22  1:22     ` Chao Peng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.