All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chao Peng <chao.p.peng@linux.intel.com>
To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	qemu-devel@nongnu.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Sean Christopherson <seanjc@google.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>,
	Hugh Dickins <hughd@google.com>, Jeff Layton <jlayton@kernel.org>,
	"J . Bruce Fields" <bfields@fieldses.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Yu Zhang <yu.c.zhang@linux.intel.com>,
	Chao Peng <chao.p.peng@linux.intel.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	luto@kernel.org, john.ji@intel.com, susie.li@intel.com,
	jun.nakajima@intel.com, dave.hansen@intel.com,
	ak@linux.intel.com, david@redhat.com
Subject: [PATCH v3 kvm/queue 03/16] mm/memfd: Introduce MEMFD_OPS
Date: Thu, 23 Dec 2021 20:29:58 +0800	[thread overview]
Message-ID: <20211223123011.41044-4-chao.p.peng@linux.intel.com> (raw)
In-Reply-To: <20211223123011.41044-1-chao.p.peng@linux.intel.com>

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The patch introduces new MEMFD_OPS facility around file created by
memfd_create() to allow a third kernel component to make use of memory
bookmarked in a memfd and gets notifier when the memory in the file
is allocated/invalidated. It will be used for KVM to use memfd file
descriptor as the guest memory backend and KVM will use MEMFD_OPS to
interact with memfd subsystem. In the future there might be other
consumers (e.g. VFIO with encrypted device memory).

It consists two set of callbacks:
  - memfd_falloc_notifier: callbacks which provided by KVM and called
    by memfd when memory gets allocated/invalidated through fallocate()
    ioctl.
  - memfd_pfn_ops: callbacks which provided by memfd and called by KVM
    to request memory page from memfd.

Locking is needed for above callbacks to prevent race condition.
  - get_owner/put_owner is used to ensure the owner is still alive in
    the invalidate_page_range/fallocate callback handlers using a
    reference mechanism.
  - page is locked between get_lock_pfn/put_unlock_pfn to ensure pfn is
    still valid when it's used (e.g. when KVM page fault handler uses
    it to establish the mapping in the secondary MMU page tables).

Userspace is in charge of guest memory lifecycle: it can allocate the
memory with fallocate() or punch hole to free memory from the guest.

The file descriptor passed down to KVM as guest memory backend. KVM
registers itself as the owner of the memfd via
memfd_register_falloc_notifier() and provides memfd_falloc_notifier
callbacks that need to be called on fallocate() and punching hole.

memfd_register_falloc_notifier() returns memfd_pfn_ops callbacks that
need to be used for requesting a new page from KVM.

At this time only shmem is supported.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/memfd.h    |  22 ++++++
 include/linux/shmem_fs.h |  16 ++++
 mm/Kconfig               |   4 +
 mm/memfd.c               |  21 ++++++
 mm/shmem.c               | 158 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 221 insertions(+)

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..0007073b53dc 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -13,4 +13,26 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
 }
 #endif
 
+#ifdef CONFIG_MEMFD_OPS
+struct memfd_falloc_notifier {
+	void (*invalidate_page_range)(struct inode *inode, void *owner,
+				      pgoff_t start, pgoff_t end);
+	void (*fallocate)(struct inode *inode, void *owner,
+			  pgoff_t start, pgoff_t end);
+	bool (*get_owner)(void *owner);
+	void (*put_owner)(void *owner);
+};
+
+struct memfd_pfn_ops {
+	long (*get_lock_pfn)(struct inode *inode, pgoff_t offset, int *order);
+	void (*put_unlock_pfn)(unsigned long pfn);
+
+};
+
+extern int memfd_register_falloc_notifier(struct inode *inode, void *owner,
+				const struct memfd_falloc_notifier *notifier,
+				const struct memfd_pfn_ops **pfn_ops);
+extern void memfd_unregister_falloc_notifier(struct inode *inode);
+#endif
+
 #endif /* __LINUX_MEMFD_H */
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 166158b6e917..503adc63728c 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -12,6 +12,11 @@
 
 /* inode in-kernel data */
 
+#ifdef CONFIG_MEMFD_OPS
+struct memfd_falloc_notifier;
+struct memfd_pfn_ops;
+#endif
+
 struct shmem_inode_info {
 	spinlock_t		lock;
 	unsigned int		seals;		/* shmem seals */
@@ -24,6 +29,10 @@ struct shmem_inode_info {
 	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct simple_xattrs	xattrs;		/* list of xattrs */
 	atomic_t		stop_eviction;	/* hold when working on inode */
+#ifdef CONFIG_MEMFD_OPS
+	void			*owner;
+	const struct memfd_falloc_notifier *falloc_notifier;
+#endif
 	struct inode		vfs_inode;
 };
 
@@ -96,6 +105,13 @@ extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
 extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
 						pgoff_t start, pgoff_t end);
 
+#ifdef CONFIG_MEMFD_OPS
+extern int shmem_register_falloc_notifier(struct inode *inode, void *owner,
+				const struct memfd_falloc_notifier *notifier,
+				const struct memfd_pfn_ops **pfn_ops);
+extern void shmem_unregister_falloc_notifier(struct inode *inode);
+#endif
+
 /* Flag allocation requirements to shmem_getpage */
 enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
diff --git a/mm/Kconfig b/mm/Kconfig
index 28edafc820ad..9989904d1b56 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -900,6 +900,10 @@ config IO_MAPPING
 config SECRETMEM
 	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
 
+config MEMFD_OPS
+	bool
+	depends on MEMFD_CREATE
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/memfd.c b/mm/memfd.c
index c898a007fb76..41861870fc21 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -130,6 +130,27 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
 	return NULL;
 }
 
+#ifdef CONFIG_MEMFD_OPS
+int memfd_register_falloc_notifier(struct inode *inode, void *owner,
+				   const struct memfd_falloc_notifier *notifier,
+				   const struct memfd_pfn_ops **pfn_ops)
+{
+	if (shmem_mapping(inode->i_mapping))
+		return shmem_register_falloc_notifier(inode, owner,
+						      notifier, pfn_ops);
+
+	return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(memfd_register_falloc_notifier);
+
+void memfd_unregister_falloc_notifier(struct inode *inode)
+{
+	if (shmem_mapping(inode->i_mapping))
+		shmem_unregister_falloc_notifier(inode);
+}
+EXPORT_SYMBOL_GPL(memfd_unregister_falloc_notifier);
+#endif
+
 #define F_ALL_SEALS (F_SEAL_SEAL | \
 		     F_SEAL_SHRINK | \
 		     F_SEAL_GROW | \
diff --git a/mm/shmem.c b/mm/shmem.c
index faa7e9b1b9bc..4d8a75c4d037 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -78,6 +78,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/userfaultfd_k.h>
 #include <linux/rmap.h>
 #include <linux/uuid.h>
+#include <linux/memfd.h>
 
 #include <linux/uaccess.h>
 
@@ -906,6 +907,68 @@ static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
 	return split_huge_page(page) >= 0;
 }
 
+static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
+{
+#ifdef CONFIG_MEMFD_OPS
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	const struct memfd_falloc_notifier *notifier;
+	void *owner;
+	bool ret;
+
+	if (!info->falloc_notifier)
+		return;
+
+	spin_lock(&info->lock);
+	notifier = info->falloc_notifier;
+	if (!notifier) {
+		spin_unlock(&info->lock);
+		return;
+	}
+
+	owner = info->owner;
+	ret = notifier->get_owner(owner);
+	spin_unlock(&info->lock);
+	if (!ret)
+		return;
+
+	notifier->fallocate(inode, owner, start, end);
+	notifier->put_owner(owner);
+#endif
+}
+
+static void notify_invalidate_page(struct inode *inode, struct page *page,
+				   pgoff_t start, pgoff_t end)
+{
+#ifdef CONFIG_MEMFD_OPS
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	const struct memfd_falloc_notifier *notifier;
+	void *owner;
+	bool ret;
+
+	if (!info->falloc_notifier)
+		return;
+
+	spin_lock(&info->lock);
+	notifier = info->falloc_notifier;
+	if (!notifier) {
+		spin_unlock(&info->lock);
+		return;
+	}
+
+	owner = info->owner;
+	ret = notifier->get_owner(owner);
+	spin_unlock(&info->lock);
+	if (!ret)
+		return;
+
+	start = max(start, page->index);
+	end = min(end, page->index + thp_nr_pages(page));
+
+	notifier->invalidate_page_range(inode, owner, start, end);
+	notifier->put_owner(owner);
+#endif
+}
+
 /*
  * Remove range of pages and swap entries from page cache, and free them.
  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -949,6 +1012,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			index += thp_nr_pages(page) - 1;
 
+			notify_invalidate_page(inode, page, start, end);
+
 			if (!unfalloc || !PageUptodate(page))
 				truncate_inode_page(mapping, page);
 			unlock_page(page);
@@ -1025,6 +1090,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 					index--;
 					break;
 				}
+
+				notify_invalidate_page(inode, page, start, end);
+
 				VM_BUG_ON_PAGE(PageWriteback(page), page);
 				if (shmem_punch_compound(page, start, end))
 					truncate_inode_page(mapping, page);
@@ -2815,6 +2883,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
 		i_size_write(inode, offset + len);
 	inode->i_ctime = current_time(inode);
+	notify_fallocate(inode, start, end);
 undone:
 	spin_lock(&inode->i_lock);
 	inode->i_private = NULL;
@@ -3784,6 +3853,20 @@ static void shmem_destroy_inodecache(void)
 	kmem_cache_destroy(shmem_inode_cachep);
 }
 
+#ifdef CONFIG_MIGRATION
+int shmem_migrate_page(struct address_space *mapping, struct page *newpage,
+		       struct page *page, enum migrate_mode mode)
+{
+#ifdef CONFIG_MEMFD_OPS
+	struct inode *inode = mapping->host;
+
+	if (SHMEM_I(inode)->owner)
+		return -EOPNOTSUPP;
+#endif
+	return migrate_page(mapping, newpage, page, mode);
+}
+#endif
+
 const struct address_space_operations shmem_aops = {
 	.writepage	= shmem_writepage,
 	.set_page_dirty	= __set_page_dirty_no_writeback,
@@ -3798,6 +3881,81 @@ const struct address_space_operations shmem_aops = {
 };
 EXPORT_SYMBOL(shmem_aops);
 
+#ifdef CONFIG_MEMFD_OPS
+static long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset, int *order)
+{
+	struct page *page;
+	int ret;
+
+	ret = shmem_getpage(inode, offset, &page, SGP_NOALLOC);
+	if (ret)
+		return ret;
+
+	*order = thp_order(compound_head(page));
+
+	return page_to_pfn(page);
+}
+
+static void shmem_put_unlock_pfn(unsigned long pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	set_page_dirty(page);
+	unlock_page(page);
+	put_page(page);
+}
+
+static const struct memfd_pfn_ops shmem_pfn_ops = {
+	.get_lock_pfn = shmem_get_lock_pfn,
+	.put_unlock_pfn = shmem_put_unlock_pfn,
+};
+
+int shmem_register_falloc_notifier(struct inode *inode, void *owner,
+				const struct memfd_falloc_notifier *notifier,
+				const struct memfd_pfn_ops **pfn_ops)
+{
+	gfp_t gfp;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	if (!inode || !owner || !notifier || !pfn_ops ||
+	    !notifier->invalidate_page_range ||
+	    !notifier->fallocate ||
+	    !notifier->get_owner ||
+	    !notifier->put_owner)
+		return -EINVAL;
+
+	spin_lock(&info->lock);
+	if (info->owner && info->owner != owner) {
+		spin_unlock(&info->lock);
+		return -EPERM;
+	}
+
+	info->owner = owner;
+	info->falloc_notifier = notifier;
+	spin_unlock(&info->lock);
+
+	gfp = mapping_gfp_mask(inode->i_mapping);
+	gfp &= ~__GFP_MOVABLE;
+	mapping_set_gfp_mask(inode->i_mapping, gfp);
+	mapping_set_unevictable(inode->i_mapping);
+
+	*pfn_ops = &shmem_pfn_ops;
+	return 0;
+}
+
+void shmem_unregister_falloc_notifier(struct inode *inode)
+{
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	spin_lock(&info->lock);
+	info->owner = NULL;
+	info->falloc_notifier = NULL;
+	spin_unlock(&info->lock);
+}
+#endif
+
 static const struct file_operations shmem_file_operations = {
 	.mmap		= shmem_mmap,
 	.get_unmapped_area = shmem_get_unmapped_area,
-- 
2.17.1


WARNING: multiple messages have this Message-ID (diff)
From: Chao Peng <chao.p.peng@linux.intel.com>
To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	qemu-devel@nongnu.org
Cc: Wanpeng Li <wanpengli@tencent.com>,
	jun.nakajima@intel.com, david@redhat.com,
	"J . Bruce Fields" <bfields@fieldses.org>,
	dave.hansen@intel.com, "H . Peter Anvin" <hpa@zytor.com>,
	Chao Peng <chao.p.peng@linux.intel.com>,
	ak@linux.intel.com, Jonathan Corbet <corbet@lwn.net>,
	Joerg Roedel <joro@8bytes.org>,
	x86@kernel.org, Hugh Dickins <hughd@google.com>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	luto@kernel.org, Thomas Gleixner <tglx@linutronix.de>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Jim Mattson <jmattson@google.com>,
	Sean Christopherson <seanjc@google.com>,
	susie.li@intel.com, Jeff Layton <jlayton@kernel.org>,
	john.ji@intel.com, Yu Zhang <yu.c.zhang@linux.intel.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Subject: [PATCH v3 kvm/queue 03/16] mm/memfd: Introduce MEMFD_OPS
Date: Thu, 23 Dec 2021 20:29:58 +0800	[thread overview]
Message-ID: <20211223123011.41044-4-chao.p.peng@linux.intel.com> (raw)
In-Reply-To: <20211223123011.41044-1-chao.p.peng@linux.intel.com>

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The patch introduces new MEMFD_OPS facility around file created by
memfd_create() to allow a third kernel component to make use of memory
bookmarked in a memfd and gets notifier when the memory in the file
is allocated/invalidated. It will be used for KVM to use memfd file
descriptor as the guest memory backend and KVM will use MEMFD_OPS to
interact with memfd subsystem. In the future there might be other
consumers (e.g. VFIO with encrypted device memory).

It consists two set of callbacks:
  - memfd_falloc_notifier: callbacks which provided by KVM and called
    by memfd when memory gets allocated/invalidated through fallocate()
    ioctl.
  - memfd_pfn_ops: callbacks which provided by memfd and called by KVM
    to request memory page from memfd.

Locking is needed for above callbacks to prevent race condition.
  - get_owner/put_owner is used to ensure the owner is still alive in
    the invalidate_page_range/fallocate callback handlers using a
    reference mechanism.
  - page is locked between get_lock_pfn/put_unlock_pfn to ensure pfn is
    still valid when it's used (e.g. when KVM page fault handler uses
    it to establish the mapping in the secondary MMU page tables).

Userspace is in charge of guest memory lifecycle: it can allocate the
memory with fallocate() or punch hole to free memory from the guest.

The file descriptor passed down to KVM as guest memory backend. KVM
registers itself as the owner of the memfd via
memfd_register_falloc_notifier() and provides memfd_falloc_notifier
callbacks that need to be called on fallocate() and punching hole.

memfd_register_falloc_notifier() returns memfd_pfn_ops callbacks that
need to be used for requesting a new page from KVM.

At this time only shmem is supported.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/memfd.h    |  22 ++++++
 include/linux/shmem_fs.h |  16 ++++
 mm/Kconfig               |   4 +
 mm/memfd.c               |  21 ++++++
 mm/shmem.c               | 158 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 221 insertions(+)

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..0007073b53dc 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -13,4 +13,26 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
 }
 #endif
 
+#ifdef CONFIG_MEMFD_OPS
+struct memfd_falloc_notifier {
+	void (*invalidate_page_range)(struct inode *inode, void *owner,
+				      pgoff_t start, pgoff_t end);
+	void (*fallocate)(struct inode *inode, void *owner,
+			  pgoff_t start, pgoff_t end);
+	bool (*get_owner)(void *owner);
+	void (*put_owner)(void *owner);
+};
+
+struct memfd_pfn_ops {
+	long (*get_lock_pfn)(struct inode *inode, pgoff_t offset, int *order);
+	void (*put_unlock_pfn)(unsigned long pfn);
+
+};
+
+extern int memfd_register_falloc_notifier(struct inode *inode, void *owner,
+				const struct memfd_falloc_notifier *notifier,
+				const struct memfd_pfn_ops **pfn_ops);
+extern void memfd_unregister_falloc_notifier(struct inode *inode);
+#endif
+
 #endif /* __LINUX_MEMFD_H */
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 166158b6e917..503adc63728c 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -12,6 +12,11 @@
 
 /* inode in-kernel data */
 
+#ifdef CONFIG_MEMFD_OPS
+struct memfd_falloc_notifier;
+struct memfd_pfn_ops;
+#endif
+
 struct shmem_inode_info {
 	spinlock_t		lock;
 	unsigned int		seals;		/* shmem seals */
@@ -24,6 +29,10 @@ struct shmem_inode_info {
 	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct simple_xattrs	xattrs;		/* list of xattrs */
 	atomic_t		stop_eviction;	/* hold when working on inode */
+#ifdef CONFIG_MEMFD_OPS
+	void			*owner;
+	const struct memfd_falloc_notifier *falloc_notifier;
+#endif
 	struct inode		vfs_inode;
 };
 
@@ -96,6 +105,13 @@ extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
 extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
 						pgoff_t start, pgoff_t end);
 
+#ifdef CONFIG_MEMFD_OPS
+extern int shmem_register_falloc_notifier(struct inode *inode, void *owner,
+				const struct memfd_falloc_notifier *notifier,
+				const struct memfd_pfn_ops **pfn_ops);
+extern void shmem_unregister_falloc_notifier(struct inode *inode);
+#endif
+
 /* Flag allocation requirements to shmem_getpage */
 enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
diff --git a/mm/Kconfig b/mm/Kconfig
index 28edafc820ad..9989904d1b56 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -900,6 +900,10 @@ config IO_MAPPING
 config SECRETMEM
 	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
 
+config MEMFD_OPS
+	bool
+	depends on MEMFD_CREATE
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/memfd.c b/mm/memfd.c
index c898a007fb76..41861870fc21 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -130,6 +130,27 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
 	return NULL;
 }
 
+#ifdef CONFIG_MEMFD_OPS
+int memfd_register_falloc_notifier(struct inode *inode, void *owner,
+				   const struct memfd_falloc_notifier *notifier,
+				   const struct memfd_pfn_ops **pfn_ops)
+{
+	if (shmem_mapping(inode->i_mapping))
+		return shmem_register_falloc_notifier(inode, owner,
+						      notifier, pfn_ops);
+
+	return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(memfd_register_falloc_notifier);
+
+void memfd_unregister_falloc_notifier(struct inode *inode)
+{
+	if (shmem_mapping(inode->i_mapping))
+		shmem_unregister_falloc_notifier(inode);
+}
+EXPORT_SYMBOL_GPL(memfd_unregister_falloc_notifier);
+#endif
+
 #define F_ALL_SEALS (F_SEAL_SEAL | \
 		     F_SEAL_SHRINK | \
 		     F_SEAL_GROW | \
diff --git a/mm/shmem.c b/mm/shmem.c
index faa7e9b1b9bc..4d8a75c4d037 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -78,6 +78,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/userfaultfd_k.h>
 #include <linux/rmap.h>
 #include <linux/uuid.h>
+#include <linux/memfd.h>
 
 #include <linux/uaccess.h>
 
@@ -906,6 +907,68 @@ static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
 	return split_huge_page(page) >= 0;
 }
 
+static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
+{
+#ifdef CONFIG_MEMFD_OPS
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	const struct memfd_falloc_notifier *notifier;
+	void *owner;
+	bool ret;
+
+	if (!info->falloc_notifier)
+		return;
+
+	spin_lock(&info->lock);
+	notifier = info->falloc_notifier;
+	if (!notifier) {
+		spin_unlock(&info->lock);
+		return;
+	}
+
+	owner = info->owner;
+	ret = notifier->get_owner(owner);
+	spin_unlock(&info->lock);
+	if (!ret)
+		return;
+
+	notifier->fallocate(inode, owner, start, end);
+	notifier->put_owner(owner);
+#endif
+}
+
+static void notify_invalidate_page(struct inode *inode, struct page *page,
+				   pgoff_t start, pgoff_t end)
+{
+#ifdef CONFIG_MEMFD_OPS
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	const struct memfd_falloc_notifier *notifier;
+	void *owner;
+	bool ret;
+
+	if (!info->falloc_notifier)
+		return;
+
+	spin_lock(&info->lock);
+	notifier = info->falloc_notifier;
+	if (!notifier) {
+		spin_unlock(&info->lock);
+		return;
+	}
+
+	owner = info->owner;
+	ret = notifier->get_owner(owner);
+	spin_unlock(&info->lock);
+	if (!ret)
+		return;
+
+	start = max(start, page->index);
+	end = min(end, page->index + thp_nr_pages(page));
+
+	notifier->invalidate_page_range(inode, owner, start, end);
+	notifier->put_owner(owner);
+#endif
+}
+
 /*
  * Remove range of pages and swap entries from page cache, and free them.
  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -949,6 +1012,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			index += thp_nr_pages(page) - 1;
 
+			notify_invalidate_page(inode, page, start, end);
+
 			if (!unfalloc || !PageUptodate(page))
 				truncate_inode_page(mapping, page);
 			unlock_page(page);
@@ -1025,6 +1090,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 					index--;
 					break;
 				}
+
+				notify_invalidate_page(inode, page, start, end);
+
 				VM_BUG_ON_PAGE(PageWriteback(page), page);
 				if (shmem_punch_compound(page, start, end))
 					truncate_inode_page(mapping, page);
@@ -2815,6 +2883,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
 		i_size_write(inode, offset + len);
 	inode->i_ctime = current_time(inode);
+	notify_fallocate(inode, start, end);
 undone:
 	spin_lock(&inode->i_lock);
 	inode->i_private = NULL;
@@ -3784,6 +3853,20 @@ static void shmem_destroy_inodecache(void)
 	kmem_cache_destroy(shmem_inode_cachep);
 }
 
+#ifdef CONFIG_MIGRATION
+int shmem_migrate_page(struct address_space *mapping, struct page *newpage,
+		       struct page *page, enum migrate_mode mode)
+{
+#ifdef CONFIG_MEMFD_OPS
+	struct inode *inode = mapping->host;
+
+	if (SHMEM_I(inode)->owner)
+		return -EOPNOTSUPP;
+#endif
+	return migrate_page(mapping, newpage, page, mode);
+}
+#endif
+
 const struct address_space_operations shmem_aops = {
 	.writepage	= shmem_writepage,
 	.set_page_dirty	= __set_page_dirty_no_writeback,
@@ -3798,6 +3881,81 @@ const struct address_space_operations shmem_aops = {
 };
 EXPORT_SYMBOL(shmem_aops);
 
+#ifdef CONFIG_MEMFD_OPS
+static long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset, int *order)
+{
+	struct page *page;
+	int ret;
+
+	ret = shmem_getpage(inode, offset, &page, SGP_NOALLOC);
+	if (ret)
+		return ret;
+
+	*order = thp_order(compound_head(page));
+
+	return page_to_pfn(page);
+}
+
+static void shmem_put_unlock_pfn(unsigned long pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	set_page_dirty(page);
+	unlock_page(page);
+	put_page(page);
+}
+
+static const struct memfd_pfn_ops shmem_pfn_ops = {
+	.get_lock_pfn = shmem_get_lock_pfn,
+	.put_unlock_pfn = shmem_put_unlock_pfn,
+};
+
+int shmem_register_falloc_notifier(struct inode *inode, void *owner,
+				const struct memfd_falloc_notifier *notifier,
+				const struct memfd_pfn_ops **pfn_ops)
+{
+	gfp_t gfp;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	if (!inode || !owner || !notifier || !pfn_ops ||
+	    !notifier->invalidate_page_range ||
+	    !notifier->fallocate ||
+	    !notifier->get_owner ||
+	    !notifier->put_owner)
+		return -EINVAL;
+
+	spin_lock(&info->lock);
+	if (info->owner && info->owner != owner) {
+		spin_unlock(&info->lock);
+		return -EPERM;
+	}
+
+	info->owner = owner;
+	info->falloc_notifier = notifier;
+	spin_unlock(&info->lock);
+
+	gfp = mapping_gfp_mask(inode->i_mapping);
+	gfp &= ~__GFP_MOVABLE;
+	mapping_set_gfp_mask(inode->i_mapping, gfp);
+	mapping_set_unevictable(inode->i_mapping);
+
+	*pfn_ops = &shmem_pfn_ops;
+	return 0;
+}
+
+void shmem_unregister_falloc_notifier(struct inode *inode)
+{
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	spin_lock(&info->lock);
+	info->owner = NULL;
+	info->falloc_notifier = NULL;
+	spin_unlock(&info->lock);
+}
+#endif
+
 static const struct file_operations shmem_file_operations = {
 	.mmap		= shmem_mmap,
 	.get_unmapped_area = shmem_get_unmapped_area,
-- 
2.17.1



  parent reply	other threads:[~2021-12-23 12:31 UTC|newest]

Thread overview: 97+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-23 12:29 [PATCH v3 kvm/queue 00/16] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
2021-12-23 12:29 ` Chao Peng
2021-12-23 12:29 ` [PATCH v3 kvm/queue 01/16] mm/shmem: Introduce F_SEAL_INACCESSIBLE Chao Peng
2021-12-23 12:29   ` Chao Peng
2022-01-04 14:22   ` David Hildenbrand
2022-01-04 14:22     ` David Hildenbrand
2022-01-06 13:06     ` Chao Peng
2022-01-06 13:06       ` Chao Peng
2022-01-13 15:56       ` David Hildenbrand
2022-01-13 15:56         ` David Hildenbrand
2021-12-23 12:29 ` [PATCH v3 kvm/queue 02/16] mm/memfd: Introduce MFD_INACCESSIBLE flag Chao Peng
2021-12-23 12:29   ` Chao Peng
2021-12-23 12:29 ` Chao Peng [this message]
2021-12-23 12:29   ` [PATCH v3 kvm/queue 03/16] mm/memfd: Introduce MEMFD_OPS Chao Peng
2021-12-24  3:53   ` Robert Hoo
2021-12-24  3:53     ` Robert Hoo
2021-12-31  2:38     ` Chao Peng
2021-12-31  2:38       ` Chao Peng
2022-01-04 17:38       ` Sean Christopherson
2022-01-05  6:07         ` Chao Peng
2022-01-05  6:07           ` Chao Peng
2021-12-23 12:29 ` [PATCH v3 kvm/queue 04/16] KVM: Extend the memslot to support fd-based private memory Chao Peng
2021-12-23 12:29   ` Chao Peng
2021-12-23 17:35   ` Sean Christopherson
2021-12-31  2:53     ` Chao Peng
2021-12-31  2:53       ` Chao Peng
2022-01-04 17:34       ` Sean Christopherson
2021-12-23 12:30 ` [PATCH v3 kvm/queue 05/16] KVM: Maintain ofs_tree for fast memslot lookup by file offset Chao Peng
2021-12-23 12:30   ` Chao Peng
2021-12-23 18:02   ` Sean Christopherson
2021-12-24  3:54     ` Chao Peng
2021-12-24  3:54       ` Chao Peng
2021-12-27 23:50       ` Yao Yuan
2021-12-27 23:50         ` Yao Yuan
2021-12-28 21:48       ` Sean Christopherson
2021-12-31  2:26         ` Chao Peng
2021-12-31  2:26           ` Chao Peng
2022-01-04 17:43           ` Sean Christopherson
2022-01-05  6:09             ` Chao Peng
2022-01-05  6:09               ` Chao Peng
2021-12-23 12:30 ` [PATCH v3 kvm/queue 06/16] KVM: Implement fd-based memory using MEMFD_OPS interfaces Chao Peng
2021-12-23 12:30   ` Chao Peng
2021-12-23 18:34   ` Sean Christopherson
2021-12-23 23:09     ` Paolo Bonzini
2021-12-23 23:09       ` Paolo Bonzini
2021-12-24  4:25       ` Chao Peng
2021-12-24  4:25         ` Chao Peng
2021-12-28 22:14         ` Sean Christopherson
2021-12-24  4:12     ` Chao Peng
2021-12-24  4:12       ` Chao Peng
2021-12-24  4:22     ` Chao Peng
2021-12-24  4:22       ` Chao Peng
2021-12-23 12:30 ` [PATCH v3 kvm/queue 07/16] KVM: Refactor hva based memory invalidation code Chao Peng
2021-12-23 12:30   ` Chao Peng
2021-12-23 12:30 ` [PATCH v3 kvm/queue 08/16] KVM: Special handling for fd-based memory invalidation Chao Peng
2021-12-23 12:30   ` Chao Peng
2021-12-23 12:30 ` [PATCH v3 kvm/queue 09/16] KVM: Split out common memory invalidation code Chao Peng
2021-12-23 12:30   ` Chao Peng
2021-12-23 12:30 ` [PATCH v3 kvm/queue 10/16] KVM: Implement fd-based memory invalidation Chao Peng
2021-12-23 12:30   ` Chao Peng
2021-12-23 12:30 ` [PATCH v3 kvm/queue 11/16] KVM: Add kvm_map_gfn_range Chao Peng
2021-12-23 12:30   ` Chao Peng
2021-12-23 18:06   ` Sean Christopherson
2021-12-24  4:13     ` Chao Peng
2021-12-24  4:13       ` Chao Peng
2021-12-31  2:33       ` Chao Peng
2021-12-31  2:33         ` Chao Peng
2022-01-04 17:31         ` Sean Christopherson
2022-01-05  6:14           ` Chao Peng
2022-01-05  6:14             ` Chao Peng
2022-01-05 17:03             ` Sean Christopherson
2022-01-06 12:35               ` Chao Peng
2022-01-06 12:35                 ` Chao Peng
2021-12-23 12:30 ` [PATCH v3 kvm/queue 12/16] KVM: Implement fd-based memory fallocation Chao Peng
2021-12-23 12:30   ` Chao Peng
2021-12-23 12:30 ` [PATCH v3 kvm/queue 13/16] KVM: Add KVM_EXIT_MEMORY_ERROR exit Chao Peng
2021-12-23 12:30   ` Chao Peng
2021-12-23 18:28   ` Sean Christopherson
2021-12-23 12:30 ` [PATCH v3 kvm/queue 14/16] KVM: Handle page fault for private memory Chao Peng
2021-12-23 12:30   ` Chao Peng
2022-01-04  1:46   ` Yan Zhao
2022-01-04  1:46     ` Yan Zhao
2022-01-04  9:10     ` Chao Peng
2022-01-04  9:10       ` Chao Peng
2022-01-04 10:06       ` Yan Zhao
2022-01-04 10:06         ` Yan Zhao
2022-01-05  6:28         ` Chao Peng
2022-01-05  6:28           ` Chao Peng
2022-01-05  7:53           ` Yan Zhao
2022-01-05  7:53             ` Yan Zhao
2022-01-05 20:52             ` Sean Christopherson
2022-01-14  5:53               ` Yan Zhao
2022-01-14  5:53                 ` Yan Zhao
2021-12-23 12:30 ` [PATCH v3 kvm/queue 15/16] KVM: Use kvm_userspace_memory_region_ext Chao Peng
2021-12-23 12:30   ` Chao Peng
2021-12-23 12:30 ` [PATCH v3 kvm/queue 16/16] KVM: Register/unregister private memory slot to memfd Chao Peng
2021-12-23 12:30   ` Chao Peng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211223123011.41044-4-chao.p.peng@linux.intel.com \
    --to=chao.p.peng@linux.intel.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=bfields@fieldses.org \
    --cc=bp@alien8.de \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@intel.com \
    --cc=david@redhat.com \
    --cc=hpa@zytor.com \
    --cc=hughd@google.com \
    --cc=jlayton@kernel.org \
    --cc=jmattson@google.com \
    --cc=john.ji@intel.com \
    --cc=joro@8bytes.org \
    --cc=jun.nakajima@intel.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=seanjc@google.com \
    --cc=susie.li@intel.com \
    --cc=tglx@linutronix.de \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    --cc=x86@kernel.org \
    --cc=yu.c.zhang@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.