All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/9] Add support for shared PTEs across processes
@ 2022-06-29 22:53 Khalid Aziz
  2022-06-29 22:53 ` [PATCH v2 1/9] mm: Add msharefs filesystem Khalid Aziz
                   ` (10 more replies)
  0 siblings, 11 replies; 43+ messages in thread
From: Khalid Aziz @ 2022-06-29 22:53 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin


Memory pages shared between processes require a page table entry
(PTE) for each process. Each of these PTE consumes consume some of
the memory and as long as number of mappings being maintained is
small enough, this space consumed by page tables is not
objectionable. When very few memory pages are shared between
processes, the number of page table entries (PTEs) to maintain is
mostly constrained by the number of pages of memory on the system.
As the number of shared pages and the number of times pages are
shared goes up, amount of memory consumed by page tables starts to
become significant. This issue does not apply to threads. Any number
of threads can share the same pages inside a process while sharing
the same PTEs. Extending this same model to sharing pages across
processes can eliminate this issue for sharing across processes as
well.

Some of the field deployments commonly see memory pages shared
across 1000s of processes. On x86_64, each page requires a PTE that
is only 8 bytes long which is very small compared to the 4K page
size. When 2000 processes map the same page in their address space,
each one of them requires 8 bytes for its PTE and together that adds
up to 8K of memory just to hold the PTEs for one 4K page. On a
database server with 300GB SGA, a system crash was seen with
out-of-memory condition when 1500+ clients tried to share this SGA
even though the system had 512GB of memory. On this server, in the
worst case scenario of all 1500 processes mapping every page from
SGA would have required 878GB+ for just the PTEs. If these PTEs
could be shared, amount of memory saved is very significant.

This patch series implements a mechanism in kernel to allow
userspace processes to opt into sharing PTEs. It adds a new
in-memory filesystem - msharefs. A file created on msharefs creates
a new shared region where all processes sharing that region will
share the PTEs as well. A process can create a new file on msharefs
and then mmap it which assigns a starting address and size to this
mshare'd region. Another process that has the right permission to
open the file on msharefs can then mmap this file in its address
space at same virtual address and size and share this region through
shared PTEs. An unlink() on the file marks the mshare'd region for
deletion once there are no more users of the region. When the mshare
region is deleted, all the pages used by the region are freed.


API
===

mshare does not introduce a new API. It instead uses existing APIs
to implement page table sharing. The steps to use this feature are:

1. Mount msharefs on /sys/fs/mshare -
	mount -t msharefs msharefs /sys/fs/mshare

2. mshare regions have alignment and size requirements. Start
   address for the region must be aligned to an address boundary and
   be a multiple of fixed size. This alignment and size requirement
   can be obtained by reading the file /sys/fs/mshare/mshare_info
   which returns a number in text format. mshare regions must be
   aligned to this boundary and be a multiple of this size.

3. For the process creating mshare region:
	a. Create a file on /sys/fs/mshare, for example -
		fd = open("/sys/fs/mshare/shareme",
				O_RDWR|O_CREAT|O_EXCL, 0600);
	
	b. mmap this file to establish starting address and size - 
		mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE,
                        MAP_SHARED, fd, 0);

	c. Write and read to mshared region normally.

4. For processes attaching to mshare'd region:
	a. Open the file on msharefs, for example -
		fd = open("/sys/fs/mshare/shareme", O_RDWR);

	b. Get information about mshare'd region from the file:
		struct mshare_info {
			unsigned long start;
			unsigned long size;
		} m_info;

		read(fd, &m_info, sizeof(m_info));
	
	c. mmap the mshare'd region -
		mmap(m_info.start, m_info.size,
			PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

5. To delete the mshare region -
		unlink("/sys/fs/mshare/shareme");



Example Code
============

Snippet of the code that a donor process would run looks like below:

-----------------
	fd = open("/sys/fs/mshare/mshare_info", O_RDONLY);
	read(fd, req, 128);
	alignsize = atoi(req);
	close(fd);
	fd = open("/sys/fs/mshare/shareme", O_RDWR|O_CREAT|O_EXCL, 0600);
	start = alignsize * 4;
	size = alignsize * 2;
        addr = mmap((void *)start, size, PROT_READ | PROT_WRITE,
                        MAP_SHARED | MAP_ANONYMOUS, 0, 0);
        if (addr == MAP_FAILED)
                perror("ERROR: mmap failed");

        strncpy(addr, "Some random shared text",
			sizeof("Some random shared text"));
-----------------


Snippet of code that a consumer process would execute looks like:

-----------------
	struct mshare_info {
		unsigned long start;
		unsigned long size;
	} minfo;


        fd = open("/sys/fs/mshare/shareme", O_RDONLY);

        if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
                printf("INFO: %ld bytes shared at addr 0x%lx \n",
				minfo.size, minfo.start);

        addr = mmap(minfo.start, minfo.size, PROT_READ | PROT_WRITE,
			MAP_SHARED, fd, 0);

        printf("Guest mmap at %px:\n", addr);
        printf("%s\n", addr);
	printf("\nDone\n");

-----------------



v1 -> v2:
	- Eliminated mshare and mshare_unlink system calls and
	  replaced API with standard mmap and unlink (Based upon
	  v1 patch discussions and LSF/MM discussions)
	- All fd based API (based upon feedback and suggestions from
	  Andy Lutomirski, Eric Biederman, Kirill and others)
	- Added a file /sys/fs/mshare/mshare_info to provide
	  alignment and size requirement info (based upon feedback
	  from Dave Hansen, Mark Hemment and discussions at LSF/MM)
	- Addressed TODOs in v1
	- Added support for directories in msharefs
	- Added locks around any time vma is touched (Dave Hansen)
	- Eliminated the need to point vm_mm in original vmas to the
	  newly synthesized mshare mm
	- Ensured mmap_read_unlock is called for correct mm in
	  handle_mm_fault (Dave Hansen)

Khalid Aziz (9):
  mm: Add msharefs filesystem
  mm/mshare: pre-populate msharefs with information file
  mm/mshare: make msharefs writable and support directories
  mm/mshare: Add a read operation for msharefs files
  mm/mshare: Add vm flag for shared PTE
  mm/mshare: Add mmap operation
  mm/mshare: Add unlink and munmap support
  mm/mshare: Add basic page table sharing support
  mm/mshare: Enable mshare region mapping across processes

 Documentation/filesystems/msharefs.rst |  19 +
 include/linux/mm.h                     |  10 +
 include/trace/events/mmflags.h         |   3 +-
 include/uapi/linux/magic.h             |   1 +
 include/uapi/linux/mman.h              |   5 +
 mm/Makefile                            |   2 +-
 mm/internal.h                          |   7 +
 mm/memory.c                            | 101 ++++-
 mm/mshare.c                            | 575 +++++++++++++++++++++++++
 9 files changed, 719 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/filesystems/msharefs.rst
 create mode 100644 mm/mshare.c

-- 
2.32.0


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v2 1/9] mm: Add msharefs filesystem
  2022-06-29 22:53 [PATCH v2 0/9] Add support for shared PTEs across processes Khalid Aziz
@ 2022-06-29 22:53 ` Khalid Aziz
  2022-06-30 21:53   ` Darrick J. Wong
  2022-06-30 22:57   ` Al Viro
  2022-06-29 22:53 ` [PATCH v2 2/9] mm/mshare: pre-populate msharefs with information file Khalid Aziz
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 43+ messages in thread
From: Khalid Aziz @ 2022-06-29 22:53 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

Add a ram-based filesystem that contains page table sharing
information and files that enables processes to share page tables.
This patch adds the basic filesystem that can be mounted.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 Documentation/filesystems/msharefs.rst |  19 +++++
 include/uapi/linux/magic.h             |   1 +
 mm/Makefile                            |   2 +-
 mm/mshare.c                            | 103 +++++++++++++++++++++++++
 4 files changed, 124 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/filesystems/msharefs.rst
 create mode 100644 mm/mshare.c

diff --git a/Documentation/filesystems/msharefs.rst b/Documentation/filesystems/msharefs.rst
new file mode 100644
index 000000000000..fd161f67045d
--- /dev/null
+++ b/Documentation/filesystems/msharefs.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================================
+msharefs - a filesystem to support shared page tables
+=====================================================
+
+msharefs is a ram-based filesystem that allows multiple processes to
+share page table entries for shared pages.
+
+msharefs is typically mounted like this::
+
+	mount -t msharefs none /sys/fs/mshare
+
+When a process calls mshare syscall with a name for the shared address
+range, a file with the same name is created under msharefs with that
+name. This file can be opened by another process, if permissions
+allow, to query the addresses shared under this range. These files are
+removed by mshare_unlink syscall and can not be deleted directly.
+Hence these files are created as immutable files.
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index f724129c0425..2a57a6ec6f3e 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -105,5 +105,6 @@
 #define Z3FOLD_MAGIC		0x33
 #define PPC_CMM_MAGIC		0xc7571590
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
+#define MSHARE_MAGIC		0x4d534852	/* "MSHR" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/mm/Makefile b/mm/Makefile
index 6f9ffa968a1a..51a2ab9080d9 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -37,7 +37,7 @@ CFLAGS_init-mm.o += $(call cc-disable-warning, override-init)
 CFLAGS_init-mm.o += $(call cc-disable-warning, initializer-overrides)
 
 mmu-y			:= nommu.o
-mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o \
+mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o mshare.o \
 			   mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
 			   msync.o page_vma_mapped.o pagewalk.o \
 			   pgtable-generic.o rmap.o vmalloc.o
diff --git a/mm/mshare.c b/mm/mshare.c
new file mode 100644
index 000000000000..c8fab3869bab
--- /dev/null
+++ b/mm/mshare.c
@@ -0,0 +1,103 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Enable copperating processes to share page table between
+ * them to reduce the extra memory consumed by multiple copies
+ * of page tables.
+ *
+ * This code adds an in-memory filesystem - msharefs.
+ * msharefs is used to manage page table sharing
+ *
+ *
+ * Copyright (C) 2022 Oracle Corp. All rights reserved.
+ * Author:	Khalid Aziz <khalid.aziz@oracle.com>
+ *
+ */
+
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/syscalls.h>
+#include <linux/uaccess.h>
+#include <linux/pseudo_fs.h>
+#include <linux/fileattr.h>
+#include <uapi/linux/magic.h>
+#include <uapi/linux/limits.h>
+
+static struct super_block *msharefs_sb;
+
+static const struct file_operations msharefs_file_operations = {
+	.open	= simple_open,
+	.llseek	= no_llseek,
+};
+
+static int
+msharefs_d_hash(const struct dentry *dentry, struct qstr *qstr)
+{
+	unsigned long hash = init_name_hash(dentry);
+	const unsigned char *s = qstr->name;
+	unsigned int len = qstr->len;
+
+	while (len--)
+		hash = partial_name_hash(*s++, hash);
+	qstr->hash = end_name_hash(hash);
+	return 0;
+}
+
+static const struct dentry_operations msharefs_d_ops = {
+	.d_hash = msharefs_d_hash,
+};
+
+static int
+msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+	static const struct tree_descr empty_descr = {""};
+	int err;
+
+	sb->s_d_op = &msharefs_d_ops;
+	err = simple_fill_super(sb, MSHARE_MAGIC, &empty_descr);
+	if (err)
+		return err;
+
+	msharefs_sb = sb;
+	return 0;
+}
+
+static int
+msharefs_get_tree(struct fs_context *fc)
+{
+	return get_tree_single(fc, msharefs_fill_super);
+}
+
+static const struct fs_context_operations msharefs_context_ops = {
+	.get_tree	= msharefs_get_tree,
+};
+
+static int
+mshare_init_fs_context(struct fs_context *fc)
+{
+	fc->ops = &msharefs_context_ops;
+	return 0;
+}
+
+static struct file_system_type mshare_fs = {
+	.name			= "msharefs",
+	.init_fs_context	= mshare_init_fs_context,
+	.kill_sb		= kill_litter_super,
+};
+
+static int
+mshare_init(void)
+{
+	int ret = 0;
+
+	ret = sysfs_create_mount_point(fs_kobj, "mshare");
+	if (ret)
+		return ret;
+
+	ret = register_filesystem(&mshare_fs);
+	if (ret)
+		sysfs_remove_mount_point(fs_kobj, "mshare");
+
+	return ret;
+}
+
+fs_initcall(mshare_init);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 2/9] mm/mshare: pre-populate msharefs with information file
  2022-06-29 22:53 [PATCH v2 0/9] Add support for shared PTEs across processes Khalid Aziz
  2022-06-29 22:53 ` [PATCH v2 1/9] mm: Add msharefs filesystem Khalid Aziz
@ 2022-06-29 22:53 ` Khalid Aziz
  2022-06-30 21:37   ` Darrick J. Wong
  2022-06-30 23:01   ` Al Viro
  2022-06-29 22:53 ` [PATCH v2 3/9] mm/mshare: make msharefs writable and support directories Khalid Aziz
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 43+ messages in thread
From: Khalid Aziz @ 2022-06-29 22:53 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

Users of mshare feature to share page tables need to know the size
and alignment requirement for shared regions. Pre-populate msharefs
with a file, mshare_info, that provides this information.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 mm/mshare.c | 62 +++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 48 insertions(+), 14 deletions(-)

diff --git a/mm/mshare.c b/mm/mshare.c
index c8fab3869bab..3e448e11c742 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -25,8 +25,8 @@
 static struct super_block *msharefs_sb;
 
 static const struct file_operations msharefs_file_operations = {
-	.open	= simple_open,
-	.llseek	= no_llseek,
+	.open		= simple_open,
+	.llseek		= no_llseek,
 };
 
 static int
@@ -42,23 +42,52 @@ msharefs_d_hash(const struct dentry *dentry, struct qstr *qstr)
 	return 0;
 }
 
+static void
+mshare_evict_inode(struct inode *inode)
+{
+	clear_inode(inode);
+}
+
 static const struct dentry_operations msharefs_d_ops = {
 	.d_hash = msharefs_d_hash,
 };
 
+static ssize_t
+mshare_info_read(struct file *file, char __user *buf, size_t nbytes,
+		loff_t *ppos)
+{
+	char s[80];
+
+	sprintf(s, "%ld", PGDIR_SIZE);
+	return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
+}
+
+static const struct file_operations mshare_info_ops = {
+	.read   = mshare_info_read,
+	.llseek	= noop_llseek,
+};
+
+static const struct super_operations mshare_s_ops = {
+	.statfs	     = simple_statfs,
+	.evict_inode = mshare_evict_inode,
+};
+
 static int
 msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
 {
-	static const struct tree_descr empty_descr = {""};
+	static const struct tree_descr mshare_files[] = {
+		[2] = { "mshare_info", &mshare_info_ops, 0444},
+		{""},
+	};
 	int err;
 
-	sb->s_d_op = &msharefs_d_ops;
-	err = simple_fill_super(sb, MSHARE_MAGIC, &empty_descr);
-	if (err)
-		return err;
-
-	msharefs_sb = sb;
-	return 0;
+	err = simple_fill_super(sb, MSHARE_MAGIC, mshare_files);
+	if (!err) {
+		msharefs_sb = sb;
+		sb->s_d_op = &msharefs_d_ops;
+		sb->s_op = &mshare_s_ops;
+	}
+	return err;
 }
 
 static int
@@ -84,20 +113,25 @@ static struct file_system_type mshare_fs = {
 	.kill_sb		= kill_litter_super,
 };
 
-static int
+static int __init
 mshare_init(void)
 {
 	int ret = 0;
 
 	ret = sysfs_create_mount_point(fs_kobj, "mshare");
 	if (ret)
-		return ret;
+		goto out;
 
 	ret = register_filesystem(&mshare_fs);
-	if (ret)
+	if (ret) {
 		sysfs_remove_mount_point(fs_kobj, "mshare");
+		goto out;
+	}
+
+	return 0;
 
+out:
 	return ret;
 }
 
-fs_initcall(mshare_init);
+core_initcall(mshare_init);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 3/9] mm/mshare: make msharefs writable and support directories
  2022-06-29 22:53 [PATCH v2 0/9] Add support for shared PTEs across processes Khalid Aziz
  2022-06-29 22:53 ` [PATCH v2 1/9] mm: Add msharefs filesystem Khalid Aziz
  2022-06-29 22:53 ` [PATCH v2 2/9] mm/mshare: pre-populate msharefs with information file Khalid Aziz
@ 2022-06-29 22:53 ` Khalid Aziz
  2022-06-30 21:34   ` Darrick J. Wong
  2022-06-30 23:09   ` Al Viro
  2022-06-29 22:53 ` [PATCH v2 4/9] mm/mshare: Add a read operation for msharefs files Khalid Aziz
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 43+ messages in thread
From: Khalid Aziz @ 2022-06-29 22:53 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

Make msharefs filesystem writable and allow creating directories
to support better access control to mshare'd regions defined in
msharefs.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 mm/mshare.c | 195 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 186 insertions(+), 9 deletions(-)

diff --git a/mm/mshare.c b/mm/mshare.c
index 3e448e11c742..2d5924d39221 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -21,11 +21,21 @@
 #include <linux/fileattr.h>
 #include <uapi/linux/magic.h>
 #include <uapi/linux/limits.h>
+#include <uapi/linux/mman.h>
 
 static struct super_block *msharefs_sb;
 
+static const struct inode_operations msharefs_dir_inode_ops;
+static const struct inode_operations msharefs_file_inode_ops;
+
+static int
+msharefs_open(struct inode *inode, struct file *file)
+{
+	return simple_open(inode, file);
+}
+
 static const struct file_operations msharefs_file_operations = {
-	.open		= simple_open,
+	.open		= msharefs_open,
 	.llseek		= no_llseek,
 };
 
@@ -42,6 +52,113 @@ msharefs_d_hash(const struct dentry *dentry, struct qstr *qstr)
 	return 0;
 }
 
+static struct dentry
+*msharefs_alloc_dentry(struct dentry *parent, const char *name)
+{
+	struct dentry *d;
+	struct qstr q;
+	int err;
+
+	q.name = name;
+	q.len = strlen(name);
+
+	err = msharefs_d_hash(parent, &q);
+	if (err)
+		return ERR_PTR(err);
+
+	d = d_alloc(parent, &q);
+	if (d)
+		return d;
+
+	return ERR_PTR(-ENOMEM);
+}
+
+static struct inode
+*msharefs_get_inode(struct super_block *sb, const struct inode *dir,
+			umode_t mode)
+{
+	struct inode *inode = new_inode(sb);
+
+	if (inode) {
+		inode->i_ino = get_next_ino();
+		inode_init_owner(&init_user_ns, inode, dir, mode);
+
+		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
+
+		switch (mode & S_IFMT) {
+		case S_IFREG:
+			inode->i_op = &msharefs_file_inode_ops;
+			inode->i_fop = &msharefs_file_operations;
+			break;
+		case S_IFDIR:
+			inode->i_op = &msharefs_dir_inode_ops;
+			inode->i_fop = &simple_dir_operations;
+			inc_nlink(inode);
+			break;
+		case S_IFLNK:
+			inode->i_op = &page_symlink_inode_operations;
+			break;
+		default:
+			discard_new_inode(inode);
+			inode = NULL;
+			break;
+		}
+	}
+
+	return inode;
+}
+
+static int
+msharefs_mknod(struct user_namespace *mnt_userns, struct inode *dir,
+		struct dentry *dentry, umode_t mode, dev_t dev)
+{
+	struct inode *inode;
+	int err = 0;
+
+	inode = msharefs_get_inode(dir->i_sb, dir, mode);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	d_instantiate(dentry, inode);
+	dget(dentry);
+	dir->i_mtime = dir->i_ctime = current_time(dir);
+
+	return err;
+}
+
+static int
+msharefs_create(struct user_namespace *mnt_userns, struct inode *dir,
+		struct dentry *dentry, umode_t mode, bool excl)
+{
+	return msharefs_mknod(&init_user_ns, dir, dentry, mode | S_IFREG, 0);
+}
+
+static int
+msharefs_mkdir(struct user_namespace *mnt_userns, struct inode *dir,
+		struct dentry *dentry, umode_t mode)
+{
+	int ret = msharefs_mknod(&init_user_ns, dir, dentry, mode | S_IFDIR, 0);
+
+	if (!ret)
+		inc_nlink(dir);
+	return ret;
+}
+
+static const struct inode_operations msharefs_file_inode_ops = {
+	.setattr	= simple_setattr,
+	.getattr	= simple_getattr,
+};
+static const struct inode_operations msharefs_dir_inode_ops = {
+	.create		= msharefs_create,
+	.lookup		= simple_lookup,
+	.link		= simple_link,
+	.unlink		= simple_unlink,
+	.mkdir		= msharefs_mkdir,
+	.rmdir		= simple_rmdir,
+	.mknod		= msharefs_mknod,
+	.rename		= simple_rename,
+};
+
 static void
 mshare_evict_inode(struct inode *inode)
 {
@@ -58,7 +175,7 @@ mshare_info_read(struct file *file, char __user *buf, size_t nbytes,
 {
 	char s[80];
 
-	sprintf(s, "%ld", PGDIR_SIZE);
+	sprintf(s, "%ld\n", PGDIR_SIZE);
 	return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
 }
 
@@ -72,6 +189,38 @@ static const struct super_operations mshare_s_ops = {
 	.evict_inode = mshare_evict_inode,
 };
 
+static int
+prepopulate_files(struct super_block *s, struct inode *dir,
+			struct dentry *root, const struct tree_descr *files)
+{
+	int i;
+	struct inode *inode;
+	struct dentry *dentry;
+
+	for (i = 0; !files->name || files->name[0]; i++, files++) {
+		if (!files->name)
+			continue;
+
+		dentry = msharefs_alloc_dentry(root, files->name);
+		if (!dentry)
+			return -ENOMEM;
+
+		inode = msharefs_get_inode(s, dir, S_IFREG | files->mode);
+		if (!inode) {
+			dput(dentry);
+			return -ENOMEM;
+		}
+		inode->i_mode = S_IFREG | files->mode;
+		inode->i_atime = inode->i_mtime = inode->i_ctime
+			= current_time(inode);
+		inode->i_fop = files->ops;
+		inode->i_ino = i;
+		d_add(dentry, inode);
+	}
+
+	return 0;
+}
+
 static int
 msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
 {
@@ -79,21 +228,49 @@ msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
 		[2] = { "mshare_info", &mshare_info_ops, 0444},
 		{""},
 	};
-	int err;
+	struct inode *inode;
+	struct dentry *root;
+	int err = 0;
 
-	err = simple_fill_super(sb, MSHARE_MAGIC, mshare_files);
-	if (!err) {
-		msharefs_sb = sb;
-		sb->s_d_op = &msharefs_d_ops;
-		sb->s_op = &mshare_s_ops;
+	sb->s_blocksize		= PAGE_SIZE;
+	sb->s_blocksize_bits	= PAGE_SHIFT;
+	sb->s_magic		= MSHARE_MAGIC;
+	sb->s_op		= &mshare_s_ops;
+	sb->s_d_op		= &msharefs_d_ops;
+	sb->s_time_gran		= 1;
+
+	inode = msharefs_get_inode(sb, NULL, S_IFDIR | 0777);
+	if (!inode) {
+		err = -ENOMEM;
+		goto out;
 	}
+	inode->i_ino = 1;
+	root = d_make_root(inode);
+	if (!root) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	err = prepopulate_files(sb, inode, root, mshare_files);
+	if (err < 0)
+		goto clean_root;
+
+	sb->s_root = root;
+	msharefs_sb = sb;
+	return err;
+
+clean_root:
+	d_genocide(root);
+	shrink_dcache_parent(root);
+	dput(root);
+out:
 	return err;
 }
 
 static int
 msharefs_get_tree(struct fs_context *fc)
 {
-	return get_tree_single(fc, msharefs_fill_super);
+	return get_tree_nodev(fc, msharefs_fill_super);
 }
 
 static const struct fs_context_operations msharefs_context_ops = {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 4/9] mm/mshare: Add a read operation for msharefs files
  2022-06-29 22:53 [PATCH v2 0/9] Add support for shared PTEs across processes Khalid Aziz
                   ` (2 preceding siblings ...)
  2022-06-29 22:53 ` [PATCH v2 3/9] mm/mshare: make msharefs writable and support directories Khalid Aziz
@ 2022-06-29 22:53 ` Khalid Aziz
  2022-06-30 21:27   ` Darrick J. Wong
  2022-06-29 22:53 ` [PATCH v2 5/9] mm/mshare: Add vm flag for shared PTE Khalid Aziz
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Khalid Aziz @ 2022-06-29 22:53 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

When a new file is created under msharefs, allocate a new mm_struct
that will hold the VMAs for mshare region. Also allocate structure
to defines the mshare region and add a read operation to the file
that returns this information about the mshare region. Currently
this information is returned as a struct:

struct mshare_info {
	unsigned long start;
	unsigned long size;
};

This gives the start address for mshare region and its size.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 include/uapi/linux/mman.h |  5 +++
 mm/mshare.c               | 64 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index f55bc680b5b0..56fe446e24b1 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -41,4 +41,9 @@
 #define MAP_HUGE_2GB	HUGETLB_FLAG_ENCODE_2GB
 #define MAP_HUGE_16GB	HUGETLB_FLAG_ENCODE_16GB
 
+struct mshare_info {
+	unsigned long start;
+	unsigned long size;
+};
+
 #endif /* _UAPI_LINUX_MMAN_H */
diff --git a/mm/mshare.c b/mm/mshare.c
index 2d5924d39221..d238b68b0576 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -22,8 +22,14 @@
 #include <uapi/linux/magic.h>
 #include <uapi/linux/limits.h>
 #include <uapi/linux/mman.h>
+#include <linux/sched/mm.h>
 
 static struct super_block *msharefs_sb;
+struct mshare_data {
+	struct mm_struct *mm;
+	refcount_t refcnt;
+	struct mshare_info *minfo;
+};
 
 static const struct inode_operations msharefs_dir_inode_ops;
 static const struct inode_operations msharefs_file_inode_ops;
@@ -34,8 +40,29 @@ msharefs_open(struct inode *inode, struct file *file)
 	return simple_open(inode, file);
 }
 
+static ssize_t
+msharefs_read(struct kiocb *iocb, struct iov_iter *iov)
+{
+	struct mshare_data *info = iocb->ki_filp->private_data;
+	size_t ret;
+	struct mshare_info m_info;
+
+	if (info->minfo != NULL) {
+		m_info.start = info->minfo->start;
+		m_info.size = info->minfo->size;
+	} else {
+		m_info.start = 0;
+		m_info.size = 0;
+	}
+	ret = copy_to_iter(&m_info, sizeof(m_info), iov);
+	if (!ret)
+		return -EFAULT;
+	return ret;
+}
+
 static const struct file_operations msharefs_file_operations = {
 	.open		= msharefs_open,
+	.read_iter	= msharefs_read,
 	.llseek		= no_llseek,
 };
 
@@ -73,12 +100,43 @@ static struct dentry
 	return ERR_PTR(-ENOMEM);
 }
 
+static int
+msharefs_fill_mm(struct inode *inode)
+{
+	struct mm_struct *mm;
+	struct mshare_data *info = NULL;
+	int retval = 0;
+
+	mm = mm_alloc();
+	if (!mm) {
+		retval = -ENOMEM;
+		goto err_free;
+	}
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info) {
+		retval = -ENOMEM;
+		goto err_free;
+	}
+	info->mm = mm;
+	info->minfo = NULL;
+	refcount_set(&info->refcnt, 1);
+	inode->i_private = info;
+
+	return 0;
+
+err_free:
+	if (mm)
+		mmput(mm);
+	kfree(info);
+	return retval;
+}
+
 static struct inode
 *msharefs_get_inode(struct super_block *sb, const struct inode *dir,
 			umode_t mode)
 {
 	struct inode *inode = new_inode(sb);
-
 	if (inode) {
 		inode->i_ino = get_next_ino();
 		inode_init_owner(&init_user_ns, inode, dir, mode);
@@ -89,6 +147,10 @@ static struct inode
 		case S_IFREG:
 			inode->i_op = &msharefs_file_inode_ops;
 			inode->i_fop = &msharefs_file_operations;
+			if (msharefs_fill_mm(inode) != 0) {
+				discard_new_inode(inode);
+				inode = ERR_PTR(-ENOMEM);
+			}
 			break;
 		case S_IFDIR:
 			inode->i_op = &msharefs_dir_inode_ops;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 5/9] mm/mshare: Add vm flag for shared PTE
  2022-06-29 22:53 [PATCH v2 0/9] Add support for shared PTEs across processes Khalid Aziz
                   ` (3 preceding siblings ...)
  2022-06-29 22:53 ` [PATCH v2 4/9] mm/mshare: Add a read operation for msharefs files Khalid Aziz
@ 2022-06-29 22:53 ` Khalid Aziz
  2022-06-30 14:59   ` Mark Hemment
  2022-06-29 22:53 ` [PATCH v2 6/9] mm/mshare: Add mmap operation Khalid Aziz
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Khalid Aziz @ 2022-06-29 22:53 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

Add a bit to vm_flags to indicate a vma shares PTEs with others. Add
a function to determine if a vma shares PTE by checking this flag.
This is to be used to find the shared page table entries on page fault
for vmas sharing PTE.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/mm.h             | 8 ++++++++
 include/trace/events/mmflags.h | 3 ++-
 mm/internal.h                  | 5 +++++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bc8f326be0ce..0ddc3057f73b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -310,11 +310,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -356,6 +358,12 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_MTE_ALLOWED	VM_NONE
 #endif
 
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+#define VM_SHARED_PT	VM_HIGH_ARCH_5
+#else
+#define VM_SHARED_PT	0
+#endif
+
 #ifndef VM_GROWSUP
 # define VM_GROWSUP	VM_NONE
 #endif
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index e87cb2b80ed3..30e56cbac99b 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -194,7 +194,8 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,	"softdirty"	)		\
 	{VM_MIXEDMAP,			"mixedmap"	},		\
 	{VM_HUGEPAGE,			"hugepage"	},		\
 	{VM_NOHUGEPAGE,			"nohugepage"	},		\
-	{VM_MERGEABLE,			"mergeable"	}		\
+	{VM_MERGEABLE,			"mergeable"	},		\
+	{VM_SHARED_PT,			"sharedpt"	}		\
 
 #define show_vma_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",				\
diff --git a/mm/internal.h b/mm/internal.h
index c0f8fbe0445b..3f2790aea918 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -861,4 +861,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags);
 
 DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 
+static inline bool vma_is_shared(const struct vm_area_struct *vma)
+{
+	return vma->vm_flags & VM_SHARED_PT;
+}
+
 #endif	/* __MM_INTERNAL_H */
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 6/9] mm/mshare: Add mmap operation
  2022-06-29 22:53 [PATCH v2 0/9] Add support for shared PTEs across processes Khalid Aziz
                   ` (4 preceding siblings ...)
  2022-06-29 22:53 ` [PATCH v2 5/9] mm/mshare: Add vm flag for shared PTE Khalid Aziz
@ 2022-06-29 22:53 ` Khalid Aziz
  2022-06-30 21:44   ` Darrick J. Wong
  2022-06-29 22:53 ` [PATCH v2 7/9] mm/mshare: Add unlink and munmap support Khalid Aziz
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Khalid Aziz @ 2022-06-29 22:53 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

mmap is used to establish address range for mshare region and map the
region into process's address space. Add basic mmap operation that
supports setting address range. Also fix code to not allocate new
mm_struct for files in msharefs that exist for information and not
for defining a new mshare region.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/mshare.c | 48 +++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 41 insertions(+), 7 deletions(-)

diff --git a/mm/mshare.c b/mm/mshare.c
index d238b68b0576..088a6cab1e93 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -9,7 +9,8 @@
  *
  *
  * Copyright (C) 2022 Oracle Corp. All rights reserved.
- * Author:	Khalid Aziz <khalid.aziz@oracle.com>
+ * Authors:	Khalid Aziz <khalid.aziz@oracle.com>
+ *		Matthew Wilcox <willy@infradead.org>
  *
  */
 
@@ -60,9 +61,36 @@ msharefs_read(struct kiocb *iocb, struct iov_iter *iov)
 	return ret;
 }
 
+static int
+msharefs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct mshare_data *info = file->private_data;
+	struct mm_struct *mm = info->mm;
+
+	/*
+	 * If this mshare region has been set up once already, bail out
+	 */
+	if (mm->mmap_base != 0)
+		return -EINVAL;
+
+	if ((vma->vm_start | vma->vm_end) & (PGDIR_SIZE - 1))
+		return -EINVAL;
+
+	mm->mmap_base = vma->vm_start;
+	mm->task_size = vma->vm_end - vma->vm_start;
+	if (!mm->task_size)
+		mm->task_size--;
+	info->minfo->start = mm->mmap_base;
+	info->minfo->size = mm->task_size;
+	vma->vm_flags |= VM_SHARED_PT;
+	vma->vm_private_data = info;
+	return 0;
+}
+
 static const struct file_operations msharefs_file_operations = {
 	.open		= msharefs_open,
 	.read_iter	= msharefs_read,
+	.mmap		= msharefs_mmap,
 	.llseek		= no_llseek,
 };
 
@@ -119,7 +147,12 @@ msharefs_fill_mm(struct inode *inode)
 		goto err_free;
 	}
 	info->mm = mm;
-	info->minfo = NULL;
+	info->minfo = kzalloc(sizeof(struct mshare_info), GFP_KERNEL);
+	if (info->minfo == NULL) {
+		retval = -ENOMEM;
+		goto err_free;
+	}
+
 	refcount_set(&info->refcnt, 1);
 	inode->i_private = info;
 
@@ -128,13 +161,14 @@ msharefs_fill_mm(struct inode *inode)
 err_free:
 	if (mm)
 		mmput(mm);
+	kfree(info->minfo);
 	kfree(info);
 	return retval;
 }
 
 static struct inode
 *msharefs_get_inode(struct super_block *sb, const struct inode *dir,
-			umode_t mode)
+			umode_t mode, bool newmm)
 {
 	struct inode *inode = new_inode(sb);
 	if (inode) {
@@ -147,7 +181,7 @@ static struct inode
 		case S_IFREG:
 			inode->i_op = &msharefs_file_inode_ops;
 			inode->i_fop = &msharefs_file_operations;
-			if (msharefs_fill_mm(inode) != 0) {
+			if (newmm && msharefs_fill_mm(inode) != 0) {
 				discard_new_inode(inode);
 				inode = ERR_PTR(-ENOMEM);
 			}
@@ -177,7 +211,7 @@ msharefs_mknod(struct user_namespace *mnt_userns, struct inode *dir,
 	struct inode *inode;
 	int err = 0;
 
-	inode = msharefs_get_inode(dir->i_sb, dir, mode);
+	inode = msharefs_get_inode(dir->i_sb, dir, mode, true);
 	if (IS_ERR(inode))
 		return PTR_ERR(inode);
 
@@ -267,7 +301,7 @@ prepopulate_files(struct super_block *s, struct inode *dir,
 		if (!dentry)
 			return -ENOMEM;
 
-		inode = msharefs_get_inode(s, dir, S_IFREG | files->mode);
+		inode = msharefs_get_inode(s, dir, S_IFREG | files->mode, false);
 		if (!inode) {
 			dput(dentry);
 			return -ENOMEM;
@@ -301,7 +335,7 @@ msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
 	sb->s_d_op		= &msharefs_d_ops;
 	sb->s_time_gran		= 1;
 
-	inode = msharefs_get_inode(sb, NULL, S_IFDIR | 0777);
+	inode = msharefs_get_inode(sb, NULL, S_IFDIR | 0777, false);
 	if (!inode) {
 		err = -ENOMEM;
 		goto out;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 7/9] mm/mshare: Add unlink and munmap support
  2022-06-29 22:53 [PATCH v2 0/9] Add support for shared PTEs across processes Khalid Aziz
                   ` (5 preceding siblings ...)
  2022-06-29 22:53 ` [PATCH v2 6/9] mm/mshare: Add mmap operation Khalid Aziz
@ 2022-06-29 22:53 ` Khalid Aziz
  2022-06-30 21:50   ` Darrick J. Wong
  2022-06-29 22:53 ` [PATCH v2 8/9] mm/mshare: Add basic page table sharing support Khalid Aziz
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Khalid Aziz @ 2022-06-29 22:53 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

Number of mappings of an mshare region should be tracked so it can
be removed when there are no more references to it and associated
file has been deleted. This add code to support the unlink operation
for associated file, remove the mshare region on file deletion if
refcount goes to zero, add munmap operation to maintain refcount
to mshare region and remove it on last munmap if file has been
deleted.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 mm/mshare.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 63 insertions(+), 1 deletion(-)

diff --git a/mm/mshare.c b/mm/mshare.c
index 088a6cab1e93..90ce0564a138 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -29,6 +29,7 @@ static struct super_block *msharefs_sb;
 struct mshare_data {
 	struct mm_struct *mm;
 	refcount_t refcnt;
+	int deleted;
 	struct mshare_info *minfo;
 };
 
@@ -48,6 +49,7 @@ msharefs_read(struct kiocb *iocb, struct iov_iter *iov)
 	size_t ret;
 	struct mshare_info m_info;
 
+	mmap_read_lock(info->mm);
 	if (info->minfo != NULL) {
 		m_info.start = info->minfo->start;
 		m_info.size = info->minfo->size;
@@ -55,18 +57,42 @@ msharefs_read(struct kiocb *iocb, struct iov_iter *iov)
 		m_info.start = 0;
 		m_info.size = 0;
 	}
+	mmap_read_unlock(info->mm);
 	ret = copy_to_iter(&m_info, sizeof(m_info), iov);
 	if (!ret)
 		return -EFAULT;
 	return ret;
 }
 
+static void
+msharefs_close(struct vm_area_struct *vma)
+{
+	struct mshare_data *info = vma->vm_private_data;
+
+	if (refcount_dec_and_test(&info->refcnt)) {
+		mmap_read_lock(info->mm);
+		if (info->deleted) {
+			mmap_read_unlock(info->mm);
+			mmput(info->mm);
+			kfree(info->minfo);
+			kfree(info);
+		} else {
+			mmap_read_unlock(info->mm);
+		}
+	}
+}
+
+static const struct vm_operations_struct msharefs_vm_ops = {
+	.close	= msharefs_close,
+};
+
 static int
 msharefs_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct mshare_data *info = file->private_data;
 	struct mm_struct *mm = info->mm;
 
+	mmap_write_lock(mm);
 	/*
 	 * If this mshare region has been set up once already, bail out
 	 */
@@ -80,10 +106,14 @@ msharefs_mmap(struct file *file, struct vm_area_struct *vma)
 	mm->task_size = vma->vm_end - vma->vm_start;
 	if (!mm->task_size)
 		mm->task_size--;
+	mmap_write_unlock(mm);
 	info->minfo->start = mm->mmap_base;
 	info->minfo->size = mm->task_size;
+	info->deleted = 0;
+	refcount_inc(&info->refcnt);
 	vma->vm_flags |= VM_SHARED_PT;
 	vma->vm_private_data = info;
+	vma->vm_ops = &msharefs_vm_ops;
 	return 0;
 }
 
@@ -240,6 +270,38 @@ msharefs_mkdir(struct user_namespace *mnt_userns, struct inode *dir,
 	return ret;
 }
 
+static int
+msharefs_unlink(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = d_inode(dentry);
+	struct mshare_data *info = inode->i_private;
+
+	/*
+	 * Unmap the mshare region if it is still mapped in
+	 */
+	vm_munmap(info->minfo->start, info->minfo->size);
+
+	/*
+	 * Mark msharefs file for deletion so it can not be opened
+	 * and used for mshare mappings any more
+	 */
+	simple_unlink(dir, dentry);
+	mmap_write_lock(info->mm);
+	info->deleted = 1;
+	mmap_write_unlock(info->mm);
+
+	/*
+	 * Is this the last reference? If so, delete mshare region and
+	 * remove the file
+	 */
+	if (!refcount_dec_and_test(&info->refcnt)) {
+		mmput(info->mm);
+		kfree(info->minfo);
+		kfree(info);
+	}
+	return 0;
+}
+
 static const struct inode_operations msharefs_file_inode_ops = {
 	.setattr	= simple_setattr,
 	.getattr	= simple_getattr,
@@ -248,7 +310,7 @@ static const struct inode_operations msharefs_dir_inode_ops = {
 	.create		= msharefs_create,
 	.lookup		= simple_lookup,
 	.link		= simple_link,
-	.unlink		= simple_unlink,
+	.unlink		= msharefs_unlink,
 	.mkdir		= msharefs_mkdir,
 	.rmdir		= simple_rmdir,
 	.mknod		= msharefs_mknod,
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 8/9] mm/mshare: Add basic page table sharing support
  2022-06-29 22:53 [PATCH v2 0/9] Add support for shared PTEs across processes Khalid Aziz
                   ` (6 preceding siblings ...)
  2022-06-29 22:53 ` [PATCH v2 7/9] mm/mshare: Add unlink and munmap support Khalid Aziz
@ 2022-06-29 22:53 ` Khalid Aziz
  2022-07-07  9:13   ` Xin Hao
  2022-06-29 22:54 ` [PATCH v2 9/9] mm/mshare: Enable mshare region mapping across processes Khalid Aziz
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Khalid Aziz @ 2022-06-29 22:53 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

Add support for creating a new set of shared page tables in a new
mm_struct upon mmap of an mshare region. Add page fault handling in
this now mshare'd region. Modify exit_mmap path to make sure page
tables in the mshare'd regions are kept intact when a process using
mshare'd region exits. Clean up mshare mm_struct when the mshare
region is deleted. This support is for the process creating mshare
region only. Subsequent patches will add support for other processes
to be able to map the mshare region.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/mm.h |   2 +
 mm/internal.h      |   2 +
 mm/memory.c        | 101 +++++++++++++++++++++++++++++-
 mm/mshare.c        | 149 ++++++++++++++++++++++++++++++++++++---------
 4 files changed, 222 insertions(+), 32 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0ddc3057f73b..63887f06b37b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1859,6 +1859,8 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 int
 copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
+int
+mshare_copy_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
 int follow_pte(struct mm_struct *mm, unsigned long address,
 	       pte_t **ptepp, spinlock_t **ptlp);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/internal.h b/mm/internal.h
index 3f2790aea918..6ae7063ac10d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -861,6 +861,8 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags);
 
 DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 
+extern vm_fault_t find_shared_vma(struct vm_area_struct **vma,
+					unsigned long *addrp);
 static inline bool vma_is_shared(const struct vm_area_struct *vma)
 {
 	return vma->vm_flags & VM_SHARED_PT;
diff --git a/mm/memory.c b/mm/memory.c
index 7a089145cad4..2a8d5b8928f5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -416,15 +416,20 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		unlink_anon_vmas(vma);
 		unlink_file_vma(vma);
 
+		/*
+		 * There is no page table to be freed for vmas that
+		 * are mapped in mshare regions
+		 */
 		if (is_vm_hugetlb_page(vma)) {
 			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
 				floor, next ? next->vm_start : ceiling);
-		} else {
+		} else if (!vma_is_shared(vma)) {
 			/*
 			 * Optimization: gather nearby vmas into one call down
 			 */
 			while (next && next->vm_start <= vma->vm_end + PMD_SIZE
-			       && !is_vm_hugetlb_page(next)) {
+			       && !is_vm_hugetlb_page(next)
+			       && !vma_is_shared(next)) {
 				vma = next;
 				next = vma->vm_next;
 				unlink_anon_vmas(vma);
@@ -1260,6 +1265,54 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 	return false;
 }
 
+/*
+ * Copy PTEs for mshare'd pages.
+ * This code is based upon copy_page_range()
+ */
+int
+mshare_copy_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
+{
+	pgd_t *src_pgd, *dst_pgd;
+	unsigned long next;
+	unsigned long addr = src_vma->vm_start;
+	unsigned long end = src_vma->vm_end;
+	struct mm_struct *dst_mm = dst_vma->vm_mm;
+	struct mm_struct *src_mm = src_vma->vm_mm;
+	struct mmu_notifier_range range;
+	int ret = 0;
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
+				0, src_vma, src_mm, addr, end);
+	mmu_notifier_invalidate_range_start(&range);
+	/*
+	 * Disabling preemption is not needed for the write side, as
+	 * the read side doesn't spin, but goes to the mmap_lock.
+	 *
+	 * Use the raw variant of the seqcount_t write API to avoid
+	 * lockdep complaining about preemptibility.
+	 */
+	mmap_assert_write_locked(src_mm);
+	raw_write_seqcount_begin(&src_mm->write_protect_seq);
+
+	dst_pgd = pgd_offset(dst_mm, addr);
+	src_pgd = pgd_offset(src_mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none_or_clear_bad(src_pgd))
+			continue;
+		if (unlikely(copy_p4d_range(dst_vma, src_vma, dst_pgd, src_pgd,
+					    addr, next))) {
+			ret = -ENOMEM;
+			break;
+		}
+	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
+
+	raw_write_seqcount_end(&src_mm->write_protect_seq);
+	mmu_notifier_invalidate_range_end(&range);
+
+	return ret;
+}
+
 int
 copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 {
@@ -1628,6 +1681,13 @@ void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
+	/*
+	 * No need to unmap vmas that share page table through
+	 * mshare region
+	 */
+	if (vma_is_shared(vma))
+		return;
+
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
@@ -5113,6 +5173,8 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			   unsigned int flags, struct pt_regs *regs)
 {
 	vm_fault_t ret;
+	bool shared = false;
+	struct mm_struct *orig_mm;
 
 	__set_current_state(TASK_RUNNING);
 
@@ -5122,6 +5184,16 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	/* do counter updates before entering really critical section. */
 	check_sync_rss_stat(current);
 
+	orig_mm = vma->vm_mm;
+	if (unlikely(vma_is_shared(vma))) {
+		ret = find_shared_vma(&vma, &address);
+		if (ret)
+			return ret;
+		if (!vma)
+			return VM_FAULT_SIGSEGV;
+		shared = true;
+	}
+
 	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
 					    flags & FAULT_FLAG_INSTRUCTION,
 					    flags & FAULT_FLAG_REMOTE))
@@ -5139,6 +5211,31 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	else
 		ret = __handle_mm_fault(vma, address, flags);
 
+	/*
+	 * Release the read lock on shared VMA's parent mm unless
+	 * __handle_mm_fault released the lock already.
+	 * __handle_mm_fault sets VM_FAULT_RETRY in return value if
+	 * it released mmap lock. If lock was released, that implies
+	 * the lock would have been released on task's original mm if
+	 * this were not a shared PTE vma. To keep lock state consistent,
+	 * make sure to release the lock on task's original mm
+	 */
+	if (shared) {
+		int release_mmlock = 1;
+
+		if (!(ret & VM_FAULT_RETRY)) {
+			mmap_read_unlock(vma->vm_mm);
+			release_mmlock = 0;
+		} else if ((flags & FAULT_FLAG_ALLOW_RETRY) &&
+			(flags & FAULT_FLAG_RETRY_NOWAIT)) {
+			mmap_read_unlock(vma->vm_mm);
+			release_mmlock = 0;
+		}
+
+		if (release_mmlock)
+			mmap_read_unlock(orig_mm);
+	}
+
 	if (flags & FAULT_FLAG_USER) {
 		mem_cgroup_exit_user_fault();
 		/*
diff --git a/mm/mshare.c b/mm/mshare.c
index 90ce0564a138..2ec0e56ffd69 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -15,7 +15,7 @@
  */
 
 #include <linux/fs.h>
-#include <linux/mount.h>
+#include <linux/mm.h>
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 #include <linux/pseudo_fs.h>
@@ -24,6 +24,7 @@
 #include <uapi/linux/limits.h>
 #include <uapi/linux/mman.h>
 #include <linux/sched/mm.h>
+#include <linux/mmu_context.h>
 
 static struct super_block *msharefs_sb;
 struct mshare_data {
@@ -33,6 +34,43 @@ struct mshare_data {
 	struct mshare_info *minfo;
 };
 
+/* Returns holding the host mm's lock for read.  Caller must release. */
+vm_fault_t
+find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp)
+{
+	struct vm_area_struct *vma, *guest = *vmap;
+	struct mshare_data *info = guest->vm_private_data;
+	struct mm_struct *host_mm = info->mm;
+	unsigned long host_addr;
+	pgd_t *pgd, *guest_pgd;
+
+	mmap_read_lock(host_mm);
+	host_addr = *addrp - guest->vm_start + host_mm->mmap_base;
+	pgd = pgd_offset(host_mm, host_addr);
+	guest_pgd = pgd_offset(guest->vm_mm, *addrp);
+	if (!pgd_same(*guest_pgd, *pgd)) {
+		set_pgd(guest_pgd, *pgd);
+		mmap_read_unlock(host_mm);
+		return VM_FAULT_NOPAGE;
+	}
+
+	*addrp = host_addr;
+	vma = find_vma(host_mm, host_addr);
+
+	/* XXX: expand stack? */
+	if (vma && vma->vm_start > host_addr)
+		vma = NULL;
+
+	*vmap = vma;
+
+	/*
+	 * release host mm lock unless a matching vma is found
+	 */
+	if (!vma)
+		mmap_read_unlock(host_mm);
+	return 0;
+}
+
 static const struct inode_operations msharefs_dir_inode_ops;
 static const struct inode_operations msharefs_file_inode_ops;
 
@@ -64,6 +102,14 @@ msharefs_read(struct kiocb *iocb, struct iov_iter *iov)
 	return ret;
 }
 
+static void
+msharefs_delmm(struct mshare_data *info)
+{
+	mmput(info->mm);
+	kfree(info->minfo);
+	kfree(info);
+}
+
 static void
 msharefs_close(struct vm_area_struct *vma)
 {
@@ -73,9 +119,7 @@ msharefs_close(struct vm_area_struct *vma)
 		mmap_read_lock(info->mm);
 		if (info->deleted) {
 			mmap_read_unlock(info->mm);
-			mmput(info->mm);
-			kfree(info->minfo);
-			kfree(info);
+			msharefs_delmm(info);
 		} else {
 			mmap_read_unlock(info->mm);
 		}
@@ -90,31 +134,80 @@ static int
 msharefs_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct mshare_data *info = file->private_data;
-	struct mm_struct *mm = info->mm;
+	struct mm_struct *new_mm = info->mm;
+	int err = 0;
 
-	mmap_write_lock(mm);
+	mmap_write_lock(new_mm);
 	/*
-	 * If this mshare region has been set up once already, bail out
+	 * If this mshare region has not been set up, set up the
+	 * applicable address range for the region and prepare for
+	 * page table sharing
 	 */
-	if (mm->mmap_base != 0)
+	if (new_mm->mmap_base != 0) {
 		return -EINVAL;
+	} else {
+		struct mm_struct *old_mm;
+		struct vm_area_struct *new_vma;
+
+		if ((vma->vm_start | vma->vm_end) & (PGDIR_SIZE - 1))
+			return -EINVAL;
+
+		old_mm = current->mm;
+		mmap_assert_write_locked(old_mm);
+		new_mm->mmap_base = vma->vm_start;
+		new_mm->task_size = vma->vm_end - vma->vm_start;
+		if (!new_mm->task_size)
+			new_mm->task_size--;
+		info->minfo->start = new_mm->mmap_base;
+		info->minfo->size = new_mm->task_size;
+		info->deleted = 0;
+		refcount_inc(&info->refcnt);
+
+		/*
+		 * Mark current VMA as shared and copy over to mshare
+		 * mm_struct
+		 */
+		vma->vm_private_data = info;
+		new_vma = vm_area_dup(vma);
+		if (!new_vma) {
+			vma->vm_private_data = NULL;
+			mmap_write_unlock(new_mm);
+			err = -ENOMEM;
+			goto err_out;
+		}
+		vma->vm_flags |= (VM_SHARED_PT|VM_SHARED);
+		vma->vm_ops = &msharefs_vm_ops;
+
+		/*
+		 * Newly created mshare mapping is anonymous mapping
+		 */
+		new_vma->vm_mm = new_mm;
+		vma_set_anonymous(new_vma);
+		new_vma->vm_file = NULL;
+		new_vma->vm_flags &= ~VM_SHARED;
+
+		/*
+		 * Do not use THP for mshare region
+		 */
+		new_vma->vm_flags |= VM_NOHUGEPAGE;
+		err = insert_vm_struct(new_mm, new_vma);
+		if (err) {
+			mmap_write_unlock(new_mm);
+			err = -ENOMEM;
+			goto err_out;
+		}
 
-	if ((vma->vm_start | vma->vm_end) & (PGDIR_SIZE - 1))
-		return -EINVAL;
+		/*
+		 * Copy over current PTEs
+		 */
+		err = mshare_copy_ptes(new_vma, vma);
+	}
 
-	mm->mmap_base = vma->vm_start;
-	mm->task_size = vma->vm_end - vma->vm_start;
-	if (!mm->task_size)
-		mm->task_size--;
-	mmap_write_unlock(mm);
-	info->minfo->start = mm->mmap_base;
-	info->minfo->size = mm->task_size;
-	info->deleted = 0;
-	refcount_inc(&info->refcnt);
-	vma->vm_flags |= VM_SHARED_PT;
-	vma->vm_private_data = info;
-	vma->vm_ops = &msharefs_vm_ops;
-	return 0;
+	mmap_write_unlock(new_mm);
+	return err;
+
+err_out:
+	return err;
 }
 
 static const struct file_operations msharefs_file_operations = {
@@ -291,14 +384,10 @@ msharefs_unlink(struct inode *dir, struct dentry *dentry)
 	mmap_write_unlock(info->mm);
 
 	/*
-	 * Is this the last reference? If so, delete mshare region and
-	 * remove the file
+	 * Is this the last reference? If so, delete mshare region
 	 */
-	if (!refcount_dec_and_test(&info->refcnt)) {
-		mmput(info->mm);
-		kfree(info->minfo);
-		kfree(info);
-	}
+	if (refcount_dec_and_test(&info->refcnt))
+		msharefs_delmm(info);
 	return 0;
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 9/9] mm/mshare: Enable mshare region mapping across processes
  2022-06-29 22:53 [PATCH v2 0/9] Add support for shared PTEs across processes Khalid Aziz
                   ` (7 preceding siblings ...)
  2022-06-29 22:53 ` [PATCH v2 8/9] mm/mshare: Add basic page table sharing support Khalid Aziz
@ 2022-06-29 22:54 ` Khalid Aziz
  2022-06-30 11:57 ` [PATCH v2 0/9] Add support for shared PTEs " Mark Hemment
  2022-07-02  4:24 ` Andrew Morton
  10 siblings, 0 replies; 43+ messages in thread
From: Khalid Aziz @ 2022-06-29 22:54 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

This patch enables propcesses that did not create the mshare region
to map the region using mmap().

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 mm/mshare.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/mm/mshare.c b/mm/mshare.c
index 2ec0e56ffd69..455b10ca0cdf 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -144,7 +144,21 @@ msharefs_mmap(struct file *file, struct vm_area_struct *vma)
 	 * page table sharing
 	 */
 	if (new_mm->mmap_base != 0) {
-		return -EINVAL;
+		/*
+		 * Any mappings of mshare region must use exact same
+		 * virtual addresses
+		 */
+		if ((vma->vm_start != new_mm->mmap_base) ||
+			(new_mm->task_size != (vma->vm_end - vma->vm_start)))
+			return -EINVAL;
+
+		vma->vm_private_data = info;
+		/*
+		 * mshare pages are shared pages that also share page table
+		 */
+		vma->vm_flags |= (VM_SHARED_PT|VM_SHARED);
+		vma->vm_ops = &msharefs_vm_ops;
+		refcount_inc(&info->refcnt);
 	} else {
 		struct mm_struct *old_mm;
 		struct vm_area_struct *new_vma;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 0/9] Add support for shared PTEs across processes
  2022-06-29 22:53 [PATCH v2 0/9] Add support for shared PTEs across processes Khalid Aziz
                   ` (8 preceding siblings ...)
  2022-06-29 22:54 ` [PATCH v2 9/9] mm/mshare: Enable mshare region mapping across processes Khalid Aziz
@ 2022-06-30 11:57 ` Mark Hemment
  2022-06-30 15:39   ` Khalid Aziz
  2022-07-02  4:24 ` Andrew Morton
  10 siblings, 1 reply; 43+ messages in thread
From: Mark Hemment @ 2022-06-30 11:57 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen, david,
	ebiederm, hagen, jack, Kees Cook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, Linux-MM, longpeng2, luto, pcc,
	rppt, sieberf, sjpark, Suren Baghdasaryan, tst, yzaikin

Hi Khalid,

On Wed, 29 Jun 2022 at 23:54, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
>
> Memory pages shared between processes require a page table entry
> (PTE) for each process. Each of these PTE consumes consume some of
> the memory and as long as number of mappings being maintained is
> small enough, this space consumed by page tables is not
> objectionable. When very few memory pages are shared between
> processes, the number of page table entries (PTEs) to maintain is
> mostly constrained by the number of pages of memory on the system.
> As the number of shared pages and the number of times pages are
> shared goes up, amount of memory consumed by page tables starts to
> become significant. This issue does not apply to threads. Any number
> of threads can share the same pages inside a process while sharing
> the same PTEs. Extending this same model to sharing pages across
> processes can eliminate this issue for sharing across processes as
> well.
>
> Some of the field deployments commonly see memory pages shared
> across 1000s of processes. On x86_64, each page requires a PTE that
> is only 8 bytes long which is very small compared to the 4K page
> size. When 2000 processes map the same page in their address space,
> each one of them requires 8 bytes for its PTE and together that adds
> up to 8K of memory just to hold the PTEs for one 4K page. On a
> database server with 300GB SGA, a system crash was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant.
>
> This patch series implements a mechanism in kernel to allow
> userspace processes to opt into sharing PTEs. It adds a new
> in-memory filesystem - msharefs. A file created on msharefs creates
> a new shared region where all processes sharing that region will
> share the PTEs as well. A process can create a new file on msharefs
> and then mmap it which assigns a starting address and size to this
> mshare'd region. Another process that has the right permission to
> open the file on msharefs can then mmap this file in its address
> space at same virtual address and size and share this region through
> shared PTEs. An unlink() on the file marks the mshare'd region for
> deletion once there are no more users of the region. When the mshare
> region is deleted, all the pages used by the region are freed.

  Noting the flexibility of 'mshare' has been reduced from v1.  The
earlier version allowed msharing of named mappings, while this patch
is only for anonymous mappings.
  Any plans to support named mappings?  If not, I guess *someone* will
want it (eventually).  Minor, as the patch does not introduce new
syscalls, but having an API which is flexible for both named and anon
mappings would be good (this is a nit, not a strong suggestion).

  The cover letter details the problem being solved and the API, but
gives no details of the implementation.  A paragraph on the use of a
mm_struct per-msharefs file would be helpful.

  I've only quickly scanned the patchset; not in enough detail to
comment on each patch, but a few observations.

  o I was expecting to see mprotect() against a mshared vma to either
be disallowed or code to support the splitting of a mshared vma.  I
didn't see either.

  o For the case where the mshare file has been closed/unmmap but not
unlinked, a 'mshare_data' structure will leaked when the inode is
evicted.

  o The alignment requirement is PGDIR_SIZE, which is very large.
Should/could this be PMD_SIZE?

  o mshare should be a conditional feature (CONFIG_MSHARE ?).


  I might get a chance do a finer grain review later/tomorrow.

> API
> ===
>
> mshare does not introduce a new API. It instead uses existing APIs
> to implement page table sharing. The steps to use this feature are:
>
> 1. Mount msharefs on /sys/fs/mshare -
>         mount -t msharefs msharefs /sys/fs/mshare
>
> 2. mshare regions have alignment and size requirements. Start
>    address for the region must be aligned to an address boundary and
>    be a multiple of fixed size. This alignment and size requirement
>    can be obtained by reading the file /sys/fs/mshare/mshare_info
>    which returns a number in text format. mshare regions must be
>    aligned to this boundary and be a multiple of this size.
>
> 3. For the process creating mshare region:
>         a. Create a file on /sys/fs/mshare, for example -
>                 fd = open("/sys/fs/mshare/shareme",
>                                 O_RDWR|O_CREAT|O_EXCL, 0600);
>
>         b. mmap this file to establish starting address and size -
>                 mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE,
>                         MAP_SHARED, fd, 0);
>
>         c. Write and read to mshared region normally.
>
> 4. For processes attaching to mshare'd region:
>         a. Open the file on msharefs, for example -
>                 fd = open("/sys/fs/mshare/shareme", O_RDWR);
>
>         b. Get information about mshare'd region from the file:
>                 struct mshare_info {
>                         unsigned long start;
>                         unsigned long size;
>                 } m_info;
>
>                 read(fd, &m_info, sizeof(m_info));
>
>         c. mmap the mshare'd region -
>                 mmap(m_info.start, m_info.size,
>                         PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>
> 5. To delete the mshare region -
>                 unlink("/sys/fs/mshare/shareme");
>
>
>
> Example Code
> ============
>
> Snippet of the code that a donor process would run looks like below:
>
> -----------------
>         fd = open("/sys/fs/mshare/mshare_info", O_RDONLY);
>         read(fd, req, 128);
>         alignsize = atoi(req);
>         close(fd);
>         fd = open("/sys/fs/mshare/shareme", O_RDWR|O_CREAT|O_EXCL, 0600);
>         start = alignsize * 4;
>         size = alignsize * 2;
>         addr = mmap((void *)start, size, PROT_READ | PROT_WRITE,
>                         MAP_SHARED | MAP_ANONYMOUS, 0, 0);

Typo, missing 'fd'; MAP_SHARED | MAP_ANONYMOUS, fd, 0)

>         if (addr == MAP_FAILED)
>                 perror("ERROR: mmap failed");
>
>         strncpy(addr, "Some random shared text",
>                         sizeof("Some random shared text"));
> -----------------
>
>
> Snippet of code that a consumer process would execute looks like:
>
> -----------------
>         struct mshare_info {
>                 unsigned long start;
>                 unsigned long size;
>         } minfo;
>
>
>         fd = open("/sys/fs/mshare/shareme", O_RDONLY);
>
>         if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
>                 printf("INFO: %ld bytes shared at addr 0x%lx \n",
>                                 minfo.size, minfo.start);
>
>         addr = mmap(minfo.start, minfo.size, PROT_READ | PROT_WRITE,
>                         MAP_SHARED, fd, 0);
>
>         printf("Guest mmap at %px:\n", addr);
>         printf("%s\n", addr);
>         printf("\nDone\n");
>
> -----------------
>
>
>
> v1 -> v2:
>         - Eliminated mshare and mshare_unlink system calls and
>           replaced API with standard mmap and unlink (Based upon
>           v1 patch discussions and LSF/MM discussions)
>         - All fd based API (based upon feedback and suggestions from
>           Andy Lutomirski, Eric Biederman, Kirill and others)
>         - Added a file /sys/fs/mshare/mshare_info to provide
>           alignment and size requirement info (based upon feedback
>           from Dave Hansen, Mark Hemment and discussions at LSF/MM)
>         - Addressed TODOs in v1
>         - Added support for directories in msharefs
>         - Added locks around any time vma is touched (Dave Hansen)
>         - Eliminated the need to point vm_mm in original vmas to the
>           newly synthesized mshare mm
>         - Ensured mmap_read_unlock is called for correct mm in
>           handle_mm_fault (Dave Hansen)
>
> Khalid Aziz (9):
>   mm: Add msharefs filesystem
>   mm/mshare: pre-populate msharefs with information file
>   mm/mshare: make msharefs writable and support directories
>   mm/mshare: Add a read operation for msharefs files
>   mm/mshare: Add vm flag for shared PTE
>   mm/mshare: Add mmap operation
>   mm/mshare: Add unlink and munmap support
>   mm/mshare: Add basic page table sharing support
>   mm/mshare: Enable mshare region mapping across processes
>
>  Documentation/filesystems/msharefs.rst |  19 +
>  include/linux/mm.h                     |  10 +
>  include/trace/events/mmflags.h         |   3 +-
>  include/uapi/linux/magic.h             |   1 +
>  include/uapi/linux/mman.h              |   5 +
>  mm/Makefile                            |   2 +-
>  mm/internal.h                          |   7 +
>  mm/memory.c                            | 101 ++++-
>  mm/mshare.c                            | 575 +++++++++++++++++++++++++
>  9 files changed, 719 insertions(+), 4 deletions(-)
>  create mode 100644 Documentation/filesystems/msharefs.rst
>  create mode 100644 mm/mshare.c
>
> --
> 2.32.0

Cheers,
Mark

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 5/9] mm/mshare: Add vm flag for shared PTE
  2022-06-29 22:53 ` [PATCH v2 5/9] mm/mshare: Add vm flag for shared PTE Khalid Aziz
@ 2022-06-30 14:59   ` Mark Hemment
  2022-06-30 15:46     ` Khalid Aziz
  0 siblings, 1 reply; 43+ messages in thread
From: Mark Hemment @ 2022-06-30 14:59 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen, david,
	ebiederm, hagen, jack, Kees Cook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, Linux-MM, longpeng2, luto, pcc,
	rppt, sieberf, sjpark, Suren Baghdasaryan, tst, yzaikin

On Wed, 29 Jun 2022 at 23:54, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> Add a bit to vm_flags to indicate a vma shares PTEs with others. Add
> a function to determine if a vma shares PTE by checking this flag.
> This is to be used to find the shared page table entries on page fault
> for vmas sharing PTE.
>
> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/mm.h             | 8 ++++++++
>  include/trace/events/mmflags.h | 3 ++-
>  mm/internal.h                  | 5 +++++
>  3 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bc8f326be0ce..0ddc3057f73b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -310,11 +310,13 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_HIGH_ARCH_BIT_2     34      /* bit only usable on 64-bit architectures */
>  #define VM_HIGH_ARCH_BIT_3     35      /* bit only usable on 64-bit architectures */
>  #define VM_HIGH_ARCH_BIT_4     36      /* bit only usable on 64-bit architectures */
> +#define VM_HIGH_ARCH_BIT_5     37      /* bit only usable on 64-bit architectures */
>  #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
>  #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
>  #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
>  #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
>  #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
> +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5)
>  #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
>
>  #ifdef CONFIG_ARCH_HAS_PKEYS
> @@ -356,6 +358,12 @@ extern unsigned int kobjsize(const void *objp);
>  # define VM_MTE_ALLOWED        VM_NONE
>  #endif
>
> +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
> +#define VM_SHARED_PT   VM_HIGH_ARCH_5
> +#else
> +#define VM_SHARED_PT   0
> +#endif
> +

I'm not clear why mshare is using high-vma flags for VM_SHARED_PT.
CONFIG_ARCH_USES_HIGH_VMA_FLAGS might not be defined, making mshare
unsupported (or, rather, broken).
Is this being done as there is a shortage of non-high flags?
0x00000800 is available, although it appears to be the last one (quick
check).
(When using the last 'normal' flag bit, good idea to highlight this in
the cover letter.)

>  #ifndef VM_GROWSUP
>  # define VM_GROWSUP    VM_NONE
>  #endif
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index e87cb2b80ed3..30e56cbac99b 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -194,7 +194,8 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,  "softdirty"     )               \
>         {VM_MIXEDMAP,                   "mixedmap"      },              \
>         {VM_HUGEPAGE,                   "hugepage"      },              \
>         {VM_NOHUGEPAGE,                 "nohugepage"    },              \
> -       {VM_MERGEABLE,                  "mergeable"     }               \
> +       {VM_MERGEABLE,                  "mergeable"     },              \
> +       {VM_SHARED_PT,                  "sharedpt"      }               \
>
>  #define show_vma_flags(flags)                                          \
>         (flags) ? __print_flags(flags, "|",                             \
> diff --git a/mm/internal.h b/mm/internal.h
> index c0f8fbe0445b..3f2790aea918 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -861,4 +861,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags);
>
>  DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>
> +static inline bool vma_is_shared(const struct vm_area_struct *vma)
> +{
> +       return vma->vm_flags & VM_SHARED_PT;
> +}
> +
>  #endif /* __MM_INTERNAL_H */
> --
> 2.32.0
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 0/9] Add support for shared PTEs across processes
  2022-06-30 11:57 ` [PATCH v2 0/9] Add support for shared PTEs " Mark Hemment
@ 2022-06-30 15:39   ` Khalid Aziz
  0 siblings, 0 replies; 43+ messages in thread
From: Khalid Aziz @ 2022-06-30 15:39 UTC (permalink / raw)
  To: Mark Hemment
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen, david,
	ebiederm, hagen, jack, Kees Cook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, Linux-MM, longpeng2, luto, pcc,
	rppt, sieberf, sjpark, Suren Baghdasaryan, tst, yzaikin

On 6/30/22 05:57, Mark Hemment wrote:
> Hi Khalid,
> 
> On Wed, 29 Jun 2022 at 23:54, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>>
>> Memory pages shared between processes require a page table entry
>> (PTE) for each process. Each of these PTE consumes consume some of
>> the memory and as long as number of mappings being maintained is
>> small enough, this space consumed by page tables is not
>> objectionable. When very few memory pages are shared between
>> processes, the number of page table entries (PTEs) to maintain is
>> mostly constrained by the number of pages of memory on the system.
>> As the number of shared pages and the number of times pages are
>> shared goes up, amount of memory consumed by page tables starts to
>> become significant. This issue does not apply to threads. Any number
>> of threads can share the same pages inside a process while sharing
>> the same PTEs. Extending this same model to sharing pages across
>> processes can eliminate this issue for sharing across processes as
>> well.
>>
>> Some of the field deployments commonly see memory pages shared
>> across 1000s of processes. On x86_64, each page requires a PTE that
>> is only 8 bytes long which is very small compared to the 4K page
>> size. When 2000 processes map the same page in their address space,
>> each one of them requires 8 bytes for its PTE and together that adds
>> up to 8K of memory just to hold the PTEs for one 4K page. On a
>> database server with 300GB SGA, a system crash was seen with
>> out-of-memory condition when 1500+ clients tried to share this SGA
>> even though the system had 512GB of memory. On this server, in the
>> worst case scenario of all 1500 processes mapping every page from
>> SGA would have required 878GB+ for just the PTEs. If these PTEs
>> could be shared, amount of memory saved is very significant.
>>
>> This patch series implements a mechanism in kernel to allow
>> userspace processes to opt into sharing PTEs. It adds a new
>> in-memory filesystem - msharefs. A file created on msharefs creates
>> a new shared region where all processes sharing that region will
>> share the PTEs as well. A process can create a new file on msharefs
>> and then mmap it which assigns a starting address and size to this
>> mshare'd region. Another process that has the right permission to
>> open the file on msharefs can then mmap this file in its address
>> space at same virtual address and size and share this region through
>> shared PTEs. An unlink() on the file marks the mshare'd region for
>> deletion once there are no more users of the region. When the mshare
>> region is deleted, all the pages used by the region are freed.
> 
>    Noting the flexibility of 'mshare' has been reduced from v1.  The
> earlier version allowed msharing of named mappings, while this patch
> is only for anonymous mappings.
>    Any plans to support named mappings?  If not, I guess *someone* will
> want it (eventually).  Minor, as the patch does not introduce new
> syscalls, but having an API which is flexible for both named and anon
> mappings would be good (this is a nit, not a strong suggestion).

I apologize for not clarifying this. The initial mmap() call looks like an anonymous mapping but one could easily call 
mremap later and map any other objects in the same address space which remains shared until the mshare region is torn 
down. It is my intent to support mapping any objects in mshare region.

> 
>    The cover letter details the problem being solved and the API, but
> gives no details of the implementation.  A paragraph on the use of a
> mm_struct per-msharefs file would be helpful.

Good point. I will do that next time.

> 
>    I've only quickly scanned the patchset; not in enough detail to
> comment on each patch, but a few observations.
> 
>    o I was expecting to see mprotect() against a mshared vma to either
> be disallowed or code to support the splitting of a mshared vma.  I
> didn't see either.msharefs_delmm

Since mshare region is intended to support multiple objects being mapped in the region and different protections on 
different parts of region, mprotect should be supported and should handle splitting the mshare'd vmas. Until basic code 
is solid, it would make sense to prevent splitting vmas and add that on later. I will add this code.

> 
>    o For the case where the mshare file has been closed/unmmap but not
> unlinked, a 'mshare_data' structure will leaked when the inode is
> evicted.

You are right. mshare_evict_inode() needs to call msharefs_delmm() to clean up.

> 
>    o The alignment requirement is PGDIR_SIZE, which is very large.
> Should/could this be PMD_SIZE?

Yes, PGDIR_SIZE is large. It works for the database folks who requested this feature but PMD might be more versatile. I 
have been thinking about switching to PMD since that will make it easier to move hugetlbfs page table sharing code over 
to this code.

> 
>    o mshare should be a conditional feature (CONFIG_MSHARE ?).

I can do that. I was reluctant to add yet another CONFIG option. Since this feature is activated explicitly by userspace 
code, is it necessary to make it a config option?

> 
> 
>    I might get a chance do a finer grain review later/tomorrow.
> 
>> API
>> ===
>>
>> mshare does not introduce a new API. It instead uses existing APIs
>> to implement page table sharing. The steps to use this feature are:
>>
>> 1. Mount msharefs on /sys/fs/mshare -
>>          mount -t msharefs msharefs /sys/fs/mshare
>>
>> 2. mshare regions have alignment and size requirements. Start
>>     address for the region must be aligned to an address boundary and
>>     be a multiple of fixed size. This alignment and size requirement
>>     can be obtained by reading the file /sys/fs/mshare/mshare_info
>>     which returns a number in text format. mshare regions must be
>>     aligned to this boundary and be a multiple of this size.
>>
>> 3. For the process creating mshare region:
>>          a. Create a file on /sys/fs/mshare, for example -
>>                  fd = open("/sys/fs/mshare/shareme",
>>                                  O_RDWR|O_CREAT|O_EXCL, 0600);
>>
>>          b. mmap this file to establish starting address and size -
>>                  mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE,
>>                          MAP_SHARED, fd, 0);
>>
>>          c. Write and read to mshared region normally.
>>
>> 4. For processes attaching to mshare'd region:
>>          a. Open the file on msharefs, for example -
>>                  fd = open("/sys/fs/mshare/shareme", O_RDWR);
>>
>>          b. Get information about mshare'd region from the file:
>>                  struct mshare_info {
>>                          unsigned long start;
>>                          unsigned long size;
>>                  } m_info;
>>
>>                  read(fd, &m_info, sizeof(m_info));
>>
>>          c. mmap the mshare'd region -
>>                  mmap(m_info.start, m_info.size,
>>                          PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>>
>> 5. To delete the mshare region -
>>                  unlink("/sys/fs/mshare/shareme");
>>
>>
>>
>> Example Code
>> ============
>>
>> Snippet of the code that a donor process would run looks like below:
>>
>> -----------------
>>          fd = open("/sys/fs/mshare/mshare_info", O_RDONLY);
>>          read(fd, req, 128);
>>          alignsize = atoi(req);
>>          close(fd);
>>          fd = open("/sys/fs/mshare/shareme", O_RDWR|O_CREAT|O_EXCL, 0600);
>>          start = alignsize * 4;
>>          size = alignsize * 2;
>>          addr = mmap((void *)start, size, PROT_READ | PROT_WRITE,
>>                          MAP_SHARED | MAP_ANONYMOUS, 0, 0);
> 
> Typo, missing 'fd'; MAP_SHARED | MAP_ANONYMOUS, fd, 0)

Yes, you are right. I will fix that.

Thanks, Mark! I really appreciate your taking time to review this code.

--
Khalid

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 5/9] mm/mshare: Add vm flag for shared PTE
  2022-06-30 14:59   ` Mark Hemment
@ 2022-06-30 15:46     ` Khalid Aziz
  0 siblings, 0 replies; 43+ messages in thread
From: Khalid Aziz @ 2022-06-30 15:46 UTC (permalink / raw)
  To: Mark Hemment
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen, david,
	ebiederm, hagen, jack, Kees Cook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, Linux-MM, longpeng2, luto, pcc,
	rppt, sieberf, sjpark, Suren Baghdasaryan, tst, yzaikin

On 6/30/22 08:59, Mark Hemment wrote:
> On Wed, 29 Jun 2022 at 23:54, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> Add a bit to vm_flags to indicate a vma shares PTEs with others. Add
>> a function to determine if a vma shares PTE by checking this flag.
>> This is to be used to find the shared page table entries on page fault
>> for vmas sharing PTE.
>>
>> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
>> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>> ---
>>   include/linux/mm.h             | 8 ++++++++
>>   include/trace/events/mmflags.h | 3 ++-
>>   mm/internal.h                  | 5 +++++
>>   3 files changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index bc8f326be0ce..0ddc3057f73b 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -310,11 +310,13 @@ extern unsigned int kobjsize(const void *objp);
>>   #define VM_HIGH_ARCH_BIT_2     34      /* bit only usable on 64-bit architectures */
>>   #define VM_HIGH_ARCH_BIT_3     35      /* bit only usable on 64-bit architectures */
>>   #define VM_HIGH_ARCH_BIT_4     36      /* bit only usable on 64-bit architectures */
>> +#define VM_HIGH_ARCH_BIT_5     37      /* bit only usable on 64-bit architectures */
>>   #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
>>   #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
>>   #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
>>   #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
>>   #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
>> +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5)
>>   #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
>>
>>   #ifdef CONFIG_ARCH_HAS_PKEYS
>> @@ -356,6 +358,12 @@ extern unsigned int kobjsize(const void *objp);
>>   # define VM_MTE_ALLOWED        VM_NONE
>>   #endif
>>
>> +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
>> +#define VM_SHARED_PT   VM_HIGH_ARCH_5
>> +#else
>> +#define VM_SHARED_PT   0
>> +#endif
>> +
> 
> I'm not clear why mshare is using high-vma flags for VM_SHARED_PT.
> CONFIG_ARCH_USES_HIGH_VMA_FLAGS might not be defined, making mshare
> unsupported (or, rather, broken).
> Is this being done as there is a shortage of non-high flags?
> 0x00000800 is available, although it appears to be the last one (quick
> check).
> (When using the last 'normal' flag bit, good idea to highlight this in
> the cover letter.)

It indeed is because of shortage of non-high flag. 0x00000800 is the only non-high flag available and I am inclined to 
leave that last flag for more fundamental features. Then again if we want to move hugetlbfs page table sharing code over 
to mshare base code, it might be necessary to consume that last flag. Nevertheless, mshare code should not break if 
high-vma flags are not available. I definitely need to fix that.

Thanks,
Khalid

> 
>>   #ifndef VM_GROWSUP
>>   # define VM_GROWSUP    VM_NONE
>>   #endif
>> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
>> index e87cb2b80ed3..30e56cbac99b 100644
>> --- a/include/trace/events/mmflags.h
>> +++ b/include/trace/events/mmflags.h
>> @@ -194,7 +194,8 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,  "softdirty"     )               \
>>          {VM_MIXEDMAP,                   "mixedmap"      },              \
>>          {VM_HUGEPAGE,                   "hugepage"      },              \
>>          {VM_NOHUGEPAGE,                 "nohugepage"    },              \
>> -       {VM_MERGEABLE,                  "mergeable"     }               \
>> +       {VM_MERGEABLE,                  "mergeable"     },              \
>> +       {VM_SHARED_PT,                  "sharedpt"      }               \
>>
>>   #define show_vma_flags(flags)                                          \
>>          (flags) ? __print_flags(flags, "|",                             \
>> diff --git a/mm/internal.h b/mm/internal.h
>> index c0f8fbe0445b..3f2790aea918 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -861,4 +861,9 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags);
>>
>>   DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>>
>> +static inline bool vma_is_shared(const struct vm_area_struct *vma)
>> +{
>> +       return vma->vm_flags & VM_SHARED_PT;
>> +}
>> +
>>   #endif /* __MM_INTERNAL_H */
>> --
>> 2.32.0
>>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 4/9] mm/mshare: Add a read operation for msharefs files
  2022-06-29 22:53 ` [PATCH v2 4/9] mm/mshare: Add a read operation for msharefs files Khalid Aziz
@ 2022-06-30 21:27   ` Darrick J. Wong
  2022-06-30 22:27     ` Khalid Aziz
  0 siblings, 1 reply; 43+ messages in thread
From: Darrick J. Wong @ 2022-06-30 21:27 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: akpm, willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On Wed, Jun 29, 2022 at 04:53:55PM -0600, Khalid Aziz wrote:
> When a new file is created under msharefs, allocate a new mm_struct
> that will hold the VMAs for mshare region. Also allocate structure
> to defines the mshare region and add a read operation to the file
> that returns this information about the mshare region. Currently
> this information is returned as a struct:
> 
> struct mshare_info {
> 	unsigned long start;
> 	unsigned long size;
> };
> 
> This gives the start address for mshare region and its size.
> 
> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
> ---
>  include/uapi/linux/mman.h |  5 +++
>  mm/mshare.c               | 64 ++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 68 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> index f55bc680b5b0..56fe446e24b1 100644
> --- a/include/uapi/linux/mman.h
> +++ b/include/uapi/linux/mman.h
> @@ -41,4 +41,9 @@
>  #define MAP_HUGE_2GB	HUGETLB_FLAG_ENCODE_2GB
>  #define MAP_HUGE_16GB	HUGETLB_FLAG_ENCODE_16GB
>  
> +struct mshare_info {
> +	unsigned long start;
> +	unsigned long size;

You might want to make these explicitly u64, since this is userspace
ABI and you never know when someone will want to do something crazy like
run 32-bit programs with mshare files.

Also you might want to add some padding fields for flags, future
expansion, etc.

> +};
> +
>  #endif /* _UAPI_LINUX_MMAN_H */
> diff --git a/mm/mshare.c b/mm/mshare.c
> index 2d5924d39221..d238b68b0576 100644
> --- a/mm/mshare.c
> +++ b/mm/mshare.c
> @@ -22,8 +22,14 @@
>  #include <uapi/linux/magic.h>
>  #include <uapi/linux/limits.h>
>  #include <uapi/linux/mman.h>
> +#include <linux/sched/mm.h>
>  
>  static struct super_block *msharefs_sb;
> +struct mshare_data {
> +	struct mm_struct *mm;
> +	refcount_t refcnt;
> +	struct mshare_info *minfo;
> +};
>  
>  static const struct inode_operations msharefs_dir_inode_ops;
>  static const struct inode_operations msharefs_file_inode_ops;
> @@ -34,8 +40,29 @@ msharefs_open(struct inode *inode, struct file *file)
>  	return simple_open(inode, file);
>  }
>  
> +static ssize_t
> +msharefs_read(struct kiocb *iocb, struct iov_iter *iov)
> +{
> +	struct mshare_data *info = iocb->ki_filp->private_data;
> +	size_t ret;
> +	struct mshare_info m_info;
> +
> +	if (info->minfo != NULL) {
> +		m_info.start = info->minfo->start;
> +		m_info.size = info->minfo->size;
> +	} else {
> +		m_info.start = 0;
> +		m_info.size = 0;

Hmmm, read()ing out the shared mapping information.  Heh.

When does this case happen?  Is it before anybody mmaps this file into
an address space?

> +	}
> +	ret = copy_to_iter(&m_info, sizeof(m_info), iov);
> +	if (!ret)
> +		return -EFAULT;
> +	return ret;
> +}
> +
>  static const struct file_operations msharefs_file_operations = {
>  	.open		= msharefs_open,
> +	.read_iter	= msharefs_read,
>  	.llseek		= no_llseek,
>  };
>  
> @@ -73,12 +100,43 @@ static struct dentry
>  	return ERR_PTR(-ENOMEM);
>  }
>  
> +static int
> +msharefs_fill_mm(struct inode *inode)
> +{
> +	struct mm_struct *mm;
> +	struct mshare_data *info = NULL;
> +	int retval = 0;
> +
> +	mm = mm_alloc();
> +	if (!mm) {
> +		retval = -ENOMEM;
> +		goto err_free;
> +	}
> +
> +	info = kzalloc(sizeof(*info), GFP_KERNEL);
> +	if (!info) {
> +		retval = -ENOMEM;
> +		goto err_free;
> +	}
> +	info->mm = mm;
> +	info->minfo = NULL;
> +	refcount_set(&info->refcnt, 1);
> +	inode->i_private = info;
> +
> +	return 0;
> +
> +err_free:
> +	if (mm)
> +		mmput(mm);
> +	kfree(info);
> +	return retval;
> +}
> +
>  static struct inode
>  *msharefs_get_inode(struct super_block *sb, const struct inode *dir,
>  			umode_t mode)
>  {
>  	struct inode *inode = new_inode(sb);
> -
>  	if (inode) {
>  		inode->i_ino = get_next_ino();
>  		inode_init_owner(&init_user_ns, inode, dir, mode);
> @@ -89,6 +147,10 @@ static struct inode
>  		case S_IFREG:
>  			inode->i_op = &msharefs_file_inode_ops;
>  			inode->i_fop = &msharefs_file_operations;
> +			if (msharefs_fill_mm(inode) != 0) {
> +				discard_new_inode(inode);
> +				inode = ERR_PTR(-ENOMEM);

Is it intentional to clobber the msharefs_fill_mm return value and
replace it with ENOMEM?

--D

> +			}
>  			break;
>  		case S_IFDIR:
>  			inode->i_op = &msharefs_dir_inode_ops;
> -- 
> 2.32.0
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 3/9] mm/mshare: make msharefs writable and support directories
  2022-06-29 22:53 ` [PATCH v2 3/9] mm/mshare: make msharefs writable and support directories Khalid Aziz
@ 2022-06-30 21:34   ` Darrick J. Wong
  2022-06-30 22:49     ` Khalid Aziz
  2022-06-30 23:09   ` Al Viro
  1 sibling, 1 reply; 43+ messages in thread
From: Darrick J. Wong @ 2022-06-30 21:34 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: akpm, willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On Wed, Jun 29, 2022 at 04:53:54PM -0600, Khalid Aziz wrote:
> Make msharefs filesystem writable and allow creating directories
> to support better access control to mshare'd regions defined in
> msharefs.
> 
> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
> ---
>  mm/mshare.c | 195 +++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 186 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/mshare.c b/mm/mshare.c
> index 3e448e11c742..2d5924d39221 100644
> --- a/mm/mshare.c
> +++ b/mm/mshare.c
> @@ -21,11 +21,21 @@
>  #include <linux/fileattr.h>
>  #include <uapi/linux/magic.h>
>  #include <uapi/linux/limits.h>
> +#include <uapi/linux/mman.h>
>  
>  static struct super_block *msharefs_sb;
>  
> +static const struct inode_operations msharefs_dir_inode_ops;
> +static const struct inode_operations msharefs_file_inode_ops;
> +
> +static int
> +msharefs_open(struct inode *inode, struct file *file)
> +{
> +	return simple_open(inode, file);
> +}
> +
>  static const struct file_operations msharefs_file_operations = {
> -	.open		= simple_open,
> +	.open		= msharefs_open,
>  	.llseek		= no_llseek,
>  };
>  
> @@ -42,6 +52,113 @@ msharefs_d_hash(const struct dentry *dentry, struct qstr *qstr)
>  	return 0;
>  }
>  
> +static struct dentry
> +*msharefs_alloc_dentry(struct dentry *parent, const char *name)
> +{
> +	struct dentry *d;
> +	struct qstr q;
> +	int err;
> +
> +	q.name = name;
> +	q.len = strlen(name);
> +
> +	err = msharefs_d_hash(parent, &q);
> +	if (err)
> +		return ERR_PTR(err);
> +
> +	d = d_alloc(parent, &q);
> +	if (d)
> +		return d;
> +
> +	return ERR_PTR(-ENOMEM);
> +}
> +
> +static struct inode
> +*msharefs_get_inode(struct super_block *sb, const struct inode *dir,
> +			umode_t mode)
> +{
> +	struct inode *inode = new_inode(sb);
> +
> +	if (inode) {

Not sure why you wouldn't go with the less-indently version:

	if (!inode)
		return ERR_PTR(-ENOMEM);

	inode->i_ino = get_next_ino();
	<etc>

> +		inode->i_ino = get_next_ino();
> +		inode_init_owner(&init_user_ns, inode, dir, mode);
> +
> +		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
> +
> +		switch (mode & S_IFMT) {

Shouldn't we set the mode somewhere?

> +		case S_IFREG:
> +			inode->i_op = &msharefs_file_inode_ops;
> +			inode->i_fop = &msharefs_file_operations;
> +			break;
> +		case S_IFDIR:
> +			inode->i_op = &msharefs_dir_inode_ops;
> +			inode->i_fop = &simple_dir_operations;
> +			inc_nlink(inode);
> +			break;
> +		case S_IFLNK:
> +			inode->i_op = &page_symlink_inode_operations;
> +			break;
> +		default:
> +			discard_new_inode(inode);
> +			inode = NULL;
> +			break;
> +		}
> +	}
> +
> +	return inode;
> +}
> +
> +static int
> +msharefs_mknod(struct user_namespace *mnt_userns, struct inode *dir,
> +		struct dentry *dentry, umode_t mode, dev_t dev)
> +{
> +	struct inode *inode;
> +	int err = 0;
> +
> +	inode = msharefs_get_inode(dir->i_sb, dir, mode);
> +	if (IS_ERR(inode))
> +		return PTR_ERR(inode);

...and if @inode is NULL?

> +
> +	d_instantiate(dentry, inode);
> +	dget(dentry);
> +	dir->i_mtime = dir->i_ctime = current_time(dir);
> +
> +	return err;
> +}
> +
> +static int
> +msharefs_create(struct user_namespace *mnt_userns, struct inode *dir,
> +		struct dentry *dentry, umode_t mode, bool excl)
> +{
> +	return msharefs_mknod(&init_user_ns, dir, dentry, mode | S_IFREG, 0);
> +}
> +
> +static int
> +msharefs_mkdir(struct user_namespace *mnt_userns, struct inode *dir,
> +		struct dentry *dentry, umode_t mode)
> +{
> +	int ret = msharefs_mknod(&init_user_ns, dir, dentry, mode | S_IFDIR, 0);
> +
> +	if (!ret)
> +		inc_nlink(dir);
> +	return ret;
> +}
> +
> +static const struct inode_operations msharefs_file_inode_ops = {
> +	.setattr	= simple_setattr,
> +	.getattr	= simple_getattr,
> +};
> +static const struct inode_operations msharefs_dir_inode_ops = {
> +	.create		= msharefs_create,
> +	.lookup		= simple_lookup,
> +	.link		= simple_link,
> +	.unlink		= simple_unlink,
> +	.mkdir		= msharefs_mkdir,
> +	.rmdir		= simple_rmdir,
> +	.mknod		= msharefs_mknod,
> +	.rename		= simple_rename,
> +};
> +
>  static void
>  mshare_evict_inode(struct inode *inode)
>  {
> @@ -58,7 +175,7 @@ mshare_info_read(struct file *file, char __user *buf, size_t nbytes,
>  {
>  	char s[80];
>  
> -	sprintf(s, "%ld", PGDIR_SIZE);
> +	sprintf(s, "%ld\n", PGDIR_SIZE);

Changing this already?

>  	return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
>  }
>  
> @@ -72,6 +189,38 @@ static const struct super_operations mshare_s_ops = {
>  	.evict_inode = mshare_evict_inode,
>  };
>  
> +static int
> +prepopulate_files(struct super_block *s, struct inode *dir,
> +			struct dentry *root, const struct tree_descr *files)
> +{
> +	int i;
> +	struct inode *inode;
> +	struct dentry *dentry;
> +
> +	for (i = 0; !files->name || files->name[0]; i++, files++) {
> +		if (!files->name)
> +			continue;

What ends the array?  NULL name or empty name?
Do we have to erase all of these when the fs gets unmounted?

--D

> +
> +		dentry = msharefs_alloc_dentry(root, files->name);
> +		if (!dentry)
> +			return -ENOMEM;
> +
> +		inode = msharefs_get_inode(s, dir, S_IFREG | files->mode);
> +		if (!inode) {
> +			dput(dentry);
> +			return -ENOMEM;
> +		}
> +		inode->i_mode = S_IFREG | files->mode;
> +		inode->i_atime = inode->i_mtime = inode->i_ctime
> +			= current_time(inode);
> +		inode->i_fop = files->ops;
> +		inode->i_ino = i;
> +		d_add(dentry, inode);
> +	}
> +
> +	return 0;
> +}
> +
>  static int
>  msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
>  {
> @@ -79,21 +228,49 @@ msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
>  		[2] = { "mshare_info", &mshare_info_ops, 0444},
>  		{""},
>  	};
> -	int err;
> +	struct inode *inode;
> +	struct dentry *root;
> +	int err = 0;
>  
> -	err = simple_fill_super(sb, MSHARE_MAGIC, mshare_files);
> -	if (!err) {
> -		msharefs_sb = sb;
> -		sb->s_d_op = &msharefs_d_ops;
> -		sb->s_op = &mshare_s_ops;
> +	sb->s_blocksize		= PAGE_SIZE;
> +	sb->s_blocksize_bits	= PAGE_SHIFT;
> +	sb->s_magic		= MSHARE_MAGIC;
> +	sb->s_op		= &mshare_s_ops;
> +	sb->s_d_op		= &msharefs_d_ops;
> +	sb->s_time_gran		= 1;
> +
> +	inode = msharefs_get_inode(sb, NULL, S_IFDIR | 0777);
> +	if (!inode) {
> +		err = -ENOMEM;
> +		goto out;
>  	}
> +	inode->i_ino = 1;
> +	root = d_make_root(inode);
> +	if (!root) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +
> +	err = prepopulate_files(sb, inode, root, mshare_files);
> +	if (err < 0)
> +		goto clean_root;
> +
> +	sb->s_root = root;
> +	msharefs_sb = sb;
> +	return err;
> +
> +clean_root:
> +	d_genocide(root);
> +	shrink_dcache_parent(root);
> +	dput(root);
> +out:
>  	return err;
>  }
>  
>  static int
>  msharefs_get_tree(struct fs_context *fc)
>  {
> -	return get_tree_single(fc, msharefs_fill_super);
> +	return get_tree_nodev(fc, msharefs_fill_super);
>  }
>  
>  static const struct fs_context_operations msharefs_context_ops = {
> -- 
> 2.32.0
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 2/9] mm/mshare: pre-populate msharefs with information file
  2022-06-29 22:53 ` [PATCH v2 2/9] mm/mshare: pre-populate msharefs with information file Khalid Aziz
@ 2022-06-30 21:37   ` Darrick J. Wong
  2022-06-30 22:54     ` Khalid Aziz
  2022-06-30 23:01   ` Al Viro
  1 sibling, 1 reply; 43+ messages in thread
From: Darrick J. Wong @ 2022-06-30 21:37 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: akpm, willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On Wed, Jun 29, 2022 at 04:53:53PM -0600, Khalid Aziz wrote:
> Users of mshare feature to share page tables need to know the size
> and alignment requirement for shared regions. Pre-populate msharefs
> with a file, mshare_info, that provides this information.
> 
> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
> ---
>  mm/mshare.c | 62 +++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 48 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/mshare.c b/mm/mshare.c
> index c8fab3869bab..3e448e11c742 100644
> --- a/mm/mshare.c
> +++ b/mm/mshare.c
> @@ -25,8 +25,8 @@
>  static struct super_block *msharefs_sb;
>  
>  static const struct file_operations msharefs_file_operations = {
> -	.open	= simple_open,
> -	.llseek	= no_llseek,
> +	.open		= simple_open,
> +	.llseek		= no_llseek,

I feel like there's a lot of churn between the previous patch and this
one that could have been in the previous patch.

>  };
>  
>  static int
> @@ -42,23 +42,52 @@ msharefs_d_hash(const struct dentry *dentry, struct qstr *qstr)
>  	return 0;
>  }
>  
> +static void
> +mshare_evict_inode(struct inode *inode)
> +{
> +	clear_inode(inode);
> +}
> +
>  static const struct dentry_operations msharefs_d_ops = {
>  	.d_hash = msharefs_d_hash,
>  };
>  
> +static ssize_t
> +mshare_info_read(struct file *file, char __user *buf, size_t nbytes,
> +		loff_t *ppos)
> +{
> +	char s[80];
> +
> +	sprintf(s, "%ld", PGDIR_SIZE);

SO what is this "mshare_info" file supposed to reveal?  Hugepage size?
I wonder why this isn't exported in struct mshare_info?

> +	return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
> +}
> +
> +static const struct file_operations mshare_info_ops = {
> +	.read   = mshare_info_read,
> +	.llseek	= noop_llseek,
> +};
> +
> +static const struct super_operations mshare_s_ops = {
> +	.statfs	     = simple_statfs,
> +	.evict_inode = mshare_evict_inode,
> +};
> +
>  static int
>  msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
>  {
> -	static const struct tree_descr empty_descr = {""};
> +	static const struct tree_descr mshare_files[] = {
> +		[2] = { "mshare_info", &mshare_info_ops, 0444},
> +		{""},
> +	};
>  	int err;
>  
> -	sb->s_d_op = &msharefs_d_ops;
> -	err = simple_fill_super(sb, MSHARE_MAGIC, &empty_descr);
> -	if (err)
> -		return err;
> -
> -	msharefs_sb = sb;
> -	return 0;
> +	err = simple_fill_super(sb, MSHARE_MAGIC, mshare_files);
> +	if (!err) {
> +		msharefs_sb = sb;
> +		sb->s_d_op = &msharefs_d_ops;
> +		sb->s_op = &mshare_s_ops;
> +	}
> +	return err;
>  }
>  
>  static int
> @@ -84,20 +113,25 @@ static struct file_system_type mshare_fs = {
>  	.kill_sb		= kill_litter_super,
>  };
>  
> -static int
> +static int __init
>  mshare_init(void)
>  {
>  	int ret = 0;
>  
>  	ret = sysfs_create_mount_point(fs_kobj, "mshare");
>  	if (ret)
> -		return ret;
> +		goto out;
>  
>  	ret = register_filesystem(&mshare_fs);
> -	if (ret)
> +	if (ret) {
>  		sysfs_remove_mount_point(fs_kobj, "mshare");
> +		goto out;
> +	}
> +
> +	return 0;
>  
> +out:
>  	return ret;
>  }
>  
> -fs_initcall(mshare_init);
> +core_initcall(mshare_init);
> -- 
> 2.32.0
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 6/9] mm/mshare: Add mmap operation
  2022-06-29 22:53 ` [PATCH v2 6/9] mm/mshare: Add mmap operation Khalid Aziz
@ 2022-06-30 21:44   ` Darrick J. Wong
  2022-06-30 23:30     ` Khalid Aziz
  0 siblings, 1 reply; 43+ messages in thread
From: Darrick J. Wong @ 2022-06-30 21:44 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: akpm, willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On Wed, Jun 29, 2022 at 04:53:57PM -0600, Khalid Aziz wrote:
> mmap is used to establish address range for mshare region and map the
> region into process's address space. Add basic mmap operation that
> supports setting address range. Also fix code to not allocate new
> mm_struct for files in msharefs that exist for information and not
> for defining a new mshare region.
> 
> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/mshare.c | 48 +++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 41 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/mshare.c b/mm/mshare.c
> index d238b68b0576..088a6cab1e93 100644
> --- a/mm/mshare.c
> +++ b/mm/mshare.c
> @@ -9,7 +9,8 @@
>   *
>   *
>   * Copyright (C) 2022 Oracle Corp. All rights reserved.
> - * Author:	Khalid Aziz <khalid.aziz@oracle.com>
> + * Authors:	Khalid Aziz <khalid.aziz@oracle.com>
> + *		Matthew Wilcox <willy@infradead.org>
>   *
>   */
>  
> @@ -60,9 +61,36 @@ msharefs_read(struct kiocb *iocb, struct iov_iter *iov)
>  	return ret;
>  }
>  
> +static int
> +msharefs_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct mshare_data *info = file->private_data;
> +	struct mm_struct *mm = info->mm;
> +
> +	/*
> +	 * If this mshare region has been set up once already, bail out
> +	 */
> +	if (mm->mmap_base != 0)
> +		return -EINVAL;
> +
> +	if ((vma->vm_start | vma->vm_end) & (PGDIR_SIZE - 1))
> +		return -EINVAL;
> +
> +	mm->mmap_base = vma->vm_start;
> +	mm->task_size = vma->vm_end - vma->vm_start;
> +	if (!mm->task_size)
> +		mm->task_size--;
> +	info->minfo->start = mm->mmap_base;
> +	info->minfo->size = mm->task_size;

So, uh, if the second mmap() caller decides to ignore the mshare_info,
should they get an -EINVAL here since the memory mappings won't be at
the same process virtual address?

> +	vma->vm_flags |= VM_SHARED_PT;
> +	vma->vm_private_data = info;
> +	return 0;
> +}
> +
>  static const struct file_operations msharefs_file_operations = {
>  	.open		= msharefs_open,
>  	.read_iter	= msharefs_read,
> +	.mmap		= msharefs_mmap,
>  	.llseek		= no_llseek,
>  };
>  
> @@ -119,7 +147,12 @@ msharefs_fill_mm(struct inode *inode)
>  		goto err_free;
>  	}
>  	info->mm = mm;
> -	info->minfo = NULL;
> +	info->minfo = kzalloc(sizeof(struct mshare_info), GFP_KERNEL);
> +	if (info->minfo == NULL) {
> +		retval = -ENOMEM;
> +		goto err_free;
> +	}
> +
>  	refcount_set(&info->refcnt, 1);
>  	inode->i_private = info;
>  
> @@ -128,13 +161,14 @@ msharefs_fill_mm(struct inode *inode)
>  err_free:
>  	if (mm)
>  		mmput(mm);
> +	kfree(info->minfo);
>  	kfree(info);
>  	return retval;
>  }
>  
>  static struct inode
>  *msharefs_get_inode(struct super_block *sb, const struct inode *dir,
> -			umode_t mode)
> +			umode_t mode, bool newmm)
>  {
>  	struct inode *inode = new_inode(sb);
>  	if (inode) {
> @@ -147,7 +181,7 @@ static struct inode
>  		case S_IFREG:
>  			inode->i_op = &msharefs_file_inode_ops;
>  			inode->i_fop = &msharefs_file_operations;
> -			if (msharefs_fill_mm(inode) != 0) {
> +			if (newmm && msharefs_fill_mm(inode) != 0) {
>  				discard_new_inode(inode);
>  				inode = ERR_PTR(-ENOMEM);
>  			}
> @@ -177,7 +211,7 @@ msharefs_mknod(struct user_namespace *mnt_userns, struct inode *dir,
>  	struct inode *inode;
>  	int err = 0;
>  
> -	inode = msharefs_get_inode(dir->i_sb, dir, mode);
> +	inode = msharefs_get_inode(dir->i_sb, dir, mode, true);
>  	if (IS_ERR(inode))
>  		return PTR_ERR(inode);
>  
> @@ -267,7 +301,7 @@ prepopulate_files(struct super_block *s, struct inode *dir,
>  		if (!dentry)
>  			return -ENOMEM;
>  
> -		inode = msharefs_get_inode(s, dir, S_IFREG | files->mode);
> +		inode = msharefs_get_inode(s, dir, S_IFREG | files->mode, false);

I was wondering why the information files were getting their own
mshare_data.

TBH I'm not really sure what the difference is between mshare_data and
mshare_info, since those names are not especially distinct.

>  		if (!inode) {
>  			dput(dentry);
>  			return -ENOMEM;
> @@ -301,7 +335,7 @@ msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
>  	sb->s_d_op		= &msharefs_d_ops;
>  	sb->s_time_gran		= 1;
>  
> -	inode = msharefs_get_inode(sb, NULL, S_IFDIR | 0777);
> +	inode = msharefs_get_inode(sb, NULL, S_IFDIR | 0777, false);

Is it wise to default to world-writable?  Surely whatever userspace
software wraps an msharefs can relax permissions as needed.

--D

>  	if (!inode) {
>  		err = -ENOMEM;
>  		goto out;
> -- 
> 2.32.0
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 7/9] mm/mshare: Add unlink and munmap support
  2022-06-29 22:53 ` [PATCH v2 7/9] mm/mshare: Add unlink and munmap support Khalid Aziz
@ 2022-06-30 21:50   ` Darrick J. Wong
  2022-07-01 15:58     ` Khalid Aziz
  0 siblings, 1 reply; 43+ messages in thread
From: Darrick J. Wong @ 2022-06-30 21:50 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: akpm, willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On Wed, Jun 29, 2022 at 04:53:58PM -0600, Khalid Aziz wrote:
> Number of mappings of an mshare region should be tracked so it can
> be removed when there are no more references to it and associated
> file has been deleted. This add code to support the unlink operation
> for associated file, remove the mshare region on file deletion if
> refcount goes to zero, add munmap operation to maintain refcount
> to mshare region and remove it on last munmap if file has been
> deleted.
> 
> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
> ---
>  mm/mshare.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 63 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/mshare.c b/mm/mshare.c
> index 088a6cab1e93..90ce0564a138 100644
> --- a/mm/mshare.c
> +++ b/mm/mshare.c
> @@ -29,6 +29,7 @@ static struct super_block *msharefs_sb;
>  struct mshare_data {
>  	struct mm_struct *mm;
>  	refcount_t refcnt;
> +	int deleted;
>  	struct mshare_info *minfo;
>  };
>  
> @@ -48,6 +49,7 @@ msharefs_read(struct kiocb *iocb, struct iov_iter *iov)
>  	size_t ret;
>  	struct mshare_info m_info;
>  
> +	mmap_read_lock(info->mm);
>  	if (info->minfo != NULL) {
>  		m_info.start = info->minfo->start;
>  		m_info.size = info->minfo->size;
> @@ -55,18 +57,42 @@ msharefs_read(struct kiocb *iocb, struct iov_iter *iov)
>  		m_info.start = 0;
>  		m_info.size = 0;
>  	}
> +	mmap_read_unlock(info->mm);
>  	ret = copy_to_iter(&m_info, sizeof(m_info), iov);
>  	if (!ret)
>  		return -EFAULT;
>  	return ret;
>  }
>  
> +static void
> +msharefs_close(struct vm_area_struct *vma)
> +{
> +	struct mshare_data *info = vma->vm_private_data;
> +
> +	if (refcount_dec_and_test(&info->refcnt)) {
> +		mmap_read_lock(info->mm);
> +		if (info->deleted) {
> +			mmap_read_unlock(info->mm);
> +			mmput(info->mm);
> +			kfree(info->minfo);
> +			kfree(info);

Aren't filesystems supposed to take care of disposing of the file data
in destroy_inode?  IIRC struct inode doesn't go away until all fds are
closed, mappings are torn down, and there are no more references from
dentries.  I could be misremembering since it's been a few months since
I went looking at the (VFS) inode lifecycle.

> +		} else {
> +			mmap_read_unlock(info->mm);
> +		}
> +	}
> +}
> +
> +static const struct vm_operations_struct msharefs_vm_ops = {
> +	.close	= msharefs_close,
> +};
> +
>  static int
>  msharefs_mmap(struct file *file, struct vm_area_struct *vma)
>  {
>  	struct mshare_data *info = file->private_data;
>  	struct mm_struct *mm = info->mm;
>  
> +	mmap_write_lock(mm);
>  	/*
>  	 * If this mshare region has been set up once already, bail out
>  	 */
> @@ -80,10 +106,14 @@ msharefs_mmap(struct file *file, struct vm_area_struct *vma)
>  	mm->task_size = vma->vm_end - vma->vm_start;
>  	if (!mm->task_size)
>  		mm->task_size--;
> +	mmap_write_unlock(mm);
>  	info->minfo->start = mm->mmap_base;
>  	info->minfo->size = mm->task_size;
> +	info->deleted = 0;
> +	refcount_inc(&info->refcnt);
>  	vma->vm_flags |= VM_SHARED_PT;
>  	vma->vm_private_data = info;
> +	vma->vm_ops = &msharefs_vm_ops;
>  	return 0;
>  }
>  
> @@ -240,6 +270,38 @@ msharefs_mkdir(struct user_namespace *mnt_userns, struct inode *dir,
>  	return ret;
>  }
>  
> +static int
> +msharefs_unlink(struct inode *dir, struct dentry *dentry)
> +{
> +	struct inode *inode = d_inode(dentry);
> +	struct mshare_data *info = inode->i_private;
> +
> +	/*
> +	 * Unmap the mshare region if it is still mapped in
> +	 */
> +	vm_munmap(info->minfo->start, info->minfo->size);
> +
> +	/*
> +	 * Mark msharefs file for deletion so it can not be opened
> +	 * and used for mshare mappings any more
> +	 */
> +	simple_unlink(dir, dentry);
> +	mmap_write_lock(info->mm);
> +	info->deleted = 1;
> +	mmap_write_unlock(info->mm);

What if the file is hardlinked?

--D

> +
> +	/*
> +	 * Is this the last reference? If so, delete mshare region and
> +	 * remove the file
> +	 */
> +	if (!refcount_dec_and_test(&info->refcnt)) {
> +		mmput(info->mm);
> +		kfree(info->minfo);
> +		kfree(info);
> +	}
> +	return 0;
> +}
> +
>  static const struct inode_operations msharefs_file_inode_ops = {
>  	.setattr	= simple_setattr,
>  	.getattr	= simple_getattr,
> @@ -248,7 +310,7 @@ static const struct inode_operations msharefs_dir_inode_ops = {
>  	.create		= msharefs_create,
>  	.lookup		= simple_lookup,
>  	.link		= simple_link,
> -	.unlink		= simple_unlink,
> +	.unlink		= msharefs_unlink,
>  	.mkdir		= msharefs_mkdir,
>  	.rmdir		= simple_rmdir,
>  	.mknod		= msharefs_mknod,
> -- 
> 2.32.0
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 1/9] mm: Add msharefs filesystem
  2022-06-29 22:53 ` [PATCH v2 1/9] mm: Add msharefs filesystem Khalid Aziz
@ 2022-06-30 21:53   ` Darrick J. Wong
  2022-07-01 16:05     ` Khalid Aziz
  2022-06-30 22:57   ` Al Viro
  1 sibling, 1 reply; 43+ messages in thread
From: Darrick J. Wong @ 2022-06-30 21:53 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: akpm, willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On Wed, Jun 29, 2022 at 04:53:52PM -0600, Khalid Aziz wrote:
> Add a ram-based filesystem that contains page table sharing
> information and files that enables processes to share page tables.
> This patch adds the basic filesystem that can be mounted.
> 
> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
> ---
>  Documentation/filesystems/msharefs.rst |  19 +++++
>  include/uapi/linux/magic.h             |   1 +
>  mm/Makefile                            |   2 +-
>  mm/mshare.c                            | 103 +++++++++++++++++++++++++
>  4 files changed, 124 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/filesystems/msharefs.rst
>  create mode 100644 mm/mshare.c
> 
> diff --git a/Documentation/filesystems/msharefs.rst b/Documentation/filesystems/msharefs.rst
> new file mode 100644
> index 000000000000..fd161f67045d
> --- /dev/null
> +++ b/Documentation/filesystems/msharefs.rst
> @@ -0,0 +1,19 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================================================
> +msharefs - a filesystem to support shared page tables
> +=====================================================
> +
> +msharefs is a ram-based filesystem that allows multiple processes to
> +share page table entries for shared pages.
> +
> +msharefs is typically mounted like this::
> +
> +	mount -t msharefs none /sys/fs/mshare
> +
> +When a process calls mshare syscall with a name for the shared address
> +range,

You mean creat()?

> a file with the same name is created under msharefs with that
> +name. This file can be opened by another process, if permissions
> +allow, to query the addresses shared under this range. These files are
> +removed by mshare_unlink syscall and can not be deleted directly.

Oh?

> +Hence these files are created as immutable files.
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index f724129c0425..2a57a6ec6f3e 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -105,5 +105,6 @@
>  #define Z3FOLD_MAGIC		0x33
>  #define PPC_CMM_MAGIC		0xc7571590
>  #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
> +#define MSHARE_MAGIC		0x4d534852	/* "MSHR" */
>  
>  #endif /* __LINUX_MAGIC_H__ */
> diff --git a/mm/Makefile b/mm/Makefile
> index 6f9ffa968a1a..51a2ab9080d9 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -37,7 +37,7 @@ CFLAGS_init-mm.o += $(call cc-disable-warning, override-init)
>  CFLAGS_init-mm.o += $(call cc-disable-warning, initializer-overrides)
>  
>  mmu-y			:= nommu.o
> -mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o \
> +mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o mshare.o \
>  			   mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
>  			   msync.o page_vma_mapped.o pagewalk.o \
>  			   pgtable-generic.o rmap.o vmalloc.o
> diff --git a/mm/mshare.c b/mm/mshare.c
> new file mode 100644
> index 000000000000..c8fab3869bab
> --- /dev/null
> +++ b/mm/mshare.c

Filesystems are usually supposed to live under fs/; is there some reason
to put it in mm/?

I guess shmfs is in mm so maybe this isn't much of an objection.

Also, should this fs be selectable via a Kconfig option?

--D

> @@ -0,0 +1,103 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Enable copperating processes to share page table between
> + * them to reduce the extra memory consumed by multiple copies
> + * of page tables.
> + *
> + * This code adds an in-memory filesystem - msharefs.
> + * msharefs is used to manage page table sharing
> + *
> + *
> + * Copyright (C) 2022 Oracle Corp. All rights reserved.
> + * Author:	Khalid Aziz <khalid.aziz@oracle.com>
> + *
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/mount.h>
> +#include <linux/syscalls.h>
> +#include <linux/uaccess.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/fileattr.h>
> +#include <uapi/linux/magic.h>
> +#include <uapi/linux/limits.h>
> +
> +static struct super_block *msharefs_sb;
> +
> +static const struct file_operations msharefs_file_operations = {
> +	.open	= simple_open,
> +	.llseek	= no_llseek,
> +};
> +
> +static int
> +msharefs_d_hash(const struct dentry *dentry, struct qstr *qstr)
> +{
> +	unsigned long hash = init_name_hash(dentry);
> +	const unsigned char *s = qstr->name;
> +	unsigned int len = qstr->len;
> +
> +	while (len--)
> +		hash = partial_name_hash(*s++, hash);
> +	qstr->hash = end_name_hash(hash);
> +	return 0;
> +}
> +
> +static const struct dentry_operations msharefs_d_ops = {
> +	.d_hash = msharefs_d_hash,
> +};
> +
> +static int
> +msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
> +{
> +	static const struct tree_descr empty_descr = {""};
> +	int err;
> +
> +	sb->s_d_op = &msharefs_d_ops;
> +	err = simple_fill_super(sb, MSHARE_MAGIC, &empty_descr);
> +	if (err)
> +		return err;
> +
> +	msharefs_sb = sb;
> +	return 0;
> +}
> +
> +static int
> +msharefs_get_tree(struct fs_context *fc)
> +{
> +	return get_tree_single(fc, msharefs_fill_super);
> +}
> +
> +static const struct fs_context_operations msharefs_context_ops = {
> +	.get_tree	= msharefs_get_tree,
> +};
> +
> +static int
> +mshare_init_fs_context(struct fs_context *fc)
> +{
> +	fc->ops = &msharefs_context_ops;
> +	return 0;
> +}
> +
> +static struct file_system_type mshare_fs = {
> +	.name			= "msharefs",
> +	.init_fs_context	= mshare_init_fs_context,
> +	.kill_sb		= kill_litter_super,
> +};
> +
> +static int
> +mshare_init(void)
> +{
> +	int ret = 0;
> +
> +	ret = sysfs_create_mount_point(fs_kobj, "mshare");
> +	if (ret)
> +		return ret;
> +
> +	ret = register_filesystem(&mshare_fs);
> +	if (ret)
> +		sysfs_remove_mount_point(fs_kobj, "mshare");
> +
> +	return ret;
> +}
> +
> +fs_initcall(mshare_init);
> -- 
> 2.32.0
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 4/9] mm/mshare: Add a read operation for msharefs files
  2022-06-30 21:27   ` Darrick J. Wong
@ 2022-06-30 22:27     ` Khalid Aziz
  0 siblings, 0 replies; 43+ messages in thread
From: Khalid Aziz @ 2022-06-30 22:27 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: akpm, willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 6/30/22 15:27, Darrick J. Wong wrote:
> On Wed, Jun 29, 2022 at 04:53:55PM -0600, Khalid Aziz wrote:
>> When a new file is created under msharefs, allocate a new mm_struct
>> that will hold the VMAs for mshare region. Also allocate structure
>> to defines the mshare region and add a read operation to the file
>> that returns this information about the mshare region. Currently
>> this information is returned as a struct:
>>
>> struct mshare_info {
>> 	unsigned long start;
>> 	unsigned long size;
>> };
>>
>> This gives the start address for mshare region and its size.
>>
>> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
>> ---
>>   include/uapi/linux/mman.h |  5 +++
>>   mm/mshare.c               | 64 ++++++++++++++++++++++++++++++++++++++-
>>   2 files changed, 68 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
>> index f55bc680b5b0..56fe446e24b1 100644
>> --- a/include/uapi/linux/mman.h
>> +++ b/include/uapi/linux/mman.h
>> @@ -41,4 +41,9 @@
>>   #define MAP_HUGE_2GB	HUGETLB_FLAG_ENCODE_2GB
>>   #define MAP_HUGE_16GB	HUGETLB_FLAG_ENCODE_16GB
>>   
>> +struct mshare_info {
>> +	unsigned long start;
>> +	unsigned long size;
> 
> You might want to make these explicitly u64, since this is userspace
> ABI and you never know when someone will want to do something crazy like
> run 32-bit programs with mshare files.
> 
> Also you might want to add some padding fields for flags, future
> expansion, etc.

That sounds like a good idea. I will queue it up for next version of patch series.

> 
>> +};
>> +
>>   #endif /* _UAPI_LINUX_MMAN_H */
>> diff --git a/mm/mshare.c b/mm/mshare.c
>> index 2d5924d39221..d238b68b0576 100644
>> --- a/mm/mshare.c
>> +++ b/mm/mshare.c
>> @@ -22,8 +22,14 @@
>>   #include <uapi/linux/magic.h>
>>   #include <uapi/linux/limits.h>
>>   #include <uapi/linux/mman.h>
>> +#include <linux/sched/mm.h>
>>   
>>   static struct super_block *msharefs_sb;
>> +struct mshare_data {
>> +	struct mm_struct *mm;
>> +	refcount_t refcnt;
>> +	struct mshare_info *minfo;
>> +};
>>   
>>   static const struct inode_operations msharefs_dir_inode_ops;
>>   static const struct inode_operations msharefs_file_inode_ops;
>> @@ -34,8 +40,29 @@ msharefs_open(struct inode *inode, struct file *file)
>>   	return simple_open(inode, file);
>>   }
>>   
>> +static ssize_t
>> +msharefs_read(struct kiocb *iocb, struct iov_iter *iov)
>> +{
>> +	struct mshare_data *info = iocb->ki_filp->private_data;
>> +	size_t ret;
>> +	struct mshare_info m_info;
>> +
>> +	if (info->minfo != NULL) {
>> +		m_info.start = info->minfo->start;
>> +		m_info.size = info->minfo->size;
>> +	} else {
>> +		m_info.start = 0;
>> +		m_info.size = 0;
> 
> Hmmm, read()ing out the shared mapping information.  Heh.
> 
> When does this case happen?  Is it before anybody mmaps this file into
> an address space?
> 

It can happen before or after the first mmap which will establish the start address and size. Hence I have to account 
for both cases.

>> +	}
>> +	ret = copy_to_iter(&m_info, sizeof(m_info), iov);
>> +	if (!ret)
>> +		return -EFAULT;
>> +	return ret;
>> +}
>> +
>>   static const struct file_operations msharefs_file_operations = {
>>   	.open		= msharefs_open,
>> +	.read_iter	= msharefs_read,
>>   	.llseek		= no_llseek,
>>   };
>>   
>> @@ -73,12 +100,43 @@ static struct dentry
>>   	return ERR_PTR(-ENOMEM);
>>   }
>>   
>> +static int
>> +msharefs_fill_mm(struct inode *inode)
>> +{
>> +	struct mm_struct *mm;
>> +	struct mshare_data *info = NULL;
>> +	int retval = 0;
>> +
>> +	mm = mm_alloc();
>> +	if (!mm) {
>> +		retval = -ENOMEM;
>> +		goto err_free;
>> +	}
>> +
>> +	info = kzalloc(sizeof(*info), GFP_KERNEL);
>> +	if (!info) {
>> +		retval = -ENOMEM;
>> +		goto err_free;
>> +	}
>> +	info->mm = mm;
>> +	info->minfo = NULL;
>> +	refcount_set(&info->refcnt, 1);
>> +	inode->i_private = info;
>> +
>> +	return 0;
>> +
>> +err_free:
>> +	if (mm)
>> +		mmput(mm);
>> +	kfree(info);
>> +	return retval;
>> +}
>> +
>>   static struct inode
>>   *msharefs_get_inode(struct super_block *sb, const struct inode *dir,
>>   			umode_t mode)
>>   {
>>   	struct inode *inode = new_inode(sb);
>> -
>>   	if (inode) {
>>   		inode->i_ino = get_next_ino();
>>   		inode_init_owner(&init_user_ns, inode, dir, mode);
>> @@ -89,6 +147,10 @@ static struct inode
>>   		case S_IFREG:
>>   			inode->i_op = &msharefs_file_inode_ops;
>>   			inode->i_fop = &msharefs_file_operations;
>> +			if (msharefs_fill_mm(inode) != 0) {
>> +				discard_new_inode(inode);
>> +				inode = ERR_PTR(-ENOMEM);
> 
> Is it intentional to clobber the msharefs_fill_mm return value and
> replace it with ENOMEM?

ENOMEM sounded like the right value to return from msharefs_get_inode() in case of failure. On the other hand, there 
isn't much of a reason to not just return the return value from msharefs_fill_mm(). I can change that.

Thanks for the review.

--
Khalid


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 3/9] mm/mshare: make msharefs writable and support directories
  2022-06-30 21:34   ` Darrick J. Wong
@ 2022-06-30 22:49     ` Khalid Aziz
  0 siblings, 0 replies; 43+ messages in thread
From: Khalid Aziz @ 2022-06-30 22:49 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: akpm, willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 6/30/22 15:34, Darrick J. Wong wrote:
> On Wed, Jun 29, 2022 at 04:53:54PM -0600, Khalid Aziz wrote:
>> Make msharefs filesystem writable and allow creating directories
>> to support better access control to mshare'd regions defined in
>> msharefs.
>>
>> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
>> ---
>>   mm/mshare.c | 195 +++++++++++++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 186 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/mshare.c b/mm/mshare.c
>> index 3e448e11c742..2d5924d39221 100644
>> --- a/mm/mshare.c
>> +++ b/mm/mshare.c
>> @@ -21,11 +21,21 @@
>>   #include <linux/fileattr.h>
>>   #include <uapi/linux/magic.h>
>>   #include <uapi/linux/limits.h>
>> +#include <uapi/linux/mman.h>
>>   
>>   static struct super_block *msharefs_sb;
>>   
>> +static const struct inode_operations msharefs_dir_inode_ops;
>> +static const struct inode_operations msharefs_file_inode_ops;
>> +
>> +static int
>> +msharefs_open(struct inode *inode, struct file *file)
>> +{
>> +	return simple_open(inode, file);
>> +}
>> +
>>   static const struct file_operations msharefs_file_operations = {
>> -	.open		= simple_open,
>> +	.open		= msharefs_open,
>>   	.llseek		= no_llseek,
>>   };
>>   
>> @@ -42,6 +52,113 @@ msharefs_d_hash(const struct dentry *dentry, struct qstr *qstr)
>>   	return 0;
>>   }
>>   
>> +static struct dentry
>> +*msharefs_alloc_dentry(struct dentry *parent, const char *name)
>> +{
>> +	struct dentry *d;
>> +	struct qstr q;
>> +	int err;
>> +
>> +	q.name = name;
>> +	q.len = strlen(name);
>> +
>> +	err = msharefs_d_hash(parent, &q);
>> +	if (err)
>> +		return ERR_PTR(err);
>> +
>> +	d = d_alloc(parent, &q);
>> +	if (d)
>> +		return d;
>> +
>> +	return ERR_PTR(-ENOMEM);
>> +}
>> +
>> +static struct inode
>> +*msharefs_get_inode(struct super_block *sb, const struct inode *dir,
>> +			umode_t mode)
>> +{
>> +	struct inode *inode = new_inode(sb);
>> +
>> +	if (inode) {
> 
> Not sure why you wouldn't go with the less-indently version:
> 
> 	if (!inode)
> 		return ERR_PTR(-ENOMEM);
> 
> 	inode->i_ino = get_next_ino();
> 	<etc>
> 

Yeah, good idea. I will change it.

>> +		inode->i_ino = get_next_ino();
>> +		inode_init_owner(&init_user_ns, inode, dir, mode);
>> +
>> +		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
>> +
>> +		switch (mode & S_IFMT) {
> 
> Shouldn't we set the mode somewhere?

mode is passed in as parameter to msharefs_get_inode() which uses this value to determine its actions.

> 
>> +		case S_IFREG:
>> +			inode->i_op = &msharefs_file_inode_ops;
>> +			inode->i_fop = &msharefs_file_operations;
>> +			break;
>> +		case S_IFDIR:
>> +			inode->i_op = &msharefs_dir_inode_ops;
>> +			inode->i_fop = &simple_dir_operations;
>> +			inc_nlink(inode);
>> +			break;
>> +		case S_IFLNK:
>> +			inode->i_op = &page_symlink_inode_operations;
>> +			break;
>> +		default:
>> +			discard_new_inode(inode);
>> +			inode = NULL;
>> +			break;
>> +		}
>> +	}
>> +
>> +	return inode;
>> +}
>> +
>> +static int
>> +msharefs_mknod(struct user_namespace *mnt_userns, struct inode *dir,
>> +		struct dentry *dentry, umode_t mode, dev_t dev)
>> +{
>> +	struct inode *inode;
>> +	int err = 0;
>> +
>> +	inode = msharefs_get_inode(dir->i_sb, dir, mode);
>> +	if (IS_ERR(inode))
>> +		return PTR_ERR(inode);
> 
> ...and if @inode is NULL?

Oh right, IS_ERR() does not check for NULL value. I will add a check for that and return ENOMEM.

> 
>> +
>> +	d_instantiate(dentry, inode);
>> +	dget(dentry);
>> +	dir->i_mtime = dir->i_ctime = current_time(dir);
>> +
>> +	return err;
>> +}
>> +
>> +static int
>> +msharefs_create(struct user_namespace *mnt_userns, struct inode *dir,
>> +		struct dentry *dentry, umode_t mode, bool excl)
>> +{
>> +	return msharefs_mknod(&init_user_ns, dir, dentry, mode | S_IFREG, 0);
>> +}
>> +
>> +static int
>> +msharefs_mkdir(struct user_namespace *mnt_userns, struct inode *dir,
>> +		struct dentry *dentry, umode_t mode)
>> +{
>> +	int ret = msharefs_mknod(&init_user_ns, dir, dentry, mode | S_IFDIR, 0);
>> +
>> +	if (!ret)
>> +		inc_nlink(dir);
>> +	return ret;
>> +}
>> +
>> +static const struct inode_operations msharefs_file_inode_ops = {
>> +	.setattr	= simple_setattr,
>> +	.getattr	= simple_getattr,
>> +};
>> +static const struct inode_operations msharefs_dir_inode_ops = {
>> +	.create		= msharefs_create,
>> +	.lookup		= simple_lookup,
>> +	.link		= simple_link,
>> +	.unlink		= simple_unlink,
>> +	.mkdir		= msharefs_mkdir,
>> +	.rmdir		= simple_rmdir,
>> +	.mknod		= msharefs_mknod,
>> +	.rename		= simple_rename,
>> +};
>> +
>>   static void
>>   mshare_evict_inode(struct inode *inode)
>>   {
>> @@ -58,7 +175,7 @@ mshare_info_read(struct file *file, char __user *buf, size_t nbytes,
>>   {
>>   	char s[80];
>>   
>> -	sprintf(s, "%ld", PGDIR_SIZE);
>> +	sprintf(s, "%ld\n", PGDIR_SIZE);
> 
> Changing this already?

Possibly. There is one suggestion to change it to PMD and it might be a better choice.

> 
>>   	return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
>>   }
>>   
>> @@ -72,6 +189,38 @@ static const struct super_operations mshare_s_ops = {
>>   	.evict_inode = mshare_evict_inode,
>>   };
>>   
>> +static int
>> +prepopulate_files(struct super_block *s, struct inode *dir,
>> +			struct dentry *root, const struct tree_descr *files)
>> +{
>> +	int i;
>> +	struct inode *inode;
>> +	struct dentry *dentry;
>> +
>> +	for (i = 0; !files->name || files->name[0]; i++, files++) {
>> +		if (!files->name)
>> +			continue;
> 
> What ends the array?  NULL name or empty name?
> Do we have to erase all of these when the fs gets unmounted?

This code is very similar to simple_fill_super() and I reused the code from there. inodes and dentries will need to be 
erased on unmount through evict_inode.

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 2/9] mm/mshare: pre-populate msharefs with information file
  2022-06-30 21:37   ` Darrick J. Wong
@ 2022-06-30 22:54     ` Khalid Aziz
  0 siblings, 0 replies; 43+ messages in thread
From: Khalid Aziz @ 2022-06-30 22:54 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: akpm, willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 6/30/22 15:37, Darrick J. Wong wrote:
> On Wed, Jun 29, 2022 at 04:53:53PM -0600, Khalid Aziz wrote:
>> Users of mshare feature to share page tables need to know the size
>> and alignment requirement for shared regions. Pre-populate msharefs
>> with a file, mshare_info, that provides this information.
>>
>> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
>> ---
>>   mm/mshare.c | 62 +++++++++++++++++++++++++++++++++++++++++------------
>>   1 file changed, 48 insertions(+), 14 deletions(-)
>>
>> diff --git a/mm/mshare.c b/mm/mshare.c
>> index c8fab3869bab..3e448e11c742 100644
>> --- a/mm/mshare.c
>> +++ b/mm/mshare.c
>> @@ -25,8 +25,8 @@
>>   static struct super_block *msharefs_sb;
>>   
>>   static const struct file_operations msharefs_file_operations = {
>> -	.open	= simple_open,
>> -	.llseek	= no_llseek,
>> +	.open		= simple_open,
>> +	.llseek		= no_llseek,
> 
> I feel like there's a lot of churn between the previous patch and this
> one that could have been in the previous patch.

I will look at what changes can be consolidated between two patches, including this change. Thanks for catching this.

> 
>>   };
>>   
>>   static int
>> @@ -42,23 +42,52 @@ msharefs_d_hash(const struct dentry *dentry, struct qstr *qstr)
>>   	return 0;
>>   }
>>   
>> +static void
>> +mshare_evict_inode(struct inode *inode)
>> +{
>> +	clear_inode(inode);
>> +}
>> +
>>   static const struct dentry_operations msharefs_d_ops = {
>>   	.d_hash = msharefs_d_hash,
>>   };
>>   
>> +static ssize_t
>> +mshare_info_read(struct file *file, char __user *buf, size_t nbytes,
>> +		loff_t *ppos)
>> +{
>> +	char s[80];
>> +
>> +	sprintf(s, "%ld", PGDIR_SIZE);
> 
> SO what is this "mshare_info" file supposed to reveal?  Hugepage size?
> I wonder why this isn't exported in struct mshare_info?

This file gives the alignment and size requirement for mshare regions which is the same for every mshare region. struct 
mshare_info is a per-mshare region structure that provides starting address and size for just that region.

--
Khalid

> 
>> +	return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
>> +}
>> +
>> +static const struct file_operations mshare_info_ops = {
>> +	.read   = mshare_info_read,
>> +	.llseek	= noop_llseek,
>> +};
>> +
>> +static const struct super_operations mshare_s_ops = {
>> +	.statfs	     = simple_statfs,
>> +	.evict_inode = mshare_evict_inode,
>> +};
>> +
>>   static int
>>   msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
>>   {
>> -	static const struct tree_descr empty_descr = {""};
>> +	static const struct tree_descr mshare_files[] = {
>> +		[2] = { "mshare_info", &mshare_info_ops, 0444},
>> +		{""},
>> +	};
>>   	int err;
>>   
>> -	sb->s_d_op = &msharefs_d_ops;
>> -	err = simple_fill_super(sb, MSHARE_MAGIC, &empty_descr);
>> -	if (err)
>> -		return err;
>> -
>> -	msharefs_sb = sb;
>> -	return 0;
>> +	err = simple_fill_super(sb, MSHARE_MAGIC, mshare_files);
>> +	if (!err) {
>> +		msharefs_sb = sb;
>> +		sb->s_d_op = &msharefs_d_ops;
>> +		sb->s_op = &mshare_s_ops;
>> +	}
>> +	return err;
>>   }
>>   
>>   static int
>> @@ -84,20 +113,25 @@ static struct file_system_type mshare_fs = {
>>   	.kill_sb		= kill_litter_super,
>>   };
>>   
>> -static int
>> +static int __init
>>   mshare_init(void)
>>   {
>>   	int ret = 0;
>>   
>>   	ret = sysfs_create_mount_point(fs_kobj, "mshare");
>>   	if (ret)
>> -		return ret;
>> +		goto out;
>>   
>>   	ret = register_filesystem(&mshare_fs);
>> -	if (ret)
>> +	if (ret) {
>>   		sysfs_remove_mount_point(fs_kobj, "mshare");
>> +		goto out;
>> +	}
>> +
>> +	return 0;
>>   
>> +out:
>>   	return ret;
>>   }
>>   
>> -fs_initcall(mshare_init);
>> +core_initcall(mshare_init);
>> -- 
>> 2.32.0
>>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 1/9] mm: Add msharefs filesystem
  2022-06-29 22:53 ` [PATCH v2 1/9] mm: Add msharefs filesystem Khalid Aziz
  2022-06-30 21:53   ` Darrick J. Wong
@ 2022-06-30 22:57   ` Al Viro
  2022-07-01 16:08     ` Khalid Aziz
  1 sibling, 1 reply; 43+ messages in thread
From: Al Viro @ 2022-06-30 22:57 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: akpm, willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On Wed, Jun 29, 2022 at 04:53:52PM -0600, Khalid Aziz wrote:
> +static int
> +msharefs_d_hash(const struct dentry *dentry, struct qstr *qstr)
> +{
> +	unsigned long hash = init_name_hash(dentry);
> +	const unsigned char *s = qstr->name;
> +	unsigned int len = qstr->len;
> +
> +	while (len--)
> +		hash = partial_name_hash(*s++, hash);
> +	qstr->hash = end_name_hash(hash);
> +	return 0;
> +}

What do you need that for and how is it different from letting it
use full_name_hash() (which is what it will do if you leave
dentry_operations->d_hash equal to NULL)?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 2/9] mm/mshare: pre-populate msharefs with information file
  2022-06-29 22:53 ` [PATCH v2 2/9] mm/mshare: pre-populate msharefs with information file Khalid Aziz
  2022-06-30 21:37   ` Darrick J. Wong
@ 2022-06-30 23:01   ` Al Viro
  2022-07-01 16:11     ` Khalid Aziz
  1 sibling, 1 reply; 43+ messages in thread
From: Al Viro @ 2022-06-30 23:01 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: akpm, willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On Wed, Jun 29, 2022 at 04:53:53PM -0600, Khalid Aziz wrote:

> +static void
> +mshare_evict_inode(struct inode *inode)
> +{
> +	clear_inode(inode);
> +}

Again, what for?  And while we are at it, shouldn't you evict the
pages when inode gets freed and ->i_data along with it?
IOW, aren't you missing
                truncate_inode_pages_final(&inode->i_data);
That, or just leave ->evict_inode NULL...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 3/9] mm/mshare: make msharefs writable and support directories
  2022-06-29 22:53 ` [PATCH v2 3/9] mm/mshare: make msharefs writable and support directories Khalid Aziz
  2022-06-30 21:34   ` Darrick J. Wong
@ 2022-06-30 23:09   ` Al Viro
  2022-07-02  0:22     ` Khalid Aziz
  1 sibling, 1 reply; 43+ messages in thread
From: Al Viro @ 2022-06-30 23:09 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: akpm, willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On Wed, Jun 29, 2022 at 04:53:54PM -0600, Khalid Aziz wrote:

> +static int
> +msharefs_open(struct inode *inode, struct file *file)
> +{
> +	return simple_open(inode, file);
> +}

Again, whatever for?

> +static struct dentry
> +*msharefs_alloc_dentry(struct dentry *parent, const char *name)
> +{
> +	struct dentry *d;
> +	struct qstr q;
> +	int err;
> +
> +	q.name = name;
> +	q.len = strlen(name);
> +
> +	err = msharefs_d_hash(parent, &q);
> +	if (err)
> +		return ERR_PTR(err);
> +
> +	d = d_alloc(parent, &q);
> +	if (d)
> +		return d;
> +
> +	return ERR_PTR(-ENOMEM);
> +}

And it's different from d_alloc_name() how, exactly?

> +		case S_IFLNK:
> +			inode->i_op = &page_symlink_inode_operations;
> +			break;

Really?  You've got symlinks here?

> +		default:
> +			discard_new_inode(inode);
> +			inode = NULL;

That's an odd way to spell BUG()...

> +static int
> +msharefs_mknod(struct user_namespace *mnt_userns, struct inode *dir,
> +		struct dentry *dentry, umode_t mode, dev_t dev)
> +{
> +	struct inode *inode;
> +	int err = 0;
> +
> +	inode = msharefs_get_inode(dir->i_sb, dir, mode);
> +	if (IS_ERR(inode))
> +		return PTR_ERR(inode);
> +
> +	d_instantiate(dentry, inode);
> +	dget(dentry);
> +	dir->i_mtime = dir->i_ctime = current_time(dir);
> +
> +	return err;
> +}

BTW, what's the point of having device nodes on that thing?

> +static int
> +msharefs_create(struct user_namespace *mnt_userns, struct inode *dir,
> +		struct dentry *dentry, umode_t mode, bool excl)
> +{
> +	return msharefs_mknod(&init_user_ns, dir, dentry, mode | S_IFREG, 0);
> +}
> +
> +static int
> +msharefs_mkdir(struct user_namespace *mnt_userns, struct inode *dir,
> +		struct dentry *dentry, umode_t mode)
> +{
> +	int ret = msharefs_mknod(&init_user_ns, dir, dentry, mode | S_IFDIR, 0);
> +
> +	if (!ret)
> +		inc_nlink(dir);
> +	return ret;
> +}
> +
> +static const struct inode_operations msharefs_file_inode_ops = {
> +	.setattr	= simple_setattr,
> +	.getattr	= simple_getattr,
> +};
> +static const struct inode_operations msharefs_dir_inode_ops = {
> +	.create		= msharefs_create,
> +	.lookup		= simple_lookup,
> +	.link		= simple_link,
> +	.unlink		= simple_unlink,
> +	.mkdir		= msharefs_mkdir,
> +	.rmdir		= simple_rmdir,
> +	.mknod		= msharefs_mknod,
> +	.rename		= simple_rename,
> +};
> +
>  static void
>  mshare_evict_inode(struct inode *inode)
>  {
> @@ -58,7 +175,7 @@ mshare_info_read(struct file *file, char __user *buf, size_t nbytes,
>  {
>  	char s[80];
>  
> -	sprintf(s, "%ld", PGDIR_SIZE);
> +	sprintf(s, "%ld\n", PGDIR_SIZE);
>  	return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
>  }
>  
> @@ -72,6 +189,38 @@ static const struct super_operations mshare_s_ops = {
>  	.evict_inode = mshare_evict_inode,
>  };
>  
> +static int
> +prepopulate_files(struct super_block *s, struct inode *dir,
> +			struct dentry *root, const struct tree_descr *files)
> +{
> +	int i;
> +	struct inode *inode;
> +	struct dentry *dentry;
> +
> +	for (i = 0; !files->name || files->name[0]; i++, files++) {
> +		if (!files->name)
> +			continue;
> +
> +		dentry = msharefs_alloc_dentry(root, files->name);
> +		if (!dentry)
> +			return -ENOMEM;
> +
> +		inode = msharefs_get_inode(s, dir, S_IFREG | files->mode);
> +		if (!inode) {
> +			dput(dentry);
> +			return -ENOMEM;
> +		}
> +		inode->i_mode = S_IFREG | files->mode;
> +		inode->i_atime = inode->i_mtime = inode->i_ctime
> +			= current_time(inode);
> +		inode->i_fop = files->ops;
> +		inode->i_ino = i;
> +		d_add(dentry, inode);
> +	}
> +
> +	return 0;
> +}

Looks remarkably similar to something I've seen somewhere... fs/libfs.c,
if I'm not mistaken...

Sarcasm aside, what's wrong with using simple_fill_super()?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 6/9] mm/mshare: Add mmap operation
  2022-06-30 21:44   ` Darrick J. Wong
@ 2022-06-30 23:30     ` Khalid Aziz
  0 siblings, 0 replies; 43+ messages in thread
From: Khalid Aziz @ 2022-06-30 23:30 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: akpm, willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 6/30/22 15:44, Darrick J. Wong wrote:
> On Wed, Jun 29, 2022 at 04:53:57PM -0600, Khalid Aziz wrote:
>> mmap is used to establish address range for mshare region and map the
>> region into process's address space. Add basic mmap operation that
>> supports setting address range. Also fix code to not allocate new
>> mm_struct for files in msharefs that exist for information and not
>> for defining a new mshare region.
>>
>> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
>> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>> ---
>>   mm/mshare.c | 48 +++++++++++++++++++++++++++++++++++++++++-------
>>   1 file changed, 41 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/mshare.c b/mm/mshare.c
>> index d238b68b0576..088a6cab1e93 100644
>> --- a/mm/mshare.c
>> +++ b/mm/mshare.c
>> @@ -9,7 +9,8 @@
>>    *
>>    *
>>    * Copyright (C) 2022 Oracle Corp. All rights reserved.
>> - * Author:	Khalid Aziz <khalid.aziz@oracle.com>
>> + * Authors:	Khalid Aziz <khalid.aziz@oracle.com>
>> + *		Matthew Wilcox <willy@infradead.org>
>>    *
>>    */
>>   
>> @@ -60,9 +61,36 @@ msharefs_read(struct kiocb *iocb, struct iov_iter *iov)
>>   	return ret;
>>   }
>>   
>> +static int
>> +msharefs_mmap(struct file *file, struct vm_area_struct *vma)
>> +{
>> +	struct mshare_data *info = file->private_data;
>> +	struct mm_struct *mm = info->mm;
>> +
>> +	/*
>> +	 * If this mshare region has been set up once already, bail out
>> +	 */
>> +	if (mm->mmap_base != 0)
>> +		return -EINVAL;
>> +
>> +	if ((vma->vm_start | vma->vm_end) & (PGDIR_SIZE - 1))
>> +		return -EINVAL;
>> +
>> +	mm->mmap_base = vma->vm_start;
>> +	mm->task_size = vma->vm_end - vma->vm_start;
>> +	if (!mm->task_size)
>> +		mm->task_size--;
>> +	info->minfo->start = mm->mmap_base;
>> +	info->minfo->size = mm->task_size;
> 
> So, uh, if the second mmap() caller decides to ignore the mshare_info,
> should they get an -EINVAL here since the memory mappings won't be at
> the same process virtual address?

Yes, that is in patch 9. A second mmap will result in EINVAL until patch 9 irrespective of address and size passed to mmap.

> 
>> +	vma->vm_flags |= VM_SHARED_PT;
>> +	vma->vm_private_data = info;
>> +	return 0;
>> +}
>> +
>>   static const struct file_operations msharefs_file_operations = {
>>   	.open		= msharefs_open,
>>   	.read_iter	= msharefs_read,
>> +	.mmap		= msharefs_mmap,
>>   	.llseek		= no_llseek,
>>   };
>>   
>> @@ -119,7 +147,12 @@ msharefs_fill_mm(struct inode *inode)
>>   		goto err_free;
>>   	}
>>   	info->mm = mm;
>> -	info->minfo = NULL;
>> +	info->minfo = kzalloc(sizeof(struct mshare_info), GFP_KERNEL);
>> +	if (info->minfo == NULL) {
>> +		retval = -ENOMEM;
>> +		goto err_free;
>> +	}
>> +
>>   	refcount_set(&info->refcnt, 1);
>>   	inode->i_private = info;
>>   
>> @@ -128,13 +161,14 @@ msharefs_fill_mm(struct inode *inode)
>>   err_free:
>>   	if (mm)
>>   		mmput(mm);
>> +	kfree(info->minfo);
>>   	kfree(info);
>>   	return retval;
>>   }
>>   
>>   static struct inode
>>   *msharefs_get_inode(struct super_block *sb, const struct inode *dir,
>> -			umode_t mode)
>> +			umode_t mode, bool newmm)
>>   {
>>   	struct inode *inode = new_inode(sb);
>>   	if (inode) {
>> @@ -147,7 +181,7 @@ static struct inode
>>   		case S_IFREG:
>>   			inode->i_op = &msharefs_file_inode_ops;
>>   			inode->i_fop = &msharefs_file_operations;
>> -			if (msharefs_fill_mm(inode) != 0) {
>> +			if (newmm && msharefs_fill_mm(inode) != 0) {
>>   				discard_new_inode(inode);
>>   				inode = ERR_PTR(-ENOMEM);
>>   			}
>> @@ -177,7 +211,7 @@ msharefs_mknod(struct user_namespace *mnt_userns, struct inode *dir,
>>   	struct inode *inode;
>>   	int err = 0;
>>   
>> -	inode = msharefs_get_inode(dir->i_sb, dir, mode);
>> +	inode = msharefs_get_inode(dir->i_sb, dir, mode, true);
>>   	if (IS_ERR(inode))
>>   		return PTR_ERR(inode);
>>   
>> @@ -267,7 +301,7 @@ prepopulate_files(struct super_block *s, struct inode *dir,
>>   		if (!dentry)
>>   			return -ENOMEM;
>>   
>> -		inode = msharefs_get_inode(s, dir, S_IFREG | files->mode);
>> +		inode = msharefs_get_inode(s, dir, S_IFREG | files->mode, false);
> 
> I was wondering why the information files were getting their own
> mshare_data.
> 
> TBH I'm not really sure what the difference is between mshare_data and
> mshare_info, since those names are not especially distinct.

mshare_data is superset and internal while mshare_info is what is sent back to userspace when it reads a file 
representing an mshare region.

> 
>>   		if (!inode) {
>>   			dput(dentry);
>>   			return -ENOMEM;
>> @@ -301,7 +335,7 @@ msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
>>   	sb->s_d_op		= &msharefs_d_ops;
>>   	sb->s_time_gran		= 1;
>>   
>> -	inode = msharefs_get_inode(sb, NULL, S_IFDIR | 0777);
>> +	inode = msharefs_get_inode(sb, NULL, S_IFDIR | 0777, false);
> 
> Is it wise to default to world-writable?  Surely whatever userspace
> software wraps an msharefs can relax permissions as needed.
> 

Since this is for the root inode, the default is so any process can create mshare region in msharefs which I think is 
most flexible. Individual userspace app can create mshare regions with any permissions they deem fit using open(). Does 
that make sense?

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 7/9] mm/mshare: Add unlink and munmap support
  2022-06-30 21:50   ` Darrick J. Wong
@ 2022-07-01 15:58     ` Khalid Aziz
  0 siblings, 0 replies; 43+ messages in thread
From: Khalid Aziz @ 2022-07-01 15:58 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: akpm, willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 6/30/22 15:50, Darrick J. Wong wrote:
> On Wed, Jun 29, 2022 at 04:53:58PM -0600, Khalid Aziz wrote:
>> Number of mappings of an mshare region should be tracked so it can
>> be removed when there are no more references to it and associated
>> file has been deleted. This add code to support the unlink operation
>> for associated file, remove the mshare region on file deletion if
>> refcount goes to zero, add munmap operation to maintain refcount
>> to mshare region and remove it on last munmap if file has been
>> deleted.
>>
>> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
>> ---
>>   mm/mshare.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 63 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/mshare.c b/mm/mshare.c
>> index 088a6cab1e93..90ce0564a138 100644
>> --- a/mm/mshare.c
>> +++ b/mm/mshare.c
>> @@ -29,6 +29,7 @@ static struct super_block *msharefs_sb;
>>   struct mshare_data {
>>   	struct mm_struct *mm;
>>   	refcount_t refcnt;
>> +	int deleted;
>>   	struct mshare_info *minfo;
>>   };
>>   
>> @@ -48,6 +49,7 @@ msharefs_read(struct kiocb *iocb, struct iov_iter *iov)
>>   	size_t ret;
>>   	struct mshare_info m_info;
>>   
>> +	mmap_read_lock(info->mm);
>>   	if (info->minfo != NULL) {
>>   		m_info.start = info->minfo->start;
>>   		m_info.size = info->minfo->size;
>> @@ -55,18 +57,42 @@ msharefs_read(struct kiocb *iocb, struct iov_iter *iov)
>>   		m_info.start = 0;
>>   		m_info.size = 0;
>>   	}
>> +	mmap_read_unlock(info->mm);
>>   	ret = copy_to_iter(&m_info, sizeof(m_info), iov);
>>   	if (!ret)
>>   		return -EFAULT;
>>   	return ret;
>>   }
>>   
>> +static void
>> +msharefs_close(struct vm_area_struct *vma)
>> +{
>> +	struct mshare_data *info = vma->vm_private_data;
>> +
>> +	if (refcount_dec_and_test(&info->refcnt)) {
>> +		mmap_read_lock(info->mm);
>> +		if (info->deleted) {
>> +			mmap_read_unlock(info->mm);
>> +			mmput(info->mm);
>> +			kfree(info->minfo);
>> +			kfree(info);
> 
> Aren't filesystems supposed to take care of disposing of the file data
> in destroy_inode?  IIRC struct inode doesn't go away until all fds are
> closed, mappings are torn down, and there are no more references from
> dentries.  I could be misremembering since it's been a few months since
> I went looking at the (VFS) inode lifecycle.

Documentation (vfs.rst) says - "this method is called by destroy_inode() to release resources allocated for struct 
inode. It is only required if ->alloc_inode was defined and simply undoes anything done by ->alloc_inode.". I am not 
defining alloc_inode, so I assumed I do not need to define destroy_inode and the standard destroy_inode will do the 
right thing since standard alloc_inode is being used.

Are you suggesting per-region mshare_data should be freed in destroy_inode instead of in close?

> 
>> +		} else {
>> +			mmap_read_unlock(info->mm);
>> +		}
>> +	}
>> +}
>> +
>> +static const struct vm_operations_struct msharefs_vm_ops = {
>> +	.close	= msharefs_close,
>> +};
>> +
>>   static int
>>   msharefs_mmap(struct file *file, struct vm_area_struct *vma)
>>   {
>>   	struct mshare_data *info = file->private_data;
>>   	struct mm_struct *mm = info->mm;
>>   
>> +	mmap_write_lock(mm);
>>   	/*
>>   	 * If this mshare region has been set up once already, bail out
>>   	 */
>> @@ -80,10 +106,14 @@ msharefs_mmap(struct file *file, struct vm_area_struct *vma)
>>   	mm->task_size = vma->vm_end - vma->vm_start;
>>   	if (!mm->task_size)
>>   		mm->task_size--;
>> +	mmap_write_unlock(mm);
>>   	info->minfo->start = mm->mmap_base;
>>   	info->minfo->size = mm->task_size;
>> +	info->deleted = 0;
>> +	refcount_inc(&info->refcnt);
>>   	vma->vm_flags |= VM_SHARED_PT;
>>   	vma->vm_private_data = info;
>> +	vma->vm_ops = &msharefs_vm_ops;
>>   	return 0;
>>   }
>>   
>> @@ -240,6 +270,38 @@ msharefs_mkdir(struct user_namespace *mnt_userns, struct inode *dir,
>>   	return ret;
>>   }
>>   
>> +static int
>> +msharefs_unlink(struct inode *dir, struct dentry *dentry)
>> +{
>> +	struct inode *inode = d_inode(dentry);
>> +	struct mshare_data *info = inode->i_private;
>> +
>> +	/*
>> +	 * Unmap the mshare region if it is still mapped in
>> +	 */
>> +	vm_munmap(info->minfo->start, info->minfo->size);
>> +
>> +	/*
>> +	 * Mark msharefs file for deletion so it can not be opened
>> +	 * and used for mshare mappings any more
>> +	 */
>> +	simple_unlink(dir, dentry);
>> +	mmap_write_lock(info->mm);
>> +	info->deleted = 1;
>> +	mmap_write_unlock(info->mm);
> 
> What if the file is hardlinked?

It looks like that is a bug currently. I need to account for that. Thanks!

--
Khalid


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 1/9] mm: Add msharefs filesystem
  2022-06-30 21:53   ` Darrick J. Wong
@ 2022-07-01 16:05     ` Khalid Aziz
  0 siblings, 0 replies; 43+ messages in thread
From: Khalid Aziz @ 2022-07-01 16:05 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: akpm, willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 6/30/22 15:53, Darrick J. Wong wrote:
> On Wed, Jun 29, 2022 at 04:53:52PM -0600, Khalid Aziz wrote:
>> Add a ram-based filesystem that contains page table sharing
>> information and files that enables processes to share page tables.
>> This patch adds the basic filesystem that can be mounted.
>>
>> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
>> ---
>>   Documentation/filesystems/msharefs.rst |  19 +++++
>>   include/uapi/linux/magic.h             |   1 +
>>   mm/Makefile                            |   2 +-
>>   mm/mshare.c                            | 103 +++++++++++++++++++++++++
>>   4 files changed, 124 insertions(+), 1 deletion(-)
>>   create mode 100644 Documentation/filesystems/msharefs.rst
>>   create mode 100644 mm/mshare.c
>>
>> diff --git a/Documentation/filesystems/msharefs.rst b/Documentation/filesystems/msharefs.rst
>> new file mode 100644
>> index 000000000000..fd161f67045d
>> --- /dev/null
>> +++ b/Documentation/filesystems/msharefs.rst
>> @@ -0,0 +1,19 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +=====================================================
>> +msharefs - a filesystem to support shared page tables
>> +=====================================================
>> +
>> +msharefs is a ram-based filesystem that allows multiple processes to
>> +share page table entries for shared pages.
>> +
>> +msharefs is typically mounted like this::
>> +
>> +	mount -t msharefs none /sys/fs/mshare
>> +
>> +When a process calls mshare syscall with a name for the shared address
>> +range,
> 
> You mean creat()?
> 
>> a file with the same name is created under msharefs with that
>> +name. This file can be opened by another process, if permissions
>> +allow, to query the addresses shared under this range. These files are
>> +removed by mshare_unlink syscall and can not be deleted directly.
> 
> Oh?
> 

msharefs.rst needs to be updated.

>> +Hence these files are created as immutable files.
>> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
>> index f724129c0425..2a57a6ec6f3e 100644
>> --- a/include/uapi/linux/magic.h
>> +++ b/include/uapi/linux/magic.h
>> @@ -105,5 +105,6 @@
>>   #define Z3FOLD_MAGIC		0x33
>>   #define PPC_CMM_MAGIC		0xc7571590
>>   #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
>> +#define MSHARE_MAGIC		0x4d534852	/* "MSHR" */
>>   
>>   #endif /* __LINUX_MAGIC_H__ */
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 6f9ffa968a1a..51a2ab9080d9 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -37,7 +37,7 @@ CFLAGS_init-mm.o += $(call cc-disable-warning, override-init)
>>   CFLAGS_init-mm.o += $(call cc-disable-warning, initializer-overrides)
>>   
>>   mmu-y			:= nommu.o
>> -mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o \
>> +mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o mshare.o \
>>   			   mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
>>   			   msync.o page_vma_mapped.o pagewalk.o \
>>   			   pgtable-generic.o rmap.o vmalloc.o
>> diff --git a/mm/mshare.c b/mm/mshare.c
>> new file mode 100644
>> index 000000000000..c8fab3869bab
>> --- /dev/null
>> +++ b/mm/mshare.c
> 
> Filesystems are usually supposed to live under fs/; is there some reason
> to put it in mm/?
> 
> I guess shmfs is in mm so maybe this isn't much of an objection.
> 
> Also, should this fs be selectable via a Kconfig option?

Since this filesystem is meant to support an mm feature, I felt it was more appropriate for it to reside under mm/, 
similar to shmfs.

I could add a Kconfig option. The option would be to enable mshare feature. msharefs would automatically be enabled when 
mshare is enabled, i.e. msharefs shouldn't be a visible Kconfig option. Do we see a reason to make mshare an optional 
feature? If we can base hugetlbfs page table sharing on mshare in future, this will not be an optional feature at that 
time and mshare kconfig option will have to be removed.

Thanks,
Khalid



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 1/9] mm: Add msharefs filesystem
  2022-06-30 22:57   ` Al Viro
@ 2022-07-01 16:08     ` Khalid Aziz
  0 siblings, 0 replies; 43+ messages in thread
From: Khalid Aziz @ 2022-07-01 16:08 UTC (permalink / raw)
  To: Al Viro
  Cc: akpm, willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 6/30/22 16:57, Al Viro wrote:
> On Wed, Jun 29, 2022 at 04:53:52PM -0600, Khalid Aziz wrote:
>> +static int
>> +msharefs_d_hash(const struct dentry *dentry, struct qstr *qstr)
>> +{
>> +	unsigned long hash = init_name_hash(dentry);
>> +	const unsigned char *s = qstr->name;
>> +	unsigned int len = qstr->len;
>> +
>> +	while (len--)
>> +		hash = partial_name_hash(*s++, hash);
>> +	qstr->hash = end_name_hash(hash);
>> +	return 0;
>> +}
> 
> What do you need that for and how is it different from letting it
> use full_name_hash() (which is what it will do if you leave
> dentry_operations->d_hash equal to NULL)?

I don't have a specific reason to use msharefs_d_hash(). If full_name_hash() can work, I don't mind reducing amount of 
code in my patch. I will take a look at it.

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 2/9] mm/mshare: pre-populate msharefs with information file
  2022-06-30 23:01   ` Al Viro
@ 2022-07-01 16:11     ` Khalid Aziz
  0 siblings, 0 replies; 43+ messages in thread
From: Khalid Aziz @ 2022-07-01 16:11 UTC (permalink / raw)
  To: Al Viro
  Cc: akpm, willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 6/30/22 17:01, Al Viro wrote:
> On Wed, Jun 29, 2022 at 04:53:53PM -0600, Khalid Aziz wrote:
> 
>> +static void
>> +mshare_evict_inode(struct inode *inode)
>> +{
>> +	clear_inode(inode);
>> +}
> 
> Again, what for?  And while we are at it, shouldn't you evict the
> pages when inode gets freed and ->i_data along with it?
> IOW, aren't you missing
>                  truncate_inode_pages_final(&inode->i_data);
> That, or just leave ->evict_inode NULL...

Good suggestion. I thought I would need to do more in evict_inode() but turned out I didn't. It is time to eject this 
routine from my code.

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 3/9] mm/mshare: make msharefs writable and support directories
  2022-06-30 23:09   ` Al Viro
@ 2022-07-02  0:22     ` Khalid Aziz
  0 siblings, 0 replies; 43+ messages in thread
From: Khalid Aziz @ 2022-07-02  0:22 UTC (permalink / raw)
  To: Al Viro
  Cc: akpm, willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 6/30/22 17:09, Al Viro wrote:
> On Wed, Jun 29, 2022 at 04:53:54PM -0600, Khalid Aziz wrote:
> 
>> +static int
>> +msharefs_open(struct inode *inode, struct file *file)
>> +{
>> +	return simple_open(inode, file);
>> +}
> 
> Again, whatever for? >
>> +static struct dentry
>> +*msharefs_alloc_dentry(struct dentry *parent, const char *name)
>> +{
>> +	struct dentry *d;
>> +	struct qstr q;
>> +	int err;
>> +
>> +	q.name = name;
>> +	q.len = strlen(name);
>> +
>> +	err = msharefs_d_hash(parent, &q);
>> +	if (err)
>> +		return ERR_PTR(err);
>> +
>> +	d = d_alloc(parent, &q);
>> +	if (d)
>> +		return d;
>> +
>> +	return ERR_PTR(-ENOMEM);
>> +}
> 
> And it's different from d_alloc_name() how, exactly?

By making minor changes to my other code, I was able to use all of the standard functions you pointed out. That 
simplified my patch quite a bit. Thank you!

> 
>> +		case S_IFLNK:
>> +			inode->i_op = &page_symlink_inode_operations;
>> +			break;
> 
> Really?  You've got symlinks here?

I intended to support symlinks on msharefs but I am not sure if I see a use case at this time. I can drop support for 
symlinks and add it in future if there is a use case.

> 
>> +		default:
>> +			discard_new_inode(inode);
>> +			inode = NULL;
> 
> That's an odd way to spell BUG()...

I think what you are saying is this default case represents a bug and I should report it as such. Is that right, or 
should I not have a default case at all (which is what I am seeing in some of the other places)?

> 
>> +static int
>> +msharefs_mknod(struct user_namespace *mnt_userns, struct inode *dir,
>> +		struct dentry *dentry, umode_t mode, dev_t dev)
>> +{
>> +	struct inode *inode;
>> +	int err = 0;
>> +
>> +	inode = msharefs_get_inode(dir->i_sb, dir, mode);
>> +	if (IS_ERR(inode))
>> +		return PTR_ERR(inode);
>> +
>> +	d_instantiate(dentry, inode);
>> +	dget(dentry);
>> +	dir->i_mtime = dir->i_ctime = current_time(dir);
>> +
>> +	return err;
>> +}
> 
> BTW, what's the point of having device nodes on that thing?

There will be no device nodes on msharefs. Are you referring to the dev_t parameter in msharefs_mknod() declaration? If 
so, I am following the prototype declaration for that function from fs.h:

         int (*mknod) (struct user_namespace *, struct inode *,struct dentry *,
                       umode_t,dev_t);

If I am misunderstanding, please correct me.

> 
>> +static int
>> +msharefs_create(struct user_namespace *mnt_userns, struct inode *dir,
>> +		struct dentry *dentry, umode_t mode, bool excl)
>> +{
>> +	return msharefs_mknod(&init_user_ns, dir, dentry, mode | S_IFREG, 0);
>> +}
>> +
>> +static int
>> +msharefs_mkdir(struct user_namespace *mnt_userns, struct inode *dir,
>> +		struct dentry *dentry, umode_t mode)
>> +{
>> +	int ret = msharefs_mknod(&init_user_ns, dir, dentry, mode | S_IFDIR, 0);
>> +
>> +	if (!ret)
>> +		inc_nlink(dir);
>> +	return ret;
>> +}
>> +
>> +static const struct inode_operations msharefs_file_inode_ops = {
>> +	.setattr	= simple_setattr,
>> +	.getattr	= simple_getattr,
>> +};
>> +static const struct inode_operations msharefs_dir_inode_ops = {
>> +	.create		= msharefs_create,
>> +	.lookup		= simple_lookup,
>> +	.link		= simple_link,
>> +	.unlink		= simple_unlink,
>> +	.mkdir		= msharefs_mkdir,
>> +	.rmdir		= simple_rmdir,
>> +	.mknod		= msharefs_mknod,
>> +	.rename		= simple_rename,
>> +};
>> +
>>   static void
>>   mshare_evict_inode(struct inode *inode)
>>   {
>> @@ -58,7 +175,7 @@ mshare_info_read(struct file *file, char __user *buf, size_t nbytes,
>>   {
>>   	char s[80];
>>   
>> -	sprintf(s, "%ld", PGDIR_SIZE);
>> +	sprintf(s, "%ld\n", PGDIR_SIZE);
>>   	return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
>>   }
>>   
>> @@ -72,6 +189,38 @@ static const struct super_operations mshare_s_ops = {
>>   	.evict_inode = mshare_evict_inode,
>>   };
>>   
>> +static int
>> +prepopulate_files(struct super_block *s, struct inode *dir,
>> +			struct dentry *root, const struct tree_descr *files)
>> +{
>> +	int i;
>> +	struct inode *inode;
>> +	struct dentry *dentry;
>> +
>> +	for (i = 0; !files->name || files->name[0]; i++, files++) {
>> +		if (!files->name)
>> +			continue;
>> +
>> +		dentry = msharefs_alloc_dentry(root, files->name);
>> +		if (!dentry)
>> +			return -ENOMEM;
>> +
>> +		inode = msharefs_get_inode(s, dir, S_IFREG | files->mode);
>> +		if (!inode) {
>> +			dput(dentry);
>> +			return -ENOMEM;
>> +		}
>> +		inode->i_mode = S_IFREG | files->mode;
>> +		inode->i_atime = inode->i_mtime = inode->i_ctime
>> +			= current_time(inode);
>> +		inode->i_fop = files->ops;
>> +		inode->i_ino = i;
>> +		d_add(dentry, inode);
>> +	}
>> +
>> +	return 0;
>> +}
> 
> Looks remarkably similar to something I've seen somewhere... fs/libfs.c,
> if I'm not mistaken...
> 
> Sarcasm aside, what's wrong with using simple_fill_super()?
I started out using simple_fill_super() in patch 1. I found that when I use simple_fill_super(), I end up with a 
filesystem that userspace can not create a file in. I looked at other code like shmfs and efivarfs and wrote similar 
code which got me a writable filesystem. I might be missing something basic and if there is a way to use 
simple_fill_super() and be able to support file creation from userspace, I would much rather use simple_fill_super().

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 0/9] Add support for shared PTEs across processes
  2022-06-29 22:53 [PATCH v2 0/9] Add support for shared PTEs across processes Khalid Aziz
                   ` (9 preceding siblings ...)
  2022-06-30 11:57 ` [PATCH v2 0/9] Add support for shared PTEs " Mark Hemment
@ 2022-07-02  4:24 ` Andrew Morton
  2022-07-06 19:26   ` Khalid Aziz
  2022-07-08 11:47   ` David Hildenbrand
  10 siblings, 2 replies; 43+ messages in thread
From: Andrew Morton @ 2022-07-02  4:24 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen, david,
	ebiederm, hagen, jack, keescook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, linux-mm, longpeng2, luto, markhemm,
	pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:

> This patch series implements a mechanism in kernel to allow
> userspace processes to opt into sharing PTEs. It adds a new
> in-memory filesystem - msharefs. 

Dumb question: why do we need a new filesystem for this?  Is it not
feasible to permit PTE sharing for mmaps of tmpfs/xfs/ext4/etc files?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 0/9] Add support for shared PTEs across processes
  2022-07-02  4:24 ` Andrew Morton
@ 2022-07-06 19:26   ` Khalid Aziz
  2022-07-08 11:47   ` David Hildenbrand
  1 sibling, 0 replies; 43+ messages in thread
From: Khalid Aziz @ 2022-07-06 19:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen, david,
	ebiederm, hagen, jack, keescook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, linux-mm, longpeng2, luto, markhemm,
	pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 7/1/22 22:24, Andrew Morton wrote:
> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:
> 
>> This patch series implements a mechanism in kernel to allow
>> userspace processes to opt into sharing PTEs. It adds a new
>> in-memory filesystem - msharefs.
> 
> Dumb question: why do we need a new filesystem for this?  Is it not
> feasible to permit PTE sharing for mmaps of tmpfs/xfs/ext4/etc files?

Hi Andrew,

The new filesystem is meant to provide only the control files for sharing PTE. It contains a file that provides 
alignment/size requirement. Other files are created as named objects to represent shared regions and these files provide 
information about the size and virtual address for each shared regions when the file is read. Actual shared data is not 
hosted on msharefs. Actual data is mmap'ed using anonymous pages, ext4/xfs/btfrfs/etc files.

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 8/9] mm/mshare: Add basic page table sharing support
  2022-06-29 22:53 ` [PATCH v2 8/9] mm/mshare: Add basic page table sharing support Khalid Aziz
@ 2022-07-07  9:13   ` Xin Hao
  2022-07-07 15:33     ` Khalid Aziz
  0 siblings, 1 reply; 43+ messages in thread
From: Xin Hao @ 2022-07-07  9:13 UTC (permalink / raw)
  To: Khalid Aziz, akpm, willy
  Cc: aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen, david,
	ebiederm, hagen, jack, keescook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, linux-mm, longpeng2, luto, markhemm,
	pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin


On 6/30/22 6:53 AM, Khalid Aziz wrote:
> Add support for creating a new set of shared page tables in a new
> mm_struct upon mmap of an mshare region. Add page fault handling in
> this now mshare'd region. Modify exit_mmap path to make sure page
> tables in the mshare'd regions are kept intact when a process using
> mshare'd region exits. Clean up mshare mm_struct when the mshare
> region is deleted. This support is for the process creating mshare
> region only. Subsequent patches will add support for other processes
> to be able to map the mshare region.
>
> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>   include/linux/mm.h |   2 +
>   mm/internal.h      |   2 +
>   mm/memory.c        | 101 +++++++++++++++++++++++++++++-
>   mm/mshare.c        | 149 ++++++++++++++++++++++++++++++++++++---------
>   4 files changed, 222 insertions(+), 32 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0ddc3057f73b..63887f06b37b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1859,6 +1859,8 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
>   		unsigned long end, unsigned long floor, unsigned long ceiling);
>   int
>   copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
> +int
> +mshare_copy_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
>   int follow_pte(struct mm_struct *mm, unsigned long address,
>   	       pte_t **ptepp, spinlock_t **ptlp);
>   int follow_pfn(struct vm_area_struct *vma, unsigned long address,
> diff --git a/mm/internal.h b/mm/internal.h
> index 3f2790aea918..6ae7063ac10d 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -861,6 +861,8 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags);
>   
>   DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>   
> +extern vm_fault_t find_shared_vma(struct vm_area_struct **vma,
> +					unsigned long *addrp);
>   static inline bool vma_is_shared(const struct vm_area_struct *vma)
>   {
>   	return vma->vm_flags & VM_SHARED_PT;
> diff --git a/mm/memory.c b/mm/memory.c
> index 7a089145cad4..2a8d5b8928f5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -416,15 +416,20 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   		unlink_anon_vmas(vma);
>   		unlink_file_vma(vma);
>   
> +		/*
> +		 * There is no page table to be freed for vmas that
> +		 * are mapped in mshare regions
> +		 */
>   		if (is_vm_hugetlb_page(vma)) {
>   			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
>   				floor, next ? next->vm_start : ceiling);
> -		} else {
> +		} else if (!vma_is_shared(vma)) {
>   			/*
>   			 * Optimization: gather nearby vmas into one call down
>   			 */
>   			while (next && next->vm_start <= vma->vm_end + PMD_SIZE
> -			       && !is_vm_hugetlb_page(next)) {
> +			       && !is_vm_hugetlb_page(next)
> +			       && !vma_is_shared(next)) {
>   				vma = next;
>   				next = vma->vm_next;
>   				unlink_anon_vmas(vma);
> @@ -1260,6 +1265,54 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>   	return false;
>   }
>   
> +/*
> + * Copy PTEs for mshare'd pages.
> + * This code is based upon copy_page_range()
> + */
> +int
> +mshare_copy_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> +{
> +	pgd_t *src_pgd, *dst_pgd;
> +	unsigned long next;
> +	unsigned long addr = src_vma->vm_start;
> +	unsigned long end = src_vma->vm_end;
> +	struct mm_struct *dst_mm = dst_vma->vm_mm;
> +	struct mm_struct *src_mm = src_vma->vm_mm;
> +	struct mmu_notifier_range range;
> +	int ret = 0;
> +
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
> +				0, src_vma, src_mm, addr, end);
> +	mmu_notifier_invalidate_range_start(&range);
> +	/*
> +	 * Disabling preemption is not needed for the write side, as
> +	 * the read side doesn't spin, but goes to the mmap_lock.
> +	 *
> +	 * Use the raw variant of the seqcount_t write API to avoid
> +	 * lockdep complaining about preemptibility.
> +	 */
> +	mmap_assert_write_locked(src_mm);
> +	raw_write_seqcount_begin(&src_mm->write_protect_seq);
> +
> +	dst_pgd = pgd_offset(dst_mm, addr);
> +	src_pgd = pgd_offset(src_mm, addr);
> +	do {
> +		next = pgd_addr_end(addr, end);
> +		if (pgd_none_or_clear_bad(src_pgd))
> +			continue;
> +		if (unlikely(copy_p4d_range(dst_vma, src_vma, dst_pgd, src_pgd,
> +					    addr, next))) {
> +			ret = -ENOMEM;
> +			break;
> +		}
> +	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
> +
> +	raw_write_seqcount_end(&src_mm->write_protect_seq);
> +	mmu_notifier_invalidate_range_end(&range);
> +
> +	return ret;
> +}
> +
>   int
>   copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>   {
> @@ -1628,6 +1681,13 @@ void unmap_page_range(struct mmu_gather *tlb,
>   	pgd_t *pgd;
>   	unsigned long next;
>   
> +	/*
> +	 * No need to unmap vmas that share page table through
> +	 * mshare region
> +	 */
> +	if (vma_is_shared(vma))
> +		return;
> +
>   	BUG_ON(addr >= end);
>   	tlb_start_vma(tlb, vma);
>   	pgd = pgd_offset(vma->vm_mm, addr);
> @@ -5113,6 +5173,8 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>   			   unsigned int flags, struct pt_regs *regs)
>   {
>   	vm_fault_t ret;
> +	bool shared = false;
> +	struct mm_struct *orig_mm;
>   
>   	__set_current_state(TASK_RUNNING);
>   
> @@ -5122,6 +5184,16 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>   	/* do counter updates before entering really critical section. */
>   	check_sync_rss_stat(current);
>   
> +	orig_mm = vma->vm_mm;
> +	if (unlikely(vma_is_shared(vma))) {
> +		ret = find_shared_vma(&vma, &address);
> +		if (ret)
> +			return ret;
> +		if (!vma)
> +			return VM_FAULT_SIGSEGV;
> +		shared = true;
if  shared is true,  so it mean the origin vma are replaced, but the 
code not free the origin vma ?
> +	}
> +
>   	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>   					    flags & FAULT_FLAG_INSTRUCTION,
>   					    flags & FAULT_FLAG_REMOTE))
> @@ -5139,6 +5211,31 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>   	else
>   		ret = __handle_mm_fault(vma, address, flags);
>   
> +	/*
> +	 * Release the read lock on shared VMA's parent mm unless
> +	 * __handle_mm_fault released the lock already.
> +	 * __handle_mm_fault sets VM_FAULT_RETRY in return value if
> +	 * it released mmap lock. If lock was released, that implies
> +	 * the lock would have been released on task's original mm if
> +	 * this were not a shared PTE vma. To keep lock state consistent,
> +	 * make sure to release the lock on task's original mm
> +	 */
> +	if (shared) {
> +		int release_mmlock = 1;
> +
> +		if (!(ret & VM_FAULT_RETRY)) {
> +			mmap_read_unlock(vma->vm_mm);
> +			release_mmlock = 0;
> +		} else if ((flags & FAULT_FLAG_ALLOW_RETRY) &&
> +			(flags & FAULT_FLAG_RETRY_NOWAIT)) {
> +			mmap_read_unlock(vma->vm_mm);
> +			release_mmlock = 0;
> +		}
> +
> +		if (release_mmlock)
> +			mmap_read_unlock(orig_mm);
> +	}
> +
>   	if (flags & FAULT_FLAG_USER) {
>   		mem_cgroup_exit_user_fault();
>   		/*
> diff --git a/mm/mshare.c b/mm/mshare.c
> index 90ce0564a138..2ec0e56ffd69 100644
> --- a/mm/mshare.c
> +++ b/mm/mshare.c
> @@ -15,7 +15,7 @@
>    */
>   
>   #include <linux/fs.h>
> -#include <linux/mount.h>
> +#include <linux/mm.h>
>   #include <linux/syscalls.h>
>   #include <linux/uaccess.h>
>   #include <linux/pseudo_fs.h>
> @@ -24,6 +24,7 @@
>   #include <uapi/linux/limits.h>
>   #include <uapi/linux/mman.h>
>   #include <linux/sched/mm.h>
> +#include <linux/mmu_context.h>
>   
>   static struct super_block *msharefs_sb;
>   struct mshare_data {
> @@ -33,6 +34,43 @@ struct mshare_data {
>   	struct mshare_info *minfo;
>   };
>   
> +/* Returns holding the host mm's lock for read.  Caller must release. */
> +vm_fault_t
> +find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp)
> +{
> +	struct vm_area_struct *vma, *guest = *vmap;
> +	struct mshare_data *info = guest->vm_private_data;
> +	struct mm_struct *host_mm = info->mm;
> +	unsigned long host_addr;
> +	pgd_t *pgd, *guest_pgd;
> +
> +	mmap_read_lock(host_mm);
> +	host_addr = *addrp - guest->vm_start + host_mm->mmap_base;
> +	pgd = pgd_offset(host_mm, host_addr);
> +	guest_pgd = pgd_offset(guest->vm_mm, *addrp);
> +	if (!pgd_same(*guest_pgd, *pgd)) {
> +		set_pgd(guest_pgd, *pgd);
> +		mmap_read_unlock(host_mm);
> +		return VM_FAULT_NOPAGE;
> +	}
> +
> +	*addrp = host_addr;
> +	vma = find_vma(host_mm, host_addr);
> +
> +	/* XXX: expand stack? */
> +	if (vma && vma->vm_start > host_addr)
> +		vma = NULL;
> +
> +	*vmap = vma;
> +
> +	/*
> +	 * release host mm lock unless a matching vma is found
> +	 */
> +	if (!vma)
> +		mmap_read_unlock(host_mm);
> +	return 0;
> +}
> +
>   static const struct inode_operations msharefs_dir_inode_ops;
>   static const struct inode_operations msharefs_file_inode_ops;
>   
> @@ -64,6 +102,14 @@ msharefs_read(struct kiocb *iocb, struct iov_iter *iov)
>   	return ret;
>   }
>   
> +static void
> +msharefs_delmm(struct mshare_data *info)
> +{
> +	mmput(info->mm);
> +	kfree(info->minfo);
> +	kfree(info);
> +}
> +
>   static void
>   msharefs_close(struct vm_area_struct *vma)
>   {
> @@ -73,9 +119,7 @@ msharefs_close(struct vm_area_struct *vma)
>   		mmap_read_lock(info->mm);
>   		if (info->deleted) {
>   			mmap_read_unlock(info->mm);
> -			mmput(info->mm);
> -			kfree(info->minfo);
> -			kfree(info);
> +			msharefs_delmm(info);
>   		} else {
>   			mmap_read_unlock(info->mm);
>   		}
> @@ -90,31 +134,80 @@ static int
>   msharefs_mmap(struct file *file, struct vm_area_struct *vma)
>   {
>   	struct mshare_data *info = file->private_data;
> -	struct mm_struct *mm = info->mm;
> +	struct mm_struct *new_mm = info->mm;
> +	int err = 0;
>   
> -	mmap_write_lock(mm);
> +	mmap_write_lock(new_mm);
>   	/*
> -	 * If this mshare region has been set up once already, bail out
> +	 * If this mshare region has not been set up, set up the
> +	 * applicable address range for the region and prepare for
> +	 * page table sharing
>   	 */
> -	if (mm->mmap_base != 0)
> +	if (new_mm->mmap_base != 0) {
>   		return -EINVAL;
> +	} else {
> +		struct mm_struct *old_mm;
> +		struct vm_area_struct *new_vma;
> +
> +		if ((vma->vm_start | vma->vm_end) & (PGDIR_SIZE - 1))
> +			return -EINVAL;
> +
> +		old_mm = current->mm;
> +		mmap_assert_write_locked(old_mm);
> +		new_mm->mmap_base = vma->vm_start;
> +		new_mm->task_size = vma->vm_end - vma->vm_start;
> +		if (!new_mm->task_size)
> +			new_mm->task_size--;
> +		info->minfo->start = new_mm->mmap_base;
> +		info->minfo->size = new_mm->task_size;
> +		info->deleted = 0;
> +		refcount_inc(&info->refcnt);
> +
> +		/*
> +		 * Mark current VMA as shared and copy over to mshare
> +		 * mm_struct
> +		 */
> +		vma->vm_private_data = info;
> +		new_vma = vm_area_dup(vma);
> +		if (!new_vma) {
> +			vma->vm_private_data = NULL;
> +			mmap_write_unlock(new_mm);
> +			err = -ENOMEM;
> +			goto err_out;
> +		}
> +		vma->vm_flags |= (VM_SHARED_PT|VM_SHARED);
> +		vma->vm_ops = &msharefs_vm_ops;
> +
> +		/*
> +		 * Newly created mshare mapping is anonymous mapping
> +		 */
> +		new_vma->vm_mm = new_mm;
> +		vma_set_anonymous(new_vma);
> +		new_vma->vm_file = NULL;
> +		new_vma->vm_flags &= ~VM_SHARED;
> +
> +		/*
> +		 * Do not use THP for mshare region
> +		 */
> +		new_vma->vm_flags |= VM_NOHUGEPAGE;
> +		err = insert_vm_struct(new_mm, new_vma);
> +		if (err) {
> +			mmap_write_unlock(new_mm);
> +			err = -ENOMEM;
> +			goto err_out;
> +		}
>   
> -	if ((vma->vm_start | vma->vm_end) & (PGDIR_SIZE - 1))
> -		return -EINVAL;
> +		/*
> +		 * Copy over current PTEs
> +		 */
> +		err = mshare_copy_ptes(new_vma, vma);
> +	}
>   
> -	mm->mmap_base = vma->vm_start;
> -	mm->task_size = vma->vm_end - vma->vm_start;
> -	if (!mm->task_size)
> -		mm->task_size--;
> -	mmap_write_unlock(mm);
> -	info->minfo->start = mm->mmap_base;
> -	info->minfo->size = mm->task_size;
> -	info->deleted = 0;
> -	refcount_inc(&info->refcnt);
> -	vma->vm_flags |= VM_SHARED_PT;
> -	vma->vm_private_data = info;
> -	vma->vm_ops = &msharefs_vm_ops;
> -	return 0;
> +	mmap_write_unlock(new_mm);
> +	return err;
> +
> +err_out:
> +	return err;
>   }
>   
>   static const struct file_operations msharefs_file_operations = {
> @@ -291,14 +384,10 @@ msharefs_unlink(struct inode *dir, struct dentry *dentry)
>   	mmap_write_unlock(info->mm);
>   
>   	/*
> -	 * Is this the last reference? If so, delete mshare region and
> -	 * remove the file
> +	 * Is this the last reference? If so, delete mshare region
>   	 */
> -	if (!refcount_dec_and_test(&info->refcnt)) {
> -		mmput(info->mm);
> -		kfree(info->minfo);
> -		kfree(info);
> -	}
> +	if (refcount_dec_and_test(&info->refcnt))
> +		msharefs_delmm(info);
>   	return 0;
>   }
>   

-- 
Best Regards!
Xin Hao


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 8/9] mm/mshare: Add basic page table sharing support
  2022-07-07  9:13   ` Xin Hao
@ 2022-07-07 15:33     ` Khalid Aziz
  0 siblings, 0 replies; 43+ messages in thread
From: Khalid Aziz @ 2022-07-07 15:33 UTC (permalink / raw)
  To: xhao, akpm, willy
  Cc: aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen, david,
	ebiederm, hagen, jack, keescook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, linux-mm, longpeng2, luto, markhemm,
	pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 7/7/22 03:13, Xin Hao wrote:
> 
> On 6/30/22 6:53 AM, Khalid Aziz wrote:
>> Add support for creating a new set of shared page tables in a new
>> mm_struct upon mmap of an mshare region. Add page fault handling in
>> this now mshare'd region. Modify exit_mmap path to make sure page
>> tables in the mshare'd regions are kept intact when a process using
>> mshare'd region exits. Clean up mshare mm_struct when the mshare
>> region is deleted. This support is for the process creating mshare
>> region only. Subsequent patches will add support for other processes
>> to be able to map the mshare region.
>>
>> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
>> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>> ---
>>   include/linux/mm.h |   2 +
>>   mm/internal.h      |   2 +
>>   mm/memory.c        | 101 +++++++++++++++++++++++++++++-
>>   mm/mshare.c        | 149 ++++++++++++++++++++++++++++++++++++---------
>>   4 files changed, 222 insertions(+), 32 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 0ddc3057f73b..63887f06b37b 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1859,6 +1859,8 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
>>           unsigned long end, unsigned long floor, unsigned long ceiling);
>>   int
>>   copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
>> +int
>> +mshare_copy_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
>>   int follow_pte(struct mm_struct *mm, unsigned long address,
>>              pte_t **ptepp, spinlock_t **ptlp);
>>   int follow_pfn(struct vm_area_struct *vma, unsigned long address,
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 3f2790aea918..6ae7063ac10d 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -861,6 +861,8 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags);
>>   DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>> +extern vm_fault_t find_shared_vma(struct vm_area_struct **vma,
>> +                    unsigned long *addrp);
>>   static inline bool vma_is_shared(const struct vm_area_struct *vma)
>>   {
>>       return vma->vm_flags & VM_SHARED_PT;
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 7a089145cad4..2a8d5b8928f5 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -416,15 +416,20 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>           unlink_anon_vmas(vma);
>>           unlink_file_vma(vma);
>> +        /*
>> +         * There is no page table to be freed for vmas that
>> +         * are mapped in mshare regions
>> +         */
>>           if (is_vm_hugetlb_page(vma)) {
>>               hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
>>                   floor, next ? next->vm_start : ceiling);
>> -        } else {
>> +        } else if (!vma_is_shared(vma)) {
>>               /*
>>                * Optimization: gather nearby vmas into one call down
>>                */
>>               while (next && next->vm_start <= vma->vm_end + PMD_SIZE
>> -                   && !is_vm_hugetlb_page(next)) {
>> +                   && !is_vm_hugetlb_page(next)
>> +                   && !vma_is_shared(next)) {
>>                   vma = next;
>>                   next = vma->vm_next;
>>                   unlink_anon_vmas(vma);
>> @@ -1260,6 +1265,54 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>>       return false;
>>   }
>> +/*
>> + * Copy PTEs for mshare'd pages.
>> + * This code is based upon copy_page_range()
>> + */
>> +int
>> +mshare_copy_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>> +{
>> +    pgd_t *src_pgd, *dst_pgd;
>> +    unsigned long next;
>> +    unsigned long addr = src_vma->vm_start;
>> +    unsigned long end = src_vma->vm_end;
>> +    struct mm_struct *dst_mm = dst_vma->vm_mm;
>> +    struct mm_struct *src_mm = src_vma->vm_mm;
>> +    struct mmu_notifier_range range;
>> +    int ret = 0;
>> +
>> +    mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
>> +                0, src_vma, src_mm, addr, end);
>> +    mmu_notifier_invalidate_range_start(&range);
>> +    /*
>> +     * Disabling preemption is not needed for the write side, as
>> +     * the read side doesn't spin, but goes to the mmap_lock.
>> +     *
>> +     * Use the raw variant of the seqcount_t write API to avoid
>> +     * lockdep complaining about preemptibility.
>> +     */
>> +    mmap_assert_write_locked(src_mm);
>> +    raw_write_seqcount_begin(&src_mm->write_protect_seq);
>> +
>> +    dst_pgd = pgd_offset(dst_mm, addr);
>> +    src_pgd = pgd_offset(src_mm, addr);
>> +    do {
>> +        next = pgd_addr_end(addr, end);
>> +        if (pgd_none_or_clear_bad(src_pgd))
>> +            continue;
>> +        if (unlikely(copy_p4d_range(dst_vma, src_vma, dst_pgd, src_pgd,
>> +                        addr, next))) {
>> +            ret = -ENOMEM;
>> +            break;
>> +        }
>> +    } while (dst_pgd++, src_pgd++, addr = next, addr != end);
>> +
>> +    raw_write_seqcount_end(&src_mm->write_protect_seq);
>> +    mmu_notifier_invalidate_range_end(&range);
>> +
>> +    return ret;
>> +}
>> +
>>   int
>>   copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>>   {
>> @@ -1628,6 +1681,13 @@ void unmap_page_range(struct mmu_gather *tlb,
>>       pgd_t *pgd;
>>       unsigned long next;
>> +    /*
>> +     * No need to unmap vmas that share page table through
>> +     * mshare region
>> +     */
>> +    if (vma_is_shared(vma))
>> +        return;
>> +
>>       BUG_ON(addr >= end);
>>       tlb_start_vma(tlb, vma);
>>       pgd = pgd_offset(vma->vm_mm, addr);
>> @@ -5113,6 +5173,8 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>                  unsigned int flags, struct pt_regs *regs)
>>   {
>>       vm_fault_t ret;
>> +    bool shared = false;
>> +    struct mm_struct *orig_mm;
>>       __set_current_state(TASK_RUNNING);
>> @@ -5122,6 +5184,16 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>       /* do counter updates before entering really critical section. */
>>       check_sync_rss_stat(current);
>> +    orig_mm = vma->vm_mm;
>> +    if (unlikely(vma_is_shared(vma))) {
>> +        ret = find_shared_vma(&vma, &address);
>> +        if (ret)
>> +            return ret;
>> +        if (!vma)
>> +            return VM_FAULT_SIGSEGV;
>> +        shared = true;
> if  shared is true,  so it mean the origin vma are replaced, but the code not free the origin vma ?

The original vma is not replaced. Only the pointer for the vma to be used for processing mm fault is updated. Original 
vma continues to be part of vma chain for the process and should not be freed.

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 0/9] Add support for shared PTEs across processes
  2022-07-02  4:24 ` Andrew Morton
  2022-07-06 19:26   ` Khalid Aziz
@ 2022-07-08 11:47   ` David Hildenbrand
  2022-07-08 19:36     ` Khalid Aziz
  1 sibling, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2022-07-08 11:47 UTC (permalink / raw)
  To: Andrew Morton, Khalid Aziz
  Cc: willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	ebiederm, hagen, jack, keescook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, linux-mm, longpeng2, luto, markhemm,
	pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 02.07.22 06:24, Andrew Morton wrote:
> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:
> 
>> This patch series implements a mechanism in kernel to allow
>> userspace processes to opt into sharing PTEs. It adds a new
>> in-memory filesystem - msharefs. 
> 
> Dumb question: why do we need a new filesystem for this?  Is it not
> feasible to permit PTE sharing for mmaps of tmpfs/xfs/ext4/etc files?
> 

IIRC, the general opinion at LSF/MM was that this approach at hand is
makes people nervous and I at least am not convinced that we really want
to have this upstream.

What's *completely* missing from the cover letter are the dirty details:
"Actual data is mmap'ed using anonymous pages, ext4/xfs/btfrfs/etc
files.". Gah.

As raised already, "anonymous pages" makes me shiver.


(To me, what I read, this looks like an RFC to me, yet I see "v2". But I
am a little confused why most of the feedback at LSF/MM seems to be
ignored and people are moving forward with this approach. But maybe my
memory is wrong.)

Please, let's look into more generic page table sharing just like
hugetlb already provides to some degree. And if we need additional
alignment requirements to share multiple page table levels, then let's
look into that as well for special users.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 0/9] Add support for shared PTEs across processes
  2022-07-08 11:47   ` David Hildenbrand
@ 2022-07-08 19:36     ` Khalid Aziz
  2022-07-13 14:00       ` David Hildenbrand
  0 siblings, 1 reply; 43+ messages in thread
From: Khalid Aziz @ 2022-07-08 19:36 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Mike Kravetz
  Cc: willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	ebiederm, hagen, jack, keescook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, linux-mm, longpeng2, luto, markhemm,
	pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 7/8/22 05:47, David Hildenbrand wrote:
> On 02.07.22 06:24, Andrew Morton wrote:
>> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>>> This patch series implements a mechanism in kernel to allow
>>> userspace processes to opt into sharing PTEs. It adds a new
>>> in-memory filesystem - msharefs.
>>
>> Dumb question: why do we need a new filesystem for this?  Is it not
>> feasible to permit PTE sharing for mmaps of tmpfs/xfs/ext4/etc files?
>>
> 
> IIRC, the general opinion at LSF/MM was that this approach at hand is
> makes people nervous and I at least am not convinced that we really want
> to have this upstream.

Hi David,

You are right that sharing page tables across processes feels scary, but at the same time threads already share PTEs and 
this just extends that concept to processes. A number of people have commented on potential usefulness of this concept 
and implementation. There were concerns raised about being able to make this safe and reliable. I had agreed to send a 
second version of the patch incorporating feedback from last review and LSF/MM, and that is what v2 patch is about. The 
suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion 
on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is 
better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that 
wrong.

> 
> What's *completely* missing from the cover letter are the dirty details:
> "Actual data is mmap'ed using anonymous pages, ext4/xfs/btfrfs/etc
> files.". Gah.

Yes, cover letter in v2 patch was a little lacking. I will add more details next time.

> 
> As raised already, "anonymous pages" makes me shiver.
> 
> 
> (To me, what I read, this looks like an RFC to me, yet I see "v2". But I
> am a little confused why most of the feedback at LSF/MM seems to be
> ignored and people are moving forward with this approach. But maybe my
> memory is wrong.)
> 
> Please, let's look into more generic page table sharing just like
> hugetlb already provides to some degree. And if we need additional
> alignment requirements to share multiple page table levels, then let's
> look into that as well for special users.
> 

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 0/9] Add support for shared PTEs across processes
  2022-07-08 19:36     ` Khalid Aziz
@ 2022-07-13 14:00       ` David Hildenbrand
  2022-07-13 17:58         ` Mike Kravetz
  2022-07-14 22:02         ` Khalid Aziz
  0 siblings, 2 replies; 43+ messages in thread
From: David Hildenbrand @ 2022-07-13 14:00 UTC (permalink / raw)
  To: Khalid Aziz, Andrew Morton, Mike Kravetz
  Cc: willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	ebiederm, hagen, jack, keescook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, linux-mm, longpeng2, luto, markhemm,
	pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 08.07.22 21:36, Khalid Aziz wrote:
> On 7/8/22 05:47, David Hildenbrand wrote:
>> On 02.07.22 06:24, Andrew Morton wrote:
>>> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>>
>>>> This patch series implements a mechanism in kernel to allow
>>>> userspace processes to opt into sharing PTEs. It adds a new
>>>> in-memory filesystem - msharefs.
>>>
>>> Dumb question: why do we need a new filesystem for this?  Is it not
>>> feasible to permit PTE sharing for mmaps of tmpfs/xfs/ext4/etc files?
>>>
>>
>> IIRC, the general opinion at LSF/MM was that this approach at hand is
>> makes people nervous and I at least am not convinced that we really want
>> to have this upstream.
> 
> Hi David,

Hi Khalid,

> 
> You are right that sharing page tables across processes feels scary, but at the same time threads already share PTEs and 
> this just extends that concept to processes.

They share a *mm* including a consistent virtual memory layout (VMA
list). Page table sharing is just a side product of that. You could even
call page tables just an implementation detail to produce that
consistent virtual memory layout -- described for that MM via a
different data structure.

> A number of people have commented on potential usefulness of this concept 
> and implementation.

... and a lot of people raised concerns. Yes, page table sharing to
reduce memory consumption/tlb misses/... is something reasonable to
have. But that doesn't require mshare, as hugetlb has proven.

The design might be useful for a handful of corner (!) cases, but as the
cover letter only talks about memory consumption of page tables, I'll
not care about those. Once these corner cases are explained and deemed
important, we might want to think of possible alternatives to explore
the solution space.

> There were concerns raised about being able to make this safe and reliable.
> I had agreed to send a 
> second version of the patch incorporating feedback from last review and LSF/MM, and that is what v2 patch is about. The 

Okay, most of the changes I saw are related to the user interface, not
to any of the actual dirty implementation-detail concerns. And the cover
letter is not really clear what's actually happening under the hood and
what the (IMHO) weird semantics of the design imply (as can be seen from
Andrews reply).

> suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion 
> on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is 
> better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that 
> wrong.

Yes, I pushed for the removal of that yet-another-hugetlb-special-stuff,
and asked the honest question if we can just remove it and replace it by
something generic in the future. And as I learned, we most probably
cannot rip that out without affecting existing user space. Even
replacing it by mshare() would degrade existing user space.

So the natural thing to reduce page table consumption (again, what this
cover letter talks about) for user space (semi- ?)automatically for
MAP_SHARED files is to factor out what hugetlb has, and teach generic MM
code to cache and reuse page tables (PTE and PMD tables should be
sufficient) where suitable.

For reasonably aligned mappings and mapping sizes, it shouldn't be too
hard (I know, locking ...), to cache and reuse page tables attached to
files -- similar to what hugetlb does, just in a generic way. We might
want a mechanism to enable/disable this for specific processes and/or
VMAs, but these are minor details.

And that could come for free for existing user space, because page
tables, and how they are handled, would just be an implementation detail.


I'd be really interested into what the major roadblocks/downsides
file-based page table sharing has. Because I am not convinced that a
mechanism like mshare() -- that has to be explicitly implemented+used by
user space -- is required for that.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 0/9] Add support for shared PTEs across processes
  2022-07-13 14:00       ` David Hildenbrand
@ 2022-07-13 17:58         ` Mike Kravetz
  2022-07-13 18:03           ` David Hildenbrand
  2022-07-14 22:02         ` Khalid Aziz
  1 sibling, 1 reply; 43+ messages in thread
From: Mike Kravetz @ 2022-07-13 17:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Khalid Aziz, Andrew Morton, willy, aneesh.kumar, arnd, 21cnbao,
	corbet, dave.hansen, ebiederm, hagen, jack, keescook, kirill,
	kucharsk, linkinjeon, linux-fsdevel, linux-kernel, linux-mm,
	longpeng2, luto, markhemm, pcc, rppt, sieberf, sjpark, surenb,
	tst, yzaikin

On 07/13/22 16:00, David Hildenbrand wrote:
> On 08.07.22 21:36, Khalid Aziz wrote:
> > On 7/8/22 05:47, David Hildenbrand wrote:
> >> On 02.07.22 06:24, Andrew Morton wrote:
> >>> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:
> 
> > suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion 
> > on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is 
> > better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that 
> > wrong.
> 
> Yes, I pushed for the removal of that yet-another-hugetlb-special-stuff,
> and asked the honest question if we can just remove it and replace it by
> something generic in the future. And as I learned, we most probably
> cannot rip that out without affecting existing user space. Even
> replacing it by mshare() would degrade existing user space.
> 
> So the natural thing to reduce page table consumption (again, what this
> cover letter talks about) for user space (semi- ?)automatically for
> MAP_SHARED files is to factor out what hugetlb has, and teach generic MM
> code to cache and reuse page tables (PTE and PMD tables should be
> sufficient) where suitable.
> 
> For reasonably aligned mappings and mapping sizes, it shouldn't be too
> hard (I know, locking ...), to cache and reuse page tables attached to
> files -- similar to what hugetlb does, just in a generic way. We might
> want a mechanism to enable/disable this for specific processes and/or
> VMAs, but these are minor details.
> 
> And that could come for free for existing user space, because page
> tables, and how they are handled, would just be an implementation detail.
> 
> 
> I'd be really interested into what the major roadblocks/downsides
> file-based page table sharing has. Because I am not convinced that a
> mechanism like mshare() -- that has to be explicitly implemented+used by
> user space -- is required for that.

Perhaps this is an 'opportunity' for me to write up in detail how
hugetlb pmd sharing works.  As you know, I have been struggling with
keeping that working AND safe AND performant.  Who knows, this may lead
to changes in the existing implementation.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 0/9] Add support for shared PTEs across processes
  2022-07-13 17:58         ` Mike Kravetz
@ 2022-07-13 18:03           ` David Hildenbrand
  0 siblings, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2022-07-13 18:03 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Khalid Aziz, Andrew Morton, willy, aneesh.kumar, arnd, 21cnbao,
	corbet, dave.hansen, ebiederm, hagen, jack, keescook, kirill,
	kucharsk, linkinjeon, linux-fsdevel, linux-kernel, linux-mm,
	longpeng2, luto, markhemm, pcc, rppt, sieberf, sjpark, surenb,
	tst, yzaikin

On 13.07.22 19:58, Mike Kravetz wrote:
> On 07/13/22 16:00, David Hildenbrand wrote:
>> On 08.07.22 21:36, Khalid Aziz wrote:
>>> On 7/8/22 05:47, David Hildenbrand wrote:
>>>> On 02.07.22 06:24, Andrew Morton wrote:
>>>>> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>>> suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion 
>>> on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is 
>>> better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that 
>>> wrong.
>>
>> Yes, I pushed for the removal of that yet-another-hugetlb-special-stuff,
>> and asked the honest question if we can just remove it and replace it by
>> something generic in the future. And as I learned, we most probably
>> cannot rip that out without affecting existing user space. Even
>> replacing it by mshare() would degrade existing user space.
>>
>> So the natural thing to reduce page table consumption (again, what this
>> cover letter talks about) for user space (semi- ?)automatically for
>> MAP_SHARED files is to factor out what hugetlb has, and teach generic MM
>> code to cache and reuse page tables (PTE and PMD tables should be
>> sufficient) where suitable.
>>
>> For reasonably aligned mappings and mapping sizes, it shouldn't be too
>> hard (I know, locking ...), to cache and reuse page tables attached to
>> files -- similar to what hugetlb does, just in a generic way. We might
>> want a mechanism to enable/disable this for specific processes and/or
>> VMAs, but these are minor details.
>>
>> And that could come for free for existing user space, because page
>> tables, and how they are handled, would just be an implementation detail.
>>
>>
>> I'd be really interested into what the major roadblocks/downsides
>> file-based page table sharing has. Because I am not convinced that a
>> mechanism like mshare() -- that has to be explicitly implemented+used by
>> user space -- is required for that.
> 
> Perhaps this is an 'opportunity' for me to write up in detail how
> hugetlb pmd sharing works.  As you know, I have been struggling with
> keeping that working AND safe AND performant. 

Yes, and I have your locking-related changes in my inbox marked as "to
be reviewed" :D Sheding some light on that would be highly appreciated,
especially, how hugetlb-specific it currently is and for which reason.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 0/9] Add support for shared PTEs across processes
  2022-07-13 14:00       ` David Hildenbrand
  2022-07-13 17:58         ` Mike Kravetz
@ 2022-07-14 22:02         ` Khalid Aziz
  2022-07-18 12:59           ` David Hildenbrand
  1 sibling, 1 reply; 43+ messages in thread
From: Khalid Aziz @ 2022-07-14 22:02 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Mike Kravetz
  Cc: willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	ebiederm, hagen, jack, keescook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, linux-mm, longpeng2, luto, markhemm,
	pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 7/13/22 08:00, David Hildenbrand wrote:
> On 08.07.22 21:36, Khalid Aziz wrote:
>> On 7/8/22 05:47, David Hildenbrand wrote:
>>> On 02.07.22 06:24, Andrew Morton wrote:
>>>> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>>>
>>>>> This patch series implements a mechanism in kernel to allow
>>>>> userspace processes to opt into sharing PTEs. It adds a new
>>>>> in-memory filesystem - msharefs.
>>>>
>>>> Dumb question: why do we need a new filesystem for this?  Is it not
>>>> feasible to permit PTE sharing for mmaps of tmpfs/xfs/ext4/etc files?
>>>>
>>>
>>> IIRC, the general opinion at LSF/MM was that this approach at hand is
>>> makes people nervous and I at least am not convinced that we really want
>>> to have this upstream.
>>
>> Hi David,
> 
> Hi Khalid,
> 
>>
>> You are right that sharing page tables across processes feels scary, but at the same time threads already share PTEs and
>> this just extends that concept to processes.
> 
> They share a *mm* including a consistent virtual memory layout (VMA
> list). Page table sharing is just a side product of that. You could even
> call page tables just an implementation detail to produce that
> consistent virtual memory layout -- described for that MM via a
> different data structure.

Yes, sharing an mm and vma chain does make it different from implementation point of view.

> 
>> A number of people have commented on potential usefulness of this concept
>> and implementation.
> 
> ... and a lot of people raised concerns. Yes, page table sharing to
> reduce memory consumption/tlb misses/... is something reasonable to
> have. But that doesn't require mshare, as hugetlb has proven.
> 
> The design might be useful for a handful of corner (!) cases, but as the
> cover letter only talks about memory consumption of page tables, I'll
> not care about those. Once these corner cases are explained and deemed
> important, we might want to think of possible alternatives to explore
> the solution space.

Memory consumption by page tables is turning out to be significant issue. I mentioned one real-world example from a 
customer where a 300GB SGA on a 512GB server resulted in OOM when 1500+ processes tried to map parts of the SGA into 
their address space. Some customers are able to solve this issue by switching to hugetlbfs but that is not feasible for 
every one.

> 
>> There were concerns raised about being able to make this safe and reliable.
>> I had agreed to send a
>> second version of the patch incorporating feedback from last review and LSF/MM, and that is what v2 patch is about. The
> 
> Okay, most of the changes I saw are related to the user interface, not
> to any of the actual dirty implementation-detail concerns. And the cover
> letter is not really clear what's actually happening under the hood and
> what the (IMHO) weird semantics of the design imply (as can be seen from
> Andrews reply).

Sure, I will add more details to the cover letter next time. msharefs needs more explanation. I will highlight the 
creation of a new mm struct for mshare regions that is not owned by any process. There was another under-the-hood change 
that is listed in changelog but could have been highlighted - "Eliminated the need to point vm_mm in original vmas to 
the newly synthesized mshare mm". How the fields in new mm struct are used helped make this change and could use more 
details in cover letter.

> 
>> suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion
>> on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is
>> better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that
>> wrong.
> 
> Yes, I pushed for the removal of that yet-another-hugetlb-special-stuff,
> and asked the honest question if we can just remove it and replace it by
> something generic in the future. And as I learned, we most probably
> cannot rip that out without affecting existing user space. Even
> replacing it by mshare() would degrade existing user space.
> 
> So the natural thing to reduce page table consumption (again, what this
> cover letter talks about) for user space (semi- ?)automatically for
> MAP_SHARED files is to factor out what hugetlb has, and teach generic MM
> code to cache and reuse page tables (PTE and PMD tables should be
> sufficient) where suitable.
> 
> For reasonably aligned mappings and mapping sizes, it shouldn't be too
> hard (I know, locking ...), to cache and reuse page tables attached to
> files -- similar to what hugetlb does, just in a generic way. We might
> want a mechanism to enable/disable this for specific processes and/or
> VMAs, but these are minor details.
> 
> And that could come for free for existing user space, because page
> tables, and how they are handled, would just be an implementation detail.
> 
> 
> I'd be really interested into what the major roadblocks/downsides
> file-based page table sharing has. Because I am not convinced that a
> mechanism like mshare() -- that has to be explicitly implemented+used by
> user space -- is required for that.
> 

I see two parts to what you are suggesting (please correct me if I get this wrong):

1. Implement a generic page table sharing mechanism
2. Implement a way to use this mechanism from userspace

For 1, your suggestion seems to be extract the page table sharing code from hugetlb and make it generic. My approach is 
to create a special mm struct to host the shared page tables and create a minimal set of changes to simply get PTEs from 
this special mm struct whenever a shared VMA is accessed. There may be value to extracting hugetlb page table sharing 
code and recasting it into this framework of special mm struct. I will look some more into it.

As for 2, is it fair to say you are not fond of explicit opt-in from userspace and would rather have the sharing be file 
based like hugetlb? That is worth considering but is limiting page table sharing to just file objects reasonable? A goal 
for mshare mechanism was to allow shared objects to be files, anonymous pages, RDMA buffers, whatever. Idea being if you 
can map it, you can share it with shared page tables. Maybe that is too ambitious a goal and I am open to course correction.

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 0/9] Add support for shared PTEs across processes
  2022-07-14 22:02         ` Khalid Aziz
@ 2022-07-18 12:59           ` David Hildenbrand
  0 siblings, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2022-07-18 12:59 UTC (permalink / raw)
  To: Khalid Aziz, Andrew Morton, Mike Kravetz
  Cc: willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	ebiederm, hagen, jack, keescook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, linux-mm, longpeng2, luto, markhemm,
	pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

[sorry for not being as responsive as I usually am]

>>
>> They share a *mm* including a consistent virtual memory layout (VMA
>> list). Page table sharing is just a side product of that. You could even
>> call page tables just an implementation detail to produce that
>> consistent virtual memory layout -- described for that MM via a
>> different data structure.
> 
> Yes, sharing an mm and vma chain does make it different from implementation point of view.
> 
>>
>>> A number of people have commented on potential usefulness of this concept
>>> and implementation.
>>
>> ... and a lot of people raised concerns. Yes, page table sharing to
>> reduce memory consumption/tlb misses/... is something reasonable to
>> have. But that doesn't require mshare, as hugetlb has proven.
>>
>> The design might be useful for a handful of corner (!) cases, but as the
>> cover letter only talks about memory consumption of page tables, I'll
>> not care about those. Once these corner cases are explained and deemed
>> important, we might want to think of possible alternatives to explore
>> the solution space.
> 
> Memory consumption by page tables is turning out to be significant issue. I mentioned one real-world example from a 
> customer where a 300GB SGA on a 512GB server resulted in OOM when 1500+ processes tried to map parts of the SGA into 
> their address space. Some customers are able to solve this issue by switching to hugetlbfs but that is not feasible for 
> every one.

Yes. Another use case I am aware of are KVM-based virtual machines, when
VM memory (shmem, file-backed) is not only mapped into the emulator
process, but also into other processes used to carry out I/O (e.g.,
vhost-user).

In that case, it's tempting to simply share the page tables between all
processes for the shared mapping -- automatically, just like
shmem/hugetlb already does.

[...]

>>
>>> suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion
>>> on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is
>>> better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that
>>> wrong.
>>
>> Yes, I pushed for the removal of that yet-another-hugetlb-special-stuff,
>> and asked the honest question if we can just remove it and replace it by
>> something generic in the future. And as I learned, we most probably
>> cannot rip that out without affecting existing user space. Even
>> replacing it by mshare() would degrade existing user space.
>>
>> So the natural thing to reduce page table consumption (again, what this
>> cover letter talks about) for user space (semi- ?)automatically for
>> MAP_SHARED files is to factor out what hugetlb has, and teach generic MM
>> code to cache and reuse page tables (PTE and PMD tables should be
>> sufficient) where suitable.
>>
>> For reasonably aligned mappings and mapping sizes, it shouldn't be too
>> hard (I know, locking ...), to cache and reuse page tables attached to
>> files -- similar to what hugetlb does, just in a generic way. We might
>> want a mechanism to enable/disable this for specific processes and/or
>> VMAs, but these are minor details.
>>
>> And that could come for free for existing user space, because page
>> tables, and how they are handled, would just be an implementation detail.
>>
>>
>> I'd be really interested into what the major roadblocks/downsides
>> file-based page table sharing has. Because I am not convinced that a
>> mechanism like mshare() -- that has to be explicitly implemented+used by
>> user space -- is required for that.
>>
> 
> I see two parts to what you are suggesting (please correct me if I get this wrong):
> 
> 1. Implement a generic page table sharing mechanism
> 2. Implement a way to use this mechanism from userspace

Yes. Whereby 2) would usually just be some heuristic (e.g.,. file > X
MiB -> start sharing), with an additional way to just disable it or just
enable it. But yes, most of that stuff should just be automatic.

> 
> For 1, your suggestion seems to be extract the page table sharing code from hugetlb and make it generic. My approach is 
> to create a special mm struct to host the shared page tables and create a minimal set of changes to simply get PTEs from 
> this special mm struct whenever a shared VMA is accessed. There may be value to extracting hugetlb page table sharing 
> code and recasting it into this framework of special mm struct. I will look some more into it.

The basic idea would be that whenever a MAP_SHARED VMA has a reasonable
size, is aligned in a suitable way (including MAP offset), and
protection match, you can just share PTE tables and even PMD tables. As
page tables of shared mappings usually don't really store per-process
information (exceptions I am aware of are userfaultfd and softdirty
tracking), we can simply share/unshare page tables of shared mappings
fairly easily.

Then, you'd have e.g., 2 sets of page tables cached by the fd that can
be mapped into processes

1) PROT_READ|PROT_WRITE
2) PROT_READ

On VMA protection changes, one would have to unshare (unmap the page
table) and either map another shared one, or map a private one. I don't
think there would be need to optimize e.g., for PROT_NONE, but of
course, other combinations could make sense to cache.


PROT_NONE and other corner cases (softdirty tracking) would simply not
use shared page tables.

Shared page tables would have to be refcounted and one could e.g.,
implement a shrinker that frees unused page tables in the fd cache when
memory reclaim kicks in.

With something like that in place, page table consumption could be
reduced and vmscan/rmap walks could turn out more efficient.

> 
> As for 2, is it fair to say you are not fond of explicit opt-in from userspace and would rather have the sharing be file 
> based like hugetlb? That is worth considering but is limiting page table sharing to just file objects reasonable? A goal 
> for mshare mechanism was to allow shared objects to be files, anonymous pages, RDMA buffers, whatever. Idea being if you 
> can map it, you can share it with shared page tables. Maybe that is too ambitious a goal and I am open to course correction.


We can glue it to the file or anything else that's shared I think  -- I
don't think we particularly, as long as it's something shared between
processes to be mapped. And to be quite honest, whenever I read about
anonymous memory (i.e., MAP_PRIVATE) I hear my inner voice screaming:
just use *shared* memory when you want to *share* memory between
processes, and optimize that if anything is missing.


Having that said, I understood from previous discussions that there is a
use case of efficient read-only protection across many processes/VMAs. I
was wondering if that could be handled on the fs-level (pte_mkwrite). I
remember I raised the idea before: if one could have a
userfaultfd-wp-style (overlay?) file (system?), user-space could
protect/unprotect file pages via a different mechanism (ioctl) and get
notified about write access via something similar to userfaultfd
user-space handlers, not via signals. Instead of adjusting VMAs, once
would only adjust file page mappings to map the relevant pages R/O when
protecting -- if page tables are shared, that would be efficient.


Now, that is is just a very vague brain dump to get it out of my
(overloaded) system. What I think the overall message is: let's try not
designing new features around page table sharing, let's use page table
sharing as an rmap performance optimization and as a mechanism to reduce
page table overhead. I hope what I said makes any sense, I might eb just
wrong.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2022-07-18 13:00 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-29 22:53 [PATCH v2 0/9] Add support for shared PTEs across processes Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 1/9] mm: Add msharefs filesystem Khalid Aziz
2022-06-30 21:53   ` Darrick J. Wong
2022-07-01 16:05     ` Khalid Aziz
2022-06-30 22:57   ` Al Viro
2022-07-01 16:08     ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 2/9] mm/mshare: pre-populate msharefs with information file Khalid Aziz
2022-06-30 21:37   ` Darrick J. Wong
2022-06-30 22:54     ` Khalid Aziz
2022-06-30 23:01   ` Al Viro
2022-07-01 16:11     ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 3/9] mm/mshare: make msharefs writable and support directories Khalid Aziz
2022-06-30 21:34   ` Darrick J. Wong
2022-06-30 22:49     ` Khalid Aziz
2022-06-30 23:09   ` Al Viro
2022-07-02  0:22     ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 4/9] mm/mshare: Add a read operation for msharefs files Khalid Aziz
2022-06-30 21:27   ` Darrick J. Wong
2022-06-30 22:27     ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 5/9] mm/mshare: Add vm flag for shared PTE Khalid Aziz
2022-06-30 14:59   ` Mark Hemment
2022-06-30 15:46     ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 6/9] mm/mshare: Add mmap operation Khalid Aziz
2022-06-30 21:44   ` Darrick J. Wong
2022-06-30 23:30     ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 7/9] mm/mshare: Add unlink and munmap support Khalid Aziz
2022-06-30 21:50   ` Darrick J. Wong
2022-07-01 15:58     ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 8/9] mm/mshare: Add basic page table sharing support Khalid Aziz
2022-07-07  9:13   ` Xin Hao
2022-07-07 15:33     ` Khalid Aziz
2022-06-29 22:54 ` [PATCH v2 9/9] mm/mshare: Enable mshare region mapping across processes Khalid Aziz
2022-06-30 11:57 ` [PATCH v2 0/9] Add support for shared PTEs " Mark Hemment
2022-06-30 15:39   ` Khalid Aziz
2022-07-02  4:24 ` Andrew Morton
2022-07-06 19:26   ` Khalid Aziz
2022-07-08 11:47   ` David Hildenbrand
2022-07-08 19:36     ` Khalid Aziz
2022-07-13 14:00       ` David Hildenbrand
2022-07-13 17:58         ` Mike Kravetz
2022-07-13 18:03           ` David Hildenbrand
2022-07-14 22:02         ` Khalid Aziz
2022-07-18 12:59           ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.