All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 00/14] Add support for shared PTEs across processes
@ 2022-04-11 16:05 Khalid Aziz
  2022-04-11 16:05 ` [PATCH v1 01/14] mm: Add new system calls mshare, mshare_unlink Khalid Aziz
                   ` (17 more replies)
  0 siblings, 18 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-04-11 16:05 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

Page tables in kernel consume some of the memory and as long as number
of mappings being maintained is small enough, this space consumed by
page tables is not objectionable. When very few memory pages are
shared between processes, the number of page table entries (PTEs) to
maintain is mostly constrained by the number of pages of memory on the
system. As the number of shared pages and the number of times pages
are shared goes up, amount of memory consumed by page tables starts to
become significant.

Some of the field deployments commonly see memory pages shared across
1000s of processes. On x86_64, each page requires a PTE that is only 8
bytes long which is very small compared to the 4K page size. When 2000
processes map the same page in their address space, each one of them
requires 8 bytes for its PTE and together that adds up to 8K of memory
just to hold the PTEs for one 4K page. On a database server with 300GB
SGA, a system carsh was seen with out-of-memory condition when 1500+
clients tried to share this SGA even though the system had 512GB of
memory. On this server, in the worst case scenario of all 1500
processes mapping every page from SGA would have required 878GB+ for
just the PTEs. If these PTEs could be shared, amount of memory saved
is very significant.

This patch series implements a mechanism in kernel to allow userspace
processes to opt into sharing PTEs. It adds two new system calls - (1)
mshare(), which can be used by a process to create a region (we will
call it mshare'd region) which can be used by other processes to map
same pages using shared PTEs, (2) mshare_unlink() which is used to
detach from the mshare'd region. Once an mshare'd region is created,
other process(es), assuming they have the right permissions, can make
the mashare() system call to map the shared pages into their address
space using the shared PTEs.  When a process is done using this
mshare'd region, it makes a mshare_unlink() system call to end its
access. When the last process accessing mshare'd region calls
mshare_unlink(), the mshare'd region is torn down and memory used by
it is freed.


API
===

The mshare API consists of two system calls - mshare() and mshare_unlink()

--
int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)

mshare() creates and opens a new, or opens an existing mshare'd
region that will be shared at PTE level. "name" refers to shared object
name that exists under /sys/fs/mshare. "addr" is the starting address
of this shared memory area and length is the size of this area.
oflags can be one of:

- O_RDONLY opens shared memory area for read only access by everyone
- O_RDWR opens shared memory area for read and write access
- O_CREAT creates the named shared memory area if it does not exist
- O_EXCL If O_CREAT was also specified, and a shared memory area
  exists with that name, return an error.

mode represents the creation mode for the shared object under
/sys/fs/mshare.

mshare() returns an error code if it fails, otherwise it returns 0.

PTEs are shared at pgdir level and hence it imposes following
requirements on the address and size given to the mshare():

- Starting address must be aligned to pgdir size (512GB on x86_64).
  This alignment value can be looked up in /proc/sys/vm//mshare_size
- Size must be a multiple of pgdir size
- Any mappings created in this address range at any time become
  shared automatically
- Shared address range can have unmapped addresses in it. Any access
  to unmapped address will result in SIGBUS

Mappings within this address range behave as if they were shared
between threads, so a write to a MAP_PRIVATE mapping will create a
page which is shared between all the sharers. The first process that
declares an address range mshare'd can continue to map objects in
the shared area. All other processes that want mshare'd access to
this memory area can do so by calling mshare(). After this call, the
address range given by mshare becomes a shared range in its address
space. Anonymous mappings will be shared and not COWed.

A file under /sys/fs/mshare can be opened and read from. A read from
this file returns two long values - (1) starting address, and (2)
size of the mshare'd region.

--
int mshare_unlink(char *name)

A shared address range created by mshare() can be destroyed using
mshare_unlink() which removes the  shared named object. Once all
processes have unmapped the shared object, the shared address range
references are de-allocated and destroyed.

mshare_unlink() returns 0 on success or -1 on error.


Example Code
============

Snippet of the code that a donor process would run looks like below:

-----------------
        addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
                        MAP_SHARED | MAP_ANONYMOUS, 0, 0);
        if (addr == MAP_FAILED)
                perror("ERROR: mmap failed");

        err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
			GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
        if (err < 0) {
                perror("mshare() syscall failed");
                exit(1);
        }

        strncpy(addr, "Some random shared text",
			sizeof("Some random shared text"));
-----------------

Snippet of code that a consumer process would execute looks like:

-----------------
	struct mshare_info minfo;

        fd = open("testregion", O_RDONLY);
        if (fd < 0) {
                perror("open failed");
                exit(1);
        }

        if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
                printf("INFO: %ld bytes shared at addr 0x%lx \n",
				minfo.size, minfo.start);
        else
                perror("read failed");

        close(fd);

        addr = (void *)minfo.start;
        err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
			O_RDWR, 600);
        if (err < 0) {
                perror("mshare() syscall failed");
                exit(1);
        }

        printf("Guest mmap at %px:\n", addr);
        printf("%s\n", addr);
	printf("\nDone\n");

        err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
        if (err < 0) {
                perror("mshare_unlink() failed");
                exit(1);
        }
-----------------


Patch series
============

This series of patches is an initial implementation of these two
system calls. This code implements working basic functionality.

Prototype for the two syscalls is:

SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
		unsigned long, len, int, oflag, mode_t, mode)

SYSCALL_DEFINE1(mshare_unlink, const char *, name)

In order to facilitate page table sharing, this implemntation adds a
new in-memory filesystem - msharefs which will be mounted at
/sys/fs/mshare. When a new mshare'd region is created, a file with
the name given by initial mshare() call is created under this
filesystem.  Permissions for this file are given by the "mode"
parameter to mshare(). The only operation supported on this file is
read. A read from this file returns a structure containing
information about mshare'd region - (1) starting virtual address for
the region, and (2) size of mshare'd region.

A donor process that wants to create an mshare'd region from a
section of its mapped addresses calls mshare() with O_CREAT oflag.
mshare() syscall then creates a new mm_struct which will host the
page tables for the mshare'd region.  vma->vm_private_data for the
vmas covering address range for this region are updated to point to
a structure containing pointer to this new mm_struct.  Existing page
tables are copied over to new mm struct.

A consumer process that wants to map mshare'd region opens
/sys/fs/mshare/<filename> and reads the starting address and size of
mshare'd region. It then calls mshare() with these values to map the
entire region in its address space. Consumer process calls
mshare_unlink() to terminate its access.


Since RFC
=========

This patch series includes better error handling and more robust
locking besides improved implementation of mshare since the original
RFC. It also incorporates feedback from original RFC. Alignment and
size requirment are PGDIR_SIZE, same as RFC and this is open to
change based upon further feedback. More review is needed for this
patch series and is much appreciated.



Khalid Aziz (14):
  mm: Add new system calls mshare, mshare_unlink
  mm: Add msharefs filesystem
  mm: Add read for msharefs
  mm: implement mshare_unlink syscall
  mm: Add locking to msharefs syscalls
  mm/msharefs: Check for mounted filesystem
  mm: Add vm flag for shared PTE
  mm/mshare: Add basic page table sharing using mshare
  mm: Do not free PTEs for mshare'd PTEs
  mm/mshare: Check for mapped vma when mshare'ing existing mshare'd
    range
  mm/mshare: unmap vmas in mshare_unlink
  mm/mshare: Add a proc file with mshare alignment/size information
  mm/mshare: Enforce mshare'd region permissions
  mm/mshare: Copy PTEs to host mm

 Documentation/filesystems/msharefs.rst |  19 +
 arch/x86/entry/syscalls/syscall_64.tbl |   2 +
 include/linux/mm.h                     |  11 +
 include/trace/events/mmflags.h         |   3 +-
 include/uapi/asm-generic/unistd.h      |   7 +-
 include/uapi/linux/magic.h             |   1 +
 include/uapi/linux/mman.h              |   5 +
 kernel/sysctl.c                        |   7 +
 mm/Makefile                            |   2 +-
 mm/internal.h                          |   7 +
 mm/memory.c                            | 105 ++++-
 mm/mshare.c                            | 587 +++++++++++++++++++++++++
 12 files changed, 750 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/filesystems/msharefs.rst
 create mode 100644 mm/mshare.c

-- 
2.32.0


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v1 01/14] mm: Add new system calls mshare, mshare_unlink
  2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
@ 2022-04-11 16:05 ` Khalid Aziz
  2022-04-11 16:05 ` [PATCH v1 02/14] mm/mshare: Add msharefs filesystem Khalid Aziz
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-04-11 16:05 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

Add two new system calls to support PTE sharing across processes through
explicit declarations of shared address space.There is almost no
implementation in this patch and it only wires up the system calls for
x86_64 only. mshare() returns a file descriptor which does not support
any operations yet.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 arch/x86/entry/syscalls/syscall_64.tbl |  2 +
 include/uapi/asm-generic/unistd.h      |  7 ++-
 mm/Makefile                            |  2 +-
 mm/mshare.c                            | 60 ++++++++++++++++++++++++++
 4 files changed, 69 insertions(+), 2 deletions(-)
 create mode 100644 mm/mshare.c

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..e6e53b85fea6 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,8 @@
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
 450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
+451	common	mshare			sys_mshare
+452	common	mshare_unlink		sys_mshare_unlink
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1c48b0ae3ba3..d546086d0661 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -886,8 +886,13 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 #define __NR_set_mempolicy_home_node 450
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 
+#define __NR_mshare 451
+__SYSCALL(__NR_mshare, sys_mshare)
+#define __NR_mshare_unlink 452
+__SYSCALL(__NR_mshare_unlink, sys_mshare_unlink)
+
 #undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 453
 
 /*
  * 32 bit systems traditionally used different
diff --git a/mm/Makefile b/mm/Makefile
index 70d4309c9ce3..70a470b5ebe3 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -37,7 +37,7 @@ CFLAGS_init-mm.o += $(call cc-disable-warning, override-init)
 CFLAGS_init-mm.o += $(call cc-disable-warning, initializer-overrides)
 
 mmu-y			:= nommu.o
-mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o \
+mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o mshare.o \
 			   mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
 			   msync.o page_vma_mapped.o pagewalk.o \
 			   pgtable-generic.o rmap.o vmalloc.o
diff --git a/mm/mshare.c b/mm/mshare.c
new file mode 100644
index 000000000000..436195c0e74e
--- /dev/null
+++ b/mm/mshare.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * mm/mshare.c
+ *
+ * Page table sharing code
+ *
+ *
+ * Copyright (C) 2021 Oracle Corp. All rights reserved.
+ * Authors:	Khalid Aziz <khalid.aziz@oracle.com>
+ *		Matthew Wilcox <willy@infradead.org>
+ */
+
+#include <linux/anon_inodes.h>
+#include <linux/fs.h>
+#include <linux/syscalls.h>
+
+static const struct file_operations mshare_fops = {
+};
+
+/*
+ * mshare syscall. Returns a file descriptor
+ */
+SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
+		unsigned long, len, int, oflag, mode_t, mode)
+{
+	int fd;
+
+	/*
+	 * Address range being shared must be aligned to pgdir
+	 * boundary and its size must be a multiple of pgdir size
+	 */
+	if ((addr | len) & (PGDIR_SIZE - 1))
+		return -EINVAL;
+
+	/*
+	 * Allocate a file descriptor to return
+	 *
+	 * TODO: This code ignores the object name completely. Add
+	 * support for that
+	 */
+	fd = anon_inode_getfd("mshare", &mshare_fops, NULL, O_RDWR);
+
+	return fd;
+}
+
+/*
+ * mshare_unlink syscall. Close and remove the named mshare'd object
+ */
+SYSCALL_DEFINE1(mshare_unlink, const char *, name)
+{
+	int fd;
+
+	/*
+	 * Delete the named object
+	 *
+	 * TODO: Mark mshare'd range for deletion
+	 *
+	 */
+	return 0;
+}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 02/14] mm/mshare: Add msharefs filesystem
  2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
  2022-04-11 16:05 ` [PATCH v1 01/14] mm: Add new system calls mshare, mshare_unlink Khalid Aziz
@ 2022-04-11 16:05 ` Khalid Aziz
  2022-04-11 16:05 ` [PATCH v1 03/14] mm/mshare: Add read for msharefs Khalid Aziz
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-04-11 16:05 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

Add a ram-based filesystem that contains the files created for each
shared address range. This adds just the filesystem and creation of
files. Page table entries for these shared ranges created by mshare
syscall are still not shared.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 Documentation/filesystems/msharefs.rst |  19 +++
 include/uapi/linux/magic.h             |   1 +
 mm/mshare.c                            | 191 +++++++++++++++++++++++--
 3 files changed, 197 insertions(+), 14 deletions(-)
 create mode 100644 Documentation/filesystems/msharefs.rst

diff --git a/Documentation/filesystems/msharefs.rst b/Documentation/filesystems/msharefs.rst
new file mode 100644
index 000000000000..fd161f67045d
--- /dev/null
+++ b/Documentation/filesystems/msharefs.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================================
+msharefs - a filesystem to support shared page tables
+=====================================================
+
+msharefs is a ram-based filesystem that allows multiple processes to
+share page table entries for shared pages.
+
+msharefs is typically mounted like this::
+
+	mount -t msharefs none /sys/fs/mshare
+
+When a process calls mshare syscall with a name for the shared address
+range, a file with the same name is created under msharefs with that
+name. This file can be opened by another process, if permissions
+allow, to query the addresses shared under this range. These files are
+removed by mshare_unlink syscall and can not be deleted directly.
+Hence these files are created as immutable files.
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index f724129c0425..2a57a6ec6f3e 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -105,5 +105,6 @@
 #define Z3FOLD_MAGIC		0x33
 #define PPC_CMM_MAGIC		0xc7571590
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
+#define MSHARE_MAGIC		0x4d534852	/* "MSHR" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/mm/mshare.c b/mm/mshare.c
index 436195c0e74e..ad695288d4bb 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -10,20 +10,117 @@
  *		Matthew Wilcox <willy@infradead.org>
  */
 
-#include <linux/anon_inodes.h>
 #include <linux/fs.h>
+#include <linux/mount.h>
 #include <linux/syscalls.h>
+#include <linux/uaccess.h>
+#include <linux/pseudo_fs.h>
+#include <linux/fileattr.h>
+#include <uapi/linux/magic.h>
+#include <uapi/linux/limits.h>
 
-static const struct file_operations mshare_fops = {
+static struct super_block *msharefs_sb;
+
+static const struct file_operations msharefs_file_operations = {
+	.open	= simple_open,
+	.llseek	= no_llseek,
 };
 
+static int
+msharefs_d_hash(const struct dentry *dentry, struct qstr *qstr)
+{
+	unsigned long hash = init_name_hash(dentry);
+	const unsigned char *s = qstr->name;
+	unsigned int len = qstr->len;
+
+	while (len--)
+		hash = partial_name_hash(*s++, hash);
+	qstr->hash = end_name_hash(hash);
+	return 0;
+}
+
+static struct dentry
+*msharefs_alloc_dentry(struct dentry *parent, const char *name)
+{
+	struct dentry *d;
+	struct qstr q;
+	int err;
+
+	q.name = name;
+	q.len = strlen(name);
+
+	err = msharefs_d_hash(parent, &q);
+	if (err)
+		return ERR_PTR(err);
+
+	d = d_alloc(parent, &q);
+	if (d)
+		return d;
+
+	return ERR_PTR(-ENOMEM);
+}
+
+static struct inode
+*msharefs_get_inode(struct super_block *sb, int mode)
+{
+	struct inode *inode = new_inode(sb);
+
+	if (inode) {
+		inode->i_ino = get_next_ino();
+		inode->i_mode = mode;
+
+		/*
+		 * msharefs are not meant to be manipulated from userspace.
+		 * Reading from the file is the only allowed operation
+		 */
+		inode->i_flags = S_IMMUTABLE;
+
+		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
+		inode->i_fop = &msharefs_file_operations;
+		inode->i_size = 0;
+		inode->i_uid = current_fsuid();
+		inode->i_gid = current_fsgid();
+	}
+
+	return inode;
+}
+
+static int
+mshare_file_create(const char *name, unsigned long flags)
+{
+	struct inode *inode;
+	struct dentry *root, *dentry;
+	int err = 0;
+
+	root = msharefs_sb->s_root;
+
+	inode = msharefs_get_inode(msharefs_sb, S_IFREG | 0400);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	dentry = msharefs_alloc_dentry(root, name);
+	if (IS_ERR(dentry)) {
+		err = PTR_ERR(dentry);
+		goto fail_inode;
+	}
+
+	d_add(dentry, inode);
+
+	return err;
+
+fail_inode:
+	iput(inode);
+	return err;
+}
+
 /*
- * mshare syscall. Returns a file descriptor
+ * mshare syscall
  */
-SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
+SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 		unsigned long, len, int, oflag, mode_t, mode)
 {
-	int fd;
+	char mshare_name[NAME_MAX];
+	int err;
 
 	/*
 	 * Address range being shared must be aligned to pgdir
@@ -32,15 +129,14 @@ SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
 	if ((addr | len) & (PGDIR_SIZE - 1))
 		return -EINVAL;
 
-	/*
-	 * Allocate a file descriptor to return
-	 *
-	 * TODO: This code ignores the object name completely. Add
-	 * support for that
-	 */
-	fd = anon_inode_getfd("mshare", &mshare_fops, NULL, O_RDWR);
+	err = copy_from_user(mshare_name, name, NAME_MAX);
+	if (err)
+		goto err_out;
 
-	return fd;
+	err = mshare_file_create(mshare_name, oflag);
+
+err_out:
+	return err;
 }
 
 /*
@@ -48,7 +144,8 @@ SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
  */
 SYSCALL_DEFINE1(mshare_unlink, const char *, name)
 {
-	int fd;
+	char mshare_name[NAME_MAX];
+	int err;
 
 	/*
 	 * Delete the named object
@@ -56,5 +153,71 @@ SYSCALL_DEFINE1(mshare_unlink, const char *, name)
 	 * TODO: Mark mshare'd range for deletion
 	 *
 	 */
+	err = copy_from_user(mshare_name, name, NAME_MAX);
+	if (err)
+		goto err_out;
+	return 0;
+
+err_out:
+	return err;
+}
+
+static const struct dentry_operations msharefs_d_ops = {
+	.d_hash = msharefs_d_hash,
+};
+
+static int
+msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+	static const struct tree_descr empty_descr = {""};
+	int err;
+
+	sb->s_d_op = &msharefs_d_ops;
+	err = simple_fill_super(sb, MSHARE_MAGIC, &empty_descr);
+	if (err)
+		return err;
+
+	msharefs_sb = sb;
+	return 0;
+}
+
+static int
+msharefs_get_tree(struct fs_context *fc)
+{
+	return get_tree_single(fc, msharefs_fill_super);
+}
+
+static const struct fs_context_operations msharefs_context_ops = {
+	.get_tree	= msharefs_get_tree,
+};
+
+static int
+mshare_init_fs_context(struct fs_context *fc)
+{
+	fc->ops = &msharefs_context_ops;
 	return 0;
 }
+
+static struct file_system_type mshare_fs = {
+	.name			= "msharefs",
+	.init_fs_context	= mshare_init_fs_context,
+	.kill_sb		= kill_litter_super,
+};
+
+static int
+mshare_init(void)
+{
+	int ret = 0;
+
+	ret = sysfs_create_mount_point(fs_kobj, "mshare");
+	if (ret)
+		return ret;
+
+	ret = register_filesystem(&mshare_fs);
+	if (ret)
+		sysfs_remove_mount_point(fs_kobj, "mshare");
+
+	return ret;
+}
+
+fs_initcall(mshare_init);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 03/14] mm/mshare: Add read for msharefs
  2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
  2022-04-11 16:05 ` [PATCH v1 01/14] mm: Add new system calls mshare, mshare_unlink Khalid Aziz
  2022-04-11 16:05 ` [PATCH v1 02/14] mm/mshare: Add msharefs filesystem Khalid Aziz
@ 2022-04-11 16:05 ` Khalid Aziz
  2022-04-11 16:05 ` [PATCH v1 04/14] mm/mshare: implement mshare_unlink syscall Khalid Aziz
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-04-11 16:05 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

Allocate a new mm to store shared page table. Add a read operation to
file created by mshare syscall to return the starting address and
size of region shared by the corresponding range. This information
is returned as struct mshare_info defined in include/uapi/linux/mman.h

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
v1:
	- Read returns struct instead of two unsigned long (suggested
	  by Mike Rapoport)

 include/uapi/linux/mman.h |  5 +++
 mm/mshare.c               | 68 ++++++++++++++++++++++++++++++++++++---
 2 files changed, 68 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index f55bc680b5b0..56fe446e24b1 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -41,4 +41,9 @@
 #define MAP_HUGE_2GB	HUGETLB_FLAG_ENCODE_2GB
 #define MAP_HUGE_16GB	HUGETLB_FLAG_ENCODE_16GB
 
+struct mshare_info {
+	unsigned long start;
+	unsigned long size;
+};
+
 #endif /* _UAPI_LINUX_MMAN_H */
diff --git a/mm/mshare.c b/mm/mshare.c
index ad695288d4bb..59e5d294e562 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -16,14 +16,39 @@
 #include <linux/uaccess.h>
 #include <linux/pseudo_fs.h>
 #include <linux/fileattr.h>
+#include <linux/refcount.h>
+#include <linux/sched/mm.h>
 #include <uapi/linux/magic.h>
 #include <uapi/linux/limits.h>
+#include <uapi/linux/mman.h>
+
+struct mshare_data {
+	struct mm_struct *mm;
+	refcount_t refcnt;
+};
 
 static struct super_block *msharefs_sb;
 
+static ssize_t
+mshare_read(struct kiocb *iocb, struct iov_iter *iov)
+{
+	struct mshare_data *info = iocb->ki_filp->private_data;
+	struct mm_struct *mm = info->mm;
+	size_t ret;
+	struct mshare_info m_info;
+
+	m_info.start = mm->mmap_base;
+	m_info.size = mm->task_size - mm->mmap_base;
+	ret = copy_to_iter(&m_info, sizeof(m_info), iov);
+	if (!ret)
+		return -EFAULT;
+	return ret;
+}
+
 static const struct file_operations msharefs_file_operations = {
-	.open	= simple_open,
-	.llseek	= no_llseek,
+	.open		= simple_open,
+	.read_iter	= mshare_read,
+	.llseek		= no_llseek,
 };
 
 static int
@@ -77,7 +102,12 @@ static struct inode
 
 		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
 		inode->i_fop = &msharefs_file_operations;
-		inode->i_size = 0;
+
+		/*
+		 * A read from this file will return two unsigned long
+		 */
+		inode->i_size = 2 * sizeof(unsigned long);
+
 		inode->i_uid = current_fsuid();
 		inode->i_gid = current_fsgid();
 	}
@@ -86,7 +116,8 @@ static struct inode
 }
 
 static int
-mshare_file_create(const char *name, unsigned long flags)
+mshare_file_create(const char *name, unsigned long flags,
+			struct mshare_data *info)
 {
 	struct inode *inode;
 	struct dentry *root, *dentry;
@@ -98,6 +129,8 @@ mshare_file_create(const char *name, unsigned long flags)
 	if (IS_ERR(inode))
 		return PTR_ERR(inode);
 
+	inode->i_private = info;
+
 	dentry = msharefs_alloc_dentry(root, name);
 	if (IS_ERR(dentry)) {
 		err = PTR_ERR(dentry);
@@ -120,6 +153,8 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 		unsigned long, len, int, oflag, mode_t, mode)
 {
 	char mshare_name[NAME_MAX];
+	struct mshare_data *info;
+	struct mm_struct *mm;
 	int err;
 
 	/*
@@ -133,8 +168,31 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 	if (err)
 		goto err_out;
 
-	err = mshare_file_create(mshare_name, oflag);
+	mm = mm_alloc();
+	if (!mm)
+		return -ENOMEM;
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info) {
+		err = -ENOMEM;
+		goto err_relmm;
+	}
+	mm->mmap_base = addr;
+	mm->task_size = addr + len;
+	if (!mm->task_size)
+		mm->task_size--;
+	info->mm = mm;
+	refcount_set(&info->refcnt, 1);
+
+	err = mshare_file_create(mshare_name, oflag, info);
+	if (err)
+		goto err_relinfo;
+
+	return 0;
 
+err_relinfo:
+	kfree(info);
+err_relmm:
+	mmput(mm);
 err_out:
 	return err;
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 04/14] mm/mshare: implement mshare_unlink syscall
  2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
                   ` (2 preceding siblings ...)
  2022-04-11 16:05 ` [PATCH v1 03/14] mm/mshare: Add read for msharefs Khalid Aziz
@ 2022-04-11 16:05 ` Khalid Aziz
  2022-04-11 16:05 ` [PATCH v1 05/14] mm/mshare: Add locking to msharefs syscalls Khalid Aziz
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-04-11 16:05 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

Add code to allow mshare syscall to be made for an exisitng mshare'd
region. Complete the implementation for mshare_unlink syscall. Make
reading mshare resource name from userspace safer. Fix code to allow
msharefs to be unmounted cleanly.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 mm/mshare.c | 144 +++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 113 insertions(+), 31 deletions(-)

diff --git a/mm/mshare.c b/mm/mshare.c
index 0247275aac50..b9d7836f9bd1 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -23,11 +23,24 @@
 
 struct mshare_data {
 	struct mm_struct *mm;
+	mode_t mode;
 	refcount_t refcnt;
 };
 
 static struct super_block *msharefs_sb;
 
+static void
+msharefs_evict_inode(struct inode *inode)
+{
+	clear_inode(inode);
+}
+
+static const struct super_operations msharefs_ops = {
+	.statfs		= simple_statfs,
+	.drop_inode	= generic_delete_inode,
+	.evict_inode	= msharefs_evict_inode,
+};
+
 static ssize_t
 mshare_read(struct kiocb *iocb, struct iov_iter *iov)
 {
@@ -115,7 +128,7 @@ static struct inode
 }
 
 static int
-mshare_file_create(const char *name, unsigned long flags,
+mshare_file_create(struct filename *fname, int flags,
 			struct mshare_data *info)
 {
 	struct inode *inode;
@@ -124,13 +137,16 @@ mshare_file_create(const char *name, unsigned long flags,
 
 	root = msharefs_sb->s_root;
 
+	/*
+	 * This is a read only file.
+	 */
 	inode = msharefs_get_inode(msharefs_sb, S_IFREG | 0400);
 	if (IS_ERR(inode))
 		return PTR_ERR(inode);
 
 	inode->i_private = info;
 
-	dentry = msharefs_alloc_dentry(root, name);
+	dentry = msharefs_alloc_dentry(root, fname->name);
 	if (IS_ERR(dentry)) {
 		err = PTR_ERR(dentry);
 		goto fail_inode;
@@ -138,6 +154,7 @@ mshare_file_create(const char *name, unsigned long flags,
 
 	d_add(dentry, inode);
 
+	dput(dentry);
 	return err;
 
 fail_inode:
@@ -151,10 +168,13 @@ mshare_file_create(const char *name, unsigned long flags,
 SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 		unsigned long, len, int, oflag, mode_t, mode)
 {
-	char mshare_name[NAME_MAX];
 	struct mshare_data *info;
 	struct mm_struct *mm;
-	int err;
+	struct filename *fname = getname(name);
+	struct dentry *dentry;
+	struct inode *inode;
+	struct qstr namestr;
+	int err = PTR_ERR(fname);
 
 	/*
 	 * Address range being shared must be aligned to pgdir
@@ -163,29 +183,56 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 	if ((addr | len) & (PGDIR_SIZE - 1))
 		return -EINVAL;
 
-	err = copy_from_user(mshare_name, name, NAME_MAX);
+	if (IS_ERR(fname))
+		goto err_out;
+
+	/*
+	 * Does this mshare entry exist already? If it does, calling
+	 * mshare with O_EXCL|O_CREAT is an error
+	 */
+	namestr.name = fname->name;
+	namestr.len = strlen(fname->name);
+	err = msharefs_d_hash(msharefs_sb->s_root, &namestr);
 	if (err)
 		goto err_out;
+	dentry = d_lookup(msharefs_sb->s_root, &namestr);
+	if (dentry && (oflag & (O_EXCL|O_CREAT))) {
+		err = -EEXIST;
+		dput(dentry);
+		goto err_out;
+	}
 
-	mm = mm_alloc();
-	if (!mm)
-		return -ENOMEM;
-	info = kzalloc(sizeof(*info), GFP_KERNEL);
-	if (!info) {
-		err = -ENOMEM;
-		goto err_relmm;
+	if (dentry) {
+		inode = d_inode(dentry);
+		if (inode == NULL) {
+			err = -EINVAL;
+			goto err_out;
+		}
+		info = inode->i_private;
+		refcount_inc(&info->refcnt);
+		dput(dentry);
+	} else {
+		mm = mm_alloc();
+		if (!mm)
+			return -ENOMEM;
+		info = kzalloc(sizeof(*info), GFP_KERNEL);
+		if (!info) {
+			err = -ENOMEM;
+			goto err_relmm;
+		}
+		mm->mmap_base = addr;
+		mm->task_size = addr + len;
+		if (!mm->task_size)
+			mm->task_size--;
+		info->mm = mm;
+		info->mode = mode;
+		refcount_set(&info->refcnt, 1);
+		err = mshare_file_create(fname, oflag, info);
+		if (err)
+			goto err_relinfo;
 	}
-	mm->mmap_base = addr;
-	mm->task_size = addr + len;
-	if (!mm->task_size)
-		mm->task_size--;
-	info->mm = mm;
-	refcount_set(&info->refcnt, 1);
-
-	err = mshare_file_create(mshare_name, oflag, info);
-	if (err)
-		goto err_relinfo;
 
+	putname(fname);
 	return 0;
 
 err_relinfo:
@@ -193,6 +240,7 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 err_relmm:
 	mmput(mm);
 err_out:
+	putname(fname);
 	return err;
 }
 
@@ -201,21 +249,54 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
  */
 SYSCALL_DEFINE1(mshare_unlink, const char *, name)
 {
-	char mshare_name[NAME_MAX];
-	int err;
+	struct filename *fname = getname(name);
+	int err = PTR_ERR(fname);
+	struct dentry *dentry;
+	struct inode *inode;
+	struct mshare_data *info;
+	struct qstr namestr;
 
-	/*
-	 * Delete the named object
-	 *
-	 * TODO: Mark mshare'd range for deletion
-	 *
-	 */
-	err = copy_from_user(mshare_name, name, NAME_MAX);
+	if (IS_ERR(fname))
+		goto err_out;
+
+	namestr.name = fname->name;
+	namestr.len = strlen(fname->name);
+	err = msharefs_d_hash(msharefs_sb->s_root, &namestr);
 	if (err)
 		goto err_out;
+	dentry = d_lookup(msharefs_sb->s_root, &namestr);
+	if (dentry == NULL) {
+		err = -EINVAL;
+		goto err_out;
+	}
+
+	inode = d_inode(dentry);
+	if (inode == NULL) {
+		err = -EINVAL;
+		goto err_dput;
+	}
+	info = inode->i_private;
+
+	/*
+	 * Is this the last reference?
+	 */
+	if (refcount_dec_and_test(&info->refcnt)) {
+		simple_unlink(d_inode(msharefs_sb->s_root), dentry);
+		d_drop(dentry);
+		d_delete(dentry);
+		mmput(info->mm);
+		kfree(info);
+	} else {
+		dput(dentry);
+	}
+
+	putname(fname);
 	return 0;
 
+err_dput:
+	dput(dentry);
 err_out:
+	putname(fname);
 	return err;
 }
 
@@ -229,6 +310,7 @@ msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
 	static const struct tree_descr empty_descr = {""};
 	int err;
 
+	sb->s_op = &msharefs_ops;
 	sb->s_d_op = &msharefs_d_ops;
 	err = simple_fill_super(sb, MSHARE_MAGIC, &empty_descr);
 	if (err)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 05/14] mm/mshare: Add locking to msharefs syscalls
  2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
                   ` (3 preceding siblings ...)
  2022-04-11 16:05 ` [PATCH v1 04/14] mm/mshare: implement mshare_unlink syscall Khalid Aziz
@ 2022-04-11 16:05 ` Khalid Aziz
  2022-04-11 16:05 ` [PATCH v1 06/14] mm/mshare: Check for mounted filesystem Khalid Aziz
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-04-11 16:05 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

Lock the root inode for msharefs when creating a new file or
deleting an existing one to avoid races. mshare syscalls are low
frequency operations, so locking the root inode is reasonable.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 mm/mshare.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/mshare.c b/mm/mshare.c
index b9d7836f9bd1..85ccb7f02f33 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -195,11 +195,12 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 	err = msharefs_d_hash(msharefs_sb->s_root, &namestr);
 	if (err)
 		goto err_out;
+	inode_lock(d_inode(msharefs_sb->s_root));
 	dentry = d_lookup(msharefs_sb->s_root, &namestr);
 	if (dentry && (oflag & (O_EXCL|O_CREAT))) {
 		err = -EEXIST;
 		dput(dentry);
-		goto err_out;
+		goto err_unlock_inode;
 	}
 
 	if (dentry) {
@@ -232,6 +233,7 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 			goto err_relinfo;
 	}
 
+	inode_unlock(d_inode(msharefs_sb->s_root));
 	putname(fname);
 	return 0;
 
@@ -239,6 +241,8 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 	kfree(info);
 err_relmm:
 	mmput(mm);
+err_unlock_inode:
+	inode_unlock(d_inode(msharefs_sb->s_root));
 err_out:
 	putname(fname);
 	return err;
@@ -264,10 +268,11 @@ SYSCALL_DEFINE1(mshare_unlink, const char *, name)
 	err = msharefs_d_hash(msharefs_sb->s_root, &namestr);
 	if (err)
 		goto err_out;
+	inode_lock(d_inode(msharefs_sb->s_root));
 	dentry = d_lookup(msharefs_sb->s_root, &namestr);
 	if (dentry == NULL) {
 		err = -EINVAL;
-		goto err_out;
+		goto err_unlock_inode;
 	}
 
 	inode = d_inode(dentry);
@@ -290,11 +295,14 @@ SYSCALL_DEFINE1(mshare_unlink, const char *, name)
 		dput(dentry);
 	}
 
+	inode_unlock(d_inode(msharefs_sb->s_root));
 	putname(fname);
 	return 0;
 
 err_dput:
 	dput(dentry);
+err_unlock_inode:
+	inode_unlock(d_inode(msharefs_sb->s_root));
 err_out:
 	putname(fname);
 	return err;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 06/14] mm/mshare: Check for mounted filesystem
  2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
                   ` (4 preceding siblings ...)
  2022-04-11 16:05 ` [PATCH v1 05/14] mm/mshare: Add locking to msharefs syscalls Khalid Aziz
@ 2022-04-11 16:05 ` Khalid Aziz
  2022-04-11 16:05 ` [PATCH v1 07/14] mm/mshare: Add vm flag for shared PTE Khalid Aziz
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-04-11 16:05 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

Check if msharefs is mounted before performing any msharefs
operations to avoid inadvertent NULL pointer dereferences.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 mm/mshare.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/mshare.c b/mm/mshare.c
index 85ccb7f02f33..cd2f7ad24d9d 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -176,6 +176,13 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 	struct qstr namestr;
 	int err = PTR_ERR(fname);
 
+	/*
+	 * Is msharefs mounted? TODO: If not mounted, return error
+	 * or automount?
+	 */
+	if (msharefs_sb == NULL)
+		return -ENOENT;
+
 	/*
 	 * Address range being shared must be aligned to pgdir
 	 * boundary and its size must be a multiple of pgdir size
@@ -260,6 +267,9 @@ SYSCALL_DEFINE1(mshare_unlink, const char *, name)
 	struct mshare_data *info;
 	struct qstr namestr;
 
+	if (msharefs_sb == NULL)
+		return -ENOENT;
+
 	if (IS_ERR(fname))
 		goto err_out;
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 07/14] mm/mshare: Add vm flag for shared PTE
  2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
                   ` (5 preceding siblings ...)
  2022-04-11 16:05 ` [PATCH v1 06/14] mm/mshare: Check for mounted filesystem Khalid Aziz
@ 2022-04-11 16:05 ` Khalid Aziz
  2022-04-11 16:05 ` [PATCH v1 08/14] mm/mshare: Add basic page table sharing using mshare Khalid Aziz
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-04-11 16:05 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

Add a bit to vm_flags to indicate a vma shares PTEs with others. Add
a function to determine if a vma shares PTE by checking this flag.
This is to be used to find the shared page table entries on page fault
for vmas sharing PTE.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/mm.h             | 8 ++++++++
 include/trace/events/mmflags.h | 3 ++-
 mm/internal.h                  | 5 +++++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5744a3fc4716..821ed7ee7b41 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -308,11 +308,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -354,6 +356,12 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_MTE_ALLOWED	VM_NONE
 #endif
 
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+#define VM_SHARED_PT	VM_HIGH_ARCH_5
+#else
+#define VM_SHARED_PT	0
+#endif
+
 #ifndef VM_GROWSUP
 # define VM_GROWSUP	VM_NONE
 #endif
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 116ed4d5d0f8..002dbf2711c5 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -184,7 +184,8 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,	"softdirty"	)		\
 	{VM_MIXEDMAP,			"mixedmap"	},		\
 	{VM_HUGEPAGE,			"hugepage"	},		\
 	{VM_NOHUGEPAGE,			"nohugepage"	},		\
-	{VM_MERGEABLE,			"mergeable"	}		\
+	{VM_MERGEABLE,			"mergeable"	},		\
+	{VM_SHARED_PT,			"sharedpt"	}		\
 
 #define show_vma_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",				\
diff --git a/mm/internal.h b/mm/internal.h
index d80300392a19..cf50a471384e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -718,4 +718,9 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 		      unsigned long addr, int page_nid, int *flags);
 
+static inline bool vma_is_shared(const struct vm_area_struct *vma)
+{
+	return vma->vm_flags & VM_SHARED_PT;
+}
+
 #endif	/* __MM_INTERNAL_H */
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 08/14] mm/mshare: Add basic page table sharing using mshare
  2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
                   ` (6 preceding siblings ...)
  2022-04-11 16:05 ` [PATCH v1 07/14] mm/mshare: Add vm flag for shared PTE Khalid Aziz
@ 2022-04-11 16:05 ` Khalid Aziz
  2022-04-11 18:48   ` Dave Hansen
                     ` (2 more replies)
  2022-04-11 16:05 ` [PATCH v1 09/14] mm/mshare: Do not free PTEs for mshare'd PTEs Khalid Aziz
                   ` (9 subsequent siblings)
  17 siblings, 3 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-04-11 16:05 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

This patch adds basic page table sharing across tasks by making
mshare syscall. It does this by creating a new mm_struct which
hosts the shared vmas and page tables. This mm_struct is
maintained as long as there is at least one task using the mshare'd
range. It is cleaned up by the last mshare_unlink syscall.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/internal.h |   2 +
 mm/memory.c   |  35 ++++++++++
 mm/mshare.c   | 190 ++++++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 214 insertions(+), 13 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index cf50a471384e..68f82f0f8b66 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -718,6 +718,8 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 		      unsigned long addr, int page_nid, int *flags);
 
+extern vm_fault_t find_shared_vma(struct vm_area_struct **vma,
+			unsigned long *addrp);
 static inline bool vma_is_shared(const struct vm_area_struct *vma)
 {
 	return vma->vm_flags & VM_SHARED_PT;
diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..c77c0d643ea8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4776,6 +4776,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			   unsigned int flags, struct pt_regs *regs)
 {
 	vm_fault_t ret;
+	bool shared = false;
 
 	__set_current_state(TASK_RUNNING);
 
@@ -4785,6 +4786,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	/* do counter updates before entering really critical section. */
 	check_sync_rss_stat(current);
 
+	if (unlikely(vma_is_shared(vma))) {
+		ret = find_shared_vma(&vma, &address);
+		if (ret)
+			return ret;
+		if (!vma)
+			return VM_FAULT_SIGSEGV;
+		shared = true;
+	}
+
 	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
 					    flags & FAULT_FLAG_INSTRUCTION,
 					    flags & FAULT_FLAG_REMOTE))
@@ -4802,6 +4812,31 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	else
 		ret = __handle_mm_fault(vma, address, flags);
 
+	/*
+	 * Release the read lock on shared VMA's parent mm unless
+	 * __handle_mm_fault released the lock already.
+	 * __handle_mm_fault sets VM_FAULT_RETRY in return value if
+	 * it released mmap lock. If lock was released, that implies
+	 * the lock would have been released on task's original mm if
+	 * this were not a shared PTE vma. To keep lock state consistent,
+	 * make sure to release the lock on task's original mm
+	 */
+	if (shared) {
+		int release_mmlock = 1;
+
+		if (!(ret & VM_FAULT_RETRY)) {
+			mmap_read_unlock(vma->vm_mm);
+			release_mmlock = 0;
+		} else if ((flags & FAULT_FLAG_ALLOW_RETRY) &&
+			(flags & FAULT_FLAG_RETRY_NOWAIT)) {
+			mmap_read_unlock(vma->vm_mm);
+			release_mmlock = 0;
+		}
+
+		if (release_mmlock)
+			mmap_read_unlock(current->mm);
+	}
+
 	if (flags & FAULT_FLAG_USER) {
 		mem_cgroup_exit_user_fault();
 		/*
diff --git a/mm/mshare.c b/mm/mshare.c
index cd2f7ad24d9d..d1896adcb00f 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -17,18 +17,49 @@
 #include <linux/pseudo_fs.h>
 #include <linux/fileattr.h>
 #include <linux/refcount.h>
+#include <linux/mman.h>
 #include <linux/sched/mm.h>
 #include <uapi/linux/magic.h>
 #include <uapi/linux/limits.h>
 
 struct mshare_data {
-	struct mm_struct *mm;
+	struct mm_struct *mm, *host_mm;
 	mode_t mode;
 	refcount_t refcnt;
 };
 
 static struct super_block *msharefs_sb;
 
+/* Returns holding the host mm's lock for read.  Caller must release. */
+vm_fault_t
+find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp)
+{
+	struct vm_area_struct *vma, *guest = *vmap;
+	struct mshare_data *info = guest->vm_private_data;
+	struct mm_struct *host_mm = info->mm;
+	unsigned long host_addr;
+	pgd_t *pgd, *guest_pgd;
+
+	host_addr = *addrp - guest->vm_start + host_mm->mmap_base;
+	pgd = pgd_offset(host_mm, host_addr);
+	guest_pgd = pgd_offset(current->mm, *addrp);
+	if (!pgd_same(*guest_pgd, *pgd)) {
+		set_pgd(guest_pgd, *pgd);
+		return VM_FAULT_NOPAGE;
+	}
+
+	*addrp = host_addr;
+	mmap_read_lock(host_mm);
+	vma = find_vma(host_mm, host_addr);
+
+	/* XXX: expand stack? */
+	if (vma && vma->vm_start > host_addr)
+		vma = NULL;
+
+	*vmap = vma;
+	return 0;
+}
+
 static void
 msharefs_evict_inode(struct inode *inode)
 {
@@ -169,11 +200,13 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 		unsigned long, len, int, oflag, mode_t, mode)
 {
 	struct mshare_data *info;
-	struct mm_struct *mm;
 	struct filename *fname = getname(name);
 	struct dentry *dentry;
 	struct inode *inode;
 	struct qstr namestr;
+	struct vm_area_struct *vma, *next, *new_vma;
+	struct mm_struct *new_mm;
+	unsigned long end;
 	int err = PTR_ERR(fname);
 
 	/*
@@ -193,6 +226,8 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 	if (IS_ERR(fname))
 		goto err_out;
 
+	end = addr + len;
+
 	/*
 	 * Does this mshare entry exist already? If it does, calling
 	 * mshare with O_EXCL|O_CREAT is an error
@@ -205,49 +240,165 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 	inode_lock(d_inode(msharefs_sb->s_root));
 	dentry = d_lookup(msharefs_sb->s_root, &namestr);
 	if (dentry && (oflag & (O_EXCL|O_CREAT))) {
+		inode = d_inode(dentry);
 		err = -EEXIST;
 		dput(dentry);
 		goto err_unlock_inode;
 	}
 
 	if (dentry) {
+		unsigned long mapaddr, prot = PROT_NONE;
+
 		inode = d_inode(dentry);
 		if (inode == NULL) {
+			mmap_write_unlock(current->mm);
 			err = -EINVAL;
 			goto err_out;
 		}
 		info = inode->i_private;
-		refcount_inc(&info->refcnt);
 		dput(dentry);
+
+		/*
+		 * Map in the address range as anonymous mappings
+		 */
+		oflag &= (O_RDONLY | O_WRONLY | O_RDWR);
+		if (oflag & O_RDONLY)
+			prot |= PROT_READ;
+		else if (oflag & O_WRONLY)
+			prot |= PROT_WRITE;
+		else if (oflag & O_RDWR)
+			prot |= (PROT_READ | PROT_WRITE);
+		mapaddr = vm_mmap(NULL, addr, len, prot,
+				MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, 0);
+		if (IS_ERR((void *)mapaddr)) {
+			err = -EINVAL;
+			goto err_out;
+		}
+
+		refcount_inc(&info->refcnt);
+
+		/*
+		 * Now that we have mmap'd the mshare'd range, update vma
+		 * flags and vm_mm pointer for this mshare'd range.
+		 */
+		mmap_write_lock(current->mm);
+		vma = find_vma(current->mm, addr);
+		if (vma && vma->vm_start < addr) {
+			mmap_write_unlock(current->mm);
+			err = -EINVAL;
+			goto err_out;
+		}
+
+		while (vma && vma->vm_start < (addr + len)) {
+			vma->vm_private_data = info;
+			vma->vm_mm = info->mm;
+			vma->vm_flags |= VM_SHARED_PT;
+			next = vma->vm_next;
+			vma = next;
+		}
 	} else {
-		mm = mm_alloc();
-		if (!mm)
+		unsigned long myaddr;
+		struct mm_struct *old_mm;
+
+		old_mm = current->mm;
+		new_mm = mm_alloc();
+		if (!new_mm)
 			return -ENOMEM;
 		info = kzalloc(sizeof(*info), GFP_KERNEL);
 		if (!info) {
 			err = -ENOMEM;
 			goto err_relmm;
 		}
-		mm->mmap_base = addr;
-		mm->task_size = addr + len;
-		if (!mm->task_size)
-			mm->task_size--;
-		info->mm = mm;
+		new_mm->mmap_base = addr;
+		new_mm->task_size = addr + len;
+		if (!new_mm->task_size)
+			new_mm->task_size--;
+		info->mm = new_mm;
+		info->host_mm = old_mm;
 		info->mode = mode;
 		refcount_set(&info->refcnt, 1);
+
+		/*
+		 * VMAs for this address range may or may not exist.
+		 * If VMAs exist, they should be marked as shared at
+		 * this point and page table info should be copied
+		 * over to newly created mm_struct. TODO: If VMAs do not
+		 * exist, create them and mark them as shared.
+		 */
+		mmap_write_lock(old_mm);
+		vma = find_vma_intersection(old_mm, addr, end);
+		if (!vma) {
+			err = -EINVAL;
+			goto unlock;
+		}
+		/*
+		 * TODO: If the currently allocated VMA goes beyond the
+		 * mshare'd range, this VMA needs to be split.
+		 *
+		 * Double check that source VMAs do not extend outside
+		 * the range
+		 */
+		vma = find_vma(old_mm, addr + len);
+		if (vma && vma->vm_start < (addr + len)) {
+			err = -EINVAL;
+			goto unlock;
+		}
+
+		vma = find_vma(old_mm, addr);
+		if (vma && vma->vm_start < addr) {
+			err = -EINVAL;
+			goto unlock;
+		}
+
+		mmap_write_lock(new_mm);
+		while (vma && vma->vm_start < (addr + len)) {
+			/*
+			 * Copy this vma over to host mm
+			 */
+			vma->vm_private_data = info;
+			vma->vm_mm = new_mm;
+			vma->vm_flags |= VM_SHARED_PT;
+			new_vma = vm_area_dup(vma);
+			if (!new_vma) {
+				err = -ENOMEM;
+				goto unlock;
+			}
+			err = insert_vm_struct(new_mm, new_vma);
+			if (err)
+				goto unlock;
+
+			vma = vma->vm_next;
+		}
+		mmap_write_unlock(new_mm);
+
 		err = mshare_file_create(fname, oflag, info);
 		if (err)
-			goto err_relinfo;
+			goto unlock;
+
+		/*
+		 * Copy over current PTEs
+		 */
+		myaddr = addr;
+		while (myaddr < new_mm->task_size) {
+			*pgd_offset(new_mm, myaddr) = *pgd_offset(old_mm, myaddr);
+			myaddr += PGDIR_SIZE;
+		}
+		/*
+		 * TODO: Free the corresponding page table in calling
+		 * process
+		 */
 	}
 
+	mmap_write_unlock(current->mm);
 	inode_unlock(d_inode(msharefs_sb->s_root));
 	putname(fname);
 	return 0;
 
-err_relinfo:
+unlock:
+	mmap_write_unlock(current->mm);
 	kfree(info);
 err_relmm:
-	mmput(mm);
+	mmput(new_mm);
 err_unlock_inode:
 	inode_unlock(d_inode(msharefs_sb->s_root));
 err_out:
@@ -294,11 +445,24 @@ SYSCALL_DEFINE1(mshare_unlink, const char *, name)
 
 	/*
 	 * Is this the last reference?
+	 * TODO: permission checks are needed before proceeding
 	 */
 	if (refcount_dec_and_test(&info->refcnt)) {
 		simple_unlink(d_inode(msharefs_sb->s_root), dentry);
 		d_drop(dentry);
 		d_delete(dentry);
+		/*
+		 * TODO: Release all physical pages allocated for this
+		 * mshare range and release associated page table. If
+		 * the final unlink happens from the process that created
+		 * mshare'd range, do we return page tables and pages to
+		 * that process so the creating process can continue using
+		 * the address range it had chosen to mshare at some
+		 * point?
+		 *
+		 * TODO: unmap shared vmas from every task that is using
+		 * this mshare'd range.
+		 */
 		mmput(info->mm);
 		kfree(info);
 	} else {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 09/14] mm/mshare: Do not free PTEs for mshare'd PTEs
  2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
                   ` (7 preceding siblings ...)
  2022-04-11 16:05 ` [PATCH v1 08/14] mm/mshare: Add basic page table sharing using mshare Khalid Aziz
@ 2022-04-11 16:05 ` Khalid Aziz
  2022-05-31  4:24   ` Barry Song
  2022-04-11 16:05 ` [PATCH v1 10/14] mm/mshare: Check for mapped vma when mshare'ing existing mshare'd range Khalid Aziz
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 37+ messages in thread
From: Khalid Aziz @ 2022-04-11 16:05 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

mshare'd PTEs should not be removed when a task exits. These PTEs
are removed when the last task sharing the PTEs exits. Add a check
for shared PTEs and skip them.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 mm/memory.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c77c0d643ea8..e7c5bc6f8836 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -419,16 +419,25 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		} else {
 			/*
 			 * Optimization: gather nearby vmas into one call down
+			 * as long as they all belong to the same mm (that
+			 * may not be the case if a vma is part of mshare'd
+			 * range
 			 */
 			while (next && next->vm_start <= vma->vm_end + PMD_SIZE
-			       && !is_vm_hugetlb_page(next)) {
+			       && !is_vm_hugetlb_page(next)
+			       && vma->vm_mm == tlb->mm) {
 				vma = next;
 				next = vma->vm_next;
 				unlink_anon_vmas(vma);
 				unlink_file_vma(vma);
 			}
-			free_pgd_range(tlb, addr, vma->vm_end,
-				floor, next ? next->vm_start : ceiling);
+			/*
+			 * Free pgd only if pgd is not allocated for an
+			 * mshare'd range
+			 */
+			if (vma->vm_mm == tlb->mm)
+				free_pgd_range(tlb, addr, vma->vm_end,
+					floor, next ? next->vm_start : ceiling);
 		}
 		vma = next;
 	}
@@ -1551,6 +1560,13 @@ void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
+	/*
+	 * If this is an mshare'd page, do not unmap it since it might
+	 * still be in use.
+	 */
+	if (vma->vm_mm != tlb->mm)
+		return;
+
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 10/14] mm/mshare: Check for mapped vma when mshare'ing existing mshare'd range
  2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
                   ` (8 preceding siblings ...)
  2022-04-11 16:05 ` [PATCH v1 09/14] mm/mshare: Do not free PTEs for mshare'd PTEs Khalid Aziz
@ 2022-04-11 16:05 ` Khalid Aziz
  2022-04-11 16:05 ` [PATCH v1 11/14] mm/mshare: unmap vmas in mshare_unlink Khalid Aziz
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-04-11 16:05 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

When a task calls mshare() to map in an existing mshare'd region,
make sure this mapping does not overlap any existing mappings in
calling task. Ensure mmap locks are taken and released in correct
order and in correct read/write mode.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 mm/mshare.c | 62 +++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 43 insertions(+), 19 deletions(-)

diff --git a/mm/mshare.c b/mm/mshare.c
index d1896adcb00f..40c495ffc0ca 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -249,11 +249,24 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 	if (dentry) {
 		unsigned long mapaddr, prot = PROT_NONE;
 
+		/*
+		 * If a task is trying to map in an existing mshare'd
+		 * range, make sure there are no overlapping mappings
+		 * in calling process already
+		 */
+		mmap_read_lock(current->mm);
+		vma = find_vma_intersection(current->mm, addr, end);
+		if (vma) {
+			mmap_read_unlock(current->mm);
+			err = -EINVAL;
+			goto err_unlock_inode;
+		}
+		mmap_read_unlock(current->mm);
+
 		inode = d_inode(dentry);
 		if (inode == NULL) {
-			mmap_write_unlock(current->mm);
 			err = -EINVAL;
-			goto err_out;
+			goto err_unlock_inode;
 		}
 		info = inode->i_private;
 		dput(dentry);
@@ -272,7 +285,7 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 				MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, 0);
 		if (IS_ERR((void *)mapaddr)) {
 			err = -EINVAL;
-			goto err_out;
+			goto err_unlock_inode;
 		}
 
 		refcount_inc(&info->refcnt);
@@ -286,7 +299,7 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 		if (vma && vma->vm_start < addr) {
 			mmap_write_unlock(current->mm);
 			err = -EINVAL;
-			goto err_out;
+			goto err_unlock_inode;
 		}
 
 		while (vma && vma->vm_start < (addr + len)) {
@@ -296,6 +309,7 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 			next = vma->vm_next;
 			vma = next;
 		}
+		mmap_write_unlock(current->mm);
 	} else {
 		unsigned long myaddr;
 		struct mm_struct *old_mm;
@@ -325,11 +339,12 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 		 * over to newly created mm_struct. TODO: If VMAs do not
 		 * exist, create them and mark them as shared.
 		 */
-		mmap_write_lock(old_mm);
+		mmap_read_lock(old_mm);
 		vma = find_vma_intersection(old_mm, addr, end);
 		if (!vma) {
+			mmap_read_unlock(old_mm);
 			err = -EINVAL;
-			goto unlock;
+			goto free_info;
 		}
 		/*
 		 * TODO: If the currently allocated VMA goes beyond the
@@ -340,17 +355,21 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 		 */
 		vma = find_vma(old_mm, addr + len);
 		if (vma && vma->vm_start < (addr + len)) {
+			mmap_read_unlock(old_mm);
 			err = -EINVAL;
-			goto unlock;
+			goto free_info;
 		}
 
 		vma = find_vma(old_mm, addr);
 		if (vma && vma->vm_start < addr) {
+			mmap_read_unlock(old_mm);
 			err = -EINVAL;
-			goto unlock;
+			goto free_info;
 		}
+		mmap_read_unlock(old_mm);
 
 		mmap_write_lock(new_mm);
+		mmap_write_lock(old_mm);
 		while (vma && vma->vm_start < (addr + len)) {
 			/*
 			 * Copy this vma over to host mm
@@ -360,20 +379,21 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 			vma->vm_flags |= VM_SHARED_PT;
 			new_vma = vm_area_dup(vma);
 			if (!new_vma) {
+				mmap_write_unlock(new_mm);
+				mmap_write_unlock(old_mm);
 				err = -ENOMEM;
-				goto unlock;
+				goto free_info;
 			}
 			err = insert_vm_struct(new_mm, new_vma);
-			if (err)
-				goto unlock;
+			if (err) {
+				mmap_write_unlock(new_mm);
+				mmap_write_unlock(old_mm);
+				err = -ENOMEM;
+				goto free_info;
+			}
 
 			vma = vma->vm_next;
 		}
-		mmap_write_unlock(new_mm);
-
-		err = mshare_file_create(fname, oflag, info);
-		if (err)
-			goto unlock;
 
 		/*
 		 * Copy over current PTEs
@@ -387,15 +407,19 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 		 * TODO: Free the corresponding page table in calling
 		 * process
 		 */
+		mmap_write_unlock(old_mm);
+		mmap_write_unlock(new_mm);
+
+		err = mshare_file_create(fname, oflag, info);
+		if (err)
+			goto free_info;
 	}
 
-	mmap_write_unlock(current->mm);
 	inode_unlock(d_inode(msharefs_sb->s_root));
 	putname(fname);
 	return 0;
 
-unlock:
-	mmap_write_unlock(current->mm);
+free_info:
 	kfree(info);
 err_relmm:
 	mmput(new_mm);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 11/14] mm/mshare: unmap vmas in mshare_unlink
  2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
                   ` (9 preceding siblings ...)
  2022-04-11 16:05 ` [PATCH v1 10/14] mm/mshare: Check for mapped vma when mshare'ing existing mshare'd range Khalid Aziz
@ 2022-04-11 16:05 ` Khalid Aziz
  2022-04-11 16:05 ` [PATCH v1 12/14] mm/mshare: Add a proc file with mshare alignment/size information Khalid Aziz
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-04-11 16:05 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

mshare() maps in vma for the calling task. These vmas should be
unmapped when the task calls mshare_unlink(). Add minimal code to
unmap vmas.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 mm/mshare.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/mshare.c b/mm/mshare.c
index 40c495ffc0ca..ec23d1db79b2 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -490,6 +490,17 @@ SYSCALL_DEFINE1(mshare_unlink, const char *, name)
 		mmput(info->mm);
 		kfree(info);
 	} else {
+		/*
+		 * TODO: If mshare'd range is still mapped in the process,
+		 * it should be unmapped. Following is minimal code and
+		 * might need fix up
+		 */
+		unsigned long tmp;
+
+		tmp = info->mm->task_size - info->mm->mmap_base;
+		if (info->host_mm != current->mm)
+			vm_munmap(info->mm->mmap_base, tmp);
+
 		dput(dentry);
 	}
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 12/14] mm/mshare: Add a proc file with mshare alignment/size information
  2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
                   ` (10 preceding siblings ...)
  2022-04-11 16:05 ` [PATCH v1 11/14] mm/mshare: unmap vmas in mshare_unlink Khalid Aziz
@ 2022-04-11 16:05 ` Khalid Aziz
  2022-04-11 16:05 ` [PATCH v1 13/14] mm/mshare: Enforce mshare'd region permissions Khalid Aziz
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-04-11 16:05 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

mshare regions are subject to size and alignment requirement. This
alignment boundary can be different on different architectures and
userspace needs a way to know what the requirement is. Add a file
/proc/sys/vm//mshare_size that can be read by userspace to get
the alignment and size requirement.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
v1:
	- Provide a way for userspace to determine alignment and
	  size retriction (based upon feedback from Dave Hansen)

 include/linux/mm.h | 1 +
 kernel/sysctl.c    | 7 +++++++
 mm/mshare.c        | 3 +++
 3 files changed, 11 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 821ed7ee7b41..d9456d424202 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3339,6 +3339,7 @@ unsigned long wp_shared_mapping_range(struct address_space *mapping,
 #endif
 
 extern int sysctl_nr_trim_pages;
+extern ulong sysctl_mshare_size;
 
 #ifdef CONFIG_PRINTK
 void mem_dump_obj(void *object);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 730ab56d9e92..66697ba5da88 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2810,6 +2810,13 @@ static struct ctl_table vm_table[] = {
 		.extra2		= SYSCTL_ONE,
 	},
 #endif
+	{
+		.procname	= "mshare_size",
+		.data		= &sysctl_mshare_size,
+		.maxlen		= sizeof(sysctl_mshare_size),
+		.mode		= 0444,
+		.proc_handler	= proc_doulongvec_minmax,
+	},
 	{ }
 };
 
diff --git a/mm/mshare.c b/mm/mshare.c
index ec23d1db79b2..88c7cefc933d 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -28,6 +28,8 @@ struct mshare_data {
 	refcount_t refcnt;
 };
 
+ulong sysctl_mshare_size;
+
 static struct super_block *msharefs_sb;
 
 /* Returns holding the host mm's lock for read.  Caller must release. */
@@ -573,6 +575,7 @@ mshare_init(void)
 	if (ret)
 		sysfs_remove_mount_point(fs_kobj, "mshare");
 
+	sysctl_mshare_size = PGDIR_SIZE;
 	return ret;
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 13/14] mm/mshare: Enforce mshare'd region permissions
  2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
                   ` (11 preceding siblings ...)
  2022-04-11 16:05 ` [PATCH v1 12/14] mm/mshare: Add a proc file with mshare alignment/size information Khalid Aziz
@ 2022-04-11 16:05 ` Khalid Aziz
  2022-04-11 16:05 ` [PATCH v1 14/14] mm/mshare: Copy PTEs to host mm Khalid Aziz
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-04-11 16:05 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

When a process creates an mshare region, it specifies permissions
allowed for others to access this region as well as permissions
for the file to be created in msharefs that allows other processes
to get information on mshare'd region. Ensure those permissions
are enforced for other tasks accessing this region.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 mm/mshare.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/mm/mshare.c b/mm/mshare.c
index 88c7cefc933d..fba31f3c190f 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -24,7 +24,7 @@
 
 struct mshare_data {
 	struct mm_struct *mm, *host_mm;
-	mode_t mode;
+	int flags;
 	refcount_t refcnt;
 };
 
@@ -131,7 +131,7 @@ static struct dentry
 }
 
 static struct inode
-*msharefs_get_inode(struct super_block *sb, int mode)
+*msharefs_get_inode(struct super_block *sb, mode_t mode)
 {
 	struct inode *inode = new_inode(sb);
 
@@ -161,19 +161,22 @@ static struct inode
 }
 
 static int
-mshare_file_create(struct filename *fname, int flags,
+mshare_file_create(struct filename *fname, mode_t mode,
 			struct mshare_data *info)
 {
 	struct inode *inode;
 	struct dentry *root, *dentry;
 	int err = 0;
+	mode_t fmode;
 
 	root = msharefs_sb->s_root;
 
 	/*
-	 * This is a read only file.
+	 * This is a read only file so mask out all other bits. Make sure
+	 * it is readable by owner at least.
 	 */
-	inode = msharefs_get_inode(msharefs_sb, S_IFREG | 0400);
+	fmode = (mode & 0444) | S_IFREG | 0400;
+	inode = msharefs_get_inode(msharefs_sb, fmode);
 	if (IS_ERR(inode))
 		return PTR_ERR(inode);
 
@@ -247,6 +250,7 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 		dput(dentry);
 		goto err_unlock_inode;
 	}
+	oflag &= (O_RDONLY | O_WRONLY | O_RDWR);
 
 	if (dentry) {
 		unsigned long mapaddr, prot = PROT_NONE;
@@ -276,7 +280,11 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 		/*
 		 * Map in the address range as anonymous mappings
 		 */
-		oflag &= (O_RDONLY | O_WRONLY | O_RDWR);
+		if (oflag != (oflag & info->flags)) {
+			err = -EPERM;
+			goto err_unlock_inode;
+		}
+
 		if (oflag & O_RDONLY)
 			prot |= PROT_READ;
 		else if (oflag & O_WRONLY)
@@ -331,7 +339,7 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 			new_mm->task_size--;
 		info->mm = new_mm;
 		info->host_mm = old_mm;
-		info->mode = mode;
+		info->flags = oflag;
 		refcount_set(&info->refcnt, 1);
 
 		/*
@@ -412,7 +420,7 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 		mmap_write_unlock(old_mm);
 		mmap_write_unlock(new_mm);
 
-		err = mshare_file_create(fname, oflag, info);
+		err = mshare_file_create(fname, mode, info);
 		if (err)
 			goto free_info;
 	}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v1 14/14] mm/mshare: Copy PTEs to host mm
  2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
                   ` (12 preceding siblings ...)
  2022-04-11 16:05 ` [PATCH v1 13/14] mm/mshare: Enforce mshare'd region permissions Khalid Aziz
@ 2022-04-11 16:05 ` Khalid Aziz
  2022-04-11 17:37 ` [PATCH v1 00/14] Add support for shared PTEs across processes Matthew Wilcox
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-04-11 16:05 UTC (permalink / raw)
  To: akpm, willy
  Cc: Khalid Aziz, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, ebiederm, hagen, jack, keescook, kirill, kucharsk,
	linkinjeon, linux-fsdevel, linux-kernel, linux-mm, longpeng2,
	luto, markhemm, pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

VMAs for shared addresses are hosted by a separate host mm. Copy over
original PTEs from the donor process to host mm so the PTEs are
maintained independent of donor process.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 48 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/mshare.c        | 14 +++++---------
 3 files changed, 55 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d9456d424202..78c22891a792 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1845,6 +1845,8 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 int
 copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
+int
+mshare_copy_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
 int follow_invalidate_pte(struct mm_struct *mm, unsigned long address,
 			  struct mmu_notifier_range *range, pte_t **ptepp,
 			  pmd_t **pmdpp, spinlock_t **ptlp);
diff --git a/mm/memory.c b/mm/memory.c
index e7c5bc6f8836..9010d68f053a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1234,6 +1234,54 @@ copy_p4d_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	return 0;
 }
 
+/*
+ * Copy PTEs for mshare'd pages.
+ * This code is based upon copy_page_range()
+ */
+int
+mshare_copy_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
+{
+	pgd_t *src_pgd, *dst_pgd;
+	unsigned long next;
+	unsigned long addr = src_vma->vm_start;
+	unsigned long end = src_vma->vm_end;
+	struct mm_struct *dst_mm = dst_vma->vm_mm;
+	struct mm_struct *src_mm = src_vma->vm_mm;
+	struct mmu_notifier_range range;
+	int ret = 0;
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
+				0, src_vma, src_mm, addr, end);
+	mmu_notifier_invalidate_range_start(&range);
+	/*
+	 * Disabling preemption is not needed for the write side, as
+	 * the read side doesn't spin, but goes to the mmap_lock.
+	 *
+	 * Use the raw variant of the seqcount_t write API to avoid
+	 * lockdep complaining about preemptibility.
+	 */
+	mmap_assert_write_locked(src_mm);
+	raw_write_seqcount_begin(&src_mm->write_protect_seq);
+
+	dst_pgd = pgd_offset(dst_mm, addr);
+	src_pgd = pgd_offset(src_mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none_or_clear_bad(src_pgd))
+			continue;
+		if (unlikely(copy_p4d_range(dst_vma, src_vma, dst_pgd, src_pgd,
+					    addr, next))) {
+			ret = -ENOMEM;
+			break;
+		}
+	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
+
+	raw_write_seqcount_end(&src_mm->write_protect_seq);
+	mmu_notifier_invalidate_range_end(&range);
+
+	return ret;
+}
+
 int
 copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 {
diff --git a/mm/mshare.c b/mm/mshare.c
index fba31f3c190f..a399234bf106 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -385,7 +385,6 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 			 * Copy this vma over to host mm
 			 */
 			vma->vm_private_data = info;
-			vma->vm_mm = new_mm;
 			vma->vm_flags |= VM_SHARED_PT;
 			new_vma = vm_area_dup(vma);
 			if (!new_vma) {
@@ -394,6 +393,7 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 				err = -ENOMEM;
 				goto free_info;
 			}
+			new_vma->vm_mm = new_mm;
 			err = insert_vm_struct(new_mm, new_vma);
 			if (err) {
 				mmap_write_unlock(new_mm);
@@ -402,17 +402,13 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 				goto free_info;
 			}
 
+			/* Copy over current PTEs */
+			err = mshare_copy_ptes(new_vma, vma);
+			if (err != 0)
+				goto free_info;
 			vma = vma->vm_next;
 		}
 
-		/*
-		 * Copy over current PTEs
-		 */
-		myaddr = addr;
-		while (myaddr < new_mm->task_size) {
-			*pgd_offset(new_mm, myaddr) = *pgd_offset(old_mm, myaddr);
-			myaddr += PGDIR_SIZE;
-		}
 		/*
 		 * TODO: Free the corresponding page table in calling
 		 * process
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 00/14] Add support for shared PTEs across processes
  2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
                   ` (13 preceding siblings ...)
  2022-04-11 16:05 ` [PATCH v1 14/14] mm/mshare: Copy PTEs to host mm Khalid Aziz
@ 2022-04-11 17:37 ` Matthew Wilcox
  2022-04-11 18:51   ` Dave Hansen
  2022-04-11 19:52   ` Khalid Aziz
  2022-04-11 18:47 ` Dave Hansen
                   ` (2 subsequent siblings)
  17 siblings, 2 replies; 37+ messages in thread
From: Matthew Wilcox @ 2022-04-11 17:37 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: akpm, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen, david,
	ebiederm, hagen, jack, keescook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, linux-mm, longpeng2, luto, markhemm,
	pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On Mon, Apr 11, 2022 at 10:05:44AM -0600, Khalid Aziz wrote:
> Page tables in kernel consume some of the memory and as long as number
> of mappings being maintained is small enough, this space consumed by
> page tables is not objectionable. When very few memory pages are
> shared between processes, the number of page table entries (PTEs) to
> maintain is mostly constrained by the number of pages of memory on the
> system. As the number of shared pages and the number of times pages
> are shared goes up, amount of memory consumed by page tables starts to
> become significant.

All of this is true.  However, I've found a lot of people don't see this
as compelling.  I've had more success explaining this from a different
direction:

--- 8< ---

Linux supports processes which share all of their address space (threads)
and processes that share none of their address space (tasks).  We propose
a useful intermediate model where two or more cooperating processes
can choose to share portions of their address space with each other.
The shared portion is referred to by a file descriptor which processes
can choose to attach to their own address space.

Modifications to the shared region affect all processes sharing
that region, just as changes by one thread affect all threads in a
multithreaded program.  This implies a certain level of trust between
the different processes (ie malicious processes should not be allowed
access to the mshared region).

--- 8< ---

Another argument that MM developers find compelling is that we can reduce
some of the complexity in hugetlbfs where it has the ability to share
page tables between processes.

One objection that was raised is that the mechanism for starting the
shared region is a bit clunky.  Did you investigate the proposed approach
of creating an empty address space, attaching to it and using an fd-based
mmap to modify its contents?

> int mshare_unlink(char *name)
> 
> A shared address range created by mshare() can be destroyed using
> mshare_unlink() which removes the  shared named object. Once all
> processes have unmapped the shared object, the shared address range
> references are de-allocated and destroyed.
> 
> mshare_unlink() returns 0 on success or -1 on error.

Can you explain why this is a syscall instead of being a library
function which does

	int dirfd = open("/sys/fs/mshare");
	err = unlinkat(dirfd, name, 0);
	close(dirfd);
	return err;

Does msharefs support creating directories, so that we can use file
permissions to limit who can see the sharable files?  Or is it strictly
a single-level-deep hierarchy?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 00/14] Add support for shared PTEs across processes
  2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
                   ` (14 preceding siblings ...)
  2022-04-11 17:37 ` [PATCH v1 00/14] Add support for shared PTEs across processes Matthew Wilcox
@ 2022-04-11 18:47 ` Dave Hansen
  2022-04-11 20:10 ` Eric W. Biederman
  2022-05-30 10:48 ` Barry Song
  17 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2022-04-11 18:47 UTC (permalink / raw)
  To: Khalid Aziz, akpm, willy
  Cc: aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen, david,
	ebiederm, hagen, jack, keescook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, linux-mm, longpeng2, luto, markhemm,
	pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 4/11/22 09:05, Khalid Aziz wrote:
> PTEs are shared at pgdir level and hence it imposes following
> requirements on the address and size given to the mshare():
> 
> - Starting address must be aligned to pgdir size (512GB on x86_64).
>   This alignment value can be looked up in /proc/sys/vm//mshare_size
> - Size must be a multiple of pgdir size
> - Any mappings created in this address range at any time become
>   shared automatically
> - Shared address range can have unmapped addresses in it. Any access
>   to unmapped address will result in SIGBUS
> 
> Mappings within this address range behave as if they were shared
> between threads, so a write to a MAP_PRIVATE mapping will create a
> page which is shared between all the sharers. The first process that
> declares an address range mshare'd can continue to map objects in
> the shared area. All other processes that want mshare'd access to
> this memory area can do so by calling mshare(). After this call, the
> address range given by mshare becomes a shared range in its address
> space. Anonymous mappings will be shared and not COWed.

What does this mean in practice?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 08/14] mm/mshare: Add basic page table sharing using mshare
  2022-04-11 16:05 ` [PATCH v1 08/14] mm/mshare: Add basic page table sharing using mshare Khalid Aziz
@ 2022-04-11 18:48   ` Dave Hansen
  2022-04-11 20:39     ` Khalid Aziz
  2022-05-30 11:11   ` Barry Song
  2022-05-31  3:46   ` Barry Song
  2 siblings, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2022-04-11 18:48 UTC (permalink / raw)
  To: Khalid Aziz, akpm, willy
  Cc: aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen, david,
	ebiederm, hagen, jack, keescook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, linux-mm, longpeng2, luto, markhemm,
	pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 4/11/22 09:05, Khalid Aziz wrote:
> This patch adds basic page table sharing across tasks by making
> mshare syscall. It does this by creating a new mm_struct which
> hosts the shared vmas and page tables. This mm_struct is
> maintained as long as there is at least one task using the mshare'd
> range. It is cleaned up by the last mshare_unlink syscall.

This sounds like a really good idea because it (in theory) totally
separates the lifetime of the *source* of the page tables from the
lifetime of the process that creates the mshare.

> diff --git a/mm/internal.h b/mm/internal.h
> index cf50a471384e..68f82f0f8b66 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -718,6 +718,8 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
>  int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
>  		      unsigned long addr, int page_nid, int *flags);
>  
> +extern vm_fault_t find_shared_vma(struct vm_area_struct **vma,
> +			unsigned long *addrp);
>  static inline bool vma_is_shared(const struct vm_area_struct *vma)
>  {
>  	return vma->vm_flags & VM_SHARED_PT;
> diff --git a/mm/memory.c b/mm/memory.c
> index c125c4969913..c77c0d643ea8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4776,6 +4776,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  			   unsigned int flags, struct pt_regs *regs)
>  {
>  	vm_fault_t ret;
> +	bool shared = false;
>  
>  	__set_current_state(TASK_RUNNING);
>  
> @@ -4785,6 +4786,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  	/* do counter updates before entering really critical section. */
>  	check_sync_rss_stat(current);
>  
> +	if (unlikely(vma_is_shared(vma))) {
> +		ret = find_shared_vma(&vma, &address);
> +		if (ret)
> +			return ret;
> +		if (!vma)
> +			return VM_FAULT_SIGSEGV;
> +		shared = true;
> +	}
> +
>  	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>  					    flags & FAULT_FLAG_INSTRUCTION,
>  					    flags & FAULT_FLAG_REMOTE))
> @@ -4802,6 +4812,31 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  	else
>  		ret = __handle_mm_fault(vma, address, flags);
>  
> +	/*
> +	 * Release the read lock on shared VMA's parent mm unless
> +	 * __handle_mm_fault released the lock already.
> +	 * __handle_mm_fault sets VM_FAULT_RETRY in return value if
> +	 * it released mmap lock. If lock was released, that implies
> +	 * the lock would have been released on task's original mm if
> +	 * this were not a shared PTE vma. To keep lock state consistent,
> +	 * make sure to release the lock on task's original mm
> +	 */
> +	if (shared) {
> +		int release_mmlock = 1;
> +
> +		if (!(ret & VM_FAULT_RETRY)) {
> +			mmap_read_unlock(vma->vm_mm);
> +			release_mmlock = 0;
> +		} else if ((flags & FAULT_FLAG_ALLOW_RETRY) &&
> +			(flags & FAULT_FLAG_RETRY_NOWAIT)) {
> +			mmap_read_unlock(vma->vm_mm);
> +			release_mmlock = 0;
> +		}
> +
> +		if (release_mmlock)
> +			mmap_read_unlock(current->mm);
> +	}

Are we guaranteed that current->mm == the original vma->vm_mm?  Just a
quick scan of handle_mm_fault() users shows a few suspect ones like
hmm_range_fault() or iommu_v2.c::do_fault().

> diff --git a/mm/mshare.c b/mm/mshare.c
> index cd2f7ad24d9d..d1896adcb00f 100644
> --- a/mm/mshare.c
> +++ b/mm/mshare.c
> @@ -17,18 +17,49 @@
>  #include <linux/pseudo_fs.h>
>  #include <linux/fileattr.h>
>  #include <linux/refcount.h>
> +#include <linux/mman.h>
>  #include <linux/sched/mm.h>
>  #include <uapi/linux/magic.h>
>  #include <uapi/linux/limits.h>
>  
>  struct mshare_data {
> -	struct mm_struct *mm;
> +	struct mm_struct *mm, *host_mm;
>  	mode_t mode;
>  	refcount_t refcnt;
>  };
>  
>  static struct super_block *msharefs_sb;
>  
> +/* Returns holding the host mm's lock for read.  Caller must release. */
> +vm_fault_t
> +find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp)
> +{
> +	struct vm_area_struct *vma, *guest = *vmap;
> +	struct mshare_data *info = guest->vm_private_data;
> +	struct mm_struct *host_mm = info->mm;
> +	unsigned long host_addr;
> +	pgd_t *pgd, *guest_pgd;
> +
> +	host_addr = *addrp - guest->vm_start + host_mm->mmap_base;
> +	pgd = pgd_offset(host_mm, host_addr);
> +	guest_pgd = pgd_offset(current->mm, *addrp);
> +	if (!pgd_same(*guest_pgd, *pgd)) {
> +		set_pgd(guest_pgd, *pgd);
> +		return VM_FAULT_NOPAGE;
> +	}

Is digging around in the other process's page tables OK without holding
any locks?

> +	*addrp = host_addr;
> +	mmap_read_lock(host_mm);
> +	vma = find_vma(host_mm, host_addr);
> +
> +	/* XXX: expand stack? */
> +	if (vma && vma->vm_start > host_addr)
> +		vma = NULL;
> +
> +	*vmap = vma;
> +	return 0;
> +}
> +
>  static void
>  msharefs_evict_inode(struct inode *inode)
>  {
> @@ -169,11 +200,13 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>  		unsigned long, len, int, oflag, mode_t, mode)
>  {
>  	struct mshare_data *info;
> -	struct mm_struct *mm;
>  	struct filename *fname = getname(name);
>  	struct dentry *dentry;
>  	struct inode *inode;
>  	struct qstr namestr;
> +	struct vm_area_struct *vma, *next, *new_vma;
> +	struct mm_struct *new_mm;
> +	unsigned long end;
>  	int err = PTR_ERR(fname);
>  
>  	/*
> @@ -193,6 +226,8 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>  	if (IS_ERR(fname))
>  		goto err_out;
>  
> +	end = addr + len;
> +
>  	/*
>  	 * Does this mshare entry exist already? If it does, calling
>  	 * mshare with O_EXCL|O_CREAT is an error
> @@ -205,49 +240,165 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>  	inode_lock(d_inode(msharefs_sb->s_root));
>  	dentry = d_lookup(msharefs_sb->s_root, &namestr);
>  	if (dentry && (oflag & (O_EXCL|O_CREAT))) {
> +		inode = d_inode(dentry);
>  		err = -EEXIST;
>  		dput(dentry);
>  		goto err_unlock_inode;
>  	}
>  
>  	if (dentry) {
> +		unsigned long mapaddr, prot = PROT_NONE;
> +
>  		inode = d_inode(dentry);
>  		if (inode == NULL) {
> +			mmap_write_unlock(current->mm);
>  			err = -EINVAL;
>  			goto err_out;
>  		}
>  		info = inode->i_private;
> -		refcount_inc(&info->refcnt);
>  		dput(dentry);
> +
> +		/*
> +		 * Map in the address range as anonymous mappings
> +		 */
> +		oflag &= (O_RDONLY | O_WRONLY | O_RDWR);
> +		if (oflag & O_RDONLY)
> +			prot |= PROT_READ;
> +		else if (oflag & O_WRONLY)
> +			prot |= PROT_WRITE;
> +		else if (oflag & O_RDWR)
> +			prot |= (PROT_READ | PROT_WRITE);
> +		mapaddr = vm_mmap(NULL, addr, len, prot,
> +				MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, 0);
> +		if (IS_ERR((void *)mapaddr)) {
> +			err = -EINVAL;
> +			goto err_out;
> +		}
> +
> +		refcount_inc(&info->refcnt);
> +
> +		/*
> +		 * Now that we have mmap'd the mshare'd range, update vma
> +		 * flags and vm_mm pointer for this mshare'd range.
> +		 */
> +		mmap_write_lock(current->mm);
> +		vma = find_vma(current->mm, addr);
> +		if (vma && vma->vm_start < addr) {
> +			mmap_write_unlock(current->mm);
> +			err = -EINVAL;
> +			goto err_out;
> +		}

How do you know that this is the same anonymous VMA that you set up
above?  Couldn't it have been unmapped and remapped to be something
random before the mmap_write_lock() is reacquired?

> +		while (vma && vma->vm_start < (addr + len)) {
> +			vma->vm_private_data = info;
> +			vma->vm_mm = info->mm;
> +			vma->vm_flags |= VM_SHARED_PT;
> +			next = vma->vm_next;
> +			vma = next;
> +		}

This vma is still in the mm->mm_rb tree, right?  I'm kinda surprised
that it's OK to have a VMA in mm->vm_rb have vma->vm_mm!=mm.

>  	} else {
> -		mm = mm_alloc();
> -		if (!mm)
> +		unsigned long myaddr;
> +		struct mm_struct *old_mm;
> +
> +		old_mm = current->mm;
> +		new_mm = mm_alloc();
> +		if (!new_mm)
>  			return -ENOMEM;
>  		info = kzalloc(sizeof(*info), GFP_KERNEL);
>  		if (!info) {
>  			err = -ENOMEM;
>  			goto err_relmm;
>  		}
> -		mm->mmap_base = addr;
> -		mm->task_size = addr + len;
> -		if (!mm->task_size)
> -			mm->task_size--;
> -		info->mm = mm;
> +		new_mm->mmap_base = addr;
> +		new_mm->task_size = addr + len;
> +		if (!new_mm->task_size)
> +			new_mm->task_size--;
> +		info->mm = new_mm;
> +		info->host_mm = old_mm;
>  		info->mode = mode;
>  		refcount_set(&info->refcnt, 1);
> +
> +		/*
> +		 * VMAs for this address range may or may not exist.
> +		 * If VMAs exist, they should be marked as shared at
> +		 * this point and page table info should be copied
> +		 * over to newly created mm_struct. TODO: If VMAs do not
> +		 * exist, create them and mark them as shared.
> +		 */

At this point, there are just too many TODO's in this series to look at
it seriously.  I think what you have here is an entertaining
proof-of-concept, but it's looking to me to be obviously still RFC
quality.  Do you seriously think anyone could even *think* about
applying this series at this point?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 00/14] Add support for shared PTEs across processes
  2022-04-11 17:37 ` [PATCH v1 00/14] Add support for shared PTEs across processes Matthew Wilcox
@ 2022-04-11 18:51   ` Dave Hansen
  2022-04-11 19:08     ` Matthew Wilcox
  2022-04-11 19:52   ` Khalid Aziz
  1 sibling, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2022-04-11 18:51 UTC (permalink / raw)
  To: Matthew Wilcox, Khalid Aziz
  Cc: akpm, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen, david,
	ebiederm, hagen, jack, keescook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, linux-mm, longpeng2, luto, markhemm,
	pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 4/11/22 10:37, Matthew Wilcox wrote:
> Another argument that MM developers find compelling is that we can reduce
> some of the complexity in hugetlbfs where it has the ability to share
> page tables between processes.

When could this complexity reduction actually happen in practice?  Can
this mshare thingy be somehow dropped in underneath the existing
hugetlbfs implementation?  Or would userspace need to change?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 00/14] Add support for shared PTEs across processes
  2022-04-11 18:51   ` Dave Hansen
@ 2022-04-11 19:08     ` Matthew Wilcox
  0 siblings, 0 replies; 37+ messages in thread
From: Matthew Wilcox @ 2022-04-11 19:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Khalid Aziz, akpm, aneesh.kumar, arnd, 21cnbao, corbet,
	dave.hansen, david, ebiederm, hagen, jack, keescook, kirill,
	kucharsk, linkinjeon, linux-fsdevel, linux-kernel, linux-mm,
	longpeng2, luto, markhemm, pcc, rppt, sieberf, sjpark, surenb,
	tst, yzaikin

On Mon, Apr 11, 2022 at 11:51:46AM -0700, Dave Hansen wrote:
> On 4/11/22 10:37, Matthew Wilcox wrote:
> > Another argument that MM developers find compelling is that we can reduce
> > some of the complexity in hugetlbfs where it has the ability to share
> > page tables between processes.
> 
> When could this complexity reduction actually happen in practice?  Can
> this mshare thingy be somehow dropped in underneath the existing
> hugetlbfs implementation?  Or would userspace need to change?

Userspace needs to opt in to mshare, so there's going to be a transition
period where we still need hugetlbfs to still support it, but I have
the impression that the users that need page table sharing are pretty
specialised and we'll be able to find them all before disabling it.

I don't think we can make it transparent to userspace, but I'll noodle
on that a bit.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 00/14] Add support for shared PTEs across processes
  2022-04-11 17:37 ` [PATCH v1 00/14] Add support for shared PTEs across processes Matthew Wilcox
  2022-04-11 18:51   ` Dave Hansen
@ 2022-04-11 19:52   ` Khalid Aziz
  1 sibling, 0 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-04-11 19:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen, david,
	ebiederm, hagen, jack, keescook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, linux-mm, longpeng2, luto, markhemm,
	pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 4/11/22 11:37, Matthew Wilcox wrote:
> On Mon, Apr 11, 2022 at 10:05:44AM -0600, Khalid Aziz wrote:
>> Page tables in kernel consume some of the memory and as long as number
>> of mappings being maintained is small enough, this space consumed by
>> page tables is not objectionable. When very few memory pages are
>> shared between processes, the number of page table entries (PTEs) to
>> maintain is mostly constrained by the number of pages of memory on the
>> system. As the number of shared pages and the number of times pages
>> are shared goes up, amount of memory consumed by page tables starts to
>> become significant.
> 
> All of this is true.  However, I've found a lot of people don't see this
> as compelling.  I've had more success explaining this from a different
> direction:
> 
> --- 8< ---
> 
> Linux supports processes which share all of their address space (threads)
> and processes that share none of their address space (tasks).  We propose
> a useful intermediate model where two or more cooperating processes
> can choose to share portions of their address space with each other.
> The shared portion is referred to by a file descriptor which processes
> can choose to attach to their own address space.
> 
> Modifications to the shared region affect all processes sharing
> that region, just as changes by one thread affect all threads in a
> multithreaded program.  This implies a certain level of trust between
> the different processes (ie malicious processes should not be allowed
> access to the mshared region).
> 
> --- 8< ---
> 
> Another argument that MM developers find compelling is that we can reduce
> some of the complexity in hugetlbfs where it has the ability to share
> page tables between processes.

This all sounds reasonable.

> 
> One objection that was raised is that the mechanism for starting the
> shared region is a bit clunky.  Did you investigate the proposed approach
> of creating an empty address space, attaching to it and using an fd-based
> mmap to modify its contents?

I want to make sure I understand this correctly. In the example I gave, the process creating mshare'd region maps in the 
address space first possibly using mmap(). It then calls mshare() to share this already-mapped region. Are you 
suggesting that the process be able to call mshare() before mapping in address range and then map things into that 
address range later? If yes, it is my intent to support that after the initial implementation as expansion of original 
concept.

> 
>> int mshare_unlink(char *name)
>>
>> A shared address range created by mshare() can be destroyed using
>> mshare_unlink() which removes the  shared named object. Once all
>> processes have unmapped the shared object, the shared address range
>> references are de-allocated and destroyed.
>>
>> mshare_unlink() returns 0 on success or -1 on error.
> 
> Can you explain why this is a syscall instead of being a library
> function which does
> 
> 	int dirfd = open("/sys/fs/mshare");
> 	err = unlinkat(dirfd, name, 0);
> 	close(dirfd);
> 	return err;

mshare_unlink can be simple unlink on the file in msharefs. API will be asymmetrical in that creating mshare'd region is 
a syscall while tearing it down is a file op. I don't mind saving a syscall slot. Would you prefer it that way?

> 
> Does msharefs support creating directories, so that we can use file
> permissions to limit who can see the sharable files?  Or is it strictly
> a single-level-deep hierarchy?
> 

For now msharefs is single-level-deep. It can be expanded to support directories to limit visibility of filenames. Would 
you prefer to see it support directories from the beginning or can that be a future expansion of this feature?

Thanks,
Khalid


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 00/14] Add support for shared PTEs across processes
  2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
                   ` (15 preceding siblings ...)
  2022-04-11 18:47 ` Dave Hansen
@ 2022-04-11 20:10 ` Eric W. Biederman
  2022-04-11 22:21   ` Khalid Aziz
  2022-05-30 10:48 ` Barry Song
  17 siblings, 1 reply; 37+ messages in thread
From: Eric W. Biederman @ 2022-04-11 20:10 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: akpm, willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, hagen, jack, keescook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, linux-mm, longpeng2, luto, markhemm,
	pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

Khalid Aziz <khalid.aziz@oracle.com> writes:

> Page tables in kernel consume some of the memory and as long as number
> of mappings being maintained is small enough, this space consumed by
> page tables is not objectionable. When very few memory pages are
> shared between processes, the number of page table entries (PTEs) to
> maintain is mostly constrained by the number of pages of memory on the
> system. As the number of shared pages and the number of times pages
> are shared goes up, amount of memory consumed by page tables starts to
> become significant.
>
> Some of the field deployments commonly see memory pages shared across
> 1000s of processes. On x86_64, each page requires a PTE that is only 8
> bytes long which is very small compared to the 4K page size. When 2000
> processes map the same page in their address space, each one of them
> requires 8 bytes for its PTE and together that adds up to 8K of memory
> just to hold the PTEs for one 4K page. On a database server with 300GB
> SGA, a system carsh was seen with out-of-memory condition when 1500+
> clients tried to share this SGA even though the system had 512GB of
> memory. On this server, in the worst case scenario of all 1500
> processes mapping every page from SGA would have required 878GB+ for
> just the PTEs. If these PTEs could be shared, amount of memory saved
> is very significant.
>
> This patch series implements a mechanism in kernel to allow userspace
> processes to opt into sharing PTEs. It adds two new system calls - (1)
> mshare(), which can be used by a process to create a region (we will
> call it mshare'd region) which can be used by other processes to map
> same pages using shared PTEs, (2) mshare_unlink() which is used to
> detach from the mshare'd region. Once an mshare'd region is created,
> other process(es), assuming they have the right permissions, can make
> the mashare() system call to map the shared pages into their address
> space using the shared PTEs.  When a process is done using this
> mshare'd region, it makes a mshare_unlink() system call to end its
> access. When the last process accessing mshare'd region calls
> mshare_unlink(), the mshare'd region is torn down and memory used by
> it is freed.
>

>
> API
> ===
>
> The mshare API consists of two system calls - mshare() and mshare_unlink()
>
> --
> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>
> mshare() creates and opens a new, or opens an existing mshare'd
> region that will be shared at PTE level. "name" refers to shared object
> name that exists under /sys/fs/mshare. "addr" is the starting address
> of this shared memory area and length is the size of this area.
> oflags can be one of:
>
> - O_RDONLY opens shared memory area for read only access by everyone
> - O_RDWR opens shared memory area for read and write access
> - O_CREAT creates the named shared memory area if it does not exist
> - O_EXCL If O_CREAT was also specified, and a shared memory area
>   exists with that name, return an error.
>
> mode represents the creation mode for the shared object under
> /sys/fs/mshare.
>
> mshare() returns an error code if it fails, otherwise it returns 0.
>

Please don't add system calls that take names.
 
Please just open objects on the filesystem and allow multi-instances of
the filesystem.  Otherwise someone is going to have to come along later
and implement namespace magic to deal with your new system calls.  You
already have a filesystem all that is needed to avoid having to
introduce namespace magic is to simply allow multiple instances of the
filesystem to be mounted.

On that note.  Since you have a filesystem, introduce a well known
name for a directory and in that directory place all of the information
and possibly control files for your filesystem.  No need to for proc
files and the like, and if at somepoint you have mount options that
allow the information to be changed you can have different mounts
with different values present.


This is must me.  But I find it weird that you don't use mmap
to place the shared area from the mshare fd into your address space.

I think I would do:
	// Establish the mshare region
	addr = mmap(NULL, PGDIR_SIZE, PROT_READ | PROT_WRITE,
			MAP_SHARED | MAP_MSHARE, msharefd, 0);

	// Remove the mshare region
        addr2 = mmap(addr, PGDIR_SIZE, PROT_NONE,
        		MAP_FIXED | MAP_MUNSHARE, msharefd, 0);

I could see a point of using separate system calls instead of
adding MAP_SHARE and MAP_UNSHARE flags.

What are the locking implications of taking a page fault in the shared
region?

Is it a noop or is it going to make some of the nasty locking we have in
the kernel for things like truncates worse?

Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 08/14] mm/mshare: Add basic page table sharing using mshare
  2022-04-11 18:48   ` Dave Hansen
@ 2022-04-11 20:39     ` Khalid Aziz
  0 siblings, 0 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-04-11 20:39 UTC (permalink / raw)
  To: Dave Hansen, akpm, willy
  Cc: aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen, david,
	ebiederm, hagen, jack, keescook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, linux-mm, longpeng2, luto, markhemm,
	pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 4/11/22 12:48, Dave Hansen wrote:
> On 4/11/22 09:05, Khalid Aziz wrote:
>> This patch adds basic page table sharing across tasks by making
>> mshare syscall. It does this by creating a new mm_struct which
>> hosts the shared vmas and page tables. This mm_struct is
>> maintained as long as there is at least one task using the mshare'd
>> range. It is cleaned up by the last mshare_unlink syscall.
> 
> This sounds like a really good idea because it (in theory) totally
> separates the lifetime of the *source* of the page tables from the
> lifetime of the process that creates the mshare.
> 
>> diff --git a/mm/internal.h b/mm/internal.h
>> index cf50a471384e..68f82f0f8b66 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -718,6 +718,8 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
>>   int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
>>   		      unsigned long addr, int page_nid, int *flags);
>>   
>> +extern vm_fault_t find_shared_vma(struct vm_area_struct **vma,
>> +			unsigned long *addrp);
>>   static inline bool vma_is_shared(const struct vm_area_struct *vma)
>>   {
>>   	return vma->vm_flags & VM_SHARED_PT;
>> diff --git a/mm/memory.c b/mm/memory.c
>> index c125c4969913..c77c0d643ea8 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4776,6 +4776,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>   			   unsigned int flags, struct pt_regs *regs)
>>   {
>>   	vm_fault_t ret;
>> +	bool shared = false;
>>   
>>   	__set_current_state(TASK_RUNNING);
>>   
>> @@ -4785,6 +4786,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>   	/* do counter updates before entering really critical section. */
>>   	check_sync_rss_stat(current);
>>   
>> +	if (unlikely(vma_is_shared(vma))) {
>> +		ret = find_shared_vma(&vma, &address);
>> +		if (ret)
>> +			return ret;
>> +		if (!vma)
>> +			return VM_FAULT_SIGSEGV;
>> +		shared = true;
>> +	}
>> +
>>   	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>>   					    flags & FAULT_FLAG_INSTRUCTION,
>>   					    flags & FAULT_FLAG_REMOTE))
>> @@ -4802,6 +4812,31 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>   	else
>>   		ret = __handle_mm_fault(vma, address, flags);
>>   
>> +	/*
>> +	 * Release the read lock on shared VMA's parent mm unless
>> +	 * __handle_mm_fault released the lock already.
>> +	 * __handle_mm_fault sets VM_FAULT_RETRY in return value if
>> +	 * it released mmap lock. If lock was released, that implies
>> +	 * the lock would have been released on task's original mm if
>> +	 * this were not a shared PTE vma. To keep lock state consistent,
>> +	 * make sure to release the lock on task's original mm
>> +	 */
>> +	if (shared) {
>> +		int release_mmlock = 1;
>> +
>> +		if (!(ret & VM_FAULT_RETRY)) {
>> +			mmap_read_unlock(vma->vm_mm);
>> +			release_mmlock = 0;
>> +		} else if ((flags & FAULT_FLAG_ALLOW_RETRY) &&
>> +			(flags & FAULT_FLAG_RETRY_NOWAIT)) {
>> +			mmap_read_unlock(vma->vm_mm);
>> +			release_mmlock = 0;
>> +		}
>> +
>> +		if (release_mmlock)
>> +			mmap_read_unlock(current->mm);
>> +	}
> 
> Are we guaranteed that current->mm == the original vma->vm_mm?  Just a
> quick scan of handle_mm_fault() users shows a few suspect ones like
> hmm_range_fault() or iommu_v2.c::do_fault().

You are probably right. Safe thing to do would be to save the original vma->vm_mm before calling find_shared_vma() and 
use this saved value to unlock later if needed. I will fix that.

> 
>> diff --git a/mm/mshare.c b/mm/mshare.c
>> index cd2f7ad24d9d..d1896adcb00f 100644
>> --- a/mm/mshare.c
>> +++ b/mm/mshare.c
>> @@ -17,18 +17,49 @@
>>   #include <linux/pseudo_fs.h>
>>   #include <linux/fileattr.h>
>>   #include <linux/refcount.h>
>> +#include <linux/mman.h>
>>   #include <linux/sched/mm.h>
>>   #include <uapi/linux/magic.h>
>>   #include <uapi/linux/limits.h>
>>   
>>   struct mshare_data {
>> -	struct mm_struct *mm;
>> +	struct mm_struct *mm, *host_mm;
>>   	mode_t mode;
>>   	refcount_t refcnt;
>>   };
>>   
>>   static struct super_block *msharefs_sb;
>>   
>> +/* Returns holding the host mm's lock for read.  Caller must release. */
>> +vm_fault_t
>> +find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp)
>> +{
>> +	struct vm_area_struct *vma, *guest = *vmap;
>> +	struct mshare_data *info = guest->vm_private_data;
>> +	struct mm_struct *host_mm = info->mm;
>> +	unsigned long host_addr;
>> +	pgd_t *pgd, *guest_pgd;
>> +
>> +	host_addr = *addrp - guest->vm_start + host_mm->mmap_base;
>> +	pgd = pgd_offset(host_mm, host_addr);
>> +	guest_pgd = pgd_offset(current->mm, *addrp);
>> +	if (!pgd_same(*guest_pgd, *pgd)) {
>> +		set_pgd(guest_pgd, *pgd);
>> +		return VM_FAULT_NOPAGE;
>> +	}
> 
> Is digging around in the other process's page tables OK without holding
> any locks?

current->mm should already be locked when handle_mm_fault() is called. "mmap_read_lock(host_mm)" should be moved up to 
before calling pgd_offset(). I will fix that.

> 
>> +	*addrp = host_addr;
>> +	mmap_read_lock(host_mm);
>> +	vma = find_vma(host_mm, host_addr);
>> +
>> +	/* XXX: expand stack? */
>> +	if (vma && vma->vm_start > host_addr)
>> +		vma = NULL;
>> +
>> +	*vmap = vma;
>> +	return 0;
>> +}
>> +
>>   static void
>>   msharefs_evict_inode(struct inode *inode)
>>   {
>> @@ -169,11 +200,13 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>>   		unsigned long, len, int, oflag, mode_t, mode)
>>   {
>>   	struct mshare_data *info;
>> -	struct mm_struct *mm;
>>   	struct filename *fname = getname(name);
>>   	struct dentry *dentry;
>>   	struct inode *inode;
>>   	struct qstr namestr;
>> +	struct vm_area_struct *vma, *next, *new_vma;
>> +	struct mm_struct *new_mm;
>> +	unsigned long end;
>>   	int err = PTR_ERR(fname);
>>   
>>   	/*
>> @@ -193,6 +226,8 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>>   	if (IS_ERR(fname))
>>   		goto err_out;
>>   
>> +	end = addr + len;
>> +
>>   	/*
>>   	 * Does this mshare entry exist already? If it does, calling
>>   	 * mshare with O_EXCL|O_CREAT is an error
>> @@ -205,49 +240,165 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>>   	inode_lock(d_inode(msharefs_sb->s_root));
>>   	dentry = d_lookup(msharefs_sb->s_root, &namestr);
>>   	if (dentry && (oflag & (O_EXCL|O_CREAT))) {
>> +		inode = d_inode(dentry);
>>   		err = -EEXIST;
>>   		dput(dentry);
>>   		goto err_unlock_inode;
>>   	}
>>   
>>   	if (dentry) {
>> +		unsigned long mapaddr, prot = PROT_NONE;
>> +
>>   		inode = d_inode(dentry);
>>   		if (inode == NULL) {
>> +			mmap_write_unlock(current->mm);
>>   			err = -EINVAL;
>>   			goto err_out;
>>   		}
>>   		info = inode->i_private;
>> -		refcount_inc(&info->refcnt);
>>   		dput(dentry);
>> +
>> +		/*
>> +		 * Map in the address range as anonymous mappings
>> +		 */
>> +		oflag &= (O_RDONLY | O_WRONLY | O_RDWR);
>> +		if (oflag & O_RDONLY)
>> +			prot |= PROT_READ;
>> +		else if (oflag & O_WRONLY)
>> +			prot |= PROT_WRITE;
>> +		else if (oflag & O_RDWR)
>> +			prot |= (PROT_READ | PROT_WRITE);
>> +		mapaddr = vm_mmap(NULL, addr, len, prot,
>> +				MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, 0);
>> +		if (IS_ERR((void *)mapaddr)) {
>> +			err = -EINVAL;
>> +			goto err_out;
>> +		}
>> +
>> +		refcount_inc(&info->refcnt);
>> +
>> +		/*
>> +		 * Now that we have mmap'd the mshare'd range, update vma
>> +		 * flags and vm_mm pointer for this mshare'd range.
>> +		 */
>> +		mmap_write_lock(current->mm);
>> +		vma = find_vma(current->mm, addr);
>> +		if (vma && vma->vm_start < addr) {
>> +			mmap_write_unlock(current->mm);
>> +			err = -EINVAL;
>> +			goto err_out;
>> +		}
> 
> How do you know that this is the same anonymous VMA that you set up
> above?  Couldn't it have been unmapped and remapped to be something
> random before the mmap_write_lock() is reacquired?

Good point. The one check I have after find_vma() is not enough. I need to add more checks to validate this vma.

> 
>> +		while (vma && vma->vm_start < (addr + len)) {
>> +			vma->vm_private_data = info;
>> +			vma->vm_mm = info->mm;
>> +			vma->vm_flags |= VM_SHARED_PT;
>> +			next = vma->vm_next;
>> +			vma = next;
>> +		}
> 
> This vma is still in the mm->mm_rb tree, right?  I'm kinda surprised
> that it's OK to have a VMA in mm->vm_rb have vma->vm_mm!=mm.

I will look into what needs to be fixed up here. One possibility is to not change vm_mm. I think I can work without 
changing vm_mm for donor or client processes as long as vm_mm in the host mm for mshare'd vmas points to the host mm.

> 
>>   	} else {
>> -		mm = mm_alloc();
>> -		if (!mm)
>> +		unsigned long myaddr;
>> +		struct mm_struct *old_mm;
>> +
>> +		old_mm = current->mm;
>> +		new_mm = mm_alloc();
>> +		if (!new_mm)
>>   			return -ENOMEM;
>>   		info = kzalloc(sizeof(*info), GFP_KERNEL);
>>   		if (!info) {
>>   			err = -ENOMEM;
>>   			goto err_relmm;
>>   		}
>> -		mm->mmap_base = addr;
>> -		mm->task_size = addr + len;
>> -		if (!mm->task_size)
>> -			mm->task_size--;
>> -		info->mm = mm;
>> +		new_mm->mmap_base = addr;
>> +		new_mm->task_size = addr + len;
>> +		if (!new_mm->task_size)
>> +			new_mm->task_size--;
>> +		info->mm = new_mm;
>> +		info->host_mm = old_mm;
>>   		info->mode = mode;
>>   		refcount_set(&info->refcnt, 1);
>> +
>> +		/*
>> +		 * VMAs for this address range may or may not exist.
>> +		 * If VMAs exist, they should be marked as shared at
>> +		 * this point and page table info should be copied
>> +		 * over to newly created mm_struct. TODO: If VMAs do not
>> +		 * exist, create them and mark them as shared.
>> +		 */
> 
> At this point, there are just too many TODO's in this series to look at
> it seriously.  I think what you have here is an entertaining
> proof-of-concept, but it's looking to me to be obviously still RFC
> quality.  Do you seriously think anyone could even *think* about
> applying this series at this point?
> 

Fair enough. Some of the TODOs are meant to be reminders for expansion of functionality (like this one is to support 
calling mshare without having to mmap the address range first), but I should clean these up and I will do that.

I appreciate your taking time to provide yuseful feedback.

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 00/14] Add support for shared PTEs across processes
  2022-04-11 20:10 ` Eric W. Biederman
@ 2022-04-11 22:21   ` Khalid Aziz
  0 siblings, 0 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-04-11 22:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: akpm, willy, aneesh.kumar, arnd, 21cnbao, corbet, dave.hansen,
	david, hagen, jack, keescook, kirill, kucharsk, linkinjeon,
	linux-fsdevel, linux-kernel, linux-mm, longpeng2, luto, markhemm,
	pcc, rppt, sieberf, sjpark, surenb, tst, yzaikin

On 4/11/22 14:10, Eric W. Biederman wrote:
> Khalid Aziz <khalid.aziz@oracle.com> writes:
> 
>> Page tables in kernel consume some of the memory and as long as number
>> of mappings being maintained is small enough, this space consumed by
>> page tables is not objectionable. When very few memory pages are
>> shared between processes, the number of page table entries (PTEs) to
>> maintain is mostly constrained by the number of pages of memory on the
>> system. As the number of shared pages and the number of times pages
>> are shared goes up, amount of memory consumed by page tables starts to
>> become significant.
>>
>> Some of the field deployments commonly see memory pages shared across
>> 1000s of processes. On x86_64, each page requires a PTE that is only 8
>> bytes long which is very small compared to the 4K page size. When 2000
>> processes map the same page in their address space, each one of them
>> requires 8 bytes for its PTE and together that adds up to 8K of memory
>> just to hold the PTEs for one 4K page. On a database server with 300GB
>> SGA, a system carsh was seen with out-of-memory condition when 1500+
>> clients tried to share this SGA even though the system had 512GB of
>> memory. On this server, in the worst case scenario of all 1500
>> processes mapping every page from SGA would have required 878GB+ for
>> just the PTEs. If these PTEs could be shared, amount of memory saved
>> is very significant.
>>
>> This patch series implements a mechanism in kernel to allow userspace
>> processes to opt into sharing PTEs. It adds two new system calls - (1)
>> mshare(), which can be used by a process to create a region (we will
>> call it mshare'd region) which can be used by other processes to map
>> same pages using shared PTEs, (2) mshare_unlink() which is used to
>> detach from the mshare'd region. Once an mshare'd region is created,
>> other process(es), assuming they have the right permissions, can make
>> the mashare() system call to map the shared pages into their address
>> space using the shared PTEs.  When a process is done using this
>> mshare'd region, it makes a mshare_unlink() system call to end its
>> access. When the last process accessing mshare'd region calls
>> mshare_unlink(), the mshare'd region is torn down and memory used by
>> it is freed.
>>
> 
>>
>> API
>> ===
>>
>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>
>> --
>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>
>> mshare() creates and opens a new, or opens an existing mshare'd
>> region that will be shared at PTE level. "name" refers to shared object
>> name that exists under /sys/fs/mshare. "addr" is the starting address
>> of this shared memory area and length is the size of this area.
>> oflags can be one of:
>>
>> - O_RDONLY opens shared memory area for read only access by everyone
>> - O_RDWR opens shared memory area for read and write access
>> - O_CREAT creates the named shared memory area if it does not exist
>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>    exists with that name, return an error.
>>
>> mode represents the creation mode for the shared object under
>> /sys/fs/mshare.
>>
>> mshare() returns an error code if it fails, otherwise it returns 0.
>>
> 
> Please don't add system calls that take names.
>   
> Please just open objects on the filesystem and allow multi-instances of
> the filesystem.  Otherwise someone is going to have to come along later
> and implement namespace magic to deal with your new system calls.  You
> already have a filesystem all that is needed to avoid having to
> introduce namespace magic is to simply allow multiple instances of the
> filesystem to be mounted.

Hi Eric,

Thanks for taking the time to provide feedback.

That sounds reasonable. Is something like this more in line with what you are thinking:

fd = open("/sys/fs/mshare/testregion", O_CREAT|O_EXCL, 0400);
mshare(fd, start, end, O_RDWR);

This sequence creates the mshare'd region and assigns it an address range (which may or may not be empty). Then a client 
process would do something like:

fd = open("/sys/fs/mshare/testregion", O_RDONLY);
mshare(fd, start, end, O_RDWR);

which maps the mshare'd region into client process's address space.

> 
> On that note.  Since you have a filesystem, introduce a well known
> name for a directory and in that directory place all of the information
> and possibly control files for your filesystem.  No need to for proc
> files and the like, and if at somepoint you have mount options that
> allow the information to be changed you can have different mounts
> with different values present.

So have the kernel mount msharefs at /sys/fs/mshare and create a file /sys/fs/mshare/info. A read from 
/sys/fs/mshare/info returns what /proc/sys/vm//mshare_size in patch 12 returns. Did I understand that correctly?

> 
> 
> This is must me.  But I find it weird that you don't use mmap
> to place the shared area from the mshare fd into your address space.
> 
> I think I would do:
> 	// Establish the mshare region
> 	addr = mmap(NULL, PGDIR_SIZE, PROT_READ | PROT_WRITE,
> 			MAP_SHARED | MAP_MSHARE, msharefd, 0);
> 
> 	// Remove the mshare region
>          addr2 = mmap(addr, PGDIR_SIZE, PROT_NONE,
>          		MAP_FIXED | MAP_MUNSHARE, msharefd, 0);
> 
> I could see a point of using separate system calls instead of
> adding MAP_SHARE and MAP_UNSHARE flags.

In that case, we can just do away with syscalls completely. For instance, to create mshare'd region, one would:

fd = open("/sys/fs/mshare/testregion", O_CREAT|O_EXCL, 0400);
addr = mmap(NULL, PGDIR_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
or
addr = mmap(myaddr, PGDIR_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, fd, 0);

First mmap using this fd sets the address range which stays the same through out the life of region. To populate data in 
this address range, the process could use mmap/mremap and other mechanisms subsequently.

To map this mshare'd region into its address space, a process would:

struct mshare_info {
         unsigned long start;
         unsigned long size;
} minfo;
fd = open("/sys/fs/mshare/testregion", O_CREAT|O_EXCL, 0400);
read(fd, &minfo, sizeof(struct mshare_info));
addr = mmap(NULL, minfo.size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

When done with mshare'd region, process would unlink("/sys/fs/mshare/testregion").

This changes API significantly but if it results in a cleaner and more maintainable API, I will make the changes.

> 
> What are the locking implications of taking a page fault in the shared
> region?
> 
> Is it a noop or is it going to make some of the nasty locking we have in
> the kernel for things like truncates worse?
> 
> Eric

Handling page fault would require locking mm for the faulting process and host mm which would synchronize any other 
process taking a page fault in the shared region at the same time.

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 00/14] Add support for shared PTEs across processes
  2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
                   ` (16 preceding siblings ...)
  2022-04-11 20:10 ` Eric W. Biederman
@ 2022-05-30 10:48 ` Barry Song
  2022-05-30 11:18   ` David Hildenbrand
  2022-06-29 17:40   ` Khalid Aziz
  17 siblings, 2 replies; 37+ messages in thread
From: Barry Song @ 2022-05-30 10:48 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Andrew Morton, Matthew Wilcox, Aneesh Kumar, Arnd Bergmann,
	Jonathan Corbet, Dave Hansen, David Hildenbrand, ebiederm, hagen,
	jack, Kees Cook, kirill, kucharsk, linkinjeon, linux-fsdevel,
	LKML, Linux-MM, longpeng2, Andy Lutomirski, markhemm, pcc,
	Mike Rapoport, sieberf, sjpark, Suren Baghdasaryan, tst,
	Iurii Zaikin

On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> Page tables in kernel consume some of the memory and as long as number
> of mappings being maintained is small enough, this space consumed by
> page tables is not objectionable. When very few memory pages are
> shared between processes, the number of page table entries (PTEs) to
> maintain is mostly constrained by the number of pages of memory on the
> system. As the number of shared pages and the number of times pages
> are shared goes up, amount of memory consumed by page tables starts to
> become significant.
>
> Some of the field deployments commonly see memory pages shared across
> 1000s of processes. On x86_64, each page requires a PTE that is only 8
> bytes long which is very small compared to the 4K page size. When 2000
> processes map the same page in their address space, each one of them
> requires 8 bytes for its PTE and together that adds up to 8K of memory
> just to hold the PTEs for one 4K page. On a database server with 300GB
> SGA, a system carsh was seen with out-of-memory condition when 1500+
> clients tried to share this SGA even though the system had 512GB of
> memory. On this server, in the worst case scenario of all 1500
> processes mapping every page from SGA would have required 878GB+ for
> just the PTEs. If these PTEs could be shared, amount of memory saved
> is very significant.
>
> This patch series implements a mechanism in kernel to allow userspace
> processes to opt into sharing PTEs. It adds two new system calls - (1)
> mshare(), which can be used by a process to create a region (we will
> call it mshare'd region) which can be used by other processes to map
> same pages using shared PTEs, (2) mshare_unlink() which is used to
> detach from the mshare'd region. Once an mshare'd region is created,
> other process(es), assuming they have the right permissions, can make
> the mashare() system call to map the shared pages into their address
> space using the shared PTEs.  When a process is done using this
> mshare'd region, it makes a mshare_unlink() system call to end its
> access. When the last process accessing mshare'd region calls
> mshare_unlink(), the mshare'd region is torn down and memory used by
> it is freed.
>
>
> API
> ===
>
> The mshare API consists of two system calls - mshare() and mshare_unlink()
>
> --
> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>
> mshare() creates and opens a new, or opens an existing mshare'd
> region that will be shared at PTE level. "name" refers to shared object
> name that exists under /sys/fs/mshare. "addr" is the starting address
> of this shared memory area and length is the size of this area.
> oflags can be one of:
>
> - O_RDONLY opens shared memory area for read only access by everyone
> - O_RDWR opens shared memory area for read and write access
> - O_CREAT creates the named shared memory area if it does not exist
> - O_EXCL If O_CREAT was also specified, and a shared memory area
>   exists with that name, return an error.
>
> mode represents the creation mode for the shared object under
> /sys/fs/mshare.
>
> mshare() returns an error code if it fails, otherwise it returns 0.
>
> PTEs are shared at pgdir level and hence it imposes following
> requirements on the address and size given to the mshare():
>
> - Starting address must be aligned to pgdir size (512GB on x86_64).
>   This alignment value can be looked up in /proc/sys/vm//mshare_size
> - Size must be a multiple of pgdir size
> - Any mappings created in this address range at any time become
>   shared automatically
> - Shared address range can have unmapped addresses in it. Any access
>   to unmapped address will result in SIGBUS
>
> Mappings within this address range behave as if they were shared
> between threads, so a write to a MAP_PRIVATE mapping will create a
> page which is shared between all the sharers. The first process that
> declares an address range mshare'd can continue to map objects in
> the shared area. All other processes that want mshare'd access to
> this memory area can do so by calling mshare(). After this call, the
> address range given by mshare becomes a shared range in its address
> space. Anonymous mappings will be shared and not COWed.
>
> A file under /sys/fs/mshare can be opened and read from. A read from
> this file returns two long values - (1) starting address, and (2)
> size of the mshare'd region.
>
> --
> int mshare_unlink(char *name)
>
> A shared address range created by mshare() can be destroyed using
> mshare_unlink() which removes the  shared named object. Once all
> processes have unmapped the shared object, the shared address range
> references are de-allocated and destroyed.
>
> mshare_unlink() returns 0 on success or -1 on error.
>
>
> Example Code
> ============
>
> Snippet of the code that a donor process would run looks like below:
>
> -----------------
>         addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>                         MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>         if (addr == MAP_FAILED)
>                 perror("ERROR: mmap failed");
>
>         err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>                         GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>         if (err < 0) {
>                 perror("mshare() syscall failed");
>                 exit(1);
>         }
>
>         strncpy(addr, "Some random shared text",
>                         sizeof("Some random shared text"));
> -----------------
>
> Snippet of code that a consumer process would execute looks like:
>
> -----------------
>         struct mshare_info minfo;
>
>         fd = open("testregion", O_RDONLY);
>         if (fd < 0) {
>                 perror("open failed");
>                 exit(1);
>         }
>
>         if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
>                 printf("INFO: %ld bytes shared at addr 0x%lx \n",
>                                 minfo.size, minfo.start);
>         else
>                 perror("read failed");
>
>         close(fd);
>
>         addr = (void *)minfo.start;
>         err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
>                         O_RDWR, 600);
>         if (err < 0) {
>                 perror("mshare() syscall failed");
>                 exit(1);
>         }
>
>         printf("Guest mmap at %px:\n", addr);
>         printf("%s\n", addr);
>         printf("\nDone\n");
>
>         err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>         if (err < 0) {
>                 perror("mshare_unlink() failed");
>                 exit(1);
>         }
> -----------------


Does  that mean those shared pages will get page_mapcount()=1 ?

A big pain for a memory limited system like a desktop/embedded system is
that reverse mapping will take tons of cpu in memory reclamation path
especially for those pages mapped by multiple processes. sometimes,
100% cpu utilization on LRU to scan and find out if a page is accessed
by reading PTE young.

if we result in one PTE only by this patchset, it means we are getting
significant
performance improvement in kernel LRU particularly when free memory
approaches the watermarks.

But I don't see how a new system call like mshare(),  can be used
by those systems as they might need some more automatic PTEs sharing
mechanism.

BTW, I suppose we are actually able to share PTEs as long as the address
is 2MB aligned?

>
>
> Patch series
> ============
>
> This series of patches is an initial implementation of these two
> system calls. This code implements working basic functionality.
>
> Prototype for the two syscalls is:
>
> SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
>                 unsigned long, len, int, oflag, mode_t, mode)
>
> SYSCALL_DEFINE1(mshare_unlink, const char *, name)
>
> In order to facilitate page table sharing, this implemntation adds a
> new in-memory filesystem - msharefs which will be mounted at
> /sys/fs/mshare. When a new mshare'd region is created, a file with
> the name given by initial mshare() call is created under this
> filesystem.  Permissions for this file are given by the "mode"
> parameter to mshare(). The only operation supported on this file is
> read. A read from this file returns a structure containing
> information about mshare'd region - (1) starting virtual address for
> the region, and (2) size of mshare'd region.
>
> A donor process that wants to create an mshare'd region from a
> section of its mapped addresses calls mshare() with O_CREAT oflag.
> mshare() syscall then creates a new mm_struct which will host the
> page tables for the mshare'd region.  vma->vm_private_data for the
> vmas covering address range for this region are updated to point to
> a structure containing pointer to this new mm_struct.  Existing page
> tables are copied over to new mm struct.
>
> A consumer process that wants to map mshare'd region opens
> /sys/fs/mshare/<filename> and reads the starting address and size of
> mshare'd region. It then calls mshare() with these values to map the
> entire region in its address space. Consumer process calls
> mshare_unlink() to terminate its access.
>
>
> Since RFC
> =========
>
> This patch series includes better error handling and more robust
> locking besides improved implementation of mshare since the original
> RFC. It also incorporates feedback from original RFC. Alignment and
> size requirment are PGDIR_SIZE, same as RFC and this is open to
> change based upon further feedback. More review is needed for this
> patch series and is much appreciated.
>
>
>
> Khalid Aziz (14):
>   mm: Add new system calls mshare, mshare_unlink
>   mm: Add msharefs filesystem
>   mm: Add read for msharefs
>   mm: implement mshare_unlink syscall
>   mm: Add locking to msharefs syscalls
>   mm/msharefs: Check for mounted filesystem
>   mm: Add vm flag for shared PTE
>   mm/mshare: Add basic page table sharing using mshare
>   mm: Do not free PTEs for mshare'd PTEs
>   mm/mshare: Check for mapped vma when mshare'ing existing mshare'd
>     range
>   mm/mshare: unmap vmas in mshare_unlink
>   mm/mshare: Add a proc file with mshare alignment/size information
>   mm/mshare: Enforce mshare'd region permissions
>   mm/mshare: Copy PTEs to host mm
>
>  Documentation/filesystems/msharefs.rst |  19 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   2 +
>  include/linux/mm.h                     |  11 +
>  include/trace/events/mmflags.h         |   3 +-
>  include/uapi/asm-generic/unistd.h      |   7 +-
>  include/uapi/linux/magic.h             |   1 +
>  include/uapi/linux/mman.h              |   5 +
>  kernel/sysctl.c                        |   7 +
>  mm/Makefile                            |   2 +-
>  mm/internal.h                          |   7 +
>  mm/memory.c                            | 105 ++++-
>  mm/mshare.c                            | 587 +++++++++++++++++++++++++
>  12 files changed, 750 insertions(+), 6 deletions(-)
>  create mode 100644 Documentation/filesystems/msharefs.rst
>  create mode 100644 mm/mshare.c
>
> --
> 2.32.0
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 08/14] mm/mshare: Add basic page table sharing using mshare
  2022-04-11 16:05 ` [PATCH v1 08/14] mm/mshare: Add basic page table sharing using mshare Khalid Aziz
  2022-04-11 18:48   ` Dave Hansen
@ 2022-05-30 11:11   ` Barry Song
  2022-06-28 20:11     ` Khalid Aziz
  2022-05-31  3:46   ` Barry Song
  2 siblings, 1 reply; 37+ messages in thread
From: Barry Song @ 2022-05-30 11:11 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Andrew Morton, Matthew Wilcox, Aneesh Kumar, Arnd Bergmann,
	Jonathan Corbet, Dave Hansen, David Hildenbrand, ebiederm, hagen,
	jack, Kees Cook, kirill, kucharsk, linkinjeon, linux-fsdevel,
	LKML, Linux-MM, longpeng2, Andy Lutomirski, markhemm, pcc,
	Mike Rapoport, sieberf, sjpark, Suren Baghdasaryan, tst,
	Iurii Zaikin

On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> This patch adds basic page table sharing across tasks by making
> mshare syscall. It does this by creating a new mm_struct which
> hosts the shared vmas and page tables. This mm_struct is
> maintained as long as there is at least one task using the mshare'd
> range. It is cleaned up by the last mshare_unlink syscall.
>
> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/internal.h |   2 +
>  mm/memory.c   |  35 ++++++++++
>  mm/mshare.c   | 190 ++++++++++++++++++++++++++++++++++++++++++++++----
>  3 files changed, 214 insertions(+), 13 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index cf50a471384e..68f82f0f8b66 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -718,6 +718,8 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
>  int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
>                       unsigned long addr, int page_nid, int *flags);
>
> +extern vm_fault_t find_shared_vma(struct vm_area_struct **vma,
> +                       unsigned long *addrp);
>  static inline bool vma_is_shared(const struct vm_area_struct *vma)
>  {
>         return vma->vm_flags & VM_SHARED_PT;
> diff --git a/mm/memory.c b/mm/memory.c
> index c125c4969913..c77c0d643ea8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4776,6 +4776,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>                            unsigned int flags, struct pt_regs *regs)
>  {
>         vm_fault_t ret;
> +       bool shared = false;
>
>         __set_current_state(TASK_RUNNING);
>
> @@ -4785,6 +4786,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>         /* do counter updates before entering really critical section. */
>         check_sync_rss_stat(current);
>
> +       if (unlikely(vma_is_shared(vma))) {
> +               ret = find_shared_vma(&vma, &address);
> +               if (ret)
> +                       return ret;
> +               if (!vma)
> +                       return VM_FAULT_SIGSEGV;
> +               shared = true;
> +       }
> +
>         if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>                                             flags & FAULT_FLAG_INSTRUCTION,
>                                             flags & FAULT_FLAG_REMOTE))
> @@ -4802,6 +4812,31 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>         else
>                 ret = __handle_mm_fault(vma, address, flags);
>
> +       /*
> +        * Release the read lock on shared VMA's parent mm unless
> +        * __handle_mm_fault released the lock already.
> +        * __handle_mm_fault sets VM_FAULT_RETRY in return value if
> +        * it released mmap lock. If lock was released, that implies
> +        * the lock would have been released on task's original mm if
> +        * this were not a shared PTE vma. To keep lock state consistent,
> +        * make sure to release the lock on task's original mm
> +        */
> +       if (shared) {
> +               int release_mmlock = 1;
> +
> +               if (!(ret & VM_FAULT_RETRY)) {
> +                       mmap_read_unlock(vma->vm_mm);
> +                       release_mmlock = 0;
> +               } else if ((flags & FAULT_FLAG_ALLOW_RETRY) &&
> +                       (flags & FAULT_FLAG_RETRY_NOWAIT)) {
> +                       mmap_read_unlock(vma->vm_mm);
> +                       release_mmlock = 0;
> +               }
> +
> +               if (release_mmlock)
> +                       mmap_read_unlock(current->mm);
> +       }
> +
>         if (flags & FAULT_FLAG_USER) {
>                 mem_cgroup_exit_user_fault();
>                 /*
> diff --git a/mm/mshare.c b/mm/mshare.c
> index cd2f7ad24d9d..d1896adcb00f 100644
> --- a/mm/mshare.c
> +++ b/mm/mshare.c
> @@ -17,18 +17,49 @@
>  #include <linux/pseudo_fs.h>
>  #include <linux/fileattr.h>
>  #include <linux/refcount.h>
> +#include <linux/mman.h>
>  #include <linux/sched/mm.h>
>  #include <uapi/linux/magic.h>
>  #include <uapi/linux/limits.h>
>
>  struct mshare_data {
> -       struct mm_struct *mm;
> +       struct mm_struct *mm, *host_mm;
>         mode_t mode;
>         refcount_t refcnt;
>  };
>
>  static struct super_block *msharefs_sb;
>
> +/* Returns holding the host mm's lock for read.  Caller must release. */
> +vm_fault_t
> +find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp)
> +{
> +       struct vm_area_struct *vma, *guest = *vmap;
> +       struct mshare_data *info = guest->vm_private_data;
> +       struct mm_struct *host_mm = info->mm;
> +       unsigned long host_addr;
> +       pgd_t *pgd, *guest_pgd;
> +
> +       host_addr = *addrp - guest->vm_start + host_mm->mmap_base;
> +       pgd = pgd_offset(host_mm, host_addr);
> +       guest_pgd = pgd_offset(current->mm, *addrp);
> +       if (!pgd_same(*guest_pgd, *pgd)) {
> +               set_pgd(guest_pgd, *pgd);
> +               return VM_FAULT_NOPAGE;
> +       }
> +
> +       *addrp = host_addr;
> +       mmap_read_lock(host_mm);
> +       vma = find_vma(host_mm, host_addr);
> +
> +       /* XXX: expand stack? */
> +       if (vma && vma->vm_start > host_addr)
> +               vma = NULL;
> +
> +       *vmap = vma;
> +       return 0;
> +}
> +
>  static void
>  msharefs_evict_inode(struct inode *inode)
>  {
> @@ -169,11 +200,13 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>                 unsigned long, len, int, oflag, mode_t, mode)
>  {
>         struct mshare_data *info;
> -       struct mm_struct *mm;
>         struct filename *fname = getname(name);
>         struct dentry *dentry;
>         struct inode *inode;
>         struct qstr namestr;
> +       struct vm_area_struct *vma, *next, *new_vma;
> +       struct mm_struct *new_mm;
> +       unsigned long end;
>         int err = PTR_ERR(fname);
>
>         /*
> @@ -193,6 +226,8 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>         if (IS_ERR(fname))
>                 goto err_out;
>
> +       end = addr + len;
> +
>         /*
>          * Does this mshare entry exist already? If it does, calling
>          * mshare with O_EXCL|O_CREAT is an error
> @@ -205,49 +240,165 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>         inode_lock(d_inode(msharefs_sb->s_root));
>         dentry = d_lookup(msharefs_sb->s_root, &namestr);
>         if (dentry && (oflag & (O_EXCL|O_CREAT))) {
> +               inode = d_inode(dentry);
>                 err = -EEXIST;
>                 dput(dentry);
>                 goto err_unlock_inode;
>         }
>
>         if (dentry) {
> +               unsigned long mapaddr, prot = PROT_NONE;
> +
>                 inode = d_inode(dentry);
>                 if (inode == NULL) {
> +                       mmap_write_unlock(current->mm);
>                         err = -EINVAL;
>                         goto err_out;
>                 }
>                 info = inode->i_private;
> -               refcount_inc(&info->refcnt);
>                 dput(dentry);
> +
> +               /*
> +                * Map in the address range as anonymous mappings
> +                */
> +               oflag &= (O_RDONLY | O_WRONLY | O_RDWR);
> +               if (oflag & O_RDONLY)
> +                       prot |= PROT_READ;
> +               else if (oflag & O_WRONLY)
> +                       prot |= PROT_WRITE;
> +               else if (oflag & O_RDWR)
> +                       prot |= (PROT_READ | PROT_WRITE);
> +               mapaddr = vm_mmap(NULL, addr, len, prot,
> +                               MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, 0);

From the perspective of hardware, do we have to use MAP_FIXED to make
sure those processes sharing PTE
use the same virtual address for the shared area? or actually we don't
necessarily need it? as long as the
upper level pgtable entries point to the same lower level pgtable?


> +               if (IS_ERR((void *)mapaddr)) {
> +                       err = -EINVAL;
> +                       goto err_out;
> +               }
> +
> +               refcount_inc(&info->refcnt);
> +
> +               /*
> +                * Now that we have mmap'd the mshare'd range, update vma
> +                * flags and vm_mm pointer for this mshare'd range.
> +                */
> +               mmap_write_lock(current->mm);
> +               vma = find_vma(current->mm, addr);
> +               if (vma && vma->vm_start < addr) {
> +                       mmap_write_unlock(current->mm);
> +                       err = -EINVAL;
> +                       goto err_out;
> +               }
> +
> +               while (vma && vma->vm_start < (addr + len)) {
> +                       vma->vm_private_data = info;
> +                       vma->vm_mm = info->mm;
> +                       vma->vm_flags |= VM_SHARED_PT;
> +                       next = vma->vm_next;
> +                       vma = next;
> +               }
>         } else {
> -               mm = mm_alloc();
> -               if (!mm)
> +               unsigned long myaddr;
> +               struct mm_struct *old_mm;
> +
> +               old_mm = current->mm;
> +               new_mm = mm_alloc();
> +               if (!new_mm)
>                         return -ENOMEM;
>                 info = kzalloc(sizeof(*info), GFP_KERNEL);
>                 if (!info) {
>                         err = -ENOMEM;
>                         goto err_relmm;
>                 }
> -               mm->mmap_base = addr;
> -               mm->task_size = addr + len;
> -               if (!mm->task_size)
> -                       mm->task_size--;
> -               info->mm = mm;
> +               new_mm->mmap_base = addr;
> +               new_mm->task_size = addr + len;
> +               if (!new_mm->task_size)
> +                       new_mm->task_size--;
> +               info->mm = new_mm;
> +               info->host_mm = old_mm;
>                 info->mode = mode;
>                 refcount_set(&info->refcnt, 1);
> +
> +               /*
> +                * VMAs for this address range may or may not exist.
> +                * If VMAs exist, they should be marked as shared at
> +                * this point and page table info should be copied
> +                * over to newly created mm_struct. TODO: If VMAs do not
> +                * exist, create them and mark them as shared.
> +                */
> +               mmap_write_lock(old_mm);
> +               vma = find_vma_intersection(old_mm, addr, end);
> +               if (!vma) {
> +                       err = -EINVAL;
> +                       goto unlock;
> +               }
> +               /*
> +                * TODO: If the currently allocated VMA goes beyond the
> +                * mshare'd range, this VMA needs to be split.
> +                *
> +                * Double check that source VMAs do not extend outside
> +                * the range
> +                */
> +               vma = find_vma(old_mm, addr + len);
> +               if (vma && vma->vm_start < (addr + len)) {
> +                       err = -EINVAL;
> +                       goto unlock;
> +               }
> +
> +               vma = find_vma(old_mm, addr);
> +               if (vma && vma->vm_start < addr) {
> +                       err = -EINVAL;
> +                       goto unlock;
> +               }
> +
> +               mmap_write_lock(new_mm);
> +               while (vma && vma->vm_start < (addr + len)) {
> +                       /*
> +                        * Copy this vma over to host mm
> +                        */
> +                       vma->vm_private_data = info;
> +                       vma->vm_mm = new_mm;
> +                       vma->vm_flags |= VM_SHARED_PT;
> +                       new_vma = vm_area_dup(vma);
> +                       if (!new_vma) {
> +                               err = -ENOMEM;
> +                               goto unlock;
> +                       }
> +                       err = insert_vm_struct(new_mm, new_vma);
> +                       if (err)
> +                               goto unlock;
> +
> +                       vma = vma->vm_next;
> +               }
> +               mmap_write_unlock(new_mm);
> +
>                 err = mshare_file_create(fname, oflag, info);
>                 if (err)
> -                       goto err_relinfo;
> +                       goto unlock;
> +
> +               /*
> +                * Copy over current PTEs
> +                */
> +               myaddr = addr;
> +               while (myaddr < new_mm->task_size) {
> +                       *pgd_offset(new_mm, myaddr) = *pgd_offset(old_mm, myaddr);
> +                       myaddr += PGDIR_SIZE;
> +               }
> +               /*
> +                * TODO: Free the corresponding page table in calling
> +                * process
> +                */
>         }
>
> +       mmap_write_unlock(current->mm);
>         inode_unlock(d_inode(msharefs_sb->s_root));
>         putname(fname);
>         return 0;
>
> -err_relinfo:
> +unlock:
> +       mmap_write_unlock(current->mm);
>         kfree(info);
>  err_relmm:
> -       mmput(mm);
> +       mmput(new_mm);
>  err_unlock_inode:
>         inode_unlock(d_inode(msharefs_sb->s_root));
>  err_out:
> @@ -294,11 +445,24 @@ SYSCALL_DEFINE1(mshare_unlink, const char *, name)
>
>         /*
>          * Is this the last reference?
> +        * TODO: permission checks are needed before proceeding
>          */
>         if (refcount_dec_and_test(&info->refcnt)) {
>                 simple_unlink(d_inode(msharefs_sb->s_root), dentry);
>                 d_drop(dentry);
>                 d_delete(dentry);
> +               /*
> +                * TODO: Release all physical pages allocated for this
> +                * mshare range and release associated page table. If
> +                * the final unlink happens from the process that created
> +                * mshare'd range, do we return page tables and pages to
> +                * that process so the creating process can continue using
> +                * the address range it had chosen to mshare at some
> +                * point?
> +                *
> +                * TODO: unmap shared vmas from every task that is using
> +                * this mshare'd range.
> +                */
>                 mmput(info->mm);
>                 kfree(info);
>         } else {
> --
> 2.32.0
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 00/14] Add support for shared PTEs across processes
  2022-05-30 10:48 ` Barry Song
@ 2022-05-30 11:18   ` David Hildenbrand
  2022-05-30 11:49     ` Barry Song
  2022-06-29 17:48     ` Khalid Aziz
  2022-06-29 17:40   ` Khalid Aziz
  1 sibling, 2 replies; 37+ messages in thread
From: David Hildenbrand @ 2022-05-30 11:18 UTC (permalink / raw)
  To: Barry Song, Khalid Aziz
  Cc: Andrew Morton, Matthew Wilcox, Aneesh Kumar, Arnd Bergmann,
	Jonathan Corbet, Dave Hansen, ebiederm, hagen, jack, Kees Cook,
	kirill, kucharsk, linkinjeon, linux-fsdevel, LKML, Linux-MM,
	longpeng2, Andy Lutomirski, markhemm, pcc, Mike Rapoport,
	sieberf, sjpark, Suren Baghdasaryan, tst, Iurii Zaikin

On 30.05.22 12:48, Barry Song wrote:
> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> Page tables in kernel consume some of the memory and as long as number
>> of mappings being maintained is small enough, this space consumed by
>> page tables is not objectionable. When very few memory pages are
>> shared between processes, the number of page table entries (PTEs) to
>> maintain is mostly constrained by the number of pages of memory on the
>> system. As the number of shared pages and the number of times pages
>> are shared goes up, amount of memory consumed by page tables starts to
>> become significant.
>>
>> Some of the field deployments commonly see memory pages shared across
>> 1000s of processes. On x86_64, each page requires a PTE that is only 8
>> bytes long which is very small compared to the 4K page size. When 2000
>> processes map the same page in their address space, each one of them
>> requires 8 bytes for its PTE and together that adds up to 8K of memory
>> just to hold the PTEs for one 4K page. On a database server with 300GB
>> SGA, a system carsh was seen with out-of-memory condition when 1500+
>> clients tried to share this SGA even though the system had 512GB of
>> memory. On this server, in the worst case scenario of all 1500
>> processes mapping every page from SGA would have required 878GB+ for
>> just the PTEs. If these PTEs could be shared, amount of memory saved
>> is very significant.
>>
>> This patch series implements a mechanism in kernel to allow userspace
>> processes to opt into sharing PTEs. It adds two new system calls - (1)
>> mshare(), which can be used by a process to create a region (we will
>> call it mshare'd region) which can be used by other processes to map
>> same pages using shared PTEs, (2) mshare_unlink() which is used to
>> detach from the mshare'd region. Once an mshare'd region is created,
>> other process(es), assuming they have the right permissions, can make
>> the mashare() system call to map the shared pages into their address
>> space using the shared PTEs.  When a process is done using this
>> mshare'd region, it makes a mshare_unlink() system call to end its
>> access. When the last process accessing mshare'd region calls
>> mshare_unlink(), the mshare'd region is torn down and memory used by
>> it is freed.
>>
>>
>> API
>> ===
>>
>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>
>> --
>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>
>> mshare() creates and opens a new, or opens an existing mshare'd
>> region that will be shared at PTE level. "name" refers to shared object
>> name that exists under /sys/fs/mshare. "addr" is the starting address
>> of this shared memory area and length is the size of this area.
>> oflags can be one of:
>>
>> - O_RDONLY opens shared memory area for read only access by everyone
>> - O_RDWR opens shared memory area for read and write access
>> - O_CREAT creates the named shared memory area if it does not exist
>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>   exists with that name, return an error.
>>
>> mode represents the creation mode for the shared object under
>> /sys/fs/mshare.
>>
>> mshare() returns an error code if it fails, otherwise it returns 0.
>>
>> PTEs are shared at pgdir level and hence it imposes following
>> requirements on the address and size given to the mshare():
>>
>> - Starting address must be aligned to pgdir size (512GB on x86_64).
>>   This alignment value can be looked up in /proc/sys/vm//mshare_size
>> - Size must be a multiple of pgdir size
>> - Any mappings created in this address range at any time become
>>   shared automatically
>> - Shared address range can have unmapped addresses in it. Any access
>>   to unmapped address will result in SIGBUS
>>
>> Mappings within this address range behave as if they were shared
>> between threads, so a write to a MAP_PRIVATE mapping will create a
>> page which is shared between all the sharers. The first process that
>> declares an address range mshare'd can continue to map objects in
>> the shared area. All other processes that want mshare'd access to
>> this memory area can do so by calling mshare(). After this call, the
>> address range given by mshare becomes a shared range in its address
>> space. Anonymous mappings will be shared and not COWed.
>>
>> A file under /sys/fs/mshare can be opened and read from. A read from
>> this file returns two long values - (1) starting address, and (2)
>> size of the mshare'd region.
>>
>> --
>> int mshare_unlink(char *name)
>>
>> A shared address range created by mshare() can be destroyed using
>> mshare_unlink() which removes the  shared named object. Once all
>> processes have unmapped the shared object, the shared address range
>> references are de-allocated and destroyed.
>>
>> mshare_unlink() returns 0 on success or -1 on error.
>>
>>
>> Example Code
>> ============
>>
>> Snippet of the code that a donor process would run looks like below:
>>
>> -----------------
>>         addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>>                         MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>>         if (addr == MAP_FAILED)
>>                 perror("ERROR: mmap failed");
>>
>>         err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>>                         GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>>         if (err < 0) {
>>                 perror("mshare() syscall failed");
>>                 exit(1);
>>         }
>>
>>         strncpy(addr, "Some random shared text",
>>                         sizeof("Some random shared text"));
>> -----------------
>>
>> Snippet of code that a consumer process would execute looks like:
>>
>> -----------------
>>         struct mshare_info minfo;
>>
>>         fd = open("testregion", O_RDONLY);
>>         if (fd < 0) {
>>                 perror("open failed");
>>                 exit(1);
>>         }
>>
>>         if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
>>                 printf("INFO: %ld bytes shared at addr 0x%lx \n",
>>                                 minfo.size, minfo.start);
>>         else
>>                 perror("read failed");
>>
>>         close(fd);
>>
>>         addr = (void *)minfo.start;
>>         err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
>>                         O_RDWR, 600);
>>         if (err < 0) {
>>                 perror("mshare() syscall failed");
>>                 exit(1);
>>         }
>>
>>         printf("Guest mmap at %px:\n", addr);
>>         printf("%s\n", addr);
>>         printf("\nDone\n");
>>
>>         err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>>         if (err < 0) {
>>                 perror("mshare_unlink() failed");
>>                 exit(1);
>>         }
>> -----------------
> 
> 
> Does  that mean those shared pages will get page_mapcount()=1 ?

AFAIU, for mshare() that is the case.

> 
> A big pain for a memory limited system like a desktop/embedded system is
> that reverse mapping will take tons of cpu in memory reclamation path
> especially for those pages mapped by multiple processes. sometimes,
> 100% cpu utilization on LRU to scan and find out if a page is accessed
> by reading PTE young.

Regarding PTE-table sharing:

Even if we'd account each logical mapping (independent of page table
sharing) in the page_mapcount(), we would benefit from page table
sharing. Simply when we unmap the page from the shared page table, we'd
have to adjust the mapcount accordingly. So unmapping from a single
(shared) pagetable could directly result in the mapcount dropping to zero.

What I am trying to say is: how the mapcount is handled might be an
implementation detail for PTE-sharing. Not sure how hugetlb handles that
with its PMD-table sharing.

We'd have to clarify what the mapcount actually expresses. Having the
mapcount express "is this page mapped by multiple processes or at
multiple VMAs" might be helpful in some cases. Right now it mostly
expresses exactly that.

> 
> if we result in one PTE only by this patchset, it means we are getting
> significant
> performance improvement in kernel LRU particularly when free memory
> approaches the watermarks.
> 
> But I don't see how a new system call like mshare(),  can be used
> by those systems as they might need some more automatic PTEs sharing
> mechanism.

IMHO, we should look into automatic PTE-table sharing of MAP_SHARED
mappings, similar to what hugetlb provides for PMD table sharing, which
leaves semantics unchanged for existing user space. Maybe there is a way
to factor that out and reuse it for PTE-table sharing.

I can understand that there are use cases for explicit sharing with new
(e.g., mprotect) semantics.

> 
> BTW, I suppose we are actually able to share PTEs as long as the address
> is 2MB aligned?

2MB is x86 specific, but yes. As long as we're aligned to PMDs and
* the VMA spans the complete PMD
* the VMA permissions match the page table
* no process-specific page table features are used (uffd, softdirty
  tracking)

... probably there are more corner cases. (e.g., referenced and dirty
bit don't reflect what the mapping process did, which might or might not
be relevant for current/future features)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 00/14] Add support for shared PTEs across processes
  2022-05-30 11:18   ` David Hildenbrand
@ 2022-05-30 11:49     ` Barry Song
  2022-06-29 17:48     ` Khalid Aziz
  1 sibling, 0 replies; 37+ messages in thread
From: Barry Song @ 2022-05-30 11:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Khalid Aziz, Andrew Morton, Matthew Wilcox, Aneesh Kumar,
	Arnd Bergmann, Jonathan Corbet, Dave Hansen, ebiederm, hagen,
	jack, Kees Cook, kirill, kucharsk, linkinjeon, linux-fsdevel,
	LKML, Linux-MM, longpeng2, Andy Lutomirski, markhemm, pcc,
	Mike Rapoport, sieberf, sjpark, Suren Baghdasaryan, tst,
	Iurii Zaikin

On Mon, May 30, 2022 at 11:18 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 30.05.22 12:48, Barry Song wrote:
> > On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
> >>
> >> Page tables in kernel consume some of the memory and as long as number
> >> of mappings being maintained is small enough, this space consumed by
> >> page tables is not objectionable. When very few memory pages are
> >> shared between processes, the number of page table entries (PTEs) to
> >> maintain is mostly constrained by the number of pages of memory on the
> >> system. As the number of shared pages and the number of times pages
> >> are shared goes up, amount of memory consumed by page tables starts to
> >> become significant.
> >>
> >> Some of the field deployments commonly see memory pages shared across
> >> 1000s of processes. On x86_64, each page requires a PTE that is only 8
> >> bytes long which is very small compared to the 4K page size. When 2000
> >> processes map the same page in their address space, each one of them
> >> requires 8 bytes for its PTE and together that adds up to 8K of memory
> >> just to hold the PTEs for one 4K page. On a database server with 300GB
> >> SGA, a system carsh was seen with out-of-memory condition when 1500+
> >> clients tried to share this SGA even though the system had 512GB of
> >> memory. On this server, in the worst case scenario of all 1500
> >> processes mapping every page from SGA would have required 878GB+ for
> >> just the PTEs. If these PTEs could be shared, amount of memory saved
> >> is very significant.
> >>
> >> This patch series implements a mechanism in kernel to allow userspace
> >> processes to opt into sharing PTEs. It adds two new system calls - (1)
> >> mshare(), which can be used by a process to create a region (we will
> >> call it mshare'd region) which can be used by other processes to map
> >> same pages using shared PTEs, (2) mshare_unlink() which is used to
> >> detach from the mshare'd region. Once an mshare'd region is created,
> >> other process(es), assuming they have the right permissions, can make
> >> the mashare() system call to map the shared pages into their address
> >> space using the shared PTEs.  When a process is done using this
> >> mshare'd region, it makes a mshare_unlink() system call to end its
> >> access. When the last process accessing mshare'd region calls
> >> mshare_unlink(), the mshare'd region is torn down and memory used by
> >> it is freed.
> >>
> >>
> >> API
> >> ===
> >>
> >> The mshare API consists of two system calls - mshare() and mshare_unlink()
> >>
> >> --
> >> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
> >>
> >> mshare() creates and opens a new, or opens an existing mshare'd
> >> region that will be shared at PTE level. "name" refers to shared object
> >> name that exists under /sys/fs/mshare. "addr" is the starting address
> >> of this shared memory area and length is the size of this area.
> >> oflags can be one of:
> >>
> >> - O_RDONLY opens shared memory area for read only access by everyone
> >> - O_RDWR opens shared memory area for read and write access
> >> - O_CREAT creates the named shared memory area if it does not exist
> >> - O_EXCL If O_CREAT was also specified, and a shared memory area
> >>   exists with that name, return an error.
> >>
> >> mode represents the creation mode for the shared object under
> >> /sys/fs/mshare.
> >>
> >> mshare() returns an error code if it fails, otherwise it returns 0.
> >>
> >> PTEs are shared at pgdir level and hence it imposes following
> >> requirements on the address and size given to the mshare():
> >>
> >> - Starting address must be aligned to pgdir size (512GB on x86_64).
> >>   This alignment value can be looked up in /proc/sys/vm//mshare_size
> >> - Size must be a multiple of pgdir size
> >> - Any mappings created in this address range at any time become
> >>   shared automatically
> >> - Shared address range can have unmapped addresses in it. Any access
> >>   to unmapped address will result in SIGBUS
> >>
> >> Mappings within this address range behave as if they were shared
> >> between threads, so a write to a MAP_PRIVATE mapping will create a
> >> page which is shared between all the sharers. The first process that
> >> declares an address range mshare'd can continue to map objects in
> >> the shared area. All other processes that want mshare'd access to
> >> this memory area can do so by calling mshare(). After this call, the
> >> address range given by mshare becomes a shared range in its address
> >> space. Anonymous mappings will be shared and not COWed.
> >>
> >> A file under /sys/fs/mshare can be opened and read from. A read from
> >> this file returns two long values - (1) starting address, and (2)
> >> size of the mshare'd region.
> >>
> >> --
> >> int mshare_unlink(char *name)
> >>
> >> A shared address range created by mshare() can be destroyed using
> >> mshare_unlink() which removes the  shared named object. Once all
> >> processes have unmapped the shared object, the shared address range
> >> references are de-allocated and destroyed.
> >>
> >> mshare_unlink() returns 0 on success or -1 on error.
> >>
> >>
> >> Example Code
> >> ============
> >>
> >> Snippet of the code that a donor process would run looks like below:
> >>
> >> -----------------
> >>         addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
> >>                         MAP_SHARED | MAP_ANONYMOUS, 0, 0);
> >>         if (addr == MAP_FAILED)
> >>                 perror("ERROR: mmap failed");
> >>
> >>         err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
> >>                         GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
> >>         if (err < 0) {
> >>                 perror("mshare() syscall failed");
> >>                 exit(1);
> >>         }
> >>
> >>         strncpy(addr, "Some random shared text",
> >>                         sizeof("Some random shared text"));
> >> -----------------
> >>
> >> Snippet of code that a consumer process would execute looks like:
> >>
> >> -----------------
> >>         struct mshare_info minfo;
> >>
> >>         fd = open("testregion", O_RDONLY);
> >>         if (fd < 0) {
> >>                 perror("open failed");
> >>                 exit(1);
> >>         }
> >>
> >>         if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
> >>                 printf("INFO: %ld bytes shared at addr 0x%lx \n",
> >>                                 minfo.size, minfo.start);
> >>         else
> >>                 perror("read failed");
> >>
> >>         close(fd);
> >>
> >>         addr = (void *)minfo.start;
> >>         err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
> >>                         O_RDWR, 600);
> >>         if (err < 0) {
> >>                 perror("mshare() syscall failed");
> >>                 exit(1);
> >>         }
> >>
> >>         printf("Guest mmap at %px:\n", addr);
> >>         printf("%s\n", addr);
> >>         printf("\nDone\n");
> >>
> >>         err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
> >>         if (err < 0) {
> >>                 perror("mshare_unlink() failed");
> >>                 exit(1);
> >>         }
> >> -----------------
> >
> >
> > Does  that mean those shared pages will get page_mapcount()=1 ?
>
> AFAIU, for mshare() that is the case.
>
> >
> > A big pain for a memory limited system like a desktop/embedded system is
> > that reverse mapping will take tons of cpu in memory reclamation path
> > especially for those pages mapped by multiple processes. sometimes,
> > 100% cpu utilization on LRU to scan and find out if a page is accessed
> > by reading PTE young.
>
> Regarding PTE-table sharing:
>
> Even if we'd account each logical mapping (independent of page table
> sharing) in the page_mapcount(), we would benefit from page table
> sharing. Simply when we unmap the page from the shared page table, we'd
> have to adjust the mapcount accordingly. So unmapping from a single
> (shared) pagetable could directly result in the mapcount dropping to zero.
>
> What I am trying to say is: how the mapcount is handled might be an
> implementation detail for PTE-sharing. Not sure how hugetlb handles that
> with its PMD-table sharing.
>
> We'd have to clarify what the mapcount actually expresses. Having the
> mapcount express "is this page mapped by multiple processes or at
> multiple VMAs" might be helpful in some cases. Right now it mostly
> expresses exactly that.

right, i guess mapcount, as a number for itself, isn't so important. the
only important thing is that we only need to read one PTE rather
than 1000 PTEs to figure out if one page is young.

so this patchset has already been able to do reverse mapping only
one time for a page shared by 1000 processes rather than reading
1000 PTEs?

i mean, i actually haven't found your actually touched any files in
mm/vmscan.c etc.

>
> >
> > if we result in one PTE only by this patchset, it means we are getting
> > significant
> > performance improvement in kernel LRU particularly when free memory
> > approaches the watermarks.
> >
> > But I don't see how a new system call like mshare(),  can be used
> > by those systems as they might need some more automatic PTEs sharing
> > mechanism.
>
> IMHO, we should look into automatic PTE-table sharing of MAP_SHARED
> mappings, similar to what hugetlb provides for PMD table sharing, which
> leaves semantics unchanged for existing user space. Maybe there is a way
> to factor that out and reuse it for PTE-table sharing.
>
> I can understand that there are use cases for explicit sharing with new
> (e.g., mprotect) semantics.
>
> >
> > BTW, I suppose we are actually able to share PTEs as long as the address
> > is 2MB aligned?
>
> 2MB is x86 specific, but yes. As long as we're aligned to PMDs and
> * the VMA spans the complete PMD
> * the VMA permissions match the page table
> * no process-specific page table features are used (uffd, softdirty
>   tracking)
>
> ... probably there are more corner cases. (e.g., referenced and dirty
> bit don't reflect what the mapping process did, which might or might not
> be relevant for current/future features)
>
> --
> Thanks,
>
> David / dhildenb
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 08/14] mm/mshare: Add basic page table sharing using mshare
  2022-04-11 16:05 ` [PATCH v1 08/14] mm/mshare: Add basic page table sharing using mshare Khalid Aziz
  2022-04-11 18:48   ` Dave Hansen
  2022-05-30 11:11   ` Barry Song
@ 2022-05-31  3:46   ` Barry Song
  2022-06-28 20:16     ` Khalid Aziz
  2 siblings, 1 reply; 37+ messages in thread
From: Barry Song @ 2022-05-31  3:46 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Andrew Morton, Matthew Wilcox, Aneesh Kumar, Arnd Bergmann,
	Jonathan Corbet, Dave Hansen, David Hildenbrand, ebiederm, hagen,
	jack, Kees Cook, kirill, kucharsk, linkinjeon, linux-fsdevel,
	LKML, Linux-MM, longpeng2, Andy Lutomirski, markhemm,
	Peter Collingbourne, Mike Rapoport, sieberf, sjpark,
	Suren Baghdasaryan, tst, Iurii Zaikin

On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> This patch adds basic page table sharing across tasks by making
> mshare syscall. It does this by creating a new mm_struct which
> hosts the shared vmas and page tables. This mm_struct is
> maintained as long as there is at least one task using the mshare'd
> range. It is cleaned up by the last mshare_unlink syscall.
>
> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/internal.h |   2 +
>  mm/memory.c   |  35 ++++++++++
>  mm/mshare.c   | 190 ++++++++++++++++++++++++++++++++++++++++++++++----
>  3 files changed, 214 insertions(+), 13 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index cf50a471384e..68f82f0f8b66 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -718,6 +718,8 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
>  int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
>                       unsigned long addr, int page_nid, int *flags);
>
> +extern vm_fault_t find_shared_vma(struct vm_area_struct **vma,
> +                       unsigned long *addrp);
>  static inline bool vma_is_shared(const struct vm_area_struct *vma)
>  {
>         return vma->vm_flags & VM_SHARED_PT;
> diff --git a/mm/memory.c b/mm/memory.c
> index c125c4969913..c77c0d643ea8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4776,6 +4776,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>                            unsigned int flags, struct pt_regs *regs)
>  {
>         vm_fault_t ret;
> +       bool shared = false;
>
>         __set_current_state(TASK_RUNNING);
>
> @@ -4785,6 +4786,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>         /* do counter updates before entering really critical section. */
>         check_sync_rss_stat(current);
>
> +       if (unlikely(vma_is_shared(vma))) {
> +               ret = find_shared_vma(&vma, &address);
> +               if (ret)
> +                       return ret;
> +               if (!vma)
> +                       return VM_FAULT_SIGSEGV;
> +               shared = true;
> +       }
> +
>         if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>                                             flags & FAULT_FLAG_INSTRUCTION,
>                                             flags & FAULT_FLAG_REMOTE))
> @@ -4802,6 +4812,31 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>         else
>                 ret = __handle_mm_fault(vma, address, flags);
>
> +       /*
> +        * Release the read lock on shared VMA's parent mm unless
> +        * __handle_mm_fault released the lock already.
> +        * __handle_mm_fault sets VM_FAULT_RETRY in return value if
> +        * it released mmap lock. If lock was released, that implies
> +        * the lock would have been released on task's original mm if
> +        * this were not a shared PTE vma. To keep lock state consistent,
> +        * make sure to release the lock on task's original mm
> +        */
> +       if (shared) {
> +               int release_mmlock = 1;
> +
> +               if (!(ret & VM_FAULT_RETRY)) {
> +                       mmap_read_unlock(vma->vm_mm);
> +                       release_mmlock = 0;
> +               } else if ((flags & FAULT_FLAG_ALLOW_RETRY) &&
> +                       (flags & FAULT_FLAG_RETRY_NOWAIT)) {
> +                       mmap_read_unlock(vma->vm_mm);
> +                       release_mmlock = 0;
> +               }
> +
> +               if (release_mmlock)
> +                       mmap_read_unlock(current->mm);
> +       }
> +
>         if (flags & FAULT_FLAG_USER) {
>                 mem_cgroup_exit_user_fault();
>                 /*
> diff --git a/mm/mshare.c b/mm/mshare.c
> index cd2f7ad24d9d..d1896adcb00f 100644
> --- a/mm/mshare.c
> +++ b/mm/mshare.c
> @@ -17,18 +17,49 @@
>  #include <linux/pseudo_fs.h>
>  #include <linux/fileattr.h>
>  #include <linux/refcount.h>
> +#include <linux/mman.h>
>  #include <linux/sched/mm.h>
>  #include <uapi/linux/magic.h>
>  #include <uapi/linux/limits.h>
>
>  struct mshare_data {
> -       struct mm_struct *mm;
> +       struct mm_struct *mm, *host_mm;
>         mode_t mode;
>         refcount_t refcnt;
>  };
>
>  static struct super_block *msharefs_sb;
>
> +/* Returns holding the host mm's lock for read.  Caller must release. */
> +vm_fault_t
> +find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp)
> +{
> +       struct vm_area_struct *vma, *guest = *vmap;
> +       struct mshare_data *info = guest->vm_private_data;
> +       struct mm_struct *host_mm = info->mm;
> +       unsigned long host_addr;
> +       pgd_t *pgd, *guest_pgd;
> +
> +       host_addr = *addrp - guest->vm_start + host_mm->mmap_base;
> +       pgd = pgd_offset(host_mm, host_addr);
> +       guest_pgd = pgd_offset(current->mm, *addrp);
> +       if (!pgd_same(*guest_pgd, *pgd)) {
> +               set_pgd(guest_pgd, *pgd);
> +               return VM_FAULT_NOPAGE;
> +       }
> +
> +       *addrp = host_addr;
> +       mmap_read_lock(host_mm);
> +       vma = find_vma(host_mm, host_addr);
> +
> +       /* XXX: expand stack? */
> +       if (vma && vma->vm_start > host_addr)
> +               vma = NULL;
> +
> +       *vmap = vma;
> +       return 0;
> +}
> +
>  static void
>  msharefs_evict_inode(struct inode *inode)
>  {
> @@ -169,11 +200,13 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>                 unsigned long, len, int, oflag, mode_t, mode)
>  {
>         struct mshare_data *info;
> -       struct mm_struct *mm;
>         struct filename *fname = getname(name);
>         struct dentry *dentry;
>         struct inode *inode;
>         struct qstr namestr;
> +       struct vm_area_struct *vma, *next, *new_vma;
> +       struct mm_struct *new_mm;
> +       unsigned long end;
>         int err = PTR_ERR(fname);
>
>         /*
> @@ -193,6 +226,8 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>         if (IS_ERR(fname))
>                 goto err_out;
>
> +       end = addr + len;
> +
>         /*
>          * Does this mshare entry exist already? If it does, calling
>          * mshare with O_EXCL|O_CREAT is an error
> @@ -205,49 +240,165 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>         inode_lock(d_inode(msharefs_sb->s_root));
>         dentry = d_lookup(msharefs_sb->s_root, &namestr);
>         if (dentry && (oflag & (O_EXCL|O_CREAT))) {
> +               inode = d_inode(dentry);
>                 err = -EEXIST;
>                 dput(dentry);
>                 goto err_unlock_inode;
>         }
>
>         if (dentry) {
> +               unsigned long mapaddr, prot = PROT_NONE;
> +
>                 inode = d_inode(dentry);
>                 if (inode == NULL) {
> +                       mmap_write_unlock(current->mm);
>                         err = -EINVAL;
>                         goto err_out;
>                 }
>                 info = inode->i_private;
> -               refcount_inc(&info->refcnt);
>                 dput(dentry);
> +
> +               /*
> +                * Map in the address range as anonymous mappings
> +                */
> +               oflag &= (O_RDONLY | O_WRONLY | O_RDWR);
> +               if (oflag & O_RDONLY)
> +                       prot |= PROT_READ;
> +               else if (oflag & O_WRONLY)
> +                       prot |= PROT_WRITE;
> +               else if (oflag & O_RDWR)
> +                       prot |= (PROT_READ | PROT_WRITE);
> +               mapaddr = vm_mmap(NULL, addr, len, prot,
> +                               MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, 0);
> +               if (IS_ERR((void *)mapaddr)) {
> +                       err = -EINVAL;
> +                       goto err_out;
> +               }
> +
> +               refcount_inc(&info->refcnt);
> +
> +               /*
> +                * Now that we have mmap'd the mshare'd range, update vma
> +                * flags and vm_mm pointer for this mshare'd range.
> +                */
> +               mmap_write_lock(current->mm);
> +               vma = find_vma(current->mm, addr);
> +               if (vma && vma->vm_start < addr) {

and I don't understand why "vma->vm_start < addr" can happen.
does it mean we have given mshare() an address which is not
aligned? then we should have return -EINVAL earlier?



> +                       mmap_write_unlock(current->mm);
> +                       err = -EINVAL;
> +                       goto err_out;
> +               }
> +
> +               while (vma && vma->vm_start < (addr + len)) {
> +                       vma->vm_private_data = info;
> +                       vma->vm_mm = info->mm;
> +                       vma->vm_flags |= VM_SHARED_PT;
> +                       next = vma->vm_next;
> +                       vma = next;
> +               }
>         } else {
> -               mm = mm_alloc();
> -               if (!mm)
> +               unsigned long myaddr;
> +               struct mm_struct *old_mm;
> +
> +               old_mm = current->mm;
> +               new_mm = mm_alloc();
> +               if (!new_mm)
>                         return -ENOMEM;
>                 info = kzalloc(sizeof(*info), GFP_KERNEL);
>                 if (!info) {
>                         err = -ENOMEM;
>                         goto err_relmm;
>                 }
> -               mm->mmap_base = addr;
> -               mm->task_size = addr + len;
> -               if (!mm->task_size)
> -                       mm->task_size--;
> -               info->mm = mm;
> +               new_mm->mmap_base = addr;
> +               new_mm->task_size = addr + len;
> +               if (!new_mm->task_size)
> +                       new_mm->task_size--;
> +               info->mm = new_mm;
> +               info->host_mm = old_mm;
>                 info->mode = mode;
>                 refcount_set(&info->refcnt, 1);
> +
> +               /*
> +                * VMAs for this address range may or may not exist.
> +                * If VMAs exist, they should be marked as shared at
> +                * this point and page table info should be copied
> +                * over to newly created mm_struct. TODO: If VMAs do not
> +                * exist, create them and mark them as shared.
> +                */
> +               mmap_write_lock(old_mm);
> +               vma = find_vma_intersection(old_mm, addr, end);
> +               if (!vma) {
> +                       err = -EINVAL;
> +                       goto unlock;
> +               }
> +               /*
> +                * TODO: If the currently allocated VMA goes beyond the
> +                * mshare'd range, this VMA needs to be split.
> +                *
> +                * Double check that source VMAs do not extend outside
> +                * the range
> +                */
> +               vma = find_vma(old_mm, addr + len);
> +               if (vma && vma->vm_start < (addr + len)) {
> +                       err = -EINVAL;
> +                       goto unlock;
> +               }
> +
> +               vma = find_vma(old_mm, addr);
> +               if (vma && vma->vm_start < addr) {
> +                       err = -EINVAL;
> +                       goto unlock;
> +               }
> +
> +               mmap_write_lock(new_mm);
> +               while (vma && vma->vm_start < (addr + len)) {
> +                       /*
> +                        * Copy this vma over to host mm
> +                        */
> +                       vma->vm_private_data = info;
> +                       vma->vm_mm = new_mm;
> +                       vma->vm_flags |= VM_SHARED_PT;
> +                       new_vma = vm_area_dup(vma);
> +                       if (!new_vma) {
> +                               err = -ENOMEM;
> +                               goto unlock;
> +                       }
> +                       err = insert_vm_struct(new_mm, new_vma);
> +                       if (err)
> +                               goto unlock;
> +
> +                       vma = vma->vm_next;
> +               }
> +               mmap_write_unlock(new_mm);
> +
>                 err = mshare_file_create(fname, oflag, info);
>                 if (err)
> -                       goto err_relinfo;
> +                       goto unlock;
> +
> +               /*
> +                * Copy over current PTEs
> +                */
> +               myaddr = addr;
> +               while (myaddr < new_mm->task_size) {
> +                       *pgd_offset(new_mm, myaddr) = *pgd_offset(old_mm, myaddr);
> +                       myaddr += PGDIR_SIZE;
> +               }
> +               /*
> +                * TODO: Free the corresponding page table in calling
> +                * process
> +                */
>         }
>
> +       mmap_write_unlock(current->mm);
>         inode_unlock(d_inode(msharefs_sb->s_root));
>         putname(fname);
>         return 0;
>
> -err_relinfo:
> +unlock:
> +       mmap_write_unlock(current->mm);
>         kfree(info);
>  err_relmm:
> -       mmput(mm);
> +       mmput(new_mm);
>  err_unlock_inode:
>         inode_unlock(d_inode(msharefs_sb->s_root));
>  err_out:
> @@ -294,11 +445,24 @@ SYSCALL_DEFINE1(mshare_unlink, const char *, name)
>
>         /*
>          * Is this the last reference?
> +        * TODO: permission checks are needed before proceeding
>          */
>         if (refcount_dec_and_test(&info->refcnt)) {
>                 simple_unlink(d_inode(msharefs_sb->s_root), dentry);
>                 d_drop(dentry);
>                 d_delete(dentry);
> +               /*
> +                * TODO: Release all physical pages allocated for this
> +                * mshare range and release associated page table. If
> +                * the final unlink happens from the process that created
> +                * mshare'd range, do we return page tables and pages to
> +                * that process so the creating process can continue using
> +                * the address range it had chosen to mshare at some
> +                * point?
> +                *
> +                * TODO: unmap shared vmas from every task that is using
> +                * this mshare'd range.
> +                */
>                 mmput(info->mm);
>                 kfree(info);
>         } else {
> --
> 2.32.0
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 09/14] mm/mshare: Do not free PTEs for mshare'd PTEs
  2022-04-11 16:05 ` [PATCH v1 09/14] mm/mshare: Do not free PTEs for mshare'd PTEs Khalid Aziz
@ 2022-05-31  4:24   ` Barry Song
  2022-06-29 17:38     ` Khalid Aziz
  0 siblings, 1 reply; 37+ messages in thread
From: Barry Song @ 2022-05-31  4:24 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Andrew Morton, Matthew Wilcox, Aneesh Kumar, Arnd Bergmann,
	Jonathan Corbet, Dave Hansen, David Hildenbrand, ebiederm, hagen,
	jack, Kees Cook, kirill, kucharsk, linkinjeon, linux-fsdevel,
	LKML, Linux-MM, longpeng2, Andy Lutomirski, markhemm,
	Peter Collingbourne, Mike Rapoport, sieberf, sjpark,
	Suren Baghdasaryan, tst, Iurii Zaikin

On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> mshare'd PTEs should not be removed when a task exits. These PTEs
> are removed when the last task sharing the PTEs exits. Add a check
> for shared PTEs and skip them.
>
> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
> ---
>  mm/memory.c | 22 +++++++++++++++++++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index c77c0d643ea8..e7c5bc6f8836 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -419,16 +419,25 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
>                 } else {
>                         /*
>                          * Optimization: gather nearby vmas into one call down
> +                        * as long as they all belong to the same mm (that
> +                        * may not be the case if a vma is part of mshare'd
> +                        * range
>                          */
>                         while (next && next->vm_start <= vma->vm_end + PMD_SIZE
> -                              && !is_vm_hugetlb_page(next)) {
> +                              && !is_vm_hugetlb_page(next)
> +                              && vma->vm_mm == tlb->mm) {
>                                 vma = next;
>                                 next = vma->vm_next;
>                                 unlink_anon_vmas(vma);
>                                 unlink_file_vma(vma);
>                         }
> -                       free_pgd_range(tlb, addr, vma->vm_end,
> -                               floor, next ? next->vm_start : ceiling);
> +                       /*
> +                        * Free pgd only if pgd is not allocated for an
> +                        * mshare'd range
> +                        */
> +                       if (vma->vm_mm == tlb->mm)
> +                               free_pgd_range(tlb, addr, vma->vm_end,
> +                                       floor, next ? next->vm_start : ceiling);
>                 }
>                 vma = next;
>         }
> @@ -1551,6 +1560,13 @@ void unmap_page_range(struct mmu_gather *tlb,
>         pgd_t *pgd;
>         unsigned long next;
>
> +       /*
> +        * If this is an mshare'd page, do not unmap it since it might
> +        * still be in use.
> +        */
> +       if (vma->vm_mm != tlb->mm)
> +               return;
> +

expect unmap, have you ever tested reverse mapping in vmscan, especially
folio_referenced()? are all vmas in those processes sharing page table still
in the rmap of the shared page?
without shared PTE, if 1000 processes share one page, we are reading 1000
PTEs, with it, are we reading just one? or are we reading the same PTE
1000 times? Have you tested it?

>         BUG_ON(addr >= end);
>         tlb_start_vma(tlb, vma);
>         pgd = pgd_offset(vma->vm_mm, addr);
> --
> 2.32.0
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 08/14] mm/mshare: Add basic page table sharing using mshare
  2022-05-30 11:11   ` Barry Song
@ 2022-06-28 20:11     ` Khalid Aziz
  0 siblings, 0 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-06-28 20:11 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Matthew Wilcox, Aneesh Kumar, Arnd Bergmann,
	Jonathan Corbet, Dave Hansen, David Hildenbrand, ebiederm, hagen,
	jack, Kees Cook, kirill, kucharsk, linkinjeon, linux-fsdevel,
	LKML, Linux-MM, longpeng2, Andy Lutomirski, markhemm, pcc,
	Mike Rapoport, sieberf, sjpark, Suren Baghdasaryan, tst,
	Iurii Zaikin

On 5/30/22 05:11, Barry Song wrote:
> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>>
>> @@ -193,6 +226,8 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>>          if (IS_ERR(fname))
>>                  goto err_out;
>>
>> +       end = addr + len;
>> +
>>          /*
>>           * Does this mshare entry exist already? If it does, calling
>>           * mshare with O_EXCL|O_CREAT is an error
>> @@ -205,49 +240,165 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>>          inode_lock(d_inode(msharefs_sb->s_root));
>>          dentry = d_lookup(msharefs_sb->s_root, &namestr);
>>          if (dentry && (oflag & (O_EXCL|O_CREAT))) {
>> +               inode = d_inode(dentry);
>>                  err = -EEXIST;
>>                  dput(dentry);
>>                  goto err_unlock_inode;
>>          }
>>
>>          if (dentry) {
>> +               unsigned long mapaddr, prot = PROT_NONE;
>> +
>>                  inode = d_inode(dentry);
>>                  if (inode == NULL) {
>> +                       mmap_write_unlock(current->mm);
>>                          err = -EINVAL;
>>                          goto err_out;
>>                  }
>>                  info = inode->i_private;
>> -               refcount_inc(&info->refcnt);
>>                  dput(dentry);
>> +
>> +               /*
>> +                * Map in the address range as anonymous mappings
>> +                */
>> +               oflag &= (O_RDONLY | O_WRONLY | O_RDWR);
>> +               if (oflag & O_RDONLY)
>> +                       prot |= PROT_READ;
>> +               else if (oflag & O_WRONLY)
>> +                       prot |= PROT_WRITE;
>> +               else if (oflag & O_RDWR)
>> +                       prot |= (PROT_READ | PROT_WRITE);
>> +               mapaddr = vm_mmap(NULL, addr, len, prot,
>> +                               MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, 0);
> 
>  From the perspective of hardware, do we have to use MAP_FIXED to make
> sure those processes sharing PTE
> use the same virtual address for the shared area? or actually we don't
> necessarily need it? as long as the
> upper level pgtable entries point to the same lower level pgtable?

Hi Barry,

Sorry, I didn't mean to ignore this. I was out of commission for the last few weeks.

All processes sharing an mshare region must use the same virtual address otherwise page table entry for those processes 
won't be identical and hence can not be shared. Upper bits of virtual address provide index into various level 
directories. It may be possible to manipulate the various page directories to allow for different virtual addresses 
across processes and get hardware page table walk to work correctly, but that would be complex and potentially error prone.

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 08/14] mm/mshare: Add basic page table sharing using mshare
  2022-05-31  3:46   ` Barry Song
@ 2022-06-28 20:16     ` Khalid Aziz
  0 siblings, 0 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-06-28 20:16 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Matthew Wilcox, Aneesh Kumar, Arnd Bergmann,
	Jonathan Corbet, Dave Hansen, David Hildenbrand, ebiederm, hagen,
	jack, Kees Cook, kirill, kucharsk, linkinjeon, linux-fsdevel,
	LKML, Linux-MM, longpeng2, Andy Lutomirski, markhemm,
	Peter Collingbourne, Mike Rapoport, sieberf, sjpark,
	Suren Baghdasaryan, tst, Iurii Zaikin

On 5/30/22 21:46, Barry Song wrote:
> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> This patch adds basic page table sharing across tasks by making
>> mshare syscall. It does this by creating a new mm_struct which
>> hosts the shared vmas and page tables. This mm_struct is
>> maintained as long as there is at least one task using the mshare'd
>> range. It is cleaned up by the last mshare_unlink syscall.
>>
>> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
>> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>> ---
>>   mm/internal.h |   2 +
>>   mm/memory.c   |  35 ++++++++++
>>   mm/mshare.c   | 190 ++++++++++++++++++++++++++++++++++++++++++++++----
>>   3 files changed, 214 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index cf50a471384e..68f82f0f8b66 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -718,6 +718,8 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
>>   int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
>>                        unsigned long addr, int page_nid, int *flags);
>>
>> +extern vm_fault_t find_shared_vma(struct vm_area_struct **vma,
>> +                       unsigned long *addrp);
>>   static inline bool vma_is_shared(const struct vm_area_struct *vma)
>>   {
>>          return vma->vm_flags & VM_SHARED_PT;
>> diff --git a/mm/memory.c b/mm/memory.c
>> index c125c4969913..c77c0d643ea8 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4776,6 +4776,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>                             unsigned int flags, struct pt_regs *regs)
>>   {
>>          vm_fault_t ret;
>> +       bool shared = false;
>>
>>          __set_current_state(TASK_RUNNING);
>>
>> @@ -4785,6 +4786,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>          /* do counter updates before entering really critical section. */
>>          check_sync_rss_stat(current);
>>
>> +       if (unlikely(vma_is_shared(vma))) {
>> +               ret = find_shared_vma(&vma, &address);
>> +               if (ret)
>> +                       return ret;
>> +               if (!vma)
>> +                       return VM_FAULT_SIGSEGV;
>> +               shared = true;
>> +       }
>> +
>>          if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>>                                              flags & FAULT_FLAG_INSTRUCTION,
>>                                              flags & FAULT_FLAG_REMOTE))
>> @@ -4802,6 +4812,31 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>          else
>>                  ret = __handle_mm_fault(vma, address, flags);
>>
>> +       /*
>> +        * Release the read lock on shared VMA's parent mm unless
>> +        * __handle_mm_fault released the lock already.
>> +        * __handle_mm_fault sets VM_FAULT_RETRY in return value if
>> +        * it released mmap lock. If lock was released, that implies
>> +        * the lock would have been released on task's original mm if
>> +        * this were not a shared PTE vma. To keep lock state consistent,
>> +        * make sure to release the lock on task's original mm
>> +        */
>> +       if (shared) {
>> +               int release_mmlock = 1;
>> +
>> +               if (!(ret & VM_FAULT_RETRY)) {
>> +                       mmap_read_unlock(vma->vm_mm);
>> +                       release_mmlock = 0;
>> +               } else if ((flags & FAULT_FLAG_ALLOW_RETRY) &&
>> +                       (flags & FAULT_FLAG_RETRY_NOWAIT)) {
>> +                       mmap_read_unlock(vma->vm_mm);
>> +                       release_mmlock = 0;
>> +               }
>> +
>> +               if (release_mmlock)
>> +                       mmap_read_unlock(current->mm);
>> +       }
>> +
>>          if (flags & FAULT_FLAG_USER) {
>>                  mem_cgroup_exit_user_fault();
>>                  /*
>> diff --git a/mm/mshare.c b/mm/mshare.c
>> index cd2f7ad24d9d..d1896adcb00f 100644
>> --- a/mm/mshare.c
>> +++ b/mm/mshare.c
>> @@ -17,18 +17,49 @@
>>   #include <linux/pseudo_fs.h>
>>   #include <linux/fileattr.h>
>>   #include <linux/refcount.h>
>> +#include <linux/mman.h>
>>   #include <linux/sched/mm.h>
>>   #include <uapi/linux/magic.h>
>>   #include <uapi/linux/limits.h>
>>
>>   struct mshare_data {
>> -       struct mm_struct *mm;
>> +       struct mm_struct *mm, *host_mm;
>>          mode_t mode;
>>          refcount_t refcnt;
>>   };
>>
>>   static struct super_block *msharefs_sb;
>>
>> +/* Returns holding the host mm's lock for read.  Caller must release. */
>> +vm_fault_t
>> +find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp)
>> +{
>> +       struct vm_area_struct *vma, *guest = *vmap;
>> +       struct mshare_data *info = guest->vm_private_data;
>> +       struct mm_struct *host_mm = info->mm;
>> +       unsigned long host_addr;
>> +       pgd_t *pgd, *guest_pgd;
>> +
>> +       host_addr = *addrp - guest->vm_start + host_mm->mmap_base;
>> +       pgd = pgd_offset(host_mm, host_addr);
>> +       guest_pgd = pgd_offset(current->mm, *addrp);
>> +       if (!pgd_same(*guest_pgd, *pgd)) {
>> +               set_pgd(guest_pgd, *pgd);
>> +               return VM_FAULT_NOPAGE;
>> +       }
>> +
>> +       *addrp = host_addr;
>> +       mmap_read_lock(host_mm);
>> +       vma = find_vma(host_mm, host_addr);
>> +
>> +       /* XXX: expand stack? */
>> +       if (vma && vma->vm_start > host_addr)
>> +               vma = NULL;
>> +
>> +       *vmap = vma;
>> +       return 0;
>> +}
>> +
>>   static void
>>   msharefs_evict_inode(struct inode *inode)
>>   {
>> @@ -169,11 +200,13 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>>                  unsigned long, len, int, oflag, mode_t, mode)
>>   {
>>          struct mshare_data *info;
>> -       struct mm_struct *mm;
>>          struct filename *fname = getname(name);
>>          struct dentry *dentry;
>>          struct inode *inode;
>>          struct qstr namestr;
>> +       struct vm_area_struct *vma, *next, *new_vma;
>> +       struct mm_struct *new_mm;
>> +       unsigned long end;
>>          int err = PTR_ERR(fname);
>>
>>          /*
>> @@ -193,6 +226,8 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>>          if (IS_ERR(fname))
>>                  goto err_out;
>>
>> +       end = addr + len;
>> +
>>          /*
>>           * Does this mshare entry exist already? If it does, calling
>>           * mshare with O_EXCL|O_CREAT is an error
>> @@ -205,49 +240,165 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>>          inode_lock(d_inode(msharefs_sb->s_root));
>>          dentry = d_lookup(msharefs_sb->s_root, &namestr);
>>          if (dentry && (oflag & (O_EXCL|O_CREAT))) {
>> +               inode = d_inode(dentry);
>>                  err = -EEXIST;
>>                  dput(dentry);
>>                  goto err_unlock_inode;
>>          }
>>
>>          if (dentry) {
>> +               unsigned long mapaddr, prot = PROT_NONE;
>> +
>>                  inode = d_inode(dentry);
>>                  if (inode == NULL) {
>> +                       mmap_write_unlock(current->mm);
>>                          err = -EINVAL;
>>                          goto err_out;
>>                  }
>>                  info = inode->i_private;
>> -               refcount_inc(&info->refcnt);
>>                  dput(dentry);
>> +
>> +               /*
>> +                * Map in the address range as anonymous mappings
>> +                */
>> +               oflag &= (O_RDONLY | O_WRONLY | O_RDWR);
>> +               if (oflag & O_RDONLY)
>> +                       prot |= PROT_READ;
>> +               else if (oflag & O_WRONLY)
>> +                       prot |= PROT_WRITE;
>> +               else if (oflag & O_RDWR)
>> +                       prot |= (PROT_READ | PROT_WRITE);
>> +               mapaddr = vm_mmap(NULL, addr, len, prot,
>> +                               MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, 0);
>> +               if (IS_ERR((void *)mapaddr)) {
>> +                       err = -EINVAL;
>> +                       goto err_out;
>> +               }
>> +
>> +               refcount_inc(&info->refcnt);
>> +
>> +               /*
>> +                * Now that we have mmap'd the mshare'd range, update vma
>> +                * flags and vm_mm pointer for this mshare'd range.
>> +                */
>> +               mmap_write_lock(current->mm);
>> +               vma = find_vma(current->mm, addr);
>> +               if (vma && vma->vm_start < addr) {
> 
> and I don't understand why "vma->vm_start < addr" can happen.
> does it mean we have given mshare() an address which is not
> aligned? then we should have return -EINVAL earlier?
> 
Yes, this could potentially be caught earlier. I have rewritten the code to new API based upon mmap() entirely without 
adding new system calls. Much of the above code is no longer needed since mmap takes care of lots of these checks 
already. I am running some final tests on v2 patch series and will send it out shortly.

Thanks,
Khalid


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 09/14] mm/mshare: Do not free PTEs for mshare'd PTEs
  2022-05-31  4:24   ` Barry Song
@ 2022-06-29 17:38     ` Khalid Aziz
  2022-07-03 20:54       ` Andy Lutomirski
  0 siblings, 1 reply; 37+ messages in thread
From: Khalid Aziz @ 2022-06-29 17:38 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Matthew Wilcox, Aneesh Kumar, Arnd Bergmann,
	Jonathan Corbet, Dave Hansen, David Hildenbrand, ebiederm, hagen,
	jack, Kees Cook, kirill, kucharsk, linkinjeon, linux-fsdevel,
	LKML, Linux-MM, longpeng2, Andy Lutomirski, markhemm,
	Peter Collingbourne, Mike Rapoport, sieberf, sjpark,
	Suren Baghdasaryan, tst, Iurii Zaikin

On 5/30/22 22:24, Barry Song wrote:
> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> mshare'd PTEs should not be removed when a task exits. These PTEs
>> are removed when the last task sharing the PTEs exits. Add a check
>> for shared PTEs and skip them.
>>
>> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
>> ---
>>   mm/memory.c | 22 +++++++++++++++++++---
>>   1 file changed, 19 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index c77c0d643ea8..e7c5bc6f8836 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -419,16 +419,25 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>                  } else {
>>                          /*
>>                           * Optimization: gather nearby vmas into one call down
>> +                        * as long as they all belong to the same mm (that
>> +                        * may not be the case if a vma is part of mshare'd
>> +                        * range
>>                           */
>>                          while (next && next->vm_start <= vma->vm_end + PMD_SIZE
>> -                              && !is_vm_hugetlb_page(next)) {
>> +                              && !is_vm_hugetlb_page(next)
>> +                              && vma->vm_mm == tlb->mm) {
>>                                  vma = next;
>>                                  next = vma->vm_next;
>>                                  unlink_anon_vmas(vma);
>>                                  unlink_file_vma(vma);
>>                          }
>> -                       free_pgd_range(tlb, addr, vma->vm_end,
>> -                               floor, next ? next->vm_start : ceiling);
>> +                       /*
>> +                        * Free pgd only if pgd is not allocated for an
>> +                        * mshare'd range
>> +                        */
>> +                       if (vma->vm_mm == tlb->mm)
>> +                               free_pgd_range(tlb, addr, vma->vm_end,
>> +                                       floor, next ? next->vm_start : ceiling);
>>                  }
>>                  vma = next;
>>          }
>> @@ -1551,6 +1560,13 @@ void unmap_page_range(struct mmu_gather *tlb,
>>          pgd_t *pgd;
>>          unsigned long next;
>>
>> +       /*
>> +        * If this is an mshare'd page, do not unmap it since it might
>> +        * still be in use.
>> +        */
>> +       if (vma->vm_mm != tlb->mm)
>> +               return;
>> +
> 
> expect unmap, have you ever tested reverse mapping in vmscan, especially
> folio_referenced()? are all vmas in those processes sharing page table still
> in the rmap of the shared page?
> without shared PTE, if 1000 processes share one page, we are reading 1000
> PTEs, with it, are we reading just one? or are we reading the same PTE
> 1000 times? Have you tested it?
> 

We are treating mshared region same as threads sharing address space. There is one PTE that is being used by all 
processes and the VMA maintained in the separate mshare mm struct that also holds the shared PTE is the one that gets 
added to rmap. This is a different model with mshare in that it adds an mm struct that is separate from the mm structs 
of the processes that refer to the vma and pte in mshare mm struct. Do you see issues with rmap in this model?

Thanks,
Khalid


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 00/14] Add support for shared PTEs across processes
  2022-05-30 10:48 ` Barry Song
  2022-05-30 11:18   ` David Hildenbrand
@ 2022-06-29 17:40   ` Khalid Aziz
  1 sibling, 0 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-06-29 17:40 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Matthew Wilcox, Aneesh Kumar, Arnd Bergmann,
	Jonathan Corbet, Dave Hansen, David Hildenbrand, ebiederm, hagen,
	jack, Kees Cook, kirill, kucharsk, linkinjeon, linux-fsdevel,
	LKML, Linux-MM, longpeng2, Andy Lutomirski, markhemm, pcc,
	Mike Rapoport, sieberf, sjpark, Suren Baghdasaryan, tst,
	Iurii Zaikin

On 5/30/22 04:48, Barry Song wrote:
> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> Page tables in kernel consume some of the memory and as long as number
>> of mappings being maintained is small enough, this space consumed by
>> page tables is not objectionable. When very few memory pages are
>> shared between processes, the number of page table entries (PTEs) to
>> maintain is mostly constrained by the number of pages of memory on the
>> system. As the number of shared pages and the number of times pages
>> are shared goes up, amount of memory consumed by page tables starts to
>> become significant.
>>
>> Some of the field deployments commonly see memory pages shared across
>> 1000s of processes. On x86_64, each page requires a PTE that is only 8
>> bytes long which is very small compared to the 4K page size. When 2000
>> processes map the same page in their address space, each one of them
>> requires 8 bytes for its PTE and together that adds up to 8K of memory
>> just to hold the PTEs for one 4K page. On a database server with 300GB
>> SGA, a system carsh was seen with out-of-memory condition when 1500+
>> clients tried to share this SGA even though the system had 512GB of
>> memory. On this server, in the worst case scenario of all 1500
>> processes mapping every page from SGA would have required 878GB+ for
>> just the PTEs. If these PTEs could be shared, amount of memory saved
>> is very significant.
>>
>> This patch series implements a mechanism in kernel to allow userspace
>> processes to opt into sharing PTEs. It adds two new system calls - (1)
>> mshare(), which can be used by a process to create a region (we will
>> call it mshare'd region) which can be used by other processes to map
>> same pages using shared PTEs, (2) mshare_unlink() which is used to
>> detach from the mshare'd region. Once an mshare'd region is created,
>> other process(es), assuming they have the right permissions, can make
>> the mashare() system call to map the shared pages into their address
>> space using the shared PTEs.  When a process is done using this
>> mshare'd region, it makes a mshare_unlink() system call to end its
>> access. When the last process accessing mshare'd region calls
>> mshare_unlink(), the mshare'd region is torn down and memory used by
>> it is freed.
>>
>>
>> API
>> ===
>>
>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>
>> --
>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>
>> mshare() creates and opens a new, or opens an existing mshare'd
>> region that will be shared at PTE level. "name" refers to shared object
>> name that exists under /sys/fs/mshare. "addr" is the starting address
>> of this shared memory area and length is the size of this area.
>> oflags can be one of:
>>
>> - O_RDONLY opens shared memory area for read only access by everyone
>> - O_RDWR opens shared memory area for read and write access
>> - O_CREAT creates the named shared memory area if it does not exist
>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>    exists with that name, return an error.
>>
>> mode represents the creation mode for the shared object under
>> /sys/fs/mshare.
>>
>> mshare() returns an error code if it fails, otherwise it returns 0.
>>
>> PTEs are shared at pgdir level and hence it imposes following
>> requirements on the address and size given to the mshare():
>>
>> - Starting address must be aligned to pgdir size (512GB on x86_64).
>>    This alignment value can be looked up in /proc/sys/vm//mshare_size
>> - Size must be a multiple of pgdir size
>> - Any mappings created in this address range at any time become
>>    shared automatically
>> - Shared address range can have unmapped addresses in it. Any access
>>    to unmapped address will result in SIGBUS
>>
>> Mappings within this address range behave as if they were shared
>> between threads, so a write to a MAP_PRIVATE mapping will create a
>> page which is shared between all the sharers. The first process that
>> declares an address range mshare'd can continue to map objects in
>> the shared area. All other processes that want mshare'd access to
>> this memory area can do so by calling mshare(). After this call, the
>> address range given by mshare becomes a shared range in its address
>> space. Anonymous mappings will be shared and not COWed.
>>
>> A file under /sys/fs/mshare can be opened and read from. A read from
>> this file returns two long values - (1) starting address, and (2)
>> size of the mshare'd region.
>>
>> --
>> int mshare_unlink(char *name)
>>
>> A shared address range created by mshare() can be destroyed using
>> mshare_unlink() which removes the  shared named object. Once all
>> processes have unmapped the shared object, the shared address range
>> references are de-allocated and destroyed.
>>
>> mshare_unlink() returns 0 on success or -1 on error.
>>
>>
>> Example Code
>> ============
>>
>> Snippet of the code that a donor process would run looks like below:
>>
>> -----------------
>>          addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>>                          MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>>          if (addr == MAP_FAILED)
>>                  perror("ERROR: mmap failed");
>>
>>          err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>>                          GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>>          if (err < 0) {
>>                  perror("mshare() syscall failed");
>>                  exit(1);
>>          }
>>
>>          strncpy(addr, "Some random shared text",
>>                          sizeof("Some random shared text"));
>> -----------------
>>
>> Snippet of code that a consumer process would execute looks like:
>>
>> -----------------
>>          struct mshare_info minfo;
>>
>>          fd = open("testregion", O_RDONLY);
>>          if (fd < 0) {
>>                  perror("open failed");
>>                  exit(1);
>>          }
>>
>>          if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
>>                  printf("INFO: %ld bytes shared at addr 0x%lx \n",
>>                                  minfo.size, minfo.start);
>>          else
>>                  perror("read failed");
>>
>>          close(fd);
>>
>>          addr = (void *)minfo.start;
>>          err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
>>                          O_RDWR, 600);
>>          if (err < 0) {
>>                  perror("mshare() syscall failed");
>>                  exit(1);
>>          }
>>
>>          printf("Guest mmap at %px:\n", addr);
>>          printf("%s\n", addr);
>>          printf("\nDone\n");
>>
>>          err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>>          if (err < 0) {
>>                  perror("mshare_unlink() failed");
>>                  exit(1);
>>          }
>> -----------------
> 
> 
> Does  that mean those shared pages will get page_mapcount()=1 ?
> 
> A big pain for a memory limited system like a desktop/embedded system is
> that reverse mapping will take tons of cpu in memory reclamation path
> especially for those pages mapped by multiple processes. sometimes,
> 100% cpu utilization on LRU to scan and find out if a page is accessed
> by reading PTE young.
> 
> if we result in one PTE only by this patchset, it means we are getting
> significant
> performance improvement in kernel LRU particularly when free memory
> approaches the watermarks.
> 
> But I don't see how a new system call like mshare(),  can be used
> by those systems as they might need some more automatic PTEs sharing
> mechanism.
> 
> BTW, I suppose we are actually able to share PTEs as long as the address
> is 2MB aligned?
> 

The anonymous pages that are allocated to back the virtual pages in vmas maintained in mshare region get added to rmap 
once. mshare() system call is going away and has been replaced by fd based mmap instead in the next version of this patch.

Thanks,
Khalid


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 00/14] Add support for shared PTEs across processes
  2022-05-30 11:18   ` David Hildenbrand
  2022-05-30 11:49     ` Barry Song
@ 2022-06-29 17:48     ` Khalid Aziz
  1 sibling, 0 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-06-29 17:48 UTC (permalink / raw)
  To: David Hildenbrand, Barry Song
  Cc: Andrew Morton, Matthew Wilcox, Aneesh Kumar, Arnd Bergmann,
	Jonathan Corbet, Dave Hansen, ebiederm, hagen, jack, Kees Cook,
	kirill, kucharsk, linkinjeon, linux-fsdevel, LKML, Linux-MM,
	longpeng2, Andy Lutomirski, markhemm, pcc, Mike Rapoport,
	sieberf, sjpark, Suren Baghdasaryan, tst, Iurii Zaikin

On 5/30/22 05:18, David Hildenbrand wrote:
> On 30.05.22 12:48, Barry Song wrote:
>> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>>
>>> Page tables in kernel consume some of the memory and as long as number
>>> of mappings being maintained is small enough, this space consumed by
>>> page tables is not objectionable. When very few memory pages are
>>> shared between processes, the number of page table entries (PTEs) to
>>> maintain is mostly constrained by the number of pages of memory on the
>>> system. As the number of shared pages and the number of times pages
>>> are shared goes up, amount of memory consumed by page tables starts to
>>> become significant.
>>>
>>> Some of the field deployments commonly see memory pages shared across
>>> 1000s of processes. On x86_64, each page requires a PTE that is only 8
>>> bytes long which is very small compared to the 4K page size. When 2000
>>> processes map the same page in their address space, each one of them
>>> requires 8 bytes for its PTE and together that adds up to 8K of memory
>>> just to hold the PTEs for one 4K page. On a database server with 300GB
>>> SGA, a system carsh was seen with out-of-memory condition when 1500+
>>> clients tried to share this SGA even though the system had 512GB of
>>> memory. On this server, in the worst case scenario of all 1500
>>> processes mapping every page from SGA would have required 878GB+ for
>>> just the PTEs. If these PTEs could be shared, amount of memory saved
>>> is very significant.
>>>
>>> This patch series implements a mechanism in kernel to allow userspace
>>> processes to opt into sharing PTEs. It adds two new system calls - (1)
>>> mshare(), which can be used by a process to create a region (we will
>>> call it mshare'd region) which can be used by other processes to map
>>> same pages using shared PTEs, (2) mshare_unlink() which is used to
>>> detach from the mshare'd region. Once an mshare'd region is created,
>>> other process(es), assuming they have the right permissions, can make
>>> the mashare() system call to map the shared pages into their address
>>> space using the shared PTEs.  When a process is done using this
>>> mshare'd region, it makes a mshare_unlink() system call to end its
>>> access. When the last process accessing mshare'd region calls
>>> mshare_unlink(), the mshare'd region is torn down and memory used by
>>> it is freed.
>>>
>>>
>>> API
>>> ===
>>>
>>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>>
>>> --
>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>>
>>> mshare() creates and opens a new, or opens an existing mshare'd
>>> region that will be shared at PTE level. "name" refers to shared object
>>> name that exists under /sys/fs/mshare. "addr" is the starting address
>>> of this shared memory area and length is the size of this area.
>>> oflags can be one of:
>>>
>>> - O_RDONLY opens shared memory area for read only access by everyone
>>> - O_RDWR opens shared memory area for read and write access
>>> - O_CREAT creates the named shared memory area if it does not exist
>>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>>    exists with that name, return an error.
>>>
>>> mode represents the creation mode for the shared object under
>>> /sys/fs/mshare.
>>>
>>> mshare() returns an error code if it fails, otherwise it returns 0.
>>>
>>> PTEs are shared at pgdir level and hence it imposes following
>>> requirements on the address and size given to the mshare():
>>>
>>> - Starting address must be aligned to pgdir size (512GB on x86_64).
>>>    This alignment value can be looked up in /proc/sys/vm//mshare_size
>>> - Size must be a multiple of pgdir size
>>> - Any mappings created in this address range at any time become
>>>    shared automatically
>>> - Shared address range can have unmapped addresses in it. Any access
>>>    to unmapped address will result in SIGBUS
>>>
>>> Mappings within this address range behave as if they were shared
>>> between threads, so a write to a MAP_PRIVATE mapping will create a
>>> page which is shared between all the sharers. The first process that
>>> declares an address range mshare'd can continue to map objects in
>>> the shared area. All other processes that want mshare'd access to
>>> this memory area can do so by calling mshare(). After this call, the
>>> address range given by mshare becomes a shared range in its address
>>> space. Anonymous mappings will be shared and not COWed.
>>>
>>> A file under /sys/fs/mshare can be opened and read from. A read from
>>> this file returns two long values - (1) starting address, and (2)
>>> size of the mshare'd region.
>>>
>>> --
>>> int mshare_unlink(char *name)
>>>
>>> A shared address range created by mshare() can be destroyed using
>>> mshare_unlink() which removes the  shared named object. Once all
>>> processes have unmapped the shared object, the shared address range
>>> references are de-allocated and destroyed.
>>>
>>> mshare_unlink() returns 0 on success or -1 on error.
>>>
>>>
>>> Example Code
>>> ============
>>>
>>> Snippet of the code that a donor process would run looks like below:
>>>
>>> -----------------
>>>          addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>>>                          MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>>>          if (addr == MAP_FAILED)
>>>                  perror("ERROR: mmap failed");
>>>
>>>          err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>>>                          GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>>>          if (err < 0) {
>>>                  perror("mshare() syscall failed");
>>>                  exit(1);
>>>          }
>>>
>>>          strncpy(addr, "Some random shared text",
>>>                          sizeof("Some random shared text"));
>>> -----------------
>>>
>>> Snippet of code that a consumer process would execute looks like:
>>>
>>> -----------------
>>>          struct mshare_info minfo;
>>>
>>>          fd = open("testregion", O_RDONLY);
>>>          if (fd < 0) {
>>>                  perror("open failed");
>>>                  exit(1);
>>>          }
>>>
>>>          if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
>>>                  printf("INFO: %ld bytes shared at addr 0x%lx \n",
>>>                                  minfo.size, minfo.start);
>>>          else
>>>                  perror("read failed");
>>>
>>>          close(fd);
>>>
>>>          addr = (void *)minfo.start;
>>>          err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
>>>                          O_RDWR, 600);
>>>          if (err < 0) {
>>>                  perror("mshare() syscall failed");
>>>                  exit(1);
>>>          }
>>>
>>>          printf("Guest mmap at %px:\n", addr);
>>>          printf("%s\n", addr);
>>>          printf("\nDone\n");
>>>
>>>          err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>>>          if (err < 0) {
>>>                  perror("mshare_unlink() failed");
>>>                  exit(1);
>>>          }
>>> -----------------
>>
>>
>> Does  that mean those shared pages will get page_mapcount()=1 ?
> 
> AFAIU, for mshare() that is the case.
> 
>>
>> A big pain for a memory limited system like a desktop/embedded system is
>> that reverse mapping will take tons of cpu in memory reclamation path
>> especially for those pages mapped by multiple processes. sometimes,
>> 100% cpu utilization on LRU to scan and find out if a page is accessed
>> by reading PTE young.
> 
> Regarding PTE-table sharing:
> 
> Even if we'd account each logical mapping (independent of page table
> sharing) in the page_mapcount(), we would benefit from page table
> sharing. Simply when we unmap the page from the shared page table, we'd
> have to adjust the mapcount accordingly. So unmapping from a single
> (shared) pagetable could directly result in the mapcount dropping to zero.
> 
> What I am trying to say is: how the mapcount is handled might be an
> implementation detail for PTE-sharing. Not sure how hugetlb handles that
> with its PMD-table sharing.
> 
> We'd have to clarify what the mapcount actually expresses. Having the
> mapcount express "is this page mapped by multiple processes or at
> multiple VMAs" might be helpful in some cases. Right now it mostly
> expresses exactly that.

Right, that is the question - what does mapcount represent. I am interpreting it as mapcount represents how many ptes 
map the page. Since mshare uses one pte for each shared page irrespective of how many processes share the page, a 
mapcount of 1 sounds reasonable to me.

> 
>>
>> if we result in one PTE only by this patchset, it means we are getting
>> significant
>> performance improvement in kernel LRU particularly when free memory
>> approaches the watermarks.
>>
>> But I don't see how a new system call like mshare(),  can be used
>> by those systems as they might need some more automatic PTEs sharing
>> mechanism.
> 
> IMHO, we should look into automatic PTE-table sharing of MAP_SHARED
> mappings, similar to what hugetlb provides for PMD table sharing, which
> leaves semantics unchanged for existing user space. Maybe there is a way
> to factor that out and reuse it for PTE-table sharing.
> 
> I can understand that there are use cases for explicit sharing with new
> (e.g., mprotect) semantics.

It is tempting to make this sharing automatic and mshare may evolve to that. Since mshare assumes significant trust 
between the processes sharing pages (shared pages share attributes and protection keys possibly) , it sounds dangerous 
to make that assumption automatically without processes explicitly declaring that level of trust.

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 09/14] mm/mshare: Do not free PTEs for mshare'd PTEs
  2022-06-29 17:38     ` Khalid Aziz
@ 2022-07-03 20:54       ` Andy Lutomirski
  2022-07-06 20:33         ` Khalid Aziz
  0 siblings, 1 reply; 37+ messages in thread
From: Andy Lutomirski @ 2022-07-03 20:54 UTC (permalink / raw)
  To: Khalid Aziz, Barry Song
  Cc: Andrew Morton, Matthew Wilcox, Aneesh Kumar, Arnd Bergmann,
	Jonathan Corbet, Dave Hansen, David Hildenbrand, ebiederm, hagen,
	jack, Kees Cook, kirill, kucharsk, linkinjeon, linux-fsdevel,
	LKML, Linux-MM, longpeng2, markhemm, Peter Collingbourne,
	Mike Rapoport, sieberf, sjpark, Suren Baghdasaryan, tst,
	Iurii Zaikin

On 6/29/22 10:38, Khalid Aziz wrote:
> On 5/30/22 22:24, Barry Song wrote:
>> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@oracle.com> 
>> wrote:
>>>
>>> mshare'd PTEs should not be removed when a task exits. These PTEs
>>> are removed when the last task sharing the PTEs exits. Add a check
>>> for shared PTEs and skip them.
>>>
>>> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
>>> ---
>>>   mm/memory.c | 22 +++++++++++++++++++---
>>>   1 file changed, 19 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index c77c0d643ea8..e7c5bc6f8836 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -419,16 +419,25 @@ void free_pgtables(struct mmu_gather *tlb, 
>>> struct vm_area_struct *vma,
>>>                  } else {
>>>                          /*
>>>                           * Optimization: gather nearby vmas into one 
>>> call down
>>> +                        * as long as they all belong to the same mm 
>>> (that
>>> +                        * may not be the case if a vma is part of 
>>> mshare'd
>>> +                        * range
>>>                           */
>>>                          while (next && next->vm_start <= vma->vm_end 
>>> + PMD_SIZE
>>> -                              && !is_vm_hugetlb_page(next)) {
>>> +                              && !is_vm_hugetlb_page(next)
>>> +                              && vma->vm_mm == tlb->mm) {
>>>                                  vma = next;
>>>                                  next = vma->vm_next;
>>>                                  unlink_anon_vmas(vma);
>>>                                  unlink_file_vma(vma);
>>>                          }
>>> -                       free_pgd_range(tlb, addr, vma->vm_end,
>>> -                               floor, next ? next->vm_start : ceiling);
>>> +                       /*
>>> +                        * Free pgd only if pgd is not allocated for an
>>> +                        * mshare'd range
>>> +                        */
>>> +                       if (vma->vm_mm == tlb->mm)
>>> +                               free_pgd_range(tlb, addr, vma->vm_end,
>>> +                                       floor, next ? next->vm_start 
>>> : ceiling);
>>>                  }
>>>                  vma = next;
>>>          }
>>> @@ -1551,6 +1560,13 @@ void unmap_page_range(struct mmu_gather *tlb,
>>>          pgd_t *pgd;
>>>          unsigned long next;
>>>
>>> +       /*
>>> +        * If this is an mshare'd page, do not unmap it since it might
>>> +        * still be in use.
>>> +        */
>>> +       if (vma->vm_mm != tlb->mm)
>>> +               return;
>>> +
>>
>> expect unmap, have you ever tested reverse mapping in vmscan, especially
>> folio_referenced()? are all vmas in those processes sharing page table 
>> still
>> in the rmap of the shared page?
>> without shared PTE, if 1000 processes share one page, we are reading 1000
>> PTEs, with it, are we reading just one? or are we reading the same PTE
>> 1000 times? Have you tested it?
>>
> 
> We are treating mshared region same as threads sharing address space. 
> There is one PTE that is being used by all processes and the VMA 
> maintained in the separate mshare mm struct that also holds the shared 
> PTE is the one that gets added to rmap. This is a different model with 
> mshare in that it adds an mm struct that is separate from the mm structs 
> of the processes that refer to the vma and pte in mshare mm struct. Do 
> you see issues with rmap in this model?

I think this patch is actually the most interesting bit of the series by 
far.  Most of the rest is defining an API (which is important!) and 
figuring out semantics.  This patch changes something rather fundamental 
about how user address spaces work: what vmas live in them.  So let's 
figure out its effects.

I admit I'm rather puzzled about what vm_mm is for in the first place. 
In current kernels (without your patch), I think it's a pretty hard 
requirement for vm_mm to equal the mm for all vmas in an mm.  After a 
quick and incomplete survey, vm_mm seems to be mostly used as a somewhat 
lazy way to find the mm.  Let's see:

file_operations->mmap doesn't receive an mm_struct.  Instead it infers 
the mm from vm_mm.  (Why?  I don't know.)

Some walk_page_range users seem to dig the mm out of vm_mm instead of 
mm_walk.

Some manual address space walkers start with an mm, don't bother passing 
it around, and dig it back out of of vm_mm.  For example, unuse_vma() 
and all its helpers.

The only real exception I've found so far is rmap: AFAICS (on quick 
inspection -- I could be wrong), rmap can map from a folio to a bunch of 
vmas, and the vmas' mms are not stored separately but instead determined 
by vm_mm.



Your patch makes me quite nervous.  You're potentially breaking any 
kernel code path that assumes that mms only contain vmas that have vm_mm 
== mm.  And you're potentially causing rmap to be quite confused.  I 
think that if you're going to take this approach, you need to clearly 
define the new semantics of vm_mm and audit or clean up every user of 
vm_mm in the tree.  This may be nontrivial (especially rmap), although a 
cleanup of everything else to stop using vm_mm might be valuable.

But I'm wondering if it would be better to attack this from a different 
direction.  Right now, there's a hardcoded assumption that an mm owns 
every page table it references.  That's really the thing you're 
changing.  To me, it seems that a magical vma that shares page tables 
should still be a vma that belongs to its mm_struct -- munmap() and 
potentialy other m***() operations should all work on it, existing 
find_vma() users should work, etc.

So maybe instead there should be new behavior (by a VM_ flag or 
otherwise) that indicates that a vma owns its PTEs.  It could even be a 
vm_operation, although if anyone ever wants regular file mappings to 
share PTEs, then a vm_operation doesn't really make sense.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v1 09/14] mm/mshare: Do not free PTEs for mshare'd PTEs
  2022-07-03 20:54       ` Andy Lutomirski
@ 2022-07-06 20:33         ` Khalid Aziz
  0 siblings, 0 replies; 37+ messages in thread
From: Khalid Aziz @ 2022-07-06 20:33 UTC (permalink / raw)
  To: Andy Lutomirski, Barry Song
  Cc: Andrew Morton, Matthew Wilcox, Aneesh Kumar, Arnd Bergmann,
	Jonathan Corbet, Dave Hansen, David Hildenbrand, ebiederm, hagen,
	jack, Kees Cook, kirill, kucharsk, linkinjeon, linux-fsdevel,
	LKML, Linux-MM, longpeng2, markhemm, Peter Collingbourne,
	Mike Rapoport, sieberf, sjpark, Suren Baghdasaryan, tst,
	Iurii Zaikin

On 7/3/22 14:54, Andy Lutomirski wrote:
> On 6/29/22 10:38, Khalid Aziz wrote:
>> On 5/30/22 22:24, Barry Song wrote:
>>> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>>>
>>>> mshare'd PTEs should not be removed when a task exits. These PTEs
>>>> are removed when the last task sharing the PTEs exits. Add a check
>>>> for shared PTEs and skip them.
>>>>
>>>> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
>>>> ---
>>>>   mm/memory.c | 22 +++++++++++++++++++---
>>>>   1 file changed, 19 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index c77c0d643ea8..e7c5bc6f8836 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -419,16 +419,25 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>>>                  } else {
>>>>                          /*
>>>>                           * Optimization: gather nearby vmas into one call down
>>>> +                        * as long as they all belong to the same mm (that
>>>> +                        * may not be the case if a vma is part of mshare'd
>>>> +                        * range
>>>>                           */
>>>>                          while (next && next->vm_start <= vma->vm_end + PMD_SIZE
>>>> -                              && !is_vm_hugetlb_page(next)) {
>>>> +                              && !is_vm_hugetlb_page(next)
>>>> +                              && vma->vm_mm == tlb->mm) {
>>>>                                  vma = next;
>>>>                                  next = vma->vm_next;
>>>>                                  unlink_anon_vmas(vma);
>>>>                                  unlink_file_vma(vma);
>>>>                          }
>>>> -                       free_pgd_range(tlb, addr, vma->vm_end,
>>>> -                               floor, next ? next->vm_start : ceiling);
>>>> +                       /*
>>>> +                        * Free pgd only if pgd is not allocated for an
>>>> +                        * mshare'd range
>>>> +                        */
>>>> +                       if (vma->vm_mm == tlb->mm)
>>>> +                               free_pgd_range(tlb, addr, vma->vm_end,
>>>> +                                       floor, next ? next->vm_start : ceiling);
>>>>                  }
>>>>                  vma = next;
>>>>          }
>>>> @@ -1551,6 +1560,13 @@ void unmap_page_range(struct mmu_gather *tlb,
>>>>          pgd_t *pgd;
>>>>          unsigned long next;
>>>>
>>>> +       /*
>>>> +        * If this is an mshare'd page, do not unmap it since it might
>>>> +        * still be in use.
>>>> +        */
>>>> +       if (vma->vm_mm != tlb->mm)
>>>> +               return;
>>>> +
>>>
>>> expect unmap, have you ever tested reverse mapping in vmscan, especially
>>> folio_referenced()? are all vmas in those processes sharing page table still
>>> in the rmap of the shared page?
>>> without shared PTE, if 1000 processes share one page, we are reading 1000
>>> PTEs, with it, are we reading just one? or are we reading the same PTE
>>> 1000 times? Have you tested it?
>>>
>>
>> We are treating mshared region same as threads sharing address space. There is one PTE that is being used by all 
>> processes and the VMA maintained in the separate mshare mm struct that also holds the shared PTE is the one that gets 
>> added to rmap. This is a different model with mshare in that it adds an mm struct that is separate from the mm structs 
>> of the processes that refer to the vma and pte in mshare mm struct. Do you see issues with rmap in this model?
> 
> I think this patch is actually the most interesting bit of the series by far.  Most of the rest is defining an API 
> (which is important!) and figuring out semantics.  This patch changes something rather fundamental about how user 
> address spaces work: what vmas live in them.  So let's figure out its effects.
> 
> I admit I'm rather puzzled about what vm_mm is for in the first place. In current kernels (without your patch), I think 
> it's a pretty hard requirement for vm_mm to equal the mm for all vmas in an mm.  After a quick and incomplete survey, 
> vm_mm seems to be mostly used as a somewhat lazy way to find the mm.  Let's see:
> 
> file_operations->mmap doesn't receive an mm_struct.  Instead it infers the mm from vm_mm.  (Why?  I don't know.)
> 
> Some walk_page_range users seem to dig the mm out of vm_mm instead of mm_walk.
> 
> Some manual address space walkers start with an mm, don't bother passing it around, and dig it back out of of vm_mm.  
> For example, unuse_vma() and all its helpers.
> 
> The only real exception I've found so far is rmap: AFAICS (on quick inspection -- I could be wrong), rmap can map from a 
> folio to a bunch of vmas, and the vmas' mms are not stored separately but instead determined by vm_mm.
> 
> 
> 
> Your patch makes me quite nervous.  You're potentially breaking any kernel code path that assumes that mms only contain 
> vmas that have vm_mm == mm.  And you're potentially causing rmap to be quite confused.  I think that if you're going to 
> take this approach, you need to clearly define the new semantics of vm_mm and audit or clean up every user of vm_mm in 
> the tree.  This may be nontrivial (especially rmap), although a cleanup of everything else to stop using vm_mm might be 
> valuable.
> 
> But I'm wondering if it would be better to attack this from a different direction.  Right now, there's a hardcoded 
> assumption that an mm owns every page table it references.  That's really the thing you're changing.  To me, it seems 
> that a magical vma that shares page tables should still be a vma that belongs to its mm_struct -- munmap() and 
> potentialy other m***() operations should all work on it, existing find_vma() users should work, etc.
> 
> So maybe instead there should be new behavior (by a VM_ flag or otherwise) that indicates that a vma owns its PTEs.  It 
> could even be a vm_operation, although if anyone ever wants regular file mappings to share PTEs, then a vm_operation 
> doesn't really make sense.
> 

Hi Andy,

You are absolutely right. Dragons lie on the path to changing the sense of vm_mm. Dave H pointed out potential issues 
with rb_tree as well. As I have looked at more code, it seems breaking the assumption that vm_mm always points to 
containing mm struct opens up the possibility of code breaking in strange ways in odd places. As a result, I have 
changed the code in v2 patches to not break this assumption about vm_mm and instead I rewrote the code to use the vm 
flag VM_SHARED_PT and vm_private_data everywhere it needed to find the mshare mm struct. All the vmas belonging to the 
new mm struct for mshare region also have their vm_mm pointing to the mshare mm_struct and that keeps all vma operations 
working normally.

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2022-07-06 20:34 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 01/14] mm: Add new system calls mshare, mshare_unlink Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 02/14] mm/mshare: Add msharefs filesystem Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 03/14] mm/mshare: Add read for msharefs Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 04/14] mm/mshare: implement mshare_unlink syscall Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 05/14] mm/mshare: Add locking to msharefs syscalls Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 06/14] mm/mshare: Check for mounted filesystem Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 07/14] mm/mshare: Add vm flag for shared PTE Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 08/14] mm/mshare: Add basic page table sharing using mshare Khalid Aziz
2022-04-11 18:48   ` Dave Hansen
2022-04-11 20:39     ` Khalid Aziz
2022-05-30 11:11   ` Barry Song
2022-06-28 20:11     ` Khalid Aziz
2022-05-31  3:46   ` Barry Song
2022-06-28 20:16     ` Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 09/14] mm/mshare: Do not free PTEs for mshare'd PTEs Khalid Aziz
2022-05-31  4:24   ` Barry Song
2022-06-29 17:38     ` Khalid Aziz
2022-07-03 20:54       ` Andy Lutomirski
2022-07-06 20:33         ` Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 10/14] mm/mshare: Check for mapped vma when mshare'ing existing mshare'd range Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 11/14] mm/mshare: unmap vmas in mshare_unlink Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 12/14] mm/mshare: Add a proc file with mshare alignment/size information Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 13/14] mm/mshare: Enforce mshare'd region permissions Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 14/14] mm/mshare: Copy PTEs to host mm Khalid Aziz
2022-04-11 17:37 ` [PATCH v1 00/14] Add support for shared PTEs across processes Matthew Wilcox
2022-04-11 18:51   ` Dave Hansen
2022-04-11 19:08     ` Matthew Wilcox
2022-04-11 19:52   ` Khalid Aziz
2022-04-11 18:47 ` Dave Hansen
2022-04-11 20:10 ` Eric W. Biederman
2022-04-11 22:21   ` Khalid Aziz
2022-05-30 10:48 ` Barry Song
2022-05-30 11:18   ` David Hildenbrand
2022-05-30 11:49     ` Barry Song
2022-06-29 17:48     ` Khalid Aziz
2022-06-29 17:40   ` Khalid Aziz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.