All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] Add support for shared PTEs across processes
@ 2022-01-18 21:19 Khalid Aziz
  2022-01-18 21:19 ` [RFC PATCH 1/6] mm: Add new system calls mshare, mshare_unlink Khalid Aziz
                   ` (11 more replies)
  0 siblings, 12 replies; 54+ messages in thread
From: Khalid Aziz @ 2022-01-18 21:19 UTC (permalink / raw)
  To: akpm, willy, longpeng2, arnd, dave.hansen, david, rppt, surenb,
	linux-kernel, linux-mm
  Cc: Khalid Aziz

Page tables in kernel consume some of the memory and as long as
number of mappings being maintained is small enough, this space
consumed by page tables is not objectionable. When very few memory
pages are shared between processes, the number of page table entries
(PTEs) to maintain is mostly constrained by the number of pages of
memory on the system. As the number of shared pages and the number
of times pages are shared goes up, amount of memory consumed by page
tables starts to become significant.

Some of the field deployments commonly see memory pages shared
across 1000s of processes. On x86_64, each page requires a PTE that
is only 8 bytes long which is very small compared to the 4K page
size. When 2000 processes map the same page in their address space,
each one of them requires 8 bytes for its PTE and together that adds
up to 8K of memory just to hold the PTEs for one 4K page. On a
database server with 300GB SGA, a system carsh was seen with
out-of-memory condition when 1500+ clients tried to share this SGA
even though the system had 512GB of memory. On this server, in the
worst case scenario of all 1500 processes mapping every page from
SGA would have required 878GB+ for just the PTEs. If these PTEs
could be shared, amount of memory saved is very significant.

This is a proposal to implement a mechanism in kernel to allow
userspace processes to opt into sharing PTEs. The proposal is to add
a new system call - mshare(), which can be used by a process to
create a region (we will call it mshare'd region) which can be used
by other processes to map same pages using shared PTEs. Other
process(es), assuming they have the right permissions, can then make
the mashare() system call to map the shared pages into their address
space using the shared PTEs.  When a process is done using this
mshare'd region, it makes a mshare_unlink() system call to end its
access. When the last process accessing mshare'd region calls
mshare_unlink(), the mshare'd region is torn down and memory used by
it is freed.


API Proposal
============

The mshare API consists of two system calls - mshare() and mshare_unlink()

--
int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)

mshare() creates and opens a new, or opens an existing mshare'd
region that will be shared at PTE level. "name" refers to shared object
name that exists under /sys/fs/mshare. "addr" is the starting address
of this shared memory area and length is the size of this area.
oflags can be one of:

- O_RDONLY opens shared memory area for read only access by everyone
- O_RDWR opens shared memory area for read and write access
- O_CREAT creates the named shared memory area if it does not exist
- O_EXCL If O_CREAT was also specified, and a shared memory area
  exists with that name, return an error.

mode represents the creation mode for the shared object under
/sys/fs/mshare.

mshare() returns an error code if it fails, otherwise it returns 0.

PTEs are shared at pgdir level and hence it imposes following
requirements on the address and size given to the mshare():

- Starting address must be aligned to pgdir size (512GB on x86_64)
- Size must be a multiple of pgdir size
- Any mappings created in this address range at any time become
  shared automatically
- Shared address range can have unmapped addresses in it. Any access
  to unmapped address will result in SIGBUS

Mappings within this address range behave as if they were shared
between threads, so a write to a MAP_PRIVATE mapping will create a
page which is shared between all the sharers. The first process that
declares an address range mshare'd can continue to map objects in
the shared area. All other processes that want mshare'd access to
this memory area can do so by calling mshare(). After this call, the
address range given by mshare becomes a shared range in its address
space. Anonymous mappings will be shared and not COWed.

A file under /sys/fs/mshare can be opened and read from. A read from
this file returns two long values - (1) starting address, and (2)
size of the mshare'd region.

--
int mshare_unlink(char *name)

A shared address range created by mshare() can be destroyed using
mshare_unlink() which removes the  shared named object. Once all
processes have unmapped the shared object, the shared address range
references are de-allocated and destroyed.

mshare_unlink() returns 0 on success or -1 on error.


Example Code
============

Snippet of the code that a donor process would run looks like below:

-----------------
        addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
                        MAP_SHARED | MAP_ANONYMOUS, 0, 0);
        if (addr == MAP_FAILED)
                perror("ERROR: mmap failed");

        err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
			GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
        if (err < 0) {
                perror("mshare() syscall failed");
                exit(1);
        }

        strncpy(addr, "Some random shared text",
			sizeof("Some random shared text"));
-----------------

Snippet of code that a consumer process would execute looks like:

-----------------
        fd = open("testregion", O_RDONLY);
        if (fd < 0) {
                perror("open failed");
                exit(1);
        }

        if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
                printf("INFO: %ld bytes shared at addr %lx \n",
				mshare_info[1], mshare_info[0]);
        else
                perror("read failed");

        close(fd);

        addr = (char *)mshare_info[0];
        err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
			mshare_info[1], O_RDWR, 600);
        if (err < 0) {
                perror("mshare() syscall failed");
                exit(1);
        }

        printf("Guest mmap at %px:\n", addr);
        printf("%s\n", addr);
	printf("\nDone\n");

        err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
        if (err < 0) {
                perror("mshare_unlink() failed");
                exit(1);
        }
-----------------


RFC Proposal
============

This series of RFC patches is a prototype implementation of these
two system calls. This code is just a prototype with many bugs and
limitations and is incomplete. It works well enough to show how
this concept will work.

Prototype for the two syscalls is:

SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
		unsigned long, len, int, oflag, mode_t, mode)

SYSCALL_DEFINE1(mshare_unlink, const char *, name)

In oreder to facilitate page table sharing, this implemntation adds
a new in-memory filesystem - msharefs which will be mounted at
/sys/fs/mshare. When a new mshare'd region is
created, a file with the name given by initial mshare() call is
created under this filesystem. Permissions for this file are given
by the "mode" parameter to mshare(). The only operation supported on
this file is read. A read from this file returns two long values -
(1) starting virtual address for the region, and (2) size of
mshare'd region.

A donor process that wants to create an mshare'd region from a
section of its mapped addresses calls mshare() with O_CREAT oflag.
mshare() syscall then creates a new mm_struct which will host the
page tables for the mshare'd region.  vma->vm_mm for the vmas
covering address range for this region are updated to point to this
new mm_struct instead of current->mm.  Existing page tables are
copied over to new mm struct.

A consumer process that wants to map mshare'd region opens
/sys/fs/mshare/<filename> and reads the starting address and size of
mshare'd region. It then calls mshare() with these values to map the
entire region in its address space. Consumer process calls
mshare_unlink() to terminate its access.

I would very much appreciate feedback on - (1) the API and how it
defines the PTE sharing concept, and (2) core of the initial
implementation. The file mm/mshare.c contains a list of current
issues/bugs/holes in the code at this time at the top of the file.
If the general direction of this work looks reasonable, I will
continue working on adding the missing code including much of error
handling and fixing bugs.


Khalid Aziz (6):
  mm: Add new system calls mshare, mshare_unlink
  mm: Add msharefs filesystem
  mm: Add read for msharefs
  mm: implement mshare_unlink syscall
  mm: Add locking to msharefs syscalls
  mm: Add basic page table sharing using mshare

 Documentation/filesystems/msharefs.rst |  19 +
 arch/x86/entry/syscalls/syscall_64.tbl |   2 +
 include/linux/mm.h                     |   8 +
 include/trace/events/mmflags.h         |   3 +-
 include/uapi/asm-generic/unistd.h      |   7 +-
 include/uapi/linux/magic.h             |   1 +
 mm/Makefile                            |   2 +-
 mm/internal.h                          |   7 +
 mm/memory.c                            |  35 +-
 mm/mshare.c                            | 604 +++++++++++++++++++++++++
 10 files changed, 682 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/filesystems/msharefs.rst
 create mode 100644 mm/mshare.c

-- 
2.32.0


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH 1/6] mm: Add new system calls mshare, mshare_unlink
  2022-01-18 21:19 [RFC PATCH 0/6] Add support for shared PTEs across processes Khalid Aziz
@ 2022-01-18 21:19 ` Khalid Aziz
  2022-01-18 21:19 ` [RFC PATCH 2/6] mm: Add msharefs filesystem Khalid Aziz
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 54+ messages in thread
From: Khalid Aziz @ 2022-01-18 21:19 UTC (permalink / raw)
  To: akpm, willy, longpeng2, arnd, dave.hansen, david, rppt, surenb,
	linux-kernel, linux-mm
  Cc: Khalid Aziz

Add two new system calls to support PTE sharing across processes through
explicit declarations of shared address space.There is almost no
implementation in this patch and it only wires up the system calls for
x86_64 only. mshare() returns a file descriptor which does not support
any operations yet.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 arch/x86/entry/syscalls/syscall_64.tbl |  2 +
 include/uapi/asm-generic/unistd.h      |  7 ++-
 mm/Makefile                            |  2 +-
 mm/mshare.c                            | 60 ++++++++++++++++++++++++++
 4 files changed, 69 insertions(+), 2 deletions(-)
 create mode 100644 mm/mshare.c

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index fe8f8dd157b4..bb403deca1ff 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -371,6 +371,8 @@
 447	common	memfd_secret		sys_memfd_secret
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
+450	common	mshare			sys_mshare
+451	common	mshare_unlink		sys_mshare_unlink
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 4557a8b6086f..27349ad579ff 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -883,8 +883,13 @@ __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
 #define __NR_futex_waitv 449
 __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 
+#define __NR_mshare 450
+__SYSCALL(__NR_mshare, sys_mshare)
+#define __NR_mshare_unlink 451
+__SYSCALL(__NR_mshare_unlink, sys_mshare_unlink)
+
 #undef __NR_syscalls
-#define __NR_syscalls 450
+#define __NR_syscalls 452
 
 /*
  * 32 bit systems traditionally used different
diff --git a/mm/Makefile b/mm/Makefile
index d6c0042e3aa0..fca44c0d5e74 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -35,7 +35,7 @@ CFLAGS_init-mm.o += $(call cc-disable-warning, override-init)
 CFLAGS_init-mm.o += $(call cc-disable-warning, initializer-overrides)
 
 mmu-y			:= nommu.o
-mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o \
+mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o mshare.o \
 			   mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
 			   msync.o page_vma_mapped.o pagewalk.o \
 			   pgtable-generic.o rmap.o vmalloc.o
diff --git a/mm/mshare.c b/mm/mshare.c
new file mode 100644
index 000000000000..c723f8369f06
--- /dev/null
+++ b/mm/mshare.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * mm/mshare.c
+ *
+ * Page table sharing code
+ *
+ *
+ * Copyright (C) 2021 Oracle Corp. All rights reserved.
+ * Authors:	Matthew Wilcox
+ *		Khalid Aziz
+ */
+
+#include <linux/anon_inodes.h>
+#include <linux/fs.h>
+#include <linux/syscalls.h>
+
+static const struct file_operations mshare_fops = {
+};
+
+/*
+ * mshare syscall. Returns a file descriptor
+ */
+SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
+		unsigned long, len, int, oflag, mode_t, mode)
+{
+	int fd;
+
+	/*
+	 * Address range being shared must be aligned to pgdir
+	 * boundary and its size must be a multiple of pgdir size
+	 */
+	if ((addr | len) & (PGDIR_SIZE - 1))
+		return -EINVAL;
+
+	/*
+	 * Allocate a file descriptor to return
+	 *
+	 * TODO: This code ignores the object name completely. Add
+	 * support for that
+	 */
+	fd = anon_inode_getfd("mshare", &mshare_fops, NULL, O_RDWR);
+
+	return fd;
+}
+
+/*
+ * mshare_unlink syscall. Close and remove the named mshare'd object
+ */
+SYSCALL_DEFINE1(mshare_unlink, const char *, name)
+{
+	int fd;
+
+	/*
+	 * Delete the named object
+	 *
+	 * TODO: Mark mshare'd range for deletion
+	 *
+	 */
+	return 0;
+}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH 2/6] mm: Add msharefs filesystem
  2022-01-18 21:19 [RFC PATCH 0/6] Add support for shared PTEs across processes Khalid Aziz
  2022-01-18 21:19 ` [RFC PATCH 1/6] mm: Add new system calls mshare, mshare_unlink Khalid Aziz
@ 2022-01-18 21:19 ` Khalid Aziz
  2022-01-18 21:19 ` [RFC PATCH 3/6] mm: Add read for msharefs Khalid Aziz
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 54+ messages in thread
From: Khalid Aziz @ 2022-01-18 21:19 UTC (permalink / raw)
  To: akpm, willy, longpeng2, arnd, dave.hansen, david, rppt, surenb,
	linux-kernel, linux-mm
  Cc: Khalid Aziz

Add a ram-based filesystem that contains the files created for each
shared address range. This patch adds just the filesystem and creation
of files. Page table entries for these shared ranges created by mshare
syscall are still not shared.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 Documentation/filesystems/msharefs.rst |  19 +++
 include/uapi/linux/magic.h             |   1 +
 mm/mshare.c                            | 191 +++++++++++++++++++++++--
 3 files changed, 197 insertions(+), 14 deletions(-)
 create mode 100644 Documentation/filesystems/msharefs.rst

diff --git a/Documentation/filesystems/msharefs.rst b/Documentation/filesystems/msharefs.rst
new file mode 100644
index 000000000000..fd161f67045d
--- /dev/null
+++ b/Documentation/filesystems/msharefs.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================================
+msharefs - a filesystem to support shared page tables
+=====================================================
+
+msharefs is a ram-based filesystem that allows multiple processes to
+share page table entries for shared pages.
+
+msharefs is typically mounted like this::
+
+	mount -t msharefs none /sys/fs/mshare
+
+When a process calls mshare syscall with a name for the shared address
+range, a file with the same name is created under msharefs with that
+name. This file can be opened by another process, if permissions
+allow, to query the addresses shared under this range. These files are
+removed by mshare_unlink syscall and can not be deleted directly.
+Hence these files are created as immutable files.
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 35687dcb1a42..26a12e33a3c1 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -98,5 +98,6 @@
 #define Z3FOLD_MAGIC		0x33
 #define PPC_CMM_MAGIC		0xc7571590
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
+#define MSHARE_MAGIC		0x4d534852	/* "MSHR" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/mm/mshare.c b/mm/mshare.c
index c723f8369f06..e48d0f615f9f 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -10,20 +10,117 @@
  *		Khalid Aziz
  */
 
-#include <linux/anon_inodes.h>
 #include <linux/fs.h>
+#include <linux/mount.h>
 #include <linux/syscalls.h>
+#include <linux/uaccess.h>
+#include <linux/pseudo_fs.h>
+#include <linux/fileattr.h>
+#include <uapi/linux/magic.h>
+#include <uapi/linux/limits.h>
 
-static const struct file_operations mshare_fops = {
+static struct super_block *msharefs_sb;
+
+static const struct file_operations msharefs_file_operations = {
+	.open	= simple_open,
+	.llseek	= no_llseek,
 };
 
+static int
+msharefs_d_hash(const struct dentry *dentry, struct qstr *qstr)
+{
+	unsigned long hash = init_name_hash(dentry);
+	const unsigned char *s = qstr->name;
+	unsigned int len = qstr->len;
+
+	while (len--)
+		hash = partial_name_hash(*s++, hash);
+	qstr->hash = end_name_hash(hash);
+	return 0;
+}
+
+static struct dentry
+*msharefs_alloc_dentry(struct dentry *parent, const char *name)
+{
+	struct dentry *d;
+	struct qstr q;
+	int err;
+
+	q.name = name;
+	q.len = strlen(name);
+
+	err = msharefs_d_hash(parent, &q);
+	if (err)
+		return ERR_PTR(err);
+
+	d = d_alloc(parent, &q);
+	if (d)
+		return d;
+
+	return ERR_PTR(-ENOMEM);
+}
+
+static struct inode
+*msharefs_get_inode(struct super_block *sb, int mode)
+{
+	struct inode *inode = new_inode(sb);
+
+	if (inode) {
+		inode->i_ino = get_next_ino();
+		inode->i_mode = mode;
+
+		/*
+		 * msharefs are not meant to be manipulated from userspace.
+		 * Reading from the file is the only allowed operation
+		 */
+		inode->i_flags = S_IMMUTABLE;
+
+		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
+		inode->i_fop = &msharefs_file_operations;
+		inode->i_size = 0;
+		inode->i_uid = current_fsuid();
+		inode->i_gid = current_fsgid();
+	}
+
+	return inode;
+}
+
+static int
+mshare_file_create(const char *name, unsigned long flags)
+{
+	struct inode *inode;
+	struct dentry *root, *dentry;
+	int err = 0;
+
+	root = msharefs_sb->s_root;
+
+	inode = msharefs_get_inode(msharefs_sb, S_IFREG | 0400);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	dentry = msharefs_alloc_dentry(root, name);
+	if (IS_ERR(dentry)) {
+		err = PTR_ERR(dentry);
+		goto fail_inode;
+	}
+
+	d_add(dentry, inode);
+
+	return err;
+
+fail_inode:
+	iput(inode);
+	return err;
+}
+
 /*
- * mshare syscall. Returns a file descriptor
+ * mshare syscall
  */
-SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
+SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 		unsigned long, len, int, oflag, mode_t, mode)
 {
-	int fd;
+	char mshare_name[NAME_MAX];
+	int err;
 
 	/*
 	 * Address range being shared must be aligned to pgdir
@@ -32,15 +129,14 @@ SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
 	if ((addr | len) & (PGDIR_SIZE - 1))
 		return -EINVAL;
 
-	/*
-	 * Allocate a file descriptor to return
-	 *
-	 * TODO: This code ignores the object name completely. Add
-	 * support for that
-	 */
-	fd = anon_inode_getfd("mshare", &mshare_fops, NULL, O_RDWR);
+	err = copy_from_user(mshare_name, name, NAME_MAX);
+	if (err)
+		goto err_out;
 
-	return fd;
+	err = mshare_file_create(mshare_name, oflag);
+
+err_out:
+	return err;
 }
 
 /*
@@ -48,7 +144,8 @@ SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
  */
 SYSCALL_DEFINE1(mshare_unlink, const char *, name)
 {
-	int fd;
+	char mshare_name[NAME_MAX];
+	int err;
 
 	/*
 	 * Delete the named object
@@ -56,5 +153,71 @@ SYSCALL_DEFINE1(mshare_unlink, const char *, name)
 	 * TODO: Mark mshare'd range for deletion
 	 *
 	 */
+	err = copy_from_user(mshare_name, name, NAME_MAX);
+	if (err)
+		goto err_out;
+	return 0;
+
+err_out:
+	return err;
+}
+
+static const struct dentry_operations msharefs_d_ops = {
+	.d_hash = msharefs_d_hash,
+};
+
+static int
+msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+	static const struct tree_descr empty_descr = {""};
+	int err;
+
+	sb->s_d_op = &msharefs_d_ops;
+	err = simple_fill_super(sb, MSHARE_MAGIC, &empty_descr);
+	if (err)
+		return err;
+
+	msharefs_sb = sb;
+	return 0;
+}
+
+static int
+msharefs_get_tree(struct fs_context *fc)
+{
+	return get_tree_single(fc, msharefs_fill_super);
+}
+
+static const struct fs_context_operations msharefs_context_ops = {
+	.get_tree	= msharefs_get_tree,
+};
+
+static int
+mshare_init_fs_context(struct fs_context *fc)
+{
+	fc->ops = &msharefs_context_ops;
 	return 0;
 }
+
+static struct file_system_type mshare_fs = {
+	.name			= "msharefs",
+	.init_fs_context	= mshare_init_fs_context,
+	.kill_sb		= kill_litter_super,
+};
+
+static int
+mshare_init(void)
+{
+	int ret = 0;
+
+	ret = sysfs_create_mount_point(fs_kobj, "mshare");
+	if (ret)
+		return ret;
+
+	ret = register_filesystem(&mshare_fs);
+	if (ret)
+		sysfs_remove_mount_point(fs_kobj, "mshare");
+
+	return ret;
+}
+
+fs_initcall(mshare_init);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH 3/6] mm: Add read for msharefs
  2022-01-18 21:19 [RFC PATCH 0/6] Add support for shared PTEs across processes Khalid Aziz
  2022-01-18 21:19 ` [RFC PATCH 1/6] mm: Add new system calls mshare, mshare_unlink Khalid Aziz
  2022-01-18 21:19 ` [RFC PATCH 2/6] mm: Add msharefs filesystem Khalid Aziz
@ 2022-01-18 21:19 ` Khalid Aziz
  2022-01-18 21:19 ` [RFC PATCH 4/6] mm: implement mshare_unlink syscall Khalid Aziz
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 54+ messages in thread
From: Khalid Aziz @ 2022-01-18 21:19 UTC (permalink / raw)
  To: akpm, willy, longpeng2, arnd, dave.hansen, david, rppt, surenb,
	linux-kernel, linux-mm
  Cc: Khalid Aziz

Allocate a new mm to store shared page table. Add a read operation to
file created by mshare syscall to return the starting address and size
of region shared by the corresponding range.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 mm/mshare.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 61 insertions(+), 5 deletions(-)

diff --git a/mm/mshare.c b/mm/mshare.c
index e48d0f615f9f..d998b23c25ab 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -16,14 +16,37 @@
 #include <linux/uaccess.h>
 #include <linux/pseudo_fs.h>
 #include <linux/fileattr.h>
+#include <linux/refcount.h>
 #include <uapi/linux/magic.h>
 #include <uapi/linux/limits.h>
 
+struct mshare_data {
+	struct mm_struct *mm;
+	refcount_t refcnt;
+};
+
 static struct super_block *msharefs_sb;
 
+static ssize_t
+mshare_read(struct kiocb *iocb, struct iov_iter *iov)
+{
+	struct mshare_data *info = iocb->ki_filp->private_data;
+	struct mm_struct *mm = info->mm;
+	size_t ret;
+	unsigned long tmp[2];
+
+	tmp[0] = mm->mmap_base;
+	tmp[1] = mm->task_size - mm->mmap_base;
+	ret = copy_to_iter(&tmp, sizeof(tmp), iov);
+	if (!ret)
+		return -EFAULT;
+	return ret;
+}
+
 static const struct file_operations msharefs_file_operations = {
-	.open	= simple_open,
-	.llseek	= no_llseek,
+	.open		= simple_open,
+	.read_iter	= mshare_read,
+	.llseek		= no_llseek,
 };
 
 static int
@@ -77,7 +100,12 @@ static struct inode
 
 		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
 		inode->i_fop = &msharefs_file_operations;
-		inode->i_size = 0;
+
+		/*
+		 * A read from this file will return two unsigned long
+		 */
+		inode->i_size = 2 * sizeof(unsigned long);
+
 		inode->i_uid = current_fsuid();
 		inode->i_gid = current_fsgid();
 	}
@@ -86,7 +114,8 @@ static struct inode
 }
 
 static int
-mshare_file_create(const char *name, unsigned long flags)
+mshare_file_create(const char *name, unsigned long flags,
+			struct mshare_data *info)
 {
 	struct inode *inode;
 	struct dentry *root, *dentry;
@@ -98,6 +127,8 @@ mshare_file_create(const char *name, unsigned long flags)
 	if (IS_ERR(inode))
 		return PTR_ERR(inode);
 
+	inode->i_private = info;
+
 	dentry = msharefs_alloc_dentry(root, name);
 	if (IS_ERR(dentry)) {
 		err = PTR_ERR(dentry);
@@ -120,6 +151,8 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 		unsigned long, len, int, oflag, mode_t, mode)
 {
 	char mshare_name[NAME_MAX];
+	struct mshare_data *info;
+	struct mm_struct *mm;
 	int err;
 
 	/*
@@ -133,8 +166,31 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 	if (err)
 		goto err_out;
 
-	err = mshare_file_create(mshare_name, oflag);
+	mm = mm_alloc();
+	if (!mm)
+		return -ENOMEM;
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info) {
+		err = -ENOMEM;
+		goto err_relmm;
+	}
+	mm->mmap_base = addr;
+	mm->task_size = addr + len;
+	if (!mm->task_size)
+		mm->task_size--;
+	info->mm = mm;
+	refcount_set(&info->refcnt, 1);
+
+	err = mshare_file_create(mshare_name, oflag, info);
+	if (err)
+		goto err_relinfo;
+
+	return 0;
 
+err_relinfo:
+	kfree(info);
+err_relmm:
+	mmput(mm);
 err_out:
 	return err;
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH 4/6] mm: implement mshare_unlink syscall
  2022-01-18 21:19 [RFC PATCH 0/6] Add support for shared PTEs across processes Khalid Aziz
                   ` (2 preceding siblings ...)
  2022-01-18 21:19 ` [RFC PATCH 3/6] mm: Add read for msharefs Khalid Aziz
@ 2022-01-18 21:19 ` Khalid Aziz
  2022-01-18 21:19 ` [RFC PATCH 5/6] mm: Add locking to msharefs syscalls Khalid Aziz
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 54+ messages in thread
From: Khalid Aziz @ 2022-01-18 21:19 UTC (permalink / raw)
  To: akpm, willy, longpeng2, arnd, dave.hansen, david, rppt, surenb,
	linux-kernel, linux-mm
  Cc: Khalid Aziz

Add code to allow mshare syscall to be made for an exisitng mshare'd
region. Complete the implementation for mshare_unlink syscall. Make
reading mshare resource name from userspace safer. Fix code to allow
msharefs to be unmounted cleanly.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 mm/mshare.c | 144 +++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 113 insertions(+), 31 deletions(-)

diff --git a/mm/mshare.c b/mm/mshare.c
index d998b23c25ab..8273136363cc 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -22,11 +22,24 @@
 
 struct mshare_data {
 	struct mm_struct *mm;
+	mode_t mode;
 	refcount_t refcnt;
 };
 
 static struct super_block *msharefs_sb;
 
+static void
+msharefs_evict_inode(struct inode *inode)
+{
+	clear_inode(inode);
+}
+
+static const struct super_operations msharefs_ops = {
+	.statfs 	= simple_statfs,
+	.drop_inode	= generic_delete_inode,
+	.evict_inode	= msharefs_evict_inode,
+};
+
 static ssize_t
 mshare_read(struct kiocb *iocb, struct iov_iter *iov)
 {
@@ -114,7 +127,7 @@ static struct inode
 }
 
 static int
-mshare_file_create(const char *name, unsigned long flags,
+mshare_file_create(struct filename *fname, int flags,
 			struct mshare_data *info)
 {
 	struct inode *inode;
@@ -123,13 +136,16 @@ mshare_file_create(const char *name, unsigned long flags,
 
 	root = msharefs_sb->s_root;
 
+	/*
+	 * This is a read only file.
+	 */
 	inode = msharefs_get_inode(msharefs_sb, S_IFREG | 0400);
 	if (IS_ERR(inode))
 		return PTR_ERR(inode);
 
 	inode->i_private = info;
 
-	dentry = msharefs_alloc_dentry(root, name);
+	dentry = msharefs_alloc_dentry(root, fname->name);
 	if (IS_ERR(dentry)) {
 		err = PTR_ERR(dentry);
 		goto fail_inode;
@@ -137,6 +153,7 @@ mshare_file_create(const char *name, unsigned long flags,
 
 	d_add(dentry, inode);
 
+	dput(dentry);
 	return err;
 
 fail_inode:
@@ -150,10 +167,13 @@ mshare_file_create(const char *name, unsigned long flags,
 SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 		unsigned long, len, int, oflag, mode_t, mode)
 {
-	char mshare_name[NAME_MAX];
 	struct mshare_data *info;
 	struct mm_struct *mm;
-	int err;
+	struct filename *fname = getname(name);
+	struct dentry *dentry;
+	struct inode *inode;
+	struct qstr namestr;
+	int err = PTR_ERR(fname);
 
 	/*
 	 * Address range being shared must be aligned to pgdir
@@ -162,29 +182,56 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 	if ((addr | len) & (PGDIR_SIZE - 1))
 		return -EINVAL;
 
-	err = copy_from_user(mshare_name, name, NAME_MAX);
+	if (IS_ERR(fname))
+		goto err_out;
+
+	/*
+	 * Does this mshare entry exist already? If it does, calling
+	 * mshare with O_EXCL|O_CREAT is an error
+	 */
+	namestr.name = fname->name;
+	namestr.len = strlen(fname->name);
+	err = msharefs_d_hash(msharefs_sb->s_root, &namestr);
 	if (err)
 		goto err_out;
+	dentry = d_lookup(msharefs_sb->s_root, &namestr);
+	if (dentry && (oflag & (O_EXCL|O_CREAT))) {
+		err = -EEXIST;
+		dput(dentry);
+		goto err_out;
+	}
 
-	mm = mm_alloc();
-	if (!mm)
-		return -ENOMEM;
-	info = kzalloc(sizeof(*info), GFP_KERNEL);
-	if (!info) {
-		err = -ENOMEM;
-		goto err_relmm;
+	if (dentry) {
+		inode = d_inode(dentry);
+		if (inode == NULL) {
+			err = -EINVAL;
+			goto err_out;
+		}
+		info = inode->i_private;
+		refcount_inc(&info->refcnt);
+		dput(dentry);
+	} else {
+		mm = mm_alloc();
+		if (!mm)
+			return -ENOMEM;
+		info = kzalloc(sizeof(*info), GFP_KERNEL);
+		if (!info) {
+			err = -ENOMEM;
+			goto err_relmm;
+		}
+		mm->mmap_base = addr;
+		mm->task_size = addr + len;
+		if (!mm->task_size)
+			mm->task_size--;
+		info->mm = mm;
+		info->mode = mode;
+		refcount_set(&info->refcnt, 1);
+		err = mshare_file_create(fname, oflag, info);
+		if (err)
+			goto err_relinfo;
 	}
-	mm->mmap_base = addr;
-	mm->task_size = addr + len;
-	if (!mm->task_size)
-		mm->task_size--;
-	info->mm = mm;
-	refcount_set(&info->refcnt, 1);
-
-	err = mshare_file_create(mshare_name, oflag, info);
-	if (err)
-		goto err_relinfo;
 
+	putname(fname);
 	return 0;
 
 err_relinfo:
@@ -192,6 +239,7 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 err_relmm:
 	mmput(mm);
 err_out:
+	putname(fname);
 	return err;
 }
 
@@ -200,21 +248,54 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
  */
 SYSCALL_DEFINE1(mshare_unlink, const char *, name)
 {
-	char mshare_name[NAME_MAX];
-	int err;
+	struct filename *fname = getname(name);
+	int err = PTR_ERR(fname);
+	struct dentry *dentry;
+	struct inode *inode;
+	struct mshare_data *info;
+	struct qstr namestr;
 
-	/*
-	 * Delete the named object
-	 *
-	 * TODO: Mark mshare'd range for deletion
-	 *
-	 */
-	err = copy_from_user(mshare_name, name, NAME_MAX);
+	if (IS_ERR(fname))
+		goto err_out;
+
+	namestr.name = fname->name;
+	namestr.len = strlen(fname->name);
+	err = msharefs_d_hash(msharefs_sb->s_root, &namestr);
 	if (err)
 		goto err_out;
+	dentry = d_lookup(msharefs_sb->s_root, &namestr);
+	if (dentry == NULL) {
+		err = -EINVAL;
+		goto err_out;
+	}
+
+	inode = d_inode(dentry);
+	if (inode == NULL) {
+		err = -EINVAL;
+		goto err_dput;
+	}
+	info = inode->i_private;
+
+	/*
+	 * Is this the last reference?
+	 */
+	if (refcount_dec_and_test(&info->refcnt)) {
+		simple_unlink(d_inode(msharefs_sb->s_root), dentry);
+		d_drop(dentry);
+		d_delete(dentry);
+		mmput(info->mm);
+		kfree(info);
+	} else {
+		dput(dentry);
+	}
+
+	putname(fname);
 	return 0;
 
+err_dput:
+	dput(dentry);
 err_out:
+	putname(fname);
 	return err;
 }
 
@@ -228,6 +309,7 @@ msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
 	static const struct tree_descr empty_descr = {""};
 	int err;
 
+	sb->s_op = &msharefs_ops;
 	sb->s_d_op = &msharefs_d_ops;
 	err = simple_fill_super(sb, MSHARE_MAGIC, &empty_descr);
 	if (err)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH 5/6] mm: Add locking to msharefs syscalls
  2022-01-18 21:19 [RFC PATCH 0/6] Add support for shared PTEs across processes Khalid Aziz
                   ` (3 preceding siblings ...)
  2022-01-18 21:19 ` [RFC PATCH 4/6] mm: implement mshare_unlink syscall Khalid Aziz
@ 2022-01-18 21:19 ` Khalid Aziz
  2022-01-18 21:19 ` [RFC PATCH 6/6] mm: Add basic page table sharing using mshare Khalid Aziz
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 54+ messages in thread
From: Khalid Aziz @ 2022-01-18 21:19 UTC (permalink / raw)
  To: akpm, willy, longpeng2, arnd, dave.hansen, david, rppt, surenb,
	linux-kernel, linux-mm
  Cc: Khalid Aziz

Lock the root inode for msharefs when creating a new file or deleting
an existing one to avoid races. mshare syscalls are low frequency
operations, so locking the root inode is reasonable.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 mm/mshare.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/mshare.c b/mm/mshare.c
index 8273136363cc..adfd5a280e5b 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -194,11 +194,12 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 	err = msharefs_d_hash(msharefs_sb->s_root, &namestr);
 	if (err)
 		goto err_out;
+	inode_lock(d_inode(msharefs_sb->s_root));
 	dentry = d_lookup(msharefs_sb->s_root, &namestr);
 	if (dentry && (oflag & (O_EXCL|O_CREAT))) {
 		err = -EEXIST;
 		dput(dentry);
-		goto err_out;
+		goto err_unlock_inode;
 	}
 
 	if (dentry) {
@@ -231,6 +232,7 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 			goto err_relinfo;
 	}
 
+	inode_unlock(d_inode(msharefs_sb->s_root));
 	putname(fname);
 	return 0;
 
@@ -238,6 +240,8 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 	kfree(info);
 err_relmm:
 	mmput(mm);
+err_unlock_inode:
+	inode_unlock(d_inode(msharefs_sb->s_root));
 err_out:
 	putname(fname);
 	return err;
@@ -263,10 +267,11 @@ SYSCALL_DEFINE1(mshare_unlink, const char *, name)
 	err = msharefs_d_hash(msharefs_sb->s_root, &namestr);
 	if (err)
 		goto err_out;
+	inode_lock(d_inode(msharefs_sb->s_root));
 	dentry = d_lookup(msharefs_sb->s_root, &namestr);
 	if (dentry == NULL) {
 		err = -EINVAL;
-		goto err_out;
+		goto err_unlock_inode;
 	}
 
 	inode = d_inode(dentry);
@@ -289,11 +294,14 @@ SYSCALL_DEFINE1(mshare_unlink, const char *, name)
 		dput(dentry);
 	}
 
+	inode_unlock(d_inode(msharefs_sb->s_root));
 	putname(fname);
 	return 0;
 
 err_dput:
 	dput(dentry);
+err_unlock_inode:
+	inode_unlock(d_inode(msharefs_sb->s_root));
 err_out:
 	putname(fname);
 	return err;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH 6/6] mm: Add basic page table sharing using mshare
  2022-01-18 21:19 [RFC PATCH 0/6] Add support for shared PTEs across processes Khalid Aziz
                   ` (4 preceding siblings ...)
  2022-01-18 21:19 ` [RFC PATCH 5/6] mm: Add locking to msharefs syscalls Khalid Aziz
@ 2022-01-18 21:19 ` Khalid Aziz
  2022-01-18 21:41 ` [RFC PATCH 0/6] Add support for shared PTEs across processes Dave Hansen
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 54+ messages in thread
From: Khalid Aziz @ 2022-01-18 21:19 UTC (permalink / raw)
  To: akpm, willy, longpeng2, arnd, dave.hansen, david, rppt, surenb,
	linux-kernel, linux-mm
  Cc: Khalid Aziz

This patch adds basic page table sharing across tasks by making
mshare syscall. It does this by creating a new mm_struct which
hosts the shared vmas and page tables. This mm_struct is
maintained as long as there is at least one task using the mshare'd
range. It is cleaned up by the last mshare_unlink syscall.

NOTE: WORK IN PRGRESS. This is only a working prototype and has
bugs and many missing pieces. mm/mshare.c documents these bugs and
issues that should be addressed in a more complete implementation.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/mm.h             |   8 +
 include/trace/events/mmflags.h |   3 +-
 mm/internal.h                  |   7 +
 mm/memory.c                    |  35 ++++-
 mm/mshare.c                    | 265 +++++++++++++++++++++++++++++++--
 5 files changed, 299 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..63128f6c83cd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -308,11 +308,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -354,6 +356,12 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_MTE_ALLOWED	VM_NONE
 #endif
 
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+#define VM_SHARED_PT	VM_HIGH_ARCH_5
+#else
+#define VM_SHARED_PT	0
+#endif
+
 #ifndef VM_GROWSUP
 # define VM_GROWSUP	VM_NONE
 #endif
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 116ed4d5d0f8..002dbf2711c5 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -184,7 +184,8 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,	"softdirty"	)		\
 	{VM_MIXEDMAP,			"mixedmap"	},		\
 	{VM_HUGEPAGE,			"hugepage"	},		\
 	{VM_NOHUGEPAGE,			"nohugepage"	},		\
-	{VM_MERGEABLE,			"mergeable"	}		\
+	{VM_MERGEABLE,			"mergeable"	},		\
+	{VM_SHARED_PT,			"sharedpt"	}		\
 
 #define show_vma_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",				\
diff --git a/mm/internal.h b/mm/internal.h
index 3b79a5c9427a..9bfc4dde7d70 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -713,4 +713,11 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 		      unsigned long addr, int page_nid, int *flags);
 
+extern vm_fault_t find_shared_vma(struct vm_area_struct **vma,
+			unsigned long *addrp);
+static inline bool vma_is_shared(const struct vm_area_struct *vma)
+{
+	return vma->vm_flags & VM_SHARED_PT;
+}
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memory.c b/mm/memory.c
index 8f1de811a1dc..b506bbbfae60 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -418,16 +418,25 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		} else {
 			/*
 			 * Optimization: gather nearby vmas into one call down
+			 * as long as they all belong to the same mm (that
+			 * may not be the case if a vma is part of mshare'd
+			 * range
 			 */
 			while (next && next->vm_start <= vma->vm_end + PMD_SIZE
-			       && !is_vm_hugetlb_page(next)) {
+			       && !is_vm_hugetlb_page(next)
+			       && vma->vm_mm == tlb->mm) {
 				vma = next;
 				next = vma->vm_next;
 				unlink_anon_vmas(vma);
 				unlink_file_vma(vma);
 			}
-			free_pgd_range(tlb, addr, vma->vm_end,
-				floor, next ? next->vm_start : ceiling);
+			/*
+			 * Free pgd only if pgd is not allocated for an
+			 * mshare'd range
+			 */
+			if (vma->vm_mm == tlb->mm)
+				free_pgd_range(tlb, addr, vma->vm_end,
+					floor, next ? next->vm_start : ceiling);
 		}
 		vma = next;
 	}
@@ -1528,6 +1537,13 @@ void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
+	/*
+	 * If this is an mshare'd page, do not unmap it since it might
+	 * still be in use.
+	 */
+	if (vma->vm_mm != tlb->mm)
+		return;
+
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
@@ -4757,6 +4773,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			   unsigned int flags, struct pt_regs *regs)
 {
 	vm_fault_t ret;
+	bool shared = false;
 
 	__set_current_state(TASK_RUNNING);
 
@@ -4766,6 +4783,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	/* do counter updates before entering really critical section. */
 	check_sync_rss_stat(current);
 
+	if (unlikely(vma_is_shared(vma))) {
+		ret = find_shared_vma(&vma, &address);
+		if (ret)
+			return ret;
+		if (!vma)
+			return VM_FAULT_SIGSEGV;
+		shared = true;
+	}
+
 	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
 					    flags & FAULT_FLAG_INSTRUCTION,
 					    flags & FAULT_FLAG_REMOTE))
@@ -4783,6 +4809,9 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	else
 		ret = __handle_mm_fault(vma, address, flags);
 
+	if (shared)
+		mmap_read_unlock(vma->vm_mm);
+
 	if (flags & FAULT_FLAG_USER) {
 		mem_cgroup_exit_user_fault();
 		/*
diff --git a/mm/mshare.c b/mm/mshare.c
index adfd5a280e5b..ffdc72963b6b 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -6,8 +6,37 @@
  *
  *
  * Copyright (C) 2021 Oracle Corp. All rights reserved.
- * Authors:	Matthew Wilcox
- *		Khalid Aziz
+ * Authors:	Matthew Wilcox <willy@infradead.org>
+ *		Khalid Aziz <khalid.aziz@oracle.com>
+ *
+ * Current issues/questions:
+ *	- mshare_unlink should unmap all shared VMAs for the calling task
+ *	- If a task that had called mshare dies, make sure its shared VMAs
+ *	  are cleaned up properly and refcount to shared region is
+ *	  updated correctly.
+ *	- Should mshare_unlink be allowed only for the tasks that called
+ *	  mshare() originally so the two calls are matched up? If yes,
+ *	  should there still be a way to clean up stale shared regions?
+ *	- Allow already mapped in VMAs to be mshare'd by the task creating
+ *	  mshare region. This will potentially require splitting VMAs
+ *	- What happens when a task tries to attach to an existing mshare
+ *	  region and it already has VMAs mapped into that address range?
+ *	  Possibilities - (1) mshare() fails, or (2) unmap current VMAs
+ *	  and create new ones with new mapping. If (2), what happens
+ *	  if exisiting VMA is larger than mshare region - split the
+ *	  VMA and leave partial original mapping intact, or unmap all
+ *	  overlapping VMAs
+ *	- Can the tasks using mshare region mmap/mremap things into sections
+ *	  of the range? If yes, that will require additional work. Which
+ *	  mmaps  should be supported - anonymous, files??
+ *	- How does this work with hugepages?
+ *	- How does this work with OOM killer?
+ *	- BUG: PTEs no longer work once the task that created mshare
+ *	  range dies.
+ *	- mmu_notifier uses vma->vm_mm. That very likely breaks with
+ *	  this code
+ *	- Should consumer processes be allowed to only map the entire
+ *	  mshare'd region or should they be allowed to map subset of it?
  */
 
 #include <linux/fs.h>
@@ -17,17 +46,50 @@
 #include <linux/pseudo_fs.h>
 #include <linux/fileattr.h>
 #include <linux/refcount.h>
+#include <linux/mman.h>
 #include <uapi/linux/magic.h>
 #include <uapi/linux/limits.h>
+#include "internal.h"
 
 struct mshare_data {
-	struct mm_struct *mm;
+	struct mm_struct *mm, *host_mm;
 	mode_t mode;
 	refcount_t refcnt;
 };
 
 static struct super_block *msharefs_sb;
 
+/* Returns holding the host mm's lock for read.  Caller must release. */
+vm_fault_t
+find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp)
+{
+	struct vm_area_struct *vma, *guest = *vmap;
+	struct mshare_data *info = guest->vm_private_data;
+	struct mm_struct *host_mm = info->mm;
+	unsigned long host_addr;
+	pgd_t *pgd, *guest_pgd;
+
+	host_addr = *addrp - guest->vm_start + host_mm->mmap_base;
+	pgd = pgd_offset(host_mm, host_addr);
+	guest_pgd = pgd_offset(current->mm, *addrp);
+	if (!pgd_same(*guest_pgd, *pgd)) {
+		set_pgd(guest_pgd, *pgd);
+		return VM_FAULT_NOPAGE;
+	}
+
+	*addrp = host_addr;
+	mmap_read_lock(host_mm);
+	vma = find_vma(host_mm, host_addr);
+
+	/* XXX: expand stack? */
+	if (vma && vma->vm_start > host_addr)
+		vma = NULL;
+
+	*vmap = vma;
+	vma->vm_mm = host_mm;
+	return 0;
+}
+
 static void
 msharefs_evict_inode(struct inode *inode)
 {
@@ -168,13 +230,23 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 		unsigned long, len, int, oflag, mode_t, mode)
 {
 	struct mshare_data *info;
-	struct mm_struct *mm;
 	struct filename *fname = getname(name);
 	struct dentry *dentry;
 	struct inode *inode;
 	struct qstr namestr;
+	struct vm_area_struct *vma, *next, *new_vma;
+	struct mm_struct *new_mm;
+	unsigned long end;
 	int err = PTR_ERR(fname);
 
+
+	/*
+	 * Is msharefs mounted? TODO: If not mounted, return error
+	 * or automount?
+	 */
+	if (msharefs_sb == NULL)
+		return -ENOENT;
+
 	/*
 	 * Address range being shared must be aligned to pgdir
 	 * boundary and its size must be a multiple of pgdir size
@@ -185,6 +257,8 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 	if (IS_ERR(fname))
 		goto err_out;
 
+	end = addr + len;
+
 	/*
 	 * Does this mshare entry exist already? If it does, calling
 	 * mshare with O_EXCL|O_CREAT is an error
@@ -197,49 +271,183 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
 	inode_lock(d_inode(msharefs_sb->s_root));
 	dentry = d_lookup(msharefs_sb->s_root, &namestr);
 	if (dentry && (oflag & (O_EXCL|O_CREAT))) {
+		inode = d_inode(dentry);
 		err = -EEXIST;
 		dput(dentry);
 		goto err_unlock_inode;
 	}
 
 	if (dentry) {
+		unsigned long mapaddr, prot = PROT_NONE;
+
+		/*
+		 * TODO: Address the following comment
+		 *
+		 * For now, we do not allow mshare mapping an existing mshare
+		 * region if any overlapping mappings exist in calling
+		 * process already
+		 */
+		mmap_write_lock(current->mm);
+		vma = find_vma_intersection(current->mm, addr, end);
+		if (vma) {
+			mmap_write_unlock(current->mm);
+			err = -EINVAL;
+			goto err_out;
+		}
+
 		inode = d_inode(dentry);
 		if (inode == NULL) {
+			mmap_write_unlock(current->mm);
 			err = -EINVAL;
 			goto err_out;
 		}
 		info = inode->i_private;
-		refcount_inc(&info->refcnt);
 		dput(dentry);
+
+		/*
+		 * Map in the address range as anonymous mappings
+		 */
+		mmap_write_unlock(current->mm);
+		oflag &= (O_RDONLY | O_WRONLY | O_RDWR);
+		if (oflag & O_RDONLY)
+			prot |= PROT_READ;
+		else if (oflag & O_WRONLY)
+			prot |= PROT_WRITE;
+		else if (oflag & O_RDWR)
+			prot |= (PROT_READ | PROT_WRITE);
+		mapaddr = vm_mmap(NULL, addr, len, prot,
+				MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, 0);
+		if (IS_ERR((void *)mapaddr)) {
+			err = -EINVAL;
+			goto err_out;
+		}
+
+		refcount_inc(&info->refcnt);
+
+		/*
+		 * Now that we have mmap'd the mshare'd range, update vma
+		 * flags and vm_mm pointer for this mshare'd range.
+		 */
+		mmap_write_lock(current->mm);
+		vma = find_vma(current->mm, addr);
+		if (vma && vma->vm_start < addr) {
+			mmap_write_unlock(current->mm);
+			err = -EINVAL;
+			goto err_out;
+		}
+
+		while (vma && vma->vm_start < (addr + len)) {
+			vma->vm_private_data = info;
+			vma->vm_mm = info->mm;
+			vma->vm_flags |= VM_SHARED_PT;
+			next = vma->vm_next;
+			vma = next;
+		}
 	} else {
-		mm = mm_alloc();
-		if (!mm)
+		unsigned long myaddr;
+		struct mm_struct *old_mm;
+
+		old_mm = current->mm;
+		new_mm = mm_alloc();
+		if (!new_mm)
 			return -ENOMEM;
 		info = kzalloc(sizeof(*info), GFP_KERNEL);
 		if (!info) {
 			err = -ENOMEM;
 			goto err_relmm;
 		}
-		mm->mmap_base = addr;
-		mm->task_size = addr + len;
-		if (!mm->task_size)
-			mm->task_size--;
-		info->mm = mm;
+		new_mm->mmap_base = addr;
+		new_mm->task_size = addr + len;
+		if (!new_mm->task_size)
+			new_mm->task_size--;
+		info->mm = new_mm;
+		info->host_mm = old_mm;
 		info->mode = mode;
 		refcount_set(&info->refcnt, 1);
+
+		/*
+		 * VMAs for this address range may or may not exist.
+		 * If VMAs exist, they should be marked as shared at
+		 * this point and page table info should be copied
+		 * over to newly created mm_struct. TODO: If VMAs do not
+		 * exist, create them and mark them as shared.
+		 */
+		mmap_write_lock(old_mm);
+		vma = find_vma_intersection(old_mm, addr, end);
+		if (!vma) {
+			err = -EINVAL;
+			goto unlock;
+		}
+		/*
+		 * TODO: If the currently allocated VMA goes beyond the
+		 * mshare'd range, this VMA needs to be split.
+		 *
+		 * Double check that source VMAs do not extend outside
+		 * the range
+		 */
+		vma = find_vma(old_mm, addr + len);
+		if (vma && vma->vm_start < (addr + len)) {
+			err = -EINVAL;
+			goto unlock;
+		}
+
+		vma = find_vma(old_mm, addr);
+		if (vma && vma->vm_start < addr) {
+			err = -EINVAL;
+			goto unlock;
+		}
+
+		mmap_write_lock(new_mm);
+		while (vma && vma->vm_start < (addr + len)) {
+			/*
+			 * Copy this vma over to host mm
+			 */
+			new_vma = vm_area_dup(vma);
+			if (!new_vma) {
+				err = -ENOMEM;
+				goto unlock;
+			}
+			vma->vm_mm = new_mm;
+			err = insert_vm_struct(new_mm, new_vma);
+			if (err)
+				goto unlock;
+
+			vma->vm_private_data = info;
+			vma->vm_mm = new_mm;
+			vma->vm_flags |= VM_SHARED_PT;
+			vma = vma->vm_next;
+		}
+		mmap_write_unlock(new_mm);
+
+
 		err = mshare_file_create(fname, oflag, info);
 		if (err)
-			goto err_relinfo;
+			goto unlock;
+
+		/*
+		 * Copy over current PTEs
+		 */
+		myaddr = addr;
+		while (myaddr < new_mm->task_size) {
+			*pgd_offset(new_mm, myaddr) = *pgd_offset(old_mm, myaddr);
+			myaddr += PGDIR_SIZE;
+		}
+		/*
+		 * TODO: Free the corresponding page table in calling
+		 * process
+		 */
 	}
 
+	mmap_write_unlock(current->mm);
 	inode_unlock(d_inode(msharefs_sb->s_root));
 	putname(fname);
 	return 0;
 
-err_relinfo:
+unlock:
+	mmap_write_unlock(current->mm);
 	kfree(info);
 err_relmm:
-	mmput(mm);
+	mmput(new_mm);
 err_unlock_inode:
 	inode_unlock(d_inode(msharefs_sb->s_root));
 err_out:
@@ -259,6 +467,9 @@ SYSCALL_DEFINE1(mshare_unlink, const char *, name)
 	struct mshare_data *info;
 	struct qstr namestr;
 
+	if (msharefs_sb == NULL)
+		return -ENOENT;
+
 	if (IS_ERR(fname))
 		goto err_out;
 
@@ -283,14 +494,38 @@ SYSCALL_DEFINE1(mshare_unlink, const char *, name)
 
 	/*
 	 * Is this the last reference?
+	 * TODO: permission checks are needed before proceeding
 	 */
 	if (refcount_dec_and_test(&info->refcnt)) {
 		simple_unlink(d_inode(msharefs_sb->s_root), dentry);
 		d_drop(dentry);
 		d_delete(dentry);
+		/*
+		 * TODO: Release all physical pages allocated for this
+		 * mshare range and release associated page table. If
+		 * the final unlink happens from the process that created
+		 * mshare'd range, do we return page tables and pages to
+		 * that process so the creating process can continue using
+		 * the address range it had chosen to mshare at some
+		 * point?
+		 *
+		 * TODO: unmap shared vmas from every task that is using
+		 * this mshare'd range.
+		 */
 		mmput(info->mm);
 		kfree(info);
 	} else {
+		/*
+		 * TODO: If mshare'd range is still mapped in the process,
+		 * it should be unmapped. Following is minimal code and
+		 * might need fix up
+		 */
+		unsigned long tmp;
+
+		tmp = info->mm->task_size - info->mm->mmap_base;
+		if (info->host_mm != current->mm)
+			vm_munmap(info->mm->mmap_base, tmp);
+
 		dput(dentry);
 	}
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-18 21:19 [RFC PATCH 0/6] Add support for shared PTEs across processes Khalid Aziz
                   ` (5 preceding siblings ...)
  2022-01-18 21:19 ` [RFC PATCH 6/6] mm: Add basic page table sharing using mshare Khalid Aziz
@ 2022-01-18 21:41 ` Dave Hansen
  2022-01-18 21:46   ` Matthew Wilcox
  2022-01-18 22:06 ` Dave Hansen
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 54+ messages in thread
From: Dave Hansen @ 2022-01-18 21:41 UTC (permalink / raw)
  To: Khalid Aziz, akpm, willy, longpeng2, arnd, dave.hansen, david,
	rppt, surenb, linux-kernel, linux-mm

On 1/18/22 1:19 PM, Khalid Aziz wrote:
> - Starting address must be aligned to pgdir size (512GB on x86_64)

How does this work on systems with 5-level paging where a top-level page
table entry covers 256TB?  Is the alignment requirement 512GB or 256TB?
 How does userspace figure out which alignment requirement it is subject to?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-18 21:41 ` [RFC PATCH 0/6] Add support for shared PTEs across processes Dave Hansen
@ 2022-01-18 21:46   ` Matthew Wilcox
  2022-01-18 22:47     ` Khalid Aziz
  0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2022-01-18 21:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Khalid Aziz, akpm, longpeng2, arnd, dave.hansen, david, rppt,
	surenb, linux-kernel, linux-mm

On Tue, Jan 18, 2022 at 01:41:40PM -0800, Dave Hansen wrote:
> On 1/18/22 1:19 PM, Khalid Aziz wrote:
> > - Starting address must be aligned to pgdir size (512GB on x86_64)
> 
> How does this work on systems with 5-level paging where a top-level page
> table entry covers 256TB?  Is the alignment requirement 512GB or 256TB?
>  How does userspace figure out which alignment requirement it is subject to?

The original idea was any power of two, naturally aligned, >= PAGE_SIZE,
but I suspect Khalid has simplified it for this first implementation.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-18 21:19 [RFC PATCH 0/6] Add support for shared PTEs across processes Khalid Aziz
                   ` (6 preceding siblings ...)
  2022-01-18 21:41 ` [RFC PATCH 0/6] Add support for shared PTEs across processes Dave Hansen
@ 2022-01-18 22:06 ` Dave Hansen
  2022-01-18 22:52   ` Khalid Aziz
  2022-01-19 11:38 ` Mark Hemment
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 54+ messages in thread
From: Dave Hansen @ 2022-01-18 22:06 UTC (permalink / raw)
  To: Khalid Aziz, akpm, willy, longpeng2, arnd, dave.hansen, david,
	rppt, surenb, linux-kernel, linux-mm

On 1/18/22 1:19 PM, Khalid Aziz wrote:
> This is a proposal to implement a mechanism in kernel to allow
> userspace processes to opt into sharing PTEs. The proposal is to add
> a new system call - mshare(), which can be used by a process to
> create a region (we will call it mshare'd region) which can be used
> by other processes to map same pages using shared PTEs. Other
> process(es), assuming they have the right permissions, can then make
> the mashare() system call to map the shared pages into their address
> space using the shared PTEs.

One thing that went over my head here was that this allowing sharing of
relatively arbitrary *EXISTING* regions.  The mshared'd region might be
anonymous or an plain mmap()'d file.  It can even be a filesystem or
device DAX mmap().

In other words, donors can (ideally) share anything.  Consumers have
must use msharefs to access the donated areas.

Right?

( btw... thanks to willy for the correction on IRC.)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-18 21:46   ` Matthew Wilcox
@ 2022-01-18 22:47     ` Khalid Aziz
  0 siblings, 0 replies; 54+ messages in thread
From: Khalid Aziz @ 2022-01-18 22:47 UTC (permalink / raw)
  To: Matthew Wilcox, Dave Hansen
  Cc: akpm, longpeng2, arnd, dave.hansen, david, rppt, surenb,
	linux-kernel, linux-mm

On 1/18/22 14:46, Matthew Wilcox wrote:
> On Tue, Jan 18, 2022 at 01:41:40PM -0800, Dave Hansen wrote:
>> On 1/18/22 1:19 PM, Khalid Aziz wrote:
>>> - Starting address must be aligned to pgdir size (512GB on x86_64)
>>
>> How does this work on systems with 5-level paging where a top-level page
>> table entry covers 256TB?  Is the alignment requirement 512GB or 256TB?
>>   How does userspace figure out which alignment requirement it is subject to?
> 
> The original idea was any power of two, naturally aligned, >= PAGE_SIZE,
> but I suspect Khalid has simplified it for this first implementation.
> 

Hi Dave,

Yes, this is mostly to keep code somewhat simpler. Large regions make it easier to manage the separate set of shared 
VMAs. Part of the exploration here is to see what size regions work for other people. This initial prototype is x86 only 
and for now I am using PGDIR_SIZE. I see your point about how would userspace figure out alignment since this should 
across all architectures and PGDIR_SIZE/PMD_SIZE/PUD_SIZE are not the same across architectures. We can choose a fixed 
size and alignment. I would like to keep the region size at 2^20 pages or larger to minimize having to manage large 
number of small shared regions if those regions are not contiguous. Do you have any suggestion?

Thanks for your feedback.

--
Khalid

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-18 22:06 ` Dave Hansen
@ 2022-01-18 22:52   ` Khalid Aziz
  0 siblings, 0 replies; 54+ messages in thread
From: Khalid Aziz @ 2022-01-18 22:52 UTC (permalink / raw)
  To: Dave Hansen, akpm, willy, longpeng2, arnd, dave.hansen, david,
	rppt, surenb, linux-kernel, linux-mm

On 1/18/22 15:06, Dave Hansen wrote:
> On 1/18/22 1:19 PM, Khalid Aziz wrote:
>> This is a proposal to implement a mechanism in kernel to allow
>> userspace processes to opt into sharing PTEs. The proposal is to add
>> a new system call - mshare(), which can be used by a process to
>> create a region (we will call it mshare'd region) which can be used
>> by other processes to map same pages using shared PTEs. Other
>> process(es), assuming they have the right permissions, can then make
>> the mashare() system call to map the shared pages into their address
>> space using the shared PTEs.
> 
> One thing that went over my head here was that this allowing sharing of
> relatively arbitrary *EXISTING* regions.  The mshared'd region might be
> anonymous or an plain mmap()'d file.  It can even be a filesystem or
> device DAX mmap().
> 
> In other words, donors can (ideally) share anything.  Consumers have
> must use msharefs to access the donated areas.
> 
> Right?
> 
> ( btw... thanks to willy for the correction on IRC.)
> 

Hi Dave,

Consumers use msharefs only to get information on address and size of shared region. Access to the donated are does not 
go through msharefs. So the consumer opens the file in msharefs to read starting address and size:

         fd = open("testregion", O_RDONLY);

         if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
                 printf("INFO: %ld bytes shared at addr %lx \n",
                                 mshare_info[1], mshare_info[0]);
         else
                 perror("read failed");

         close(fd);

It then uses that information to map in the donated region:

         addr = (char *)mshare_info[0];
         err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
                         mshare_info[1], O_RDWR, 600);

Makes sense?

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-18 21:19 [RFC PATCH 0/6] Add support for shared PTEs across processes Khalid Aziz
                   ` (7 preceding siblings ...)
  2022-01-18 22:06 ` Dave Hansen
@ 2022-01-19 11:38 ` Mark Hemment
  2022-01-19 17:02   ` Khalid Aziz
  2022-01-21  1:08 ` Barry Song
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 54+ messages in thread
From: Mark Hemment @ 2022-01-19 11:38 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	longpeng2, arnd, dave.hansen, david, rppt, Suren Baghdasaryan,
	linux-kernel, linux-mm

On Tue, 18 Jan 2022 at 21:20, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> Page tables in kernel consume some of the memory and as long as
> number of mappings being maintained is small enough, this space
> consumed by page tables is not objectionable. When very few memory
> pages are shared between processes, the number of page table entries
> (PTEs) to maintain is mostly constrained by the number of pages of
> memory on the system. As the number of shared pages and the number
> of times pages are shared goes up, amount of memory consumed by page
> tables starts to become significant.
>
> Some of the field deployments commonly see memory pages shared
> across 1000s of processes. On x86_64, each page requires a PTE that
> is only 8 bytes long which is very small compared to the 4K page
> size. When 2000 processes map the same page in their address space,
> each one of them requires 8 bytes for its PTE and together that adds
> up to 8K of memory just to hold the PTEs for one 4K page. On a
> database server with 300GB SGA, a system carsh was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant.
>
> This is a proposal to implement a mechanism in kernel to allow
> userspace processes to opt into sharing PTEs. The proposal is to add
> a new system call - mshare(), which can be used by a process to
> create a region (we will call it mshare'd region) which can be used
> by other processes to map same pages using shared PTEs. Other
> process(es), assuming they have the right permissions, can then make
> the mashare() system call to map the shared pages into their address
> space using the shared PTEs.  When a process is done using this
> mshare'd region, it makes a mshare_unlink() system call to end its
> access. When the last process accessing mshare'd region calls
> mshare_unlink(), the mshare'd region is torn down and memory used by
> it is freed.
>
>
> API Proposal
> ============
>
> The mshare API consists of two system calls - mshare() and mshare_unlink()
>
> --
> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>
> mshare() creates and opens a new, or opens an existing mshare'd
> region that will be shared at PTE level. "name" refers to shared object
> name that exists under /sys/fs/mshare. "addr" is the starting address
> of this shared memory area and length is the size of this area.
> oflags can be one of:
>
> - O_RDONLY opens shared memory area for read only access by everyone
> - O_RDWR opens shared memory area for read and write access
> - O_CREAT creates the named shared memory area if it does not exist
> - O_EXCL If O_CREAT was also specified, and a shared memory area
>   exists with that name, return an error.
>
> mode represents the creation mode for the shared object under
> /sys/fs/mshare.
>
> mshare() returns an error code if it fails, otherwise it returns 0.
>
> PTEs are shared at pgdir level and hence it imposes following
> requirements on the address and size given to the mshare():
>
> - Starting address must be aligned to pgdir size (512GB on x86_64)
> - Size must be a multiple of pgdir size
> - Any mappings created in this address range at any time become
>   shared automatically
> - Shared address range can have unmapped addresses in it. Any access
>   to unmapped address will result in SIGBUS
>
> Mappings within this address range behave as if they were shared
> between threads, so a write to a MAP_PRIVATE mapping will create a
> page which is shared between all the sharers. The first process that
> declares an address range mshare'd can continue to map objects in
> the shared area. All other processes that want mshare'd access to
> this memory area can do so by calling mshare(). After this call, the
> address range given by mshare becomes a shared range in its address
> space. Anonymous mappings will be shared and not COWed.
>
> A file under /sys/fs/mshare can be opened and read from. A read from
> this file returns two long values - (1) starting address, and (2)
> size of the mshare'd region.
>
> --
> int mshare_unlink(char *name)
>
> A shared address range created by mshare() can be destroyed using
> mshare_unlink() which removes the  shared named object. Once all
> processes have unmapped the shared object, the shared address range
> references are de-allocated and destroyed.
>
> mshare_unlink() returns 0 on success or -1 on error.
>
>
> Example Code
> ============
>
> Snippet of the code that a donor process would run looks like below:
>
> -----------------
>         addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>                         MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>         if (addr == MAP_FAILED)
>                 perror("ERROR: mmap failed");
>
>         err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>                         GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>         if (err < 0) {
>                 perror("mshare() syscall failed");
>                 exit(1);
>         }
>
>         strncpy(addr, "Some random shared text",
>                         sizeof("Some random shared text"));
> -----------------
>
> Snippet of code that a consumer process would execute looks like:
>
> -----------------
>         fd = open("testregion", O_RDONLY);
>         if (fd < 0) {
>                 perror("open failed");
>                 exit(1);
>         }
>
>         if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
>                 printf("INFO: %ld bytes shared at addr %lx \n",
>                                 mshare_info[1], mshare_info[0]);
>         else
>                 perror("read failed");
>
>         close(fd);
>
>         addr = (char *)mshare_info[0];
>         err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
>                         mshare_info[1], O_RDWR, 600);
>         if (err < 0) {
>                 perror("mshare() syscall failed");
>                 exit(1);
>         }
>
>         printf("Guest mmap at %px:\n", addr);
>         printf("%s\n", addr);
>         printf("\nDone\n");
>
>         err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>         if (err < 0) {
>                 perror("mshare_unlink() failed");
>                 exit(1);
>         }
> -----------------
...
Hi Khalid,

The proposed mshare() appears to be similar to POSIX shared memory,
but with two extra (related) attributes;
a) Internally, uses shared page tables.
b) Shared memory is mapped at same address for all users.

Rather than introduce two new system calls, along with /sys/ file to
communicate global addresses, could mshare() be built on top of shmem
API?  Thinking of something like the below;
1) For shm_open(3), add a new oflag to indicate the properties needed
for mshare() (say, O_SHARED_PTE - better name?)
2) For ftruncate(2), objects created with O_SHARED_PTE are constrained
in the sizes which can be set.
3) For mmap(2), NULL is always passed as the address for O_SHARED_PTE
objects.  On first mmap()ing an appropiate address is assigned,
otherwise the current 'global' address is used.
4) shm_unlink(3) destroys the object when last reference is dropped.

For 3), might be able to weaken the NULL requirement and validate a
given address on first mapping to ensure it is correctly aligned.
shm_open(3) sets FD_CLOEXEC on the file descriptor, which might not be
the default behaviour you require.

Internally, the handling of mshare()/O_SHARED_PTE memory might be
sufficiently different to shmem that there is not much code sharing
between the two (I haven't thought this through, but the object
naming/refcounting should be similiar), but using shmem would be a
familiar API.

Any thoughts?

Cheers,
Mark

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-19 11:38 ` Mark Hemment
@ 2022-01-19 17:02   ` Khalid Aziz
  2022-01-20 12:49     ` Mark Hemment
  0 siblings, 1 reply; 54+ messages in thread
From: Khalid Aziz @ 2022-01-19 17:02 UTC (permalink / raw)
  To: Mark Hemment
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	longpeng2, arnd, dave.hansen, david, rppt, Suren Baghdasaryan,
	linux-kernel, linux-mm

On 1/19/22 04:38, Mark Hemment wrote:
> On Tue, 18 Jan 2022 at 21:20, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> Page tables in kernel consume some of the memory and as long as
>> number of mappings being maintained is small enough, this space
>> consumed by page tables is not objectionable. When very few memory
>> pages are shared between processes, the number of page table entries
>> (PTEs) to maintain is mostly constrained by the number of pages of
>> memory on the system. As the number of shared pages and the number
>> of times pages are shared goes up, amount of memory consumed by page
>> tables starts to become significant.
>>
>> Some of the field deployments commonly see memory pages shared
>> across 1000s of processes. On x86_64, each page requires a PTE that
>> is only 8 bytes long which is very small compared to the 4K page
>> size. When 2000 processes map the same page in their address space,
>> each one of them requires 8 bytes for its PTE and together that adds
>> up to 8K of memory just to hold the PTEs for one 4K page. On a
>> database server with 300GB SGA, a system carsh was seen with
>> out-of-memory condition when 1500+ clients tried to share this SGA
>> even though the system had 512GB of memory. On this server, in the
>> worst case scenario of all 1500 processes mapping every page from
>> SGA would have required 878GB+ for just the PTEs. If these PTEs
>> could be shared, amount of memory saved is very significant.
>>
>> This is a proposal to implement a mechanism in kernel to allow
>> userspace processes to opt into sharing PTEs. The proposal is to add
>> a new system call - mshare(), which can be used by a process to
>> create a region (we will call it mshare'd region) which can be used
>> by other processes to map same pages using shared PTEs. Other
>> process(es), assuming they have the right permissions, can then make
>> the mashare() system call to map the shared pages into their address
>> space using the shared PTEs.  When a process is done using this
>> mshare'd region, it makes a mshare_unlink() system call to end its
>> access. When the last process accessing mshare'd region calls
>> mshare_unlink(), the mshare'd region is torn down and memory used by
>> it is freed.
>>
>>
>> API Proposal
>> ============
>>
>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>
>> --
>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>
>> mshare() creates and opens a new, or opens an existing mshare'd
>> region that will be shared at PTE level. "name" refers to shared object
>> name that exists under /sys/fs/mshare. "addr" is the starting address
>> of this shared memory area and length is the size of this area.
>> oflags can be one of:
>>
>> - O_RDONLY opens shared memory area for read only access by everyone
>> - O_RDWR opens shared memory area for read and write access
>> - O_CREAT creates the named shared memory area if it does not exist
>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>    exists with that name, return an error.
>>
>> mode represents the creation mode for the shared object under
>> /sys/fs/mshare.
>>
>> mshare() returns an error code if it fails, otherwise it returns 0.
>>
>> PTEs are shared at pgdir level and hence it imposes following
>> requirements on the address and size given to the mshare():
>>
>> - Starting address must be aligned to pgdir size (512GB on x86_64)
>> - Size must be a multiple of pgdir size
>> - Any mappings created in this address range at any time become
>>    shared automatically
>> - Shared address range can have unmapped addresses in it. Any access
>>    to unmapped address will result in SIGBUS
>>
>> Mappings within this address range behave as if they were shared
>> between threads, so a write to a MAP_PRIVATE mapping will create a
>> page which is shared between all the sharers. The first process that
>> declares an address range mshare'd can continue to map objects in
>> the shared area. All other processes that want mshare'd access to
>> this memory area can do so by calling mshare(). After this call, the
>> address range given by mshare becomes a shared range in its address
>> space. Anonymous mappings will be shared and not COWed.
>>
>> A file under /sys/fs/mshare can be opened and read from. A read from
>> this file returns two long values - (1) starting address, and (2)
>> size of the mshare'd region.
>>
>> --
>> int mshare_unlink(char *name)
>>
>> A shared address range created by mshare() can be destroyed using
>> mshare_unlink() which removes the  shared named object. Once all
>> processes have unmapped the shared object, the shared address range
>> references are de-allocated and destroyed.
>>
>> mshare_unlink() returns 0 on success or -1 on error.
>>
>>
>> Example Code
>> ============
>>
>> Snippet of the code that a donor process would run looks like below:
>>
>> -----------------
>>          addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>>                          MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>>          if (addr == MAP_FAILED)
>>                  perror("ERROR: mmap failed");
>>
>>          err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>>                          GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>>          if (err < 0) {
>>                  perror("mshare() syscall failed");
>>                  exit(1);
>>          }
>>
>>          strncpy(addr, "Some random shared text",
>>                          sizeof("Some random shared text"));
>> -----------------
>>
>> Snippet of code that a consumer process would execute looks like:
>>
>> -----------------
>>          fd = open("testregion", O_RDONLY);
>>          if (fd < 0) {
>>                  perror("open failed");
>>                  exit(1);
>>          }
>>
>>          if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
>>                  printf("INFO: %ld bytes shared at addr %lx \n",
>>                                  mshare_info[1], mshare_info[0]);
>>          else
>>                  perror("read failed");
>>
>>          close(fd);
>>
>>          addr = (char *)mshare_info[0];
>>          err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
>>                          mshare_info[1], O_RDWR, 600);
>>          if (err < 0) {
>>                  perror("mshare() syscall failed");
>>                  exit(1);
>>          }
>>
>>          printf("Guest mmap at %px:\n", addr);
>>          printf("%s\n", addr);
>>          printf("\nDone\n");
>>
>>          err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>>          if (err < 0) {
>>                  perror("mshare_unlink() failed");
>>                  exit(1);
>>          }
>> -----------------
> ...
> Hi Khalid,
> 
> The proposed mshare() appears to be similar to POSIX shared memory,
> but with two extra (related) attributes;
> a) Internally, uses shared page tables.
> b) Shared memory is mapped at same address for all users.

Hi Mark,

You are right there are a few similarities with POSIX shm but there is one key difference - unlike shm, shared region 
access does not go through a filesystem. msharefs exists to query mshare'd regions and enforce access restrictions. 
mshare is meant to allow sharing any existing regions that might map a file, may be anonymous or map any other object. 
Any consumer process can use the same PTEs to access whatever might be mapped in that region which is quite different 
from what shm does. Because of the similarities between the two, I had started a prototype using POSIX shm API to 
leverage that code but I found myself special casing mshare often enough in shm code that it made sense to go with a 
separate implementation. I considered an API very much like POSIX shm but a simple mshare() syscall at any time to share 
a range of addresses that may be fully or partially mapped in is a simpler and more versatile API.

Does that rationale sound reasonable?

Thanks,
Khalid

> 
> Rather than introduce two new system calls, along with /sys/ file to
> communicate global addresses, could mshare() be built on top of shmem
> API?  Thinking of something like the below;
> 1) For shm_open(3), add a new oflag to indicate the properties needed
> for mshare() (say, O_SHARED_PTE - better name?)
> 2) For ftruncate(2), objects created with O_SHARED_PTE are constrained
> in the sizes which can be set.
> 3) For mmap(2), NULL is always passed as the address for O_SHARED_PTE
> objects.  On first mmap()ing an appropiate address is assigned,
> otherwise the current 'global' address is used.
> 4) shm_unlink(3) destroys the object when last reference is dropped.
> 
> For 3), might be able to weaken the NULL requirement and validate a
> given address on first mapping to ensure it is correctly aligned.
> shm_open(3) sets FD_CLOEXEC on the file descriptor, which might not be
> the default behaviour you require.
> 
> Internally, the handling of mshare()/O_SHARED_PTE memory might be
> sufficiently different to shmem that there is not much code sharing
> between the two (I haven't thought this through, but the object
> naming/refcounting should be similiar), but using shmem would be a
> familiar API.
> 
> Any thoughts?
> 
> Cheers,
> Mark
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-19 17:02   ` Khalid Aziz
@ 2022-01-20 12:49     ` Mark Hemment
  2022-01-20 19:15       ` Khalid Aziz
  0 siblings, 1 reply; 54+ messages in thread
From: Mark Hemment @ 2022-01-20 12:49 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	longpeng2, arnd, dave.hansen, david, rppt, Suren Baghdasaryan,
	linux-kernel, linux-mm

On Wed, 19 Jan 2022 at 17:02, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> On 1/19/22 04:38, Mark Hemment wrote:
> > On Tue, 18 Jan 2022 at 21:20, Khalid Aziz <khalid.aziz@oracle.com> wrote:
> >>
> >> Page tables in kernel consume some of the memory and as long as
> >> number of mappings being maintained is small enough, this space
> >> consumed by page tables is not objectionable. When very few memory
> >> pages are shared between processes, the number of page table entries
> >> (PTEs) to maintain is mostly constrained by the number of pages of
> >> memory on the system. As the number of shared pages and the number
> >> of times pages are shared goes up, amount of memory consumed by page
> >> tables starts to become significant.
> >>
> >> Some of the field deployments commonly see memory pages shared
> >> across 1000s of processes. On x86_64, each page requires a PTE that
> >> is only 8 bytes long which is very small compared to the 4K page
> >> size. When 2000 processes map the same page in their address space,
> >> each one of them requires 8 bytes for its PTE and together that adds
> >> up to 8K of memory just to hold the PTEs for one 4K page. On a
> >> database server with 300GB SGA, a system carsh was seen with
> >> out-of-memory condition when 1500+ clients tried to share this SGA
> >> even though the system had 512GB of memory. On this server, in the
> >> worst case scenario of all 1500 processes mapping every page from
> >> SGA would have required 878GB+ for just the PTEs. If these PTEs
> >> could be shared, amount of memory saved is very significant.
> >>
> >> This is a proposal to implement a mechanism in kernel to allow
> >> userspace processes to opt into sharing PTEs. The proposal is to add
> >> a new system call - mshare(), which can be used by a process to
> >> create a region (we will call it mshare'd region) which can be used
> >> by other processes to map same pages using shared PTEs. Other
> >> process(es), assuming they have the right permissions, can then make
> >> the mashare() system call to map the shared pages into their address
> >> space using the shared PTEs.  When a process is done using this
> >> mshare'd region, it makes a mshare_unlink() system call to end its
> >> access. When the last process accessing mshare'd region calls
> >> mshare_unlink(), the mshare'd region is torn down and memory used by
> >> it is freed.
> >>
> >>
> >> API Proposal
> >> ============
> >>
> >> The mshare API consists of two system calls - mshare() and mshare_unlink()
> >>
> >> --
> >> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
> >>
> >> mshare() creates and opens a new, or opens an existing mshare'd
> >> region that will be shared at PTE level. "name" refers to shared object
> >> name that exists under /sys/fs/mshare. "addr" is the starting address
> >> of this shared memory area and length is the size of this area.
> >> oflags can be one of:
> >>
> >> - O_RDONLY opens shared memory area for read only access by everyone
> >> - O_RDWR opens shared memory area for read and write access
> >> - O_CREAT creates the named shared memory area if it does not exist
> >> - O_EXCL If O_CREAT was also specified, and a shared memory area
> >>    exists with that name, return an error.
> >>
> >> mode represents the creation mode for the shared object under
> >> /sys/fs/mshare.
> >>
> >> mshare() returns an error code if it fails, otherwise it returns 0.
> >>
> >> PTEs are shared at pgdir level and hence it imposes following
> >> requirements on the address and size given to the mshare():
> >>
> >> - Starting address must be aligned to pgdir size (512GB on x86_64)
> >> - Size must be a multiple of pgdir size
> >> - Any mappings created in this address range at any time become
> >>    shared automatically
> >> - Shared address range can have unmapped addresses in it. Any access
> >>    to unmapped address will result in SIGBUS
> >>
> >> Mappings within this address range behave as if they were shared
> >> between threads, so a write to a MAP_PRIVATE mapping will create a
> >> page which is shared between all the sharers. The first process that
> >> declares an address range mshare'd can continue to map objects in
> >> the shared area. All other processes that want mshare'd access to
> >> this memory area can do so by calling mshare(). After this call, the
> >> address range given by mshare becomes a shared range in its address
> >> space. Anonymous mappings will be shared and not COWed.
> >>
> >> A file under /sys/fs/mshare can be opened and read from. A read from
> >> this file returns two long values - (1) starting address, and (2)
> >> size of the mshare'd region.
> >>
> >> --
> >> int mshare_unlink(char *name)
> >>
> >> A shared address range created by mshare() can be destroyed using
> >> mshare_unlink() which removes the  shared named object. Once all
> >> processes have unmapped the shared object, the shared address range
> >> references are de-allocated and destroyed.
> >>
> >> mshare_unlink() returns 0 on success or -1 on error.
> >>
> >>
> >> Example Code
> >> ============
> >>
> >> Snippet of the code that a donor process would run looks like below:
> >>
> >> -----------------
> >>          addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
> >>                          MAP_SHARED | MAP_ANONYMOUS, 0, 0);
> >>          if (addr == MAP_FAILED)
> >>                  perror("ERROR: mmap failed");
> >>
> >>          err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
> >>                          GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
> >>          if (err < 0) {
> >>                  perror("mshare() syscall failed");
> >>                  exit(1);
> >>          }
> >>
> >>          strncpy(addr, "Some random shared text",
> >>                          sizeof("Some random shared text"));
> >> -----------------
> >>
> >> Snippet of code that a consumer process would execute looks like:
> >>
> >> -----------------
> >>          fd = open("testregion", O_RDONLY);
> >>          if (fd < 0) {
> >>                  perror("open failed");
> >>                  exit(1);
> >>          }
> >>
> >>          if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
> >>                  printf("INFO: %ld bytes shared at addr %lx \n",
> >>                                  mshare_info[1], mshare_info[0]);
> >>          else
> >>                  perror("read failed");
> >>
> >>          close(fd);
> >>
> >>          addr = (char *)mshare_info[0];
> >>          err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
> >>                          mshare_info[1], O_RDWR, 600);
> >>          if (err < 0) {
> >>                  perror("mshare() syscall failed");
> >>                  exit(1);
> >>          }
> >>
> >>          printf("Guest mmap at %px:\n", addr);
> >>          printf("%s\n", addr);
> >>          printf("\nDone\n");
> >>
> >>          err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
> >>          if (err < 0) {
> >>                  perror("mshare_unlink() failed");
> >>                  exit(1);
> >>          }
> >> -----------------
> > ...
> > Hi Khalid,
> >
> > The proposed mshare() appears to be similar to POSIX shared memory,
> > but with two extra (related) attributes;
> > a) Internally, uses shared page tables.
> > b) Shared memory is mapped at same address for all users.
>
> Hi Mark,
>
> You are right there are a few similarities with POSIX shm but there is one key difference - unlike shm, shared region
> access does not go through a filesystem. msharefs exists to query mshare'd regions and enforce access restrictions.
> mshare is meant to allow sharing any existing regions that might map a file, may be anonymous or map any other object.
> Any consumer process can use the same PTEs to access whatever might be mapped in that region which is quite different
> from what shm does. Because of the similarities between the two, I had started a prototype using POSIX shm API to
> leverage that code but I found myself special casing mshare often enough in shm code that it made sense to go with a
> separate implementation.

Ah, I jumped in assuming this was only for anon memory.

> I considered an API very much like POSIX shm but a simple mshare() syscall at any time to share
> a range of addresses that may be fully or partially mapped in is a simpler and more versatile API.

So possible you have already considered the below...which does make
the API a little more POSIX shm like.

The mshare() syscall does two operations;
1) create/open mshare object
2) export/import the given memory region

Would it be better if these were seperate operations?  That is,
mshare_open() (say) creates/opens the object returning a file
descriptor.  The fd used as the identifier for the export/import after
mmap(2); eg.
addr = mshare_op(EXPORT, fd, addr, size);
addr = mshare_op(IMPORT, fd, NULL, 0);
(Not sure about export/import terms..)

The benefit of the the separate ops is the file descriptor.  This
could be used for fstat(2) (and fchown(2)?), although not sure how
much value this would add.

The 'importer' would use the address/size of the memory region as
exported (and stored in msharefs), so no need for /sys file (except
for human readable info).

If the set-up operations are split in two, then would it make sense to
also split the teardown as well?  Say, mshare_op(DROP, fd) and
mshare_unlink(fd)?

>
> Does that rationale sound reasonable?

It doesn't sound unreasonable.  As msharefs is providing a namespace
and perms, it doesn't need much flexibility.  Being able to modifying
the perms post namespace creation (fchown(2)), before exporting the
memory region, might be useful in some cases - but as I don't have any
usecases I'm not claiming it is essential.

>
> Thanks,
> Khalid

Cheers,
Mark
>
> >
> > Rather than introduce two new system calls, along with /sys/ file to
> > communicate global addresses, could mshare() be built on top of shmem
> > API?  Thinking of something like the below;
> > 1) For shm_open(3), add a new oflag to indicate the properties needed
> > for mshare() (say, O_SHARED_PTE - better name?)
> > 2) For ftruncate(2), objects created with O_SHARED_PTE are constrained
> > in the sizes which can be set.
> > 3) For mmap(2), NULL is always passed as the address for O_SHARED_PTE
> > objects.  On first mmap()ing an appropiate address is assigned,
> > otherwise the current 'global' address is used.
> > 4) shm_unlink(3) destroys the object when last reference is dropped.
> >
> > For 3), might be able to weaken the NULL requirement and validate a
> > given address on first mapping to ensure it is correctly aligned.
> > shm_open(3) sets FD_CLOEXEC on the file descriptor, which might not be
> > the default behaviour you require.
> >
> > Internally, the handling of mshare()/O_SHARED_PTE memory might be
> > sufficiently different to shmem that there is not much code sharing
> > between the two (I haven't thought this through, but the object
> > naming/refcounting should be similiar), but using shmem would be a
> > familiar API.
> >
> > Any thoughts?
> >
> > Cheers,
> > Mark
> >
>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-20 12:49     ` Mark Hemment
@ 2022-01-20 19:15       ` Khalid Aziz
  2022-01-24 15:15         ` Mark Hemment
  0 siblings, 1 reply; 54+ messages in thread
From: Khalid Aziz @ 2022-01-20 19:15 UTC (permalink / raw)
  To: Mark Hemment
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	longpeng2, arnd, dave.hansen, david, rppt, Suren Baghdasaryan,
	linux-kernel, linux-mm

On 1/20/22 05:49, Mark Hemment wrote:
> On Wed, 19 Jan 2022 at 17:02, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> On 1/19/22 04:38, Mark Hemment wrote:
>>> On Tue, 18 Jan 2022 at 21:20, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>>>
>>>> Page tables in kernel consume some of the memory and as long as
>>>> number of mappings being maintained is small enough, this space
>>>> consumed by page tables is not objectionable. When very few memory
>>>> pages are shared between processes, the number of page table entries
>>>> (PTEs) to maintain is mostly constrained by the number of pages of
>>>> memory on the system. As the number of shared pages and the number
>>>> of times pages are shared goes up, amount of memory consumed by page
>>>> tables starts to become significant.
>>>>
>>>> Some of the field deployments commonly see memory pages shared
>>>> across 1000s of processes. On x86_64, each page requires a PTE that
>>>> is only 8 bytes long which is very small compared to the 4K page
>>>> size. When 2000 processes map the same page in their address space,
>>>> each one of them requires 8 bytes for its PTE and together that adds
>>>> up to 8K of memory just to hold the PTEs for one 4K page. On a
>>>> database server with 300GB SGA, a system carsh was seen with
>>>> out-of-memory condition when 1500+ clients tried to share this SGA
>>>> even though the system had 512GB of memory. On this server, in the
>>>> worst case scenario of all 1500 processes mapping every page from
>>>> SGA would have required 878GB+ for just the PTEs. If these PTEs
>>>> could be shared, amount of memory saved is very significant.
>>>>
>>>> This is a proposal to implement a mechanism in kernel to allow
>>>> userspace processes to opt into sharing PTEs. The proposal is to add
>>>> a new system call - mshare(), which can be used by a process to
>>>> create a region (we will call it mshare'd region) which can be used
>>>> by other processes to map same pages using shared PTEs. Other
>>>> process(es), assuming they have the right permissions, can then make
>>>> the mashare() system call to map the shared pages into their address
>>>> space using the shared PTEs.  When a process is done using this
>>>> mshare'd region, it makes a mshare_unlink() system call to end its
>>>> access. When the last process accessing mshare'd region calls
>>>> mshare_unlink(), the mshare'd region is torn down and memory used by
>>>> it is freed.
>>>>
>>>>
>>>> API Proposal
>>>> ============
>>>>
>>>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>>>
>>>> --
>>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>>>
>>>> mshare() creates and opens a new, or opens an existing mshare'd
>>>> region that will be shared at PTE level. "name" refers to shared object
>>>> name that exists under /sys/fs/mshare. "addr" is the starting address
>>>> of this shared memory area and length is the size of this area.
>>>> oflags can be one of:
>>>>
>>>> - O_RDONLY opens shared memory area for read only access by everyone
>>>> - O_RDWR opens shared memory area for read and write access
>>>> - O_CREAT creates the named shared memory area if it does not exist
>>>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>>>     exists with that name, return an error.
>>>>
>>>> mode represents the creation mode for the shared object under
>>>> /sys/fs/mshare.
>>>>
>>>> mshare() returns an error code if it fails, otherwise it returns 0.
>>>>
>>>> PTEs are shared at pgdir level and hence it imposes following
>>>> requirements on the address and size given to the mshare():
>>>>
>>>> - Starting address must be aligned to pgdir size (512GB on x86_64)
>>>> - Size must be a multiple of pgdir size
>>>> - Any mappings created in this address range at any time become
>>>>     shared automatically
>>>> - Shared address range can have unmapped addresses in it. Any access
>>>>     to unmapped address will result in SIGBUS
>>>>
>>>> Mappings within this address range behave as if they were shared
>>>> between threads, so a write to a MAP_PRIVATE mapping will create a
>>>> page which is shared between all the sharers. The first process that
>>>> declares an address range mshare'd can continue to map objects in
>>>> the shared area. All other processes that want mshare'd access to
>>>> this memory area can do so by calling mshare(). After this call, the
>>>> address range given by mshare becomes a shared range in its address
>>>> space. Anonymous mappings will be shared and not COWed.
>>>>
>>>> A file under /sys/fs/mshare can be opened and read from. A read from
>>>> this file returns two long values - (1) starting address, and (2)
>>>> size of the mshare'd region.
>>>>
>>>> --
>>>> int mshare_unlink(char *name)
>>>>
>>>> A shared address range created by mshare() can be destroyed using
>>>> mshare_unlink() which removes the  shared named object. Once all
>>>> processes have unmapped the shared object, the shared address range
>>>> references are de-allocated and destroyed.
>>>>
>>>> mshare_unlink() returns 0 on success or -1 on error.
>>>>
>>>>
>>>> Example Code
>>>> ============
>>>>
>>>> Snippet of the code that a donor process would run looks like below:
>>>>
>>>> -----------------
>>>>           addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>>>>                           MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>>>>           if (addr == MAP_FAILED)
>>>>                   perror("ERROR: mmap failed");
>>>>
>>>>           err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>>>>                           GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>>>>           if (err < 0) {
>>>>                   perror("mshare() syscall failed");
>>>>                   exit(1);
>>>>           }
>>>>
>>>>           strncpy(addr, "Some random shared text",
>>>>                           sizeof("Some random shared text"));
>>>> -----------------
>>>>
>>>> Snippet of code that a consumer process would execute looks like:
>>>>
>>>> -----------------
>>>>           fd = open("testregion", O_RDONLY);
>>>>           if (fd < 0) {
>>>>                   perror("open failed");
>>>>                   exit(1);
>>>>           }
>>>>
>>>>           if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
>>>>                   printf("INFO: %ld bytes shared at addr %lx \n",
>>>>                                   mshare_info[1], mshare_info[0]);
>>>>           else
>>>>                   perror("read failed");
>>>>
>>>>           close(fd);
>>>>
>>>>           addr = (char *)mshare_info[0];
>>>>           err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
>>>>                           mshare_info[1], O_RDWR, 600);
>>>>           if (err < 0) {
>>>>                   perror("mshare() syscall failed");
>>>>                   exit(1);
>>>>           }
>>>>
>>>>           printf("Guest mmap at %px:\n", addr);
>>>>           printf("%s\n", addr);
>>>>           printf("\nDone\n");
>>>>
>>>>           err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>>>>           if (err < 0) {
>>>>                   perror("mshare_unlink() failed");
>>>>                   exit(1);
>>>>           }
>>>> -----------------
>>> ...
>>> Hi Khalid,
>>>
>>> The proposed mshare() appears to be similar to POSIX shared memory,
>>> but with two extra (related) attributes;
>>> a) Internally, uses shared page tables.
>>> b) Shared memory is mapped at same address for all users.
>>
>> Hi Mark,
>>
>> You are right there are a few similarities with POSIX shm but there is one key difference - unlike shm, shared region
>> access does not go through a filesystem. msharefs exists to query mshare'd regions and enforce access restrictions.
>> mshare is meant to allow sharing any existing regions that might map a file, may be anonymous or map any other object.
>> Any consumer process can use the same PTEs to access whatever might be mapped in that region which is quite different
>> from what shm does. Because of the similarities between the two, I had started a prototype using POSIX shm API to
>> leverage that code but I found myself special casing mshare often enough in shm code that it made sense to go with a
>> separate implementation.
> 
> Ah, I jumped in assuming this was only for anon memory.
> 
>> I considered an API very much like POSIX shm but a simple mshare() syscall at any time to share
>> a range of addresses that may be fully or partially mapped in is a simpler and more versatile API.
> 
> So possible you have already considered the below...which does make
> the API a little more POSIX shm like.
> 
> The mshare() syscall does two operations;
> 1) create/open mshare object
> 2) export/import the given memory region
> 
> Would it be better if these were seperate operations?  That is,
> mshare_open() (say) creates/opens the object returning a file
> descriptor.  The fd used as the identifier for the export/import after
> mmap(2); eg.
> addr = mshare_op(EXPORT, fd, addr, size);
> addr = mshare_op(IMPORT, fd, NULL, 0);
> (Not sure about export/import terms..)
> 
> The benefit of the the separate ops is the file descriptor.  This
> could be used for fstat(2) (and fchown(2)?), although not sure how
> much value this would add.

Hi Mark,

That is the question here - what would be the value of fd to mshare_op? The file in msharefs can be opened like a 
regular file and supports fstat, fchown etc which can be used to query/set permissions for the mshare'd region.

> 
> The 'importer' would use the address/size of the memory region as
> exported (and stored in msharefs), so no need for /sys file (except
> for human readable info).

I think we still need /sys/fs/msharefs files, right? Since you said msharefs stores information about address and size, 
I assume you are not proposing eliminating msharefs.

> 
> If the set-up operations are split in two, then would it make sense to
> also split the teardown as well?  Say, mshare_op(DROP, fd) and
> mshare_unlink(fd)?

A single op is simpler. Every process can call mshare_unlink() and if last reference is dropped, kernel should take care 
of cleaning up mshare'd region by itself. One of my goals is for mshare to continue to work even if the process that 
created the mshare region dies. In a database context, such mshare'd regions can live for very long time. As a result I 
would rather not make any process be responsible for cleaning up the mshare'd region. It should be as simple as the 
mshare'd region disappearing on its own when all references to it are dropped.

Thanks,
Khalid

> 
>>
>> Does that rationale sound reasonable?
> 
> It doesn't sound unreasonable.  As msharefs is providing a namespace
> and perms, it doesn't need much flexibility.  Being able to modifying
> the perms post namespace creation (fchown(2)), before exporting the
> memory region, might be useful in some cases - but as I don't have any
> usecases I'm not claiming it is essential.
> 
>>
>> Thanks,
>> Khalid
> 
> Cheers,
> Mark
>>
>>>
>>> Rather than introduce two new system calls, along with /sys/ file to
>>> communicate global addresses, could mshare() be built on top of shmem
>>> API?  Thinking of something like the below;
>>> 1) For shm_open(3), add a new oflag to indicate the properties needed
>>> for mshare() (say, O_SHARED_PTE - better name?)
>>> 2) For ftruncate(2), objects created with O_SHARED_PTE are constrained
>>> in the sizes which can be set.
>>> 3) For mmap(2), NULL is always passed as the address for O_SHARED_PTE
>>> objects.  On first mmap()ing an appropiate address is assigned,
>>> otherwise the current 'global' address is used.
>>> 4) shm_unlink(3) destroys the object when last reference is dropped.
>>>
>>> For 3), might be able to weaken the NULL requirement and validate a
>>> given address on first mapping to ensure it is correctly aligned.
>>> shm_open(3) sets FD_CLOEXEC on the file descriptor, which might not be
>>> the default behaviour you require.
>>>
>>> Internally, the handling of mshare()/O_SHARED_PTE memory might be
>>> sufficiently different to shmem that there is not much code sharing
>>> between the two (I haven't thought this through, but the object
>>> naming/refcounting should be similiar), but using shmem would be a
>>> familiar API.
>>>
>>> Any thoughts?
>>>
>>> Cheers,
>>> Mark
>>>
>>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-18 21:19 [RFC PATCH 0/6] Add support for shared PTEs across processes Khalid Aziz
                   ` (8 preceding siblings ...)
  2022-01-19 11:38 ` Mark Hemment
@ 2022-01-21  1:08 ` Barry Song
  2022-01-21  2:13   ` Matthew Wilcox
  2022-01-22 11:31 ` Mike Rapoport
  2022-01-25 11:42 ` Kirill A. Shutemov
  11 siblings, 1 reply; 54+ messages in thread
From: Barry Song @ 2022-01-21  1:08 UTC (permalink / raw)
  To: khalid.aziz
  Cc: akpm, arnd, dave.hansen, david, linux-kernel, linux-mm,
	longpeng2, rppt, surenb, willy

> A file under /sys/fs/mshare can be opened and read from. A read from
> this file returns two long values - (1) starting address, and (2)
> size of the mshare'd region.
> 
> --
> int mshare_unlink(char *name)
> 
> A shared address range created by mshare() can be destroyed using
> mshare_unlink() which removes the  shared named object. Once all
> processes have unmapped the shared object, the shared address range
> references are de-allocated and destroyed.

> mshare_unlink() returns 0 on success or -1 on error.

I am still struggling with the user scenarios of these new APIs. This patch
supposes multiple processes will have same virtual address for the shared
area? How can this be guaranteed while different processes can map different
stack, heap, libraries, files?

BTW, it seems you have different intention with the below?
Shared page tables during fork[1]
[1] https://lwn.net/Articles/861547/

Thanks
Barry


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-21  1:08 ` Barry Song
@ 2022-01-21  2:13   ` Matthew Wilcox
  2022-01-21  7:35     ` Barry Song
  0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2022-01-21  2:13 UTC (permalink / raw)
  To: Barry Song
  Cc: khalid.aziz, akpm, arnd, dave.hansen, david, linux-kernel,
	linux-mm, longpeng2, rppt, surenb

On Fri, Jan 21, 2022 at 09:08:06AM +0800, Barry Song wrote:
> > A file under /sys/fs/mshare can be opened and read from. A read from
> > this file returns two long values - (1) starting address, and (2)
> > size of the mshare'd region.
> > 
> > --
> > int mshare_unlink(char *name)
> > 
> > A shared address range created by mshare() can be destroyed using
> > mshare_unlink() which removes the  shared named object. Once all
> > processes have unmapped the shared object, the shared address range
> > references are de-allocated and destroyed.
> 
> > mshare_unlink() returns 0 on success or -1 on error.
> 
> I am still struggling with the user scenarios of these new APIs. This patch
> supposes multiple processes will have same virtual address for the shared
> area? How can this be guaranteed while different processes can map different
> stack, heap, libraries, files?

The two processes choose to share a chunk of their address space.
They can map anything they like in that shared area, and then also
anything they like in the areas that aren't shared.  They can choose
for that shared area to have the same address in both processes
or different locations in each process.

If two processes want to put a shared library in that shared address
space, that should work.  They probably would need to agree to use
the same virtual address for the shared page tables for that to work.

Processes should probably not put their stacks in the shared region.
I mean, it could work, I suppose ... threads manage it in a single
address space.  But I don't see why you'd want to do that.  For
heaps, if you want the other process to be able to access the memory,
I suppose you could put it in the shared region, but heaps aren't
going to be put in the shared region by default.

Think of this like hugetlbfs, only instead of sharing hugetlbfs
memory, you can share _anything_ that's mmapable.

> BTW, it seems you have different intention with the below?
> Shared page tables during fork[1]
> [1] https://lwn.net/Articles/861547/

Yes, that's completely different.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-21  2:13   ` Matthew Wilcox
@ 2022-01-21  7:35     ` Barry Song
  2022-01-21 14:47       ` Matthew Wilcox
  0 siblings, 1 reply; 54+ messages in thread
From: Barry Song @ 2022-01-21  7:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: khalid.aziz, Andrew Morton, Arnd Bergmann, Dave Hansen,
	David Hildenbrand, LKML, Linux-MM, longpeng2, Mike Rapoport,
	Suren Baghdasaryan

On Fri, Jan 21, 2022 at 3:13 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jan 21, 2022 at 09:08:06AM +0800, Barry Song wrote:
> > > A file under /sys/fs/mshare can be opened and read from. A read from
> > > this file returns two long values - (1) starting address, and (2)
> > > size of the mshare'd region.
> > >
> > > --
> > > int mshare_unlink(char *name)
> > >
> > > A shared address range created by mshare() can be destroyed using
> > > mshare_unlink() which removes the  shared named object. Once all
> > > processes have unmapped the shared object, the shared address range
> > > references are de-allocated and destroyed.
> >
> > > mshare_unlink() returns 0 on success or -1 on error.
> >
> > I am still struggling with the user scenarios of these new APIs. This patch
> > supposes multiple processes will have same virtual address for the shared
> > area? How can this be guaranteed while different processes can map different
> > stack, heap, libraries, files?
>
> The two processes choose to share a chunk of their address space.
> They can map anything they like in that shared area, and then also
> anything they like in the areas that aren't shared.  They can choose
> for that shared area to have the same address in both processes
> or different locations in each process.
>
> If two processes want to put a shared library in that shared address
> space, that should work.  They probably would need to agree to use
> the same virtual address for the shared page tables for that to work.

we are depending on an elf loader and ld to map the library
dynamically , so hardly
can we find a chance in users' code to call mshare() to map libraries
in application
level?

so we are supposed to modify some very low level code to use this feature?

>
> Processes should probably not put their stacks in the shared region.
> I mean, it could work, I suppose ... threads manage it in a single
> address space.  But I don't see why you'd want to do that.  For
> heaps, if you want the other process to be able to access the memory,
> I suppose you could put it in the shared region, but heaps aren't
> going to be put in the shared region by default.
>
> Think of this like hugetlbfs, only instead of sharing hugetlbfs
> memory, you can share _anything_ that's mmapable.

yep, we can call mshare() on any kind of memory. for example, if multiple
processes use SYSV shmem, posix shmem or mmap the same file. but
it seems it is more sensible to let kernel do it automatically rather than
depending on calling mshare() from users? It is difficult for users to
decide which areas should be applied mshare(). users might want to call
mshare() for all shared areas to save memory coming from duplicated PTEs?
unlike SYSV shmem and POSIX shmem which are a feature for inter-processes
communications,  mshare() looks not like a feature for applications,
but like a feature
for the whole system level? why would applications have to call something which
doesn't directly help them? without mshare(), those applications
will still work without any problem, right? is there anything in
mshare() which is
a must-have for applications? or mshare() is only a suggestion from applications
like madvise()?

>
> > BTW, it seems you have different intention with the below?
> > Shared page tables during fork[1]
> > [1] https://lwn.net/Articles/861547/
>
> Yes, that's completely different.

Thanks for clarification.

Best Regards.
Barry

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-21  7:35     ` Barry Song
@ 2022-01-21 14:47       ` Matthew Wilcox
  2022-01-21 16:41         ` Khalid Aziz
  0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2022-01-21 14:47 UTC (permalink / raw)
  To: Barry Song
  Cc: khalid.aziz, Andrew Morton, Arnd Bergmann, Dave Hansen,
	David Hildenbrand, LKML, Linux-MM, longpeng2, Mike Rapoport,
	Suren Baghdasaryan

On Fri, Jan 21, 2022 at 08:35:17PM +1300, Barry Song wrote:
> On Fri, Jan 21, 2022 at 3:13 PM Matthew Wilcox <willy@infradead.org> wrote:
> > On Fri, Jan 21, 2022 at 09:08:06AM +0800, Barry Song wrote:
> > > > A file under /sys/fs/mshare can be opened and read from. A read from
> > > > this file returns two long values - (1) starting address, and (2)
> > > > size of the mshare'd region.
> > > >
> > > > --
> > > > int mshare_unlink(char *name)
> > > >
> > > > A shared address range created by mshare() can be destroyed using
> > > > mshare_unlink() which removes the  shared named object. Once all
> > > > processes have unmapped the shared object, the shared address range
> > > > references are de-allocated and destroyed.
> > >
> > > > mshare_unlink() returns 0 on success or -1 on error.
> > >
> > > I am still struggling with the user scenarios of these new APIs. This patch
> > > supposes multiple processes will have same virtual address for the shared
> > > area? How can this be guaranteed while different processes can map different
> > > stack, heap, libraries, files?
> >
> > The two processes choose to share a chunk of their address space.
> > They can map anything they like in that shared area, and then also
> > anything they like in the areas that aren't shared.  They can choose
> > for that shared area to have the same address in both processes
> > or different locations in each process.
> >
> > If two processes want to put a shared library in that shared address
> > space, that should work.  They probably would need to agree to use
> > the same virtual address for the shared page tables for that to work.
> 
> we are depending on an elf loader and ld to map the library
> dynamically , so hardly
> can we find a chance in users' code to call mshare() to map libraries
> in application
> level?

If somebody wants to modify ld.so to take advantage of mshare(), they
could.  That wasn't our primary motivation here, so if it turns out to
not work for that usecase, well, that's a shame.

> > Think of this like hugetlbfs, only instead of sharing hugetlbfs
> > memory, you can share _anything_ that's mmapable.
> 
> yep, we can call mshare() on any kind of memory. for example, if multiple
> processes use SYSV shmem, posix shmem or mmap the same file. but
> it seems it is more sensible to let kernel do it automatically rather than
> depending on calling mshare() from users? It is difficult for users to
> decide which areas should be applied mshare(). users might want to call
> mshare() for all shared areas to save memory coming from duplicated PTEs?
> unlike SYSV shmem and POSIX shmem which are a feature for inter-processes
> communications,  mshare() looks not like a feature for applications,
> but like a feature
> for the whole system level? why would applications have to call something which
> doesn't directly help them? without mshare(), those applications
> will still work without any problem, right? is there anything in
> mshare() which is
> a must-have for applications? or mshare() is only a suggestion from applications
> like madvise()?

Our use case is that we have some very large files stored on persistent
memory which we want to mmap in thousands of processes.  So the first
one shares a chunk of its address space and mmaps all the files into
that chunk of address space.  Subsequent processes find that a suitable
address space already exists and use it, sharing the page tables and
avoiding the calls to mmap.

Sharing page tables is akin to running multiple threads in a single
address space; except that only part of the address space is the same.
There does need to be a certain amount of trust between the processes
sharing the address space.  You don't want to do it to an unsuspecting
process.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-21 14:47       ` Matthew Wilcox
@ 2022-01-21 16:41         ` Khalid Aziz
  2022-01-22  1:39           ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  0 siblings, 1 reply; 54+ messages in thread
From: Khalid Aziz @ 2022-01-21 16:41 UTC (permalink / raw)
  To: Matthew Wilcox, Barry Song
  Cc: Andrew Morton, Arnd Bergmann, Dave Hansen, David Hildenbrand,
	LKML, Linux-MM, longpeng2, Mike Rapoport, Suren Baghdasaryan

On 1/21/22 07:47, Matthew Wilcox wrote:
> On Fri, Jan 21, 2022 at 08:35:17PM +1300, Barry Song wrote:
>> On Fri, Jan 21, 2022 at 3:13 PM Matthew Wilcox <willy@infradead.org> wrote:
>>> On Fri, Jan 21, 2022 at 09:08:06AM +0800, Barry Song wrote:
>>>>> A file under /sys/fs/mshare can be opened and read from. A read from
>>>>> this file returns two long values - (1) starting address, and (2)
>>>>> size of the mshare'd region.
>>>>>
>>>>> --
>>>>> int mshare_unlink(char *name)
>>>>>
>>>>> A shared address range created by mshare() can be destroyed using
>>>>> mshare_unlink() which removes the  shared named object. Once all
>>>>> processes have unmapped the shared object, the shared address range
>>>>> references are de-allocated and destroyed.
>>>>
>>>>> mshare_unlink() returns 0 on success or -1 on error.
>>>>
>>>> I am still struggling with the user scenarios of these new APIs. This patch
>>>> supposes multiple processes will have same virtual address for the shared
>>>> area? How can this be guaranteed while different processes can map different
>>>> stack, heap, libraries, files?
>>>
>>> The two processes choose to share a chunk of their address space.
>>> They can map anything they like in that shared area, and then also
>>> anything they like in the areas that aren't shared.  They can choose
>>> for that shared area to have the same address in both processes
>>> or different locations in each process.
>>>
>>> If two processes want to put a shared library in that shared address
>>> space, that should work.  They probably would need to agree to use
>>> the same virtual address for the shared page tables for that to work.
>>
>> we are depending on an elf loader and ld to map the library
>> dynamically , so hardly
>> can we find a chance in users' code to call mshare() to map libraries
>> in application
>> level?
> 
> If somebody wants to modify ld.so to take advantage of mshare(), they
> could.  That wasn't our primary motivation here, so if it turns out to
> not work for that usecase, well, that's a shame.
> 
>>> Think of this like hugetlbfs, only instead of sharing hugetlbfs
>>> memory, you can share _anything_ that's mmapable.
>>
>> yep, we can call mshare() on any kind of memory. for example, if multiple
>> processes use SYSV shmem, posix shmem or mmap the same file. but
>> it seems it is more sensible to let kernel do it automatically rather than
>> depending on calling mshare() from users? It is difficult for users to
>> decide which areas should be applied mshare(). users might want to call
>> mshare() for all shared areas to save memory coming from duplicated PTEs?
>> unlike SYSV shmem and POSIX shmem which are a feature for inter-processes
>> communications,  mshare() looks not like a feature for applications,
>> but like a feature
>> for the whole system level? why would applications have to call something which
>> doesn't directly help them? without mshare(), those applications
>> will still work without any problem, right? is there anything in
>> mshare() which is
>> a must-have for applications? or mshare() is only a suggestion from applications
>> like madvise()?
> 
> Our use case is that we have some very large files stored on persistent
> memory which we want to mmap in thousands of processes.  So the first
> one shares a chunk of its address space and mmaps all the files into
> that chunk of address space.  Subsequent processes find that a suitable
> address space already exists and use it, sharing the page tables and
> avoiding the calls to mmap.
> 
> Sharing page tables is akin to running multiple threads in a single
> address space; except that only part of the address space is the same.
> There does need to be a certain amount of trust between the processes
> sharing the address space.  You don't want to do it to an unsuspecting
> process.
> 

Hello Barry,

mshare() is really meant for sharing data across unrelated processes by sharing address space explicitly and hence 
opt-in is required. As Matthew said, the processes sharing this virtual address space need to have a level of trust.
Permissions on the msharefs files control who can access this shared address space. It is possible to adapt this
mechanism to share stack, libraries etc but that is not the intent. This feature will be used by applications that share
data with multiple processes using shared mapping normally and it helps them avoid the overhead of large number of
duplicated PTEs which consume memory. This extra memory consumed by PTEs reduces amount of memory available for
applications and can result in out-of-memory condition. An example from the patch 0/6:

"On a database server with 300GB SGA, a system crash was seen with
out-of-memory condition when 1500+ clients tried to share this SGA
even though the system had 512GB of memory. On this server, in the
worst case scenario of all 1500 processes mapping every page from
SGA would have required 878GB+ for just the PTEs. If these PTEs
could be shared, amount of memory saved is very significant."

--
Khalid

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-21 16:41         ` Khalid Aziz
@ 2022-01-22  1:39           ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  2022-01-22  1:41             ` Matthew Wilcox
  0 siblings, 1 reply; 54+ messages in thread
From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) @ 2022-01-22  1:39 UTC (permalink / raw)
  To: Khalid Aziz, Matthew Wilcox, Barry Song
  Cc: Andrew Morton, Arnd Bergmann, Dave Hansen, David Hildenbrand,
	LKML, Linux-MM, Mike Rapoport, Suren Baghdasaryan



> -----Original Message-----
> From: Khalid Aziz [mailto:khalid.aziz@oracle.com]
> Sent: Saturday, January 22, 2022 12:42 AM
> To: Matthew Wilcox <willy@infradead.org>; Barry Song <21cnbao@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>; Arnd Bergmann <arnd@arndb.de>;
> Dave Hansen <dave.hansen@linux.intel.com>; David Hildenbrand
> <david@redhat.com>; LKML <linux-kernel@vger.kernel.org>; Linux-MM
> <linux-mm@kvack.org>; Longpeng (Mike, Cloud Infrastructure Service Product
> Dept.) <longpeng2@huawei.com>; Mike Rapoport <rppt@kernel.org>; Suren
> Baghdasaryan <surenb@google.com>
> Subject: Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
> 
> On 1/21/22 07:47, Matthew Wilcox wrote:
> > On Fri, Jan 21, 2022 at 08:35:17PM +1300, Barry Song wrote:
> >> On Fri, Jan 21, 2022 at 3:13 PM Matthew Wilcox <willy@infradead.org> wrote:
> >>> On Fri, Jan 21, 2022 at 09:08:06AM +0800, Barry Song wrote:
> >>>>> A file under /sys/fs/mshare can be opened and read from. A read from
> >>>>> this file returns two long values - (1) starting address, and (2)
> >>>>> size of the mshare'd region.
> >>>>>
> >>>>> --
> >>>>> int mshare_unlink(char *name)
> >>>>>
> >>>>> A shared address range created by mshare() can be destroyed using
> >>>>> mshare_unlink() which removes the  shared named object. Once all
> >>>>> processes have unmapped the shared object, the shared address range
> >>>>> references are de-allocated and destroyed.
> >>>>
> >>>>> mshare_unlink() returns 0 on success or -1 on error.
> >>>>
> >>>> I am still struggling with the user scenarios of these new APIs. This patch
> >>>> supposes multiple processes will have same virtual address for the shared
> >>>> area? How can this be guaranteed while different processes can map different
> >>>> stack, heap, libraries, files?
> >>>
> >>> The two processes choose to share a chunk of their address space.
> >>> They can map anything they like in that shared area, and then also
> >>> anything they like in the areas that aren't shared.  They can choose
> >>> for that shared area to have the same address in both processes
> >>> or different locations in each process.
> >>>
> >>> If two processes want to put a shared library in that shared address
> >>> space, that should work.  They probably would need to agree to use
> >>> the same virtual address for the shared page tables for that to work.
> >>
> >> we are depending on an elf loader and ld to map the library
> >> dynamically , so hardly
> >> can we find a chance in users' code to call mshare() to map libraries
> >> in application
> >> level?
> >
> > If somebody wants to modify ld.so to take advantage of mshare(), they
> > could.  That wasn't our primary motivation here, so if it turns out to
> > not work for that usecase, well, that's a shame.
> >
> >>> Think of this like hugetlbfs, only instead of sharing hugetlbfs
> >>> memory, you can share _anything_ that's mmapable.
> >>
> >> yep, we can call mshare() on any kind of memory. for example, if multiple
> >> processes use SYSV shmem, posix shmem or mmap the same file. but
> >> it seems it is more sensible to let kernel do it automatically rather than
> >> depending on calling mshare() from users? It is difficult for users to
> >> decide which areas should be applied mshare(). users might want to call
> >> mshare() for all shared areas to save memory coming from duplicated PTEs?
> >> unlike SYSV shmem and POSIX shmem which are a feature for inter-processes
> >> communications,  mshare() looks not like a feature for applications,
> >> but like a feature
> >> for the whole system level? why would applications have to call something
> which
> >> doesn't directly help them? without mshare(), those applications
> >> will still work without any problem, right? is there anything in
> >> mshare() which is
> >> a must-have for applications? or mshare() is only a suggestion from
> applications
> >> like madvise()?
> >
> > Our use case is that we have some very large files stored on persistent
> > memory which we want to mmap in thousands of processes.  So the first
> > one shares a chunk of its address space and mmaps all the files into
> > that chunk of address space.  Subsequent processes find that a suitable
> > address space already exists and use it, sharing the page tables and
> > avoiding the calls to mmap.
> >
> > Sharing page tables is akin to running multiple threads in a single
> > address space; except that only part of the address space is the same.
> > There does need to be a certain amount of trust between the processes
> > sharing the address space.  You don't want to do it to an unsuspecting
> > process.
> >
> 
> Hello Barry,
> 
> mshare() is really meant for sharing data across unrelated processes by sharing
> address space explicitly and hence
> opt-in is required. As Matthew said, the processes sharing this virtual address
> space need to have a level of trust.
> Permissions on the msharefs files control who can access this shared address
> space. It is possible to adapt this
> mechanism to share stack, libraries etc but that is not the intent. This feature
> will be used by applications that share
> data with multiple processes using shared mapping normally and it helps them
> avoid the overhead of large number of
> duplicated PTEs which consume memory. This extra memory consumed by PTEs reduces
> amount of memory available for
> applications and can result in out-of-memory condition. An example from the patch
> 0/6:
> 
> "On a database server with 300GB SGA, a system crash was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant."
> 

The memory overhead of PTEs would be significantly saved if we use
hugetlbfs in this case, but why not?

> --
> Khalid

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-22  1:39           ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
@ 2022-01-22  1:41             ` Matthew Wilcox
  2022-01-22 10:18               ` Thomas Schoebel-Theuer
  0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2022-01-22  1:41 UTC (permalink / raw)
  To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  Cc: Khalid Aziz, Barry Song, Andrew Morton, Arnd Bergmann,
	Dave Hansen, David Hildenbrand, LKML, Linux-MM, Mike Rapoport,
	Suren Baghdasaryan

On Sat, Jan 22, 2022 at 01:39:46AM +0000, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> > > Our use case is that we have some very large files stored on persistent
> > > memory which we want to mmap in thousands of processes.  So the first
> 
> The memory overhead of PTEs would be significantly saved if we use
> hugetlbfs in this case, but why not?

Because we want the files to be persistent across reboots.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-22  1:41             ` Matthew Wilcox
@ 2022-01-22 10:18               ` Thomas Schoebel-Theuer
  2022-01-22 16:09                 ` Matthew Wilcox
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Schoebel-Theuer @ 2022-01-22 10:18 UTC (permalink / raw)
  To: Matthew Wilcox, Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.)
  Cc: Khalid Aziz, Barry Song, Andrew Morton, Arnd Bergmann,
	Dave Hansen, David Hildenbrand, LKML, Linux-MM, Mike Rapoport,
	Suren Baghdasaryan

On 1/22/22 2:41 AM, Matthew Wilcox wrote:
> On Sat, Jan 22, 2022 at 01:39:46AM +0000, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
>>>> Our use case is that we have some very large files stored on persistent
>>>> memory which we want to mmap in thousands of processes.  So the first
>> The memory overhead of PTEs would be significantly saved if we use
>> hugetlbfs in this case, but why not?
> Because we want the files to be persistent across reboots.

100% agree. There is another use case: geo-redundancy.

My view is publicly documented at 
https://github.com/schoebel/mars/tree/master/docu and click at 
architecture-guide-geo-redundancy.pdf

In some scenarios, migration or (temporary) co-existence of block 
devices from/between hardware architecture A to/between hardware 
architecture B might become a future requirement for me.

The currrent implementation does not yet use hugetlbfs and/or its 
proposed / low-overhead / more fine-grained and/or less 
hardware-architecture specific (future) alternatives.

For me, all of these are future options. In particular, when (1) 
abstractable for reduction of architectural dependencies, and hopefully 
(2) usable from both kernelspace and userspace.

It would be great if msharefs is not only low-footprint, but also would 
be usable from kernelspace.

Reduction (or getting rid) of preallocation strategies would be also a 
valuable feature for me.

Of course, I cannot decide what I will prefer in future for any future 
requirements. But some kind of mutual awareness and future collaboration 
would be great.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-18 21:19 [RFC PATCH 0/6] Add support for shared PTEs across processes Khalid Aziz
                   ` (9 preceding siblings ...)
  2022-01-21  1:08 ` Barry Song
@ 2022-01-22 11:31 ` Mike Rapoport
  2022-01-22 18:29   ` Andy Lutomirski
  2022-01-24 18:48   ` Khalid Aziz
  2022-01-25 11:42 ` Kirill A. Shutemov
  11 siblings, 2 replies; 54+ messages in thread
From: Mike Rapoport @ 2022-01-22 11:31 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: akpm, willy, longpeng2, arnd, dave.hansen, david, surenb,
	linux-kernel, linux-mm, linux-api

(added linux-api)

On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
> Page tables in kernel consume some of the memory and as long as
> number of mappings being maintained is small enough, this space
> consumed by page tables is not objectionable. When very few memory
> pages are shared between processes, the number of page table entries
> (PTEs) to maintain is mostly constrained by the number of pages of
> memory on the system. As the number of shared pages and the number
> of times pages are shared goes up, amount of memory consumed by page
> tables starts to become significant.
> 
> Some of the field deployments commonly see memory pages shared
> across 1000s of processes. On x86_64, each page requires a PTE that
> is only 8 bytes long which is very small compared to the 4K page
> size. When 2000 processes map the same page in their address space,
> each one of them requires 8 bytes for its PTE and together that adds
> up to 8K of memory just to hold the PTEs for one 4K page. On a
> database server with 300GB SGA, a system carsh was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant.
> 
> This is a proposal to implement a mechanism in kernel to allow
> userspace processes to opt into sharing PTEs. The proposal is to add
> a new system call - mshare(), which can be used by a process to
> create a region (we will call it mshare'd region) which can be used
> by other processes to map same pages using shared PTEs. Other
> process(es), assuming they have the right permissions, can then make
> the mashare() system call to map the shared pages into their address
> space using the shared PTEs.  When a process is done using this
> mshare'd region, it makes a mshare_unlink() system call to end its
> access. When the last process accessing mshare'd region calls
> mshare_unlink(), the mshare'd region is torn down and memory used by
> it is freed.
> 
> 
> API Proposal
> ============
> 
> The mshare API consists of two system calls - mshare() and mshare_unlink()
> 
> --
> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
> 
> mshare() creates and opens a new, or opens an existing mshare'd
> region that will be shared at PTE level. "name" refers to shared object
> name that exists under /sys/fs/mshare. "addr" is the starting address
> of this shared memory area and length is the size of this area.
> oflags can be one of:
> 
> - O_RDONLY opens shared memory area for read only access by everyone
> - O_RDWR opens shared memory area for read and write access
> - O_CREAT creates the named shared memory area if it does not exist
> - O_EXCL If O_CREAT was also specified, and a shared memory area
>   exists with that name, return an error.
> 
> mode represents the creation mode for the shared object under
> /sys/fs/mshare.
> 
> mshare() returns an error code if it fails, otherwise it returns 0.

Did you consider returning a file descriptor from mshare() system call?
Then there would be no need in mshare_unlink() as close(fd) would work.
 
> PTEs are shared at pgdir level and hence it imposes following
> requirements on the address and size given to the mshare():
> 
> - Starting address must be aligned to pgdir size (512GB on x86_64)
> - Size must be a multiple of pgdir size
> - Any mappings created in this address range at any time become
>   shared automatically
> - Shared address range can have unmapped addresses in it. Any access
>   to unmapped address will result in SIGBUS
> 
> Mappings within this address range behave as if they were shared
> between threads, so a write to a MAP_PRIVATE mapping will create a
> page which is shared between all the sharers. The first process that
> declares an address range mshare'd can continue to map objects in
> the shared area. All other processes that want mshare'd access to
> this memory area can do so by calling mshare(). After this call, the
> address range given by mshare becomes a shared range in its address
> space. Anonymous mappings will be shared and not COWed.
> 
> A file under /sys/fs/mshare can be opened and read from. A read from
> this file returns two long values - (1) starting address, and (2)
> size of the mshare'd region.

Maybe read should return a structure containing some data identifier and
the data itself, so that it could be extended in the future.
 
> --
> int mshare_unlink(char *name)
> 
> A shared address range created by mshare() can be destroyed using
> mshare_unlink() which removes the  shared named object. Once all
> processes have unmapped the shared object, the shared address range
> references are de-allocated and destroyed.
> 
> mshare_unlink() returns 0 on success or -1 on error.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-22 10:18               ` Thomas Schoebel-Theuer
@ 2022-01-22 16:09                 ` Matthew Wilcox
  0 siblings, 0 replies; 54+ messages in thread
From: Matthew Wilcox @ 2022-01-22 16:09 UTC (permalink / raw)
  To: Thomas Schoebel-Theuer
  Cc: Longpeng (Mike, Cloud Infrastructure Service Product Dept.),
	Khalid Aziz, Barry Song, Andrew Morton, Arnd Bergmann,
	Dave Hansen, David Hildenbrand, LKML, Linux-MM, Mike Rapoport,
	Suren Baghdasaryan

On Sat, Jan 22, 2022 at 11:18:14AM +0100, Thomas Schoebel-Theuer wrote:
> On 1/22/22 2:41 AM, Matthew Wilcox wrote:
> > On Sat, Jan 22, 2022 at 01:39:46AM +0000, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> > > > > Our use case is that we have some very large files stored on persistent
> > > > > memory which we want to mmap in thousands of processes.  So the first
> > > The memory overhead of PTEs would be significantly saved if we use
> > > hugetlbfs in this case, but why not?
> > Because we want the files to be persistent across reboots.
> 
> 100% agree. There is another use case: geo-redundancy.
> 
> My view is publicly documented at
> https://github.com/schoebel/mars/tree/master/docu and click at
> architecture-guide-geo-redundancy.pdf

That's a 160+ page PDF.  No offence, Thomas, I'm not reading that to
try to understand how you want to use page table sharing.

> In some scenarios, migration or (temporary) co-existence of block devices
> from/between hardware architecture A to/between hardware architecture B
> might become a future requirement for me.

I'm not sure how sharing block devices between systems matches up with
sharing page tables between processes.

> It would be great if msharefs is not only low-footprint, but also would be
> usable from kernelspace.

I don't understand what you want here either.  Kernel threads already
share their page tables.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-22 11:31 ` Mike Rapoport
@ 2022-01-22 18:29   ` Andy Lutomirski
  2022-01-24 18:48   ` Khalid Aziz
  1 sibling, 0 replies; 54+ messages in thread
From: Andy Lutomirski @ 2022-01-22 18:29 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Khalid Aziz, akpm, willy, longpeng2, arnd, dave.hansen, david,
	surenb, linux-kernel, linux-mm, linux-api, Andy Lutomirski

> On Jan 22, 2022, at 3:31 AM, Mike Rapoport <rppt@kernel.org> wrote:
>
> (added linux-api)
>
>> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
>> Page tables in kernel consume some of the memory and as long as
>> number of mappings being maintained is small enough, this space
>> consumed by page tables is not objectionable. When very few memory
>> pages are shared between processes, the number of page table entries
>> (PTEs) to maintain is mostly constrained by the number of pages of
>> memory on the system. As the number of shared pages and the number
>> of times pages are shared goes up, amount of memory consumed by page
>> tables starts to become significant.

Sharing PTEs is nice, but merely sharing a chunk of address space
regardless of optimizations is nontrivial.  It’s also quite useful,
potentially.  So I think a good way to start would be to make a nice
design for just sharing address space and then, on top of it, figure
out how to share page tables.

See here for an earlier proposal:

https://lore.kernel.org/all/CALCETrUSUp_7svg8EHNTk3nQ0x9sdzMCU=h8G-Sy6=SODq5GHg@mail.gmail.com/

Alternatively, one could try to optimize memfd so that large similarly
aligned mappings in different processes could share page tables.

Any of the above will require some interesting thought as to whether
TLB shootdowns are managed by the core rmap code or by mmu notifiers.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-20 19:15       ` Khalid Aziz
@ 2022-01-24 15:15         ` Mark Hemment
  2022-01-24 15:27           ` Matthew Wilcox
  2022-01-24 22:20           ` Khalid Aziz
  0 siblings, 2 replies; 54+ messages in thread
From: Mark Hemment @ 2022-01-24 15:15 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	longpeng2, arnd, dave.hansen, david, rppt, Suren Baghdasaryan,
	linux-kernel, linux-mm

On Thu, 20 Jan 2022 at 19:15, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> On 1/20/22 05:49, Mark Hemment wrote:
> > On Wed, 19 Jan 2022 at 17:02, Khalid Aziz <khalid.aziz@oracle.com> wrote:
> >>
> >> On 1/19/22 04:38, Mark Hemment wrote:
> >>> On Tue, 18 Jan 2022 at 21:20, Khalid Aziz <khalid.aziz@oracle.com> wrote:
> >>>>
> >>>> Page tables in kernel consume some of the memory and as long as
> >>>> number of mappings being maintained is small enough, this space
> >>>> consumed by page tables is not objectionable. When very few memory
> >>>> pages are shared between processes, the number of page table entries
> >>>> (PTEs) to maintain is mostly constrained by the number of pages of
> >>>> memory on the system. As the number of shared pages and the number
> >>>> of times pages are shared goes up, amount of memory consumed by page
> >>>> tables starts to become significant.
> >>>>
> >>>> Some of the field deployments commonly see memory pages shared
> >>>> across 1000s of processes. On x86_64, each page requires a PTE that
> >>>> is only 8 bytes long which is very small compared to the 4K page
> >>>> size. When 2000 processes map the same page in their address space,
> >>>> each one of them requires 8 bytes for its PTE and together that adds
> >>>> up to 8K of memory just to hold the PTEs for one 4K page. On a
> >>>> database server with 300GB SGA, a system carsh was seen with
> >>>> out-of-memory condition when 1500+ clients tried to share this SGA
> >>>> even though the system had 512GB of memory. On this server, in the
> >>>> worst case scenario of all 1500 processes mapping every page from
> >>>> SGA would have required 878GB+ for just the PTEs. If these PTEs
> >>>> could be shared, amount of memory saved is very significant.
> >>>>
> >>>> This is a proposal to implement a mechanism in kernel to allow
> >>>> userspace processes to opt into sharing PTEs. The proposal is to add
> >>>> a new system call - mshare(), which can be used by a process to
> >>>> create a region (we will call it mshare'd region) which can be used
> >>>> by other processes to map same pages using shared PTEs. Other
> >>>> process(es), assuming they have the right permissions, can then make
> >>>> the mashare() system call to map the shared pages into their address
> >>>> space using the shared PTEs.  When a process is done using this
> >>>> mshare'd region, it makes a mshare_unlink() system call to end its
> >>>> access. When the last process accessing mshare'd region calls
> >>>> mshare_unlink(), the mshare'd region is torn down and memory used by
> >>>> it is freed.
> >>>>
> >>>>
> >>>> API Proposal
> >>>> ============
> >>>>
> >>>> The mshare API consists of two system calls - mshare() and mshare_unlink()
> >>>>
> >>>> --
> >>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
> >>>>
> >>>> mshare() creates and opens a new, or opens an existing mshare'd
> >>>> region that will be shared at PTE level. "name" refers to shared object
> >>>> name that exists under /sys/fs/mshare. "addr" is the starting address
> >>>> of this shared memory area and length is the size of this area.
> >>>> oflags can be one of:
> >>>>
> >>>> - O_RDONLY opens shared memory area for read only access by everyone
> >>>> - O_RDWR opens shared memory area for read and write access
> >>>> - O_CREAT creates the named shared memory area if it does not exist
> >>>> - O_EXCL If O_CREAT was also specified, and a shared memory area
> >>>>     exists with that name, return an error.
> >>>>
> >>>> mode represents the creation mode for the shared object under
> >>>> /sys/fs/mshare.
> >>>>
> >>>> mshare() returns an error code if it fails, otherwise it returns 0.
> >>>>
> >>>> PTEs are shared at pgdir level and hence it imposes following
> >>>> requirements on the address and size given to the mshare():
> >>>>
> >>>> - Starting address must be aligned to pgdir size (512GB on x86_64)
> >>>> - Size must be a multiple of pgdir size
> >>>> - Any mappings created in this address range at any time become
> >>>>     shared automatically
> >>>> - Shared address range can have unmapped addresses in it. Any access
> >>>>     to unmapped address will result in SIGBUS
> >>>>
> >>>> Mappings within this address range behave as if they were shared
> >>>> between threads, so a write to a MAP_PRIVATE mapping will create a
> >>>> page which is shared between all the sharers. The first process that
> >>>> declares an address range mshare'd can continue to map objects in
> >>>> the shared area. All other processes that want mshare'd access to
> >>>> this memory area can do so by calling mshare(). After this call, the
> >>>> address range given by mshare becomes a shared range in its address
> >>>> space. Anonymous mappings will be shared and not COWed.
> >>>>
> >>>> A file under /sys/fs/mshare can be opened and read from. A read from
> >>>> this file returns two long values - (1) starting address, and (2)
> >>>> size of the mshare'd region.
> >>>>
> >>>> --
> >>>> int mshare_unlink(char *name)
> >>>>
> >>>> A shared address range created by mshare() can be destroyed using
> >>>> mshare_unlink() which removes the  shared named object. Once all
> >>>> processes have unmapped the shared object, the shared address range
> >>>> references are de-allocated and destroyed.
> >>>>
> >>>> mshare_unlink() returns 0 on success or -1 on error.
> >>>>
> >>>>
> >>>> Example Code
> >>>> ============
> >>>>
> >>>> Snippet of the code that a donor process would run looks like below:
> >>>>
> >>>> -----------------
> >>>>           addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
> >>>>                           MAP_SHARED | MAP_ANONYMOUS, 0, 0);
> >>>>           if (addr == MAP_FAILED)
> >>>>                   perror("ERROR: mmap failed");
> >>>>
> >>>>           err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
> >>>>                           GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
> >>>>           if (err < 0) {
> >>>>                   perror("mshare() syscall failed");
> >>>>                   exit(1);
> >>>>           }
> >>>>
> >>>>           strncpy(addr, "Some random shared text",
> >>>>                           sizeof("Some random shared text"));
> >>>> -----------------
> >>>>
> >>>> Snippet of code that a consumer process would execute looks like:
> >>>>
> >>>> -----------------
> >>>>           fd = open("testregion", O_RDONLY);
> >>>>           if (fd < 0) {
> >>>>                   perror("open failed");
> >>>>                   exit(1);
> >>>>           }
> >>>>
> >>>>           if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
> >>>>                   printf("INFO: %ld bytes shared at addr %lx \n",
> >>>>                                   mshare_info[1], mshare_info[0]);
> >>>>           else
> >>>>                   perror("read failed");
> >>>>
> >>>>           close(fd);
> >>>>
> >>>>           addr = (char *)mshare_info[0];
> >>>>           err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
> >>>>                           mshare_info[1], O_RDWR, 600);
> >>>>           if (err < 0) {
> >>>>                   perror("mshare() syscall failed");
> >>>>                   exit(1);
> >>>>           }
> >>>>
> >>>>           printf("Guest mmap at %px:\n", addr);
> >>>>           printf("%s\n", addr);
> >>>>           printf("\nDone\n");
> >>>>
> >>>>           err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
> >>>>           if (err < 0) {
> >>>>                   perror("mshare_unlink() failed");
> >>>>                   exit(1);
> >>>>           }
> >>>> -----------------
> >>> ...
> >>> Hi Khalid,
> >>>
> >>> The proposed mshare() appears to be similar to POSIX shared memory,
> >>> but with two extra (related) attributes;
> >>> a) Internally, uses shared page tables.
> >>> b) Shared memory is mapped at same address for all users.
> >>
> >> Hi Mark,
> >>
> >> You are right there are a few similarities with POSIX shm but there is one key difference - unlike shm, shared region
> >> access does not go through a filesystem. msharefs exists to query mshare'd regions and enforce access restrictions.
> >> mshare is meant to allow sharing any existing regions that might map a file, may be anonymous or map any other object.
> >> Any consumer process can use the same PTEs to access whatever might be mapped in that region which is quite different
> >> from what shm does. Because of the similarities between the two, I had started a prototype using POSIX shm API to
> >> leverage that code but I found myself special casing mshare often enough in shm code that it made sense to go with a
> >> separate implementation.
> >
> > Ah, I jumped in assuming this was only for anon memory.
> >
> >> I considered an API very much like POSIX shm but a simple mshare() syscall at any time to share
> >> a range of addresses that may be fully or partially mapped in is a simpler and more versatile API.
> >
> > So possible you have already considered the below...which does make
> > the API a little more POSIX shm like.
> >
> > The mshare() syscall does two operations;
> > 1) create/open mshare object
> > 2) export/import the given memory region
> >
> > Would it be better if these were seperate operations?  That is,
> > mshare_open() (say) creates/opens the object returning a file
> > descriptor.  The fd used as the identifier for the export/import after
> > mmap(2); eg.
> > addr = mshare_op(EXPORT, fd, addr, size);
> > addr = mshare_op(IMPORT, fd, NULL, 0);
> > (Not sure about export/import terms..)
> >
> > The benefit of the the separate ops is the file descriptor.  This
> > could be used for fstat(2) (and fchown(2)?), although not sure how
> > much value this would add.
>
> Hi Mark,
>
> That is the question here - what would be the value of fd to mshare_op? The file in msharefs can be opened like a
> regular file and supports fstat, fchown etc which can be used to query/set permissions for the mshare'd region.

Hi Khalid,

In your proposed API, the 'importer' of the mshared region does not
open the mshared backing object (when a file being mapped) instead it
does an open on the msharefs file.
From the code sample in your initial email (simplified), where a
process attaches to the mshared region;
    fd = open("testregion", O_RDONLY);
    read(fd, &mshare_info, sizeof (mshare_info));
    mshare("testregion", addr, len, RDWR, 0600);

Open permission checks are done by the mshare() system call against
the msharefs file ("testregion").

From the code sample in your initial email (simplified), where a
process creates a msharefs file with the anonymous mmap()ed region to
be shared;
    addr = mmap(RDWR, ANON);
    mshare("testregion", addr, len, CREAT|RDWR|EXCL, 0600);

Now, consider the case where the mmap() is named (that is, against a
file).  I believe this is the usecase for Oracle's SGA.
My (simplified) code for msharing a named file ("SGA") using your
proposed API (does not matter if the mapping is PRIVATE or SHARED);
    fd = open("SGA", RDWR);
    addr = mmap(RDWR, ..., fd);
    mshare("SGA-region", addr, len, CREAT|RDWR|EXCL, 0600);

If the permissions (usr/grp+perms+ACL) between the "SGA" file and the
"SGA-region" msharefs are different, then it is very likely a serious
security issue.
That is, a user who could not open(2) the "SGA" file might be able to
open the "SGA-region" msharefs file, and so gain at least read
permission on the file.

This is why I was proposing a file descriptor, so the msharefs file
could be set to have the same permissions as the backing file it is
exporting (but I got this wrong).
This would still leave a window between the msharefs file being
creating and the permissions being set, where a rogue process could
attach to a region when they should not have the permission (this
could be closed by failing a non-creating mshare() if the region is of
zero len - nothing yet shared - until permission are set and the
region shared).
But relying on userspace to always set the correct permissions on the
msharefs file is dangerous - likely to get it wrong on occasion - and
isn't sufficient.  The msharefs API needs to be bullet proof.

Looking at the patches, I cannot see where extra validation is being
done for a named mapping to ensure any 'importer' has the necessary
permission against the backing file.
The 'struct file' (->vm_file, and associated inode) in the VMA is
sufficient to perform required access checks against the file's perms
- the posted patches do not check this (but they are for an RFC, so
don't expect all cases to be handled).  But what about a full path
permission check?  That is, the 'importer' has necessary permissions
on the backing file, but would not be able to find this file due to
directory permissions?  msharefs would bypass the directory checks.


> >
> > The 'importer' would use the address/size of the memory region as
> > exported (and stored in msharefs), so no need for /sys file (except
> > for human readable info).
>
> I think we still need /sys/fs/msharefs files, right? Since you said msharefs stores information about address and size,
> I assume you are not proposing eliminating msharefs.

The 'exporter' of the mshared region specifies the address and length,
and is therefore is known by the mshare code.
An 'import' needs to only pass NULL/0 for addr/len and is told by
mshare where the region has been attached in its address-space.  With
this, the /sys file is no longer part of the API.


> >
> > If the set-up operations are split in two, then would it make sense to
> > also split the teardown as well?  Say, mshare_op(DROP, fd) and
> > mshare_unlink(fd)?
>
> A single op is simpler. Every process can call mshare_unlink() and if last reference is dropped, kernel should take care
> of cleaning up mshare'd region by itself. One of my goals is for mshare to continue to work even if the process that
> created the mshare region dies. In a database context, such mshare'd regions can live for very long time. As a result I
> would rather not make any process be responsible for cleaning up the mshare'd region. It should be as simple as the
> mshare'd region disappearing on its own when all references to it are dropped.
>
> Thanks,
> Khalid

Cheers,
Mark

>
> >
> >>
> >> Does that rationale sound reasonable?
> >
> > It doesn't sound unreasonable.  As msharefs is providing a namespace
> > and perms, it doesn't need much flexibility.  Being able to modifying
> > the perms post namespace creation (fchown(2)), before exporting the
> > memory region, might be useful in some cases - but as I don't have any
> > usecases I'm not claiming it is essential.
> >
> >>
> >> Thanks,
> >> Khalid
> >
> > Cheers,
> > Mark
> >>
> >>>
> >>> Rather than introduce two new system calls, along with /sys/ file to
> >>> communicate global addresses, could mshare() be built on top of shmem
> >>> API?  Thinking of something like the below;
> >>> 1) For shm_open(3), add a new oflag to indicate the properties needed
> >>> for mshare() (say, O_SHARED_PTE - better name?)
> >>> 2) For ftruncate(2), objects created with O_SHARED_PTE are constrained
> >>> in the sizes which can be set.
> >>> 3) For mmap(2), NULL is always passed as the address for O_SHARED_PTE
> >>> objects.  On first mmap()ing an appropiate address is assigned,
> >>> otherwise the current 'global' address is used.
> >>> 4) shm_unlink(3) destroys the object when last reference is dropped.
> >>>
> >>> For 3), might be able to weaken the NULL requirement and validate a
> >>> given address on first mapping to ensure it is correctly aligned.
> >>> shm_open(3) sets FD_CLOEXEC on the file descriptor, which might not be
> >>> the default behaviour you require.
> >>>
> >>> Internally, the handling of mshare()/O_SHARED_PTE memory might be
> >>> sufficiently different to shmem that there is not much code sharing
> >>> between the two (I haven't thought this through, but the object
> >>> naming/refcounting should be similiar), but using shmem would be a
> >>> familiar API.
> >>>
> >>> Any thoughts?
> >>>
> >>> Cheers,
> >>> Mark
> >>>
> >>
>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-24 15:15         ` Mark Hemment
@ 2022-01-24 15:27           ` Matthew Wilcox
  2022-01-24 22:20           ` Khalid Aziz
  1 sibling, 0 replies; 54+ messages in thread
From: Matthew Wilcox @ 2022-01-24 15:27 UTC (permalink / raw)
  To: Mark Hemment
  Cc: Khalid Aziz, Andrew Morton, longpeng2, arnd, dave.hansen, david,
	rppt, Suren Baghdasaryan, linux-kernel, linux-mm

On Mon, Jan 24, 2022 at 03:15:36PM +0000, Mark Hemment wrote:
> From the code sample in your initial email (simplified), where a
> process creates a msharefs file with the anonymous mmap()ed region to
> be shared;
>     addr = mmap(RDWR, ANON);
>     mshare("testregion", addr, len, CREAT|RDWR|EXCL, 0600);
> 
> Now, consider the case where the mmap() is named (that is, against a
> file).  I believe this is the usecase for Oracle's SGA.
> My (simplified) code for msharing a named file ("SGA") using your
> proposed API (does not matter if the mapping is PRIVATE or SHARED);
>     fd = open("SGA", RDWR);
>     addr = mmap(RDWR, ..., fd);
>     mshare("SGA-region", addr, len, CREAT|RDWR|EXCL, 0600);

Don't think of an mshared region as containing only one file.
It might easily contain dozens.  Or none at the start.  They're
dynamic; the mshare fd represents a chunk of address space, not
whatever is currently mapped there.

> If the permissions (usr/grp+perms+ACL) between the "SGA" file and the
> "SGA-region" msharefs are different, then it is very likely a serious
> security issue.

Only in the same sense that an application might open() a file that it
has permission to access and then open a pipe/socket to a process that
does not have permission and send the data to it.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-22 11:31 ` Mike Rapoport
  2022-01-22 18:29   ` Andy Lutomirski
@ 2022-01-24 18:48   ` Khalid Aziz
  2022-01-24 19:45     ` Andy Lutomirski
  1 sibling, 1 reply; 54+ messages in thread
From: Khalid Aziz @ 2022-01-24 18:48 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: akpm, willy, longpeng2, arnd, dave.hansen, david, surenb,
	linux-kernel, linux-mm, linux-api

On 1/22/22 04:31, Mike Rapoport wrote:
> (added linux-api)
> 
> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
>> Page tables in kernel consume some of the memory and as long as
>> number of mappings being maintained is small enough, this space
>> consumed by page tables is not objectionable. When very few memory
>> pages are shared between processes, the number of page table entries
>> (PTEs) to maintain is mostly constrained by the number of pages of
>> memory on the system. As the number of shared pages and the number
>> of times pages are shared goes up, amount of memory consumed by page
>> tables starts to become significant.
>>
>> Some of the field deployments commonly see memory pages shared
>> across 1000s of processes. On x86_64, each page requires a PTE that
>> is only 8 bytes long which is very small compared to the 4K page
>> size. When 2000 processes map the same page in their address space,
>> each one of them requires 8 bytes for its PTE and together that adds
>> up to 8K of memory just to hold the PTEs for one 4K page. On a
>> database server with 300GB SGA, a system carsh was seen with
>> out-of-memory condition when 1500+ clients tried to share this SGA
>> even though the system had 512GB of memory. On this server, in the
>> worst case scenario of all 1500 processes mapping every page from
>> SGA would have required 878GB+ for just the PTEs. If these PTEs
>> could be shared, amount of memory saved is very significant.
>>
>> This is a proposal to implement a mechanism in kernel to allow
>> userspace processes to opt into sharing PTEs. The proposal is to add
>> a new system call - mshare(), which can be used by a process to
>> create a region (we will call it mshare'd region) which can be used
>> by other processes to map same pages using shared PTEs. Other
>> process(es), assuming they have the right permissions, can then make
>> the mashare() system call to map the shared pages into their address
>> space using the shared PTEs.  When a process is done using this
>> mshare'd region, it makes a mshare_unlink() system call to end its
>> access. When the last process accessing mshare'd region calls
>> mshare_unlink(), the mshare'd region is torn down and memory used by
>> it is freed.
>>
>>
>> API Proposal
>> ============
>>
>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>
>> --
>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>
>> mshare() creates and opens a new, or opens an existing mshare'd
>> region that will be shared at PTE level. "name" refers to shared object
>> name that exists under /sys/fs/mshare. "addr" is the starting address
>> of this shared memory area and length is the size of this area.
>> oflags can be one of:
>>
>> - O_RDONLY opens shared memory area for read only access by everyone
>> - O_RDWR opens shared memory area for read and write access
>> - O_CREAT creates the named shared memory area if it does not exist
>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>    exists with that name, return an error.
>>
>> mode represents the creation mode for the shared object under
>> /sys/fs/mshare.
>>
>> mshare() returns an error code if it fails, otherwise it returns 0.
> 
> Did you consider returning a file descriptor from mshare() system call?
> Then there would be no need in mshare_unlink() as close(fd) would work.

That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though 
for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases 
for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling 
mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to 
bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic 
with a close() semantics though unless there was another way to force refcount to 0. Right?


>   
>> PTEs are shared at pgdir level and hence it imposes following
>> requirements on the address and size given to the mshare():
>>
>> - Starting address must be aligned to pgdir size (512GB on x86_64)
>> - Size must be a multiple of pgdir size
>> - Any mappings created in this address range at any time become
>>    shared automatically
>> - Shared address range can have unmapped addresses in it. Any access
>>    to unmapped address will result in SIGBUS
>>
>> Mappings within this address range behave as if they were shared
>> between threads, so a write to a MAP_PRIVATE mapping will create a
>> page which is shared between all the sharers. The first process that
>> declares an address range mshare'd can continue to map objects in
>> the shared area. All other processes that want mshare'd access to
>> this memory area can do so by calling mshare(). After this call, the
>> address range given by mshare becomes a shared range in its address
>> space. Anonymous mappings will be shared and not COWed.
>>
>> A file under /sys/fs/mshare can be opened and read from. A read from
>> this file returns two long values - (1) starting address, and (2)
>> size of the mshare'd region.
> 
> Maybe read should return a structure containing some data identifier and
> the data itself, so that it could be extended in the future.

I like that idea. I will work on it.

Thanks!

--
Khalid

>   
>> --
>> int mshare_unlink(char *name)
>>
>> A shared address range created by mshare() can be destroyed using
>> mshare_unlink() which removes the  shared named object. Once all
>> processes have unmapped the shared object, the shared address range
>> references are de-allocated and destroyed.
>>
>> mshare_unlink() returns 0 on success or -1 on error.
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-24 18:48   ` Khalid Aziz
@ 2022-01-24 19:45     ` Andy Lutomirski
  2022-01-24 22:30       ` Khalid Aziz
  0 siblings, 1 reply; 54+ messages in thread
From: Andy Lutomirski @ 2022-01-24 19:45 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Mike Rapoport, akpm, willy, longpeng2, arnd, dave.hansen, david,
	surenb, linux-kernel, linux-mm, linux-api

On Mon, Jan 24, 2022 at 10:54 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> On 1/22/22 04:31, Mike Rapoport wrote:
> > (added linux-api)
> >
> > On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
> >> Page tables in kernel consume some of the memory and as long as
> >> number of mappings being maintained is small enough, this space
> >> consumed by page tables is not objectionable. When very few memory
> >> pages are shared between processes, the number of page table entries
> >> (PTEs) to maintain is mostly constrained by the number of pages of
> >> memory on the system. As the number of shared pages and the number
> >> of times pages are shared goes up, amount of memory consumed by page
> >> tables starts to become significant.
> >>
> >> Some of the field deployments commonly see memory pages shared
> >> across 1000s of processes. On x86_64, each page requires a PTE that
> >> is only 8 bytes long which is very small compared to the 4K page
> >> size. When 2000 processes map the same page in their address space,
> >> each one of them requires 8 bytes for its PTE and together that adds
> >> up to 8K of memory just to hold the PTEs for one 4K page. On a
> >> database server with 300GB SGA, a system carsh was seen with
> >> out-of-memory condition when 1500+ clients tried to share this SGA
> >> even though the system had 512GB of memory. On this server, in the
> >> worst case scenario of all 1500 processes mapping every page from
> >> SGA would have required 878GB+ for just the PTEs. If these PTEs
> >> could be shared, amount of memory saved is very significant.
> >>
> >> This is a proposal to implement a mechanism in kernel to allow
> >> userspace processes to opt into sharing PTEs. The proposal is to add
> >> a new system call - mshare(), which can be used by a process to
> >> create a region (we will call it mshare'd region) which can be used
> >> by other processes to map same pages using shared PTEs. Other
> >> process(es), assuming they have the right permissions, can then make
> >> the mashare() system call to map the shared pages into their address
> >> space using the shared PTEs.  When a process is done using this
> >> mshare'd region, it makes a mshare_unlink() system call to end its
> >> access. When the last process accessing mshare'd region calls
> >> mshare_unlink(), the mshare'd region is torn down and memory used by
> >> it is freed.
> >>
> >>
> >> API Proposal
> >> ============
> >>
> >> The mshare API consists of two system calls - mshare() and mshare_unlink()
> >>
> >> --
> >> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
> >>
> >> mshare() creates and opens a new, or opens an existing mshare'd
> >> region that will be shared at PTE level. "name" refers to shared object
> >> name that exists under /sys/fs/mshare. "addr" is the starting address
> >> of this shared memory area and length is the size of this area.
> >> oflags can be one of:
> >>
> >> - O_RDONLY opens shared memory area for read only access by everyone
> >> - O_RDWR opens shared memory area for read and write access
> >> - O_CREAT creates the named shared memory area if it does not exist
> >> - O_EXCL If O_CREAT was also specified, and a shared memory area
> >>    exists with that name, return an error.
> >>
> >> mode represents the creation mode for the shared object under
> >> /sys/fs/mshare.
> >>
> >> mshare() returns an error code if it fails, otherwise it returns 0.
> >
> > Did you consider returning a file descriptor from mshare() system call?
> > Then there would be no need in mshare_unlink() as close(fd) would work.
>
> That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though
> for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases
> for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling
> mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to
> bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic
> with a close() semantics though unless there was another way to force refcount to 0. Right?
>

I'm not sure I understand the problem.  If you're sharing a portion of
an mm and the mm goes away, then all that should be left are some
struct files that are no longer useful.  They'll go away when their
refcount goes to zero.

--Andy

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-24 15:15         ` Mark Hemment
  2022-01-24 15:27           ` Matthew Wilcox
@ 2022-01-24 22:20           ` Khalid Aziz
  1 sibling, 0 replies; 54+ messages in thread
From: Khalid Aziz @ 2022-01-24 22:20 UTC (permalink / raw)
  To: Mark Hemment
  Cc: Andrew Morton, Matthew Wilcox (Oracle),
	longpeng2, arnd, dave.hansen, david, rppt, Suren Baghdasaryan,
	linux-kernel, linux-mm

On 1/24/22 08:15, Mark Hemment wrote:
> On Thu, 20 Jan 2022 at 19:15, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> On 1/20/22 05:49, Mark Hemment wrote:
>>> On Wed, 19 Jan 2022 at 17:02, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>>>
>>>> On 1/19/22 04:38, Mark Hemment wrote:
>>>>> On Tue, 18 Jan 2022 at 21:20, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>>>>>
>>>>>> Page tables in kernel consume some of the memory and as long as
>>>>>> number of mappings being maintained is small enough, this space
>>>>>> consumed by page tables is not objectionable. When very few memory
>>>>>> pages are shared between processes, the number of page table entries
>>>>>> (PTEs) to maintain is mostly constrained by the number of pages of
>>>>>> memory on the system. As the number of shared pages and the number
>>>>>> of times pages are shared goes up, amount of memory consumed by page
>>>>>> tables starts to become significant.
>>>>>>
>>>>>> Some of the field deployments commonly see memory pages shared
>>>>>> across 1000s of processes. On x86_64, each page requires a PTE that
>>>>>> is only 8 bytes long which is very small compared to the 4K page
>>>>>> size. When 2000 processes map the same page in their address space,
>>>>>> each one of them requires 8 bytes for its PTE and together that adds
>>>>>> up to 8K of memory just to hold the PTEs for one 4K page. On a
>>>>>> database server with 300GB SGA, a system carsh was seen with
>>>>>> out-of-memory condition when 1500+ clients tried to share this SGA
>>>>>> even though the system had 512GB of memory. On this server, in the
>>>>>> worst case scenario of all 1500 processes mapping every page from
>>>>>> SGA would have required 878GB+ for just the PTEs. If these PTEs
>>>>>> could be shared, amount of memory saved is very significant.
>>>>>>
>>>>>> This is a proposal to implement a mechanism in kernel to allow
>>>>>> userspace processes to opt into sharing PTEs. The proposal is to add
>>>>>> a new system call - mshare(), which can be used by a process to
>>>>>> create a region (we will call it mshare'd region) which can be used
>>>>>> by other processes to map same pages using shared PTEs. Other
>>>>>> process(es), assuming they have the right permissions, can then make
>>>>>> the mashare() system call to map the shared pages into their address
>>>>>> space using the shared PTEs.  When a process is done using this
>>>>>> mshare'd region, it makes a mshare_unlink() system call to end its
>>>>>> access. When the last process accessing mshare'd region calls
>>>>>> mshare_unlink(), the mshare'd region is torn down and memory used by
>>>>>> it is freed.
>>>>>>
>>>>>>
>>>>>> API Proposal
>>>>>> ============
>>>>>>
>>>>>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>>>>>
>>>>>> --
>>>>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>>>>>
>>>>>> mshare() creates and opens a new, or opens an existing mshare'd
>>>>>> region that will be shared at PTE level. "name" refers to shared object
>>>>>> name that exists under /sys/fs/mshare. "addr" is the starting address
>>>>>> of this shared memory area and length is the size of this area.
>>>>>> oflags can be one of:
>>>>>>
>>>>>> - O_RDONLY opens shared memory area for read only access by everyone
>>>>>> - O_RDWR opens shared memory area for read and write access
>>>>>> - O_CREAT creates the named shared memory area if it does not exist
>>>>>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>>>>>      exists with that name, return an error.
>>>>>>
>>>>>> mode represents the creation mode for the shared object under
>>>>>> /sys/fs/mshare.
>>>>>>
>>>>>> mshare() returns an error code if it fails, otherwise it returns 0.
>>>>>>
>>>>>> PTEs are shared at pgdir level and hence it imposes following
>>>>>> requirements on the address and size given to the mshare():
>>>>>>
>>>>>> - Starting address must be aligned to pgdir size (512GB on x86_64)
>>>>>> - Size must be a multiple of pgdir size
>>>>>> - Any mappings created in this address range at any time become
>>>>>>      shared automatically
>>>>>> - Shared address range can have unmapped addresses in it. Any access
>>>>>>      to unmapped address will result in SIGBUS
>>>>>>
>>>>>> Mappings within this address range behave as if they were shared
>>>>>> between threads, so a write to a MAP_PRIVATE mapping will create a
>>>>>> page which is shared between all the sharers. The first process that
>>>>>> declares an address range mshare'd can continue to map objects in
>>>>>> the shared area. All other processes that want mshare'd access to
>>>>>> this memory area can do so by calling mshare(). After this call, the
>>>>>> address range given by mshare becomes a shared range in its address
>>>>>> space. Anonymous mappings will be shared and not COWed.
>>>>>>
>>>>>> A file under /sys/fs/mshare can be opened and read from. A read from
>>>>>> this file returns two long values - (1) starting address, and (2)
>>>>>> size of the mshare'd region.
>>>>>>
>>>>>> --
>>>>>> int mshare_unlink(char *name)
>>>>>>
>>>>>> A shared address range created by mshare() can be destroyed using
>>>>>> mshare_unlink() which removes the  shared named object. Once all
>>>>>> processes have unmapped the shared object, the shared address range
>>>>>> references are de-allocated and destroyed.
>>>>>>
>>>>>> mshare_unlink() returns 0 on success or -1 on error.
>>>>>>
>>>>>>
>>>>>> Example Code
>>>>>> ============
>>>>>>
>>>>>> Snippet of the code that a donor process would run looks like below:
>>>>>>
>>>>>> -----------------
>>>>>>            addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>>>>>>                            MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>>>>>>            if (addr == MAP_FAILED)
>>>>>>                    perror("ERROR: mmap failed");
>>>>>>
>>>>>>            err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>>>>>>                            GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>>>>>>            if (err < 0) {
>>>>>>                    perror("mshare() syscall failed");
>>>>>>                    exit(1);
>>>>>>            }
>>>>>>
>>>>>>            strncpy(addr, "Some random shared text",
>>>>>>                            sizeof("Some random shared text"));
>>>>>> -----------------
>>>>>>
>>>>>> Snippet of code that a consumer process would execute looks like:
>>>>>>
>>>>>> -----------------
>>>>>>            fd = open("testregion", O_RDONLY);
>>>>>>            if (fd < 0) {
>>>>>>                    perror("open failed");
>>>>>>                    exit(1);
>>>>>>            }
>>>>>>
>>>>>>            if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
>>>>>>                    printf("INFO: %ld bytes shared at addr %lx \n",
>>>>>>                                    mshare_info[1], mshare_info[0]);
>>>>>>            else
>>>>>>                    perror("read failed");
>>>>>>
>>>>>>            close(fd);
>>>>>>
>>>>>>            addr = (char *)mshare_info[0];
>>>>>>            err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
>>>>>>                            mshare_info[1], O_RDWR, 600);
>>>>>>            if (err < 0) {
>>>>>>                    perror("mshare() syscall failed");
>>>>>>                    exit(1);
>>>>>>            }
>>>>>>
>>>>>>            printf("Guest mmap at %px:\n", addr);
>>>>>>            printf("%s\n", addr);
>>>>>>            printf("\nDone\n");
>>>>>>
>>>>>>            err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>>>>>>            if (err < 0) {
>>>>>>                    perror("mshare_unlink() failed");
>>>>>>                    exit(1);
>>>>>>            }
>>>>>> -----------------
>>>>> ...
>>>>> Hi Khalid,
>>>>>
>>>>> The proposed mshare() appears to be similar to POSIX shared memory,
>>>>> but with two extra (related) attributes;
>>>>> a) Internally, uses shared page tables.
>>>>> b) Shared memory is mapped at same address for all users.
>>>>
>>>> Hi Mark,
>>>>
>>>> You are right there are a few similarities with POSIX shm but there is one key difference - unlike shm, shared region
>>>> access does not go through a filesystem. msharefs exists to query mshare'd regions and enforce access restrictions.
>>>> mshare is meant to allow sharing any existing regions that might map a file, may be anonymous or map any other object.
>>>> Any consumer process can use the same PTEs to access whatever might be mapped in that region which is quite different
>>>> from what shm does. Because of the similarities between the two, I had started a prototype using POSIX shm API to
>>>> leverage that code but I found myself special casing mshare often enough in shm code that it made sense to go with a
>>>> separate implementation.
>>>
>>> Ah, I jumped in assuming this was only for anon memory.
>>>
>>>> I considered an API very much like POSIX shm but a simple mshare() syscall at any time to share
>>>> a range of addresses that may be fully or partially mapped in is a simpler and more versatile API.
>>>
>>> So possible you have already considered the below...which does make
>>> the API a little more POSIX shm like.
>>>
>>> The mshare() syscall does two operations;
>>> 1) create/open mshare object
>>> 2) export/import the given memory region
>>>
>>> Would it be better if these were seperate operations?  That is,
>>> mshare_open() (say) creates/opens the object returning a file
>>> descriptor.  The fd used as the identifier for the export/import after
>>> mmap(2); eg.
>>> addr = mshare_op(EXPORT, fd, addr, size);
>>> addr = mshare_op(IMPORT, fd, NULL, 0);
>>> (Not sure about export/import terms..)
>>>
>>> The benefit of the the separate ops is the file descriptor.  This
>>> could be used for fstat(2) (and fchown(2)?), although not sure how
>>> much value this would add.
>>
>> Hi Mark,
>>
>> That is the question here - what would be the value of fd to mshare_op? The file in msharefs can be opened like a
>> regular file and supports fstat, fchown etc which can be used to query/set permissions for the mshare'd region.
> 
> Hi Khalid,
> 
> In your proposed API, the 'importer' of the mshared region does not
> open the mshared backing object (when a file being mapped) instead it
> does an open on the msharefs file.
>  From the code sample in your initial email (simplified), where a
> process attaches to the mshared region;
>      fd = open("testregion", O_RDONLY);
>      read(fd, &mshare_info, sizeof (mshare_info));
>      mshare("testregion", addr, len, RDWR, 0600);
> 
> Open permission checks are done by the mshare() system call against
> the msharefs file ("testregion").
> 
>  From the code sample in your initial email (simplified), where a
> process creates a msharefs file with the anonymous mmap()ed region to
> be shared;
>      addr = mmap(RDWR, ANON);
>      mshare("testregion", addr, len, CREAT|RDWR|EXCL, 0600);
> 
> Now, consider the case where the mmap() is named (that is, against a
> file).  I believe this is the usecase for Oracle's SGA.
> My (simplified) code for msharing a named file ("SGA") using your
> proposed API (does not matter if the mapping is PRIVATE or SHARED);
>      fd = open("SGA", RDWR);
>      addr = mmap(RDWR, ..., fd);
>      mshare("SGA-region", addr, len, CREAT|RDWR|EXCL, 0600);
> 
> If the permissions (usr/grp+perms+ACL) between the "SGA" file and the
> "SGA-region" msharefs are different, then it is very likely a serious
> security issue.
> That is, a user who could not open(2) the "SGA" file might be able to
> open the "SGA-region" msharefs file, and so gain at least read
> permission on the file.

If an app creates an mshare'd region and gives wider access permissions than the on the objects it has mapped in or maps 
in in future, I would read it as app intends to do that. mshare'd region can cover any mapped object besides files. It 
could be anonymous memory holding data supplied by an application user, or could be data captured over network. How 
would one validate intended permissions in those cases? A uniform check would be to ensure access given to mshare'd 
region is compliant with the permissions on the region iteself as given by the file under msharefs. Those permission 
checks are not implemented yet in this initial prototype but definitely planned as continuing work.

> 
> This is why I was proposing a file descriptor, so the msharefs file
> could be set to have the same permissions as the backing file it is
> exporting (but I got this wrong).
> This would still leave a window between the msharefs file being
> creating and the permissions being set, where a rogue process could
> attach to a region when they should not have the permission (this
> could be closed by failing a non-creating mshare() if the region is of
> zero len - nothing yet shared - until permission are set and the
> region shared).
> But relying on userspace to always set the correct permissions on the
> msharefs file is dangerous - likely to get it wrong on occasion - and
> isn't sufficient.  The msharefs API needs to be bullet proof.
> 
> Looking at the patches, I cannot see where extra validation is being
> done for a named mapping to ensure any 'importer' has the necessary
> permission against the backing file.
> The 'struct file' (->vm_file, and associated inode) in the VMA is
> sufficient to perform required access checks against the file's perms
> - the posted patches do not check this (but they are for an RFC, so
> don't expect all cases to be handled).  But what about a full path
> permission check?  That is, the 'importer' has necessary permissions
> on the backing file, but would not be able to find this file due to
> directory permissions?  msharefs would bypass the directory checks.
>  >
>>>
>>> The 'importer' would use the address/size of the memory region as
>>> exported (and stored in msharefs), so no need for /sys file (except
>>> for human readable info).
>>
>> I think we still need /sys/fs/msharefs files, right? Since you said msharefs stores information about address and size,
>> I assume you are not proposing eliminating msharefs.
> 
> The 'exporter' of the mshared region specifies the address and length,
> and is therefore is known by the mshare code.
> An 'import' needs to only pass NULL/0 for addr/len and is told by
> mshare where the region has been attached in its address-space.  With
> this, the /sys file is no longer part of the API.

This would require every importer map the entire mshare'd range. Should we allow mapping a subset of the region, in 
which case importer needs to supply starting address and length? It is possibly simpler to only allow mapping the entire 
region but if there is a use case for partial mapping, designing that flexibility in is useful.

Thanks,
Khalid

> 
> 
>>>
>>> If the set-up operations are split in two, then would it make sense to
>>> also split the teardown as well?  Say, mshare_op(DROP, fd) and
>>> mshare_unlink(fd)?
>>
>> A single op is simpler. Every process can call mshare_unlink() and if last reference is dropped, kernel should take care
>> of cleaning up mshare'd region by itself. One of my goals is for mshare to continue to work even if the process that
>> created the mshare region dies. In a database context, such mshare'd regions can live for very long time. As a result I
>> would rather not make any process be responsible for cleaning up the mshare'd region. It should be as simple as the
>> mshare'd region disappearing on its own when all references to it are dropped.
>>
>> Thanks,
>> Khalid
> 
> Cheers,
> Mark
> 
>>
>>>
>>>>
>>>> Does that rationale sound reasonable?
>>>
>>> It doesn't sound unreasonable.  As msharefs is providing a namespace
>>> and perms, it doesn't need much flexibility.  Being able to modifying
>>> the perms post namespace creation (fchown(2)), before exporting the
>>> memory region, might be useful in some cases - but as I don't have any
>>> usecases I'm not claiming it is essential.
>>>
>>>>
>>>> Thanks,
>>>> Khalid
>>>
>>> Cheers,
>>> Mark
>>>>
>>>>>
>>>>> Rather than introduce two new system calls, along with /sys/ file to
>>>>> communicate global addresses, could mshare() be built on top of shmem
>>>>> API?  Thinking of something like the below;
>>>>> 1) For shm_open(3), add a new oflag to indicate the properties needed
>>>>> for mshare() (say, O_SHARED_PTE - better name?)
>>>>> 2) For ftruncate(2), objects created with O_SHARED_PTE are constrained
>>>>> in the sizes which can be set.
>>>>> 3) For mmap(2), NULL is always passed as the address for O_SHARED_PTE
>>>>> objects.  On first mmap()ing an appropiate address is assigned,
>>>>> otherwise the current 'global' address is used.
>>>>> 4) shm_unlink(3) destroys the object when last reference is dropped.
>>>>>
>>>>> For 3), might be able to weaken the NULL requirement and validate a
>>>>> given address on first mapping to ensure it is correctly aligned.
>>>>> shm_open(3) sets FD_CLOEXEC on the file descriptor, which might not be
>>>>> the default behaviour you require.
>>>>>
>>>>> Internally, the handling of mshare()/O_SHARED_PTE memory might be
>>>>> sufficiently different to shmem that there is not much code sharing
>>>>> between the two (I haven't thought this through, but the object
>>>>> naming/refcounting should be similiar), but using shmem would be a
>>>>> familiar API.
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>> Cheers,
>>>>> Mark
>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-24 19:45     ` Andy Lutomirski
@ 2022-01-24 22:30       ` Khalid Aziz
  2022-01-24 23:16         ` Andy Lutomirski
  0 siblings, 1 reply; 54+ messages in thread
From: Khalid Aziz @ 2022-01-24 22:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mike Rapoport, akpm, willy, longpeng2, arnd, dave.hansen, david,
	surenb, linux-kernel, linux-mm, linux-api

On 1/24/22 12:45, Andy Lutomirski wrote:
> On Mon, Jan 24, 2022 at 10:54 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> On 1/22/22 04:31, Mike Rapoport wrote:
>>> (added linux-api)
>>>
>>> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
>>>> Page tables in kernel consume some of the memory and as long as
>>>> number of mappings being maintained is small enough, this space
>>>> consumed by page tables is not objectionable. When very few memory
>>>> pages are shared between processes, the number of page table entries
>>>> (PTEs) to maintain is mostly constrained by the number of pages of
>>>> memory on the system. As the number of shared pages and the number
>>>> of times pages are shared goes up, amount of memory consumed by page
>>>> tables starts to become significant.
>>>>
>>>> Some of the field deployments commonly see memory pages shared
>>>> across 1000s of processes. On x86_64, each page requires a PTE that
>>>> is only 8 bytes long which is very small compared to the 4K page
>>>> size. When 2000 processes map the same page in their address space,
>>>> each one of them requires 8 bytes for its PTE and together that adds
>>>> up to 8K of memory just to hold the PTEs for one 4K page. On a
>>>> database server with 300GB SGA, a system carsh was seen with
>>>> out-of-memory condition when 1500+ clients tried to share this SGA
>>>> even though the system had 512GB of memory. On this server, in the
>>>> worst case scenario of all 1500 processes mapping every page from
>>>> SGA would have required 878GB+ for just the PTEs. If these PTEs
>>>> could be shared, amount of memory saved is very significant.
>>>>
>>>> This is a proposal to implement a mechanism in kernel to allow
>>>> userspace processes to opt into sharing PTEs. The proposal is to add
>>>> a new system call - mshare(), which can be used by a process to
>>>> create a region (we will call it mshare'd region) which can be used
>>>> by other processes to map same pages using shared PTEs. Other
>>>> process(es), assuming they have the right permissions, can then make
>>>> the mashare() system call to map the shared pages into their address
>>>> space using the shared PTEs.  When a process is done using this
>>>> mshare'd region, it makes a mshare_unlink() system call to end its
>>>> access. When the last process accessing mshare'd region calls
>>>> mshare_unlink(), the mshare'd region is torn down and memory used by
>>>> it is freed.
>>>>
>>>>
>>>> API Proposal
>>>> ============
>>>>
>>>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>>>
>>>> --
>>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>>>
>>>> mshare() creates and opens a new, or opens an existing mshare'd
>>>> region that will be shared at PTE level. "name" refers to shared object
>>>> name that exists under /sys/fs/mshare. "addr" is the starting address
>>>> of this shared memory area and length is the size of this area.
>>>> oflags can be one of:
>>>>
>>>> - O_RDONLY opens shared memory area for read only access by everyone
>>>> - O_RDWR opens shared memory area for read and write access
>>>> - O_CREAT creates the named shared memory area if it does not exist
>>>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>>>     exists with that name, return an error.
>>>>
>>>> mode represents the creation mode for the shared object under
>>>> /sys/fs/mshare.
>>>>
>>>> mshare() returns an error code if it fails, otherwise it returns 0.
>>>
>>> Did you consider returning a file descriptor from mshare() system call?
>>> Then there would be no need in mshare_unlink() as close(fd) would work.
>>
>> That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though
>> for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases
>> for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling
>> mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to
>> bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic
>> with a close() semantics though unless there was another way to force refcount to 0. Right?
>>
> 
> I'm not sure I understand the problem.  If you're sharing a portion of
> an mm and the mm goes away, then all that should be left are some
> struct files that are no longer useful.  They'll go away when their
> refcount goes to zero.
> 
> --Andy
> 

The mm that holds shared PTEs is a separate mm not tied to a task. I started out by sharing portion of the donor process 
initially but that necessitated keeping the donor process alive. If the donor process dies, its mm and the mshare'd 
portion go away.

One of the requirements I have is the process that creates mshare'd region can terminate, possibly involuntarily, but 
the mshare'd region persists and rest of the consumer processes continue without hiccup. So I create a separate mm to 
hold shared PTEs and that mm is cleaned up when all references to mshare'd region go away. Each call to mshare() 
increments the refcount and each call to mshare_unlink() decrements the refcount.

--
Khalid

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-24 22:30       ` Khalid Aziz
@ 2022-01-24 23:16         ` Andy Lutomirski
  2022-01-24 23:44           ` Khalid Aziz
  0 siblings, 1 reply; 54+ messages in thread
From: Andy Lutomirski @ 2022-01-24 23:16 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Mike Rapoport, akpm, willy, longpeng2, arnd, dave.hansen, david,
	surenb, linux-kernel, linux-mm, linux-api

On Mon, Jan 24, 2022 at 2:34 PM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> On 1/24/22 12:45, Andy Lutomirski wrote:
> > On Mon, Jan 24, 2022 at 10:54 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
> >>
> >> On 1/22/22 04:31, Mike Rapoport wrote:
> >>> (added linux-api)
> >>>
> >>> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
> >>>> Page tables in kernel consume some of the memory and as long as
> >>>> number of mappings being maintained is small enough, this space
> >>>> consumed by page tables is not objectionable. When very few memory
> >>>> pages are shared between processes, the number of page table entries
> >>>> (PTEs) to maintain is mostly constrained by the number of pages of
> >>>> memory on the system. As the number of shared pages and the number
> >>>> of times pages are shared goes up, amount of memory consumed by page
> >>>> tables starts to become significant.
> >>>>
> >>>> Some of the field deployments commonly see memory pages shared
> >>>> across 1000s of processes. On x86_64, each page requires a PTE that
> >>>> is only 8 bytes long which is very small compared to the 4K page
> >>>> size. When 2000 processes map the same page in their address space,
> >>>> each one of them requires 8 bytes for its PTE and together that adds
> >>>> up to 8K of memory just to hold the PTEs for one 4K page. On a
> >>>> database server with 300GB SGA, a system carsh was seen with
> >>>> out-of-memory condition when 1500+ clients tried to share this SGA
> >>>> even though the system had 512GB of memory. On this server, in the
> >>>> worst case scenario of all 1500 processes mapping every page from
> >>>> SGA would have required 878GB+ for just the PTEs. If these PTEs
> >>>> could be shared, amount of memory saved is very significant.
> >>>>
> >>>> This is a proposal to implement a mechanism in kernel to allow
> >>>> userspace processes to opt into sharing PTEs. The proposal is to add
> >>>> a new system call - mshare(), which can be used by a process to
> >>>> create a region (we will call it mshare'd region) which can be used
> >>>> by other processes to map same pages using shared PTEs. Other
> >>>> process(es), assuming they have the right permissions, can then make
> >>>> the mashare() system call to map the shared pages into their address
> >>>> space using the shared PTEs.  When a process is done using this
> >>>> mshare'd region, it makes a mshare_unlink() system call to end its
> >>>> access. When the last process accessing mshare'd region calls
> >>>> mshare_unlink(), the mshare'd region is torn down and memory used by
> >>>> it is freed.
> >>>>
> >>>>
> >>>> API Proposal
> >>>> ============
> >>>>
> >>>> The mshare API consists of two system calls - mshare() and mshare_unlink()
> >>>>
> >>>> --
> >>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
> >>>>
> >>>> mshare() creates and opens a new, or opens an existing mshare'd
> >>>> region that will be shared at PTE level. "name" refers to shared object
> >>>> name that exists under /sys/fs/mshare. "addr" is the starting address
> >>>> of this shared memory area and length is the size of this area.
> >>>> oflags can be one of:
> >>>>
> >>>> - O_RDONLY opens shared memory area for read only access by everyone
> >>>> - O_RDWR opens shared memory area for read and write access
> >>>> - O_CREAT creates the named shared memory area if it does not exist
> >>>> - O_EXCL If O_CREAT was also specified, and a shared memory area
> >>>>     exists with that name, return an error.
> >>>>
> >>>> mode represents the creation mode for the shared object under
> >>>> /sys/fs/mshare.
> >>>>
> >>>> mshare() returns an error code if it fails, otherwise it returns 0.
> >>>
> >>> Did you consider returning a file descriptor from mshare() system call?
> >>> Then there would be no need in mshare_unlink() as close(fd) would work.
> >>
> >> That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though
> >> for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases
> >> for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling
> >> mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to
> >> bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic
> >> with a close() semantics though unless there was another way to force refcount to 0. Right?
> >>
> >
> > I'm not sure I understand the problem.  If you're sharing a portion of
> > an mm and the mm goes away, then all that should be left are some
> > struct files that are no longer useful.  They'll go away when their
> > refcount goes to zero.
> >
> > --Andy
> >
>
> The mm that holds shared PTEs is a separate mm not tied to a task. I started out by sharing portion of the donor process
> initially but that necessitated keeping the donor process alive. If the donor process dies, its mm and the mshare'd
> portion go away.
>
> One of the requirements I have is the process that creates mshare'd region can terminate, possibly involuntarily, but
> the mshare'd region persists and rest of the consumer processes continue without hiccup. So I create a separate mm to
> hold shared PTEs and that mm is cleaned up when all references to mshare'd region go away. Each call to mshare()
> increments the refcount and each call to mshare_unlink() decrements the refcount.

In general, objects which are kept alive by name tend to be quite
awkward. Things like network namespaces essentially have to work that
way and end up with awkward APIs.  Things like shared memory don't
actually have to be kept alive by name, and the cases that do keep
them alive by name (tmpfs, shmget) can end up being so awkward that
people invent nameless variants like memfd.

So I would strongly suggest you see how the design works out with no
names and no external keep-alive mechanism.  Either have the continued
existence of *any* fd keep the whole thing alive or make it be a pair
of struct files, one that controls the region (and can map things into
it, etc) and one that can map it.  SCM_RIGHTS is pretty good for
passing objects like this around.

--Andy

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-24 23:16         ` Andy Lutomirski
@ 2022-01-24 23:44           ` Khalid Aziz
  0 siblings, 0 replies; 54+ messages in thread
From: Khalid Aziz @ 2022-01-24 23:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mike Rapoport, akpm, willy, longpeng2, arnd, dave.hansen, david,
	surenb, linux-kernel, linux-mm, linux-api

On 1/24/22 16:16, Andy Lutomirski wrote:
> On Mon, Jan 24, 2022 at 2:34 PM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> On 1/24/22 12:45, Andy Lutomirski wrote:
>>> On Mon, Jan 24, 2022 at 10:54 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>>>
>>>> On 1/22/22 04:31, Mike Rapoport wrote:
>>>>> (added linux-api)
>>>>>
>>>>> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
>>>>>> Page tables in kernel consume some of the memory and as long as
>>>>>> number of mappings being maintained is small enough, this space
>>>>>> consumed by page tables is not objectionable. When very few memory
>>>>>> pages are shared between processes, the number of page table entries
>>>>>> (PTEs) to maintain is mostly constrained by the number of pages of
>>>>>> memory on the system. As the number of shared pages and the number
>>>>>> of times pages are shared goes up, amount of memory consumed by page
>>>>>> tables starts to become significant.
>>>>>>
>>>>>> Some of the field deployments commonly see memory pages shared
>>>>>> across 1000s of processes. On x86_64, each page requires a PTE that
>>>>>> is only 8 bytes long which is very small compared to the 4K page
>>>>>> size. When 2000 processes map the same page in their address space,
>>>>>> each one of them requires 8 bytes for its PTE and together that adds
>>>>>> up to 8K of memory just to hold the PTEs for one 4K page. On a
>>>>>> database server with 300GB SGA, a system carsh was seen with
>>>>>> out-of-memory condition when 1500+ clients tried to share this SGA
>>>>>> even though the system had 512GB of memory. On this server, in the
>>>>>> worst case scenario of all 1500 processes mapping every page from
>>>>>> SGA would have required 878GB+ for just the PTEs. If these PTEs
>>>>>> could be shared, amount of memory saved is very significant.
>>>>>>
>>>>>> This is a proposal to implement a mechanism in kernel to allow
>>>>>> userspace processes to opt into sharing PTEs. The proposal is to add
>>>>>> a new system call - mshare(), which can be used by a process to
>>>>>> create a region (we will call it mshare'd region) which can be used
>>>>>> by other processes to map same pages using shared PTEs. Other
>>>>>> process(es), assuming they have the right permissions, can then make
>>>>>> the mashare() system call to map the shared pages into their address
>>>>>> space using the shared PTEs.  When a process is done using this
>>>>>> mshare'd region, it makes a mshare_unlink() system call to end its
>>>>>> access. When the last process accessing mshare'd region calls
>>>>>> mshare_unlink(), the mshare'd region is torn down and memory used by
>>>>>> it is freed.
>>>>>>
>>>>>>
>>>>>> API Proposal
>>>>>> ============
>>>>>>
>>>>>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>>>>>
>>>>>> --
>>>>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>>>>>
>>>>>> mshare() creates and opens a new, or opens an existing mshare'd
>>>>>> region that will be shared at PTE level. "name" refers to shared object
>>>>>> name that exists under /sys/fs/mshare. "addr" is the starting address
>>>>>> of this shared memory area and length is the size of this area.
>>>>>> oflags can be one of:
>>>>>>
>>>>>> - O_RDONLY opens shared memory area for read only access by everyone
>>>>>> - O_RDWR opens shared memory area for read and write access
>>>>>> - O_CREAT creates the named shared memory area if it does not exist
>>>>>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>>>>>      exists with that name, return an error.
>>>>>>
>>>>>> mode represents the creation mode for the shared object under
>>>>>> /sys/fs/mshare.
>>>>>>
>>>>>> mshare() returns an error code if it fails, otherwise it returns 0.
>>>>>
>>>>> Did you consider returning a file descriptor from mshare() system call?
>>>>> Then there would be no need in mshare_unlink() as close(fd) would work.
>>>>
>>>> That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though
>>>> for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases
>>>> for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling
>>>> mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to
>>>> bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic
>>>> with a close() semantics though unless there was another way to force refcount to 0. Right?
>>>>
>>>
>>> I'm not sure I understand the problem.  If you're sharing a portion of
>>> an mm and the mm goes away, then all that should be left are some
>>> struct files that are no longer useful.  They'll go away when their
>>> refcount goes to zero.
>>>
>>> --Andy
>>>
>>
>> The mm that holds shared PTEs is a separate mm not tied to a task. I started out by sharing portion of the donor process
>> initially but that necessitated keeping the donor process alive. If the donor process dies, its mm and the mshare'd
>> portion go away.
>>
>> One of the requirements I have is the process that creates mshare'd region can terminate, possibly involuntarily, but
>> the mshare'd region persists and rest of the consumer processes continue without hiccup. So I create a separate mm to
>> hold shared PTEs and that mm is cleaned up when all references to mshare'd region go away. Each call to mshare()
>> increments the refcount and each call to mshare_unlink() decrements the refcount.
> 
> In general, objects which are kept alive by name tend to be quite
> awkward. Things like network namespaces essentially have to work that
> way and end up with awkward APIs.  Things like shared memory don't
> actually have to be kept alive by name, and the cases that do keep
> them alive by name (tmpfs, shmget) can end up being so awkward that
> people invent nameless variants like memfd.
> 
> So I would strongly suggest you see how the design works out with no
> names and no external keep-alive mechanism.  Either have the continued
> existence of *any* fd keep the whole thing alive or make it be a pair
> of struct files, one that controls the region (and can map things into
> it, etc) and one that can map it.  SCM_RIGHTS is pretty good for
> passing objects like this around.
> 
> --Andy
> 

These are certainly good ideas to simplify this feature. My very first implementation of mshare did not have msharefs, 
was based on fd where fd could be passed to any other process using SCM_RIGHTS and required the process creating mshare 
region to stay alive for the region to exist. That certainly made life simpler in terms of implementation. Feedback from 
my customers of this feature (DB developers and people deploying DB systems) was this imposes a hard dependency on a 
server process that creates the mshare'd region and passes fd to other processes needing access to this region. This 
dependency creates a weak link in system reliability that is too risky. If the server process dies for any reason, the 
entire system becomes unavailable. They requested a more robust implementation that they can depend upon. I then went 
through the process of implementing this using shmfs since POSIX shm has those attributes. That turned out to be more 
kludgy than a clean implementation using a separate in-memory msharefs. That brought me to the RFC implementation I sent 
out.

I do agree with you that name based persistent object makes the implementation more complex (maintaining a separate mm 
not tied to a process requires quite a bit of work to keep things consistent and clean mm up properly as users of this 
shared mm terminate) but I see the reliability point of view. Does that logic resonate with you?

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-18 21:19 [RFC PATCH 0/6] Add support for shared PTEs across processes Khalid Aziz
                   ` (10 preceding siblings ...)
  2022-01-22 11:31 ` Mike Rapoport
@ 2022-01-25 11:42 ` Kirill A. Shutemov
  2022-01-25 12:09   ` William Kucharski
  2022-01-25 13:23   ` Matthew Wilcox
  11 siblings, 2 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2022-01-25 11:42 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: akpm, willy, longpeng2, arnd, dave.hansen, david, rppt, surenb,
	linux-kernel, linux-mm

On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
> Example Code
> ============
> 
> Snippet of the code that a donor process would run looks like below:
> 
> -----------------
>         addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>                         MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>         if (addr == MAP_FAILED)
>                 perror("ERROR: mmap failed");
> 
>         err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
> 			GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>         if (err < 0) {
>                 perror("mshare() syscall failed");
>                 exit(1);
>         }
> 
>         strncpy(addr, "Some random shared text",
> 			sizeof("Some random shared text"));
> -----------------
> 
> Snippet of code that a consumer process would execute looks like:
> 
> -----------------
>         fd = open("testregion", O_RDONLY);
>         if (fd < 0) {
>                 perror("open failed");
>                 exit(1);
>         }
> 
>         if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
>                 printf("INFO: %ld bytes shared at addr %lx \n",
> 				mshare_info[1], mshare_info[0]);
>         else
>                 perror("read failed");
> 
>         close(fd);
> 
>         addr = (char *)mshare_info[0];
>         err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
> 			mshare_info[1], O_RDWR, 600);
>         if (err < 0) {
>                 perror("mshare() syscall failed");
>                 exit(1);
>         }
> 
>         printf("Guest mmap at %px:\n", addr);
>         printf("%s\n", addr);
> 	printf("\nDone\n");
> 
>         err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>         if (err < 0) {
>                 perror("mshare_unlink() failed");
>                 exit(1);
>         }
> -----------------

I wounder if we can get away with zero-API here: we can transparently
create/use shared page tables for any inode on mmap(MAP_SHARED) as long as
size and alignment is sutiable. Page tables will be linked to the inode
and will be freed when the last of such mapping will go away. I don't see
a need in new syscalls of flags to existing one.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-25 11:42 ` Kirill A. Shutemov
@ 2022-01-25 12:09   ` William Kucharski
  2022-01-25 13:18     ` David Hildenbrand
  2022-01-25 13:23   ` Matthew Wilcox
  1 sibling, 1 reply; 54+ messages in thread
From: William Kucharski @ 2022-01-25 12:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Khalid Aziz, akpm, willy, longpeng2, arnd, dave.hansen, david,
	rppt, surenb, linux-kernel, linux-mm

I would think this should be the case; certainly it seems to be a more effective approach than having to manually enable sharing via the API in every case or via changes to ld.so.

If anything it might be useful to have an API for shutting it off, though there are already multiple areas where the system shares resources in ways that cannot be shut off by user action.

> On Jan 25, 2022, at 04:41, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
>> Example Code
>> ============
>> 
>> Snippet of the code that a donor process would run looks like below:
>> 
>> -----------------
>>        addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>>                        MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>>        if (addr == MAP_FAILED)
>>                perror("ERROR: mmap failed");
>> 
>>        err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>>            GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>>        if (err < 0) {
>>                perror("mshare() syscall failed");
>>                exit(1);
>>        }
>> 
>>        strncpy(addr, "Some random shared text",
>>            sizeof("Some random shared text"));
>> -----------------
>> 
>> Snippet of code that a consumer process would execute looks like:
>> 
>> -----------------
>>        fd = open("testregion", O_RDONLY);
>>        if (fd < 0) {
>>                perror("open failed");
>>                exit(1);
>>        }
>> 
>>        if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
>>                printf("INFO: %ld bytes shared at addr %lx \n",
>>                mshare_info[1], mshare_info[0]);
>>        else
>>                perror("read failed");
>> 
>>        close(fd);
>> 
>>        addr = (char *)mshare_info[0];
>>        err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
>>            mshare_info[1], O_RDWR, 600);
>>        if (err < 0) {
>>                perror("mshare() syscall failed");
>>                exit(1);
>>        }
>> 
>>        printf("Guest mmap at %px:\n", addr);
>>        printf("%s\n", addr);
>>    printf("\nDone\n");
>> 
>>        err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>>        if (err < 0) {
>>                perror("mshare_unlink() failed");
>>                exit(1);
>>        }
>> -----------------
> 
> I wounder if we can get away with zero-API here: we can transparently
> create/use shared page tables for any inode on mmap(MAP_SHARED) as long as
> size and alignment is sutiable. Page tables will be linked to the inode
> and will be freed when the last of such mapping will go away. I don't see
> a need in new syscalls of flags to existing one.
> 
> -- 
> Kirill A. Shutemov
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-25 12:09   ` William Kucharski
@ 2022-01-25 13:18     ` David Hildenbrand
  2022-01-25 14:01       ` Kirill A. Shutemov
  0 siblings, 1 reply; 54+ messages in thread
From: David Hildenbrand @ 2022-01-25 13:18 UTC (permalink / raw)
  To: William Kucharski, Kirill A. Shutemov
  Cc: Khalid Aziz, akpm, willy, longpeng2, arnd, dave.hansen, rppt,
	surenb, linux-kernel, linux-mm

On 25.01.22 13:09, William Kucharski wrote:
> I would think this should be the case; certainly it seems to be a more effective approach than having to manually enable sharing via the API in every case or via changes to ld.so.
> 
> If anything it might be useful to have an API for shutting it off, though there are already multiple areas where the system shares resources in ways that cannot be shut off by user action.
> 

I don't have time to look into details right now, but I see various
possible hard-to-handle issues with sharing anon pages via this
mechanism between processes. If we could restrict it to MAP_SHARED and
have some magic toggle to opt in, that would be great. Features like
uffd that we might soon see on some MAP_SHARED pages will require to not
share page tables automatically I assume.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-25 11:42 ` Kirill A. Shutemov
  2022-01-25 12:09   ` William Kucharski
@ 2022-01-25 13:23   ` Matthew Wilcox
  2022-01-25 13:59     ` Kirill A. Shutemov
  1 sibling, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2022-01-25 13:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Khalid Aziz, akpm, longpeng2, arnd, dave.hansen, david, rppt,
	surenb, linux-kernel, linux-mm

On Tue, Jan 25, 2022 at 02:42:12PM +0300, Kirill A. Shutemov wrote:
> I wounder if we can get away with zero-API here: we can transparently
> create/use shared page tables for any inode on mmap(MAP_SHARED) as long as
> size and alignment is sutiable. Page tables will be linked to the inode
> and will be freed when the last of such mapping will go away. I don't see
> a need in new syscalls of flags to existing one.

That's how HugeTLBfs works today, right?  Would you want that mechanism
hoisted into the real MM?  Because my plan was the opposite -- remove it
from the shadow MM once mshare() is established.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-25 13:23   ` Matthew Wilcox
@ 2022-01-25 13:59     ` Kirill A. Shutemov
  2022-01-25 14:09       ` Matthew Wilcox
  0 siblings, 1 reply; 54+ messages in thread
From: Kirill A. Shutemov @ 2022-01-25 13:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Khalid Aziz, akpm, longpeng2, arnd, dave.hansen, david, rppt,
	surenb, linux-kernel, linux-mm

On Tue, Jan 25, 2022 at 01:23:21PM +0000, Matthew Wilcox wrote:
> On Tue, Jan 25, 2022 at 02:42:12PM +0300, Kirill A. Shutemov wrote:
> > I wounder if we can get away with zero-API here: we can transparently
> > create/use shared page tables for any inode on mmap(MAP_SHARED) as long as
> > size and alignment is sutiable. Page tables will be linked to the inode
> > and will be freed when the last of such mapping will go away. I don't see
> > a need in new syscalls of flags to existing one.
> 
> That's how HugeTLBfs works today, right?  Would you want that mechanism
> hoisted into the real MM?  Because my plan was the opposite -- remove it
> from the shadow MM once mshare() is established.

I hate HugeTLBfs because it is a special place with own rules. mshare() as
it proposed creates a new special place. I don't like this.

It's better to find a way to integrate the feature natively into core-mm
and make as much users as possible to benefit from it.

I think zero-API approach (plus madvise() hints to tweak it) is worth
considering.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-25 13:18     ` David Hildenbrand
@ 2022-01-25 14:01       ` Kirill A. Shutemov
  0 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2022-01-25 14:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: William Kucharski, Khalid Aziz, akpm, willy, longpeng2, arnd,
	dave.hansen, rppt, surenb, linux-kernel, linux-mm

On Tue, Jan 25, 2022 at 02:18:57PM +0100, David Hildenbrand wrote:
> On 25.01.22 13:09, William Kucharski wrote:
> > I would think this should be the case; certainly it seems to be a more effective approach than having to manually enable sharing via the API in every case or via changes to ld.so.
> > 
> > If anything it might be useful to have an API for shutting it off, though there are already multiple areas where the system shares resources in ways that cannot be shut off by user action.
> > 
> 
> I don't have time to look into details right now, but I see various
> possible hard-to-handle issues with sharing anon pages via this
> mechanism between processes.

Right. We should not break invariant that an anonymous page can only be
mapped into a process once. Otherwise may need to deal with new class of
security issues.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-25 13:59     ` Kirill A. Shutemov
@ 2022-01-25 14:09       ` Matthew Wilcox
  2022-01-25 18:57         ` Kirill A. Shutemov
  0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2022-01-25 14:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Khalid Aziz, akpm, longpeng2, arnd, dave.hansen, david, rppt,
	surenb, linux-kernel, linux-mm

On Tue, Jan 25, 2022 at 04:59:17PM +0300, Kirill A. Shutemov wrote:
> On Tue, Jan 25, 2022 at 01:23:21PM +0000, Matthew Wilcox wrote:
> > On Tue, Jan 25, 2022 at 02:42:12PM +0300, Kirill A. Shutemov wrote:
> > > I wounder if we can get away with zero-API here: we can transparently
> > > create/use shared page tables for any inode on mmap(MAP_SHARED) as long as
> > > size and alignment is sutiable. Page tables will be linked to the inode
> > > and will be freed when the last of such mapping will go away. I don't see
> > > a need in new syscalls of flags to existing one.
> > 
> > That's how HugeTLBfs works today, right?  Would you want that mechanism
> > hoisted into the real MM?  Because my plan was the opposite -- remove it
> > from the shadow MM once mshare() is established.
> 
> I hate HugeTLBfs because it is a special place with own rules. mshare() as
> it proposed creates a new special place. I don't like this.

No new special place.  I suppose the only thing it creates that's "new"
is an MM without any threads of its own.  And from the MM point of view,
that's not a new thing at all because the MM simply doesn't care how
many threads share an MM.

> It's better to find a way to integrate the feature natively into core-mm
> and make as much users as possible to benefit from it.

That's what mshare is trying to do!

> I think zero-API approach (plus madvise() hints to tweak it) is worth
> considering.

I think the zero-API approach actually misses out on a lot of
possibilities that the mshare() approach offers.  For example, mshare()
allows you to mmap() many small files in the shared region -- you
can't do that with zeroAPI.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-25 14:09       ` Matthew Wilcox
@ 2022-01-25 18:57         ` Kirill A. Shutemov
  2022-01-25 18:59           ` Matthew Wilcox
  0 siblings, 1 reply; 54+ messages in thread
From: Kirill A. Shutemov @ 2022-01-25 18:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Khalid Aziz, akpm, longpeng2, arnd, dave.hansen, david, rppt,
	surenb, linux-kernel, linux-mm

On Tue, Jan 25, 2022 at 02:09:47PM +0000, Matthew Wilcox wrote:
> > I think zero-API approach (plus madvise() hints to tweak it) is worth
> > considering.
> 
> I think the zero-API approach actually misses out on a lot of
> possibilities that the mshare() approach offers.  For example, mshare()
> allows you to mmap() many small files in the shared region -- you
> can't do that with zeroAPI.

Do you consider a use-case for many small files to be common? I would
think that the main consumer of the feature to be mmap of huge files.
And in this case zero enabling burden on userspace side sounds like a
sweet deal.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-25 18:57         ` Kirill A. Shutemov
@ 2022-01-25 18:59           ` Matthew Wilcox
  2022-01-26  4:04             ` Matthew Wilcox
  0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2022-01-25 18:59 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Khalid Aziz, akpm, longpeng2, arnd, dave.hansen, david, rppt,
	surenb, linux-kernel, linux-mm

On Tue, Jan 25, 2022 at 09:57:05PM +0300, Kirill A. Shutemov wrote:
> On Tue, Jan 25, 2022 at 02:09:47PM +0000, Matthew Wilcox wrote:
> > > I think zero-API approach (plus madvise() hints to tweak it) is worth
> > > considering.
> > 
> > I think the zero-API approach actually misses out on a lot of
> > possibilities that the mshare() approach offers.  For example, mshare()
> > allows you to mmap() many small files in the shared region -- you
> > can't do that with zeroAPI.
> 
> Do you consider a use-case for many small files to be common? I would
> think that the main consumer of the feature to be mmap of huge files.
> And in this case zero enabling burden on userspace side sounds like a
> sweet deal.

mmap() of huge files is certainly the Oracle use-case.  With occasional
funny business like mprotect() of a single page in the middle of a 1GB
hugepage.

The approach of designating ranges of a process's address space as
sharable with other processes felt like the cleaner & frankly more
interesting approach that opens up use-cases other than "hurr, durr, we
are Oracle, we like big files, kernel get out of way now, transactions
to perform".

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-25 18:59           ` Matthew Wilcox
@ 2022-01-26  4:04             ` Matthew Wilcox
  2022-01-26 10:16               ` David Hildenbrand
  2022-01-26 13:42               ` Kirill A. Shutemov
  0 siblings, 2 replies; 54+ messages in thread
From: Matthew Wilcox @ 2022-01-26  4:04 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Khalid Aziz, akpm, longpeng2, arnd, dave.hansen, david, rppt,
	surenb, linux-kernel, linux-mm

On Tue, Jan 25, 2022 at 06:59:50PM +0000, Matthew Wilcox wrote:
> On Tue, Jan 25, 2022 at 09:57:05PM +0300, Kirill A. Shutemov wrote:
> > On Tue, Jan 25, 2022 at 02:09:47PM +0000, Matthew Wilcox wrote:
> > > > I think zero-API approach (plus madvise() hints to tweak it) is worth
> > > > considering.
> > > 
> > > I think the zero-API approach actually misses out on a lot of
> > > possibilities that the mshare() approach offers.  For example, mshare()
> > > allows you to mmap() many small files in the shared region -- you
> > > can't do that with zeroAPI.
> > 
> > Do you consider a use-case for many small files to be common? I would
> > think that the main consumer of the feature to be mmap of huge files.
> > And in this case zero enabling burden on userspace side sounds like a
> > sweet deal.
> 
> mmap() of huge files is certainly the Oracle use-case.  With occasional
> funny business like mprotect() of a single page in the middle of a 1GB
> hugepage.

Bill and I were talking about this earlier and realised that this is
the key point.  There's a requirement that when one process mprotects
a page that it gets protected in all processes.  You can't do that
without *some* API because that's different behaviour than any existing
API would produce.

So how about something like this ...

int mcreate(const char *name, int flags, mode_t mode);

creates a new mm_struct with a refcount of 2.  returns an fd (one
of the two refcounts) and creates a name for it (inside msharefs,
holds the other refcount).

You can then mmap() that fd to attach it to a chunk of your address
space.  Once attached, you can start to populate it by calling
mmap() and specifying an address inside the attached mm as the first
argument to mmap().

Maybe mcreate() is just a library call, and it's really a thin wrapper
around open() that happens to know where msharefs is mounted.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-26  4:04             ` Matthew Wilcox
@ 2022-01-26 10:16               ` David Hildenbrand
  2022-01-26 13:38                 ` Matthew Wilcox
  2022-01-26 13:42               ` Kirill A. Shutemov
  1 sibling, 1 reply; 54+ messages in thread
From: David Hildenbrand @ 2022-01-26 10:16 UTC (permalink / raw)
  To: Matthew Wilcox, Kirill A. Shutemov
  Cc: Khalid Aziz, akpm, longpeng2, arnd, dave.hansen, rppt, surenb,
	linux-kernel, linux-mm, Peter Xu

On 26.01.22 05:04, Matthew Wilcox wrote:
> On Tue, Jan 25, 2022 at 06:59:50PM +0000, Matthew Wilcox wrote:
>> On Tue, Jan 25, 2022 at 09:57:05PM +0300, Kirill A. Shutemov wrote:
>>> On Tue, Jan 25, 2022 at 02:09:47PM +0000, Matthew Wilcox wrote:
>>>>> I think zero-API approach (plus madvise() hints to tweak it) is worth
>>>>> considering.
>>>>
>>>> I think the zero-API approach actually misses out on a lot of
>>>> possibilities that the mshare() approach offers.  For example, mshare()
>>>> allows you to mmap() many small files in the shared region -- you
>>>> can't do that with zeroAPI.
>>>
>>> Do you consider a use-case for many small files to be common? I would
>>> think that the main consumer of the feature to be mmap of huge files.
>>> And in this case zero enabling burden on userspace side sounds like a
>>> sweet deal.
>>
>> mmap() of huge files is certainly the Oracle use-case.  With occasional
>> funny business like mprotect() of a single page in the middle of a 1GB
>> hugepage.
> 
> Bill and I were talking about this earlier and realised that this is
> the key point.  There's a requirement that when one process mprotects
> a page that it gets protected in all processes.  You can't do that
> without *some* API because that's different behaviour than any existing
> API would produce.

A while ago I talked with Peter about an extended uffd (here: WP)
mechanism that would work on fds instead of the process address space.

The rough idea would be to register the uffd (or however that would be
called) handler on an fd instead of a virtual address space of a single
process and write-protect pages in that fd. Once anybody would try
writing to such a protected range (write, mmap, ...), the uffd handler
would fire and user space could handle the event (-> unprotect). The
page cache would have to remember the uffd information ("wp using
uffd"). When (un)protecting pages using this mechanism, all page tables
mapping the page would have to be updated accordingly using the rmap. At
that point, we wouldn't care if it's a single page table (e.g., shared
similar to hugetlb) or simply multiple page tables.

It's a completely rough idea, I just wanted to mention it.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-26 10:16               ` David Hildenbrand
@ 2022-01-26 13:38                 ` Matthew Wilcox
  2022-01-26 13:55                   ` David Hildenbrand
  2022-01-26 14:12                   ` Mike Rapoport
  0 siblings, 2 replies; 54+ messages in thread
From: Matthew Wilcox @ 2022-01-26 13:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kirill A. Shutemov, Khalid Aziz, akpm, longpeng2, arnd,
	dave.hansen, rppt, surenb, linux-kernel, linux-mm, Peter Xu

On Wed, Jan 26, 2022 at 11:16:42AM +0100, David Hildenbrand wrote:
> A while ago I talked with Peter about an extended uffd (here: WP)
> mechanism that would work on fds instead of the process address space.

As far as I can tell, uffd is a grotesque hack that exists to work around
the poor choice to use anonymous memory instead of file-backed memory
in kvm.  Every time I see somebody mention it, I feel pain.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-26  4:04             ` Matthew Wilcox
  2022-01-26 10:16               ` David Hildenbrand
@ 2022-01-26 13:42               ` Kirill A. Shutemov
  2022-01-26 14:18                 ` Mike Rapoport
  1 sibling, 1 reply; 54+ messages in thread
From: Kirill A. Shutemov @ 2022-01-26 13:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Khalid Aziz, akpm, longpeng2, arnd, dave.hansen, david, rppt,
	surenb, linux-kernel, linux-mm

On Wed, Jan 26, 2022 at 04:04:48AM +0000, Matthew Wilcox wrote:
> On Tue, Jan 25, 2022 at 06:59:50PM +0000, Matthew Wilcox wrote:
> > On Tue, Jan 25, 2022 at 09:57:05PM +0300, Kirill A. Shutemov wrote:
> > > On Tue, Jan 25, 2022 at 02:09:47PM +0000, Matthew Wilcox wrote:
> > > > > I think zero-API approach (plus madvise() hints to tweak it) is worth
> > > > > considering.
> > > > 
> > > > I think the zero-API approach actually misses out on a lot of
> > > > possibilities that the mshare() approach offers.  For example, mshare()
> > > > allows you to mmap() many small files in the shared region -- you
> > > > can't do that with zeroAPI.
> > > 
> > > Do you consider a use-case for many small files to be common? I would
> > > think that the main consumer of the feature to be mmap of huge files.
> > > And in this case zero enabling burden on userspace side sounds like a
> > > sweet deal.
> > 
> > mmap() of huge files is certainly the Oracle use-case.  With occasional
> > funny business like mprotect() of a single page in the middle of a 1GB
> > hugepage.
> 
> Bill and I were talking about this earlier and realised that this is
> the key point.  There's a requirement that when one process mprotects
> a page that it gets protected in all processes.  You can't do that
> without *some* API because that's different behaviour than any existing
> API would produce.

"hurr, durr, we are Oracle" :P

Sounds like a very niche requirement. I doubt there will more than single
digit user count for the feature. Maybe only the DB.

> So how about something like this ...
> 
> int mcreate(const char *name, int flags, mode_t mode);
> 
> creates a new mm_struct with a refcount of 2.  returns an fd (one
> of the two refcounts) and creates a name for it (inside msharefs,
> holds the other refcount).
> 
> You can then mmap() that fd to attach it to a chunk of your address
> space.  Once attached, you can start to populate it by calling
> mmap() and specifying an address inside the attached mm as the first
> argument to mmap().

That is not what mmap() would normally do to an existing mapping. So it
requires special treatment.

In general mmap() of a mm_struct scares me. I can't wrap my head around
implications.

Like how does it work on fork()?

How accounting works? What happens on OOM?

What prevents creating loops, like mapping a mm_struct inside itself?

What mremap()/munmap() do to such mapping? Will it affect mapping of
mm_struct or will it target mapping inside the mm_sturct?

Maybe it just didn't clicked for me, I donno.

> Maybe mcreate() is just a library call, and it's really a thin wrapper
> around open() that happens to know where msharefs is mounted.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-26 13:38                 ` Matthew Wilcox
@ 2022-01-26 13:55                   ` David Hildenbrand
  2022-01-26 14:12                     ` Matthew Wilcox
  2022-01-26 14:12                   ` Mike Rapoport
  1 sibling, 1 reply; 54+ messages in thread
From: David Hildenbrand @ 2022-01-26 13:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kirill A. Shutemov, Khalid Aziz, akpm, longpeng2, arnd,
	dave.hansen, rppt, surenb, linux-kernel, linux-mm, Peter Xu

On 26.01.22 14:38, Matthew Wilcox wrote:
> On Wed, Jan 26, 2022 at 11:16:42AM +0100, David Hildenbrand wrote:
>> A while ago I talked with Peter about an extended uffd (here: WP)
>> mechanism that would work on fds instead of the process address space.
> 
> As far as I can tell, uffd is a grotesque hack that exists to work around
> the poor choice to use anonymous memory instead of file-backed memory
> in kvm.  Every time I see somebody mention it, I feel pain.
> 

I might be missing something important, because KVM can deal with
file-back memory just fine and uffd is used heavily outside of hypervisors.

I'd love to learn how to handle what ordinary uffd (handle
missing/unpopulated pages) and uffd-wp (handle write access to pages)
can do with files instead. Because if something like that already
exists, it would be precisely what I am talking about.

Maybe mentioning uffd was a bad choice ;)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-26 13:38                 ` Matthew Wilcox
  2022-01-26 13:55                   ` David Hildenbrand
@ 2022-01-26 14:12                   ` Mike Rapoport
  1 sibling, 0 replies; 54+ messages in thread
From: Mike Rapoport @ 2022-01-26 14:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, Kirill A. Shutemov, Khalid Aziz, akpm,
	longpeng2, arnd, dave.hansen, surenb, linux-kernel, linux-mm,
	Peter Xu

On Wed, Jan 26, 2022 at 01:38:49PM +0000, Matthew Wilcox wrote:
> On Wed, Jan 26, 2022 at 11:16:42AM +0100, David Hildenbrand wrote:
> > A while ago I talked with Peter about an extended uffd (here: WP)
> > mechanism that would work on fds instead of the process address space.
> 
> As far as I can tell, uffd is a grotesque hack that exists to work around
> the poor choice to use anonymous memory instead of file-backed memory
> in kvm.  Every time I see somebody mention it, I feel pain.

How file-backed memory would have helped for the major use-case of uffd
which is post-copy migration?

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-26 13:55                   ` David Hildenbrand
@ 2022-01-26 14:12                     ` Matthew Wilcox
  2022-01-26 14:30                       ` David Hildenbrand
  0 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2022-01-26 14:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kirill A. Shutemov, Khalid Aziz, akpm, longpeng2, arnd,
	dave.hansen, rppt, surenb, linux-kernel, linux-mm, Peter Xu

On Wed, Jan 26, 2022 at 02:55:10PM +0100, David Hildenbrand wrote:
> On 26.01.22 14:38, Matthew Wilcox wrote:
> > On Wed, Jan 26, 2022 at 11:16:42AM +0100, David Hildenbrand wrote:
> >> A while ago I talked with Peter about an extended uffd (here: WP)
> >> mechanism that would work on fds instead of the process address space.
> > 
> > As far as I can tell, uffd is a grotesque hack that exists to work around
> > the poor choice to use anonymous memory instead of file-backed memory
> > in kvm.  Every time I see somebody mention it, I feel pain.
> > 
> 
> I might be missing something important, because KVM can deal with
> file-back memory just fine and uffd is used heavily outside of hypervisors.
> 
> I'd love to learn how to handle what ordinary uffd (handle
> missing/unpopulated pages) and uffd-wp (handle write access to pages)
> can do with files instead. Because if something like that already
> exists, it would be precisely what I am talking about.

Every notification that uffd wants already exists as a notification to
the underlying filesystem.  Something like a uffdfs [1] would be able
to do everything that uffd does without adding extra crap all over the MM.

[1] acronyms are bad, mmmkay?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-26 13:42               ` Kirill A. Shutemov
@ 2022-01-26 14:18                 ` Mike Rapoport
  2022-01-26 17:33                   ` Khalid Aziz
  0 siblings, 1 reply; 54+ messages in thread
From: Mike Rapoport @ 2022-01-26 14:18 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Matthew Wilcox, Khalid Aziz, akpm, longpeng2, arnd, dave.hansen,
	david, surenb, linux-kernel, linux-mm

On Wed, Jan 26, 2022 at 04:42:47PM +0300, Kirill A. Shutemov wrote:
> On Wed, Jan 26, 2022 at 04:04:48AM +0000, Matthew Wilcox wrote:
> > On Tue, Jan 25, 2022 at 06:59:50PM +0000, Matthew Wilcox wrote:
> > > On Tue, Jan 25, 2022 at 09:57:05PM +0300, Kirill A. Shutemov wrote:
>
> > So how about something like this ...
> > 
> > int mcreate(const char *name, int flags, mode_t mode);
> > 
> > creates a new mm_struct with a refcount of 2.  returns an fd (one
> > of the two refcounts) and creates a name for it (inside msharefs,
> > holds the other refcount).
> > 
> > You can then mmap() that fd to attach it to a chunk of your address
> > space.  Once attached, you can start to populate it by calling
> > mmap() and specifying an address inside the attached mm as the first
> > argument to mmap().
> 
> That is not what mmap() would normally do to an existing mapping. So it
> requires special treatment.
> 
> In general mmap() of a mm_struct scares me. I can't wrap my head around
> implications.
> 
> Like how does it work on fork()?
> 
> How accounting works? What happens on OOM?
> 
> What prevents creating loops, like mapping a mm_struct inside itself?
> 
> What mremap()/munmap() do to such mapping? Will it affect mapping of
> mm_struct or will it target mapping inside the mm_sturct?
> 
> Maybe it just didn't clicked for me, I donno.

My understanding was that the new mm_struct would be rather stripped and
will be used more as an abstraction for the shared page table, maybe I'm
totally wrong :)
 
> > Maybe mcreate() is just a library call, and it's really a thin wrapper
> > around open() that happens to know where msharefs is mounted.
> 
> -- 
>  Kirill A. Shutemov

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-26 14:12                     ` Matthew Wilcox
@ 2022-01-26 14:30                       ` David Hildenbrand
  0 siblings, 0 replies; 54+ messages in thread
From: David Hildenbrand @ 2022-01-26 14:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kirill A. Shutemov, Khalid Aziz, akpm, longpeng2, arnd,
	dave.hansen, rppt, surenb, linux-kernel, linux-mm, Peter Xu

On 26.01.22 15:12, Matthew Wilcox wrote:
> On Wed, Jan 26, 2022 at 02:55:10PM +0100, David Hildenbrand wrote:
>> On 26.01.22 14:38, Matthew Wilcox wrote:
>>> On Wed, Jan 26, 2022 at 11:16:42AM +0100, David Hildenbrand wrote:
>>>> A while ago I talked with Peter about an extended uffd (here: WP)
>>>> mechanism that would work on fds instead of the process address space.
>>>
>>> As far as I can tell, uffd is a grotesque hack that exists to work around
>>> the poor choice to use anonymous memory instead of file-backed memory
>>> in kvm.  Every time I see somebody mention it, I feel pain.
>>>
>>
>> I might be missing something important, because KVM can deal with
>> file-back memory just fine and uffd is used heavily outside of hypervisors.
>>
>> I'd love to learn how to handle what ordinary uffd (handle
>> missing/unpopulated pages) and uffd-wp (handle write access to pages)
>> can do with files instead. Because if something like that already
>> exists, it would be precisely what I am talking about.
> 
> Every notification that uffd wants already exists as a notification to
> the underlying filesystem.  Something like a uffdfs [1] would be able
> to do everything that uffd does without adding extra crap all over the MM.

I don't speak "filesystem" fluently, but I assume that could be an
overlay over other fs?

Peter is currently upstreaming uffd-wp for shmem. How could that look
like when doing it the fs-way?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-26 14:18                 ` Mike Rapoport
@ 2022-01-26 17:33                   ` Khalid Aziz
  0 siblings, 0 replies; 54+ messages in thread
From: Khalid Aziz @ 2022-01-26 17:33 UTC (permalink / raw)
  To: Mike Rapoport, Kirill A. Shutemov
  Cc: Matthew Wilcox, akpm, longpeng2, arnd, dave.hansen, david,
	surenb, linux-kernel, linux-mm

On 1/26/22 07:18, Mike Rapoport wrote:
> On Wed, Jan 26, 2022 at 04:42:47PM +0300, Kirill A. Shutemov wrote:
>> On Wed, Jan 26, 2022 at 04:04:48AM +0000, Matthew Wilcox wrote:
>>> On Tue, Jan 25, 2022 at 06:59:50PM +0000, Matthew Wilcox wrote:
>>>> On Tue, Jan 25, 2022 at 09:57:05PM +0300, Kirill A. Shutemov wrote:
>>
>>> So how about something like this ...
>>>
>>> int mcreate(const char *name, int flags, mode_t mode);
>>>
>>> creates a new mm_struct with a refcount of 2.  returns an fd (one
>>> of the two refcounts) and creates a name for it (inside msharefs,
>>> holds the other refcount).
>>>
>>> You can then mmap() that fd to attach it to a chunk of your address
>>> space.  Once attached, you can start to populate it by calling
>>> mmap() and specifying an address inside the attached mm as the first
>>> argument to mmap().
>>
>> That is not what mmap() would normally do to an existing mapping. So it
>> requires special treatment.
>>
>> In general mmap() of a mm_struct scares me. I can't wrap my head around
>> implications.
>>
>> Like how does it work on fork()?
>>
>> How accounting works? What happens on OOM?
>>
>> What prevents creating loops, like mapping a mm_struct inside itself?
>>
>> What mremap()/munmap() do to such mapping? Will it affect mapping of
>> mm_struct or will it target mapping inside the mm_sturct?
>>
>> Maybe it just didn't clicked for me, I donno.
> 
> My understanding was that the new mm_struct would be rather stripped and
> will be used more as an abstraction for the shared page table, maybe I'm
> totally wrong :)

Your understanding is correct for the RFC implementation of mshare(). mcreate() is a different beast that I do not fully 
understand yet. From Matthew's explanation, it sounds like what he has in mind is that mcreate() is a frontend to 
mshare/msharefs, uses mshare to created the shared region and thus allows a user to mprotect a single page inside the 
mmap it creates using the fd returned by mcreate. mshare underneath automagically extends the new page protection to 
every one sharing that page owing to shared PTE.

--
Khalid

>   
>>> Maybe mcreate() is just a library call, and it's really a thin wrapper
>>> around open() that happens to know where msharefs is mounted.
>>
>> -- 
>>   Kirill A. Shutemov
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2022-01-26 17:34 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-18 21:19 [RFC PATCH 0/6] Add support for shared PTEs across processes Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 1/6] mm: Add new system calls mshare, mshare_unlink Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 2/6] mm: Add msharefs filesystem Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 3/6] mm: Add read for msharefs Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 4/6] mm: implement mshare_unlink syscall Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 5/6] mm: Add locking to msharefs syscalls Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 6/6] mm: Add basic page table sharing using mshare Khalid Aziz
2022-01-18 21:41 ` [RFC PATCH 0/6] Add support for shared PTEs across processes Dave Hansen
2022-01-18 21:46   ` Matthew Wilcox
2022-01-18 22:47     ` Khalid Aziz
2022-01-18 22:06 ` Dave Hansen
2022-01-18 22:52   ` Khalid Aziz
2022-01-19 11:38 ` Mark Hemment
2022-01-19 17:02   ` Khalid Aziz
2022-01-20 12:49     ` Mark Hemment
2022-01-20 19:15       ` Khalid Aziz
2022-01-24 15:15         ` Mark Hemment
2022-01-24 15:27           ` Matthew Wilcox
2022-01-24 22:20           ` Khalid Aziz
2022-01-21  1:08 ` Barry Song
2022-01-21  2:13   ` Matthew Wilcox
2022-01-21  7:35     ` Barry Song
2022-01-21 14:47       ` Matthew Wilcox
2022-01-21 16:41         ` Khalid Aziz
2022-01-22  1:39           ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2022-01-22  1:41             ` Matthew Wilcox
2022-01-22 10:18               ` Thomas Schoebel-Theuer
2022-01-22 16:09                 ` Matthew Wilcox
2022-01-22 11:31 ` Mike Rapoport
2022-01-22 18:29   ` Andy Lutomirski
2022-01-24 18:48   ` Khalid Aziz
2022-01-24 19:45     ` Andy Lutomirski
2022-01-24 22:30       ` Khalid Aziz
2022-01-24 23:16         ` Andy Lutomirski
2022-01-24 23:44           ` Khalid Aziz
2022-01-25 11:42 ` Kirill A. Shutemov
2022-01-25 12:09   ` William Kucharski
2022-01-25 13:18     ` David Hildenbrand
2022-01-25 14:01       ` Kirill A. Shutemov
2022-01-25 13:23   ` Matthew Wilcox
2022-01-25 13:59     ` Kirill A. Shutemov
2022-01-25 14:09       ` Matthew Wilcox
2022-01-25 18:57         ` Kirill A. Shutemov
2022-01-25 18:59           ` Matthew Wilcox
2022-01-26  4:04             ` Matthew Wilcox
2022-01-26 10:16               ` David Hildenbrand
2022-01-26 13:38                 ` Matthew Wilcox
2022-01-26 13:55                   ` David Hildenbrand
2022-01-26 14:12                     ` Matthew Wilcox
2022-01-26 14:30                       ` David Hildenbrand
2022-01-26 14:12                   ` Mike Rapoport
2022-01-26 13:42               ` Kirill A. Shutemov
2022-01-26 14:18                 ` Mike Rapoport
2022-01-26 17:33                   ` Khalid Aziz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.