linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
@ 2020-07-27 17:07 ` Eric W. Biederman
  2020-07-27 18:00   ` Steven Sistare
  2020-07-27 17:11 ` [RFC PATCH 1/5] elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings Anthony Yznaga
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 74+ messages in thread
From: Eric W. Biederman @ 2020-07-27 17:07 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-arch, mhocko, tglx,
	mingo, bp, x86, hpa, viro, akpm, arnd, keescook, gerg, ktkhai,
	christian.brauner, peterz, esyr, jgg, christian, areber, cyphar,
	steven.sistare

Anthony Yznaga <anthony.yznaga@oracle.com> writes:

> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,
> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.

What is wrong with using a file descriptor to a possibly deleted file
and re-mmaping it?

There is already MAP_FIXED that allows you to ensure you have the same
address.

I think all it would take would be a small protocol from one version
to the next to say map file descriptor #N and address #A.  Something
easily passed on the command line.

There is just enough complexity in exec currently that our
implementation of exec is already teetering.  So if we could use
existing mechanisms it would be good.

And I don't see why this version of sharing a mmap area would be
particularly general.  I do imagine that being able to force a
mmap area into a setuid executable would be a fun attack vector.

Perhaps I missed this in my skim of these patches but I did not see
anything that guarded this feature against an exec that changes an
applications privileges.

Eric


> Patches 1 and 2 ensure that loading of ELF load segments does not silently
> clobber existing VMAS, and remove assumptions that the stack is the only
> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
> and could be considered on its own.
>
> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
> using an ELF note.
>
> Anthony Yznaga (5):
>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>   mm: do not assume only the stack vma exists in setup_arg_pages()
>   mm: introduce VM_EXEC_KEEP
>   exec, elf: require opt-in for accepting preserved mem
>   mm: introduce MADV_DOEXEC
>
>  arch/x86/Kconfig                       |   1 +
>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>  fs/exec.c                              |  33 +++++-
>  include/linux/binfmts.h                |   7 +-
>  include/linux/mm.h                     |   5 +
>  include/uapi/asm-generic/mman-common.h |   3 +
>  kernel/fork.c                          |   2 +-
>  mm/madvise.c                           |  25 +++++
>  mm/mmap.c                              |  47 ++++++++
>  9 files changed, 266 insertions(+), 53 deletions(-)


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [RFC PATCH 0/5] madvise MADV_DOEXEC
@ 2020-07-27 17:11 Anthony Yznaga
  2020-07-27 17:07 ` Eric W. Biederman
                   ` (10 more replies)
  0 siblings, 11 replies; 74+ messages in thread
From: Anthony Yznaga @ 2020-07-27 17:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, ktkhai, christian.brauner, peterz, esyr, jgg,
	christian, areber, cyphar, steven.sistare

This patchset adds support for preserving an anonymous memory range across
exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
sharing memory in this manner, as opposed to re-attaching to a named shared
memory segment, is to ensure it is mapped at the same virtual address in
the new process as it was in the old one.  An intended use for this is to
preserve guest memory for guests using vfio while qemu exec's an updated
version of itself.  By ensuring the memory is preserved at a fixed address,
vfio mappings and their associated kernel data structures can remain valid.
In addition, for the qemu use case, qemu instances that back guest RAM with
anonymous memory can be updated.

Patches 1 and 2 ensure that loading of ELF load segments does not silently
clobber existing VMAS, and remove assumptions that the stack is the only
VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
and could be considered on its own.

Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
using an ELF note.

Anthony Yznaga (5):
  elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
  mm: do not assume only the stack vma exists in setup_arg_pages()
  mm: introduce VM_EXEC_KEEP
  exec, elf: require opt-in for accepting preserved mem
  mm: introduce MADV_DOEXEC

 arch/x86/Kconfig                       |   1 +
 fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
 fs/exec.c                              |  33 +++++-
 include/linux/binfmts.h                |   7 +-
 include/linux/mm.h                     |   5 +
 include/uapi/asm-generic/mman-common.h |   3 +
 kernel/fork.c                          |   2 +-
 mm/madvise.c                           |  25 +++++
 mm/mmap.c                              |  47 ++++++++
 9 files changed, 266 insertions(+), 53 deletions(-)

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [RFC PATCH 1/5] elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
  2020-07-27 17:07 ` Eric W. Biederman
@ 2020-07-27 17:11 ` Anthony Yznaga
  2020-07-27 17:11 ` [RFC PATCH 2/5] mm: do not assume only the stack vma exists in setup_arg_pages() Anthony Yznaga
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 74+ messages in thread
From: Anthony Yznaga @ 2020-07-27 17:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, ktkhai, christian.brauner, peterz, esyr, jgg,
	christian, areber, cyphar, steven.sistare

Commit b212921b13bd ("elf: don't use MAP_FIXED_NOREPLACE for elf
executable mappings") reverted back to using MAP_FIXED to map elf load
segments because it was found that the load segments in some binaries
overlap and can cause MAP_FIXED_NOREPLACE to fail.  The original intent
of MAP_FIXED_NOREPLACE was to prevent the silent clobbering of an
existing mapping (e.g. the stack) by the elf image.  To achieve this,
expand on the logic used when loading ET_DYN binaries which calculates a
total size for the image when the first segment is mapped, maps the
entire image, and then unmaps the remainder before remaining segments
are mapped.  Apply this to ET_EXEC binaries as well as ET_DYN binaries
as is done now, and for both ET_EXEC and ET_DYN+INTERP use
MAP_FIXED_NOREPLACE for the initial total size mapping and MAP_FIXED for
remaining mappings.  For ET_DYN w/out INTERP, continue to map at a
system-selected address in the mmap region.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 fs/binfmt_elf.c | 112 ++++++++++++++++++++++++++++++++------------------------
 1 file changed, 64 insertions(+), 48 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 9fe3b51c116a..6445a6dbdb1d 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1046,58 +1046,25 @@ static int load_elf_binary(struct linux_binprm *bprm)
 
 		vaddr = elf_ppnt->p_vaddr;
 		/*
-		 * If we are loading ET_EXEC or we have already performed
-		 * the ET_DYN load_addr calculations, proceed normally.
+		 * Map remaining segments with MAP_FIXED once the first
+		 * total size mapping has been done.
 		 */
-		if (elf_ex->e_type == ET_EXEC || load_addr_set) {
+		if (load_addr_set) {
 			elf_flags |= MAP_FIXED;
-		} else if (elf_ex->e_type == ET_DYN) {
-			/*
-			 * This logic is run once for the first LOAD Program
-			 * Header for ET_DYN binaries to calculate the
-			 * randomization (load_bias) for all the LOAD
-			 * Program Headers, and to calculate the entire
-			 * size of the ELF mapping (total_size). (Note that
-			 * load_addr_set is set to true later once the
-			 * initial mapping is performed.)
-			 *
-			 * There are effectively two types of ET_DYN
-			 * binaries: programs (i.e. PIE: ET_DYN with INTERP)
-			 * and loaders (ET_DYN without INTERP, since they
-			 * _are_ the ELF interpreter). The loaders must
-			 * be loaded away from programs since the program
-			 * may otherwise collide with the loader (especially
-			 * for ET_EXEC which does not have a randomized
-			 * position). For example to handle invocations of
-			 * "./ld.so someprog" to test out a new version of
-			 * the loader, the subsequent program that the
-			 * loader loads must avoid the loader itself, so
-			 * they cannot share the same load range. Sufficient
-			 * room for the brk must be allocated with the
-			 * loader as well, since brk must be available with
-			 * the loader.
-			 *
-			 * Therefore, programs are loaded offset from
-			 * ELF_ET_DYN_BASE and loaders are loaded into the
-			 * independently randomized mmap region (0 load_bias
-			 * without MAP_FIXED).
-			 */
-			if (interpreter) {
-				load_bias = ELF_ET_DYN_BASE;
-				if (current->flags & PF_RANDOMIZE)
-					load_bias += arch_mmap_rnd();
-				elf_flags |= MAP_FIXED;
-			} else
-				load_bias = 0;
-
+		} else {
 			/*
-			 * Since load_bias is used for all subsequent loading
-			 * calculations, we must lower it by the first vaddr
-			 * so that the remaining calculations based on the
-			 * ELF vaddrs will be correctly offset. The result
-			 * is then page aligned.
+			 * To ensure loading does not continue if an ELF
+			 * LOAD segment overlaps an existing mapping (e.g.
+			 * the stack), for the first LOAD Program Header
+			 * calculate the the entire size of the ELF mapping
+			 * and map it with MAP_FIXED_NOREPLACE. On success,
+			 * the remainder will be unmapped and subsequent
+			 * LOAD segments mapped with MAP_FIXED rather than
+			 * MAP_FIXED_NOREPLACE because some binaries may
+			 * have overlapping segments that would cause the
+			 * mmap to fail.
 			 */
-			load_bias = ELF_PAGESTART(load_bias - vaddr);
+			elf_flags |= MAP_FIXED_NOREPLACE;
 
 			total_size = total_mapping_size(elf_phdata,
 							elf_ex->e_phnum);
@@ -1105,6 +1072,55 @@ static int load_elf_binary(struct linux_binprm *bprm)
 				retval = -EINVAL;
 				goto out_free_dentry;
 			}
+
+			if (elf_ex->e_type == ET_DYN) {
+				/*
+				 * This logic is run once for the first LOAD
+				 * Program Header for ET_DYN binaries to
+				 * calculate the randomization (load_bias) for
+				 * all the LOAD Program Headers.
+				 *
+				 * There are effectively two types of ET_DYN
+				 * binaries: programs (i.e. PIE: ET_DYN with
+				 * INTERP) and loaders (ET_DYN without INTERP,
+				 * since they _are_ the ELF interpreter). The
+				 * loaders must be loaded away from programs
+				 * since the program may otherwise collide with
+				 * the loader (especially for ET_EXEC which does
+				 * not have a randomized position). For example
+				 * to handle invocations of "./ld.so someprog"
+				 * to test out a new version of the loader, the
+				 * subsequent program that the loader loads must
+				 * avoid the loader itself, so they cannot share
+				 * the same load range. Sufficient room for the
+				 * brk must be allocated with the loader as
+				 * well, since brk must be available with the
+				 * loader.
+				 *
+				 * Therefore, programs are loaded offset from
+				 * ELF_ET_DYN_BASE and loaders are loaded into
+				 * the independently randomized mmap region
+				 * (0 load_bias without MAP_FIXED*).
+				 */
+				if (interpreter) {
+					load_bias = ELF_ET_DYN_BASE;
+					if (current->flags & PF_RANDOMIZE)
+						load_bias += arch_mmap_rnd();
+				} else {
+					load_bias = 0;
+					elf_flags &= ~MAP_FIXED_NOREPLACE;
+				}
+
+				/*
+				 * Since load_bias is used for all subsequent
+				 * loading calculations, we must lower it by
+				 * the first vaddr so that the remaining
+				 * calculations based on the ELF vaddrs will
+				 * be correctly offset. The result is then
+				 * page aligned.
+				 */
+				load_bias = ELF_PAGESTART(load_bias - vaddr);
+			}
 		}
 
 		error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt,
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [RFC PATCH 2/5] mm: do not assume only the stack vma exists in setup_arg_pages()
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
  2020-07-27 17:07 ` Eric W. Biederman
  2020-07-27 17:11 ` [RFC PATCH 1/5] elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings Anthony Yznaga
@ 2020-07-27 17:11 ` Anthony Yznaga
  2020-07-27 17:11 ` [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP Anthony Yznaga
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 74+ messages in thread
From: Anthony Yznaga @ 2020-07-27 17:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, ktkhai, christian.brauner, peterz, esyr, jgg,
	christian, areber, cyphar, steven.sistare

In preparation for allowing vmas to be preserved across exec do not
assume that there is no prev vma to pass to mprotect_fixup() in
setup_arg_pages().
Also, setup_arg_pages() expands the initial stack of a process by
128k or to the stack size limit, whichever is smaller.  expand_stack()
assumes there is no vma between the vma passed to it and the
address to expand to, so check before calling it.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 fs/exec.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/exec.c b/fs/exec.c
index e6e8a9a70327..262112e5f9f8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -720,7 +720,7 @@ int setup_arg_pages(struct linux_binprm *bprm,
 	unsigned long stack_shift;
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = bprm->vma;
-	struct vm_area_struct *prev = NULL;
+	struct vm_area_struct *prev = vma->vm_prev;
 	unsigned long vm_flags;
 	unsigned long stack_base;
 	unsigned long stack_size;
@@ -819,6 +819,10 @@ int setup_arg_pages(struct linux_binprm *bprm,
 	else
 		stack_base = vma->vm_start - stack_expand;
 #endif
+	if (vma != find_vma(mm, stack_base)) {
+		ret = -EFAULT;
+		goto out_unlock;
+	}
 	current->mm->start_stack = bprm->p;
 	ret = expand_stack(vma, stack_base);
 	if (ret)
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
                   ` (2 preceding siblings ...)
  2020-07-27 17:11 ` [RFC PATCH 2/5] mm: do not assume only the stack vma exists in setup_arg_pages() Anthony Yznaga
@ 2020-07-27 17:11 ` Anthony Yznaga
  2020-07-28 13:38   ` Eric W. Biederman
  2020-07-29 13:52   ` Kirill A. Shutemov
  2020-07-27 17:11 ` [RFC PATCH 4/5] exec, elf: require opt-in for accepting preserved mem Anthony Yznaga
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 74+ messages in thread
From: Anthony Yznaga @ 2020-07-27 17:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, ktkhai, christian.brauner, peterz, esyr, jgg,
	christian, areber, cyphar, steven.sistare

A vma with the VM_EXEC_KEEP flag is preserved across exec.  For anonymous
vmas only.  For safety, overlap with fixed address VMAs created in the new
mm during exec (e.g. the stack and elf load segments) is not permitted and
will cause the exec to fail.
(We are studying how to guarantee there are no conflicts. Comments welcome.)

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/Kconfig   |  1 +
 fs/exec.c          | 20 ++++++++++++++++++++
 include/linux/mm.h |  5 +++++
 kernel/fork.c      |  2 +-
 mm/mmap.c          | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 883da0abf779..fc36eb2f45c0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -30,6 +30,7 @@ config X86_64
 	select MODULES_USE_ELF_RELA
 	select NEED_DMA_MAP_STATE
 	select SWIOTLB
+	select ARCH_USES_HIGH_VMA_FLAGS
 
 config FORCE_DYNAMIC_FTRACE
 	def_bool y
diff --git a/fs/exec.c b/fs/exec.c
index 262112e5f9f8..1de09c4eef00 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1069,6 +1069,20 @@ ssize_t read_code(struct file *file, unsigned long addr, loff_t pos, size_t len)
 EXPORT_SYMBOL(read_code);
 #endif
 
+static int vma_dup_some(struct mm_struct *old_mm, struct mm_struct *new_mm)
+{
+	struct vm_area_struct *vma;
+	int ret;
+
+	for (vma = old_mm->mmap; vma; vma = vma->vm_next)
+		if (vma->vm_flags & VM_EXEC_KEEP) {
+			ret = vma_dup(vma, new_mm);
+			if (ret)
+				return ret;
+		}
+	return 0;
+}
+
 /*
  * Maps the mm_struct mm into the current task struct.
  * On success, this function returns with the mutex
@@ -1104,6 +1118,12 @@ static int exec_mmap(struct mm_struct *mm)
 			mutex_unlock(&tsk->signal->exec_update_mutex);
 			return -EINTR;
 		}
+		ret = vma_dup_some(old_mm, mm);
+		if (ret) {
+			mmap_read_unlock(old_mm);
+			mutex_unlock(&tsk->signal->exec_update_mutex);
+			return ret;
+		}
 	}
 
 	task_lock(tsk);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc7b87310c10..1c538ba77f33 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -295,11 +295,15 @@ int overcommit_kbytes_handler(struct ctl_table *, int, void *, size_t *,
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_EXEC_KEEP	BIT(VM_HIGH_ARCH_BIT_5)	/* preserve VMA across exec */
+#else
+#define VM_EXEC_KEEP	VM_NONE
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -2534,6 +2538,7 @@ extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
 	unsigned long addr, unsigned long len, pgoff_t pgoff,
 	bool *need_rmap_locks);
 extern void exit_mmap(struct mm_struct *);
+extern int vma_dup(struct vm_area_struct *vma, struct mm_struct *mm);
 
 static inline int check_data_rlimit(unsigned long rlim,
 				    unsigned long new,
diff --git a/kernel/fork.c b/kernel/fork.c
index efc5493203ae..15ead613714f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -564,7 +564,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 			tmp->anon_vma = NULL;
 		} else if (anon_vma_fork(tmp, mpnt))
 			goto fail_nomem_anon_vma_fork;
-		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
+		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT | VM_EXEC_KEEP);
 		file = tmp->vm_file;
 		if (file) {
 			struct inode *inode = file_inode(file);
diff --git a/mm/mmap.c b/mm/mmap.c
index 59a4682ebf3f..be2ff53743c3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3279,6 +3279,53 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 	return NULL;
 }
 
+int vma_dup(struct vm_area_struct *old_vma, struct mm_struct *mm)
+{
+	unsigned long npages;
+	struct mm_struct *old_mm = old_vma->vm_mm;
+	struct vm_area_struct *vma;
+	int ret = -ENOMEM;
+
+	if (WARN_ON(old_vma->vm_file || old_vma->vm_ops))
+		return -EINVAL;
+
+	vma = find_vma(mm, old_vma->vm_start);
+	if (vma && vma->vm_start < old_vma->vm_end)
+		return -EEXIST;
+
+	npages = vma_pages(old_vma);
+	mm->total_vm += npages;
+
+	vma = vm_area_dup(old_vma);
+	if (!vma)
+		goto fail_nomem;
+
+	ret = vma_dup_policy(old_vma, vma);
+	if (ret)
+		goto fail_nomem_policy;
+
+	vma->vm_mm = mm;
+	ret = anon_vma_fork(vma, old_vma);
+	if (ret)
+		goto fail_nomem_anon_vma_fork;
+
+	vma->vm_flags &= ~(VM_LOCKED|VM_UFFD_MISSING|VM_UFFD_WP|VM_EXEC_KEEP);
+	vma->vm_next = vma->vm_prev = NULL;
+	vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+	if (is_vm_hugetlb_page(vma))
+		reset_vma_resv_huge_pages(vma);
+	__insert_vm_struct(mm, vma);
+	ret = copy_page_range(mm, old_mm, old_vma);
+	return ret;
+
+fail_nomem_anon_vma_fork:
+	mpol_put(vma_policy(vma));
+fail_nomem_policy:
+	vm_area_free(vma);
+fail_nomem:
+	return -ENOMEM;
+}
+
 /*
  * Return true if the calling process may expand its vm space by the passed
  * number of pages
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [RFC PATCH 4/5] exec, elf: require opt-in for accepting preserved mem
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
                   ` (3 preceding siblings ...)
  2020-07-27 17:11 ` [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP Anthony Yznaga
@ 2020-07-27 17:11 ` Anthony Yznaga
  2020-07-27 17:11 ` [RFC PATCH 5/5] mm: introduce MADV_DOEXEC Anthony Yznaga
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 74+ messages in thread
From: Anthony Yznaga @ 2020-07-27 17:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, ktkhai, christian.brauner, peterz, esyr, jgg,
	christian, areber, cyphar, steven.sistare

Don't copy preserved VMAs to the binary being exec'd unless the binary has
a "preserved-mem-ok" ELF note.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 fs/binfmt_elf.c         | 84 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/exec.c               | 17 +++++-----
 include/linux/binfmts.h |  7 ++++-
 3 files changed, 100 insertions(+), 8 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 6445a6dbdb1d..46248b7b0a75 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -683,6 +683,81 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 	return error;
 }
 
+#define NOTES_SZ			SZ_1K
+#define PRESERVED_MEM_OK_STRING		"preserved-mem-ok"
+#define SZ_PRESERVED_MEM_OK_STRING	sizeof(PRESERVED_MEM_OK_STRING)
+
+static int parse_elf_note(struct linux_binprm *bprm, const char *data, size_t *off, size_t datasz)
+{
+	const struct elf_note *nhdr;
+	const char *name;
+	size_t o;
+
+	o = *off;
+	datasz -= o;
+
+	if (datasz < sizeof(*nhdr))
+		return -ENOEXEC;
+
+	nhdr = (const struct elf_note *)(data + o);
+	o += sizeof(*nhdr);
+	datasz -= sizeof(*nhdr);
+
+	/*
+	 * Currently only the preserved-mem-ok elf note is of interest.
+	 */
+	if (nhdr->n_type != 0x07c1feed)
+		goto next;
+
+	if (nhdr->n_namesz > SZ_PRESERVED_MEM_OK_STRING)
+		return -ENOEXEC;
+
+	name =  data + o;
+	if (datasz < SZ_PRESERVED_MEM_OK_STRING ||
+	    strncmp(name, PRESERVED_MEM_OK_STRING, SZ_PRESERVED_MEM_OK_STRING))
+		return -ENOEXEC;
+
+	bprm->accepts_preserved_mem = 1;
+
+next:
+	o += roundup(nhdr->n_namesz, 4) + roundup(nhdr->n_descsz, 4);
+	*off = o;
+
+	return 0;
+}
+
+static int parse_elf_notes(struct linux_binprm *bprm, struct elf_phdr *phdr)
+{
+	char *notes;
+	size_t notes_sz;
+	size_t off = 0;
+	int ret;
+
+	if (!phdr)
+		return 0;
+
+	notes_sz = phdr->p_filesz;
+	if ((notes_sz > NOTES_SZ) || (notes_sz < sizeof(struct elf_note)))
+		return -ENOEXEC;
+
+	notes = kvmalloc(notes_sz, GFP_KERNEL);
+	if (!notes)
+		return -ENOMEM;
+
+	ret = elf_read(bprm->file, notes, notes_sz, phdr->p_offset);
+	if (ret < 0)
+		goto out;
+
+	while (off < notes_sz) {
+		ret = parse_elf_note(bprm, notes, &off, notes_sz);
+		if (ret)
+			break;
+	}
+out:
+	kvfree(notes);
+	return ret;
+}
+
 /*
  * These are the functions used to load ELF style executables and shared
  * libraries.  There is no binary dependent code anywhere else.
@@ -801,6 +876,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	unsigned long error;
 	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
 	struct elf_phdr *elf_property_phdata = NULL;
+	struct elf_phdr *elf_notes_phdata = NULL;
 	unsigned long elf_bss, elf_brk;
 	int bss_prot = 0;
 	int retval, i;
@@ -909,6 +985,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
 				executable_stack = EXSTACK_DISABLE_X;
 			break;
 
+		case PT_NOTE:
+			elf_notes_phdata = elf_ppnt;
+			break;
+
 		case PT_LOPROC ... PT_HIPROC:
 			retval = arch_elf_pt_proc(elf_ex, elf_ppnt,
 						  bprm->file, false,
@@ -970,6 +1050,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	if (retval)
 		goto out_free_dentry;
 
+	retval = parse_elf_notes(bprm, elf_notes_phdata);
+	if (retval)
+		goto out_free_dentry;
+
 	/* Flush all traces of the currently running executable */
 	retval = begin_new_exec(bprm);
 	if (retval)
diff --git a/fs/exec.c b/fs/exec.c
index 1de09c4eef00..b2b046fec1f8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1088,10 +1088,11 @@ static int vma_dup_some(struct mm_struct *old_mm, struct mm_struct *new_mm)
  * On success, this function returns with the mutex
  * exec_update_mutex locked.
  */
-static int exec_mmap(struct mm_struct *mm)
+static int exec_mmap(struct linux_binprm *bprm)
 {
 	struct task_struct *tsk;
 	struct mm_struct *old_mm, *active_mm;
+	struct mm_struct *mm = bprm->mm;
 	int ret;
 
 	/* Notify parent that we're no longer interested in the old VM */
@@ -1118,11 +1119,13 @@ static int exec_mmap(struct mm_struct *mm)
 			mutex_unlock(&tsk->signal->exec_update_mutex);
 			return -EINTR;
 		}
-		ret = vma_dup_some(old_mm, mm);
-		if (ret) {
-			mmap_read_unlock(old_mm);
-			mutex_unlock(&tsk->signal->exec_update_mutex);
-			return ret;
+		if (bprm->accepts_preserved_mem) {
+			ret = vma_dup_some(old_mm, mm);
+			if (ret) {
+				mmap_read_unlock(old_mm);
+				mutex_unlock(&tsk->signal->exec_update_mutex);
+				return ret;
+			}
 		}
 	}
 
@@ -1386,7 +1389,7 @@ int begin_new_exec(struct linux_binprm * bprm)
 	 * Release all of the old mmap stuff
 	 */
 	acct_arg_size(bprm, 0);
-	retval = exec_mmap(bprm->mm);
+	retval = exec_mmap(bprm);
 	if (retval)
 		goto out;
 
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 4a20b7517dd0..6a66589454c8 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -41,7 +41,12 @@ struct linux_binprm {
 		 * Set when errors can no longer be returned to the
 		 * original userspace.
 		 */
-		point_of_no_return:1;
+		point_of_no_return:1,
+		/*
+		 * Set if the binary being exec'd will accept memory marked
+		 * for preservation by the outgoing process.
+		 */
+		accepts_preserved_mem:1;
 #ifdef __alpha__
 	unsigned int taso:1;
 #endif
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [RFC PATCH 5/5] mm: introduce MADV_DOEXEC
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
                   ` (4 preceding siblings ...)
  2020-07-27 17:11 ` [RFC PATCH 4/5] exec, elf: require opt-in for accepting preserved mem Anthony Yznaga
@ 2020-07-27 17:11 ` Anthony Yznaga
  2020-07-28 13:22   ` Kirill Tkhai
  2020-07-28 11:34 ` [RFC PATCH 0/5] madvise MADV_DOEXEC Kirill Tkhai
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 74+ messages in thread
From: Anthony Yznaga @ 2020-07-27 17:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, ktkhai, christian.brauner, peterz, esyr, jgg,
	christian, areber, cyphar, steven.sistare

madvise MADV_DOEXEC preserves a memory range across exec.  Initially
only supported for non-executable, non-stack, anonymous memory.
MADV_DONTEXEC reverts the effect of a previous MADV_DOXEXEC call and
undoes the preservation of the range.  After a successful exec call,
the behavior of all ranges reverts to MADV_DONTEXEC.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/uapi/asm-generic/mman-common.h |  3 +++
 mm/madvise.c                           | 25 +++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index f94f65d429be..7c5f616b28f7 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -72,6 +72,9 @@
 #define MADV_COLD	20		/* deactivate these pages */
 #define MADV_PAGEOUT	21		/* reclaim these pages */
 
+#define MADV_DOEXEC	22		/* do inherit across exec */
+#define MADV_DONTEXEC	23		/* don't inherit across exec */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/madvise.c b/mm/madvise.c
index dd1d43cf026d..b447fa748649 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -103,6 +103,26 @@ static long madvise_behavior(struct vm_area_struct *vma,
 	case MADV_KEEPONFORK:
 		new_flags &= ~VM_WIPEONFORK;
 		break;
+	case MADV_DOEXEC:
+		/*
+		 * MADV_DOEXEC is only supported on private, non-executable,
+		 * non-stack anonymous memory and if the VM_EXEC_KEEP flag
+		 * is available.
+		 */
+		if (!VM_EXEC_KEEP || vma->vm_file || vma->vm_flags & (VM_EXEC|VM_SHARED|VM_STACK)) {
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags |= (new_flags & ~VM_MAYEXEC) | VM_EXEC_KEEP;
+		break;
+	case MADV_DONTEXEC:
+		if (!VM_EXEC_KEEP) {
+			error = -EINVAL;
+			goto out;
+		}
+		if (new_flags & VM_EXEC_KEEP)
+			new_flags |= (new_flags & ~VM_EXEC_KEEP) | VM_MAYEXEC;
+		break;
 	case MADV_DONTDUMP:
 		new_flags |= VM_DONTDUMP;
 		break;
@@ -983,6 +1003,8 @@ static int madvise_inject_error(int behavior,
 	case MADV_SOFT_OFFLINE:
 	case MADV_HWPOISON:
 #endif
+	case MADV_DOEXEC:
+	case MADV_DONTEXEC:
 		return true;
 
 	default:
@@ -1037,6 +1059,9 @@ static int madvise_inject_error(int behavior,
  *  MADV_DONTDUMP - the application wants to prevent pages in the given range
  *		from being included in its core dump.
  *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
+ *  MADV_DOEXEC - On exec, preserve and duplicate this area in the new process
+ *		  if the new process allows it.
+ *  MADV_DONTEXEC - Undo the effect of MADV_DOEXEC.
  *
  * return values:
  *  zero    - success
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-27 17:07 ` Eric W. Biederman
@ 2020-07-27 18:00   ` Steven Sistare
  2020-07-28 13:40     ` Christian Brauner
  0 siblings, 1 reply; 74+ messages in thread
From: Steven Sistare @ 2020-07-27 18:00 UTC (permalink / raw)
  To: Eric W. Biederman, Anthony Yznaga
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-arch, mhocko, tglx,
	mingo, bp, x86, hpa, viro, akpm, arnd, keescook, gerg, ktkhai,
	christian.brauner, peterz, esyr, jgg, christian, areber, cyphar

On 7/27/2020 1:07 PM, ebiederm@xmission.com wrote:
> Anthony Yznaga <anthony.yznaga@oracle.com> writes:
> 
>> This patchset adds support for preserving an anonymous memory range across
>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>> sharing memory in this manner, as opposed to re-attaching to a named shared
>> memory segment, is to ensure it is mapped at the same virtual address in
>> the new process as it was in the old one.  An intended use for this is to
>> preserve guest memory for guests using vfio while qemu exec's an updated
>> version of itself.  By ensuring the memory is preserved at a fixed address,
>> vfio mappings and their associated kernel data structures can remain valid.
>> In addition, for the qemu use case, qemu instances that back guest RAM with
>> anonymous memory can be updated.
> 
> What is wrong with using a file descriptor to a possibly deleted file
> and re-mmaping it?
> 
> There is already MAP_FIXED that allows you to ensure you have the same
> address.

MAP_FIXED blows away any existing mapping in that range, which is not the 
desired behavior.  We want to preserve the previously created mapping at
the same VA and co-exist with other mappings created by the new process.
There is no way to guarantee availability of a VA post-exec.

> I think all it would take would be a small protocol from one version
> to the next to say map file descriptor #N and address #A.  Something
> easily passed on the command line.
> 
> There is just enough complexity in exec currently that our
> implementation of exec is already teetering.  So if we could use
> existing mechanisms it would be good.
> 
> And I don't see why this version of sharing a mmap area would be
> particularly general.  I do imagine that being able to force a
> mmap area into a setuid executable would be a fun attack vector.

Any mmap(MAP_ANON) segment can be preserved.  That is very general, and is 
the case we need to support to upgrade legacy applications that are already
running (such as qemu) -- we cannot recode them before they are updated.

> Perhaps I missed this in my skim of these patches but I did not see
> anything that guarded this feature against an exec that changes an
> applications privileges.

The ELF opt-in flag must be set, so only selected executables will accept
incoming mappings.  The exec() code verifies that the preserved mappings 
do not overlap or become adjacent to text or stack, so it is not possible for
example for an attacker to cause stack underflow or overflow to access injected
content.  A gadget invoked by some other attack could access the preserved content.

- Steve

>> Patches 1 and 2 ensure that loading of ELF load segments does not silently
>> clobber existing VMAS, and remove assumptions that the stack is the only
>> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
>> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
>> and could be considered on its own.
>>
>> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
>> using an ELF note.
>>
>> Anthony Yznaga (5):
>>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>>   mm: do not assume only the stack vma exists in setup_arg_pages()
>>   mm: introduce VM_EXEC_KEEP
>>   exec, elf: require opt-in for accepting preserved mem
>>   mm: introduce MADV_DOEXEC
>>
>>  arch/x86/Kconfig                       |   1 +
>>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>>  fs/exec.c                              |  33 +++++-
>>  include/linux/binfmts.h                |   7 +-
>>  include/linux/mm.h                     |   5 +
>>  include/uapi/asm-generic/mman-common.h |   3 +
>>  kernel/fork.c                          |   2 +-
>>  mm/madvise.c                           |  25 +++++
>>  mm/mmap.c                              |  47 ++++++++
>>  9 files changed, 266 insertions(+), 53 deletions(-)


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
                   ` (5 preceding siblings ...)
  2020-07-27 17:11 ` [RFC PATCH 5/5] mm: introduce MADV_DOEXEC Anthony Yznaga
@ 2020-07-28 11:34 ` Kirill Tkhai
  2020-07-28 17:28   ` Anthony Yznaga
  2020-07-28 14:23 ` Andy Lutomirski
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 74+ messages in thread
From: Kirill Tkhai @ 2020-07-28 11:34 UTC (permalink / raw)
  To: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, christian.brauner, peterz, esyr, jgg, christian,
	areber, cyphar, steven.sistare

On 27.07.2020 20:11, Anthony Yznaga wrote:
> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,

So, the goal is an update of QEMU binary without a stopping of virtual machine?

> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.
> 
> Patches 1 and 2 ensure that loading of ELF load segments does not silently
> clobber existing VMAS, and remove assumptions that the stack is the only
> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
> and could be considered on its own.
> 
> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
> using an ELF note.
> 
> Anthony Yznaga (5):
>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>   mm: do not assume only the stack vma exists in setup_arg_pages()
>   mm: introduce VM_EXEC_KEEP
>   exec, elf: require opt-in for accepting preserved mem
>   mm: introduce MADV_DOEXEC
> 
>  arch/x86/Kconfig                       |   1 +
>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>  fs/exec.c                              |  33 +++++-
>  include/linux/binfmts.h                |   7 +-
>  include/linux/mm.h                     |   5 +
>  include/uapi/asm-generic/mman-common.h |   3 +
>  kernel/fork.c                          |   2 +-
>  mm/madvise.c                           |  25 +++++
>  mm/mmap.c                              |  47 ++++++++
>  9 files changed, 266 insertions(+), 53 deletions(-)
> 



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 5/5] mm: introduce MADV_DOEXEC
  2020-07-27 17:11 ` [RFC PATCH 5/5] mm: introduce MADV_DOEXEC Anthony Yznaga
@ 2020-07-28 13:22   ` Kirill Tkhai
  2020-07-28 14:06     ` Steven Sistare
  0 siblings, 1 reply; 74+ messages in thread
From: Kirill Tkhai @ 2020-07-28 13:22 UTC (permalink / raw)
  To: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, christian.brauner, peterz, esyr, jgg, christian,
	areber, cyphar, steven.sistare

On 27.07.2020 20:11, Anthony Yznaga wrote:
> madvise MADV_DOEXEC preserves a memory range across exec.  Initially
> only supported for non-executable, non-stack, anonymous memory.
> MADV_DONTEXEC reverts the effect of a previous MADV_DOXEXEC call and
> undoes the preservation of the range.  After a successful exec call,
> the behavior of all ranges reverts to MADV_DONTEXEC.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
> ---
>  include/uapi/asm-generic/mman-common.h |  3 +++
>  mm/madvise.c                           | 25 +++++++++++++++++++++++++
>  2 files changed, 28 insertions(+)
> 
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index f94f65d429be..7c5f616b28f7 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -72,6 +72,9 @@
>  #define MADV_COLD	20		/* deactivate these pages */
>  #define MADV_PAGEOUT	21		/* reclaim these pages */
>  
> +#define MADV_DOEXEC	22		/* do inherit across exec */
> +#define MADV_DONTEXEC	23		/* don't inherit across exec */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index dd1d43cf026d..b447fa748649 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -103,6 +103,26 @@ static long madvise_behavior(struct vm_area_struct *vma,
>  	case MADV_KEEPONFORK:
>  		new_flags &= ~VM_WIPEONFORK;
>  		break;
> +	case MADV_DOEXEC:

For me MADV_KEEPONEXEC sounds better as it's symmetric to MADV_KEEPONFORK.

> +		/*
> +		 * MADV_DOEXEC is only supported on private, non-executable,
> +		 * non-stack anonymous memory and if the VM_EXEC_KEEP flag
> +		 * is available.
> +		 */
> +		if (!VM_EXEC_KEEP || vma->vm_file || vma->vm_flags & (VM_EXEC|VM_SHARED|VM_STACK)) {
> +			error = -EINVAL;
> +			goto out;
> +		}
> +		new_flags |= (new_flags & ~VM_MAYEXEC) | VM_EXEC_KEEP;
> +		break;
> +	case MADV_DONTEXEC:
> +		if (!VM_EXEC_KEEP) {
> +			error = -EINVAL;
> +			goto out;
> +		}
> +		if (new_flags & VM_EXEC_KEEP)
> +			new_flags |= (new_flags & ~VM_EXEC_KEEP) | VM_MAYEXEC;
> +		break;
>  	case MADV_DONTDUMP:
>  		new_flags |= VM_DONTDUMP;
>  		break;
> @@ -983,6 +1003,8 @@ static int madvise_inject_error(int behavior,
>  	case MADV_SOFT_OFFLINE:
>  	case MADV_HWPOISON:
>  #endif
> +	case MADV_DOEXEC:
> +	case MADV_DONTEXEC:
>  		return true;
>  
>  	default:
> @@ -1037,6 +1059,9 @@ static int madvise_inject_error(int behavior,
>   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
>   *		from being included in its core dump.
>   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> + *  MADV_DOEXEC - On exec, preserve and duplicate this area in the new process
> + *		  if the new process allows it.
> + *  MADV_DONTEXEC - Undo the effect of MADV_DOEXEC.
>   *
>   * return values:
>   *  zero    - success
> 



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP
  2020-07-27 17:11 ` [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP Anthony Yznaga
@ 2020-07-28 13:38   ` Eric W. Biederman
  2020-07-28 17:44     ` Anthony Yznaga
  2020-07-29 13:52   ` Kirill A. Shutemov
  1 sibling, 1 reply; 74+ messages in thread
From: Eric W. Biederman @ 2020-07-28 13:38 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-arch, mhocko, tglx,
	mingo, bp, x86, hpa, viro, akpm, arnd, keescook, gerg, ktkhai,
	christian.brauner, peterz, esyr, jgg, christian, areber, cyphar,
	steven.sistare

Anthony Yznaga <anthony.yznaga@oracle.com> writes:

> A vma with the VM_EXEC_KEEP flag is preserved across exec.  For anonymous
> vmas only.  For safety, overlap with fixed address VMAs created in the new
> mm during exec (e.g. the stack and elf load segments) is not permitted and
> will cause the exec to fail.
> (We are studying how to guarantee there are no conflicts. Comments welcome.)
>

> diff --git a/fs/exec.c b/fs/exec.c
> index 262112e5f9f8..1de09c4eef00 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1069,6 +1069,20 @@ ssize_t read_code(struct file *file, unsigned long addr, loff_t pos, size_t len)
>  EXPORT_SYMBOL(read_code);
>  #endif
>  
> +static int vma_dup_some(struct mm_struct *old_mm, struct mm_struct *new_mm)
> +{
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	for (vma = old_mm->mmap; vma; vma = vma->vm_next)
> +		if (vma->vm_flags & VM_EXEC_KEEP) {
> +			ret = vma_dup(vma, new_mm);
> +			if (ret)
> +				return ret;
> +		}
> +	return 0;
> +}
> +
>  /*
>   * Maps the mm_struct mm into the current task struct.
>   * On success, this function returns with the mutex
> @@ -1104,6 +1118,12 @@ static int exec_mmap(struct mm_struct *mm)
>  			mutex_unlock(&tsk->signal->exec_update_mutex);
>  			return -EINTR;
>  		}
> +		ret = vma_dup_some(old_mm, mm);
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Ouch! An unconditional loop through all of the vmas of the execing
process, just in case there is a VM_EXEC_KEEP vma.

I know we already walk the list in exit_mmap, but I get the feeling this
will slow exec down when this feature is not enabled, especially when
a process with a lot of vmas is calling exec.

                
> +		if (ret) {
> +			mmap_read_unlock(old_mm);
> +			mutex_unlock(&tsk->signal->exec_update_mutex);
> +			return ret;
> +		}
>  	}
>  
>  	task_lock(tsk);


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-27 18:00   ` Steven Sistare
@ 2020-07-28 13:40     ` Christian Brauner
  0 siblings, 0 replies; 74+ messages in thread
From: Christian Brauner @ 2020-07-28 13:40 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Eric W. Biederman, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro,
	akpm, arnd, keescook, gerg, ktkhai, peterz, esyr, jgg, christian,
	areber, cyphar, linux-api

On Mon, Jul 27, 2020 at 02:00:17PM -0400, Steven Sistare wrote:
> On 7/27/2020 1:07 PM, ebiederm@xmission.com wrote:
> > Anthony Yznaga <anthony.yznaga@oracle.com> writes:
> > 
> >> This patchset adds support for preserving an anonymous memory range across
> >> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> >> sharing memory in this manner, as opposed to re-attaching to a named shared
> >> memory segment, is to ensure it is mapped at the same virtual address in
> >> the new process as it was in the old one.  An intended use for this is to
> >> preserve guest memory for guests using vfio while qemu exec's an updated
> >> version of itself.  By ensuring the memory is preserved at a fixed address,
> >> vfio mappings and their associated kernel data structures can remain valid.
> >> In addition, for the qemu use case, qemu instances that back guest RAM with
> >> anonymous memory can be updated.
> > 
> > What is wrong with using a file descriptor to a possibly deleted file
> > and re-mmaping it?
> > 
> > There is already MAP_FIXED that allows you to ensure you have the same
> > address.
> 
> MAP_FIXED blows away any existing mapping in that range, which is not the 
> desired behavior.  We want to preserve the previously created mapping at

There's also MAP_FIXED_NOREPLACE since v4.17 in case that helps.

Note that this should really go to linux-api too. I won't argue to
resend it since that would mean spamming everyone's inbox with the same
thread again but in case you send a revised version, please ensure to Cc
linux-api. The glibc folks are listening on there too.

Thanks!
Christian


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 5/5] mm: introduce MADV_DOEXEC
  2020-07-28 13:22   ` Kirill Tkhai
@ 2020-07-28 14:06     ` Steven Sistare
  0 siblings, 0 replies; 74+ messages in thread
From: Steven Sistare @ 2020-07-28 14:06 UTC (permalink / raw)
  To: Kirill Tkhai, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, christian.brauner, peterz, esyr, jgg, christian,
	areber, cyphar

On 7/28/2020 9:22 AM, Kirill Tkhai wrote:
> On 27.07.2020 20:11, Anthony Yznaga wrote:
>> madvise MADV_DOEXEC preserves a memory range across exec.  Initially
>> only supported for non-executable, non-stack, anonymous memory.
>> MADV_DONTEXEC reverts the effect of a previous MADV_DOXEXEC call and
>> undoes the preservation of the range.  After a successful exec call,
>> the behavior of all ranges reverts to MADV_DONTEXEC.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>> ---
>>  include/uapi/asm-generic/mman-common.h |  3 +++
>>  mm/madvise.c                           | 25 +++++++++++++++++++++++++
>>  2 files changed, 28 insertions(+)
>>
>> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
>> index f94f65d429be..7c5f616b28f7 100644
>> --- a/include/uapi/asm-generic/mman-common.h
>> +++ b/include/uapi/asm-generic/mman-common.h
>> @@ -72,6 +72,9 @@
>>  #define MADV_COLD	20		/* deactivate these pages */
>>  #define MADV_PAGEOUT	21		/* reclaim these pages */
>>  
>> +#define MADV_DOEXEC	22		/* do inherit across exec */
>> +#define MADV_DONTEXEC	23		/* don't inherit across exec */
>> +
>>  /* compatibility flags */
>>  #define MAP_FILE	0
>>  
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index dd1d43cf026d..b447fa748649 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -103,6 +103,26 @@ static long madvise_behavior(struct vm_area_struct *vma,
>>  	case MADV_KEEPONFORK:
>>  		new_flags &= ~VM_WIPEONFORK;
>>  		break;
>> +	case MADV_DOEXEC:
> 
> For me MADV_KEEPONEXEC sounds better as it's symmetric to MADV_KEEPONFORK.

We chose MADV_DOEXEC and MADV_DONTEXEC to match the precedent set by:

#define MADV_DONTFORK   10              /* don't inherit across fork */


#define MADV_DOFORK     11              /* do inherit across fork */


I do like "keep" as a concise description of the operation.  KEEPONFORK is not a perfect 
analog because its opposite is wipe ...

#define MADV_WIPEONFORK 18              /* Zero memory on fork, child only */
#define MADV_KEEPONFORK 19              /* Undo MADV_WIPEONFORK */

... but if folks are ok with that then IMO these are all good choices:

MADV_KEEPONEXEC
MADV_DROPONEXEC

MADV_KEEPEXEC    (shorter)
MADV_DROPEXEC 

MADV_KEEP_EXEC   (more legible, but no existing MADV names use 2nd underscores)
MADV_DROP_EXEC

Whatever folks like best.

- Steve

>> +		/*
>> +		 * MADV_DOEXEC is only supported on private, non-executable,
>> +		 * non-stack anonymous memory and if the VM_EXEC_KEEP flag
>> +		 * is available.
>> +		 */
>> +		if (!VM_EXEC_KEEP || vma->vm_file || vma->vm_flags & (VM_EXEC|VM_SHARED|VM_STACK)) {
>> +			error = -EINVAL;
>> +			goto out;
>> +		}
>> +		new_flags |= (new_flags & ~VM_MAYEXEC) | VM_EXEC_KEEP;
>> +		break;
>> +	case MADV_DONTEXEC:
>> +		if (!VM_EXEC_KEEP) {
>> +			error = -EINVAL;
>> +			goto out;
>> +		}
>> +		if (new_flags & VM_EXEC_KEEP)
>> +			new_flags |= (new_flags & ~VM_EXEC_KEEP) | VM_MAYEXEC;
>> +		break;
>>  	case MADV_DONTDUMP:
>>  		new_flags |= VM_DONTDUMP;
>>  		break;
>> @@ -983,6 +1003,8 @@ static int madvise_inject_error(int behavior,
>>  	case MADV_SOFT_OFFLINE:
>>  	case MADV_HWPOISON:
>>  #endif
>> +	case MADV_DOEXEC:
>> +	case MADV_DONTEXEC:
>>  		return true;
>>  
>>  	default:
>> @@ -1037,6 +1059,9 @@ static int madvise_inject_error(int behavior,
>>   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
>>   *		from being included in its core dump.
>>   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
>> + *  MADV_DOEXEC - On exec, preserve and duplicate this area in the new process
>> + *		  if the new process allows it.
>> + *  MADV_DONTEXEC - Undo the effect of MADV_DOEXEC.
>>   *
>>   * return values:
>>   *  zero    - success
>>
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
                   ` (6 preceding siblings ...)
  2020-07-28 11:34 ` [RFC PATCH 0/5] madvise MADV_DOEXEC Kirill Tkhai
@ 2020-07-28 14:23 ` Andy Lutomirski
  2020-07-28 14:30   ` Steven Sistare
  2020-07-30 15:22 ` Matthew Wilcox
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 74+ messages in thread
From: Andy Lutomirski @ 2020-07-28 14:23 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-arch, mhocko, tglx,
	mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm, keescook, gerg,
	ktkhai, christian.brauner, peterz, esyr, jgg, christian, areber,
	cyphar, steven.sistare



> On Jul 27, 2020, at 10:02 AM, Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
> 
> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,
> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.

This will be an amazing attack surface. Perhaps use of this flag should require no_new_privs?  Arguably it should also require a special flag to execve() to honor it.  Otherwise library helpers that do vfork()+exec() or posix_spawn() could be quite surprised.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-28 14:23 ` Andy Lutomirski
@ 2020-07-28 14:30   ` Steven Sistare
  0 siblings, 0 replies; 74+ messages in thread
From: Steven Sistare @ 2020-07-28 14:30 UTC (permalink / raw)
  To: Andy Lutomirski, Anthony Yznaga
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-arch, mhocko, tglx,
	mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm, keescook, gerg,
	ktkhai, christian.brauner, peterz, esyr, jgg, christian, areber,
	cyphar

On 7/28/2020 10:23 AM, Andy Lutomirski wrote:
>> On Jul 27, 2020, at 10:02 AM, Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>>
>> This patchset adds support for preserving an anonymous memory range across
>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>> sharing memory in this manner, as opposed to re-attaching to a named shared
>> memory segment, is to ensure it is mapped at the same virtual address in
>> the new process as it was in the old one.  An intended use for this is to
>> preserve guest memory for guests using vfio while qemu exec's an updated
>> version of itself.  By ensuring the memory is preserved at a fixed address,
>> vfio mappings and their associated kernel data structures can remain valid.
>> In addition, for the qemu use case, qemu instances that back guest RAM with
>> anonymous memory can be updated.
> 
> This will be an amazing attack surface. Perhaps use of this flag should require no_new_privs?  Arguably it should also require a special flag to execve() to honor it.  Otherwise library helpers that do vfork()+exec() or posix_spawn() could be quite surprised.

Preservation is disabled across fork, so fork/exec combo's are not affected.  We forgot to document that.

- Steve


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-28 11:34 ` [RFC PATCH 0/5] madvise MADV_DOEXEC Kirill Tkhai
@ 2020-07-28 17:28   ` Anthony Yznaga
  0 siblings, 0 replies; 74+ messages in thread
From: Anthony Yznaga @ 2020-07-28 17:28 UTC (permalink / raw)
  To: Kirill Tkhai, linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, christian.brauner, peterz, esyr, jgg, christian,
	areber, cyphar, steven.sistare



On 7/28/20 4:34 AM, Kirill Tkhai wrote:
> On 27.07.2020 20:11, Anthony Yznaga wrote:
>> This patchset adds support for preserving an anonymous memory range across
>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>> sharing memory in this manner, as opposed to re-attaching to a named shared
>> memory segment, is to ensure it is mapped at the same virtual address in
>> the new process as it was in the old one.  An intended use for this is to
>> preserve guest memory for guests using vfio while qemu exec's an updated
>> version of itself.  By ensuring the memory is preserved at a fixed address,
> So, the goal is an update of QEMU binary without a stopping of virtual machine?
Essentially, yes.  The VM is paused very briefly.

Anthony
>
>> vfio mappings and their associated kernel data structures can remain valid.
>> In addition, for the qemu use case, qemu instances that back guest RAM with
>> anonymous memory can be updated.
>>
>> Patches 1 and 2 ensure that loading of ELF load segments does not silently
>> clobber existing VMAS, and remove assumptions that the stack is the only
>> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
>> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
>> and could be considered on its own.
>>
>> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
>> using an ELF note.
>>
>> Anthony Yznaga (5):
>>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>>   mm: do not assume only the stack vma exists in setup_arg_pages()
>>   mm: introduce VM_EXEC_KEEP
>>   exec, elf: require opt-in for accepting preserved mem
>>   mm: introduce MADV_DOEXEC
>>
>>  arch/x86/Kconfig                       |   1 +
>>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>>  fs/exec.c                              |  33 +++++-
>>  include/linux/binfmts.h                |   7 +-
>>  include/linux/mm.h                     |   5 +
>>  include/uapi/asm-generic/mman-common.h |   3 +
>>  kernel/fork.c                          |   2 +-
>>  mm/madvise.c                           |  25 +++++
>>  mm/mmap.c                              |  47 ++++++++
>>  9 files changed, 266 insertions(+), 53 deletions(-)
>>



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP
  2020-07-28 13:38   ` Eric W. Biederman
@ 2020-07-28 17:44     ` Anthony Yznaga
  0 siblings, 0 replies; 74+ messages in thread
From: Anthony Yznaga @ 2020-07-28 17:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-arch, mhocko, tglx,
	mingo, bp, x86, hpa, viro, akpm, arnd, keescook, gerg, ktkhai,
	christian.brauner, peterz, esyr, jgg, christian, areber, cyphar,
	steven.sistare



On 7/28/20 6:38 AM, ebiederm@xmission.com wrote:
> Anthony Yznaga <anthony.yznaga@oracle.com> writes:
>
>> A vma with the VM_EXEC_KEEP flag is preserved across exec.  For anonymous
>> vmas only.  For safety, overlap with fixed address VMAs created in the new
>> mm during exec (e.g. the stack and elf load segments) is not permitted and
>> will cause the exec to fail.
>> (We are studying how to guarantee there are no conflicts. Comments welcome.)
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 262112e5f9f8..1de09c4eef00 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1069,6 +1069,20 @@ ssize_t read_code(struct file *file, unsigned long addr, loff_t pos, size_t len)
>>  EXPORT_SYMBOL(read_code);
>>  #endif
>>  
>> +static int vma_dup_some(struct mm_struct *old_mm, struct mm_struct *new_mm)
>> +{
>> +	struct vm_area_struct *vma;
>> +	int ret;
>> +
>> +	for (vma = old_mm->mmap; vma; vma = vma->vm_next)
>> +		if (vma->vm_flags & VM_EXEC_KEEP) {
>> +			ret = vma_dup(vma, new_mm);
>> +			if (ret)
>> +				return ret;
>> +		}
>> +	return 0;
>> +}
>> +
>>  /*
>>   * Maps the mm_struct mm into the current task struct.
>>   * On success, this function returns with the mutex
>> @@ -1104,6 +1118,12 @@ static int exec_mmap(struct mm_struct *mm)
>>  			mutex_unlock(&tsk->signal->exec_update_mutex);
>>  			return -EINTR;
>>  		}
>> +		ret = vma_dup_some(old_mm, mm);
>                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Ouch! An unconditional loop through all of the vmas of the execing
> process, just in case there is a VM_EXEC_KEEP vma.
>
> I know we already walk the list in exit_mmap, but I get the feeling this
> will slow exec down when this feature is not enabled, especially when
> a process with a lot of vmas is calling exec.
Patch 4 changes this to only call vma_dup_some() if the new
binary has opted in to accepting preserved memory.

Anthony
>
>                 
>> +		if (ret) {
>> +			mmap_read_unlock(old_mm);
>> +			mutex_unlock(&tsk->signal->exec_update_mutex);
>> +			return ret;
>> +		}
>>  	}
>>  
>>  	task_lock(tsk);



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP
  2020-07-27 17:11 ` [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP Anthony Yznaga
  2020-07-28 13:38   ` Eric W. Biederman
@ 2020-07-29 13:52   ` Kirill A. Shutemov
  2020-07-29 23:20     ` Anthony Yznaga
  1 sibling, 1 reply; 74+ messages in thread
From: Kirill A. Shutemov @ 2020-07-29 13:52 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-arch, mhocko, tglx,
	mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm, keescook, gerg,
	ktkhai, christian.brauner, peterz, esyr, jgg, christian, areber,
	cyphar, steven.sistare

On Mon, Jul 27, 2020 at 10:11:25AM -0700, Anthony Yznaga wrote:
> A vma with the VM_EXEC_KEEP flag is preserved across exec.  For anonymous
> vmas only.  For safety, overlap with fixed address VMAs created in the new
> mm during exec (e.g. the stack and elf load segments) is not permitted and
> will cause the exec to fail.
> (We are studying how to guarantee there are no conflicts. Comments welcome.)
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
> ---
>  arch/x86/Kconfig   |  1 +
>  fs/exec.c          | 20 ++++++++++++++++++++
>  include/linux/mm.h |  5 +++++
>  kernel/fork.c      |  2 +-
>  mm/mmap.c          | 47 +++++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 74 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 883da0abf779..fc36eb2f45c0 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -30,6 +30,7 @@ config X86_64
>  	select MODULES_USE_ELF_RELA
>  	select NEED_DMA_MAP_STATE
>  	select SWIOTLB
> +	select ARCH_USES_HIGH_VMA_FLAGS
>  
>  config FORCE_DYNAMIC_FTRACE
>  	def_bool y
> diff --git a/fs/exec.c b/fs/exec.c
> index 262112e5f9f8..1de09c4eef00 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1069,6 +1069,20 @@ ssize_t read_code(struct file *file, unsigned long addr, loff_t pos, size_t len)
>  EXPORT_SYMBOL(read_code);
>  #endif
>  
> +static int vma_dup_some(struct mm_struct *old_mm, struct mm_struct *new_mm)
> +{
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	for (vma = old_mm->mmap; vma; vma = vma->vm_next)
> +		if (vma->vm_flags & VM_EXEC_KEEP) {
> +			ret = vma_dup(vma, new_mm);
> +			if (ret)
> +				return ret;
> +		}
> +	return 0;
> +}
> +
>  /*
>   * Maps the mm_struct mm into the current task struct.
>   * On success, this function returns with the mutex
> @@ -1104,6 +1118,12 @@ static int exec_mmap(struct mm_struct *mm)
>  			mutex_unlock(&tsk->signal->exec_update_mutex);
>  			return -EINTR;
>  		}
> +		ret = vma_dup_some(old_mm, mm);
> +		if (ret) {
> +			mmap_read_unlock(old_mm);
> +			mutex_unlock(&tsk->signal->exec_update_mutex);
> +			return ret;
> +		}
>  	}
>  
>  	task_lock(tsk);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index dc7b87310c10..1c538ba77f33 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -295,11 +295,15 @@ int overcommit_kbytes_handler(struct ctl_table *, int, void *, size_t *,
>  #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
>  #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
>  #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
> +#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
>  #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
>  #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
>  #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
>  #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
>  #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
> +#define VM_EXEC_KEEP	BIT(VM_HIGH_ARCH_BIT_5)	/* preserve VMA across exec */
> +#else
> +#define VM_EXEC_KEEP	VM_NONE
>  #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
>  
>  #ifdef CONFIG_ARCH_HAS_PKEYS
> @@ -2534,6 +2538,7 @@ extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
>  	unsigned long addr, unsigned long len, pgoff_t pgoff,
>  	bool *need_rmap_locks);
>  extern void exit_mmap(struct mm_struct *);
> +extern int vma_dup(struct vm_area_struct *vma, struct mm_struct *mm);
>  
>  static inline int check_data_rlimit(unsigned long rlim,
>  				    unsigned long new,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index efc5493203ae..15ead613714f 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -564,7 +564,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  			tmp->anon_vma = NULL;
>  		} else if (anon_vma_fork(tmp, mpnt))
>  			goto fail_nomem_anon_vma_fork;
> -		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
> +		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT | VM_EXEC_KEEP);
>  		file = tmp->vm_file;
>  		if (file) {
>  			struct inode *inode = file_inode(file);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 59a4682ebf3f..be2ff53743c3 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3279,6 +3279,53 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>  	return NULL;
>  }
>  
> +int vma_dup(struct vm_area_struct *old_vma, struct mm_struct *mm)
> +{
> +	unsigned long npages;
> +	struct mm_struct *old_mm = old_vma->vm_mm;
> +	struct vm_area_struct *vma;
> +	int ret = -ENOMEM;
> +
> +	if (WARN_ON(old_vma->vm_file || old_vma->vm_ops))
> +		return -EINVAL;
> +
> +	vma = find_vma(mm, old_vma->vm_start);
> +	if (vma && vma->vm_start < old_vma->vm_end)
> +		return -EEXIST;
> +
> +	npages = vma_pages(old_vma);
> +	mm->total_vm += npages;

Why only total_vm? Where's exec_vm/stack_vm/data_vm?

> +
> +	vma = vm_area_dup(old_vma);
> +	if (!vma)
> +		goto fail_nomem;
> +
> +	ret = vma_dup_policy(old_vma, vma);
> +	if (ret)
> +		goto fail_nomem_policy;
> +
> +	vma->vm_mm = mm;
> +	ret = anon_vma_fork(vma, old_vma);
> +	if (ret)
> +		goto fail_nomem_anon_vma_fork;

Looks like a duplication of code form dup_mmap().
Any chance to get in one place?

> +	vma->vm_flags &= ~(VM_LOCKED|VM_UFFD_MISSING|VM_UFFD_WP|VM_EXEC_KEEP);
> +	vma->vm_next = vma->vm_prev = NULL;

No need. vm_area_dup() takes care of it.

> +	vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;

Semantics of VM_EXEC_KEEP vs userfaultfd() deserves a detailed explanation.
I feel these flags has to be mutually exclusive.

> +	if (is_vm_hugetlb_page(vma))
> +		reset_vma_resv_huge_pages(vma);
> +	__insert_vm_struct(mm, vma);
> +	ret = copy_page_range(mm, old_mm, old_vma);
> +	return ret;
> +
> +fail_nomem_anon_vma_fork:
> +	mpol_put(vma_policy(vma));
> +fail_nomem_policy:
> +	vm_area_free(vma);
> +fail_nomem:
> +	return -ENOMEM;
> +}
> +
>  /*
>   * Return true if the calling process may expand its vm space by the passed
>   * number of pages
> -- 
> 1.8.3.1
> 
> 

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP
  2020-07-29 13:52   ` Kirill A. Shutemov
@ 2020-07-29 23:20     ` Anthony Yznaga
  0 siblings, 0 replies; 74+ messages in thread
From: Anthony Yznaga @ 2020-07-29 23:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-arch, mhocko, tglx,
	mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm, keescook, gerg,
	ktkhai, christian.brauner, peterz, esyr, jgg, christian, areber,
	cyphar, steven.sistare



On 7/29/20 6:52 AM, Kirill A. Shutemov wrote:
> On Mon, Jul 27, 2020 at 10:11:25AM -0700, Anthony Yznaga wrote:
>> A vma with the VM_EXEC_KEEP flag is preserved across exec.  For anonymous
>> vmas only.  For safety, overlap with fixed address VMAs created in the new
>> mm during exec (e.g. the stack and elf load segments) is not permitted and
>> will cause the exec to fail.
>> (We are studying how to guarantee there are no conflicts. Comments welcome.)
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>> ---
>>  arch/x86/Kconfig   |  1 +
>>  fs/exec.c          | 20 ++++++++++++++++++++
>>  include/linux/mm.h |  5 +++++
>>  kernel/fork.c      |  2 +-
>>  mm/mmap.c          | 47 +++++++++++++++++++++++++++++++++++++++++++++++
>>  5 files changed, 74 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 883da0abf779..fc36eb2f45c0 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -30,6 +30,7 @@ config X86_64
>>  	select MODULES_USE_ELF_RELA
>>  	select NEED_DMA_MAP_STATE
>>  	select SWIOTLB
>> +	select ARCH_USES_HIGH_VMA_FLAGS
>>  
>>  config FORCE_DYNAMIC_FTRACE
>>  	def_bool y
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 262112e5f9f8..1de09c4eef00 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1069,6 +1069,20 @@ ssize_t read_code(struct file *file, unsigned long addr, loff_t pos, size_t len)
>>  EXPORT_SYMBOL(read_code);
>>  #endif
>>  
>> +static int vma_dup_some(struct mm_struct *old_mm, struct mm_struct *new_mm)
>> +{
>> +	struct vm_area_struct *vma;
>> +	int ret;
>> +
>> +	for (vma = old_mm->mmap; vma; vma = vma->vm_next)
>> +		if (vma->vm_flags & VM_EXEC_KEEP) {
>> +			ret = vma_dup(vma, new_mm);
>> +			if (ret)
>> +				return ret;
>> +		}
>> +	return 0;
>> +}
>> +
>>  /*
>>   * Maps the mm_struct mm into the current task struct.
>>   * On success, this function returns with the mutex
>> @@ -1104,6 +1118,12 @@ static int exec_mmap(struct mm_struct *mm)
>>  			mutex_unlock(&tsk->signal->exec_update_mutex);
>>  			return -EINTR;
>>  		}
>> +		ret = vma_dup_some(old_mm, mm);
>> +		if (ret) {
>> +			mmap_read_unlock(old_mm);
>> +			mutex_unlock(&tsk->signal->exec_update_mutex);
>> +			return ret;
>> +		}
>>  	}
>>  
>>  	task_lock(tsk);
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index dc7b87310c10..1c538ba77f33 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -295,11 +295,15 @@ int overcommit_kbytes_handler(struct ctl_table *, int, void *, size_t *,
>>  #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
>>  #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
>>  #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
>> +#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
>>  #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
>>  #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
>>  #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
>>  #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
>>  #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
>> +#define VM_EXEC_KEEP	BIT(VM_HIGH_ARCH_BIT_5)	/* preserve VMA across exec */
>> +#else
>> +#define VM_EXEC_KEEP	VM_NONE
>>  #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
>>  
>>  #ifdef CONFIG_ARCH_HAS_PKEYS
>> @@ -2534,6 +2538,7 @@ extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
>>  	unsigned long addr, unsigned long len, pgoff_t pgoff,
>>  	bool *need_rmap_locks);
>>  extern void exit_mmap(struct mm_struct *);
>> +extern int vma_dup(struct vm_area_struct *vma, struct mm_struct *mm);
>>  
>>  static inline int check_data_rlimit(unsigned long rlim,
>>  				    unsigned long new,
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index efc5493203ae..15ead613714f 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -564,7 +564,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>  			tmp->anon_vma = NULL;
>>  		} else if (anon_vma_fork(tmp, mpnt))
>>  			goto fail_nomem_anon_vma_fork;
>> -		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
>> +		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT | VM_EXEC_KEEP);
>>  		file = tmp->vm_file;
>>  		if (file) {
>>  			struct inode *inode = file_inode(file);
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 59a4682ebf3f..be2ff53743c3 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -3279,6 +3279,53 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>>  	return NULL;
>>  }
>>  
>> +int vma_dup(struct vm_area_struct *old_vma, struct mm_struct *mm)
>> +{
>> +	unsigned long npages;
>> +	struct mm_struct *old_mm = old_vma->vm_mm;
>> +	struct vm_area_struct *vma;
>> +	int ret = -ENOMEM;
>> +
>> +	if (WARN_ON(old_vma->vm_file || old_vma->vm_ops))
>> +		return -EINVAL;
>> +
>> +	vma = find_vma(mm, old_vma->vm_start);
>> +	if (vma && vma->vm_start < old_vma->vm_end)
>> +		return -EEXIST;
>> +
>> +	npages = vma_pages(old_vma);
>> +	mm->total_vm += npages;
> Why only total_vm? Where's exec_vm/stack_vm/data_vm?
That was oversight.  Will be fixed in the next version.

>
>> +
>> +	vma = vm_area_dup(old_vma);
>> +	if (!vma)
>> +		goto fail_nomem;
>> +
>> +	ret = vma_dup_policy(old_vma, vma);
>> +	if (ret)
>> +		goto fail_nomem_policy;
>> +
>> +	vma->vm_mm = mm;
>> +	ret = anon_vma_fork(vma, old_vma);
>> +	if (ret)
>> +		goto fail_nomem_anon_vma_fork;
> Looks like a duplication of code form dup_mmap().
> Any chance to get in one place?
I looked at that, but dup_mmap() is dissimilar enough with
the additional fork-specific code in dup_mmap() that I think
readability would suffer.

>
>> +	vma->vm_flags &= ~(VM_LOCKED|VM_UFFD_MISSING|VM_UFFD_WP|VM_EXEC_KEEP);
>> +	vma->vm_next = vma->vm_prev = NULL;
> No need. vm_area_dup() takes care of it.
Will fix.

>
>> +	vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> Semantics of VM_EXEC_KEEP vs userfaultfd() deserves a detailed explanation.
> I feel these flags has to be mutually exclusive.
Yes, will document this better.
I think it's okay to mark userfaultfd-enabled memory for preservation
as long as it is understood that memory would need to re-registered
with userfaultfd() after exec and restore, if desired.  Unless there's
a particular issue you see with this?

Thanks,
Anthony

>
>> +	if (is_vm_hugetlb_page(vma))
>> +		reset_vma_resv_huge_pages(vma);
>> +	__insert_vm_struct(mm, vma);
>> +	ret = copy_page_range(mm, old_mm, old_vma);
>> +	return ret;
>> +
>> +fail_nomem_anon_vma_fork:
>> +	mpol_put(vma_policy(vma));
>> +fail_nomem_policy:
>> +	vm_area_free(vma);
>> +fail_nomem:
>> +	return -ENOMEM;
>> +}
>> +
>>  /*
>>   * Return true if the calling process may expand its vm space by the passed
>>   * number of pages
>> -- 
>> 1.8.3.1
>>
>>



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
                   ` (7 preceding siblings ...)
  2020-07-28 14:23 ` Andy Lutomirski
@ 2020-07-30 15:22 ` Matthew Wilcox
  2020-07-30 15:27   ` Christian Brauner
  2020-07-30 15:59   ` Steven Sistare
  2020-07-31 19:41 ` Steven Sistare
  2021-07-08  9:52 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  10 siblings, 2 replies; 74+ messages in thread
From: Matthew Wilcox @ 2020-07-30 15:22 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-arch, mhocko, tglx,
	mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm, keescook, gerg,
	ktkhai, christian.brauner, peterz, esyr, jgg, christian, areber,
	cyphar, steven.sistare

On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,
> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.

I just realised that something else I'm working on might be a suitable
alternative to this.  Apologies for not realising it sooner.

http://www.wil.cx/~willy/linux/sileby.html

To use this, you'd mshare() the anonymous memory range, essentially
detaching the VMA from the current process's mm_struct and reparenting
it to this new mm_struct, which has an fd referencing it.

Then you call exec(), and the exec'ed task gets to call mmap() on that
new fd to attach the memory range to its own address space.

Presto!


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 15:22 ` Matthew Wilcox
@ 2020-07-30 15:27   ` Christian Brauner
  2020-07-30 15:34     ` Matthew Wilcox
  2020-07-31  9:12     ` Stefan Hajnoczi
  2020-07-30 15:59   ` Steven Sistare
  1 sibling, 2 replies; 74+ messages in thread
From: Christian Brauner @ 2020-07-30 15:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm,
	linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd,
	ebiederm, keescook, gerg, ktkhai, peterz, esyr, jgg, christian,
	areber, cyphar, steven.sistare

On Thu, Jul 30, 2020 at 04:22:50PM +0100, Matthew Wilcox wrote:
> On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
> > This patchset adds support for preserving an anonymous memory range across
> > exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> > sharing memory in this manner, as opposed to re-attaching to a named shared
> > memory segment, is to ensure it is mapped at the same virtual address in
> > the new process as it was in the old one.  An intended use for this is to
> > preserve guest memory for guests using vfio while qemu exec's an updated
> > version of itself.  By ensuring the memory is preserved at a fixed address,
> > vfio mappings and their associated kernel data structures can remain valid.
> > In addition, for the qemu use case, qemu instances that back guest RAM with
> > anonymous memory can be updated.
> 
> I just realised that something else I'm working on might be a suitable
> alternative to this.  Apologies for not realising it sooner.
> 
> http://www.wil.cx/~willy/linux/sileby.html

Just skimming: make it O_CLOEXEC by default. ;)

> 
> To use this, you'd mshare() the anonymous memory range, essentially
> detaching the VMA from the current process's mm_struct and reparenting
> it to this new mm_struct, which has an fd referencing it.
> 
> Then you call exec(), and the exec'ed task gets to call mmap() on that
> new fd to attach the memory range to its own address space.
> 
> Presto!


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 15:27   ` Christian Brauner
@ 2020-07-30 15:34     ` Matthew Wilcox
  2020-07-30 15:54       ` Christian Brauner
  2020-07-31  9:12     ` Stefan Hajnoczi
  1 sibling, 1 reply; 74+ messages in thread
From: Matthew Wilcox @ 2020-07-30 15:34 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm,
	linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd,
	ebiederm, keescook, gerg, ktkhai, peterz, esyr, jgg, christian,
	areber, cyphar, steven.sistare

On Thu, Jul 30, 2020 at 05:27:05PM +0200, Christian Brauner wrote:
> On Thu, Jul 30, 2020 at 04:22:50PM +0100, Matthew Wilcox wrote:
> > On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
> > > This patchset adds support for preserving an anonymous memory range across
> > > exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> > > sharing memory in this manner, as opposed to re-attaching to a named shared
> > > memory segment, is to ensure it is mapped at the same virtual address in
> > > the new process as it was in the old one.  An intended use for this is to
> > > preserve guest memory for guests using vfio while qemu exec's an updated
> > > version of itself.  By ensuring the memory is preserved at a fixed address,
> > > vfio mappings and their associated kernel data structures can remain valid.
> > > In addition, for the qemu use case, qemu instances that back guest RAM with
> > > anonymous memory can be updated.
> > 
> > I just realised that something else I'm working on might be a suitable
> > alternative to this.  Apologies for not realising it sooner.
> > 
> > http://www.wil.cx/~willy/linux/sileby.html
> 
> Just skimming: make it O_CLOEXEC by default. ;)

I appreciate the suggestion, and it makes sense for many 'return an fd'
interfaces, but the point of mshare() is to, well, share.  So sharing
the fd with a child is a common usecase, unlike say sharing a timerfd.
The only other reason to use mshare() is to pass the fd over a unix
socket to a non-child, and I submit that is far less common than wanting
to share with a child.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 15:34     ` Matthew Wilcox
@ 2020-07-30 15:54       ` Christian Brauner
  0 siblings, 0 replies; 74+ messages in thread
From: Christian Brauner @ 2020-07-30 15:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm,
	linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd,
	ebiederm, keescook, gerg, ktkhai, peterz, esyr, jgg, christian,
	areber, cyphar, steven.sistare

On Thu, Jul 30, 2020 at 04:34:50PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 30, 2020 at 05:27:05PM +0200, Christian Brauner wrote:
> > On Thu, Jul 30, 2020 at 04:22:50PM +0100, Matthew Wilcox wrote:
> > > On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
> > > > This patchset adds support for preserving an anonymous memory range across
> > > > exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> > > > sharing memory in this manner, as opposed to re-attaching to a named shared
> > > > memory segment, is to ensure it is mapped at the same virtual address in
> > > > the new process as it was in the old one.  An intended use for this is to
> > > > preserve guest memory for guests using vfio while qemu exec's an updated
> > > > version of itself.  By ensuring the memory is preserved at a fixed address,
> > > > vfio mappings and their associated kernel data structures can remain valid.
> > > > In addition, for the qemu use case, qemu instances that back guest RAM with
> > > > anonymous memory can be updated.
> > > 
> > > I just realised that something else I'm working on might be a suitable
> > > alternative to this.  Apologies for not realising it sooner.
> > > 
> > > http://www.wil.cx/~willy/linux/sileby.html
> > 
> > Just skimming: make it O_CLOEXEC by default. ;)
> 
> I appreciate the suggestion, and it makes sense for many 'return an fd'
> interfaces, but the point of mshare() is to, well, share.  So sharing
> the fd with a child is a common usecase, unlike say sharing a timerfd.

Fair point, from reading I thought the main reason was share after
fork() but having an fd over exec() may be a good use-case too.
(Fwiw, this very much looks like what the memfd_create() api should have
looked like, i.e. mshare() could've possibly encompassed both.)

> The only other reason to use mshare() is to pass the fd over a unix
> socket to a non-child, and I submit that is far less common than wanting
> to share with a child.

Well, we have use-cases for that too. E.g. where we need to attach to an
existing pid namespace which requires a first fork() of an intermediate
process so that the caller doesn't change the pid_for_children
namespace. Then we setns() to the target pid namespace in the
intermediate process which changes the pid_ns_children namespace such
that the next process will be a proper member of the target pid
namespace. Finally, we fork() the target process that is now a full
member of the target pid namespace. We also set CLONE_PARENT so
grandfather process == father process for the second process and then
have the intermediate process exit. Some fds we only ever open or create
after the intermediate process exited and some fds we can only open or
create after the intermediate process has been created and then send the
fds via scm (since we can't share the fdtable) to the final process.

Christian


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 15:22 ` Matthew Wilcox
  2020-07-30 15:27   ` Christian Brauner
@ 2020-07-30 15:59   ` Steven Sistare
  2020-07-30 17:12     ` Matthew Wilcox
  1 sibling, 1 reply; 74+ messages in thread
From: Steven Sistare @ 2020-07-30 15:59 UTC (permalink / raw)
  To: Matthew Wilcox, Anthony Yznaga
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-arch, mhocko, tglx,
	mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm, keescook, gerg,
	ktkhai, christian.brauner, peterz, esyr, jgg, christian, areber,
	cyphar

On 7/30/2020 11:22 AM, Matthew Wilcox wrote:
> On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
>> This patchset adds support for preserving an anonymous memory range across
>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>> sharing memory in this manner, as opposed to re-attaching to a named shared
>> memory segment, is to ensure it is mapped at the same virtual address in
>> the new process as it was in the old one.  An intended use for this is to
>> preserve guest memory for guests using vfio while qemu exec's an updated
>> version of itself.  By ensuring the memory is preserved at a fixed address,
>> vfio mappings and their associated kernel data structures can remain valid.
>> In addition, for the qemu use case, qemu instances that back guest RAM with
>> anonymous memory can be updated.
> 
> I just realised that something else I'm working on might be a suitable
> alternative to this.  Apologies for not realising it sooner.
> 
> http://www.wil.cx/~willy/linux/sileby.html
> 
> To use this, you'd mshare() the anonymous memory range, essentially
> detaching the VMA from the current process's mm_struct and reparenting
> it to this new mm_struct, which has an fd referencing it.
> 
> Then you call exec(), and the exec'ed task gets to call mmap() on that
> new fd to attach the memory range to its own address space.
> 
> Presto!

To be suitable for the qemu use case, we need a guarantee that the same VA range
is available in the new process, with nothing else mapped there.  From your spec,
it sounds like the new process could do a series of unrelated mmap's which could
overlap the desired va range before the silby mmap(fd) is performed??

Also, we need to support updating legacy processes that already created anon segments.
We inject code that calls MADV_DOEXEC for such segments.

- Steve


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 15:59   ` Steven Sistare
@ 2020-07-30 17:12     ` Matthew Wilcox
  2020-07-30 17:35       ` Steven Sistare
  0 siblings, 1 reply; 74+ messages in thread
From: Matthew Wilcox @ 2020-07-30 17:12 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm,
	linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd,
	ebiederm, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On Thu, Jul 30, 2020 at 11:59:42AM -0400, Steven Sistare wrote:
> On 7/30/2020 11:22 AM, Matthew Wilcox wrote:
> > On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
> >> This patchset adds support for preserving an anonymous memory range across
> >> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> >> sharing memory in this manner, as opposed to re-attaching to a named shared
> >> memory segment, is to ensure it is mapped at the same virtual address in
> >> the new process as it was in the old one.  An intended use for this is to
> >> preserve guest memory for guests using vfio while qemu exec's an updated
> >> version of itself.  By ensuring the memory is preserved at a fixed address,
> >> vfio mappings and their associated kernel data structures can remain valid.
> >> In addition, for the qemu use case, qemu instances that back guest RAM with
> >> anonymous memory can be updated.
> > 
> > I just realised that something else I'm working on might be a suitable
> > alternative to this.  Apologies for not realising it sooner.
> > 
> > http://www.wil.cx/~willy/linux/sileby.html
> > 
> > To use this, you'd mshare() the anonymous memory range, essentially
> > detaching the VMA from the current process's mm_struct and reparenting
> > it to this new mm_struct, which has an fd referencing it.
> > 
> > Then you call exec(), and the exec'ed task gets to call mmap() on that
> > new fd to attach the memory range to its own address space.
> > 
> > Presto!
> 
> To be suitable for the qemu use case, we need a guarantee that the same VA range
> is available in the new process, with nothing else mapped there.  From your spec,
> it sounds like the new process could do a series of unrelated mmap's which could
> overlap the desired va range before the silby mmap(fd) is performed??

That could happen.  eg libc might get its text segment mapped there
randomly.  I believe Khalid was working on a solution for reserving
memory ranges.

> Also, we need to support updating legacy processes that already created anon segments.
> We inject code that calls MADV_DOEXEC for such segments.

Yes, I was assuming you'd inject code that called mshare().

Actually, since you're injecting code, why do you need the kernel to
be involved?  You can mmap the new executable and any libraries it depends
upon, set up a new stack and jump to the main() entry point, all without
calling exec().  I appreciate it'd be a fair amount of code, but it'd all
be in userspace and you can probably steal / reuse code from ld.so (I'm
not familiar with the details of how setting up an executable is done).


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 17:12     ` Matthew Wilcox
@ 2020-07-30 17:35       ` Steven Sistare
  2020-07-30 17:49         ` Matthew Wilcox
  0 siblings, 1 reply; 74+ messages in thread
From: Steven Sistare @ 2020-07-30 17:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm,
	linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd,
	ebiederm, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On 7/30/2020 1:12 PM, Matthew Wilcox wrote:
> On Thu, Jul 30, 2020 at 11:59:42AM -0400, Steven Sistare wrote:
>> On 7/30/2020 11:22 AM, Matthew Wilcox wrote:
>>> On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
>>>> This patchset adds support for preserving an anonymous memory range across
>>>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>>>> sharing memory in this manner, as opposed to re-attaching to a named shared
>>>> memory segment, is to ensure it is mapped at the same virtual address in
>>>> the new process as it was in the old one.  An intended use for this is to
>>>> preserve guest memory for guests using vfio while qemu exec's an updated
>>>> version of itself.  By ensuring the memory is preserved at a fixed address,
>>>> vfio mappings and their associated kernel data structures can remain valid.
>>>> In addition, for the qemu use case, qemu instances that back guest RAM with
>>>> anonymous memory can be updated.
>>>
>>> I just realised that something else I'm working on might be a suitable
>>> alternative to this.  Apologies for not realising it sooner.
>>>
>>> http://www.wil.cx/~willy/linux/sileby.html
>>>
>>> To use this, you'd mshare() the anonymous memory range, essentially
>>> detaching the VMA from the current process's mm_struct and reparenting
>>> it to this new mm_struct, which has an fd referencing it.
>>>
>>> Then you call exec(), and the exec'ed task gets to call mmap() on that
>>> new fd to attach the memory range to its own address space.
>>>
>>> Presto!
>>
>> To be suitable for the qemu use case, we need a guarantee that the same VA range
>> is available in the new process, with nothing else mapped there.  From your spec,
>> it sounds like the new process could do a series of unrelated mmap's which could
>> overlap the desired va range before the silby mmap(fd) is performed??
> 
> That could happen.  eg libc might get its text segment mapped there
> randomly.  I believe Khalid was working on a solution for reserving
> memory ranges.

mshare + VA reservation is another possible solution.

Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.

>> Also, we need to support updating legacy processes that already created anon segments.
>> We inject code that calls MADV_DOEXEC for such segments.
> 
> Yes, I was assuming you'd inject code that called mshare().

OK, mshare works on existing memory and builds a new vma.

> Actually, since you're injecting code, why do you need the kernel to
> be involved?  You can mmap the new executable and any libraries it depends
> upon, set up a new stack and jump to the main() entry point, all without
> calling exec().  I appreciate it'd be a fair amount of code, but it'd all
> be in userspace and you can probably steal / reuse code from ld.so (I'm
> not familiar with the details of how setting up an executable is done).

Duplicating all the work that the kernel and loader do to exec a process would
be error prone, require ongoing maintenance, and be redundant.  Better to define 
a small kernel extension and leave exec to the kernel.

- Steve


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 17:35       ` Steven Sistare
@ 2020-07-30 17:49         ` Matthew Wilcox
  2020-07-30 18:27           ` Steven Sistare
  0 siblings, 1 reply; 74+ messages in thread
From: Matthew Wilcox @ 2020-07-30 17:49 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm,
	linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd,
	ebiederm, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On Thu, Jul 30, 2020 at 01:35:51PM -0400, Steven Sistare wrote:
> mshare + VA reservation is another possible solution.
> 
> Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.

We are.  This is the part of the review process where we explore other
solutions to the problem.

> >> Also, we need to support updating legacy processes that already created anon segments.
> >> We inject code that calls MADV_DOEXEC for such segments.
> > 
> > Yes, I was assuming you'd inject code that called mshare().
> 
> OK, mshare works on existing memory and builds a new vma.

Actually, reparents an existing VMA, and reuses the existing page tables.

> > Actually, since you're injecting code, why do you need the kernel to
> > be involved?  You can mmap the new executable and any libraries it depends
> > upon, set up a new stack and jump to the main() entry point, all without
> > calling exec().  I appreciate it'd be a fair amount of code, but it'd all
> > be in userspace and you can probably steal / reuse code from ld.so (I'm
> > not familiar with the details of how setting up an executable is done).
> 
> Duplicating all the work that the kernel and loader do to exec a process would
> be error prone, require ongoing maintenance, and be redundant.  Better to define 
> a small kernel extension and leave exec to the kernel.

Either this is a one-off kind of thing, in which case it doesn't need
ongoing maintenance, or it's something with broad applicability, in
which case it can live as its own userspace project.  It could even
start off life as part of qemu and then fork into its own project.
The idea of tagging an ELF executable to say "I can cope with having
chunks of my address space provided to me by my executor" is ... odd.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 17:49         ` Matthew Wilcox
@ 2020-07-30 18:27           ` Steven Sistare
  2020-07-30 21:58             ` Eric W. Biederman
  0 siblings, 1 reply; 74+ messages in thread
From: Steven Sistare @ 2020-07-30 18:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm,
	linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd,
	ebiederm, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On 7/30/2020 1:49 PM, Matthew Wilcox wrote:
> On Thu, Jul 30, 2020 at 01:35:51PM -0400, Steven Sistare wrote:
>> mshare + VA reservation is another possible solution.
>>
>> Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.
> 
> We are.  This is the part of the review process where we explore other
> solutions to the problem.
> 
>>>> Also, we need to support updating legacy processes that already created anon segments.
>>>> We inject code that calls MADV_DOEXEC for such segments.
>>>
>>> Yes, I was assuming you'd inject code that called mshare().
>>
>> OK, mshare works on existing memory and builds a new vma.
> 
> Actually, reparents an existing VMA, and reuses the existing page tables.
> 
>>> Actually, since you're injecting code, why do you need the kernel to
>>> be involved?  You can mmap the new executable and any libraries it depends
>>> upon, set up a new stack and jump to the main() entry point, all without
>>> calling exec().  I appreciate it'd be a fair amount of code, but it'd all
>>> be in userspace and you can probably steal / reuse code from ld.so (I'm
>>> not familiar with the details of how setting up an executable is done).
>>
>> Duplicating all the work that the kernel and loader do to exec a process would
>> be error prone, require ongoing maintenance, and be redundant.  Better to define 
>> a small kernel extension and leave exec to the kernel.
> 
> Either this is a one-off kind of thing, in which case it doesn't need
> ongoing maintenance, or it's something with broad applicability, in
> which case it can live as its own userspace project.  It could even
> start off life as part of qemu and then fork into its own project.

exec will be enhanced over time in the kernel.  A separate user space implementation
would need to track that.

Reimplementing exec in userland would be a big gross mess.  Not a good solution when
we have simple and concise ways of solving the problem.

> The idea of tagging an ELF executable to say "I can cope with having
> chunks of my address space provided to me by my executor" is ... odd.

I don't disagree.  But it is useful.  We already pass a block of data containing
environment variables and arguments from one process to the next.  Preserving 
additional segments is not a big leap from there.

- Steve


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 18:27           ` Steven Sistare
@ 2020-07-30 21:58             ` Eric W. Biederman
  2020-07-31 14:57               ` Steven Sistare
  0 siblings, 1 reply; 74+ messages in thread
From: Eric W. Biederman @ 2020-07-30 21:58 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Matthew Wilcox, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro,
	akpm, arnd, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

Steven Sistare <steven.sistare@oracle.com> writes:

> On 7/30/2020 1:49 PM, Matthew Wilcox wrote:
>> On Thu, Jul 30, 2020 at 01:35:51PM -0400, Steven Sistare wrote:
>>> mshare + VA reservation is another possible solution.
>>>
>>> Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.
>> 
>> We are.  This is the part of the review process where we explore other
>> solutions to the problem.
>> 
>>>>> Also, we need to support updating legacy processes that already created anon segments.
>>>>> We inject code that calls MADV_DOEXEC for such segments.
>>>>
>>>> Yes, I was assuming you'd inject code that called mshare().
>>>
>>> OK, mshare works on existing memory and builds a new vma.
>> 
>> Actually, reparents an existing VMA, and reuses the existing page tables.
>> 
>>>> Actually, since you're injecting code, why do you need the kernel to
>>>> be involved?  You can mmap the new executable and any libraries it depends
>>>> upon, set up a new stack and jump to the main() entry point, all without
>>>> calling exec().  I appreciate it'd be a fair amount of code, but it'd all
>>>> be in userspace and you can probably steal / reuse code from ld.so (I'm
>>>> not familiar with the details of how setting up an executable is done).
>>>
>>> Duplicating all the work that the kernel and loader do to exec a process would
>>> be error prone, require ongoing maintenance, and be redundant.  Better to define 
>>> a small kernel extension and leave exec to the kernel.
>> 
>> Either this is a one-off kind of thing, in which case it doesn't need
>> ongoing maintenance, or it's something with broad applicability, in
>> which case it can live as its own userspace project.  It could even
>> start off life as part of qemu and then fork into its own project.
>
> exec will be enhanced over time in the kernel.  A separate user space implementation
> would need to track that.
>
> Reimplementing exec in userland would be a big gross mess.  Not a good solution when
> we have simple and concise ways of solving the problem.

The interesting work is not done in exec.  The interesting work of
loading an executable is done in ld.so, and ld.so lives in userspace.

>> The idea of tagging an ELF executable to say "I can cope with having
>> chunks of my address space provided to me by my executor" is ... odd.
>
> I don't disagree.  But it is useful.  We already pass a block of data containing
> environment variables and arguments from one process to the next.  Preserving 
> additional segments is not a big leap from there.

But we randomize where that block lives.

It has been found to be very useful to have randomization by default and
you are arguing against that kind of randomization.



Here is another suggestion.

Have a very simple program that does:

	for (;;) {
		handle = dlopen("/my/real/program");
		real_main = dlsym(handle, "main");
		real_main(argc, argv, envp);
		dlclose(handle);
	}

With whatever obvious adjustments are needed to fit your usecase.

That should give the same level of functionality, be portable to all
unices, and not require you to duplicate code.  It belive it limits you
to not upgrading libc, or librt but that is a comparatively small
limitation.


Given that in general the interesting work is done in userspace and that
userspace has provided an interface for reusing that work already.
I don't see the justification for adding anything to exec at this point.


Eric


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 15:27   ` Christian Brauner
  2020-07-30 15:34     ` Matthew Wilcox
@ 2020-07-31  9:12     ` Stefan Hajnoczi
  1 sibling, 0 replies; 74+ messages in thread
From: Stefan Hajnoczi @ 2020-07-31  9:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm,
	linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd,
	ebiederm, keescook, gerg, ktkhai, peterz, esyr, jgg, christian,
	areber, cyphar, steven.sistare, Andrea Arcangeli, jasowang,
	Philippe Mathieu-Daudé,
	Peter Xu, Alex Williamson, Alexander Graf, vgoyal

[-- Attachment #1: Type: text/plain, Size: 5254 bytes --]

Hi,
Sileby looks interesting! I had just written up the following idea which
seems similar but includes a mechanism for revoking mappings.

Alexander Graf recently brought up an idea that solves the following
problem:

When process A passes shared memory file descriptors to process B there
is no way for process A to revoke access or change page protection bits
after passing the fd.

I'll describe the idea (not sure if it's exactly what Alexander had in
mind).

Memory view driver
------------------
The memory view driver allows process A to control the page table
entries of an mmap in process B. It is a character device driver that
process A opens like this:

  int fd = open("/dev/memory-view", O_RDWR);

This returns a file descriptor to a new memory view.

Next process A sets the size of the memory view:

  ftruncate(fd, 16 * GiB);

The size determines how large the memory view will be. The size is a
virtual memory concept and does not consume resources (there is no
physical memory backing this).

Process A populates the memory view with ranges from file descriptors it
wishes to share. The file descriptor can be a shared memory file
descriptor:

  int memfd = memfd_create("guest-ram, 0);
  ftruncate(memfd, 32 * GiB);

  /* Map [8GB, 10GB) at 8GB into the memory view */
  struct memview_map_fd_info = {
      .fd = memfd,
      .fd_offset = 8 * GiB,
      .size = 2 * GiB,
      .mem_offset = 8 * GiB,
      .flags = MEMVIEW_MAP_READ | MEMVIEW_MAP_WRITE,
  };
  ioctl(fd, MEMVIEW_MAP_FD, &map_fd_info);

It is also possible to populate the memory view from the page cache:

  int filefd = open("big-file.iso", O_RDONLY);

  /* Map [4GB, 12GB) at 0B into the memory view */
  struct memview_map_fd_info = {
      .fd = filefd,
      .fd_offset = 4 * GiB,
      .size = 8 * GiB,
      .mem_offset = 0,
      .flags = MEMVIEW_MAP_READ,
  };
  ioctl(fd, MEMVIEW_MAP_FD, &map_fd_info);

The memory view has now been populated like this:

Range (GiB)   Fd               Permissions
0-8           big-file.iso     read
8-10          guest-ram        read+write
10-16         <none>           <none>

Now process A gets the "view" file descriptor for this memory view. The
view file descriptor does not allow ioctls. It can be safely passed to
process B in the knowledge that process B can only mmap or close it:

  int viewfd = ioctl(fd, MEMVIEW_GET_VIEWFD);

  ...pass viewfd to process B...

Process B receives viewfd and mmaps it:

  void *ptr = mmap(NULL, 16 * GiB, PROT_READ | PROT_WRITE, MAP_SHARED,
                   viewfd, 0);

When process B accesses a page in the mmap region the memory view
driver resolves the page fault by checking if the page is mapped to an
fd and what its permissions are.

For example, accessing the page at 4GB from the start of the memory view
is an access at 8GB into big-file.iso. That's because 8GB of
big-file.iso was mapped at 0 with fd_offset 4GB.

To summarize, there is one vma in process B and the memory view driver
maps pages from the file descriptors added with ioctl(MEMVIEW_MAP_FD) by
process A.

Page protection bits are the AND of the mmap
PROT_READ/PROT_WRITE/PROT_EXEC flags with the memory view driver's
MEMVIEW_MAP_READ/MEMVIEW_MAP_WRITE/MEMVIEW_MAP_EXEC flags for the
mapping in question.

Does vmf_insert_mixed_prot() or a similar Linux API allow this?

Can the memory view driver map pages from fds without pinning the pages?

Process A can make further ioctl(MEMVIEW_MAP_FD) calls and also
ioctl(MEMVIEW_UNMAP_FD) calls to change the mappings. This requires
zapping affected process B ptes. When process B accesses those pages
again the fault handler will handle the page fault based on the latest
memory view layout.

If process B accesses a page with incorrect permissions or that has not
been configured by process A ioctl calls, a SIGSEGV/SIGBUS signal is
raised.

When process B uses mprotect(2) and other virtual memory syscalls it
is unable to increase page permissions. Instead it can only reduce them
because the pte protection bits are the AND of the mmap flags and the
memory view driver's MEMVIEW_MAP_READ/MEMVIEW_MAP_WRITE/MEMVIEW_MAP_EXEC
flags.

Use cases
---------
How to use the memory view driver for vhost-user
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
vhost-user and other out-of-process device emulation interfaces need a
way for the VMM to enforce the IOMMU mappings on the device emulation
process.

Today the VMM passes all guest RAM fds to the device emulation process
and has no way of restricting access or revoking it later. With the
memory view driver the VMM will pass one or more memory view fds instead
of the actual guest RAM fds. This allows the VMM to invoke
ioctl(MEMVIEW_MAP_FD/MEMVIEW_UNMAP_FD) to enforce permissions or revoke
access.

How to use the memory view driver for virtio-fs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The virtiofsd vhost-user process creates a memory view for the device's
DAX Window and passes it to QEMU. QEMU installs it as a kvm.ko memory
region so that the guest directly accesses the memory view.

Now virtiofsd can map portions of files into the DAX Window without
coordinating with the QEMU process. This simplifies the virtio-fs code
and should also improve DAX map/unmap performance.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 21:58             ` Eric W. Biederman
@ 2020-07-31 14:57               ` Steven Sistare
  2020-07-31 15:27                 ` Matthew Wilcox
  2020-08-03 15:28                 ` Eric W. Biederman
  0 siblings, 2 replies; 74+ messages in thread
From: Steven Sistare @ 2020-07-31 14:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matthew Wilcox, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro,
	akpm, arnd, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On 7/30/2020 5:58 PM, ebiederm@xmission.com wrote:
> Steven Sistare <steven.sistare@oracle.com> writes:
>> On 7/30/2020 1:49 PM, Matthew Wilcox wrote:
>>> On Thu, Jul 30, 2020 at 01:35:51PM -0400, Steven Sistare wrote:
>>>> mshare + VA reservation is another possible solution.
>>>>
>>>> Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.
>>>
>>> We are.  This is the part of the review process where we explore other
>>> solutions to the problem.
>>>
>>>>>> Also, we need to support updating legacy processes that already created anon segments.
>>>>>> We inject code that calls MADV_DOEXEC for such segments.
>>>>>
>>>>> Yes, I was assuming you'd inject code that called mshare().
>>>>
>>>> OK, mshare works on existing memory and builds a new vma.
>>>
>>> Actually, reparents an existing VMA, and reuses the existing page tables.
>>>
>>>>> Actually, since you're injecting code, why do you need the kernel to
>>>>> be involved?  You can mmap the new executable and any libraries it depends
>>>>> upon, set up a new stack and jump to the main() entry point, all without
>>>>> calling exec().  I appreciate it'd be a fair amount of code, but it'd all
>>>>> be in userspace and you can probably steal / reuse code from ld.so (I'm
>>>>> not familiar with the details of how setting up an executable is done).
>>>>
>>>> Duplicating all the work that the kernel and loader do to exec a process would
>>>> be error prone, require ongoing maintenance, and be redundant.  Better to define 
>>>> a small kernel extension and leave exec to the kernel.
>>>
>>> Either this is a one-off kind of thing, in which case it doesn't need
>>> ongoing maintenance, or it's something with broad applicability, in
>>> which case it can live as its own userspace project.  It could even
>>> start off life as part of qemu and then fork into its own project.
>>
>> exec will be enhanced over time in the kernel.  A separate user space implementation
>> would need to track that.
>>
>> Reimplementing exec in userland would be a big gross mess.  Not a good solution when
>> we have simple and concise ways of solving the problem.
> 
> The interesting work is not done in exec.  The interesting work of
> loading an executable is done in ld.so, and ld.so lives in userspace.

And yet a smart guy named Eric said:
  "There is just enough complexity in exec currently that our implementation of exec is 
  already teetering."
which suggests that a user-space re-implementation will also be complex.
In complexity lies bugs.

>>> The idea of tagging an ELF executable to say "I can cope with having
>>> chunks of my address space provided to me by my executor" is ... odd.
>>
>> I don't disagree.  But it is useful.  We already pass a block of data containing
>> environment variables and arguments from one process to the next.  Preserving 
>> additional segments is not a big leap from there.
> 
> But we randomize where that block lives.
> 
> It has been found to be very useful to have randomization by default and
> you are arguing against that kind of randomization.

In the normal use case the addresses of the preserved anon segments were
already randomized during allocation in the pre-exec process.  Preservation
across exec does not make the process less safe from an external attack.

Now, if the attacker injects content at known addresses by preserving and
exec'ing, then the post-exec process must decide whether to trust the
pre-exec process.  Let's prevent preservation when exec'ing setuid binaries.
That is not part of this patch series, but is a good idea.  So, uid does not
change across exec.  Do I trust myself?  Yes I do.  If the uid is trusted,
as it must be in the qemu use case which accesses system resources, then
that is sufficient.  If the exec raises caps, and the uid is not trusted,
then the post-exec process must do additional work to verify the pre-exec
credentials.  The method is left to the app.  We defined the ELF opt-in so
that processes that do not use preserved memory will never receive any and
will never need to deal with such verification.

Matthews sileby/mshare proposal has the same issue.  If a process opts-in
and mmap's an address in the shared region, then content becomes mapped at
a VA that was known to the pre-fork or pre-exec process.  Trust must still
be established.

> Here is another suggestion.
> 
> Have a very simple program that does:
> 
> 	for (;;) {
> 		handle = dlopen("/my/real/program");
> 		real_main = dlsym(handle, "main");
> 		real_main(argc, argv, envp);
> 		dlclose(handle);
> 	}
> 
> With whatever obvious adjustments are needed to fit your usecase.
> 
> That should give the same level of functionality, be portable to all
> unices, and not require you to duplicate code.  It belive it limits you
> to not upgrading libc, or librt but that is a comparatively small
> limitation.
> 
> 
> Given that in general the interesting work is done in userspace and that
> userspace has provided an interface for reusing that work already.
> I don't see the justification for adding anything to exec at this point. 

Thanks for the suggestion.  That is clever, and would make a fun project,
but I would not trust it for production.  These few lines are just
the first of many that it would take to reset the environment to the
well-defined post-exec initial conditions that all executables expect,
and incrementally tearing down state will be prone to bugs.  Getting a
clean slate from a kernel exec is a much more reliable design.  The use
case is creating long-lived apps that never go down, and the simplest
implementation will have the fewest bugs and is the best.  MADV_DOEXEC is
simple, and does not even require a new system call, and the kernel already
knows how to exec without bugs.

- Steve


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-31 14:57               ` Steven Sistare
@ 2020-07-31 15:27                 ` Matthew Wilcox
  2020-07-31 16:11                   ` Steven Sistare
  2020-08-03 15:28                 ` Eric W. Biederman
  1 sibling, 1 reply; 74+ messages in thread
From: Matthew Wilcox @ 2020-07-31 15:27 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Eric W. Biederman, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro,
	akpm, arnd, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On Fri, Jul 31, 2020 at 10:57:44AM -0400, Steven Sistare wrote:
> Matthews sileby/mshare proposal has the same issue.  If a process opts-in
> and mmap's an address in the shared region, then content becomes mapped at
> a VA that was known to the pre-fork or pre-exec process.  Trust must still
> be established.

It's up to the recipient whether they try to map it at the same address
or at a fresh address.  The intended use case is a "semi-shared" address
space between two processes (ie partway between a threaded, fully-shared
address space and a forked un-shared address space), in which case
there's a certain amount of trust and cooperation between the processes.

Your preservation-across-exec use-case might or might not need the
VMA to be mapped at the same address.  I don't know whether qemu stores
pointers in this VMA which are absolute within the qemu address space.
If it's just the emulated process's address space, then everything will
be absolute within its own address space and everything will be opaque
to qemu.  If qemu is storing its own pointers in it, then it has to be
mapped at the same address.

> > Here is another suggestion.
> > 
> > Have a very simple program that does:
> > 
> > 	for (;;) {
> > 		handle = dlopen("/my/real/program");
> > 		real_main = dlsym(handle, "main");
> > 		real_main(argc, argv, envp);
> > 		dlclose(handle);
> > 	}
> > 
> > With whatever obvious adjustments are needed to fit your usecase.
> > 
> > That should give the same level of functionality, be portable to all
> > unices, and not require you to duplicate code.  It belive it limits you
> > to not upgrading libc, or librt but that is a comparatively small
> > limitation.
> > 
> > 
> > Given that in general the interesting work is done in userspace and that
> > userspace has provided an interface for reusing that work already.
> > I don't see the justification for adding anything to exec at this point. 
> 
> Thanks for the suggestion.  That is clever, and would make a fun project,
> but I would not trust it for production.  These few lines are just
> the first of many that it would take to reset the environment to the
> well-defined post-exec initial conditions that all executables expect,
> and incrementally tearing down state will be prone to bugs.  Getting a
> clean slate from a kernel exec is a much more reliable design.  The use
> case is creating long-lived apps that never go down, and the simplest
> implementation will have the fewest bugs and is the best.  MADV_DOEXEC is
> simple, and does not even require a new system call, and the kernel already
> knows how to exec without bugs.

It's a net increase of 200 lines of kernel code.  If 4 lines of userspace
code removes 200 lines of kernel code, I think I know which I prefer ...


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-31 15:27                 ` Matthew Wilcox
@ 2020-07-31 16:11                   ` Steven Sistare
  2020-07-31 16:56                     ` Jason Gunthorpe
  2020-07-31 17:23                     ` Matthew Wilcox
  0 siblings, 2 replies; 74+ messages in thread
From: Steven Sistare @ 2020-07-31 16:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Eric W. Biederman, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro,
	akpm, arnd, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On 7/31/2020 11:27 AM, Matthew Wilcox wrote:
> On Fri, Jul 31, 2020 at 10:57:44AM -0400, Steven Sistare wrote:
>> Matthews sileby/mshare proposal has the same issue.  If a process opts-in
>> and mmap's an address in the shared region, then content becomes mapped at
>> a VA that was known to the pre-fork or pre-exec process.  Trust must still
>> be established.
> 
> It's up to the recipient whether they try to map it at the same address
> or at a fresh address.  The intended use case is a "semi-shared" address
> space between two processes (ie partway between a threaded, fully-shared
> address space and a forked un-shared address space), in which case
> there's a certain amount of trust and cooperation between the processes.

Understood, but if the recipient does map at any of the same, which is the whole
point because you want to share the page table.  The trust relationship is no
different than for the live update case.  
 
> Your preservation-across-exec use-case might or might not need the
> VMA to be mapped at the same address.  

It does.  qemu registers memory with vfio which remembers the va's in kernel
metadata for the device.

> I don't know whether qemu stores
> pointers in this VMA which are absolute within the qemu address space.
> If it's just the emulated process's address space, then everything will
> be absolute within its own address space and everything will be opaque
> to qemu.  If qemu is storing its own pointers in it, then it has to be
> mapped at the same address.

qemu does not do the latter but that could be a nice way for apps to use
preserved memory.

>>> Here is another suggestion.
>>>
>>> Have a very simple program that does:
>>>
>>> 	for (;;) {
>>> 		handle = dlopen("/my/real/program");
>>> 		real_main = dlsym(handle, "main");
>>> 		real_main(argc, argv, envp);
>>> 		dlclose(handle);
>>> 	}
>>>
>>> With whatever obvious adjustments are needed to fit your usecase.
>>>
>>> That should give the same level of functionality, be portable to all
>>> unices, and not require you to duplicate code.  It belive it limits you
>>> to not upgrading libc, or librt but that is a comparatively small
>>> limitation.
>>>
>>>
>>> Given that in general the interesting work is done in userspace and that
>>> userspace has provided an interface for reusing that work already.
>>> I don't see the justification for adding anything to exec at this point. 
>>
>> Thanks for the suggestion.  That is clever, and would make a fun project,
>> but I would not trust it for production.  These few lines are just
>> the first of many that it would take to reset the environment to the
>> well-defined post-exec initial conditions that all executables expect,
>> and incrementally tearing down state will be prone to bugs.  Getting a
>> clean slate from a kernel exec is a much more reliable design.  The use
>> case is creating long-lived apps that never go down, and the simplest
>> implementation will have the fewest bugs and is the best.  MADV_DOEXEC is
>> simple, and does not even require a new system call, and the kernel already
>> knows how to exec without bugs.
> 
> It's a net increase of 200 lines of kernel code.  If 4 lines of userspace
> code removes 200 lines of kernel code, I think I know which I prefer ...

It will be *far* more than 4 lines.
Much of the 200 lines is mostly for the elf opt in, and much of the elf code is from
anthony reviving an earlier patch that use MAP_FIXED_NOREPLACE during segment setup.

- Steve




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-31 16:11                   ` Steven Sistare
@ 2020-07-31 16:56                     ` Jason Gunthorpe
  2020-07-31 17:15                       ` Steven Sistare
  2020-07-31 17:23                     ` Matthew Wilcox
  1 sibling, 1 reply; 74+ messages in thread
From: Jason Gunthorpe @ 2020-07-31 16:56 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Matthew Wilcox, Eric W. Biederman, Anthony Yznaga, linux-kernel,
	linux-fsdevel, linux-mm, linux-arch, mhocko, tglx, mingo, bp,
	x86, hpa, viro, akpm, arnd, keescook, gerg, ktkhai,
	christian.brauner, peterz, esyr, christian, areber, cyphar

On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote:
> > Your preservation-across-exec use-case might or might not need the
> > VMA to be mapped at the same address.  
> 
> It does.  qemu registers memory with vfio which remembers the va's in kernel
> metadata for the device.

Once the memory is registered with vfio the VA doesn't matter, vfio
will keep the iommu pointing at the same physical pages no matter
where they are mapped.

Jason


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-31 16:56                     ` Jason Gunthorpe
@ 2020-07-31 17:15                       ` Steven Sistare
  2020-07-31 17:48                         ` Jason Gunthorpe
  0 siblings, 1 reply; 74+ messages in thread
From: Steven Sistare @ 2020-07-31 17:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, Eric W. Biederman, Anthony Yznaga, linux-kernel,
	linux-fsdevel, linux-mm, linux-arch, mhocko, tglx, mingo, bp,
	x86, hpa, viro, akpm, arnd, keescook, gerg, ktkhai,
	christian.brauner, peterz, esyr, christian, areber, cyphar

On 7/31/2020 12:56 PM, Jason Gunthorpe wrote:
> On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote:
>>> Your preservation-across-exec use-case might or might not need the
>>> VMA to be mapped at the same address.  
>>
>> It does.  qemu registers memory with vfio which remembers the va's in kernel
>> metadata for the device.
> 
> Once the memory is registered with vfio the VA doesn't matter, vfio
> will keep the iommu pointing at the same physical pages no matter
> where they are mapped.

Yes, but there are other code paths that compute and use offsets between va and the
base va.  Mapping at a different va in the new process breaks vfio; I have tried it.

- Steve


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-31 16:11                   ` Steven Sistare
  2020-07-31 16:56                     ` Jason Gunthorpe
@ 2020-07-31 17:23                     ` Matthew Wilcox
  1 sibling, 0 replies; 74+ messages in thread
From: Matthew Wilcox @ 2020-07-31 17:23 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Eric W. Biederman, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro,
	akpm, arnd, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote:
> On 7/31/2020 11:27 AM, Matthew Wilcox wrote:
> > On Fri, Jul 31, 2020 at 10:57:44AM -0400, Steven Sistare wrote:
> >> Matthews sileby/mshare proposal has the same issue.  If a process opts-in
> >> and mmap's an address in the shared region, then content becomes mapped at
> >> a VA that was known to the pre-fork or pre-exec process.  Trust must still
> >> be established.
> > 
> > It's up to the recipient whether they try to map it at the same address
> > or at a fresh address.  The intended use case is a "semi-shared" address
> > space between two processes (ie partway between a threaded, fully-shared
> > address space and a forked un-shared address space), in which case
> > there's a certain amount of trust and cooperation between the processes.
> 
> Understood, but if the recipient does map at any of the same, which is the whole
> point because you want to share the page table.  The trust relationship is no
> different than for the live update case.  

You don't have to map at the same address to share the page tables.
For example, on x86 if you share an 8GB region, that must be aligned at
1GB in both the donor and the recipient, but they need not be mapped at
the same address.

> > It's a net increase of 200 lines of kernel code.  If 4 lines of userspace
> > code removes 200 lines of kernel code, I think I know which I prefer ...
> 
> It will be *far* more than 4 lines.
> Much of the 200 lines is mostly for the elf opt in, and much of the elf code is from
> anthony reviving an earlier patch that use MAP_FIXED_NOREPLACE during segment setup.

It doesn't really matter how much of it is for the opt-in and how much
is for the exec path itself.  The MAP_FIXED_NOREPLACE patch is only net
+16 lines, so that's not the problem.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-31 17:15                       ` Steven Sistare
@ 2020-07-31 17:48                         ` Jason Gunthorpe
  2020-07-31 17:55                           ` Steven Sistare
  0 siblings, 1 reply; 74+ messages in thread
From: Jason Gunthorpe @ 2020-07-31 17:48 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Matthew Wilcox, Eric W. Biederman, Anthony Yznaga, linux-kernel,
	linux-fsdevel, linux-mm, linux-arch, mhocko, tglx, mingo, bp,
	x86, hpa, viro, akpm, arnd, keescook, gerg, ktkhai,
	christian.brauner, peterz, esyr, christian, areber, cyphar

On Fri, Jul 31, 2020 at 01:15:34PM -0400, Steven Sistare wrote:
> On 7/31/2020 12:56 PM, Jason Gunthorpe wrote:
> > On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote:
> >>> Your preservation-across-exec use-case might or might not need the
> >>> VMA to be mapped at the same address.  
> >>
> >> It does.  qemu registers memory with vfio which remembers the va's in kernel
> >> metadata for the device.
> > 
> > Once the memory is registered with vfio the VA doesn't matter, vfio
> > will keep the iommu pointing at the same physical pages no matter
> > where they are mapped.
> 
> Yes, but there are other code paths that compute and use offsets between va and the
> base va.  Mapping at a different va in the new process breaks vfio; I have tried it.

Maybe you could fix vfio instead of having this adventure, if vfio is
the only motivation.

Jason


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-31 17:48                         ` Jason Gunthorpe
@ 2020-07-31 17:55                           ` Steven Sistare
  0 siblings, 0 replies; 74+ messages in thread
From: Steven Sistare @ 2020-07-31 17:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, Eric W. Biederman, Anthony Yznaga, linux-kernel,
	linux-fsdevel, linux-mm, linux-arch, mhocko, tglx, mingo, bp,
	x86, hpa, viro, akpm, arnd, keescook, gerg, ktkhai,
	christian.brauner, peterz, esyr, christian, areber, cyphar

On 7/31/2020 1:48 PM, Jason Gunthorpe wrote:
> On Fri, Jul 31, 2020 at 01:15:34PM -0400, Steven Sistare wrote:
>> On 7/31/2020 12:56 PM, Jason Gunthorpe wrote:
>>> On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote:
>>>>> Your preservation-across-exec use-case might or might not need the
>>>>> VMA to be mapped at the same address.  
>>>>
>>>> It does.  qemu registers memory with vfio which remembers the va's in kernel
>>>> metadata for the device.
>>>
>>> Once the memory is registered with vfio the VA doesn't matter, vfio
>>> will keep the iommu pointing at the same physical pages no matter
>>> where they are mapped.
>>
>> Yes, but there are other code paths that compute and use offsets between va and the
>> base va.  Mapping at a different va in the new process breaks vfio; I have tried it.
> 
> Maybe you could fix vfio instead of having this adventure, if vfio is
> the only motivation.

Maybe.  We still need to preserve an anonymous segment, though.  MADV_DOEXEC, or mshare,
or something else.  And I think the ability to preserve memory containing pointers to itself
is an interesting use case, though not ours.

- Steve


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
                   ` (8 preceding siblings ...)
  2020-07-30 15:22 ` Matthew Wilcox
@ 2020-07-31 19:41 ` Steven Sistare
  2021-07-08  9:52 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  10 siblings, 0 replies; 74+ messages in thread
From: Steven Sistare @ 2020-07-31 19:41 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, ktkhai, christian.brauner, peterz, esyr, jgg,
	christian, areber, cyphar

On 7/27/2020 1:11 PM, Anthony Yznaga wrote:
> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,
> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.

I forgot to mention, our use case is not just theoretical.  It has been implemented
and is pretty cool (but I am biased).  The pause time for the guest is in the
100 - 200 msec range.  We submitted qemu patches for review based on the MADV_DOEXEC
proposal.  In case you are curious:
  https://lore.kernel.org/qemu-devel/1596122076-341293-1-git-send-email-steven.sistare@oracle.com/

- Steve



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-31 14:57               ` Steven Sistare
  2020-07-31 15:27                 ` Matthew Wilcox
@ 2020-08-03 15:28                 ` Eric W. Biederman
  2020-08-03 15:42                   ` James Bottomley
  2020-08-03 19:29                   ` Steven Sistare
  1 sibling, 2 replies; 74+ messages in thread
From: Eric W. Biederman @ 2020-08-03 15:28 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Matthew Wilcox, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro,
	akpm, arnd, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

Steven Sistare <steven.sistare@oracle.com> writes:

> On 7/30/2020 5:58 PM, ebiederm@xmission.com wrote:
>> Here is another suggestion.
>> 
>> Have a very simple program that does:
>> 
>> 	for (;;) {
>> 		handle = dlopen("/my/real/program");
>> 		real_main = dlsym(handle, "main");
>> 		real_main(argc, argv, envp);
>> 		dlclose(handle);
>> 	}
>> 
>> With whatever obvious adjustments are needed to fit your usecase.
>> 
>> That should give the same level of functionality, be portable to all
>> unices, and not require you to duplicate code.  It belive it limits you
>> to not upgrading libc, or librt but that is a comparatively small
>> limitation.
>> 
>> 
>> Given that in general the interesting work is done in userspace and that
>> userspace has provided an interface for reusing that work already.
>> I don't see the justification for adding anything to exec at this point. 
>
> Thanks for the suggestion.  That is clever, and would make a fun project,
> but I would not trust it for production.  These few lines are just
> the first of many that it would take to reset the environment to the
> well-defined post-exec initial conditions that all executables expect,
> and incrementally tearing down state will be prone to bugs.

Agreed.

> Getting a clean slate from a kernel exec is a much more reliable
> design.

Except you are explicitly throwing that out the window, by preserving
VMAs.  You very much need to have a clean bug free shutdown to pass VMAs
reliably.

> The use case is creating long-lived apps that never go down, and the
> simplest implementation will have the fewest bugs and is the best.
> MADV_DOEXEC is simple, and does not even require a new system call,
> and the kernel already knows how to exec without bugs.

*ROFL*  I wish the kernel knew how to exec things without bugs.
The bugs are hard to hit but the ones I am aware of are not straight
forward to fix.

MADV_DOEXEC is not conceptually simple.  It completely violates the
guarantees that exec is known to make about the contents of the memory
of the new process.  This makes it very difficult to reason about.  Nor
will MADV_DOEXEC be tested very much as it has only one or two users.
Which means in the fullness of time it is likely someone will change
something that will break the implementation subtlely and the bug report
probably won't come in for 3 years, or maybe a decade.  At which point
it won't be clear if the bug even can be fixed as something else might
rely on it.

What is wrong with live migration between one qemu process and another
qemu process on the same machine not work for this use case?

Just reusing live migration would seem to be the simplest path of all,
as the code is already implemented.  Further if something goes wrong
with the live migration you can fallback to the existing process.  With
exec there is no fallback if the new version does not properly support
the handoff protocol of the old version.

Eric


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-08-03 15:28                 ` Eric W. Biederman
@ 2020-08-03 15:42                   ` James Bottomley
  2020-08-03 20:03                     ` Steven Sistare
       [not found]                     ` <9371b8272fd84280ae40b409b260bab3@AcuMS.aculab.com>
  2020-08-03 19:29                   ` Steven Sistare
  1 sibling, 2 replies; 74+ messages in thread
From: James Bottomley @ 2020-08-03 15:42 UTC (permalink / raw)
  To: Eric W. Biederman, Steven Sistare
  Cc: Matthew Wilcox, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro,
	akpm, arnd, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On Mon, 2020-08-03 at 10:28 -0500, Eric W. Biederman wrote:
[...]
> What is wrong with live migration between one qemu process and
> another qemu process on the same machine not work for this use case?
> 
> Just reusing live migration would seem to be the simplest path of
> all, as the code is already implemented.  Further if something goes
> wrong with the live migration you can fallback to the existing
> process.  With exec there is no fallback if the new version does not
> properly support the handoff protocol of the old version.

Actually, could I ask this another way: the other patch set you sent to
the KVM list was to snapshot the VM to a PKRAM capsule preserved across
kexec using zero copy for extremely fast save/restore.  The original
idea was to use this as part of a CRIU based snapshot, kexec to new
system, restore.  However, why can't you do a local snapshot, restart
qemu, restore using the PKRAM capsule to achieve exactly the same as
MADV_DOEXEC does but using a system that's easy to reason about?  It
may be slightly slower, but I think we're still talking milliseconds.

James



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-08-03 15:28                 ` Eric W. Biederman
  2020-08-03 15:42                   ` James Bottomley
@ 2020-08-03 19:29                   ` Steven Sistare
  1 sibling, 0 replies; 74+ messages in thread
From: Steven Sistare @ 2020-08-03 19:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matthew Wilcox, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro,
	akpm, arnd, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On 8/3/2020 11:28 AM, ebiederm@xmission.com wrote:
> Steven Sistare <steven.sistare@oracle.com> writes:
>> On 7/30/2020 5:58 PM, ebiederm@xmission.com wrote:
>>> Here is another suggestion.
>>>
>>> Have a very simple program that does:
>>>
>>> 	for (;;) {
>>> 		handle = dlopen("/my/real/program");
>>> 		real_main = dlsym(handle, "main");
>>> 		real_main(argc, argv, envp);
>>> 		dlclose(handle);
>>> 	}
>>>
>>> With whatever obvious adjustments are needed to fit your usecase.
>>>
>>> That should give the same level of functionality, be portable to all
>>> unices, and not require you to duplicate code.  It belive it limits you
>>> to not upgrading libc, or librt but that is a comparatively small
>>> limitation.
>>>
>>>
>>> Given that in general the interesting work is done in userspace and that
>>> userspace has provided an interface for reusing that work already.
>>> I don't see the justification for adding anything to exec at this point. 
>>
>> Thanks for the suggestion.  That is clever, and would make a fun project,
>> but I would not trust it for production.  These few lines are just
>> the first of many that it would take to reset the environment to the
>> well-defined post-exec initial conditions that all executables expect,
>> and incrementally tearing down state will be prone to bugs.
> 
> Agreed.
> 
>> Getting a clean slate from a kernel exec is a much more reliable
>> design.
> 
> Except you are explicitly throwing that out the window, by preserving
> VMAs.  You very much need to have a clean bug free shutdown to pass VMAs
> reliably.

Sure.  The whole community relies on you and others to provide a bug free exec.

>> The use case is creating long-lived apps that never go down, and the
>> simplest implementation will have the fewest bugs and is the best.
>> MADV_DOEXEC is simple, and does not even require a new system call,
>> and the kernel already knows how to exec without bugs.
> 
> *ROFL*  I wish the kernel knew how to exec things without bugs.

Essentially you are saying you would argue against any enhancement to exec.
Surely that is too high a bar.  We must continue to evolve an innovate and
balance risk against reward.  This use case matters to our business a lot,
and to others as well, see below.  That is the reward.  I feel you are 
overstating the risk.  Surely there is some early point in the development
cycle of some release where this can be integrated and get enough test
time and soak time to be proven reliable.

> The bugs are hard to hit but the ones I am aware of are not straight
> forward to fix.
> 
> MADV_DOEXEC is not conceptually simple.  It completely violates the
> guarantees that exec is known to make about the contents of the memory
> of the new process.  This makes it very difficult to reason about.  

I have having trouble see the difficulty.  Perhaps I am too familar with
it, but the semantics are few and easy to explain, and it does not introduce
new concepts: the post-exec process is born with a few more mappings than
previously, and non-fixed further mmaps choose addresses in the holes.

> Nor
> will MADV_DOEXEC be tested very much as it has only one or two users.
> Which means in the fullness of time it is likely someone will change
> something that will break the implementation subtlely and the bug report
> probably won't come in for 3 years, or maybe a decade.  At which point
> it won't be clear if the bug even can be fixed as something else might
> rely on it.

That's on us; we need to provide kernel tests and be diligent about testing
new releases.  This matters to our business and we will do so. 
> What is wrong with live migration between one qemu process and another
> qemu process on the same machine not work for this use case?
> 
> Just reusing live migration would seem to be the simplest path of all,
> as the code is already implemented.  Further if something goes wrong
> with the live migration you can fallback to the existing process.  With
> exec there is no fallback if the new version does not properly support
> the handoff protocol of the old version.

This is less resource intensive than live migration.  The latter ties up two
hosts, consumes lots of memory and network bandwidth, may take a long time
to converge on a busy system, and is unfeasible for guests with a huge amount
of local storeage (which we call dense I/O shapes).  Live update takes less than
1 second total, and the guest pause time is 100 - 200 msecs.  It is a very
attractive solution that other cloud vendors have implemented as well, with
their own private modifications to exec and and fork.  We have been independently
working in this area, and we are offering our implementation to the community.

- Steve


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-08-03 15:42                   ` James Bottomley
@ 2020-08-03 20:03                     ` Steven Sistare
       [not found]                     ` <9371b8272fd84280ae40b409b260bab3@AcuMS.aculab.com>
  1 sibling, 0 replies; 74+ messages in thread
From: Steven Sistare @ 2020-08-03 20:03 UTC (permalink / raw)
  To: James Bottomley, Eric W. Biederman
  Cc: Matthew Wilcox, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro,
	akpm, arnd, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On 8/3/2020 11:42 AM, James Bottomley wrote:
> On Mon, 2020-08-03 at 10:28 -0500, Eric W. Biederman wrote:
> [...]
>> What is wrong with live migration between one qemu process and
>> another qemu process on the same machine not work for this use case?
>>
>> Just reusing live migration would seem to be the simplest path of
>> all, as the code is already implemented.  Further if something goes
>> wrong with the live migration you can fallback to the existing
>> process.  With exec there is no fallback if the new version does not
>> properly support the handoff protocol of the old version.
> 
> Actually, could I ask this another way: the other patch set you sent to
> the KVM list was to snapshot the VM to a PKRAM capsule preserved across
> kexec using zero copy for extremely fast save/restore.  The original
> idea was to use this as part of a CRIU based snapshot, kexec to new
> system, restore.  However, why can't you do a local snapshot, restart
> qemu, restore using the PKRAM capsule to achieve exactly the same as
> MADV_DOEXEC does but using a system that's easy to reason about?  It
> may be slightly slower, but I think we're still talking milliseconds.

Hi James, good to hear from you.  PKRAM or SysV shm could be used for
a restart in that manner, but it would only support sriov guests if the
guest exports an agent that supports suspend-to-ram, and if all guest
drivers support the suspend-to-ram method.  I have done this using a linux
guest and qemu guest agent, and IIRC the guest pause time is 500 - 1000 msec.
With MADV_DOEXEC, pause time is 100 - 200 msec.  The pause time is a handful
of seconds if the guest uses an nvme drive because CC.SHN takes so long
to persist metadata to stable storage.

We could instead pass vfio descriptors from the old process to a 3rd party escrow 
process and pass  them back to the new qemu process, but the shm that vfio has 
already registered must be remapped at the same VA as the previous process, and 
there is no interface to guarantee that.  MAP_FIXED blows away existing mappings 
and breaks the app. MAP_FIXED_NOREPLACE respects existing mappings but cannot map 
the shm and breaks the app.  Adding a feature that reserves VAs would fix that, we 
have experimnted with one.  Fixing the vfio kernel implementation to not use the 
original VA base would also work, but I don't know how doable/difficult that would be.

Both solutions would require a qemu instance to be stopped and relaunched using shm
as guest ram, and its guest rebooted, so they do not let us update legacy 
already-running instances that use anon memory.  That problem solves itself if we 
get these rfe's into linux and qemu, and eventually users shut down the legacy
instances, but that takes years and we need to do it sooner.

- Steve


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
       [not found]                     ` <9371b8272fd84280ae40b409b260bab3@AcuMS.aculab.com>
@ 2020-08-04 11:13                       ` Matthew Wilcox
  0 siblings, 0 replies; 74+ messages in thread
From: Matthew Wilcox @ 2020-08-04 11:13 UTC (permalink / raw)
  To: David Laight
  Cc: 'James Bottomley',
	Eric W. Biederman, Steven Sistare, Anthony Yznaga, linux-kernel,
	linux-fsdevel, linux-mm, linux-arch, mhocko, tglx, mingo, bp,
	x86, hpa, viro, akpm, arnd, keescook, gerg, ktkhai,
	christian.brauner, peterz, esyr, jgg, christian, areber, cyphar

On Tue, Aug 04, 2020 at 08:44:42AM +0000, David Laight wrote:
> From: James Bottomley
> > Sent: 03 August 2020 16:43
> > 
> > On Mon, 2020-08-03 at 10:28 -0500, Eric W. Biederman wrote:
> > [...]
> > > What is wrong with live migration between one qemu process and
> > > another qemu process on the same machine not work for this use case?
> > >
> > > Just reusing live migration would seem to be the simplest path of
> > > all, as the code is already implemented.  Further if something goes
> > > wrong with the live migration you can fallback to the existing
> > > process.  With exec there is no fallback if the new version does not
> > > properly support the handoff protocol of the old version.
> > 
> > Actually, could I ask this another way: the other patch set you sent to
> > the KVM list was to snapshot the VM to a PKRAM capsule preserved across
> > kexec using zero copy for extremely fast save/restore.  The original
> > idea was to use this as part of a CRIU based snapshot, kexec to new
> > system, restore.  However, why can't you do a local snapshot, restart
> > qemu, restore using the PKRAM capsule to achieve exactly the same as
> > MADV_DOEXEC does but using a system that's easy to reason about?  It
> > may be slightly slower, but I think we're still talking milliseconds.
> 
> 
> I've had another idea (that is probably impossible...).
> What about a 'reverse mmap' operation.
> Something that creates an fd whose contents are a chunk of the
> processes address space.

http://www.wil.cx/~willy/linux/sileby.html


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
                   ` (9 preceding siblings ...)
  2020-07-31 19:41 ` Steven Sistare
@ 2021-07-08  9:52 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  2021-07-08 12:48   ` Steven Sistare
  10 siblings, 1 reply; 74+ messages in thread
From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) @ 2021-07-08  9:52 UTC (permalink / raw)
  To: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, ktkhai, christian.brauner, peterz, esyr, jgg,
	christian, areber, cyphar, steven.sistare, Gonglei (Arei)

Hi Anthony and Steven,

在 2020/7/28 1:11, Anthony Yznaga 写道:
> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,
> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.
> 

We have a requirement like yours, but ours seems more complex. We want to
isolate some memory regions from the VM's memory space and the start a child
process who will using these memory regions.

I've wrote a draft to support this feature, but I just find that my draft is
pretty like yours.

It seems that you've already abandoned this patchset, why ?

> Patches 1 and 2 ensure that loading of ELF load segments does not silently
> clobber existing VMAS, and remove assumptions that the stack is the only
> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
> and could be considered on its own.
> 
> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
> using an ELF note.
> 
> Anthony Yznaga (5):
>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>   mm: do not assume only the stack vma exists in setup_arg_pages()
>   mm: introduce VM_EXEC_KEEP
>   exec, elf: require opt-in for accepting preserved mem
>   mm: introduce MADV_DOEXEC
> 
>  arch/x86/Kconfig                       |   1 +
>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>  fs/exec.c                              |  33 +++++-
>  include/linux/binfmts.h                |   7 +-
>  include/linux/mm.h                     |   5 +
>  include/uapi/asm-generic/mman-common.h |   3 +
>  kernel/fork.c                          |   2 +-
>  mm/madvise.c                           |  25 +++++
>  mm/mmap.c                              |  47 ++++++++
>  9 files changed, 266 insertions(+), 53 deletions(-)
> 

-- 
Sincerely yours,
Longpeng(Mike)


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-07-08  9:52 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
@ 2021-07-08 12:48   ` Steven Sistare
  2021-07-12  1:05     ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  0 siblings, 1 reply; 74+ messages in thread
From: Steven Sistare @ 2021-07-08 12:48 UTC (permalink / raw)
  To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.),
	Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm,
	linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, ktkhai, christian.brauner, peterz, esyr, jgg,
	christian, areber, cyphar, Gonglei (Arei)

On 7/8/2021 5:52 AM, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> Hi Anthony and Steven,
> 
> 在 2020/7/28 1:11, Anthony Yznaga 写道:
>> This patchset adds support for preserving an anonymous memory range across
>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>> sharing memory in this manner, as opposed to re-attaching to a named shared
>> memory segment, is to ensure it is mapped at the same virtual address in
>> the new process as it was in the old one.  An intended use for this is to
>> preserve guest memory for guests using vfio while qemu exec's an updated
>> version of itself.  By ensuring the memory is preserved at a fixed address,
>> vfio mappings and their associated kernel data structures can remain valid.
>> In addition, for the qemu use case, qemu instances that back guest RAM with
>> anonymous memory can be updated.
> 
> We have a requirement like yours, but ours seems more complex. We want to
> isolate some memory regions from the VM's memory space and the start a child
> process who will using these memory regions.
> 
> I've wrote a draft to support this feature, but I just find that my draft is
> pretty like yours.
> 
> It seems that you've already abandoned this patchset, why ?

Hi Longpeng,
  The reviewers did not like the proposal for several reasons, but the showstopper
was that they did not want to add complexity to the exec path in the kernel.  You
can read the email archive for details.

We solved part of our problem by adding new vfio interfaces: VFIO_DMA_UNMAP_FLAG_VADDR
and VFIO_DMA_MAP_FLAG_VADDR.  That solves the vfio problem for shared memory, but not
for mmap MAP_ANON memory.

- Steve

>> Patches 1 and 2 ensure that loading of ELF load segments does not silently
>> clobber existing VMAS, and remove assumptions that the stack is the only
>> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
>> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
>> and could be considered on its own.
>>
>> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
>> using an ELF note.
>>
>> Anthony Yznaga (5):
>>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>>   mm: do not assume only the stack vma exists in setup_arg_pages()
>>   mm: introduce VM_EXEC_KEEP
>>   exec, elf: require opt-in for accepting preserved mem
>>   mm: introduce MADV_DOEXEC
>>
>>  arch/x86/Kconfig                       |   1 +
>>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>>  fs/exec.c                              |  33 +++++-
>>  include/linux/binfmts.h                |   7 +-
>>  include/linux/mm.h                     |   5 +
>>  include/uapi/asm-generic/mman-common.h |   3 +
>>  kernel/fork.c                          |   2 +-
>>  mm/madvise.c                           |  25 +++++
>>  mm/mmap.c                              |  47 ++++++++
>>  9 files changed, 266 insertions(+), 53 deletions(-)
>>
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-07-08 12:48   ` Steven Sistare
@ 2021-07-12  1:05     ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  2021-07-12  1:30       ` Matthew Wilcox
  0 siblings, 1 reply; 74+ messages in thread
From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) @ 2021-07-12  1:05 UTC (permalink / raw)
  To: Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm
  Cc: Gonglei (Arei), willy

Hi Steven,

在 2021/7/8 20:48, Steven Sistare 写道:
> On 7/8/2021 5:52 AM, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
>> Hi Anthony and Steven,
>>
>> 在 2020/7/28 1:11, Anthony Yznaga 写道:
>>> This patchset adds support for preserving an anonymous memory range across
>>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>>> sharing memory in this manner, as opposed to re-attaching to a named shared
>>> memory segment, is to ensure it is mapped at the same virtual address in
>>> the new process as it was in the old one.  An intended use for this is to
>>> preserve guest memory for guests using vfio while qemu exec's an updated
>>> version of itself.  By ensuring the memory is preserved at a fixed address,
>>> vfio mappings and their associated kernel data structures can remain valid.
>>> In addition, for the qemu use case, qemu instances that back guest RAM with
>>> anonymous memory can be updated.
>>
>> We have a requirement like yours, but ours seems more complex. We want to
>> isolate some memory regions from the VM's memory space and the start a child
>> process who will using these memory regions.
>>
>> I've wrote a draft to support this feature, but I just find that my draft is
>> pretty like yours.
>>
>> It seems that you've already abandoned this patchset, why ?
> 
> Hi Longpeng,
>   The reviewers did not like the proposal for several reasons, but the showstopper
> was that they did not want to add complexity to the exec path in the kernel.  You
> can read the email archive for details.
> 
I've read the archive and did some study these days, maybe this soluation is
more sutiable for my use case.

Let me describe my use case more clearly (just ignore if you're not interested
in it):

1. Prog A mmap() 4GB memory (anon or file-mapping), suppose the allocated VA
range is [0x40000000,0x140000000)

2. Prog A specifies [0x48000000,0x50000000) and [0x80000000,0x100000000) will be
shared by its child.

3. Prog A fork() Prog B and then Prog B exec() a new ELF binary.

4. Prog B notice the shared ranges (e.g. by input parameters or ...) and remap
them to a continuous VA range.

Do you have any suggestions ?

> We solved part of our problem by adding new vfio interfaces: VFIO_DMA_UNMAP_FLAG_VADDR
> and VFIO_DMA_MAP_FLAG_VADDR.  That solves the vfio problem for shared memory, but not
> for mmap MAP_ANON memory.
> 
> - Steve
> 
>>> Patches 1 and 2 ensure that loading of ELF load segments does not silently
>>> clobber existing VMAS, and remove assumptions that the stack is the only
>>> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
>>> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
>>> and could be considered on its own.
>>>
>>> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
>>> using an ELF note.
>>>
>>> Anthony Yznaga (5):
>>>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>>>   mm: do not assume only the stack vma exists in setup_arg_pages()
>>>   mm: introduce VM_EXEC_KEEP
>>>   exec, elf: require opt-in for accepting preserved mem
>>>   mm: introduce MADV_DOEXEC
>>>
>>>  arch/x86/Kconfig                       |   1 +
>>>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>>>  fs/exec.c                              |  33 +++++-
>>>  include/linux/binfmts.h                |   7 +-
>>>  include/linux/mm.h                     |   5 +
>>>  include/uapi/asm-generic/mman-common.h |   3 +
>>>  kernel/fork.c                          |   2 +-
>>>  mm/madvise.c                           |  25 +++++
>>>  mm/mmap.c                              |  47 ++++++++
>>>  9 files changed, 266 insertions(+), 53 deletions(-)
>>>
>>
> .
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-07-12  1:05     ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
@ 2021-07-12  1:30       ` Matthew Wilcox
  2021-07-13  0:57         ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  0 siblings, 1 reply; 74+ messages in thread
From: Matthew Wilcox @ 2021-07-12  1:30 UTC (permalink / raw)
  To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  Cc: Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm, Gonglei (Arei)

On Mon, Jul 12, 2021 at 09:05:45AM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> Let me describe my use case more clearly (just ignore if you're not interested
> in it):
> 
> 1. Prog A mmap() 4GB memory (anon or file-mapping), suppose the allocated VA
> range is [0x40000000,0x140000000)
> 
> 2. Prog A specifies [0x48000000,0x50000000) and [0x80000000,0x100000000) will be
> shared by its child.
> 
> 3. Prog A fork() Prog B and then Prog B exec() a new ELF binary.
> 
> 4. Prog B notice the shared ranges (e.g. by input parameters or ...) and remap
> them to a continuous VA range.

This is dangerous.  There must be an active step for Prog B to accept
Prog A's ranges into its address space.  Otherwise Prog A could almost
completely fill Prog B's address space and so control where Prog B
places its mappings.  It could also provoke a latent bug in Prog B
if it doesn't handle address space exhaustion gracefully.

I had a proposal to handle this.  Would it meet your requirements?
https://lore.kernel.org/lkml/20200730152250.GG23808@casper.infradead.org/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-07-12  1:30       ` Matthew Wilcox
@ 2021-07-13  0:57         ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  2021-08-13 19:49           ` Khalid Aziz
  0 siblings, 1 reply; 74+ messages in thread
From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) @ 2021-07-13  0:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm, Gonglei (Arei)

Hi Matthew,

> -----Original Message-----
> From: Matthew Wilcox [mailto:willy@infradead.org]
> Sent: Monday, July 12, 2021 9:30 AM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> <longpeng2@huawei.com>
> Cc: Steven Sistare <steven.sistare@oracle.com>; Anthony Yznaga
> <anthony.yznaga@oracle.com>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; Gonglei (Arei) <arei.gonglei@huawei.com>
> Subject: Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
> 
> On Mon, Jul 12, 2021 at 09:05:45AM +0800, Longpeng (Mike, Cloud
> Infrastructure Service Product Dept.) wrote:
> > Let me describe my use case more clearly (just ignore if you're not
> > interested in it):
> >
> > 1. Prog A mmap() 4GB memory (anon or file-mapping), suppose the
> > allocated VA range is [0x40000000,0x140000000)
> >
> > 2. Prog A specifies [0x48000000,0x50000000) and
> > [0x80000000,0x100000000) will be shared by its child.
> >
> > 3. Prog A fork() Prog B and then Prog B exec() a new ELF binary.
> >
> > 4. Prog B notice the shared ranges (e.g. by input parameters or ...)
> > and remap them to a continuous VA range.
> 
> This is dangerous.  There must be an active step for Prog B to accept Prog A's
> ranges into its address space.  Otherwise Prog A could almost completely fill
> Prog B's address space and so control where Prog B places its mappings.  It
> could also provoke a latent bug in Prog B if it doesn't handle address space
> exhaustion gracefully.
> 
> I had a proposal to handle this.  Would it meet your requirements?
> https://lore.kernel.org/lkml/20200730152250.GG23808@casper.infradead.org/

I noticed your proposal of project Sileby and I think it can meet Steven's requirement, but I not sure whether it's suitable for mine because there's no sample code yet, is it in progress ?

According to the abstract of Sileby, I have two questions:
1. Would you plan to support the file-mapping memory sharing ? e.g. Prog A's 4G memory is backend with 2M hugetlb.
2. Does each mshare fd only containe one sharing VMA ? For large memory process (1T~4T in our env), maybe there is hundreds of memory ranges need to be shared, this will take too much fd space if so ?



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-07-13  0:57         ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
@ 2021-08-13 19:49           ` Khalid Aziz
  2021-08-14 20:07             ` David Laight
                               ` (2 more replies)
  0 siblings, 3 replies; 74+ messages in thread
From: Khalid Aziz @ 2021-08-13 19:49 UTC (permalink / raw)
  To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.),
	Matthew Wilcox
  Cc: Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm, Gonglei (Arei)

On Tue, 2021-07-13 at 00:57 +0000, Longpeng (Mike, Cloud Infrastructure
Service Product Dept.) wrote:
> Hi Matthew,
> 
> > -----Original Message-----
> > From: Matthew Wilcox [mailto:willy@infradead.org]
> > Sent: Monday, July 12, 2021 9:30 AM
> > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > <longpeng2@huawei.com>
> > Cc: Steven Sistare <steven.sistare@oracle.com>; Anthony Yznaga
> > <anthony.yznaga@oracle.com>; linux-kernel@vger.kernel.org;
> > linux-mm@kvack.org; Gonglei (Arei) <arei.gonglei@huawei.com>
> > Subject: Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
> > 
> > On Mon, Jul 12, 2021 at 09:05:45AM +0800, Longpeng (Mike, Cloud
> > Infrastructure Service Product Dept.) wrote:
> > > Let me describe my use case more clearly (just ignore if you're not
> > > interested in it):
> > > 
> > > 1. Prog A mmap() 4GB memory (anon or file-mapping), suppose the
> > > allocated VA range is [0x40000000,0x140000000)
> > > 
> > > 2. Prog A specifies [0x48000000,0x50000000) and
> > > [0x80000000,0x100000000) will be shared by its child.
> > > 
> > > 3. Prog A fork() Prog B and then Prog B exec() a new ELF binary.
> > > 
> > > 4. Prog B notice the shared ranges (e.g. by input parameters or
> > > ...)
> > > and remap them to a continuous VA range.
> > 
> > This is dangerous.  There must be an active step for Prog B to accept
> > Prog A's
> > ranges into its address space.  Otherwise Prog A could almost
> > completely fill
> > Prog B's address space and so control where Prog B places its
> > mappings.  It
> > could also provoke a latent bug in Prog B if it doesn't handle
> > address space
> > exhaustion gracefully.
> > 
> > I had a proposal to handle this.  Would it meet your requirements?
> > https://lore.kernel.org/lkml/20200730152250.GG23808@casper.infradead.org/
> 
> I noticed your proposal of project Sileby and I think it can meet
> Steven's requirement, but I not sure whether it's suitable for mine
> because there's no sample code yet, is it in progress ?

Hi Mike,

I am working on refining the ideas from project Sileby. I am also
working on designing the implementation. Since the original concept,
the mshare API has evolved further. Here is what it loks like:

The mshare API consists of two system calls - mshare() and
mshare_unlink()

mshare
======

int mshare(char *name,void *addr, size_t length, int oflags, mode_t
mode)

mshare() creates and opens a new, or opens an existing shared memory
area that will be shared at PTE level. name refers to shared object
name that exists under /dev/mshare (this is subject to change. There
might be better ways to manage the names for mshare'd areas). addr is
the starting address of this shared memory area and length is the size
of this area. oflags can be one of:

    O_RDONLY opens shared memory area for read only access by everyone
    O_RDWR opens shared memory area for read and write access
    O_CREAT creates the named shared memory area if it does not exist
    O_EXCL If O_CREAT was also specified, and a shared memory area
        exists with that name, return an error.

mode represents the creation mode for the shared object under
/dev/mshare.

Return Value
------------

mshare() returns a file descriptor. A read from this file descriptor
returns two long values - (1) starting address, and (2) size of the
shared memory area.

Notes
-----

PTEs are shared at pgdir level and hence it imposes following
requirements on the address and size given to the mshare():

    - Starting address must be aligned to pgdir size (512GB on x86_64)
    - Size must be a multiple of pgdir size
    - Any mappings created in this address range at any time become
    shared automatically
    - Shared address range can have unmapped addresses in it. Any
    access to unmapped address will result in SIGBUS

Mappings within this address range behave as if they were shared
between threads, so a write to a MAP_PRIVATE mapping will create a
page which is shared between all the sharers. The first process that
declares an address range mshare'd can continue to map objects in the
shared area. All other processes that want mshare'd access to this
memory area can do so by calling mshare(). After this call, the
address range given by mshare becomes a shared range in its address
space. Anonymous mappings will be shared and not COWed.


mshare_unlink
=============

int mshare_unlink(char *name)

A shared address range created by mshare() can be destroyed using
mshare_unlink() which removes the  shared named object. Once all
processes have unmapped the shared object, the shared address range
references are de-allocated and destroyed.

Return Value
------------

mshare_unlink() returns 0 on success or -1 on error.


Example
=======

A process can create an mshare'd area and map objects into it as
follows:

    fd = mshare("junk",  TB(1), GB(512), O_CREAT|O_RDWR, 0600);

    /* Map objects in the shared address space and/or Write data */

    mshare_unlink("junk");

Another process can then access this shared memory area with another
call to mshare():

    fd = mshare("junk", TB(1), GB(512), O_RDWR, 0600);

    /* Read and write data in TB(1)-((TB(1)+GB(512)) range */

    mshare_unlink("junk");


> 
> According to the abstract of Sileby, I have two questions:
> 1. Would you plan to support the file-mapping memory sharing ? e.g.
> Prog A's 4G memory is backend with 2M hugetlb.

Yes, file-mapped memory sharing support is planned.

> 2. Does each mshare fd only containe one sharing VMA ? For large
> memory process (1T~4T in our env), maybe there is hundreds of memory
> ranges need to be shared, this will take too much fd space if so ?
> 

No, each fd can support all VMAs covered by the address range with a
size that is multiple of pgdir size.

--
Khalid



^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-13 19:49           ` Khalid Aziz
@ 2021-08-14 20:07             ` David Laight
  2021-08-16  0:26               ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  2021-08-16  6:54             ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  2021-08-16  8:02             ` David Hildenbrand
  2 siblings, 1 reply; 74+ messages in thread
From: David Laight @ 2021-08-14 20:07 UTC (permalink / raw)
  To: 'Khalid Aziz',
	Longpeng (Mike, Cloud Infrastructure Service Product Dept.),
	Matthew Wilcox
  Cc: Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm, Gonglei (Arei)

...
> > > > Let me describe my use case more clearly (just ignore if you're not
> > > > interested in it):
> > > >
> > > > 1. Prog A mmap() 4GB memory (anon or file-mapping), suppose the
> > > > allocated VA range is [0x40000000,0x140000000)
> > > >
> > > > 2. Prog A specifies [0x48000000,0x50000000) and
> > > > [0x80000000,0x100000000) will be shared by its child.
> > > >
> > > > 3. Prog A fork() Prog B and then Prog B exec() a new ELF binary.
> > > >
> > > > 4. Prog B notice the shared ranges (e.g. by input parameters or
> > > > ...)
> > > > and remap them to a continuous VA range.

Remapping to contiguous VA is going to be difficult in the
general case for (IIRC) VIVT caches.
The required cache coherence may only be attainable by
using uncached mappings.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-14 20:07             ` David Laight
@ 2021-08-16  0:26               ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  2021-08-16  8:07                 ` David Laight
  0 siblings, 1 reply; 74+ messages in thread
From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) @ 2021-08-16  0:26 UTC (permalink / raw)
  To: David Laight, 'Khalid Aziz', Matthew Wilcox
  Cc: Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm, Gonglei (Arei)

Hi David,

在 2021/8/15 4:07, David Laight 写道:
> ...
>>>>> Let me describe my use case more clearly (just ignore if you're not
>>>>> interested in it):
>>>>>
>>>>> 1. Prog A mmap() 4GB memory (anon or file-mapping), suppose the
>>>>> allocated VA range is [0x40000000,0x140000000)
>>>>>
>>>>> 2. Prog A specifies [0x48000000,0x50000000) and
>>>>> [0x80000000,0x100000000) will be shared by its child.
>>>>>
>>>>> 3. Prog A fork() Prog B and then Prog B exec() a new ELF binary.
>>>>>
>>>>> 4. Prog B notice the shared ranges (e.g. by input parameters or
>>>>> ...)
>>>>> and remap them to a continuous VA range.
> 
> Remapping to contiguous VA is going to be difficult in the
> general case for (IIRC) VIVT caches.
> The required cache coherence may only be attainable by
> using uncached mappings.
> 

The Prog B uses mremap() syscall to remapping the shared ranges to other places,
this is a common case for mremap in userspace.
The cache coherence should already be processed in mremap core logic, otherwise
there's maybe something wrong in mremap().

> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-13 19:49           ` Khalid Aziz
  2021-08-14 20:07             ` David Laight
@ 2021-08-16  6:54             ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  2021-08-16  8:02             ` David Hildenbrand
  2 siblings, 0 replies; 74+ messages in thread
From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) @ 2021-08-16  6:54 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm,
	Matthew Wilcox, Gonglei (Arei)

Hi Khalid,

Thanks for your replay, PSB :)

> -----Original Message-----
> From: Khalid Aziz [mailto:khalid.aziz@oracle.com]
> Sent: Saturday, August 14, 2021 3:49 AM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> <longpeng2@huawei.com>; Matthew Wilcox <willy@infradead.org>
> Cc: Steven Sistare <steven.sistare@oracle.com>; Anthony Yznaga
> <anthony.yznaga@oracle.com>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; Gonglei (Arei) <arei.gonglei@huawei.com>
> Subject: Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
> 
> On Tue, 2021-07-13 at 00:57 +0000, Longpeng (Mike, Cloud Infrastructure Service
> Product Dept.) wrote:
> > Hi Matthew,
> >
> > > -----Original Message-----
> > > From: Matthew Wilcox [mailto:willy@infradead.org]
> > > Sent: Monday, July 12, 2021 9:30 AM
> > > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > > <longpeng2@huawei.com>
> > > Cc: Steven Sistare <steven.sistare@oracle.com>; Anthony Yznaga
> > > <anthony.yznaga@oracle.com>; linux-kernel@vger.kernel.org;
> > > linux-mm@kvack.org; Gonglei (Arei) <arei.gonglei@huawei.com>
> > > Subject: Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
> > >
> > > On Mon, Jul 12, 2021 at 09:05:45AM +0800, Longpeng (Mike, Cloud
> > > Infrastructure Service Product Dept.) wrote:
> > > > Let me describe my use case more clearly (just ignore if you're
> > > > not interested in it):
> > > >
> > > > 1. Prog A mmap() 4GB memory (anon or file-mapping), suppose the
> > > > allocated VA range is [0x40000000,0x140000000)
> > > >
> > > > 2. Prog A specifies [0x48000000,0x50000000) and
> > > > [0x80000000,0x100000000) will be shared by its child.
> > > >
> > > > 3. Prog A fork() Prog B and then Prog B exec() a new ELF binary.
> > > >
> > > > 4. Prog B notice the shared ranges (e.g. by input parameters or
> > > > ...)
> > > > and remap them to a continuous VA range.
> > >
> > > This is dangerous.  There must be an active step for Prog B to
> > > accept Prog A's ranges into its address space.  Otherwise Prog A
> > > could almost completely fill Prog B's address space and so control
> > > where Prog B places its mappings.  It could also provoke a latent
> > > bug in Prog B if it doesn't handle address space exhaustion
> > > gracefully.
> > >
> > > I had a proposal to handle this.  Would it meet your requirements?
> > > https://lore.kernel.org/lkml/20200730152250.GG23808@casper.infradead
> > > .org/
> >
> > I noticed your proposal of project Sileby and I think it can meet
> > Steven's requirement, but I not sure whether it's suitable for mine
> > because there's no sample code yet, is it in progress ?
> 
> Hi Mike,
> 
> I am working on refining the ideas from project Sileby. I am also working on
> designing the implementation. Since the original concept, the mshare API has
> evolved further. Here is what it loks like:
> 

What's the actual use cases of Sileby ? We can already share anonymous memory 
by memfd.

> The mshare API consists of two system calls - mshare() and
> mshare_unlink()
> 
> mshare
> ======
> 
> int mshare(char *name,void *addr, size_t length, int oflags, mode_t
> mode)
> 
> mshare() creates and opens a new, or opens an existing shared memory area that
> will be shared at PTE level. name refers to shared object name that exists under
> /dev/mshare (this is subject to change. There might be better ways to manage the
> names for mshare'd areas). addr is the starting address of this shared memory
> area and length is the size of this area. oflags can be one of:
> 
>     O_RDONLY opens shared memory area for read only access by everyone
>     O_RDWR opens shared memory area for read and write access
>     O_CREAT creates the named shared memory area if it does not exist
>     O_EXCL If O_CREAT was also specified, and a shared memory area
>         exists with that name, return an error.
> 
> mode represents the creation mode for the shared object under /dev/mshare.
> 
> Return Value
> ------------
> 
> mshare() returns a file descriptor. A read from this file descriptor returns two long
> values - (1) starting address, and (2) size of the shared memory area.
> 
> Notes
> -----
> 
> PTEs are shared at pgdir level and hence it imposes following requirements on the
> address and size given to the mshare():
> 
>     - Starting address must be aligned to pgdir size (512GB on x86_64)

The limitation seems not unreasonable, why ?

>     - Size must be a multiple of pgdir size
>     - Any mappings created in this address range at any time become
>     shared automatically
>     - Shared address range can have unmapped addresses in it. Any
>     access to unmapped address will result in SIGBUS
> 
> Mappings within this address range behave as if they were shared between
> threads, so a write to a MAP_PRIVATE mapping will create a page which is shared
> between all the sharers. The first process that declares an address range mshare'd
> can continue to map objects in the shared area. All other processes that want
> mshare'd access to this memory area can do so by calling mshare(). After this call,
> the address range given by mshare becomes a shared range in its address space.
> Anonymous mappings will be shared and not COWed.
> 
> 
> mshare_unlink
> =============
> 
> int mshare_unlink(char *name)
> 
> A shared address range created by mshare() can be destroyed using
> mshare_unlink() which removes the  shared named object. Once all processes
> have unmapped the shared object, the shared address range references are
> de-allocated and destroyed.
> 
> Return Value
> ------------
> 
> mshare_unlink() returns 0 on success or -1 on error.
> 
> 
> Example
> =======
> 
> A process can create an mshare'd area and map objects into it as
> follows:
> 
>     fd = mshare("junk",  TB(1), GB(512), O_CREAT|O_RDWR, 0600);
> 
>     /* Map objects in the shared address space and/or Write data */
> 
>     mshare_unlink("junk");
> 

Use the name to identify the range seems insecure and looks easy to be attacked.
How about to use the fd ? We can pass the fd to another process who is permit to
access the mshare'd memory.

> Another process can then access this shared memory area with another call to
> mshare():
> 
>     fd = mshare("junk", TB(1), GB(512), O_RDWR, 0600);
> 
>     /* Read and write data in TB(1)-((TB(1)+GB(512)) range */
> 
>     mshare_unlink("junk");
> 
> 
> >
> > According to the abstract of Sileby, I have two questions:
> > 1. Would you plan to support the file-mapping memory sharing ? e.g.
> > Prog A's 4G memory is backend with 2M hugetlb.
> 
> Yes, file-mapped memory sharing support is planned.
> 
> > 2. Does each mshare fd only containe one sharing VMA ? For large
> > memory process (1T~4T in our env), maybe there is hundreds of memory
> > ranges need to be shared, this will take too much fd space if so ?
> >
> 
> No, each fd can support all VMAs covered by the address range with a size that is
> multiple of pgdir size.
> 

I also made a proposal to meet our requirement inside.
Our requirement is:
1. The process A can specify some ranges to be shared by another process B.
2. The ranges can be shared in the order of the specified by process A.
3. The ranges can support more than one VMA, some of the VMAs can be anon and
  others can be file-mapped.

The proposal introduces a char device named /dev/mshare and interacts with the users
by ioctl interfaces. It supports three command currently:
- MSHARE_SIGN: specify the range want to be shared
- MSHARE_INFO: get the info of the shared ranges
- MSHARE_MMAP: map the shared ranges into the address space

Here is an example to show how to use it.

1. Suppose process A want to share four ranges:
  a. 0x1000 ~ 0x2000 --> Anonymous
  b. 0x4000 ~ 0x8000 --> PFN mapped range
  c. 0x200000 ~ 0x400000 --> 2M hugetlb
  d. 0x500000 ~ 0x501000 --> Anonymous


2. Process A write data into these four ranges according the following sequence:
  "b -- c -- a -- d"
  ( So the process B should map the ranges as the same order )


3. Process A specify the ranges:

  FD = open( /dev/msahre );
  ioctl(FD, MSHARE_SIGN, range b);
  ioctl(FD, MSHARE_SIGN, range c);
  ioctl(FD, MSHARE_SIGN, range a);
  ioctl(FD, MSHARE_SIGN, range d);


4. Process A make sure the O_CLOEXEC of FD is CLEAR


5. Process A now fork+exec process B with parameter shared_fd=FD:

  ./bin_B shared_fd=FD


6. Process B get the info by MSHARE_INFO ioctl:

  ioctl( shared_fd, MSHARE_INFO, &info );

  The info includes:
    num: the count of the shared ranges
    totalvm: the total vm size of these shared ranges
    align: align requirement


7. Process B use MSHARE_MMAP to map these ranges into its space:

  p = mmap ( info.totalvm, info.align );
  ioctl(shared_fd, MSHARE_MMAP, {MMAP_ALL, address fixed, p} );


I have wrote a draft for it and it works in local. But I'm not sure the concept is right,
so I'm glad to know that you're working on Sileby, maybe I can refer to the work of yours.


> --
> Khalid


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-13 19:49           ` Khalid Aziz
  2021-08-14 20:07             ` David Laight
  2021-08-16  6:54             ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
@ 2021-08-16  8:02             ` David Hildenbrand
  2021-08-16 12:07               ` Matthew Wilcox
  2 siblings, 1 reply; 74+ messages in thread
From: David Hildenbrand @ 2021-08-16  8:02 UTC (permalink / raw)
  To: Khalid Aziz, Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.),
	Matthew Wilcox
  Cc: Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm, Gonglei (Arei)

On 13.08.21 21:49, Khalid Aziz wrote:
> On Tue, 2021-07-13 at 00:57 +0000, Longpeng (Mike, Cloud Infrastructure
> Service Product Dept.) wrote:
>> Hi Matthew,
>>
>>> -----Original Message-----
>>> From: Matthew Wilcox [mailto:willy@infradead.org]
>>> Sent: Monday, July 12, 2021 9:30 AM
>>> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
>>> <longpeng2@huawei.com>
>>> Cc: Steven Sistare <steven.sistare@oracle.com>; Anthony Yznaga
>>> <anthony.yznaga@oracle.com>; linux-kernel@vger.kernel.org;
>>> linux-mm@kvack.org; Gonglei (Arei) <arei.gonglei@huawei.com>
>>> Subject: Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
>>>
>>> On Mon, Jul 12, 2021 at 09:05:45AM +0800, Longpeng (Mike, Cloud
>>> Infrastructure Service Product Dept.) wrote:
>>>> Let me describe my use case more clearly (just ignore if you're not
>>>> interested in it):
>>>>
>>>> 1. Prog A mmap() 4GB memory (anon or file-mapping), suppose the
>>>> allocated VA range is [0x40000000,0x140000000)
>>>>
>>>> 2. Prog A specifies [0x48000000,0x50000000) and
>>>> [0x80000000,0x100000000) will be shared by its child.
>>>>
>>>> 3. Prog A fork() Prog B and then Prog B exec() a new ELF binary.
>>>>
>>>> 4. Prog B notice the shared ranges (e.g. by input parameters or
>>>> ...)
>>>> and remap them to a continuous VA range.
>>>
>>> This is dangerous.  There must be an active step for Prog B to accept
>>> Prog A's
>>> ranges into its address space.  Otherwise Prog A could almost
>>> completely fill
>>> Prog B's address space and so control where Prog B places its
>>> mappings.  It
>>> could also provoke a latent bug in Prog B if it doesn't handle
>>> address space
>>> exhaustion gracefully.
>>>
>>> I had a proposal to handle this.  Would it meet your requirements?
>>> https://lore.kernel.org/lkml/20200730152250.GG23808@casper.infradead.org/
>>
>> I noticed your proposal of project Sileby and I think it can meet
>> Steven's requirement, but I not sure whether it's suitable for mine
>> because there's no sample code yet, is it in progress ?
> 
> Hi Mike,
> 
> I am working on refining the ideas from project Sileby. I am also
> working on designing the implementation. Since the original concept,
> the mshare API has evolved further. Here is what it loks like:
> 
> The mshare API consists of two system calls - mshare() and
> mshare_unlink()
> 
> mshare
> ======
> 
> int mshare(char *name,void *addr, size_t length, int oflags, mode_t
> mode)
> 
> mshare() creates and opens a new, or opens an existing shared memory
> area that will be shared at PTE level. name refers to shared object
> name that exists under /dev/mshare (this is subject to change. There
> might be better ways to manage the names for mshare'd areas). addr is
> the starting address of this shared memory area and length is the size
> of this area. oflags can be one of:
> 
>      O_RDONLY opens shared memory area for read only access by everyone
>      O_RDWR opens shared memory area for read and write access
>      O_CREAT creates the named shared memory area if it does not exist
>      O_EXCL If O_CREAT was also specified, and a shared memory area
>          exists with that name, return an error.
> 
> mode represents the creation mode for the shared object under
> /dev/mshare.
> 
> Return Value
> ------------
> 
> mshare() returns a file descriptor. A read from this file descriptor
> returns two long values - (1) starting address, and (2) size of the
> shared memory area.
> 
> Notes
> -----
> 
> PTEs are shared at pgdir level and hence it imposes following
> requirements on the address and size given to the mshare():
> 
>      - Starting address must be aligned to pgdir size (512GB on x86_64)
>      - Size must be a multiple of pgdir size
>      - Any mappings created in this address range at any time become
>      shared automatically
>      - Shared address range can have unmapped addresses in it. Any
>      access to unmapped address will result in SIGBUS
> 
> Mappings within this address range behave as if they were shared
> between threads, so a write to a MAP_PRIVATE mapping will create a
> page which is shared between all the sharers. The first process that
> declares an address range mshare'd can continue to map objects in the
> shared area. All other processes that want mshare'd access to this
> memory area can do so by calling mshare(). After this call, the
> address range given by mshare becomes a shared range in its address
> space. Anonymous mappings will be shared and not COWed.

Did I understand correctly that you want to share actual page tables 
between processes and consequently different MMs? That sounds like a 
very bad idea.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-16  0:26               ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
@ 2021-08-16  8:07                 ` David Laight
  0 siblings, 0 replies; 74+ messages in thread
From: David Laight @ 2021-08-16  8:07 UTC (permalink / raw)
  To: 'Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.)',
	'Khalid Aziz',
	Matthew Wilcox
  Cc: Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm, Gonglei (Arei)

From: Longpeng
> Sent: 16 August 2021 01:26
> Hi David,
> 
> 在 2021/8/15 4:07, David Laight 写道:
> > ...
> >>>>> Let me describe my use case more clearly (just ignore if you're not
> >>>>> interested in it):
> >>>>>
> >>>>> 1. Prog A mmap() 4GB memory (anon or file-mapping), suppose the
> >>>>> allocated VA range is [0x40000000,0x140000000)
> >>>>>
> >>>>> 2. Prog A specifies [0x48000000,0x50000000) and
> >>>>> [0x80000000,0x100000000) will be shared by its child.
> >>>>>
> >>>>> 3. Prog A fork() Prog B and then Prog B exec() a new ELF binary.
> >>>>>
> >>>>> 4. Prog B notice the shared ranges (e.g. by input parameters or
> >>>>> ...)
> >>>>> and remap them to a continuous VA range.
> >
> > Remapping to contiguous VA is going to be difficult in the
> > general case for (IIRC) VIVT caches.
> > The required cache coherence may only be attainable by
> > using uncached mappings.
> >
> 
> The Prog B uses mremap() syscall to remapping the shared ranges to other places,
> this is a common case for mremap in userspace.
> The cache coherence should already be processed in mremap core logic, otherwise
> there's maybe something wrong in mremap().

Maybe it does, and probably mremap() makes it work.
But with VIVT caches if a pages gets mapped in two processes
at the same time at different offsets into the cache then
then both mappings end up being uncached.
This was always a problem with normal file mmap.
I don't know if Linux manages to pick the VA after finding the
page has another mapping - SVR4 didn't.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-16  8:02             ` David Hildenbrand
@ 2021-08-16 12:07               ` Matthew Wilcox
  2021-08-16 12:20                 ` David Hildenbrand
                                   ` (2 more replies)
  0 siblings, 3 replies; 74+ messages in thread
From: Matthew Wilcox @ 2021-08-16 12:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Khalid Aziz, Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.),
	Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm,
	Gonglei (Arei)

On Mon, Aug 16, 2021 at 10:02:22AM +0200, David Hildenbrand wrote:
> > Mappings within this address range behave as if they were shared
> > between threads, so a write to a MAP_PRIVATE mapping will create a
> > page which is shared between all the sharers. The first process that
> > declares an address range mshare'd can continue to map objects in the
> > shared area. All other processes that want mshare'd access to this
> > memory area can do so by calling mshare(). After this call, the
> > address range given by mshare becomes a shared range in its address
> > space. Anonymous mappings will be shared and not COWed.
> 
> Did I understand correctly that you want to share actual page tables between
> processes and consequently different MMs? That sounds like a very bad idea.

That is the entire point.  Consider a machine with 10,000 instances
of an application running (process model, not thread model).  If each
application wants to map 1TB of RAM using 2MB pages, that's 4MB of page
tables per process or 40GB of RAM for the whole machine.

There's a reason hugetlbfs was enhanced to allow this page table sharing.
I'm not a fan of the implementation as it gets some locks upside down,
so this is an attempt to generalise the concept beyond hugetlbfs.

Think of it like partial threading.  You get to share some parts, but not
all, of your address space with your fellow processes.  Obviously you
don't want to expose this to random other processes, only to other
instances of yourself being run as the same user.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-16 12:07               ` Matthew Wilcox
@ 2021-08-16 12:20                 ` David Hildenbrand
  2021-08-16 12:42                   ` David Hildenbrand
  2021-08-16 12:46                   ` Matthew Wilcox
  2021-08-16 12:27                 ` [private] " David Hildenbrand
  2021-08-17  0:47                 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  2 siblings, 2 replies; 74+ messages in thread
From: David Hildenbrand @ 2021-08-16 12:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Khalid Aziz, Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.),
	Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm,
	Gonglei (Arei)

On 16.08.21 14:07, Matthew Wilcox wrote:
> On Mon, Aug 16, 2021 at 10:02:22AM +0200, David Hildenbrand wrote:
>>> Mappings within this address range behave as if they were shared
>>> between threads, so a write to a MAP_PRIVATE mapping will create a
>>> page which is shared between all the sharers. The first process that
>>> declares an address range mshare'd can continue to map objects in the
>>> shared area. All other processes that want mshare'd access to this
>>> memory area can do so by calling mshare(). After this call, the
>>> address range given by mshare becomes a shared range in its address
>>> space. Anonymous mappings will be shared and not COWed.
>>
>> Did I understand correctly that you want to share actual page tables between
>> processes and consequently different MMs? That sounds like a very bad idea.
> 
> That is the entire point.  Consider a machine with 10,000 instances
> of an application running (process model, not thread model).  If each
> application wants to map 1TB of RAM using 2MB pages, that's 4MB of page
> tables per process or 40GB of RAM for the whole machine.

What speaks against 1 GB pages then?

> 
> There's a reason hugetlbfs was enhanced to allow this page table sharing.
> I'm not a fan of the implementation as it gets some locks upside down,
> so this is an attempt to generalise the concept beyond hugetlbfs.

Who do we account the page tables to? What are MADV_DONTNEED semantics? 
Who cleans up the page tables? What happens during munmap? How does the 
rmap even work? How to we actually synchronize page table walkers?

See how hugetlbfs just doesn't raise these problems because we are 
sharing pages and not page tables?

TBH, I quite dislike just thinking about sharing page tables between 
processes.

> 
> Think of it like partial threading.  You get to share some parts, but not
> all, of your address space with your fellow processes.  Obviously you
> don't want to expose this to random other processes, only to other
> instances of yourself being run as the same user.

Sounds like a nice way to over-complicate MM to optimize for some 
special use cases. I know, I'm probably wrong. :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [private] Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-16 12:07               ` Matthew Wilcox
  2021-08-16 12:20                 ` David Hildenbrand
@ 2021-08-16 12:27                 ` David Hildenbrand
  2021-08-16 12:30                   ` David Hildenbrand
  2021-08-17  0:47                 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  2 siblings, 1 reply; 74+ messages in thread
From: David Hildenbrand @ 2021-08-16 12:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Khalid Aziz, Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.),
	Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm,
	Gonglei (Arei)

On 16.08.21 14:07, Matthew Wilcox wrote:
> On Mon, Aug 16, 2021 at 10:02:22AM +0200, David Hildenbrand wrote:
>>> Mappings within this address range behave as if they were shared
>>> between threads, so a write to a MAP_PRIVATE mapping will create a
>>> page which is shared between all the sharers. The first process that
>>> declares an address range mshare'd can continue to map objects in the
>>> shared area. All other processes that want mshare'd access to this
>>> memory area can do so by calling mshare(). After this call, the
>>> address range given by mshare becomes a shared range in its address
>>> space. Anonymous mappings will be shared and not COWed.
>>
>> Did I understand correctly that you want to share actual page tables between
>> processes and consequently different MMs? That sounds like a very bad idea.
> 
> That is the entire point.  Consider a machine with 10,000 instances
> of an application running (process model, not thread model).  If each
> application wants to map 1TB of RAM using 2MB pages, that's 4MB of page
> tables per process or 40GB of RAM for the whole machine.

Note that I am working on asynchronous reclaim of page tables, whereby I 
would even reclaim !anonymous page tables under memory pressure.

Assuming your processes don't touch all memory all the time of course 
... of course, it's a research project and will still require quite some 
work because devil is in the detail (locking).

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [private] Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-16 12:27                 ` [private] " David Hildenbrand
@ 2021-08-16 12:30                   ` David Hildenbrand
  0 siblings, 0 replies; 74+ messages in thread
From: David Hildenbrand @ 2021-08-16 12:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Khalid Aziz, Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.),
	Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm,
	Gonglei (Arei)

On 16.08.21 14:27, David Hildenbrand wrote:
> On 16.08.21 14:07, Matthew Wilcox wrote:
>> On Mon, Aug 16, 2021 at 10:02:22AM +0200, David Hildenbrand wrote:
>>>> Mappings within this address range behave as if they were shared
>>>> between threads, so a write to a MAP_PRIVATE mapping will create a
>>>> page which is shared between all the sharers. The first process that
>>>> declares an address range mshare'd can continue to map objects in the
>>>> shared area. All other processes that want mshare'd access to this
>>>> memory area can do so by calling mshare(). After this call, the
>>>> address range given by mshare becomes a shared range in its address
>>>> space. Anonymous mappings will be shared and not COWed.
>>>
>>> Did I understand correctly that you want to share actual page tables between
>>> processes and consequently different MMs? That sounds like a very bad idea.
>>
>> That is the entire point.  Consider a machine with 10,000 instances
>> of an application running (process model, not thread model).  If each
>> application wants to map 1TB of RAM using 2MB pages, that's 4MB of page
>> tables per process or 40GB of RAM for the whole machine.
> 
> Note that I am working on asynchronous reclaim of page tables, whereby I
> would even reclaim !anonymous page tables under memory pressure.
> 
> Assuming your processes don't touch all memory all the time of course
> ... of course, it's a research project and will still require quite some
> work because devil is in the detail (locking).
> 

Well, that comment did turn out not-so-private, so it's certainly a good 
discussion-starter.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-16 12:20                 ` David Hildenbrand
@ 2021-08-16 12:42                   ` David Hildenbrand
  2021-08-16 12:46                   ` Matthew Wilcox
  1 sibling, 0 replies; 74+ messages in thread
From: David Hildenbrand @ 2021-08-16 12:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Khalid Aziz, Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.),
	Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm,
	Gonglei (Arei)

On 16.08.21 14:20, David Hildenbrand wrote:
> On 16.08.21 14:07, Matthew Wilcox wrote:
>> On Mon, Aug 16, 2021 at 10:02:22AM +0200, David Hildenbrand wrote:
>>>> Mappings within this address range behave as if they were shared
>>>> between threads, so a write to a MAP_PRIVATE mapping will create a
>>>> page which is shared between all the sharers. The first process that
>>>> declares an address range mshare'd can continue to map objects in the
>>>> shared area. All other processes that want mshare'd access to this
>>>> memory area can do so by calling mshare(). After this call, the
>>>> address range given by mshare becomes a shared range in its address
>>>> space. Anonymous mappings will be shared and not COWed.
>>>
>>> Did I understand correctly that you want to share actual page tables between
>>> processes and consequently different MMs? That sounds like a very bad idea.
>>
>> That is the entire point.  Consider a machine with 10,000 instances
>> of an application running (process model, not thread model).  If each
>> application wants to map 1TB of RAM using 2MB pages, that's 4MB of page
>> tables per process or 40GB of RAM for the whole machine.
> 
> What speaks against 1 GB pages then?
> 
>>
>> There's a reason hugetlbfs was enhanced to allow this page table sharing.
>> I'm not a fan of the implementation as it gets some locks upside down,
>> so this is an attempt to generalise the concept beyond hugetlbfs.
> 
> Who do we account the page tables to? What are MADV_DONTNEED semantics?
> Who cleans up the page tables? What happens during munmap? How does the
> rmap even work? How to we actually synchronize page table walkers?
> 
> See how hugetlbfs just doesn't raise these problems because we are
> sharing pages and not page tables?

I found what you were referring to: CONFIG_ARCH_WANT_HUGE_PMD_SHARE

I was not aware that we have such a monstrosity in the kernel.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-16 12:20                 ` David Hildenbrand
  2021-08-16 12:42                   ` David Hildenbrand
@ 2021-08-16 12:46                   ` Matthew Wilcox
  2021-08-16 13:24                     ` David Hildenbrand
  1 sibling, 1 reply; 74+ messages in thread
From: Matthew Wilcox @ 2021-08-16 12:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Khalid Aziz, Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.),
	Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm,
	Gonglei (Arei)

On Mon, Aug 16, 2021 at 02:20:43PM +0200, David Hildenbrand wrote:
> On 16.08.21 14:07, Matthew Wilcox wrote:
> > On Mon, Aug 16, 2021 at 10:02:22AM +0200, David Hildenbrand wrote:
> > > > Mappings within this address range behave as if they were shared
> > > > between threads, so a write to a MAP_PRIVATE mapping will create a
> > > > page which is shared between all the sharers. The first process that
> > > > declares an address range mshare'd can continue to map objects in the
> > > > shared area. All other processes that want mshare'd access to this
> > > > memory area can do so by calling mshare(). After this call, the
> > > > address range given by mshare becomes a shared range in its address
> > > > space. Anonymous mappings will be shared and not COWed.
> > > 
> > > Did I understand correctly that you want to share actual page tables between
> > > processes and consequently different MMs? That sounds like a very bad idea.
> > 
> > That is the entire point.  Consider a machine with 10,000 instances
> > of an application running (process model, not thread model).  If each
> > application wants to map 1TB of RAM using 2MB pages, that's 4MB of page
> > tables per process or 40GB of RAM for the whole machine.
> 
> What speaks against 1 GB pages then?

Until recently, the CPUs only having 4 1GB TLB entries.  I'm sure we
still have customers using that generation of CPUs.  2MB pages perform
better than 1GB pages on the previous generation of hardware, and I
haven't seen numbers for the next generation yet.

> > There's a reason hugetlbfs was enhanced to allow this page table sharing.
> > I'm not a fan of the implementation as it gets some locks upside down,
> > so this is an attempt to generalise the concept beyond hugetlbfs.
> 
> Who do we account the page tables to? What are MADV_DONTNEED semantics? Who
> cleans up the page tables? What happens during munmap? How does the rmap
> even work? How to we actually synchronize page table walkers?
> 
> See how hugetlbfs just doesn't raise these problems because we are sharing
> pages and not page tables?

No, really, hugetlbfs shares page tables already.  You just didn't
notice that yet.

> > Think of it like partial threading.  You get to share some parts, but not
> > all, of your address space with your fellow processes.  Obviously you
> > don't want to expose this to random other processes, only to other
> > instances of yourself being run as the same user.
> 
> Sounds like a nice way to over-complicate MM to optimize for some special
> use cases. I know, I'm probably wrong. :)

It's really not as bad as you seem to think it is.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-16 12:46                   ` Matthew Wilcox
@ 2021-08-16 13:24                     ` David Hildenbrand
  2021-08-16 13:32                       ` Matthew Wilcox
  0 siblings, 1 reply; 74+ messages in thread
From: David Hildenbrand @ 2021-08-16 13:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Khalid Aziz, Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.),
	Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm,
	Gonglei (Arei)

On 16.08.21 14:46, Matthew Wilcox wrote:
> On Mon, Aug 16, 2021 at 02:20:43PM +0200, David Hildenbrand wrote:
>> On 16.08.21 14:07, Matthew Wilcox wrote:
>>> On Mon, Aug 16, 2021 at 10:02:22AM +0200, David Hildenbrand wrote:
>>>>> Mappings within this address range behave as if they were shared
>>>>> between threads, so a write to a MAP_PRIVATE mapping will create a
>>>>> page which is shared between all the sharers. The first process that
>>>>> declares an address range mshare'd can continue to map objects in the
>>>>> shared area. All other processes that want mshare'd access to this
>>>>> memory area can do so by calling mshare(). After this call, the
>>>>> address range given by mshare becomes a shared range in its address
>>>>> space. Anonymous mappings will be shared and not COWed.
>>>>
>>>> Did I understand correctly that you want to share actual page tables between
>>>> processes and consequently different MMs? That sounds like a very bad idea.
>>>
>>> That is the entire point.  Consider a machine with 10,000 instances
>>> of an application running (process model, not thread model).  If each
>>> application wants to map 1TB of RAM using 2MB pages, that's 4MB of page
>>> tables per process or 40GB of RAM for the whole machine.
>>
>> What speaks against 1 GB pages then?
> 
> Until recently, the CPUs only having 4 1GB TLB entries.  I'm sure we
> still have customers using that generation of CPUs.  2MB pages perform
> better than 1GB pages on the previous generation of hardware, and I
> haven't seen numbers for the next generation yet.

I read that somewhere else before, yet we have heavy 1 GiB page users, 
especially in the context of VMs and DPDK.

> 
>>> There's a reason hugetlbfs was enhanced to allow this page table sharing.
>>> I'm not a fan of the implementation as it gets some locks upside down,
>>> so this is an attempt to generalise the concept beyond hugetlbfs.
>>
>> Who do we account the page tables to? What are MADV_DONTNEED semantics? Who
>> cleans up the page tables? What happens during munmap? How does the rmap
>> even work? How to we actually synchronize page table walkers?
>>
>> See how hugetlbfs just doesn't raise these problems because we are sharing
>> pages and not page tables?
> 
> No, really, hugetlbfs shares page tables already.  You just didn't
> notice that yet.

So, it only works for hugetlbfs in case uffd is not in place (-> no 
per-process data in the page table) and we have an actual shared 
mappings. When unsharing, we zap the PUD entry, which will result in 
allocating a per-process page table on next fault.

I will rephrase my previous statement "hugetlbfs just doesn't raise 
these problems because we are special casing it all over the place 
already". For example, not allowing to swap such pages. Disallowing 
MADV_DONTNEED. Special hugetlbfs locking.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-16 13:24                     ` David Hildenbrand
@ 2021-08-16 13:32                       ` Matthew Wilcox
  2021-08-16 14:10                         ` David Hildenbrand
  0 siblings, 1 reply; 74+ messages in thread
From: Matthew Wilcox @ 2021-08-16 13:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Khalid Aziz, Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.),
	Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm,
	Gonglei (Arei)

On Mon, Aug 16, 2021 at 03:24:38PM +0200, David Hildenbrand wrote:
> On 16.08.21 14:46, Matthew Wilcox wrote:
> > On Mon, Aug 16, 2021 at 02:20:43PM +0200, David Hildenbrand wrote:
> > > On 16.08.21 14:07, Matthew Wilcox wrote:
> > > > On Mon, Aug 16, 2021 at 10:02:22AM +0200, David Hildenbrand wrote:
> > > > > > Mappings within this address range behave as if they were shared
> > > > > > between threads, so a write to a MAP_PRIVATE mapping will create a
> > > > > > page which is shared between all the sharers. The first process that
> > > > > > declares an address range mshare'd can continue to map objects in the
> > > > > > shared area. All other processes that want mshare'd access to this
> > > > > > memory area can do so by calling mshare(). After this call, the
> > > > > > address range given by mshare becomes a shared range in its address
> > > > > > space. Anonymous mappings will be shared and not COWed.
> > > > > 
> > > > > Did I understand correctly that you want to share actual page tables between
> > > > > processes and consequently different MMs? That sounds like a very bad idea.
> > > > 
> > > > That is the entire point.  Consider a machine with 10,000 instances
> > > > of an application running (process model, not thread model).  If each
> > > > application wants to map 1TB of RAM using 2MB pages, that's 4MB of page
> > > > tables per process or 40GB of RAM for the whole machine.
> > > 
> > > What speaks against 1 GB pages then?
> > 
> > Until recently, the CPUs only having 4 1GB TLB entries.  I'm sure we
> > still have customers using that generation of CPUs.  2MB pages perform
> > better than 1GB pages on the previous generation of hardware, and I
> > haven't seen numbers for the next generation yet.
> 
> I read that somewhere else before, yet we have heavy 1 GiB page users,
> especially in the context of VMs and DPDK.

I wonder if those users actually benchmarked.  Or whether the memory
savings worked out so well for them that the loss of TLB performance
didn't matter.

> So, it only works for hugetlbfs in case uffd is not in place (-> no
> per-process data in the page table) and we have an actual shared mappings.
> When unsharing, we zap the PUD entry, which will result in allocating a
> per-process page table on next fault.

I think uffd was a huge mistake.  It should have been a filesystem
instead of a hack on the side of anonymous memory.

> I will rephrase my previous statement "hugetlbfs just doesn't raise these
> problems because we are special casing it all over the place already". For
> example, not allowing to swap such pages. Disallowing MADV_DONTNEED. Special
> hugetlbfs locking.

Sure, that's why I want to drag this feature out of "oh this is a
hugetlb special case" and into "this is something Linux supports".


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-16 13:32                       ` Matthew Wilcox
@ 2021-08-16 14:10                         ` David Hildenbrand
  2021-08-16 14:27                           ` Matthew Wilcox
  0 siblings, 1 reply; 74+ messages in thread
From: David Hildenbrand @ 2021-08-16 14:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Khalid Aziz, Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.),
	Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm,
	Gonglei (Arei)

>>> Until recently, the CPUs only having 4 1GB TLB entries.  I'm sure we
>>> still have customers using that generation of CPUs.  2MB pages perform
>>> better than 1GB pages on the previous generation of hardware, and I
>>> haven't seen numbers for the next generation yet.
>>
>> I read that somewhere else before, yet we have heavy 1 GiB page users,
>> especially in the context of VMs and DPDK.
> 
> I wonder if those users actually benchmarked.  Or whether the memory
> savings worked out so well for them that the loss of TLB performance
> didn't matter.

These applications are extremely performance sensitive (i.e., RT 
workloads), that's why I'm wondering. I recall that they are most 
certainly using more than 4 GiB memory in real applications.

E.g., the doc [1] even has a note that "For 64-bit applications, it is 
recommended to use 1 GB hugepages if the platform supports them."


[1] https://doc.dpdk.org/guides-16.04/linux_gsg/sys_reqs.html

> 
>> So, it only works for hugetlbfs in case uffd is not in place (-> no
>> per-process data in the page table) and we have an actual shared mappings.
>> When unsharing, we zap the PUD entry, which will result in allocating a
>> per-process page table on next fault.
> 
> I think uffd was a huge mistake.  It should have been a filesystem
> instead of a hack on the side of anonymous memory.

Yes it was. Especially, looking at all the special-casing, for example, 
even in mm/pagewalk.c.

> 
>> I will rephrase my previous statement "hugetlbfs just doesn't raise these
>> problems because we are special casing it all over the place already". For
>> example, not allowing to swap such pages. Disallowing MADV_DONTNEED. Special
>> hugetlbfs locking.
> 
> Sure, that's why I want to drag this feature out of "oh this is a
> hugetlb special case" and into "this is something Linux supports".

I would have understood the move to optimize SHMEM internally - similar 
to how we seem to optimize hugetlbfs SHMEM right now internally. 
(although sharing page tables for shmem can still be quite tricky)

I did not follow why we have to play games with MAP_PRIVATE, and having 
private anonymous pages shared between processes that don't COW, 
introducing new syscalls etc.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-16 14:10                         ` David Hildenbrand
@ 2021-08-16 14:27                           ` Matthew Wilcox
  2021-08-16 14:33                             ` David Hildenbrand
  0 siblings, 1 reply; 74+ messages in thread
From: Matthew Wilcox @ 2021-08-16 14:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Khalid Aziz, Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.),
	Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm,
	Gonglei (Arei)

On Mon, Aug 16, 2021 at 04:10:28PM +0200, David Hildenbrand wrote:
> > > > Until recently, the CPUs only having 4 1GB TLB entries.  I'm sure we
> > > > still have customers using that generation of CPUs.  2MB pages perform
> > > > better than 1GB pages on the previous generation of hardware, and I
> > > > haven't seen numbers for the next generation yet.
> > > 
> > > I read that somewhere else before, yet we have heavy 1 GiB page users,
> > > especially in the context of VMs and DPDK.
> > 
> > I wonder if those users actually benchmarked.  Or whether the memory
> > savings worked out so well for them that the loss of TLB performance
> > didn't matter.
> 
> These applications are extremely performance sensitive (i.e., RT workloads),

"real time does not mean real fast".  it means predictable latency.

> > > I will rephrase my previous statement "hugetlbfs just doesn't raise these
> > > problems because we are special casing it all over the place already". For
> > > example, not allowing to swap such pages. Disallowing MADV_DONTNEED. Special
> > > hugetlbfs locking.
> > 
> > Sure, that's why I want to drag this feature out of "oh this is a
> > hugetlb special case" and into "this is something Linux supports".
> 
> I would have understood the move to optimize SHMEM internally - similar to
> how we seem to optimize hugetlbfs SHMEM right now internally. (although
> sharing page tables for shmem can still be quite tricky)
> 
> I did not follow why we have to play games with MAP_PRIVATE, and having
> private anonymous pages shared between processes that don't COW, introducing
> new syscalls etc.

It's not about SHMEM, it's about file-backed pages on regular
filesystems.  I don't want to have XFS, ext4 and btrfs all with their
own implementations of ARCH_WANT_HUGE_PMD_SHARE.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-16 14:27                           ` Matthew Wilcox
@ 2021-08-16 14:33                             ` David Hildenbrand
  2021-08-16 14:40                               ` Matthew Wilcox
  0 siblings, 1 reply; 74+ messages in thread
From: David Hildenbrand @ 2021-08-16 14:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Khalid Aziz, Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.),
	Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm,
	Gonglei (Arei)

On 16.08.21 16:27, Matthew Wilcox wrote:
> On Mon, Aug 16, 2021 at 04:10:28PM +0200, David Hildenbrand wrote:
>>>>> Until recently, the CPUs only having 4 1GB TLB entries.  I'm sure we
>>>>> still have customers using that generation of CPUs.  2MB pages perform
>>>>> better than 1GB pages on the previous generation of hardware, and I
>>>>> haven't seen numbers for the next generation yet.
>>>>
>>>> I read that somewhere else before, yet we have heavy 1 GiB page users,
>>>> especially in the context of VMs and DPDK.
>>>
>>> I wonder if those users actually benchmarked.  Or whether the memory
>>> savings worked out so well for them that the loss of TLB performance
>>> didn't matter.
>>
>> These applications are extremely performance sensitive (i.e., RT workloads),
> 
> "real time does not mean real fast".  it means predictable latency.

I know, but that doesn't explain why you would use 2 MB vs 1 GiB.

(most of these applications want also a low predictable latency)

> 
>>>> I will rephrase my previous statement "hugetlbfs just doesn't raise these
>>>> problems because we are special casing it all over the place already". For
>>>> example, not allowing to swap such pages. Disallowing MADV_DONTNEED. Special
>>>> hugetlbfs locking.
>>>
>>> Sure, that's why I want to drag this feature out of "oh this is a
>>> hugetlb special case" and into "this is something Linux supports".
>>
>> I would have understood the move to optimize SHMEM internally - similar to
>> how we seem to optimize hugetlbfs SHMEM right now internally. (although
>> sharing page tables for shmem can still be quite tricky)
>>
>> I did not follow why we have to play games with MAP_PRIVATE, and having
>> private anonymous pages shared between processes that don't COW, introducing
>> new syscalls etc.
> 
> It's not about SHMEM, it's about file-backed pages on regular
> filesystems.  I don't want to have XFS, ext4 and btrfs all with their
> own implementations of ARCH_WANT_HUGE_PMD_SHARE.

Let me ask this way: why do we have to play such games with MAP_PRIVATE?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-16 14:33                             ` David Hildenbrand
@ 2021-08-16 14:40                               ` Matthew Wilcox
  2021-08-16 15:01                                 ` David Hildenbrand
  0 siblings, 1 reply; 74+ messages in thread
From: Matthew Wilcox @ 2021-08-16 14:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Khalid Aziz, Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.),
	Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm,
	Gonglei (Arei)

On Mon, Aug 16, 2021 at 04:33:09PM +0200, David Hildenbrand wrote:
> > > I did not follow why we have to play games with MAP_PRIVATE, and having
> > > private anonymous pages shared between processes that don't COW, introducing
> > > new syscalls etc.
> > 
> > It's not about SHMEM, it's about file-backed pages on regular
> > filesystems.  I don't want to have XFS, ext4 and btrfs all with their
> > own implementations of ARCH_WANT_HUGE_PMD_SHARE.
> 
> Let me ask this way: why do we have to play such games with MAP_PRIVATE?

Are you referring to this?

: Mappings within this address range behave as if they were shared
: between threads, so a write to a MAP_PRIVATE mapping will create a
: page which is shared between all the sharers.

If so, that's a misunderstanding, because there are no games being played.
What Khalid's saying there is that because the page tables are already
shared for that range of address space, the COW of a MAP_PRIVATE will
create a new page, but that page will be shared between all the sharers.
The second write to a MAP_PRIVATE page (by any of the sharers) will not
create a COW situation.  Just like if all the sharers were threads of
the same process.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-16 14:40                               ` Matthew Wilcox
@ 2021-08-16 15:01                                 ` David Hildenbrand
  2021-08-16 15:59                                   ` Matthew Wilcox
  0 siblings, 1 reply; 74+ messages in thread
From: David Hildenbrand @ 2021-08-16 15:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Khalid Aziz, Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.),
	Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm,
	Gonglei (Arei)

On 16.08.21 16:40, Matthew Wilcox wrote:
> On Mon, Aug 16, 2021 at 04:33:09PM +0200, David Hildenbrand wrote:
>>>> I did not follow why we have to play games with MAP_PRIVATE, and having
>>>> private anonymous pages shared between processes that don't COW, introducing
>>>> new syscalls etc.
>>>
>>> It's not about SHMEM, it's about file-backed pages on regular
>>> filesystems.  I don't want to have XFS, ext4 and btrfs all with their
>>> own implementations of ARCH_WANT_HUGE_PMD_SHARE.
>>
>> Let me ask this way: why do we have to play such games with MAP_PRIVATE?
> 
> Are you referring to this?

Yes

> 
> : Mappings within this address range behave as if they were shared
> : between threads, so a write to a MAP_PRIVATE mapping will create a
> : page which is shared between all the sharers.
> 
> If so, that's a misunderstanding, because there are no games being played.
> What Khalid's saying there is that because the page tables are already
> shared for that range of address space, the COW of a MAP_PRIVATE will
> create a new page, but that page will be shared between all the sharers.
> The second write to a MAP_PRIVATE page (by any of the sharers) will not
> create a COW situation.  Just like if all the sharers were threads of
> the same process.
> 

It actually seems to be just like I understood it. We'll have multiple 
processes share anonymous pages writable, even though they are not using 
shared memory.

IMHO, sharing page tables to optimize for something kernel-internal 
(page table consumption) should be completely transparent to user space. 
Just like ARCH_WANT_HUGE_PMD_SHARE currently is unless I am missing 
something important.

The VM_MAYSHARE check in want_pmd_share()->vma_shareable() makes me 
assume that we really only optimize for MAP_SHARED right now, never for 
MAP_PRIVATE.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-16 15:01                                 ` David Hildenbrand
@ 2021-08-16 15:59                                   ` Matthew Wilcox
  2021-08-16 16:06                                     ` Khalid Aziz
  2021-08-16 16:13                                     ` David Hildenbrand
  0 siblings, 2 replies; 74+ messages in thread
From: Matthew Wilcox @ 2021-08-16 15:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Khalid Aziz, Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.),
	Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm,
	Gonglei (Arei)

On Mon, Aug 16, 2021 at 05:01:44PM +0200, David Hildenbrand wrote:
> On 16.08.21 16:40, Matthew Wilcox wrote:
> > On Mon, Aug 16, 2021 at 04:33:09PM +0200, David Hildenbrand wrote:
> > > > > I did not follow why we have to play games with MAP_PRIVATE, and having
> > > > > private anonymous pages shared between processes that don't COW, introducing
> > > > > new syscalls etc.
> > > > 
> > > > It's not about SHMEM, it's about file-backed pages on regular
> > > > filesystems.  I don't want to have XFS, ext4 and btrfs all with their
> > > > own implementations of ARCH_WANT_HUGE_PMD_SHARE.
> > > 
> > > Let me ask this way: why do we have to play such games with MAP_PRIVATE?
> > 
> > : Mappings within this address range behave as if they were shared
> > : between threads, so a write to a MAP_PRIVATE mapping will create a
> > : page which is shared between all the sharers.
> > 
> > If so, that's a misunderstanding, because there are no games being played.
> > What Khalid's saying there is that because the page tables are already
> > shared for that range of address space, the COW of a MAP_PRIVATE will
> > create a new page, but that page will be shared between all the sharers.
> > The second write to a MAP_PRIVATE page (by any of the sharers) will not
> > create a COW situation.  Just like if all the sharers were threads of
> > the same process.
> > 
> 
> It actually seems to be just like I understood it. We'll have multiple
> processes share anonymous pages writable, even though they are not using
> shared memory.
> 
> IMHO, sharing page tables to optimize for something kernel-internal (page
> table consumption) should be completely transparent to user space. Just like
> ARCH_WANT_HUGE_PMD_SHARE currently is unless I am missing something
> important.
> 
> The VM_MAYSHARE check in want_pmd_share()->vma_shareable() makes me assume
> that we really only optimize for MAP_SHARED right now, never for
> MAP_PRIVATE.

It's definitely *not* about being transparent to userspace.  It's about
giving userspace new functionality where multiple processes can choose
to share a portion of their address space with each other.  What any
process changes in that range changes, every sharing process sees.
mmap(), munmap(), mprotect(), mremap(), everything.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-16 15:59                                   ` Matthew Wilcox
@ 2021-08-16 16:06                                     ` Khalid Aziz
  2021-08-16 16:15                                       ` Matthew Wilcox
  2021-08-16 16:13                                     ` David Hildenbrand
  1 sibling, 1 reply; 74+ messages in thread
From: Khalid Aziz @ 2021-08-16 16:06 UTC (permalink / raw)
  To: Matthew Wilcox, David Hildenbrand
  Cc: Longpeng (Mike, Cloud Infrastructure Service Product Dept.),
	Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm,
	Gonglei (Arei)

On 8/16/21 9:59 AM, Matthew Wilcox wrote:
> On Mon, Aug 16, 2021 at 05:01:44PM +0200, David Hildenbrand wrote:
>> On 16.08.21 16:40, Matthew Wilcox wrote:
>>> On Mon, Aug 16, 2021 at 04:33:09PM +0200, David Hildenbrand wrote:
>>>>>> I did not follow why we have to play games with MAP_PRIVATE, and having
>>>>>> private anonymous pages shared between processes that don't COW, introducing
>>>>>> new syscalls etc.
>>>>>
>>>>> It's not about SHMEM, it's about file-backed pages on regular
>>>>> filesystems.  I don't want to have XFS, ext4 and btrfs all with their
>>>>> own implementations of ARCH_WANT_HUGE_PMD_SHARE.
>>>>
>>>> Let me ask this way: why do we have to play such games with MAP_PRIVATE?
>>>
>>> : Mappings within this address range behave as if they were shared
>>> : between threads, so a write to a MAP_PRIVATE mapping will create a
>>> : page which is shared between all the sharers.
>>>
>>> If so, that's a misunderstanding, because there are no games being played.
>>> What Khalid's saying there is that because the page tables are already
>>> shared for that range of address space, the COW of a MAP_PRIVATE will
>>> create a new page, but that page will be shared between all the sharers.
>>> The second write to a MAP_PRIVATE page (by any of the sharers) will not
>>> create a COW situation.  Just like if all the sharers were threads of
>>> the same process.
>>>
>>
>> It actually seems to be just like I understood it. We'll have multiple
>> processes share anonymous pages writable, even though they are not using
>> shared memory.
>>
>> IMHO, sharing page tables to optimize for something kernel-internal (page
>> table consumption) should be completely transparent to user space. Just like
>> ARCH_WANT_HUGE_PMD_SHARE currently is unless I am missing something
>> important.
>>
>> The VM_MAYSHARE check in want_pmd_share()->vma_shareable() makes me assume
>> that we really only optimize for MAP_SHARED right now, never for
>> MAP_PRIVATE.
> 
> It's definitely *not* about being transparent to userspace.  It's about
> giving userspace new functionality where multiple processes can choose
> to share a portion of their address space with each other.  What any
> process changes in that range changes, every sharing process sees.
> mmap(), munmap(), mprotect(), mremap(), everything.
> 

Exactly and to further elaborate, once a process calls mshare() to declare its intent to share PTEs for a range of 
address and another process accepts that sharing by calling mshare() itself, the two (or more) processes have agreed to 
share PTEs for that entire address range. A MAP_PRIVATE mapping in this address range goes against the original intent 
of sharing and what we are saying is the original intent of sharing takes precedence in case of this conflict.

--
Khalid


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-16 15:59                                   ` Matthew Wilcox
  2021-08-16 16:06                                     ` Khalid Aziz
@ 2021-08-16 16:13                                     ` David Hildenbrand
  1 sibling, 0 replies; 74+ messages in thread
From: David Hildenbrand @ 2021-08-16 16:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Khalid Aziz, Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.),
	Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm,
	Gonglei (Arei)

On 16.08.21 17:59, Matthew Wilcox wrote:
> On Mon, Aug 16, 2021 at 05:01:44PM +0200, David Hildenbrand wrote:
>> On 16.08.21 16:40, Matthew Wilcox wrote:
>>> On Mon, Aug 16, 2021 at 04:33:09PM +0200, David Hildenbrand wrote:
>>>>>> I did not follow why we have to play games with MAP_PRIVATE, and having
>>>>>> private anonymous pages shared between processes that don't COW, introducing
>>>>>> new syscalls etc.
>>>>>
>>>>> It's not about SHMEM, it's about file-backed pages on regular
>>>>> filesystems.  I don't want to have XFS, ext4 and btrfs all with their
>>>>> own implementations of ARCH_WANT_HUGE_PMD_SHARE.
>>>>
>>>> Let me ask this way: why do we have to play such games with MAP_PRIVATE?
>>>
>>> : Mappings within this address range behave as if they were shared
>>> : between threads, so a write to a MAP_PRIVATE mapping will create a
>>> : page which is shared between all the sharers.
>>>
>>> If so, that's a misunderstanding, because there are no games being played.
>>> What Khalid's saying there is that because the page tables are already
>>> shared for that range of address space, the COW of a MAP_PRIVATE will
>>> create a new page, but that page will be shared between all the sharers.
>>> The second write to a MAP_PRIVATE page (by any of the sharers) will not
>>> create a COW situation.  Just like if all the sharers were threads of
>>> the same process.
>>>
>>
>> It actually seems to be just like I understood it. We'll have multiple
>> processes share anonymous pages writable, even though they are not using
>> shared memory.
>>
>> IMHO, sharing page tables to optimize for something kernel-internal (page
>> table consumption) should be completely transparent to user space. Just like
>> ARCH_WANT_HUGE_PMD_SHARE currently is unless I am missing something
>> important.
>>
>> The VM_MAYSHARE check in want_pmd_share()->vma_shareable() makes me assume
>> that we really only optimize for MAP_SHARED right now, never for
>> MAP_PRIVATE.
> 
> It's definitely *not* about being transparent to userspace.  It's about
> giving userspace new functionality where multiple processes can choose
> to share a portion of their address space with each other.  What any
> process changes in that range changes, every sharing process sees.
> mmap(), munmap(), mprotect(), mremap(), everything.

Oh okay, so it's actually much more complicated and complex than I 
thought. Thanks for clarifying that! I recall virtiofsd had similar 
requirements for sharing memory with the QEMU main process, I might be 
wrong.

"existing shared memory area" and your initial page table example made 
me assume that we are simply dealing with sharing page tables of MAP_SHARED.

It's actually something like a VMA container that you share between 
processes. And whatever VMAs are currently inside that VMA container is 
mirrored to other processes. I assume sharing page tables could actually 
be an implementation detail, especially when keeping MAP_PRIVATE 
(confusing in that context!) and other features that will give you 
surprises (uffd) out of the picture.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-16 16:06                                     ` Khalid Aziz
@ 2021-08-16 16:15                                       ` Matthew Wilcox
  0 siblings, 0 replies; 74+ messages in thread
From: Matthew Wilcox @ 2021-08-16 16:15 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: David Hildenbrand, Longpeng (Mike,
	Cloud Infrastructure Service Product Dept.),
	Steven Sistare, Anthony Yznaga, linux-kernel, linux-mm,
	Gonglei (Arei)

On Mon, Aug 16, 2021 at 10:06:47AM -0600, Khalid Aziz wrote:
> On 8/16/21 9:59 AM, Matthew Wilcox wrote:
> > On Mon, Aug 16, 2021 at 05:01:44PM +0200, David Hildenbrand wrote:
> > > On 16.08.21 16:40, Matthew Wilcox wrote:
> > > > On Mon, Aug 16, 2021 at 04:33:09PM +0200, David Hildenbrand wrote:
> > > > > > > I did not follow why we have to play games with MAP_PRIVATE, and having
> > > > > > > private anonymous pages shared between processes that don't COW, introducing
> > > > > > > new syscalls etc.
> > > > > > 
> > > > > > It's not about SHMEM, it's about file-backed pages on regular
> > > > > > filesystems.  I don't want to have XFS, ext4 and btrfs all with their
> > > > > > own implementations of ARCH_WANT_HUGE_PMD_SHARE.
> > > > > 
> > > > > Let me ask this way: why do we have to play such games with MAP_PRIVATE?
> > > > 
> > > > : Mappings within this address range behave as if they were shared
> > > > : between threads, so a write to a MAP_PRIVATE mapping will create a
> > > > : page which is shared between all the sharers.
> > > > 
> > > > If so, that's a misunderstanding, because there are no games being played.
> > > > What Khalid's saying there is that because the page tables are already
> > > > shared for that range of address space, the COW of a MAP_PRIVATE will
> > > > create a new page, but that page will be shared between all the sharers.
> > > > The second write to a MAP_PRIVATE page (by any of the sharers) will not
> > > > create a COW situation.  Just like if all the sharers were threads of
> > > > the same process.
> > > > 
> > > 
> > > It actually seems to be just like I understood it. We'll have multiple
> > > processes share anonymous pages writable, even though they are not using
> > > shared memory.
> > > 
> > > IMHO, sharing page tables to optimize for something kernel-internal (page
> > > table consumption) should be completely transparent to user space. Just like
> > > ARCH_WANT_HUGE_PMD_SHARE currently is unless I am missing something
> > > important.
> > > 
> > > The VM_MAYSHARE check in want_pmd_share()->vma_shareable() makes me assume
> > > that we really only optimize for MAP_SHARED right now, never for
> > > MAP_PRIVATE.
> > 
> > It's definitely *not* about being transparent to userspace.  It's about
> > giving userspace new functionality where multiple processes can choose
> > to share a portion of their address space with each other.  What any
> > process changes in that range changes, every sharing process sees.
> > mmap(), munmap(), mprotect(), mremap(), everything.
> > 
> 
> Exactly and to further elaborate, once a process calls mshare() to declare
> its intent to share PTEs for a range of address and another process accepts
> that sharing by calling mshare() itself, the two (or more) processes have
> agreed to share PTEs for that entire address range. A MAP_PRIVATE mapping in
> this address range goes against the original intent of sharing and what we
> are saying is the original intent of sharing takes precedence in case of
> this conflict.

I don't know that it's against the original intent ... I think
MAP_PRIVATE in this context means "Private to this process and every
process sharing this chunk of address space".  So a store doesn't go
through to the page cache, as it would with MAP_SHARED, but it is
visible to the other processes sharing these page tables.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-16 12:07               ` Matthew Wilcox
  2021-08-16 12:20                 ` David Hildenbrand
  2021-08-16 12:27                 ` [private] " David Hildenbrand
@ 2021-08-17  0:47                 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  2021-08-17  0:55                   ` Matthew Wilcox
  2 siblings, 1 reply; 74+ messages in thread
From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) @ 2021-08-17  0:47 UTC (permalink / raw)
  To: Matthew Wilcox, David Hildenbrand
  Cc: Khalid Aziz, Steven Sistare, Anthony Yznaga, linux-kernel,
	linux-mm, Gonglei (Arei)



> -----Original Message-----
> From: Matthew Wilcox [mailto:willy@infradead.org]
> Sent: Monday, August 16, 2021 8:08 PM
> To: David Hildenbrand <david@redhat.com>
> Cc: Khalid Aziz <khalid.aziz@oracle.com>; Longpeng (Mike, Cloud Infrastructure
> Service Product Dept.) <longpeng2@huawei.com>; Steven Sistare
> <steven.sistare@oracle.com>; Anthony Yznaga <anthony.yznaga@oracle.com>;
> linux-kernel@vger.kernel.org; linux-mm@kvack.org; Gonglei (Arei)
> <arei.gonglei@huawei.com>
> Subject: Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
> 
> On Mon, Aug 16, 2021 at 10:02:22AM +0200, David Hildenbrand wrote:
> > > Mappings within this address range behave as if they were shared
> > > between threads, so a write to a MAP_PRIVATE mapping will create a
> > > page which is shared between all the sharers. The first process that
> > > declares an address range mshare'd can continue to map objects in
> > > the shared area. All other processes that want mshare'd access to
> > > this memory area can do so by calling mshare(). After this call, the
> > > address range given by mshare becomes a shared range in its address
> > > space. Anonymous mappings will be shared and not COWed.
> >
> > Did I understand correctly that you want to share actual page tables
> > between processes and consequently different MMs? That sounds like a very bad
> idea.
> 
> That is the entire point.  Consider a machine with 10,000 instances of an
> application running (process model, not thread model).  If each application wants
> to map 1TB of RAM using 2MB pages, that's 4MB of page tables per process or
> 40GB of RAM for the whole machine.
> 
> There's a reason hugetlbfs was enhanced to allow this page table sharing.
> I'm not a fan of the implementation as it gets some locks upside down, so this is an
> attempt to generalise the concept beyond hugetlbfs.
> 
> Think of it like partial threading.  You get to share some parts, but not all, of your
> address space with your fellow processes.  Obviously you don't want to expose
> this to random other processes, only to other instances of yourself being run as the
> same user.

I understand your intent now, you want to share memory ranges by sharing the relevant
pgtable pages. 

I had implemented a similar idea to support QEMU live upgrade about four years ago
( in late 2017),

https://patents.google.com/patent/US20210089345A1

"""
[0131]
In a first possible implementation, the generation unit includes a copying subunit configured
to copy an entry corresponding to the virtual memory area in a PGD page table of the first
virtual machine to an entry corresponding to the virtual memory area in a PGD page table of
the second virtual machine.
"""

We want to share the anonymous memory between old QEMU process and the new one, 
so we limit the QEMU to mmap the VM's memory address in 4T-8T and then share the 
memory by direct copy the PGD entries ( implementation is much more complicated than I 
say ).

Besides to save memory, large memory range can be shared fast in this way.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2021-08-17  0:47                 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
@ 2021-08-17  0:55                   ` Matthew Wilcox
  0 siblings, 0 replies; 74+ messages in thread
From: Matthew Wilcox @ 2021-08-17  0:55 UTC (permalink / raw)
  To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
  Cc: David Hildenbrand, Khalid Aziz, Steven Sistare, Anthony Yznaga,
	linux-kernel, linux-mm, Gonglei (Arei)

On Tue, Aug 17, 2021 at 12:47:19AM +0000, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> I understand your intent now, you want to share memory ranges by sharing the relevant
> pgtable pages. 
> 
> I had implemented a similar idea to support QEMU live upgrade about four years ago
> ( in late 2017),
> 
> https://patents.google.com/patent/US20210089345A1

I am not going to read a patent.  This conversation is now over until
I have had a conversation with my patent attorney.


^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2021-08-17  0:56 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
2020-07-27 17:07 ` Eric W. Biederman
2020-07-27 18:00   ` Steven Sistare
2020-07-28 13:40     ` Christian Brauner
2020-07-27 17:11 ` [RFC PATCH 1/5] elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings Anthony Yznaga
2020-07-27 17:11 ` [RFC PATCH 2/5] mm: do not assume only the stack vma exists in setup_arg_pages() Anthony Yznaga
2020-07-27 17:11 ` [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP Anthony Yznaga
2020-07-28 13:38   ` Eric W. Biederman
2020-07-28 17:44     ` Anthony Yznaga
2020-07-29 13:52   ` Kirill A. Shutemov
2020-07-29 23:20     ` Anthony Yznaga
2020-07-27 17:11 ` [RFC PATCH 4/5] exec, elf: require opt-in for accepting preserved mem Anthony Yznaga
2020-07-27 17:11 ` [RFC PATCH 5/5] mm: introduce MADV_DOEXEC Anthony Yznaga
2020-07-28 13:22   ` Kirill Tkhai
2020-07-28 14:06     ` Steven Sistare
2020-07-28 11:34 ` [RFC PATCH 0/5] madvise MADV_DOEXEC Kirill Tkhai
2020-07-28 17:28   ` Anthony Yznaga
2020-07-28 14:23 ` Andy Lutomirski
2020-07-28 14:30   ` Steven Sistare
2020-07-30 15:22 ` Matthew Wilcox
2020-07-30 15:27   ` Christian Brauner
2020-07-30 15:34     ` Matthew Wilcox
2020-07-30 15:54       ` Christian Brauner
2020-07-31  9:12     ` Stefan Hajnoczi
2020-07-30 15:59   ` Steven Sistare
2020-07-30 17:12     ` Matthew Wilcox
2020-07-30 17:35       ` Steven Sistare
2020-07-30 17:49         ` Matthew Wilcox
2020-07-30 18:27           ` Steven Sistare
2020-07-30 21:58             ` Eric W. Biederman
2020-07-31 14:57               ` Steven Sistare
2020-07-31 15:27                 ` Matthew Wilcox
2020-07-31 16:11                   ` Steven Sistare
2020-07-31 16:56                     ` Jason Gunthorpe
2020-07-31 17:15                       ` Steven Sistare
2020-07-31 17:48                         ` Jason Gunthorpe
2020-07-31 17:55                           ` Steven Sistare
2020-07-31 17:23                     ` Matthew Wilcox
2020-08-03 15:28                 ` Eric W. Biederman
2020-08-03 15:42                   ` James Bottomley
2020-08-03 20:03                     ` Steven Sistare
     [not found]                     ` <9371b8272fd84280ae40b409b260bab3@AcuMS.aculab.com>
2020-08-04 11:13                       ` Matthew Wilcox
2020-08-03 19:29                   ` Steven Sistare
2020-07-31 19:41 ` Steven Sistare
2021-07-08  9:52 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-07-08 12:48   ` Steven Sistare
2021-07-12  1:05     ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-07-12  1:30       ` Matthew Wilcox
2021-07-13  0:57         ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-08-13 19:49           ` Khalid Aziz
2021-08-14 20:07             ` David Laight
2021-08-16  0:26               ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-08-16  8:07                 ` David Laight
2021-08-16  6:54             ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-08-16  8:02             ` David Hildenbrand
2021-08-16 12:07               ` Matthew Wilcox
2021-08-16 12:20                 ` David Hildenbrand
2021-08-16 12:42                   ` David Hildenbrand
2021-08-16 12:46                   ` Matthew Wilcox
2021-08-16 13:24                     ` David Hildenbrand
2021-08-16 13:32                       ` Matthew Wilcox
2021-08-16 14:10                         ` David Hildenbrand
2021-08-16 14:27                           ` Matthew Wilcox
2021-08-16 14:33                             ` David Hildenbrand
2021-08-16 14:40                               ` Matthew Wilcox
2021-08-16 15:01                                 ` David Hildenbrand
2021-08-16 15:59                                   ` Matthew Wilcox
2021-08-16 16:06                                     ` Khalid Aziz
2021-08-16 16:15                                       ` Matthew Wilcox
2021-08-16 16:13                                     ` David Hildenbrand
2021-08-16 12:27                 ` [private] " David Hildenbrand
2021-08-16 12:30                   ` David Hildenbrand
2021-08-17  0:47                 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2021-08-17  0:55                   ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).