Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
@ 2020-07-27 17:07 ` Eric W. Biederman
  2020-07-27 18:00   ` Steven Sistare
  2020-07-27 17:11 ` [RFC PATCH 1/5] elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings Anthony Yznaga
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 44+ messages in thread
From: Eric W. Biederman @ 2020-07-27 17:07 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-arch, mhocko, tglx,
	mingo, bp, x86, hpa, viro, akpm, arnd, keescook, gerg, ktkhai,
	christian.brauner, peterz, esyr, jgg, christian, areber, cyphar,
	steven.sistare

Anthony Yznaga <anthony.yznaga@oracle.com> writes:

> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,
> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.

What is wrong with using a file descriptor to a possibly deleted file
and re-mmaping it?

There is already MAP_FIXED that allows you to ensure you have the same
address.

I think all it would take would be a small protocol from one version
to the next to say map file descriptor #N and address #A.  Something
easily passed on the command line.

There is just enough complexity in exec currently that our
implementation of exec is already teetering.  So if we could use
existing mechanisms it would be good.

And I don't see why this version of sharing a mmap area would be
particularly general.  I do imagine that being able to force a
mmap area into a setuid executable would be a fun attack vector.

Perhaps I missed this in my skim of these patches but I did not see
anything that guarded this feature against an exec that changes an
applications privileges.

Eric


> Patches 1 and 2 ensure that loading of ELF load segments does not silently
> clobber existing VMAS, and remove assumptions that the stack is the only
> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
> and could be considered on its own.
>
> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
> using an ELF note.
>
> Anthony Yznaga (5):
>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>   mm: do not assume only the stack vma exists in setup_arg_pages()
>   mm: introduce VM_EXEC_KEEP
>   exec, elf: require opt-in for accepting preserved mem
>   mm: introduce MADV_DOEXEC
>
>  arch/x86/Kconfig                       |   1 +
>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>  fs/exec.c                              |  33 +++++-
>  include/linux/binfmts.h                |   7 +-
>  include/linux/mm.h                     |   5 +
>  include/uapi/asm-generic/mman-common.h |   3 +
>  kernel/fork.c                          |   2 +-
>  mm/madvise.c                           |  25 +++++
>  mm/mmap.c                              |  47 ++++++++
>  9 files changed, 266 insertions(+), 53 deletions(-)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC PATCH 0/5] madvise MADV_DOEXEC
@ 2020-07-27 17:11 Anthony Yznaga
  2020-07-27 17:07 ` Eric W. Biederman
                   ` (9 more replies)
  0 siblings, 10 replies; 44+ messages in thread
From: Anthony Yznaga @ 2020-07-27 17:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, ktkhai, christian.brauner, peterz, esyr, jgg,
	christian, areber, cyphar, steven.sistare

This patchset adds support for preserving an anonymous memory range across
exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
sharing memory in this manner, as opposed to re-attaching to a named shared
memory segment, is to ensure it is mapped at the same virtual address in
the new process as it was in the old one.  An intended use for this is to
preserve guest memory for guests using vfio while qemu exec's an updated
version of itself.  By ensuring the memory is preserved at a fixed address,
vfio mappings and their associated kernel data structures can remain valid.
In addition, for the qemu use case, qemu instances that back guest RAM with
anonymous memory can be updated.

Patches 1 and 2 ensure that loading of ELF load segments does not silently
clobber existing VMAS, and remove assumptions that the stack is the only
VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
and could be considered on its own.

Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
using an ELF note.

Anthony Yznaga (5):
  elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
  mm: do not assume only the stack vma exists in setup_arg_pages()
  mm: introduce VM_EXEC_KEEP
  exec, elf: require opt-in for accepting preserved mem
  mm: introduce MADV_DOEXEC

 arch/x86/Kconfig                       |   1 +
 fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
 fs/exec.c                              |  33 +++++-
 include/linux/binfmts.h                |   7 +-
 include/linux/mm.h                     |   5 +
 include/uapi/asm-generic/mman-common.h |   3 +
 kernel/fork.c                          |   2 +-
 mm/madvise.c                           |  25 +++++
 mm/mmap.c                              |  47 ++++++++
 9 files changed, 266 insertions(+), 53 deletions(-)

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC PATCH 1/5] elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
  2020-07-27 17:07 ` Eric W. Biederman
@ 2020-07-27 17:11 ` Anthony Yznaga
  2020-07-27 17:11 ` [RFC PATCH 2/5] mm: do not assume only the stack vma exists in setup_arg_pages() Anthony Yznaga
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 44+ messages in thread
From: Anthony Yznaga @ 2020-07-27 17:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, ktkhai, christian.brauner, peterz, esyr, jgg,
	christian, areber, cyphar, steven.sistare

Commit b212921b13bd ("elf: don't use MAP_FIXED_NOREPLACE for elf
executable mappings") reverted back to using MAP_FIXED to map elf load
segments because it was found that the load segments in some binaries
overlap and can cause MAP_FIXED_NOREPLACE to fail.  The original intent
of MAP_FIXED_NOREPLACE was to prevent the silent clobbering of an
existing mapping (e.g. the stack) by the elf image.  To achieve this,
expand on the logic used when loading ET_DYN binaries which calculates a
total size for the image when the first segment is mapped, maps the
entire image, and then unmaps the remainder before remaining segments
are mapped.  Apply this to ET_EXEC binaries as well as ET_DYN binaries
as is done now, and for both ET_EXEC and ET_DYN+INTERP use
MAP_FIXED_NOREPLACE for the initial total size mapping and MAP_FIXED for
remaining mappings.  For ET_DYN w/out INTERP, continue to map at a
system-selected address in the mmap region.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 fs/binfmt_elf.c | 112 ++++++++++++++++++++++++++++++++------------------------
 1 file changed, 64 insertions(+), 48 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 9fe3b51c116a..6445a6dbdb1d 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1046,58 +1046,25 @@ static int load_elf_binary(struct linux_binprm *bprm)
 
 		vaddr = elf_ppnt->p_vaddr;
 		/*
-		 * If we are loading ET_EXEC or we have already performed
-		 * the ET_DYN load_addr calculations, proceed normally.
+		 * Map remaining segments with MAP_FIXED once the first
+		 * total size mapping has been done.
 		 */
-		if (elf_ex->e_type == ET_EXEC || load_addr_set) {
+		if (load_addr_set) {
 			elf_flags |= MAP_FIXED;
-		} else if (elf_ex->e_type == ET_DYN) {
-			/*
-			 * This logic is run once for the first LOAD Program
-			 * Header for ET_DYN binaries to calculate the
-			 * randomization (load_bias) for all the LOAD
-			 * Program Headers, and to calculate the entire
-			 * size of the ELF mapping (total_size). (Note that
-			 * load_addr_set is set to true later once the
-			 * initial mapping is performed.)
-			 *
-			 * There are effectively two types of ET_DYN
-			 * binaries: programs (i.e. PIE: ET_DYN with INTERP)
-			 * and loaders (ET_DYN without INTERP, since they
-			 * _are_ the ELF interpreter). The loaders must
-			 * be loaded away from programs since the program
-			 * may otherwise collide with the loader (especially
-			 * for ET_EXEC which does not have a randomized
-			 * position). For example to handle invocations of
-			 * "./ld.so someprog" to test out a new version of
-			 * the loader, the subsequent program that the
-			 * loader loads must avoid the loader itself, so
-			 * they cannot share the same load range. Sufficient
-			 * room for the brk must be allocated with the
-			 * loader as well, since brk must be available with
-			 * the loader.
-			 *
-			 * Therefore, programs are loaded offset from
-			 * ELF_ET_DYN_BASE and loaders are loaded into the
-			 * independently randomized mmap region (0 load_bias
-			 * without MAP_FIXED).
-			 */
-			if (interpreter) {
-				load_bias = ELF_ET_DYN_BASE;
-				if (current->flags & PF_RANDOMIZE)
-					load_bias += arch_mmap_rnd();
-				elf_flags |= MAP_FIXED;
-			} else
-				load_bias = 0;
-
+		} else {
 			/*
-			 * Since load_bias is used for all subsequent loading
-			 * calculations, we must lower it by the first vaddr
-			 * so that the remaining calculations based on the
-			 * ELF vaddrs will be correctly offset. The result
-			 * is then page aligned.
+			 * To ensure loading does not continue if an ELF
+			 * LOAD segment overlaps an existing mapping (e.g.
+			 * the stack), for the first LOAD Program Header
+			 * calculate the the entire size of the ELF mapping
+			 * and map it with MAP_FIXED_NOREPLACE. On success,
+			 * the remainder will be unmapped and subsequent
+			 * LOAD segments mapped with MAP_FIXED rather than
+			 * MAP_FIXED_NOREPLACE because some binaries may
+			 * have overlapping segments that would cause the
+			 * mmap to fail.
 			 */
-			load_bias = ELF_PAGESTART(load_bias - vaddr);
+			elf_flags |= MAP_FIXED_NOREPLACE;
 
 			total_size = total_mapping_size(elf_phdata,
 							elf_ex->e_phnum);
@@ -1105,6 +1072,55 @@ static int load_elf_binary(struct linux_binprm *bprm)
 				retval = -EINVAL;
 				goto out_free_dentry;
 			}
+
+			if (elf_ex->e_type == ET_DYN) {
+				/*
+				 * This logic is run once for the first LOAD
+				 * Program Header for ET_DYN binaries to
+				 * calculate the randomization (load_bias) for
+				 * all the LOAD Program Headers.
+				 *
+				 * There are effectively two types of ET_DYN
+				 * binaries: programs (i.e. PIE: ET_DYN with
+				 * INTERP) and loaders (ET_DYN without INTERP,
+				 * since they _are_ the ELF interpreter). The
+				 * loaders must be loaded away from programs
+				 * since the program may otherwise collide with
+				 * the loader (especially for ET_EXEC which does
+				 * not have a randomized position). For example
+				 * to handle invocations of "./ld.so someprog"
+				 * to test out a new version of the loader, the
+				 * subsequent program that the loader loads must
+				 * avoid the loader itself, so they cannot share
+				 * the same load range. Sufficient room for the
+				 * brk must be allocated with the loader as
+				 * well, since brk must be available with the
+				 * loader.
+				 *
+				 * Therefore, programs are loaded offset from
+				 * ELF_ET_DYN_BASE and loaders are loaded into
+				 * the independently randomized mmap region
+				 * (0 load_bias without MAP_FIXED*).
+				 */
+				if (interpreter) {
+					load_bias = ELF_ET_DYN_BASE;
+					if (current->flags & PF_RANDOMIZE)
+						load_bias += arch_mmap_rnd();
+				} else {
+					load_bias = 0;
+					elf_flags &= ~MAP_FIXED_NOREPLACE;
+				}
+
+				/*
+				 * Since load_bias is used for all subsequent
+				 * loading calculations, we must lower it by
+				 * the first vaddr so that the remaining
+				 * calculations based on the ELF vaddrs will
+				 * be correctly offset. The result is then
+				 * page aligned.
+				 */
+				load_bias = ELF_PAGESTART(load_bias - vaddr);
+			}
 		}
 
 		error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt,
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC PATCH 2/5] mm: do not assume only the stack vma exists in setup_arg_pages()
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
  2020-07-27 17:07 ` Eric W. Biederman
  2020-07-27 17:11 ` [RFC PATCH 1/5] elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings Anthony Yznaga
@ 2020-07-27 17:11 ` Anthony Yznaga
  2020-07-27 17:11 ` [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP Anthony Yznaga
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 44+ messages in thread
From: Anthony Yznaga @ 2020-07-27 17:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, ktkhai, christian.brauner, peterz, esyr, jgg,
	christian, areber, cyphar, steven.sistare

In preparation for allowing vmas to be preserved across exec do not
assume that there is no prev vma to pass to mprotect_fixup() in
setup_arg_pages().
Also, setup_arg_pages() expands the initial stack of a process by
128k or to the stack size limit, whichever is smaller.  expand_stack()
assumes there is no vma between the vma passed to it and the
address to expand to, so check before calling it.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 fs/exec.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/exec.c b/fs/exec.c
index e6e8a9a70327..262112e5f9f8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -720,7 +720,7 @@ int setup_arg_pages(struct linux_binprm *bprm,
 	unsigned long stack_shift;
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = bprm->vma;
-	struct vm_area_struct *prev = NULL;
+	struct vm_area_struct *prev = vma->vm_prev;
 	unsigned long vm_flags;
 	unsigned long stack_base;
 	unsigned long stack_size;
@@ -819,6 +819,10 @@ int setup_arg_pages(struct linux_binprm *bprm,
 	else
 		stack_base = vma->vm_start - stack_expand;
 #endif
+	if (vma != find_vma(mm, stack_base)) {
+		ret = -EFAULT;
+		goto out_unlock;
+	}
 	current->mm->start_stack = bprm->p;
 	ret = expand_stack(vma, stack_base);
 	if (ret)
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
                   ` (2 preceding siblings ...)
  2020-07-27 17:11 ` [RFC PATCH 2/5] mm: do not assume only the stack vma exists in setup_arg_pages() Anthony Yznaga
@ 2020-07-27 17:11 ` Anthony Yznaga
  2020-07-28 13:38   ` Eric W. Biederman
  2020-07-29 13:52   ` Kirill A. Shutemov
  2020-07-27 17:11 ` [RFC PATCH 4/5] exec, elf: require opt-in for accepting preserved mem Anthony Yznaga
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 44+ messages in thread
From: Anthony Yznaga @ 2020-07-27 17:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, ktkhai, christian.brauner, peterz, esyr, jgg,
	christian, areber, cyphar, steven.sistare

A vma with the VM_EXEC_KEEP flag is preserved across exec.  For anonymous
vmas only.  For safety, overlap with fixed address VMAs created in the new
mm during exec (e.g. the stack and elf load segments) is not permitted and
will cause the exec to fail.
(We are studying how to guarantee there are no conflicts. Comments welcome.)

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/Kconfig   |  1 +
 fs/exec.c          | 20 ++++++++++++++++++++
 include/linux/mm.h |  5 +++++
 kernel/fork.c      |  2 +-
 mm/mmap.c          | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 883da0abf779..fc36eb2f45c0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -30,6 +30,7 @@ config X86_64
 	select MODULES_USE_ELF_RELA
 	select NEED_DMA_MAP_STATE
 	select SWIOTLB
+	select ARCH_USES_HIGH_VMA_FLAGS
 
 config FORCE_DYNAMIC_FTRACE
 	def_bool y
diff --git a/fs/exec.c b/fs/exec.c
index 262112e5f9f8..1de09c4eef00 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1069,6 +1069,20 @@ ssize_t read_code(struct file *file, unsigned long addr, loff_t pos, size_t len)
 EXPORT_SYMBOL(read_code);
 #endif
 
+static int vma_dup_some(struct mm_struct *old_mm, struct mm_struct *new_mm)
+{
+	struct vm_area_struct *vma;
+	int ret;
+
+	for (vma = old_mm->mmap; vma; vma = vma->vm_next)
+		if (vma->vm_flags & VM_EXEC_KEEP) {
+			ret = vma_dup(vma, new_mm);
+			if (ret)
+				return ret;
+		}
+	return 0;
+}
+
 /*
  * Maps the mm_struct mm into the current task struct.
  * On success, this function returns with the mutex
@@ -1104,6 +1118,12 @@ static int exec_mmap(struct mm_struct *mm)
 			mutex_unlock(&tsk->signal->exec_update_mutex);
 			return -EINTR;
 		}
+		ret = vma_dup_some(old_mm, mm);
+		if (ret) {
+			mmap_read_unlock(old_mm);
+			mutex_unlock(&tsk->signal->exec_update_mutex);
+			return ret;
+		}
 	}
 
 	task_lock(tsk);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc7b87310c10..1c538ba77f33 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -295,11 +295,15 @@ int overcommit_kbytes_handler(struct ctl_table *, int, void *, size_t *,
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_EXEC_KEEP	BIT(VM_HIGH_ARCH_BIT_5)	/* preserve VMA across exec */
+#else
+#define VM_EXEC_KEEP	VM_NONE
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -2534,6 +2538,7 @@ extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
 	unsigned long addr, unsigned long len, pgoff_t pgoff,
 	bool *need_rmap_locks);
 extern void exit_mmap(struct mm_struct *);
+extern int vma_dup(struct vm_area_struct *vma, struct mm_struct *mm);
 
 static inline int check_data_rlimit(unsigned long rlim,
 				    unsigned long new,
diff --git a/kernel/fork.c b/kernel/fork.c
index efc5493203ae..15ead613714f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -564,7 +564,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 			tmp->anon_vma = NULL;
 		} else if (anon_vma_fork(tmp, mpnt))
 			goto fail_nomem_anon_vma_fork;
-		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
+		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT | VM_EXEC_KEEP);
 		file = tmp->vm_file;
 		if (file) {
 			struct inode *inode = file_inode(file);
diff --git a/mm/mmap.c b/mm/mmap.c
index 59a4682ebf3f..be2ff53743c3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3279,6 +3279,53 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 	return NULL;
 }
 
+int vma_dup(struct vm_area_struct *old_vma, struct mm_struct *mm)
+{
+	unsigned long npages;
+	struct mm_struct *old_mm = old_vma->vm_mm;
+	struct vm_area_struct *vma;
+	int ret = -ENOMEM;
+
+	if (WARN_ON(old_vma->vm_file || old_vma->vm_ops))
+		return -EINVAL;
+
+	vma = find_vma(mm, old_vma->vm_start);
+	if (vma && vma->vm_start < old_vma->vm_end)
+		return -EEXIST;
+
+	npages = vma_pages(old_vma);
+	mm->total_vm += npages;
+
+	vma = vm_area_dup(old_vma);
+	if (!vma)
+		goto fail_nomem;
+
+	ret = vma_dup_policy(old_vma, vma);
+	if (ret)
+		goto fail_nomem_policy;
+
+	vma->vm_mm = mm;
+	ret = anon_vma_fork(vma, old_vma);
+	if (ret)
+		goto fail_nomem_anon_vma_fork;
+
+	vma->vm_flags &= ~(VM_LOCKED|VM_UFFD_MISSING|VM_UFFD_WP|VM_EXEC_KEEP);
+	vma->vm_next = vma->vm_prev = NULL;
+	vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+	if (is_vm_hugetlb_page(vma))
+		reset_vma_resv_huge_pages(vma);
+	__insert_vm_struct(mm, vma);
+	ret = copy_page_range(mm, old_mm, old_vma);
+	return ret;
+
+fail_nomem_anon_vma_fork:
+	mpol_put(vma_policy(vma));
+fail_nomem_policy:
+	vm_area_free(vma);
+fail_nomem:
+	return -ENOMEM;
+}
+
 /*
  * Return true if the calling process may expand its vm space by the passed
  * number of pages
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC PATCH 4/5] exec, elf: require opt-in for accepting preserved mem
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
                   ` (3 preceding siblings ...)
  2020-07-27 17:11 ` [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP Anthony Yznaga
@ 2020-07-27 17:11 ` Anthony Yznaga
  2020-07-27 17:11 ` [RFC PATCH 5/5] mm: introduce MADV_DOEXEC Anthony Yznaga
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 44+ messages in thread
From: Anthony Yznaga @ 2020-07-27 17:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, ktkhai, christian.brauner, peterz, esyr, jgg,
	christian, areber, cyphar, steven.sistare

Don't copy preserved VMAs to the binary being exec'd unless the binary has
a "preserved-mem-ok" ELF note.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 fs/binfmt_elf.c         | 84 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/exec.c               | 17 +++++-----
 include/linux/binfmts.h |  7 ++++-
 3 files changed, 100 insertions(+), 8 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 6445a6dbdb1d..46248b7b0a75 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -683,6 +683,81 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 	return error;
 }
 
+#define NOTES_SZ			SZ_1K
+#define PRESERVED_MEM_OK_STRING		"preserved-mem-ok"
+#define SZ_PRESERVED_MEM_OK_STRING	sizeof(PRESERVED_MEM_OK_STRING)
+
+static int parse_elf_note(struct linux_binprm *bprm, const char *data, size_t *off, size_t datasz)
+{
+	const struct elf_note *nhdr;
+	const char *name;
+	size_t o;
+
+	o = *off;
+	datasz -= o;
+
+	if (datasz < sizeof(*nhdr))
+		return -ENOEXEC;
+
+	nhdr = (const struct elf_note *)(data + o);
+	o += sizeof(*nhdr);
+	datasz -= sizeof(*nhdr);
+
+	/*
+	 * Currently only the preserved-mem-ok elf note is of interest.
+	 */
+	if (nhdr->n_type != 0x07c1feed)
+		goto next;
+
+	if (nhdr->n_namesz > SZ_PRESERVED_MEM_OK_STRING)
+		return -ENOEXEC;
+
+	name =  data + o;
+	if (datasz < SZ_PRESERVED_MEM_OK_STRING ||
+	    strncmp(name, PRESERVED_MEM_OK_STRING, SZ_PRESERVED_MEM_OK_STRING))
+		return -ENOEXEC;
+
+	bprm->accepts_preserved_mem = 1;
+
+next:
+	o += roundup(nhdr->n_namesz, 4) + roundup(nhdr->n_descsz, 4);
+	*off = o;
+
+	return 0;
+}
+
+static int parse_elf_notes(struct linux_binprm *bprm, struct elf_phdr *phdr)
+{
+	char *notes;
+	size_t notes_sz;
+	size_t off = 0;
+	int ret;
+
+	if (!phdr)
+		return 0;
+
+	notes_sz = phdr->p_filesz;
+	if ((notes_sz > NOTES_SZ) || (notes_sz < sizeof(struct elf_note)))
+		return -ENOEXEC;
+
+	notes = kvmalloc(notes_sz, GFP_KERNEL);
+	if (!notes)
+		return -ENOMEM;
+
+	ret = elf_read(bprm->file, notes, notes_sz, phdr->p_offset);
+	if (ret < 0)
+		goto out;
+
+	while (off < notes_sz) {
+		ret = parse_elf_note(bprm, notes, &off, notes_sz);
+		if (ret)
+			break;
+	}
+out:
+	kvfree(notes);
+	return ret;
+}
+
 /*
  * These are the functions used to load ELF style executables and shared
  * libraries.  There is no binary dependent code anywhere else.
@@ -801,6 +876,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	unsigned long error;
 	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
 	struct elf_phdr *elf_property_phdata = NULL;
+	struct elf_phdr *elf_notes_phdata = NULL;
 	unsigned long elf_bss, elf_brk;
 	int bss_prot = 0;
 	int retval, i;
@@ -909,6 +985,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
 				executable_stack = EXSTACK_DISABLE_X;
 			break;
 
+		case PT_NOTE:
+			elf_notes_phdata = elf_ppnt;
+			break;
+
 		case PT_LOPROC ... PT_HIPROC:
 			retval = arch_elf_pt_proc(elf_ex, elf_ppnt,
 						  bprm->file, false,
@@ -970,6 +1050,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	if (retval)
 		goto out_free_dentry;
 
+	retval = parse_elf_notes(bprm, elf_notes_phdata);
+	if (retval)
+		goto out_free_dentry;
+
 	/* Flush all traces of the currently running executable */
 	retval = begin_new_exec(bprm);
 	if (retval)
diff --git a/fs/exec.c b/fs/exec.c
index 1de09c4eef00..b2b046fec1f8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1088,10 +1088,11 @@ static int vma_dup_some(struct mm_struct *old_mm, struct mm_struct *new_mm)
  * On success, this function returns with the mutex
  * exec_update_mutex locked.
  */
-static int exec_mmap(struct mm_struct *mm)
+static int exec_mmap(struct linux_binprm *bprm)
 {
 	struct task_struct *tsk;
 	struct mm_struct *old_mm, *active_mm;
+	struct mm_struct *mm = bprm->mm;
 	int ret;
 
 	/* Notify parent that we're no longer interested in the old VM */
@@ -1118,11 +1119,13 @@ static int exec_mmap(struct mm_struct *mm)
 			mutex_unlock(&tsk->signal->exec_update_mutex);
 			return -EINTR;
 		}
-		ret = vma_dup_some(old_mm, mm);
-		if (ret) {
-			mmap_read_unlock(old_mm);
-			mutex_unlock(&tsk->signal->exec_update_mutex);
-			return ret;
+		if (bprm->accepts_preserved_mem) {
+			ret = vma_dup_some(old_mm, mm);
+			if (ret) {
+				mmap_read_unlock(old_mm);
+				mutex_unlock(&tsk->signal->exec_update_mutex);
+				return ret;
+			}
 		}
 	}
 
@@ -1386,7 +1389,7 @@ int begin_new_exec(struct linux_binprm * bprm)
 	 * Release all of the old mmap stuff
 	 */
 	acct_arg_size(bprm, 0);
-	retval = exec_mmap(bprm->mm);
+	retval = exec_mmap(bprm);
 	if (retval)
 		goto out;
 
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 4a20b7517dd0..6a66589454c8 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -41,7 +41,12 @@ struct linux_binprm {
 		 * Set when errors can no longer be returned to the
 		 * original userspace.
 		 */
-		point_of_no_return:1;
+		point_of_no_return:1,
+		/*
+		 * Set if the binary being exec'd will accept memory marked
+		 * for preservation by the outgoing process.
+		 */
+		accepts_preserved_mem:1;
 #ifdef __alpha__
 	unsigned int taso:1;
 #endif
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC PATCH 5/5] mm: introduce MADV_DOEXEC
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
                   ` (4 preceding siblings ...)
  2020-07-27 17:11 ` [RFC PATCH 4/5] exec, elf: require opt-in for accepting preserved mem Anthony Yznaga
@ 2020-07-27 17:11 ` Anthony Yznaga
  2020-07-28 13:22   ` Kirill Tkhai
  2020-07-28 11:34 ` [RFC PATCH 0/5] madvise MADV_DOEXEC Kirill Tkhai
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 44+ messages in thread
From: Anthony Yznaga @ 2020-07-27 17:11 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, ktkhai, christian.brauner, peterz, esyr, jgg,
	christian, areber, cyphar, steven.sistare

madvise MADV_DOEXEC preserves a memory range across exec.  Initially
only supported for non-executable, non-stack, anonymous memory.
MADV_DONTEXEC reverts the effect of a previous MADV_DOXEXEC call and
undoes the preservation of the range.  After a successful exec call,
the behavior of all ranges reverts to MADV_DONTEXEC.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/uapi/asm-generic/mman-common.h |  3 +++
 mm/madvise.c                           | 25 +++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index f94f65d429be..7c5f616b28f7 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -72,6 +72,9 @@
 #define MADV_COLD	20		/* deactivate these pages */
 #define MADV_PAGEOUT	21		/* reclaim these pages */
 
+#define MADV_DOEXEC	22		/* do inherit across exec */
+#define MADV_DONTEXEC	23		/* don't inherit across exec */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/madvise.c b/mm/madvise.c
index dd1d43cf026d..b447fa748649 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -103,6 +103,26 @@ static long madvise_behavior(struct vm_area_struct *vma,
 	case MADV_KEEPONFORK:
 		new_flags &= ~VM_WIPEONFORK;
 		break;
+	case MADV_DOEXEC:
+		/*
+		 * MADV_DOEXEC is only supported on private, non-executable,
+		 * non-stack anonymous memory and if the VM_EXEC_KEEP flag
+		 * is available.
+		 */
+		if (!VM_EXEC_KEEP || vma->vm_file || vma->vm_flags & (VM_EXEC|VM_SHARED|VM_STACK)) {
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags |= (new_flags & ~VM_MAYEXEC) | VM_EXEC_KEEP;
+		break;
+	case MADV_DONTEXEC:
+		if (!VM_EXEC_KEEP) {
+			error = -EINVAL;
+			goto out;
+		}
+		if (new_flags & VM_EXEC_KEEP)
+			new_flags |= (new_flags & ~VM_EXEC_KEEP) | VM_MAYEXEC;
+		break;
 	case MADV_DONTDUMP:
 		new_flags |= VM_DONTDUMP;
 		break;
@@ -983,6 +1003,8 @@ static int madvise_inject_error(int behavior,
 	case MADV_SOFT_OFFLINE:
 	case MADV_HWPOISON:
 #endif
+	case MADV_DOEXEC:
+	case MADV_DONTEXEC:
 		return true;
 
 	default:
@@ -1037,6 +1059,9 @@ static int madvise_inject_error(int behavior,
  *  MADV_DONTDUMP - the application wants to prevent pages in the given range
  *		from being included in its core dump.
  *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
+ *  MADV_DOEXEC - On exec, preserve and duplicate this area in the new process
+ *		  if the new process allows it.
+ *  MADV_DONTEXEC - Undo the effect of MADV_DOEXEC.
  *
  * return values:
  *  zero    - success
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-27 17:07 ` Eric W. Biederman
@ 2020-07-27 18:00   ` Steven Sistare
  2020-07-28 13:40     ` Christian Brauner
  0 siblings, 1 reply; 44+ messages in thread
From: Steven Sistare @ 2020-07-27 18:00 UTC (permalink / raw)
  To: Eric W. Biederman, Anthony Yznaga
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-arch, mhocko, tglx,
	mingo, bp, x86, hpa, viro, akpm, arnd, keescook, gerg, ktkhai,
	christian.brauner, peterz, esyr, jgg, christian, areber, cyphar

On 7/27/2020 1:07 PM, ebiederm@xmission.com wrote:
> Anthony Yznaga <anthony.yznaga@oracle.com> writes:
> 
>> This patchset adds support for preserving an anonymous memory range across
>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>> sharing memory in this manner, as opposed to re-attaching to a named shared
>> memory segment, is to ensure it is mapped at the same virtual address in
>> the new process as it was in the old one.  An intended use for this is to
>> preserve guest memory for guests using vfio while qemu exec's an updated
>> version of itself.  By ensuring the memory is preserved at a fixed address,
>> vfio mappings and their associated kernel data structures can remain valid.
>> In addition, for the qemu use case, qemu instances that back guest RAM with
>> anonymous memory can be updated.
> 
> What is wrong with using a file descriptor to a possibly deleted file
> and re-mmaping it?
> 
> There is already MAP_FIXED that allows you to ensure you have the same
> address.

MAP_FIXED blows away any existing mapping in that range, which is not the 
desired behavior.  We want to preserve the previously created mapping at
the same VA and co-exist with other mappings created by the new process.
There is no way to guarantee availability of a VA post-exec.

> I think all it would take would be a small protocol from one version
> to the next to say map file descriptor #N and address #A.  Something
> easily passed on the command line.
> 
> There is just enough complexity in exec currently that our
> implementation of exec is already teetering.  So if we could use
> existing mechanisms it would be good.
> 
> And I don't see why this version of sharing a mmap area would be
> particularly general.  I do imagine that being able to force a
> mmap area into a setuid executable would be a fun attack vector.

Any mmap(MAP_ANON) segment can be preserved.  That is very general, and is 
the case we need to support to upgrade legacy applications that are already
running (such as qemu) -- we cannot recode them before they are updated.

> Perhaps I missed this in my skim of these patches but I did not see
> anything that guarded this feature against an exec that changes an
> applications privileges.

The ELF opt-in flag must be set, so only selected executables will accept
incoming mappings.  The exec() code verifies that the preserved mappings 
do not overlap or become adjacent to text or stack, so it is not possible for
example for an attacker to cause stack underflow or overflow to access injected
content.  A gadget invoked by some other attack could access the preserved content.

- Steve

>> Patches 1 and 2 ensure that loading of ELF load segments does not silently
>> clobber existing VMAS, and remove assumptions that the stack is the only
>> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
>> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
>> and could be considered on its own.
>>
>> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
>> using an ELF note.
>>
>> Anthony Yznaga (5):
>>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>>   mm: do not assume only the stack vma exists in setup_arg_pages()
>>   mm: introduce VM_EXEC_KEEP
>>   exec, elf: require opt-in for accepting preserved mem
>>   mm: introduce MADV_DOEXEC
>>
>>  arch/x86/Kconfig                       |   1 +
>>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>>  fs/exec.c                              |  33 +++++-
>>  include/linux/binfmts.h                |   7 +-
>>  include/linux/mm.h                     |   5 +
>>  include/uapi/asm-generic/mman-common.h |   3 +
>>  kernel/fork.c                          |   2 +-
>>  mm/madvise.c                           |  25 +++++
>>  mm/mmap.c                              |  47 ++++++++
>>  9 files changed, 266 insertions(+), 53 deletions(-)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
                   ` (5 preceding siblings ...)
  2020-07-27 17:11 ` [RFC PATCH 5/5] mm: introduce MADV_DOEXEC Anthony Yznaga
@ 2020-07-28 11:34 ` Kirill Tkhai
  2020-07-28 17:28   ` Anthony Yznaga
  2020-07-28 14:23 ` Andy Lutomirski
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 44+ messages in thread
From: Kirill Tkhai @ 2020-07-28 11:34 UTC (permalink / raw)
  To: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, christian.brauner, peterz, esyr, jgg, christian,
	areber, cyphar, steven.sistare

On 27.07.2020 20:11, Anthony Yznaga wrote:
> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,

So, the goal is an update of QEMU binary without a stopping of virtual machine?

> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.
> 
> Patches 1 and 2 ensure that loading of ELF load segments does not silently
> clobber existing VMAS, and remove assumptions that the stack is the only
> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
> and could be considered on its own.
> 
> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
> using an ELF note.
> 
> Anthony Yznaga (5):
>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>   mm: do not assume only the stack vma exists in setup_arg_pages()
>   mm: introduce VM_EXEC_KEEP
>   exec, elf: require opt-in for accepting preserved mem
>   mm: introduce MADV_DOEXEC
> 
>  arch/x86/Kconfig                       |   1 +
>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>  fs/exec.c                              |  33 +++++-
>  include/linux/binfmts.h                |   7 +-
>  include/linux/mm.h                     |   5 +
>  include/uapi/asm-generic/mman-common.h |   3 +
>  kernel/fork.c                          |   2 +-
>  mm/madvise.c                           |  25 +++++
>  mm/mmap.c                              |  47 ++++++++
>  9 files changed, 266 insertions(+), 53 deletions(-)
> 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 5/5] mm: introduce MADV_DOEXEC
  2020-07-27 17:11 ` [RFC PATCH 5/5] mm: introduce MADV_DOEXEC Anthony Yznaga
@ 2020-07-28 13:22   ` Kirill Tkhai
  2020-07-28 14:06     ` Steven Sistare
  0 siblings, 1 reply; 44+ messages in thread
From: Kirill Tkhai @ 2020-07-28 13:22 UTC (permalink / raw)
  To: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, christian.brauner, peterz, esyr, jgg, christian,
	areber, cyphar, steven.sistare

On 27.07.2020 20:11, Anthony Yznaga wrote:
> madvise MADV_DOEXEC preserves a memory range across exec.  Initially
> only supported for non-executable, non-stack, anonymous memory.
> MADV_DONTEXEC reverts the effect of a previous MADV_DOXEXEC call and
> undoes the preservation of the range.  After a successful exec call,
> the behavior of all ranges reverts to MADV_DONTEXEC.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
> ---
>  include/uapi/asm-generic/mman-common.h |  3 +++
>  mm/madvise.c                           | 25 +++++++++++++++++++++++++
>  2 files changed, 28 insertions(+)
> 
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index f94f65d429be..7c5f616b28f7 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -72,6 +72,9 @@
>  #define MADV_COLD	20		/* deactivate these pages */
>  #define MADV_PAGEOUT	21		/* reclaim these pages */
>  
> +#define MADV_DOEXEC	22		/* do inherit across exec */
> +#define MADV_DONTEXEC	23		/* don't inherit across exec */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index dd1d43cf026d..b447fa748649 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -103,6 +103,26 @@ static long madvise_behavior(struct vm_area_struct *vma,
>  	case MADV_KEEPONFORK:
>  		new_flags &= ~VM_WIPEONFORK;
>  		break;
> +	case MADV_DOEXEC:

For me MADV_KEEPONEXEC sounds better as it's symmetric to MADV_KEEPONFORK.

> +		/*
> +		 * MADV_DOEXEC is only supported on private, non-executable,
> +		 * non-stack anonymous memory and if the VM_EXEC_KEEP flag
> +		 * is available.
> +		 */
> +		if (!VM_EXEC_KEEP || vma->vm_file || vma->vm_flags & (VM_EXEC|VM_SHARED|VM_STACK)) {
> +			error = -EINVAL;
> +			goto out;
> +		}
> +		new_flags |= (new_flags & ~VM_MAYEXEC) | VM_EXEC_KEEP;
> +		break;
> +	case MADV_DONTEXEC:
> +		if (!VM_EXEC_KEEP) {
> +			error = -EINVAL;
> +			goto out;
> +		}
> +		if (new_flags & VM_EXEC_KEEP)
> +			new_flags |= (new_flags & ~VM_EXEC_KEEP) | VM_MAYEXEC;
> +		break;
>  	case MADV_DONTDUMP:
>  		new_flags |= VM_DONTDUMP;
>  		break;
> @@ -983,6 +1003,8 @@ static int madvise_inject_error(int behavior,
>  	case MADV_SOFT_OFFLINE:
>  	case MADV_HWPOISON:
>  #endif
> +	case MADV_DOEXEC:
> +	case MADV_DONTEXEC:
>  		return true;
>  
>  	default:
> @@ -1037,6 +1059,9 @@ static int madvise_inject_error(int behavior,
>   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
>   *		from being included in its core dump.
>   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> + *  MADV_DOEXEC - On exec, preserve and duplicate this area in the new process
> + *		  if the new process allows it.
> + *  MADV_DONTEXEC - Undo the effect of MADV_DOEXEC.
>   *
>   * return values:
>   *  zero    - success
> 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP
  2020-07-27 17:11 ` [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP Anthony Yznaga
@ 2020-07-28 13:38   ` Eric W. Biederman
  2020-07-28 17:44     ` Anthony Yznaga
  2020-07-29 13:52   ` Kirill A. Shutemov
  1 sibling, 1 reply; 44+ messages in thread
From: Eric W. Biederman @ 2020-07-28 13:38 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-arch, mhocko, tglx,
	mingo, bp, x86, hpa, viro, akpm, arnd, keescook, gerg, ktkhai,
	christian.brauner, peterz, esyr, jgg, christian, areber, cyphar,
	steven.sistare

Anthony Yznaga <anthony.yznaga@oracle.com> writes:

> A vma with the VM_EXEC_KEEP flag is preserved across exec.  For anonymous
> vmas only.  For safety, overlap with fixed address VMAs created in the new
> mm during exec (e.g. the stack and elf load segments) is not permitted and
> will cause the exec to fail.
> (We are studying how to guarantee there are no conflicts. Comments welcome.)
>

> diff --git a/fs/exec.c b/fs/exec.c
> index 262112e5f9f8..1de09c4eef00 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1069,6 +1069,20 @@ ssize_t read_code(struct file *file, unsigned long addr, loff_t pos, size_t len)
>  EXPORT_SYMBOL(read_code);
>  #endif
>  
> +static int vma_dup_some(struct mm_struct *old_mm, struct mm_struct *new_mm)
> +{
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	for (vma = old_mm->mmap; vma; vma = vma->vm_next)
> +		if (vma->vm_flags & VM_EXEC_KEEP) {
> +			ret = vma_dup(vma, new_mm);
> +			if (ret)
> +				return ret;
> +		}
> +	return 0;
> +}
> +
>  /*
>   * Maps the mm_struct mm into the current task struct.
>   * On success, this function returns with the mutex
> @@ -1104,6 +1118,12 @@ static int exec_mmap(struct mm_struct *mm)
>  			mutex_unlock(&tsk->signal->exec_update_mutex);
>  			return -EINTR;
>  		}
> +		ret = vma_dup_some(old_mm, mm);
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Ouch! An unconditional loop through all of the vmas of the execing
process, just in case there is a VM_EXEC_KEEP vma.

I know we already walk the list in exit_mmap, but I get the feeling this
will slow exec down when this feature is not enabled, especially when
a process with a lot of vmas is calling exec.

                
> +		if (ret) {
> +			mmap_read_unlock(old_mm);
> +			mutex_unlock(&tsk->signal->exec_update_mutex);
> +			return ret;
> +		}
>  	}
>  
>  	task_lock(tsk);


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-27 18:00   ` Steven Sistare
@ 2020-07-28 13:40     ` Christian Brauner
  0 siblings, 0 replies; 44+ messages in thread
From: Christian Brauner @ 2020-07-28 13:40 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Eric W. Biederman, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro,
	akpm, arnd, keescook, gerg, ktkhai, peterz, esyr, jgg, christian,
	areber, cyphar, linux-api

On Mon, Jul 27, 2020 at 02:00:17PM -0400, Steven Sistare wrote:
> On 7/27/2020 1:07 PM, ebiederm@xmission.com wrote:
> > Anthony Yznaga <anthony.yznaga@oracle.com> writes:
> > 
> >> This patchset adds support for preserving an anonymous memory range across
> >> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> >> sharing memory in this manner, as opposed to re-attaching to a named shared
> >> memory segment, is to ensure it is mapped at the same virtual address in
> >> the new process as it was in the old one.  An intended use for this is to
> >> preserve guest memory for guests using vfio while qemu exec's an updated
> >> version of itself.  By ensuring the memory is preserved at a fixed address,
> >> vfio mappings and their associated kernel data structures can remain valid.
> >> In addition, for the qemu use case, qemu instances that back guest RAM with
> >> anonymous memory can be updated.
> > 
> > What is wrong with using a file descriptor to a possibly deleted file
> > and re-mmaping it?
> > 
> > There is already MAP_FIXED that allows you to ensure you have the same
> > address.
> 
> MAP_FIXED blows away any existing mapping in that range, which is not the 
> desired behavior.  We want to preserve the previously created mapping at

There's also MAP_FIXED_NOREPLACE since v4.17 in case that helps.

Note that this should really go to linux-api too. I won't argue to
resend it since that would mean spamming everyone's inbox with the same
thread again but in case you send a revised version, please ensure to Cc
linux-api. The glibc folks are listening on there too.

Thanks!
Christian


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 5/5] mm: introduce MADV_DOEXEC
  2020-07-28 13:22   ` Kirill Tkhai
@ 2020-07-28 14:06     ` Steven Sistare
  0 siblings, 0 replies; 44+ messages in thread
From: Steven Sistare @ 2020-07-28 14:06 UTC (permalink / raw)
  To: Kirill Tkhai, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, christian.brauner, peterz, esyr, jgg, christian,
	areber, cyphar

On 7/28/2020 9:22 AM, Kirill Tkhai wrote:
> On 27.07.2020 20:11, Anthony Yznaga wrote:
>> madvise MADV_DOEXEC preserves a memory range across exec.  Initially
>> only supported for non-executable, non-stack, anonymous memory.
>> MADV_DONTEXEC reverts the effect of a previous MADV_DOXEXEC call and
>> undoes the preservation of the range.  After a successful exec call,
>> the behavior of all ranges reverts to MADV_DONTEXEC.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>> ---
>>  include/uapi/asm-generic/mman-common.h |  3 +++
>>  mm/madvise.c                           | 25 +++++++++++++++++++++++++
>>  2 files changed, 28 insertions(+)
>>
>> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
>> index f94f65d429be..7c5f616b28f7 100644
>> --- a/include/uapi/asm-generic/mman-common.h
>> +++ b/include/uapi/asm-generic/mman-common.h
>> @@ -72,6 +72,9 @@
>>  #define MADV_COLD	20		/* deactivate these pages */
>>  #define MADV_PAGEOUT	21		/* reclaim these pages */
>>  
>> +#define MADV_DOEXEC	22		/* do inherit across exec */
>> +#define MADV_DONTEXEC	23		/* don't inherit across exec */
>> +
>>  /* compatibility flags */
>>  #define MAP_FILE	0
>>  
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index dd1d43cf026d..b447fa748649 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -103,6 +103,26 @@ static long madvise_behavior(struct vm_area_struct *vma,
>>  	case MADV_KEEPONFORK:
>>  		new_flags &= ~VM_WIPEONFORK;
>>  		break;
>> +	case MADV_DOEXEC:
> 
> For me MADV_KEEPONEXEC sounds better as it's symmetric to MADV_KEEPONFORK.

We chose MADV_DOEXEC and MADV_DONTEXEC to match the precedent set by:

#define MADV_DONTFORK   10              /* don't inherit across fork */


#define MADV_DOFORK     11              /* do inherit across fork */


I do like "keep" as a concise description of the operation.  KEEPONFORK is not a perfect 
analog because its opposite is wipe ...

#define MADV_WIPEONFORK 18              /* Zero memory on fork, child only */
#define MADV_KEEPONFORK 19              /* Undo MADV_WIPEONFORK */

... but if folks are ok with that then IMO these are all good choices:

MADV_KEEPONEXEC
MADV_DROPONEXEC

MADV_KEEPEXEC    (shorter)
MADV_DROPEXEC 

MADV_KEEP_EXEC   (more legible, but no existing MADV names use 2nd underscores)
MADV_DROP_EXEC

Whatever folks like best.

- Steve

>> +		/*
>> +		 * MADV_DOEXEC is only supported on private, non-executable,
>> +		 * non-stack anonymous memory and if the VM_EXEC_KEEP flag
>> +		 * is available.
>> +		 */
>> +		if (!VM_EXEC_KEEP || vma->vm_file || vma->vm_flags & (VM_EXEC|VM_SHARED|VM_STACK)) {
>> +			error = -EINVAL;
>> +			goto out;
>> +		}
>> +		new_flags |= (new_flags & ~VM_MAYEXEC) | VM_EXEC_KEEP;
>> +		break;
>> +	case MADV_DONTEXEC:
>> +		if (!VM_EXEC_KEEP) {
>> +			error = -EINVAL;
>> +			goto out;
>> +		}
>> +		if (new_flags & VM_EXEC_KEEP)
>> +			new_flags |= (new_flags & ~VM_EXEC_KEEP) | VM_MAYEXEC;
>> +		break;
>>  	case MADV_DONTDUMP:
>>  		new_flags |= VM_DONTDUMP;
>>  		break;
>> @@ -983,6 +1003,8 @@ static int madvise_inject_error(int behavior,
>>  	case MADV_SOFT_OFFLINE:
>>  	case MADV_HWPOISON:
>>  #endif
>> +	case MADV_DOEXEC:
>> +	case MADV_DONTEXEC:
>>  		return true;
>>  
>>  	default:
>> @@ -1037,6 +1059,9 @@ static int madvise_inject_error(int behavior,
>>   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
>>   *		from being included in its core dump.
>>   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
>> + *  MADV_DOEXEC - On exec, preserve and duplicate this area in the new process
>> + *		  if the new process allows it.
>> + *  MADV_DONTEXEC - Undo the effect of MADV_DOEXEC.
>>   *
>>   * return values:
>>   *  zero    - success
>>
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
                   ` (6 preceding siblings ...)
  2020-07-28 11:34 ` [RFC PATCH 0/5] madvise MADV_DOEXEC Kirill Tkhai
@ 2020-07-28 14:23 ` Andy Lutomirski
  2020-07-28 14:30   ` Steven Sistare
  2020-07-30 15:22 ` Matthew Wilcox
  2020-07-31 19:41 ` Steven Sistare
  9 siblings, 1 reply; 44+ messages in thread
From: Andy Lutomirski @ 2020-07-28 14:23 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-arch, mhocko, tglx,
	mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm, keescook, gerg,
	ktkhai, christian.brauner, peterz, esyr, jgg, christian, areber,
	cyphar, steven.sistare



> On Jul 27, 2020, at 10:02 AM, Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
> 
> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,
> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.

This will be an amazing attack surface. Perhaps use of this flag should require no_new_privs?  Arguably it should also require a special flag to execve() to honor it.  Otherwise library helpers that do vfork()+exec() or posix_spawn() could be quite surprised.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-28 14:23 ` Andy Lutomirski
@ 2020-07-28 14:30   ` Steven Sistare
  0 siblings, 0 replies; 44+ messages in thread
From: Steven Sistare @ 2020-07-28 14:30 UTC (permalink / raw)
  To: Andy Lutomirski, Anthony Yznaga
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-arch, mhocko, tglx,
	mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm, keescook, gerg,
	ktkhai, christian.brauner, peterz, esyr, jgg, christian, areber,
	cyphar

On 7/28/2020 10:23 AM, Andy Lutomirski wrote:
>> On Jul 27, 2020, at 10:02 AM, Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>>
>> This patchset adds support for preserving an anonymous memory range across
>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>> sharing memory in this manner, as opposed to re-attaching to a named shared
>> memory segment, is to ensure it is mapped at the same virtual address in
>> the new process as it was in the old one.  An intended use for this is to
>> preserve guest memory for guests using vfio while qemu exec's an updated
>> version of itself.  By ensuring the memory is preserved at a fixed address,
>> vfio mappings and their associated kernel data structures can remain valid.
>> In addition, for the qemu use case, qemu instances that back guest RAM with
>> anonymous memory can be updated.
> 
> This will be an amazing attack surface. Perhaps use of this flag should require no_new_privs?  Arguably it should also require a special flag to execve() to honor it.  Otherwise library helpers that do vfork()+exec() or posix_spawn() could be quite surprised.

Preservation is disabled across fork, so fork/exec combo's are not affected.  We forgot to document that.

- Steve


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-28 11:34 ` [RFC PATCH 0/5] madvise MADV_DOEXEC Kirill Tkhai
@ 2020-07-28 17:28   ` Anthony Yznaga
  0 siblings, 0 replies; 44+ messages in thread
From: Anthony Yznaga @ 2020-07-28 17:28 UTC (permalink / raw)
  To: Kirill Tkhai, linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, christian.brauner, peterz, esyr, jgg, christian,
	areber, cyphar, steven.sistare



On 7/28/20 4:34 AM, Kirill Tkhai wrote:
> On 27.07.2020 20:11, Anthony Yznaga wrote:
>> This patchset adds support for preserving an anonymous memory range across
>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>> sharing memory in this manner, as opposed to re-attaching to a named shared
>> memory segment, is to ensure it is mapped at the same virtual address in
>> the new process as it was in the old one.  An intended use for this is to
>> preserve guest memory for guests using vfio while qemu exec's an updated
>> version of itself.  By ensuring the memory is preserved at a fixed address,
> So, the goal is an update of QEMU binary without a stopping of virtual machine?
Essentially, yes.  The VM is paused very briefly.

Anthony
>
>> vfio mappings and their associated kernel data structures can remain valid.
>> In addition, for the qemu use case, qemu instances that back guest RAM with
>> anonymous memory can be updated.
>>
>> Patches 1 and 2 ensure that loading of ELF load segments does not silently
>> clobber existing VMAS, and remove assumptions that the stack is the only
>> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
>> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
>> and could be considered on its own.
>>
>> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
>> using an ELF note.
>>
>> Anthony Yznaga (5):
>>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>>   mm: do not assume only the stack vma exists in setup_arg_pages()
>>   mm: introduce VM_EXEC_KEEP
>>   exec, elf: require opt-in for accepting preserved mem
>>   mm: introduce MADV_DOEXEC
>>
>>  arch/x86/Kconfig                       |   1 +
>>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>>  fs/exec.c                              |  33 +++++-
>>  include/linux/binfmts.h                |   7 +-
>>  include/linux/mm.h                     |   5 +
>>  include/uapi/asm-generic/mman-common.h |   3 +
>>  kernel/fork.c                          |   2 +-
>>  mm/madvise.c                           |  25 +++++
>>  mm/mmap.c                              |  47 ++++++++
>>  9 files changed, 266 insertions(+), 53 deletions(-)
>>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP
  2020-07-28 13:38   ` Eric W. Biederman
@ 2020-07-28 17:44     ` Anthony Yznaga
  0 siblings, 0 replies; 44+ messages in thread
From: Anthony Yznaga @ 2020-07-28 17:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-arch, mhocko, tglx,
	mingo, bp, x86, hpa, viro, akpm, arnd, keescook, gerg, ktkhai,
	christian.brauner, peterz, esyr, jgg, christian, areber, cyphar,
	steven.sistare



On 7/28/20 6:38 AM, ebiederm@xmission.com wrote:
> Anthony Yznaga <anthony.yznaga@oracle.com> writes:
>
>> A vma with the VM_EXEC_KEEP flag is preserved across exec.  For anonymous
>> vmas only.  For safety, overlap with fixed address VMAs created in the new
>> mm during exec (e.g. the stack and elf load segments) is not permitted and
>> will cause the exec to fail.
>> (We are studying how to guarantee there are no conflicts. Comments welcome.)
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 262112e5f9f8..1de09c4eef00 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1069,6 +1069,20 @@ ssize_t read_code(struct file *file, unsigned long addr, loff_t pos, size_t len)
>>  EXPORT_SYMBOL(read_code);
>>  #endif
>>  
>> +static int vma_dup_some(struct mm_struct *old_mm, struct mm_struct *new_mm)
>> +{
>> +	struct vm_area_struct *vma;
>> +	int ret;
>> +
>> +	for (vma = old_mm->mmap; vma; vma = vma->vm_next)
>> +		if (vma->vm_flags & VM_EXEC_KEEP) {
>> +			ret = vma_dup(vma, new_mm);
>> +			if (ret)
>> +				return ret;
>> +		}
>> +	return 0;
>> +}
>> +
>>  /*
>>   * Maps the mm_struct mm into the current task struct.
>>   * On success, this function returns with the mutex
>> @@ -1104,6 +1118,12 @@ static int exec_mmap(struct mm_struct *mm)
>>  			mutex_unlock(&tsk->signal->exec_update_mutex);
>>  			return -EINTR;
>>  		}
>> +		ret = vma_dup_some(old_mm, mm);
>                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Ouch! An unconditional loop through all of the vmas of the execing
> process, just in case there is a VM_EXEC_KEEP vma.
>
> I know we already walk the list in exit_mmap, but I get the feeling this
> will slow exec down when this feature is not enabled, especially when
> a process with a lot of vmas is calling exec.
Patch 4 changes this to only call vma_dup_some() if the new
binary has opted in to accepting preserved memory.

Anthony
>
>                 
>> +		if (ret) {
>> +			mmap_read_unlock(old_mm);
>> +			mutex_unlock(&tsk->signal->exec_update_mutex);
>> +			return ret;
>> +		}
>>  	}
>>  
>>  	task_lock(tsk);



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP
  2020-07-27 17:11 ` [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP Anthony Yznaga
  2020-07-28 13:38   ` Eric W. Biederman
@ 2020-07-29 13:52   ` Kirill A. Shutemov
  2020-07-29 23:20     ` Anthony Yznaga
  1 sibling, 1 reply; 44+ messages in thread
From: Kirill A. Shutemov @ 2020-07-29 13:52 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-arch, mhocko, tglx,
	mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm, keescook, gerg,
	ktkhai, christian.brauner, peterz, esyr, jgg, christian, areber,
	cyphar, steven.sistare

On Mon, Jul 27, 2020 at 10:11:25AM -0700, Anthony Yznaga wrote:
> A vma with the VM_EXEC_KEEP flag is preserved across exec.  For anonymous
> vmas only.  For safety, overlap with fixed address VMAs created in the new
> mm during exec (e.g. the stack and elf load segments) is not permitted and
> will cause the exec to fail.
> (We are studying how to guarantee there are no conflicts. Comments welcome.)
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
> ---
>  arch/x86/Kconfig   |  1 +
>  fs/exec.c          | 20 ++++++++++++++++++++
>  include/linux/mm.h |  5 +++++
>  kernel/fork.c      |  2 +-
>  mm/mmap.c          | 47 +++++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 74 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 883da0abf779..fc36eb2f45c0 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -30,6 +30,7 @@ config X86_64
>  	select MODULES_USE_ELF_RELA
>  	select NEED_DMA_MAP_STATE
>  	select SWIOTLB
> +	select ARCH_USES_HIGH_VMA_FLAGS
>  
>  config FORCE_DYNAMIC_FTRACE
>  	def_bool y
> diff --git a/fs/exec.c b/fs/exec.c
> index 262112e5f9f8..1de09c4eef00 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1069,6 +1069,20 @@ ssize_t read_code(struct file *file, unsigned long addr, loff_t pos, size_t len)
>  EXPORT_SYMBOL(read_code);
>  #endif
>  
> +static int vma_dup_some(struct mm_struct *old_mm, struct mm_struct *new_mm)
> +{
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	for (vma = old_mm->mmap; vma; vma = vma->vm_next)
> +		if (vma->vm_flags & VM_EXEC_KEEP) {
> +			ret = vma_dup(vma, new_mm);
> +			if (ret)
> +				return ret;
> +		}
> +	return 0;
> +}
> +
>  /*
>   * Maps the mm_struct mm into the current task struct.
>   * On success, this function returns with the mutex
> @@ -1104,6 +1118,12 @@ static int exec_mmap(struct mm_struct *mm)
>  			mutex_unlock(&tsk->signal->exec_update_mutex);
>  			return -EINTR;
>  		}
> +		ret = vma_dup_some(old_mm, mm);
> +		if (ret) {
> +			mmap_read_unlock(old_mm);
> +			mutex_unlock(&tsk->signal->exec_update_mutex);
> +			return ret;
> +		}
>  	}
>  
>  	task_lock(tsk);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index dc7b87310c10..1c538ba77f33 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -295,11 +295,15 @@ int overcommit_kbytes_handler(struct ctl_table *, int, void *, size_t *,
>  #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
>  #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
>  #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
> +#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
>  #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
>  #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
>  #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
>  #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
>  #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
> +#define VM_EXEC_KEEP	BIT(VM_HIGH_ARCH_BIT_5)	/* preserve VMA across exec */
> +#else
> +#define VM_EXEC_KEEP	VM_NONE
>  #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
>  
>  #ifdef CONFIG_ARCH_HAS_PKEYS
> @@ -2534,6 +2538,7 @@ extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
>  	unsigned long addr, unsigned long len, pgoff_t pgoff,
>  	bool *need_rmap_locks);
>  extern void exit_mmap(struct mm_struct *);
> +extern int vma_dup(struct vm_area_struct *vma, struct mm_struct *mm);
>  
>  static inline int check_data_rlimit(unsigned long rlim,
>  				    unsigned long new,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index efc5493203ae..15ead613714f 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -564,7 +564,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  			tmp->anon_vma = NULL;
>  		} else if (anon_vma_fork(tmp, mpnt))
>  			goto fail_nomem_anon_vma_fork;
> -		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
> +		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT | VM_EXEC_KEEP);
>  		file = tmp->vm_file;
>  		if (file) {
>  			struct inode *inode = file_inode(file);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 59a4682ebf3f..be2ff53743c3 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3279,6 +3279,53 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>  	return NULL;
>  }
>  
> +int vma_dup(struct vm_area_struct *old_vma, struct mm_struct *mm)
> +{
> +	unsigned long npages;
> +	struct mm_struct *old_mm = old_vma->vm_mm;
> +	struct vm_area_struct *vma;
> +	int ret = -ENOMEM;
> +
> +	if (WARN_ON(old_vma->vm_file || old_vma->vm_ops))
> +		return -EINVAL;
> +
> +	vma = find_vma(mm, old_vma->vm_start);
> +	if (vma && vma->vm_start < old_vma->vm_end)
> +		return -EEXIST;
> +
> +	npages = vma_pages(old_vma);
> +	mm->total_vm += npages;

Why only total_vm? Where's exec_vm/stack_vm/data_vm?

> +
> +	vma = vm_area_dup(old_vma);
> +	if (!vma)
> +		goto fail_nomem;
> +
> +	ret = vma_dup_policy(old_vma, vma);
> +	if (ret)
> +		goto fail_nomem_policy;
> +
> +	vma->vm_mm = mm;
> +	ret = anon_vma_fork(vma, old_vma);
> +	if (ret)
> +		goto fail_nomem_anon_vma_fork;

Looks like a duplication of code form dup_mmap().
Any chance to get in one place?

> +	vma->vm_flags &= ~(VM_LOCKED|VM_UFFD_MISSING|VM_UFFD_WP|VM_EXEC_KEEP);
> +	vma->vm_next = vma->vm_prev = NULL;

No need. vm_area_dup() takes care of it.

> +	vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;

Semantics of VM_EXEC_KEEP vs userfaultfd() deserves a detailed explanation.
I feel these flags has to be mutually exclusive.

> +	if (is_vm_hugetlb_page(vma))
> +		reset_vma_resv_huge_pages(vma);
> +	__insert_vm_struct(mm, vma);
> +	ret = copy_page_range(mm, old_mm, old_vma);
> +	return ret;
> +
> +fail_nomem_anon_vma_fork:
> +	mpol_put(vma_policy(vma));
> +fail_nomem_policy:
> +	vm_area_free(vma);
> +fail_nomem:
> +	return -ENOMEM;
> +}
> +
>  /*
>   * Return true if the calling process may expand its vm space by the passed
>   * number of pages
> -- 
> 1.8.3.1
> 
> 

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP
  2020-07-29 13:52   ` Kirill A. Shutemov
@ 2020-07-29 23:20     ` Anthony Yznaga
  0 siblings, 0 replies; 44+ messages in thread
From: Anthony Yznaga @ 2020-07-29 23:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-arch, mhocko, tglx,
	mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm, keescook, gerg,
	ktkhai, christian.brauner, peterz, esyr, jgg, christian, areber,
	cyphar, steven.sistare



On 7/29/20 6:52 AM, Kirill A. Shutemov wrote:
> On Mon, Jul 27, 2020 at 10:11:25AM -0700, Anthony Yznaga wrote:
>> A vma with the VM_EXEC_KEEP flag is preserved across exec.  For anonymous
>> vmas only.  For safety, overlap with fixed address VMAs created in the new
>> mm during exec (e.g. the stack and elf load segments) is not permitted and
>> will cause the exec to fail.
>> (We are studying how to guarantee there are no conflicts. Comments welcome.)
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>> ---
>>  arch/x86/Kconfig   |  1 +
>>  fs/exec.c          | 20 ++++++++++++++++++++
>>  include/linux/mm.h |  5 +++++
>>  kernel/fork.c      |  2 +-
>>  mm/mmap.c          | 47 +++++++++++++++++++++++++++++++++++++++++++++++
>>  5 files changed, 74 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 883da0abf779..fc36eb2f45c0 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -30,6 +30,7 @@ config X86_64
>>  	select MODULES_USE_ELF_RELA
>>  	select NEED_DMA_MAP_STATE
>>  	select SWIOTLB
>> +	select ARCH_USES_HIGH_VMA_FLAGS
>>  
>>  config FORCE_DYNAMIC_FTRACE
>>  	def_bool y
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 262112e5f9f8..1de09c4eef00 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1069,6 +1069,20 @@ ssize_t read_code(struct file *file, unsigned long addr, loff_t pos, size_t len)
>>  EXPORT_SYMBOL(read_code);
>>  #endif
>>  
>> +static int vma_dup_some(struct mm_struct *old_mm, struct mm_struct *new_mm)
>> +{
>> +	struct vm_area_struct *vma;
>> +	int ret;
>> +
>> +	for (vma = old_mm->mmap; vma; vma = vma->vm_next)
>> +		if (vma->vm_flags & VM_EXEC_KEEP) {
>> +			ret = vma_dup(vma, new_mm);
>> +			if (ret)
>> +				return ret;
>> +		}
>> +	return 0;
>> +}
>> +
>>  /*
>>   * Maps the mm_struct mm into the current task struct.
>>   * On success, this function returns with the mutex
>> @@ -1104,6 +1118,12 @@ static int exec_mmap(struct mm_struct *mm)
>>  			mutex_unlock(&tsk->signal->exec_update_mutex);
>>  			return -EINTR;
>>  		}
>> +		ret = vma_dup_some(old_mm, mm);
>> +		if (ret) {
>> +			mmap_read_unlock(old_mm);
>> +			mutex_unlock(&tsk->signal->exec_update_mutex);
>> +			return ret;
>> +		}
>>  	}
>>  
>>  	task_lock(tsk);
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index dc7b87310c10..1c538ba77f33 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -295,11 +295,15 @@ int overcommit_kbytes_handler(struct ctl_table *, int, void *, size_t *,
>>  #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
>>  #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
>>  #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
>> +#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
>>  #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
>>  #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
>>  #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
>>  #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
>>  #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
>> +#define VM_EXEC_KEEP	BIT(VM_HIGH_ARCH_BIT_5)	/* preserve VMA across exec */
>> +#else
>> +#define VM_EXEC_KEEP	VM_NONE
>>  #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
>>  
>>  #ifdef CONFIG_ARCH_HAS_PKEYS
>> @@ -2534,6 +2538,7 @@ extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
>>  	unsigned long addr, unsigned long len, pgoff_t pgoff,
>>  	bool *need_rmap_locks);
>>  extern void exit_mmap(struct mm_struct *);
>> +extern int vma_dup(struct vm_area_struct *vma, struct mm_struct *mm);
>>  
>>  static inline int check_data_rlimit(unsigned long rlim,
>>  				    unsigned long new,
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index efc5493203ae..15ead613714f 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -564,7 +564,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>  			tmp->anon_vma = NULL;
>>  		} else if (anon_vma_fork(tmp, mpnt))
>>  			goto fail_nomem_anon_vma_fork;
>> -		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
>> +		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT | VM_EXEC_KEEP);
>>  		file = tmp->vm_file;
>>  		if (file) {
>>  			struct inode *inode = file_inode(file);
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 59a4682ebf3f..be2ff53743c3 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -3279,6 +3279,53 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>>  	return NULL;
>>  }
>>  
>> +int vma_dup(struct vm_area_struct *old_vma, struct mm_struct *mm)
>> +{
>> +	unsigned long npages;
>> +	struct mm_struct *old_mm = old_vma->vm_mm;
>> +	struct vm_area_struct *vma;
>> +	int ret = -ENOMEM;
>> +
>> +	if (WARN_ON(old_vma->vm_file || old_vma->vm_ops))
>> +		return -EINVAL;
>> +
>> +	vma = find_vma(mm, old_vma->vm_start);
>> +	if (vma && vma->vm_start < old_vma->vm_end)
>> +		return -EEXIST;
>> +
>> +	npages = vma_pages(old_vma);
>> +	mm->total_vm += npages;
> Why only total_vm? Where's exec_vm/stack_vm/data_vm?
That was oversight.  Will be fixed in the next version.

>
>> +
>> +	vma = vm_area_dup(old_vma);
>> +	if (!vma)
>> +		goto fail_nomem;
>> +
>> +	ret = vma_dup_policy(old_vma, vma);
>> +	if (ret)
>> +		goto fail_nomem_policy;
>> +
>> +	vma->vm_mm = mm;
>> +	ret = anon_vma_fork(vma, old_vma);
>> +	if (ret)
>> +		goto fail_nomem_anon_vma_fork;
> Looks like a duplication of code form dup_mmap().
> Any chance to get in one place?
I looked at that, but dup_mmap() is dissimilar enough with
the additional fork-specific code in dup_mmap() that I think
readability would suffer.

>
>> +	vma->vm_flags &= ~(VM_LOCKED|VM_UFFD_MISSING|VM_UFFD_WP|VM_EXEC_KEEP);
>> +	vma->vm_next = vma->vm_prev = NULL;
> No need. vm_area_dup() takes care of it.
Will fix.

>
>> +	vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> Semantics of VM_EXEC_KEEP vs userfaultfd() deserves a detailed explanation.
> I feel these flags has to be mutually exclusive.
Yes, will document this better.
I think it's okay to mark userfaultfd-enabled memory for preservation
as long as it is understood that memory would need to re-registered
with userfaultfd() after exec and restore, if desired.  Unless there's
a particular issue you see with this?

Thanks,
Anthony

>
>> +	if (is_vm_hugetlb_page(vma))
>> +		reset_vma_resv_huge_pages(vma);
>> +	__insert_vm_struct(mm, vma);
>> +	ret = copy_page_range(mm, old_mm, old_vma);
>> +	return ret;
>> +
>> +fail_nomem_anon_vma_fork:
>> +	mpol_put(vma_policy(vma));
>> +fail_nomem_policy:
>> +	vm_area_free(vma);
>> +fail_nomem:
>> +	return -ENOMEM;
>> +}
>> +
>>  /*
>>   * Return true if the calling process may expand its vm space by the passed
>>   * number of pages
>> -- 
>> 1.8.3.1
>>
>>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
                   ` (7 preceding siblings ...)
  2020-07-28 14:23 ` Andy Lutomirski
@ 2020-07-30 15:22 ` Matthew Wilcox
  2020-07-30 15:27   ` Christian Brauner
  2020-07-30 15:59   ` Steven Sistare
  2020-07-31 19:41 ` Steven Sistare
  9 siblings, 2 replies; 44+ messages in thread
From: Matthew Wilcox @ 2020-07-30 15:22 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-arch, mhocko, tglx,
	mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm, keescook, gerg,
	ktkhai, christian.brauner, peterz, esyr, jgg, christian, areber,
	cyphar, steven.sistare

On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,
> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.

I just realised that something else I'm working on might be a suitable
alternative to this.  Apologies for not realising it sooner.

http://www.wil.cx/~willy/linux/sileby.html

To use this, you'd mshare() the anonymous memory range, essentially
detaching the VMA from the current process's mm_struct and reparenting
it to this new mm_struct, which has an fd referencing it.

Then you call exec(), and the exec'ed task gets to call mmap() on that
new fd to attach the memory range to its own address space.

Presto!


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 15:22 ` Matthew Wilcox
@ 2020-07-30 15:27   ` Christian Brauner
  2020-07-30 15:34     ` Matthew Wilcox
  2020-07-31  9:12     ` Stefan Hajnoczi
  2020-07-30 15:59   ` Steven Sistare
  1 sibling, 2 replies; 44+ messages in thread
From: Christian Brauner @ 2020-07-30 15:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm,
	linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd,
	ebiederm, keescook, gerg, ktkhai, peterz, esyr, jgg, christian,
	areber, cyphar, steven.sistare

On Thu, Jul 30, 2020 at 04:22:50PM +0100, Matthew Wilcox wrote:
> On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
> > This patchset adds support for preserving an anonymous memory range across
> > exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> > sharing memory in this manner, as opposed to re-attaching to a named shared
> > memory segment, is to ensure it is mapped at the same virtual address in
> > the new process as it was in the old one.  An intended use for this is to
> > preserve guest memory for guests using vfio while qemu exec's an updated
> > version of itself.  By ensuring the memory is preserved at a fixed address,
> > vfio mappings and their associated kernel data structures can remain valid.
> > In addition, for the qemu use case, qemu instances that back guest RAM with
> > anonymous memory can be updated.
> 
> I just realised that something else I'm working on might be a suitable
> alternative to this.  Apologies for not realising it sooner.
> 
> http://www.wil.cx/~willy/linux/sileby.html

Just skimming: make it O_CLOEXEC by default. ;)

> 
> To use this, you'd mshare() the anonymous memory range, essentially
> detaching the VMA from the current process's mm_struct and reparenting
> it to this new mm_struct, which has an fd referencing it.
> 
> Then you call exec(), and the exec'ed task gets to call mmap() on that
> new fd to attach the memory range to its own address space.
> 
> Presto!


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 15:27   ` Christian Brauner
@ 2020-07-30 15:34     ` Matthew Wilcox
  2020-07-30 15:54       ` Christian Brauner
  2020-07-31  9:12     ` Stefan Hajnoczi
  1 sibling, 1 reply; 44+ messages in thread
From: Matthew Wilcox @ 2020-07-30 15:34 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm,
	linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd,
	ebiederm, keescook, gerg, ktkhai, peterz, esyr, jgg, christian,
	areber, cyphar, steven.sistare

On Thu, Jul 30, 2020 at 05:27:05PM +0200, Christian Brauner wrote:
> On Thu, Jul 30, 2020 at 04:22:50PM +0100, Matthew Wilcox wrote:
> > On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
> > > This patchset adds support for preserving an anonymous memory range across
> > > exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> > > sharing memory in this manner, as opposed to re-attaching to a named shared
> > > memory segment, is to ensure it is mapped at the same virtual address in
> > > the new process as it was in the old one.  An intended use for this is to
> > > preserve guest memory for guests using vfio while qemu exec's an updated
> > > version of itself.  By ensuring the memory is preserved at a fixed address,
> > > vfio mappings and their associated kernel data structures can remain valid.
> > > In addition, for the qemu use case, qemu instances that back guest RAM with
> > > anonymous memory can be updated.
> > 
> > I just realised that something else I'm working on might be a suitable
> > alternative to this.  Apologies for not realising it sooner.
> > 
> > http://www.wil.cx/~willy/linux/sileby.html
> 
> Just skimming: make it O_CLOEXEC by default. ;)

I appreciate the suggestion, and it makes sense for many 'return an fd'
interfaces, but the point of mshare() is to, well, share.  So sharing
the fd with a child is a common usecase, unlike say sharing a timerfd.
The only other reason to use mshare() is to pass the fd over a unix
socket to a non-child, and I submit that is far less common than wanting
to share with a child.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 15:34     ` Matthew Wilcox
@ 2020-07-30 15:54       ` Christian Brauner
  0 siblings, 0 replies; 44+ messages in thread
From: Christian Brauner @ 2020-07-30 15:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm,
	linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd,
	ebiederm, keescook, gerg, ktkhai, peterz, esyr, jgg, christian,
	areber, cyphar, steven.sistare

On Thu, Jul 30, 2020 at 04:34:50PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 30, 2020 at 05:27:05PM +0200, Christian Brauner wrote:
> > On Thu, Jul 30, 2020 at 04:22:50PM +0100, Matthew Wilcox wrote:
> > > On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
> > > > This patchset adds support for preserving an anonymous memory range across
> > > > exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> > > > sharing memory in this manner, as opposed to re-attaching to a named shared
> > > > memory segment, is to ensure it is mapped at the same virtual address in
> > > > the new process as it was in the old one.  An intended use for this is to
> > > > preserve guest memory for guests using vfio while qemu exec's an updated
> > > > version of itself.  By ensuring the memory is preserved at a fixed address,
> > > > vfio mappings and their associated kernel data structures can remain valid.
> > > > In addition, for the qemu use case, qemu instances that back guest RAM with
> > > > anonymous memory can be updated.
> > > 
> > > I just realised that something else I'm working on might be a suitable
> > > alternative to this.  Apologies for not realising it sooner.
> > > 
> > > http://www.wil.cx/~willy/linux/sileby.html
> > 
> > Just skimming: make it O_CLOEXEC by default. ;)
> 
> I appreciate the suggestion, and it makes sense for many 'return an fd'
> interfaces, but the point of mshare() is to, well, share.  So sharing
> the fd with a child is a common usecase, unlike say sharing a timerfd.

Fair point, from reading I thought the main reason was share after
fork() but having an fd over exec() may be a good use-case too.
(Fwiw, this very much looks like what the memfd_create() api should have
looked like, i.e. mshare() could've possibly encompassed both.)

> The only other reason to use mshare() is to pass the fd over a unix
> socket to a non-child, and I submit that is far less common than wanting
> to share with a child.

Well, we have use-cases for that too. E.g. where we need to attach to an
existing pid namespace which requires a first fork() of an intermediate
process so that the caller doesn't change the pid_for_children
namespace. Then we setns() to the target pid namespace in the
intermediate process which changes the pid_ns_children namespace such
that the next process will be a proper member of the target pid
namespace. Finally, we fork() the target process that is now a full
member of the target pid namespace. We also set CLONE_PARENT so
grandfather process == father process for the second process and then
have the intermediate process exit. Some fds we only ever open or create
after the intermediate process exited and some fds we can only open or
create after the intermediate process has been created and then send the
fds via scm (since we can't share the fdtable) to the final process.

Christian


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 15:22 ` Matthew Wilcox
  2020-07-30 15:27   ` Christian Brauner
@ 2020-07-30 15:59   ` Steven Sistare
  2020-07-30 17:12     ` Matthew Wilcox
  1 sibling, 1 reply; 44+ messages in thread
From: Steven Sistare @ 2020-07-30 15:59 UTC (permalink / raw)
  To: Matthew Wilcox, Anthony Yznaga
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-arch, mhocko, tglx,
	mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm, keescook, gerg,
	ktkhai, christian.brauner, peterz, esyr, jgg, christian, areber,
	cyphar

On 7/30/2020 11:22 AM, Matthew Wilcox wrote:
> On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
>> This patchset adds support for preserving an anonymous memory range across
>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>> sharing memory in this manner, as opposed to re-attaching to a named shared
>> memory segment, is to ensure it is mapped at the same virtual address in
>> the new process as it was in the old one.  An intended use for this is to
>> preserve guest memory for guests using vfio while qemu exec's an updated
>> version of itself.  By ensuring the memory is preserved at a fixed address,
>> vfio mappings and their associated kernel data structures can remain valid.
>> In addition, for the qemu use case, qemu instances that back guest RAM with
>> anonymous memory can be updated.
> 
> I just realised that something else I'm working on might be a suitable
> alternative to this.  Apologies for not realising it sooner.
> 
> http://www.wil.cx/~willy/linux/sileby.html
> 
> To use this, you'd mshare() the anonymous memory range, essentially
> detaching the VMA from the current process's mm_struct and reparenting
> it to this new mm_struct, which has an fd referencing it.
> 
> Then you call exec(), and the exec'ed task gets to call mmap() on that
> new fd to attach the memory range to its own address space.
> 
> Presto!

To be suitable for the qemu use case, we need a guarantee that the same VA range
is available in the new process, with nothing else mapped there.  From your spec,
it sounds like the new process could do a series of unrelated mmap's which could
overlap the desired va range before the silby mmap(fd) is performed??

Also, we need to support updating legacy processes that already created anon segments.
We inject code that calls MADV_DOEXEC for such segments.

- Steve


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 15:59   ` Steven Sistare
@ 2020-07-30 17:12     ` Matthew Wilcox
  2020-07-30 17:35       ` Steven Sistare
  0 siblings, 1 reply; 44+ messages in thread
From: Matthew Wilcox @ 2020-07-30 17:12 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm,
	linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd,
	ebiederm, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On Thu, Jul 30, 2020 at 11:59:42AM -0400, Steven Sistare wrote:
> On 7/30/2020 11:22 AM, Matthew Wilcox wrote:
> > On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
> >> This patchset adds support for preserving an anonymous memory range across
> >> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> >> sharing memory in this manner, as opposed to re-attaching to a named shared
> >> memory segment, is to ensure it is mapped at the same virtual address in
> >> the new process as it was in the old one.  An intended use for this is to
> >> preserve guest memory for guests using vfio while qemu exec's an updated
> >> version of itself.  By ensuring the memory is preserved at a fixed address,
> >> vfio mappings and their associated kernel data structures can remain valid.
> >> In addition, for the qemu use case, qemu instances that back guest RAM with
> >> anonymous memory can be updated.
> > 
> > I just realised that something else I'm working on might be a suitable
> > alternative to this.  Apologies for not realising it sooner.
> > 
> > http://www.wil.cx/~willy/linux/sileby.html
> > 
> > To use this, you'd mshare() the anonymous memory range, essentially
> > detaching the VMA from the current process's mm_struct and reparenting
> > it to this new mm_struct, which has an fd referencing it.
> > 
> > Then you call exec(), and the exec'ed task gets to call mmap() on that
> > new fd to attach the memory range to its own address space.
> > 
> > Presto!
> 
> To be suitable for the qemu use case, we need a guarantee that the same VA range
> is available in the new process, with nothing else mapped there.  From your spec,
> it sounds like the new process could do a series of unrelated mmap's which could
> overlap the desired va range before the silby mmap(fd) is performed??

That could happen.  eg libc might get its text segment mapped there
randomly.  I believe Khalid was working on a solution for reserving
memory ranges.

> Also, we need to support updating legacy processes that already created anon segments.
> We inject code that calls MADV_DOEXEC for such segments.

Yes, I was assuming you'd inject code that called mshare().

Actually, since you're injecting code, why do you need the kernel to
be involved?  You can mmap the new executable and any libraries it depends
upon, set up a new stack and jump to the main() entry point, all without
calling exec().  I appreciate it'd be a fair amount of code, but it'd all
be in userspace and you can probably steal / reuse code from ld.so (I'm
not familiar with the details of how setting up an executable is done).


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 17:12     ` Matthew Wilcox
@ 2020-07-30 17:35       ` Steven Sistare
  2020-07-30 17:49         ` Matthew Wilcox
  0 siblings, 1 reply; 44+ messages in thread
From: Steven Sistare @ 2020-07-30 17:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm,
	linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd,
	ebiederm, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On 7/30/2020 1:12 PM, Matthew Wilcox wrote:
> On Thu, Jul 30, 2020 at 11:59:42AM -0400, Steven Sistare wrote:
>> On 7/30/2020 11:22 AM, Matthew Wilcox wrote:
>>> On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
>>>> This patchset adds support for preserving an anonymous memory range across
>>>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>>>> sharing memory in this manner, as opposed to re-attaching to a named shared
>>>> memory segment, is to ensure it is mapped at the same virtual address in
>>>> the new process as it was in the old one.  An intended use for this is to
>>>> preserve guest memory for guests using vfio while qemu exec's an updated
>>>> version of itself.  By ensuring the memory is preserved at a fixed address,
>>>> vfio mappings and their associated kernel data structures can remain valid.
>>>> In addition, for the qemu use case, qemu instances that back guest RAM with
>>>> anonymous memory can be updated.
>>>
>>> I just realised that something else I'm working on might be a suitable
>>> alternative to this.  Apologies for not realising it sooner.
>>>
>>> http://www.wil.cx/~willy/linux/sileby.html
>>>
>>> To use this, you'd mshare() the anonymous memory range, essentially
>>> detaching the VMA from the current process's mm_struct and reparenting
>>> it to this new mm_struct, which has an fd referencing it.
>>>
>>> Then you call exec(), and the exec'ed task gets to call mmap() on that
>>> new fd to attach the memory range to its own address space.
>>>
>>> Presto!
>>
>> To be suitable for the qemu use case, we need a guarantee that the same VA range
>> is available in the new process, with nothing else mapped there.  From your spec,
>> it sounds like the new process could do a series of unrelated mmap's which could
>> overlap the desired va range before the silby mmap(fd) is performed??
> 
> That could happen.  eg libc might get its text segment mapped there
> randomly.  I believe Khalid was working on a solution for reserving
> memory ranges.

mshare + VA reservation is another possible solution.

Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.

>> Also, we need to support updating legacy processes that already created anon segments.
>> We inject code that calls MADV_DOEXEC for such segments.
> 
> Yes, I was assuming you'd inject code that called mshare().

OK, mshare works on existing memory and builds a new vma.

> Actually, since you're injecting code, why do you need the kernel to
> be involved?  You can mmap the new executable and any libraries it depends
> upon, set up a new stack and jump to the main() entry point, all without
> calling exec().  I appreciate it'd be a fair amount of code, but it'd all
> be in userspace and you can probably steal / reuse code from ld.so (I'm
> not familiar with the details of how setting up an executable is done).

Duplicating all the work that the kernel and loader do to exec a process would
be error prone, require ongoing maintenance, and be redundant.  Better to define 
a small kernel extension and leave exec to the kernel.

- Steve


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 17:35       ` Steven Sistare
@ 2020-07-30 17:49         ` Matthew Wilcox
  2020-07-30 18:27           ` Steven Sistare
  0 siblings, 1 reply; 44+ messages in thread
From: Matthew Wilcox @ 2020-07-30 17:49 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm,
	linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd,
	ebiederm, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On Thu, Jul 30, 2020 at 01:35:51PM -0400, Steven Sistare wrote:
> mshare + VA reservation is another possible solution.
> 
> Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.

We are.  This is the part of the review process where we explore other
solutions to the problem.

> >> Also, we need to support updating legacy processes that already created anon segments.
> >> We inject code that calls MADV_DOEXEC for such segments.
> > 
> > Yes, I was assuming you'd inject code that called mshare().
> 
> OK, mshare works on existing memory and builds a new vma.

Actually, reparents an existing VMA, and reuses the existing page tables.

> > Actually, since you're injecting code, why do you need the kernel to
> > be involved?  You can mmap the new executable and any libraries it depends
> > upon, set up a new stack and jump to the main() entry point, all without
> > calling exec().  I appreciate it'd be a fair amount of code, but it'd all
> > be in userspace and you can probably steal / reuse code from ld.so (I'm
> > not familiar with the details of how setting up an executable is done).
> 
> Duplicating all the work that the kernel and loader do to exec a process would
> be error prone, require ongoing maintenance, and be redundant.  Better to define 
> a small kernel extension and leave exec to the kernel.

Either this is a one-off kind of thing, in which case it doesn't need
ongoing maintenance, or it's something with broad applicability, in
which case it can live as its own userspace project.  It could even
start off life as part of qemu and then fork into its own project.
The idea of tagging an ELF executable to say "I can cope with having
chunks of my address space provided to me by my executor" is ... odd.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 17:49         ` Matthew Wilcox
@ 2020-07-30 18:27           ` Steven Sistare
  2020-07-30 21:58             ` Eric W. Biederman
  0 siblings, 1 reply; 44+ messages in thread
From: Steven Sistare @ 2020-07-30 18:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm,
	linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd,
	ebiederm, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On 7/30/2020 1:49 PM, Matthew Wilcox wrote:
> On Thu, Jul 30, 2020 at 01:35:51PM -0400, Steven Sistare wrote:
>> mshare + VA reservation is another possible solution.
>>
>> Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.
> 
> We are.  This is the part of the review process where we explore other
> solutions to the problem.
> 
>>>> Also, we need to support updating legacy processes that already created anon segments.
>>>> We inject code that calls MADV_DOEXEC for such segments.
>>>
>>> Yes, I was assuming you'd inject code that called mshare().
>>
>> OK, mshare works on existing memory and builds a new vma.
> 
> Actually, reparents an existing VMA, and reuses the existing page tables.
> 
>>> Actually, since you're injecting code, why do you need the kernel to
>>> be involved?  You can mmap the new executable and any libraries it depends
>>> upon, set up a new stack and jump to the main() entry point, all without
>>> calling exec().  I appreciate it'd be a fair amount of code, but it'd all
>>> be in userspace and you can probably steal / reuse code from ld.so (I'm
>>> not familiar with the details of how setting up an executable is done).
>>
>> Duplicating all the work that the kernel and loader do to exec a process would
>> be error prone, require ongoing maintenance, and be redundant.  Better to define 
>> a small kernel extension and leave exec to the kernel.
> 
> Either this is a one-off kind of thing, in which case it doesn't need
> ongoing maintenance, or it's something with broad applicability, in
> which case it can live as its own userspace project.  It could even
> start off life as part of qemu and then fork into its own project.

exec will be enhanced over time in the kernel.  A separate user space implementation
would need to track that.

Reimplementing exec in userland would be a big gross mess.  Not a good solution when
we have simple and concise ways of solving the problem.

> The idea of tagging an ELF executable to say "I can cope with having
> chunks of my address space provided to me by my executor" is ... odd.

I don't disagree.  But it is useful.  We already pass a block of data containing
environment variables and arguments from one process to the next.  Preserving 
additional segments is not a big leap from there.

- Steve


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 18:27           ` Steven Sistare
@ 2020-07-30 21:58             ` Eric W. Biederman
  2020-07-31 14:57               ` Steven Sistare
  0 siblings, 1 reply; 44+ messages in thread
From: Eric W. Biederman @ 2020-07-30 21:58 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Matthew Wilcox, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro,
	akpm, arnd, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

Steven Sistare <steven.sistare@oracle.com> writes:

> On 7/30/2020 1:49 PM, Matthew Wilcox wrote:
>> On Thu, Jul 30, 2020 at 01:35:51PM -0400, Steven Sistare wrote:
>>> mshare + VA reservation is another possible solution.
>>>
>>> Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.
>> 
>> We are.  This is the part of the review process where we explore other
>> solutions to the problem.
>> 
>>>>> Also, we need to support updating legacy processes that already created anon segments.
>>>>> We inject code that calls MADV_DOEXEC for such segments.
>>>>
>>>> Yes, I was assuming you'd inject code that called mshare().
>>>
>>> OK, mshare works on existing memory and builds a new vma.
>> 
>> Actually, reparents an existing VMA, and reuses the existing page tables.
>> 
>>>> Actually, since you're injecting code, why do you need the kernel to
>>>> be involved?  You can mmap the new executable and any libraries it depends
>>>> upon, set up a new stack and jump to the main() entry point, all without
>>>> calling exec().  I appreciate it'd be a fair amount of code, but it'd all
>>>> be in userspace and you can probably steal / reuse code from ld.so (I'm
>>>> not familiar with the details of how setting up an executable is done).
>>>
>>> Duplicating all the work that the kernel and loader do to exec a process would
>>> be error prone, require ongoing maintenance, and be redundant.  Better to define 
>>> a small kernel extension and leave exec to the kernel.
>> 
>> Either this is a one-off kind of thing, in which case it doesn't need
>> ongoing maintenance, or it's something with broad applicability, in
>> which case it can live as its own userspace project.  It could even
>> start off life as part of qemu and then fork into its own project.
>
> exec will be enhanced over time in the kernel.  A separate user space implementation
> would need to track that.
>
> Reimplementing exec in userland would be a big gross mess.  Not a good solution when
> we have simple and concise ways of solving the problem.

The interesting work is not done in exec.  The interesting work of
loading an executable is done in ld.so, and ld.so lives in userspace.

>> The idea of tagging an ELF executable to say "I can cope with having
>> chunks of my address space provided to me by my executor" is ... odd.
>
> I don't disagree.  But it is useful.  We already pass a block of data containing
> environment variables and arguments from one process to the next.  Preserving 
> additional segments is not a big leap from there.

But we randomize where that block lives.

It has been found to be very useful to have randomization by default and
you are arguing against that kind of randomization.



Here is another suggestion.

Have a very simple program that does:

	for (;;) {
		handle = dlopen("/my/real/program");
		real_main = dlsym(handle, "main");
		real_main(argc, argv, envp);
		dlclose(handle);
	}

With whatever obvious adjustments are needed to fit your usecase.

That should give the same level of functionality, be portable to all
unices, and not require you to duplicate code.  It belive it limits you
to not upgrading libc, or librt but that is a comparatively small
limitation.


Given that in general the interesting work is done in userspace and that
userspace has provided an interface for reusing that work already.
I don't see the justification for adding anything to exec at this point.


Eric


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 15:27   ` Christian Brauner
  2020-07-30 15:34     ` Matthew Wilcox
@ 2020-07-31  9:12     ` Stefan Hajnoczi
  1 sibling, 0 replies; 44+ messages in thread
From: Stefan Hajnoczi @ 2020-07-31  9:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Anthony Yznaga, linux-kernel, linux-fsdevel, linux-mm,
	linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd,
	ebiederm, keescook, gerg, ktkhai, peterz, esyr, jgg, christian,
	areber, cyphar, steven.sistare, Andrea Arcangeli, jasowang,
	Philippe Mathieu-Daudé,
	Peter Xu, Alex Williamson, Alexander Graf, vgoyal


[-- Attachment #1: Type: text/plain, Size: 5254 bytes --]

Hi,
Sileby looks interesting! I had just written up the following idea which
seems similar but includes a mechanism for revoking mappings.

Alexander Graf recently brought up an idea that solves the following
problem:

When process A passes shared memory file descriptors to process B there
is no way for process A to revoke access or change page protection bits
after passing the fd.

I'll describe the idea (not sure if it's exactly what Alexander had in
mind).

Memory view driver
------------------
The memory view driver allows process A to control the page table
entries of an mmap in process B. It is a character device driver that
process A opens like this:

  int fd = open("/dev/memory-view", O_RDWR);

This returns a file descriptor to a new memory view.

Next process A sets the size of the memory view:

  ftruncate(fd, 16 * GiB);

The size determines how large the memory view will be. The size is a
virtual memory concept and does not consume resources (there is no
physical memory backing this).

Process A populates the memory view with ranges from file descriptors it
wishes to share. The file descriptor can be a shared memory file
descriptor:

  int memfd = memfd_create("guest-ram, 0);
  ftruncate(memfd, 32 * GiB);

  /* Map [8GB, 10GB) at 8GB into the memory view */
  struct memview_map_fd_info = {
      .fd = memfd,
      .fd_offset = 8 * GiB,
      .size = 2 * GiB,
      .mem_offset = 8 * GiB,
      .flags = MEMVIEW_MAP_READ | MEMVIEW_MAP_WRITE,
  };
  ioctl(fd, MEMVIEW_MAP_FD, &map_fd_info);

It is also possible to populate the memory view from the page cache:

  int filefd = open("big-file.iso", O_RDONLY);

  /* Map [4GB, 12GB) at 0B into the memory view */
  struct memview_map_fd_info = {
      .fd = filefd,
      .fd_offset = 4 * GiB,
      .size = 8 * GiB,
      .mem_offset = 0,
      .flags = MEMVIEW_MAP_READ,
  };
  ioctl(fd, MEMVIEW_MAP_FD, &map_fd_info);

The memory view has now been populated like this:

Range (GiB)   Fd               Permissions
0-8           big-file.iso     read
8-10          guest-ram        read+write
10-16         <none>           <none>

Now process A gets the "view" file descriptor for this memory view. The
view file descriptor does not allow ioctls. It can be safely passed to
process B in the knowledge that process B can only mmap or close it:

  int viewfd = ioctl(fd, MEMVIEW_GET_VIEWFD);

  ...pass viewfd to process B...

Process B receives viewfd and mmaps it:

  void *ptr = mmap(NULL, 16 * GiB, PROT_READ | PROT_WRITE, MAP_SHARED,
                   viewfd, 0);

When process B accesses a page in the mmap region the memory view
driver resolves the page fault by checking if the page is mapped to an
fd and what its permissions are.

For example, accessing the page at 4GB from the start of the memory view
is an access at 8GB into big-file.iso. That's because 8GB of
big-file.iso was mapped at 0 with fd_offset 4GB.

To summarize, there is one vma in process B and the memory view driver
maps pages from the file descriptors added with ioctl(MEMVIEW_MAP_FD) by
process A.

Page protection bits are the AND of the mmap
PROT_READ/PROT_WRITE/PROT_EXEC flags with the memory view driver's
MEMVIEW_MAP_READ/MEMVIEW_MAP_WRITE/MEMVIEW_MAP_EXEC flags for the
mapping in question.

Does vmf_insert_mixed_prot() or a similar Linux API allow this?

Can the memory view driver map pages from fds without pinning the pages?

Process A can make further ioctl(MEMVIEW_MAP_FD) calls and also
ioctl(MEMVIEW_UNMAP_FD) calls to change the mappings. This requires
zapping affected process B ptes. When process B accesses those pages
again the fault handler will handle the page fault based on the latest
memory view layout.

If process B accesses a page with incorrect permissions or that has not
been configured by process A ioctl calls, a SIGSEGV/SIGBUS signal is
raised.

When process B uses mprotect(2) and other virtual memory syscalls it
is unable to increase page permissions. Instead it can only reduce them
because the pte protection bits are the AND of the mmap flags and the
memory view driver's MEMVIEW_MAP_READ/MEMVIEW_MAP_WRITE/MEMVIEW_MAP_EXEC
flags.

Use cases
---------
How to use the memory view driver for vhost-user
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
vhost-user and other out-of-process device emulation interfaces need a
way for the VMM to enforce the IOMMU mappings on the device emulation
process.

Today the VMM passes all guest RAM fds to the device emulation process
and has no way of restricting access or revoking it later. With the
memory view driver the VMM will pass one or more memory view fds instead
of the actual guest RAM fds. This allows the VMM to invoke
ioctl(MEMVIEW_MAP_FD/MEMVIEW_UNMAP_FD) to enforce permissions or revoke
access.

How to use the memory view driver for virtio-fs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The virtiofsd vhost-user process creates a memory view for the device's
DAX Window and passes it to QEMU. QEMU installs it as a kvm.ko memory
region so that the guest directly accesses the memory view.

Now virtiofsd can map portions of files into the DAX Window without
coordinating with the QEMU process. This simplifies the virtio-fs code
and should also improve DAX map/unmap performance.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-30 21:58             ` Eric W. Biederman
@ 2020-07-31 14:57               ` Steven Sistare
  2020-07-31 15:27                 ` Matthew Wilcox
  2020-08-03 15:28                 ` Eric W. Biederman
  0 siblings, 2 replies; 44+ messages in thread
From: Steven Sistare @ 2020-07-31 14:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matthew Wilcox, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro,
	akpm, arnd, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On 7/30/2020 5:58 PM, ebiederm@xmission.com wrote:
> Steven Sistare <steven.sistare@oracle.com> writes:
>> On 7/30/2020 1:49 PM, Matthew Wilcox wrote:
>>> On Thu, Jul 30, 2020 at 01:35:51PM -0400, Steven Sistare wrote:
>>>> mshare + VA reservation is another possible solution.
>>>>
>>>> Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.
>>>
>>> We are.  This is the part of the review process where we explore other
>>> solutions to the problem.
>>>
>>>>>> Also, we need to support updating legacy processes that already created anon segments.
>>>>>> We inject code that calls MADV_DOEXEC for such segments.
>>>>>
>>>>> Yes, I was assuming you'd inject code that called mshare().
>>>>
>>>> OK, mshare works on existing memory and builds a new vma.
>>>
>>> Actually, reparents an existing VMA, and reuses the existing page tables.
>>>
>>>>> Actually, since you're injecting code, why do you need the kernel to
>>>>> be involved?  You can mmap the new executable and any libraries it depends
>>>>> upon, set up a new stack and jump to the main() entry point, all without
>>>>> calling exec().  I appreciate it'd be a fair amount of code, but it'd all
>>>>> be in userspace and you can probably steal / reuse code from ld.so (I'm
>>>>> not familiar with the details of how setting up an executable is done).
>>>>
>>>> Duplicating all the work that the kernel and loader do to exec a process would
>>>> be error prone, require ongoing maintenance, and be redundant.  Better to define 
>>>> a small kernel extension and leave exec to the kernel.
>>>
>>> Either this is a one-off kind of thing, in which case it doesn't need
>>> ongoing maintenance, or it's something with broad applicability, in
>>> which case it can live as its own userspace project.  It could even
>>> start off life as part of qemu and then fork into its own project.
>>
>> exec will be enhanced over time in the kernel.  A separate user space implementation
>> would need to track that.
>>
>> Reimplementing exec in userland would be a big gross mess.  Not a good solution when
>> we have simple and concise ways of solving the problem.
> 
> The interesting work is not done in exec.  The interesting work of
> loading an executable is done in ld.so, and ld.so lives in userspace.

And yet a smart guy named Eric said:
  "There is just enough complexity in exec currently that our implementation of exec is 
  already teetering."
which suggests that a user-space re-implementation will also be complex.
In complexity lies bugs.

>>> The idea of tagging an ELF executable to say "I can cope with having
>>> chunks of my address space provided to me by my executor" is ... odd.
>>
>> I don't disagree.  But it is useful.  We already pass a block of data containing
>> environment variables and arguments from one process to the next.  Preserving 
>> additional segments is not a big leap from there.
> 
> But we randomize where that block lives.
> 
> It has been found to be very useful to have randomization by default and
> you are arguing against that kind of randomization.

In the normal use case the addresses of the preserved anon segments were
already randomized during allocation in the pre-exec process.  Preservation
across exec does not make the process less safe from an external attack.

Now, if the attacker injects content at known addresses by preserving and
exec'ing, then the post-exec process must decide whether to trust the
pre-exec process.  Let's prevent preservation when exec'ing setuid binaries.
That is not part of this patch series, but is a good idea.  So, uid does not
change across exec.  Do I trust myself?  Yes I do.  If the uid is trusted,
as it must be in the qemu use case which accesses system resources, then
that is sufficient.  If the exec raises caps, and the uid is not trusted,
then the post-exec process must do additional work to verify the pre-exec
credentials.  The method is left to the app.  We defined the ELF opt-in so
that processes that do not use preserved memory will never receive any and
will never need to deal with such verification.

Matthews sileby/mshare proposal has the same issue.  If a process opts-in
and mmap's an address in the shared region, then content becomes mapped at
a VA that was known to the pre-fork or pre-exec process.  Trust must still
be established.

> Here is another suggestion.
> 
> Have a very simple program that does:
> 
> 	for (;;) {
> 		handle = dlopen("/my/real/program");
> 		real_main = dlsym(handle, "main");
> 		real_main(argc, argv, envp);
> 		dlclose(handle);
> 	}
> 
> With whatever obvious adjustments are needed to fit your usecase.
> 
> That should give the same level of functionality, be portable to all
> unices, and not require you to duplicate code.  It belive it limits you
> to not upgrading libc, or librt but that is a comparatively small
> limitation.
> 
> 
> Given that in general the interesting work is done in userspace and that
> userspace has provided an interface for reusing that work already.
> I don't see the justification for adding anything to exec at this point. 

Thanks for the suggestion.  That is clever, and would make a fun project,
but I would not trust it for production.  These few lines are just
the first of many that it would take to reset the environment to the
well-defined post-exec initial conditions that all executables expect,
and incrementally tearing down state will be prone to bugs.  Getting a
clean slate from a kernel exec is a much more reliable design.  The use
case is creating long-lived apps that never go down, and the simplest
implementation will have the fewest bugs and is the best.  MADV_DOEXEC is
simple, and does not even require a new system call, and the kernel already
knows how to exec without bugs.

- Steve


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-31 14:57               ` Steven Sistare
@ 2020-07-31 15:27                 ` Matthew Wilcox
  2020-07-31 16:11                   ` Steven Sistare
  2020-08-03 15:28                 ` Eric W. Biederman
  1 sibling, 1 reply; 44+ messages in thread
From: Matthew Wilcox @ 2020-07-31 15:27 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Eric W. Biederman, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro,
	akpm, arnd, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On Fri, Jul 31, 2020 at 10:57:44AM -0400, Steven Sistare wrote:
> Matthews sileby/mshare proposal has the same issue.  If a process opts-in
> and mmap's an address in the shared region, then content becomes mapped at
> a VA that was known to the pre-fork or pre-exec process.  Trust must still
> be established.

It's up to the recipient whether they try to map it at the same address
or at a fresh address.  The intended use case is a "semi-shared" address
space between two processes (ie partway between a threaded, fully-shared
address space and a forked un-shared address space), in which case
there's a certain amount of trust and cooperation between the processes.

Your preservation-across-exec use-case might or might not need the
VMA to be mapped at the same address.  I don't know whether qemu stores
pointers in this VMA which are absolute within the qemu address space.
If it's just the emulated process's address space, then everything will
be absolute within its own address space and everything will be opaque
to qemu.  If qemu is storing its own pointers in it, then it has to be
mapped at the same address.

> > Here is another suggestion.
> > 
> > Have a very simple program that does:
> > 
> > 	for (;;) {
> > 		handle = dlopen("/my/real/program");
> > 		real_main = dlsym(handle, "main");
> > 		real_main(argc, argv, envp);
> > 		dlclose(handle);
> > 	}
> > 
> > With whatever obvious adjustments are needed to fit your usecase.
> > 
> > That should give the same level of functionality, be portable to all
> > unices, and not require you to duplicate code.  It belive it limits you
> > to not upgrading libc, or librt but that is a comparatively small
> > limitation.
> > 
> > 
> > Given that in general the interesting work is done in userspace and that
> > userspace has provided an interface for reusing that work already.
> > I don't see the justification for adding anything to exec at this point. 
> 
> Thanks for the suggestion.  That is clever, and would make a fun project,
> but I would not trust it for production.  These few lines are just
> the first of many that it would take to reset the environment to the
> well-defined post-exec initial conditions that all executables expect,
> and incrementally tearing down state will be prone to bugs.  Getting a
> clean slate from a kernel exec is a much more reliable design.  The use
> case is creating long-lived apps that never go down, and the simplest
> implementation will have the fewest bugs and is the best.  MADV_DOEXEC is
> simple, and does not even require a new system call, and the kernel already
> knows how to exec without bugs.

It's a net increase of 200 lines of kernel code.  If 4 lines of userspace
code removes 200 lines of kernel code, I think I know which I prefer ...


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-31 15:27                 ` Matthew Wilcox
@ 2020-07-31 16:11                   ` Steven Sistare
  2020-07-31 16:56                     ` Jason Gunthorpe
  2020-07-31 17:23                     ` Matthew Wilcox
  0 siblings, 2 replies; 44+ messages in thread
From: Steven Sistare @ 2020-07-31 16:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Eric W. Biederman, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro,
	akpm, arnd, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On 7/31/2020 11:27 AM, Matthew Wilcox wrote:
> On Fri, Jul 31, 2020 at 10:57:44AM -0400, Steven Sistare wrote:
>> Matthews sileby/mshare proposal has the same issue.  If a process opts-in
>> and mmap's an address in the shared region, then content becomes mapped at
>> a VA that was known to the pre-fork or pre-exec process.  Trust must still
>> be established.
> 
> It's up to the recipient whether they try to map it at the same address
> or at a fresh address.  The intended use case is a "semi-shared" address
> space between two processes (ie partway between a threaded, fully-shared
> address space and a forked un-shared address space), in which case
> there's a certain amount of trust and cooperation between the processes.

Understood, but if the recipient does map at any of the same, which is the whole
point because you want to share the page table.  The trust relationship is no
different than for the live update case.  
 
> Your preservation-across-exec use-case might or might not need the
> VMA to be mapped at the same address.  

It does.  qemu registers memory with vfio which remembers the va's in kernel
metadata for the device.

> I don't know whether qemu stores
> pointers in this VMA which are absolute within the qemu address space.
> If it's just the emulated process's address space, then everything will
> be absolute within its own address space and everything will be opaque
> to qemu.  If qemu is storing its own pointers in it, then it has to be
> mapped at the same address.

qemu does not do the latter but that could be a nice way for apps to use
preserved memory.

>>> Here is another suggestion.
>>>
>>> Have a very simple program that does:
>>>
>>> 	for (;;) {
>>> 		handle = dlopen("/my/real/program");
>>> 		real_main = dlsym(handle, "main");
>>> 		real_main(argc, argv, envp);
>>> 		dlclose(handle);
>>> 	}
>>>
>>> With whatever obvious adjustments are needed to fit your usecase.
>>>
>>> That should give the same level of functionality, be portable to all
>>> unices, and not require you to duplicate code.  It belive it limits you
>>> to not upgrading libc, or librt but that is a comparatively small
>>> limitation.
>>>
>>>
>>> Given that in general the interesting work is done in userspace and that
>>> userspace has provided an interface for reusing that work already.
>>> I don't see the justification for adding anything to exec at this point. 
>>
>> Thanks for the suggestion.  That is clever, and would make a fun project,
>> but I would not trust it for production.  These few lines are just
>> the first of many that it would take to reset the environment to the
>> well-defined post-exec initial conditions that all executables expect,
>> and incrementally tearing down state will be prone to bugs.  Getting a
>> clean slate from a kernel exec is a much more reliable design.  The use
>> case is creating long-lived apps that never go down, and the simplest
>> implementation will have the fewest bugs and is the best.  MADV_DOEXEC is
>> simple, and does not even require a new system call, and the kernel already
>> knows how to exec without bugs.
> 
> It's a net increase of 200 lines of kernel code.  If 4 lines of userspace
> code removes 200 lines of kernel code, I think I know which I prefer ...

It will be *far* more than 4 lines.
Much of the 200 lines is mostly for the elf opt in, and much of the elf code is from
anthony reviving an earlier patch that use MAP_FIXED_NOREPLACE during segment setup.

- Steve




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-31 16:11                   ` Steven Sistare
@ 2020-07-31 16:56                     ` Jason Gunthorpe
  2020-07-31 17:15                       ` Steven Sistare
  2020-07-31 17:23                     ` Matthew Wilcox
  1 sibling, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2020-07-31 16:56 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Matthew Wilcox, Eric W. Biederman, Anthony Yznaga, linux-kernel,
	linux-fsdevel, linux-mm, linux-arch, mhocko, tglx, mingo, bp,
	x86, hpa, viro, akpm, arnd, keescook, gerg, ktkhai,
	christian.brauner, peterz, esyr, christian, areber, cyphar

On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote:
> > Your preservation-across-exec use-case might or might not need the
> > VMA to be mapped at the same address.  
> 
> It does.  qemu registers memory with vfio which remembers the va's in kernel
> metadata for the device.

Once the memory is registered with vfio the VA doesn't matter, vfio
will keep the iommu pointing at the same physical pages no matter
where they are mapped.

Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-31 16:56                     ` Jason Gunthorpe
@ 2020-07-31 17:15                       ` Steven Sistare
  2020-07-31 17:48                         ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Steven Sistare @ 2020-07-31 17:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, Eric W. Biederman, Anthony Yznaga, linux-kernel,
	linux-fsdevel, linux-mm, linux-arch, mhocko, tglx, mingo, bp,
	x86, hpa, viro, akpm, arnd, keescook, gerg, ktkhai,
	christian.brauner, peterz, esyr, christian, areber, cyphar

On 7/31/2020 12:56 PM, Jason Gunthorpe wrote:
> On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote:
>>> Your preservation-across-exec use-case might or might not need the
>>> VMA to be mapped at the same address.  
>>
>> It does.  qemu registers memory with vfio which remembers the va's in kernel
>> metadata for the device.
> 
> Once the memory is registered with vfio the VA doesn't matter, vfio
> will keep the iommu pointing at the same physical pages no matter
> where they are mapped.

Yes, but there are other code paths that compute and use offsets between va and the
base va.  Mapping at a different va in the new process breaks vfio; I have tried it.

- Steve


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-31 16:11                   ` Steven Sistare
  2020-07-31 16:56                     ` Jason Gunthorpe
@ 2020-07-31 17:23                     ` Matthew Wilcox
  1 sibling, 0 replies; 44+ messages in thread
From: Matthew Wilcox @ 2020-07-31 17:23 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Eric W. Biederman, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro,
	akpm, arnd, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote:
> On 7/31/2020 11:27 AM, Matthew Wilcox wrote:
> > On Fri, Jul 31, 2020 at 10:57:44AM -0400, Steven Sistare wrote:
> >> Matthews sileby/mshare proposal has the same issue.  If a process opts-in
> >> and mmap's an address in the shared region, then content becomes mapped at
> >> a VA that was known to the pre-fork or pre-exec process.  Trust must still
> >> be established.
> > 
> > It's up to the recipient whether they try to map it at the same address
> > or at a fresh address.  The intended use case is a "semi-shared" address
> > space between two processes (ie partway between a threaded, fully-shared
> > address space and a forked un-shared address space), in which case
> > there's a certain amount of trust and cooperation between the processes.
> 
> Understood, but if the recipient does map at any of the same, which is the whole
> point because you want to share the page table.  The trust relationship is no
> different than for the live update case.  

You don't have to map at the same address to share the page tables.
For example, on x86 if you share an 8GB region, that must be aligned at
1GB in both the donor and the recipient, but they need not be mapped at
the same address.

> > It's a net increase of 200 lines of kernel code.  If 4 lines of userspace
> > code removes 200 lines of kernel code, I think I know which I prefer ...
> 
> It will be *far* more than 4 lines.
> Much of the 200 lines is mostly for the elf opt in, and much of the elf code is from
> anthony reviving an earlier patch that use MAP_FIXED_NOREPLACE during segment setup.

It doesn't really matter how much of it is for the opt-in and how much
is for the exec path itself.  The MAP_FIXED_NOREPLACE patch is only net
+16 lines, so that's not the problem.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-31 17:15                       ` Steven Sistare
@ 2020-07-31 17:48                         ` Jason Gunthorpe
  2020-07-31 17:55                           ` Steven Sistare
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2020-07-31 17:48 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Matthew Wilcox, Eric W. Biederman, Anthony Yznaga, linux-kernel,
	linux-fsdevel, linux-mm, linux-arch, mhocko, tglx, mingo, bp,
	x86, hpa, viro, akpm, arnd, keescook, gerg, ktkhai,
	christian.brauner, peterz, esyr, christian, areber, cyphar

On Fri, Jul 31, 2020 at 01:15:34PM -0400, Steven Sistare wrote:
> On 7/31/2020 12:56 PM, Jason Gunthorpe wrote:
> > On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote:
> >>> Your preservation-across-exec use-case might or might not need the
> >>> VMA to be mapped at the same address.  
> >>
> >> It does.  qemu registers memory with vfio which remembers the va's in kernel
> >> metadata for the device.
> > 
> > Once the memory is registered with vfio the VA doesn't matter, vfio
> > will keep the iommu pointing at the same physical pages no matter
> > where they are mapped.
> 
> Yes, but there are other code paths that compute and use offsets between va and the
> base va.  Mapping at a different va in the new process breaks vfio; I have tried it.

Maybe you could fix vfio instead of having this adventure, if vfio is
the only motivation.

Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-31 17:48                         ` Jason Gunthorpe
@ 2020-07-31 17:55                           ` Steven Sistare
  0 siblings, 0 replies; 44+ messages in thread
From: Steven Sistare @ 2020-07-31 17:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, Eric W. Biederman, Anthony Yznaga, linux-kernel,
	linux-fsdevel, linux-mm, linux-arch, mhocko, tglx, mingo, bp,
	x86, hpa, viro, akpm, arnd, keescook, gerg, ktkhai,
	christian.brauner, peterz, esyr, christian, areber, cyphar

On 7/31/2020 1:48 PM, Jason Gunthorpe wrote:
> On Fri, Jul 31, 2020 at 01:15:34PM -0400, Steven Sistare wrote:
>> On 7/31/2020 12:56 PM, Jason Gunthorpe wrote:
>>> On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote:
>>>>> Your preservation-across-exec use-case might or might not need the
>>>>> VMA to be mapped at the same address.  
>>>>
>>>> It does.  qemu registers memory with vfio which remembers the va's in kernel
>>>> metadata for the device.
>>>
>>> Once the memory is registered with vfio the VA doesn't matter, vfio
>>> will keep the iommu pointing at the same physical pages no matter
>>> where they are mapped.
>>
>> Yes, but there are other code paths that compute and use offsets between va and the
>> base va.  Mapping at a different va in the new process breaks vfio; I have tried it.
> 
> Maybe you could fix vfio instead of having this adventure, if vfio is
> the only motivation.

Maybe.  We still need to preserve an anonymous segment, though.  MADV_DOEXEC, or mshare,
or something else.  And I think the ability to preserve memory containing pointers to itself
is an interesting use case, though not ours.

- Steve


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
                   ` (8 preceding siblings ...)
  2020-07-30 15:22 ` Matthew Wilcox
@ 2020-07-31 19:41 ` Steven Sistare
  9 siblings, 0 replies; 44+ messages in thread
From: Steven Sistare @ 2020-07-31 19:41 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-arch
  Cc: mhocko, tglx, mingo, bp, x86, hpa, viro, akpm, arnd, ebiederm,
	keescook, gerg, ktkhai, christian.brauner, peterz, esyr, jgg,
	christian, areber, cyphar

On 7/27/2020 1:11 PM, Anthony Yznaga wrote:
> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,
> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.

I forgot to mention, our use case is not just theoretical.  It has been implemented
and is pretty cool (but I am biased).  The pause time for the guest is in the
100 - 200 msec range.  We submitted qemu patches for review based on the MADV_DOEXEC
proposal.  In case you are curious:
  https://lore.kernel.org/qemu-devel/1596122076-341293-1-git-send-email-steven.sistare@oracle.com/

- Steve



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-07-31 14:57               ` Steven Sistare
  2020-07-31 15:27                 ` Matthew Wilcox
@ 2020-08-03 15:28                 ` Eric W. Biederman
  2020-08-03 15:42                   ` James Bottomley
  2020-08-03 19:29                   ` Steven Sistare
  1 sibling, 2 replies; 44+ messages in thread
From: Eric W. Biederman @ 2020-08-03 15:28 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Matthew Wilcox, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro,
	akpm, arnd, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

Steven Sistare <steven.sistare@oracle.com> writes:

> On 7/30/2020 5:58 PM, ebiederm@xmission.com wrote:
>> Here is another suggestion.
>> 
>> Have a very simple program that does:
>> 
>> 	for (;;) {
>> 		handle = dlopen("/my/real/program");
>> 		real_main = dlsym(handle, "main");
>> 		real_main(argc, argv, envp);
>> 		dlclose(handle);
>> 	}
>> 
>> With whatever obvious adjustments are needed to fit your usecase.
>> 
>> That should give the same level of functionality, be portable to all
>> unices, and not require you to duplicate code.  It belive it limits you
>> to not upgrading libc, or librt but that is a comparatively small
>> limitation.
>> 
>> 
>> Given that in general the interesting work is done in userspace and that
>> userspace has provided an interface for reusing that work already.
>> I don't see the justification for adding anything to exec at this point. 
>
> Thanks for the suggestion.  That is clever, and would make a fun project,
> but I would not trust it for production.  These few lines are just
> the first of many that it would take to reset the environment to the
> well-defined post-exec initial conditions that all executables expect,
> and incrementally tearing down state will be prone to bugs.

Agreed.

> Getting a clean slate from a kernel exec is a much more reliable
> design.

Except you are explicitly throwing that out the window, by preserving
VMAs.  You very much need to have a clean bug free shutdown to pass VMAs
reliably.

> The use case is creating long-lived apps that never go down, and the
> simplest implementation will have the fewest bugs and is the best.
> MADV_DOEXEC is simple, and does not even require a new system call,
> and the kernel already knows how to exec without bugs.

*ROFL*  I wish the kernel knew how to exec things without bugs.
The bugs are hard to hit but the ones I am aware of are not straight
forward to fix.

MADV_DOEXEC is not conceptually simple.  It completely violates the
guarantees that exec is known to make about the contents of the memory
of the new process.  This makes it very difficult to reason about.  Nor
will MADV_DOEXEC be tested very much as it has only one or two users.
Which means in the fullness of time it is likely someone will change
something that will break the implementation subtlely and the bug report
probably won't come in for 3 years, or maybe a decade.  At which point
it won't be clear if the bug even can be fixed as something else might
rely on it.

What is wrong with live migration between one qemu process and another
qemu process on the same machine not work for this use case?

Just reusing live migration would seem to be the simplest path of all,
as the code is already implemented.  Further if something goes wrong
with the live migration you can fallback to the existing process.  With
exec there is no fallback if the new version does not properly support
the handoff protocol of the old version.

Eric


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-08-03 15:28                 ` Eric W. Biederman
@ 2020-08-03 15:42                   ` James Bottomley
  2020-08-03 20:03                     ` Steven Sistare
       [not found]                     ` <9371b8272fd84280ae40b409b260bab3@AcuMS.aculab.com>
  2020-08-03 19:29                   ` Steven Sistare
  1 sibling, 2 replies; 44+ messages in thread
From: James Bottomley @ 2020-08-03 15:42 UTC (permalink / raw)
  To: Eric W. Biederman, Steven Sistare
  Cc: Matthew Wilcox, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro,
	akpm, arnd, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On Mon, 2020-08-03 at 10:28 -0500, Eric W. Biederman wrote:
[...]
> What is wrong with live migration between one qemu process and
> another qemu process on the same machine not work for this use case?
> 
> Just reusing live migration would seem to be the simplest path of
> all, as the code is already implemented.  Further if something goes
> wrong with the live migration you can fallback to the existing
> process.  With exec there is no fallback if the new version does not
> properly support the handoff protocol of the old version.

Actually, could I ask this another way: the other patch set you sent to
the KVM list was to snapshot the VM to a PKRAM capsule preserved across
kexec using zero copy for extremely fast save/restore.  The original
idea was to use this as part of a CRIU based snapshot, kexec to new
system, restore.  However, why can't you do a local snapshot, restart
qemu, restore using the PKRAM capsule to achieve exactly the same as
MADV_DOEXEC does but using a system that's easy to reason about?  It
may be slightly slower, but I think we're still talking milliseconds.

James



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-08-03 15:28                 ` Eric W. Biederman
  2020-08-03 15:42                   ` James Bottomley
@ 2020-08-03 19:29                   ` Steven Sistare
  1 sibling, 0 replies; 44+ messages in thread
From: Steven Sistare @ 2020-08-03 19:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matthew Wilcox, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro,
	akpm, arnd, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On 8/3/2020 11:28 AM, ebiederm@xmission.com wrote:
> Steven Sistare <steven.sistare@oracle.com> writes:
>> On 7/30/2020 5:58 PM, ebiederm@xmission.com wrote:
>>> Here is another suggestion.
>>>
>>> Have a very simple program that does:
>>>
>>> 	for (;;) {
>>> 		handle = dlopen("/my/real/program");
>>> 		real_main = dlsym(handle, "main");
>>> 		real_main(argc, argv, envp);
>>> 		dlclose(handle);
>>> 	}
>>>
>>> With whatever obvious adjustments are needed to fit your usecase.
>>>
>>> That should give the same level of functionality, be portable to all
>>> unices, and not require you to duplicate code.  It belive it limits you
>>> to not upgrading libc, or librt but that is a comparatively small
>>> limitation.
>>>
>>>
>>> Given that in general the interesting work is done in userspace and that
>>> userspace has provided an interface for reusing that work already.
>>> I don't see the justification for adding anything to exec at this point. 
>>
>> Thanks for the suggestion.  That is clever, and would make a fun project,
>> but I would not trust it for production.  These few lines are just
>> the first of many that it would take to reset the environment to the
>> well-defined post-exec initial conditions that all executables expect,
>> and incrementally tearing down state will be prone to bugs.
> 
> Agreed.
> 
>> Getting a clean slate from a kernel exec is a much more reliable
>> design.
> 
> Except you are explicitly throwing that out the window, by preserving
> VMAs.  You very much need to have a clean bug free shutdown to pass VMAs
> reliably.

Sure.  The whole community relies on you and others to provide a bug free exec.

>> The use case is creating long-lived apps that never go down, and the
>> simplest implementation will have the fewest bugs and is the best.
>> MADV_DOEXEC is simple, and does not even require a new system call,
>> and the kernel already knows how to exec without bugs.
> 
> *ROFL*  I wish the kernel knew how to exec things without bugs.

Essentially you are saying you would argue against any enhancement to exec.
Surely that is too high a bar.  We must continue to evolve an innovate and
balance risk against reward.  This use case matters to our business a lot,
and to others as well, see below.  That is the reward.  I feel you are 
overstating the risk.  Surely there is some early point in the development
cycle of some release where this can be integrated and get enough test
time and soak time to be proven reliable.

> The bugs are hard to hit but the ones I am aware of are not straight
> forward to fix.
> 
> MADV_DOEXEC is not conceptually simple.  It completely violates the
> guarantees that exec is known to make about the contents of the memory
> of the new process.  This makes it very difficult to reason about.  

I have having trouble see the difficulty.  Perhaps I am too familar with
it, but the semantics are few and easy to explain, and it does not introduce
new concepts: the post-exec process is born with a few more mappings than
previously, and non-fixed further mmaps choose addresses in the holes.

> Nor
> will MADV_DOEXEC be tested very much as it has only one or two users.
> Which means in the fullness of time it is likely someone will change
> something that will break the implementation subtlely and the bug report
> probably won't come in for 3 years, or maybe a decade.  At which point
> it won't be clear if the bug even can be fixed as something else might
> rely on it.

That's on us; we need to provide kernel tests and be diligent about testing
new releases.  This matters to our business and we will do so. 
> What is wrong with live migration between one qemu process and another
> qemu process on the same machine not work for this use case?
> 
> Just reusing live migration would seem to be the simplest path of all,
> as the code is already implemented.  Further if something goes wrong
> with the live migration you can fallback to the existing process.  With
> exec there is no fallback if the new version does not properly support
> the handoff protocol of the old version.

This is less resource intensive than live migration.  The latter ties up two
hosts, consumes lots of memory and network bandwidth, may take a long time
to converge on a busy system, and is unfeasible for guests with a huge amount
of local storeage (which we call dense I/O shapes).  Live update takes less than
1 second total, and the guest pause time is 100 - 200 msecs.  It is a very
attractive solution that other cloud vendors have implemented as well, with
their own private modifications to exec and and fork.  We have been independently
working in this area, and we are offering our implementation to the community.

- Steve


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
  2020-08-03 15:42                   ` James Bottomley
@ 2020-08-03 20:03                     ` Steven Sistare
       [not found]                     ` <9371b8272fd84280ae40b409b260bab3@AcuMS.aculab.com>
  1 sibling, 0 replies; 44+ messages in thread
From: Steven Sistare @ 2020-08-03 20:03 UTC (permalink / raw)
  To: James Bottomley, Eric W. Biederman
  Cc: Matthew Wilcox, Anthony Yznaga, linux-kernel, linux-fsdevel,
	linux-mm, linux-arch, mhocko, tglx, mingo, bp, x86, hpa, viro,
	akpm, arnd, keescook, gerg, ktkhai, christian.brauner, peterz,
	esyr, jgg, christian, areber, cyphar

On 8/3/2020 11:42 AM, James Bottomley wrote:
> On Mon, 2020-08-03 at 10:28 -0500, Eric W. Biederman wrote:
> [...]
>> What is wrong with live migration between one qemu process and
>> another qemu process on the same machine not work for this use case?
>>
>> Just reusing live migration would seem to be the simplest path of
>> all, as the code is already implemented.  Further if something goes
>> wrong with the live migration you can fallback to the existing
>> process.  With exec there is no fallback if the new version does not
>> properly support the handoff protocol of the old version.
> 
> Actually, could I ask this another way: the other patch set you sent to
> the KVM list was to snapshot the VM to a PKRAM capsule preserved across
> kexec using zero copy for extremely fast save/restore.  The original
> idea was to use this as part of a CRIU based snapshot, kexec to new
> system, restore.  However, why can't you do a local snapshot, restart
> qemu, restore using the PKRAM capsule to achieve exactly the same as
> MADV_DOEXEC does but using a system that's easy to reason about?  It
> may be slightly slower, but I think we're still talking milliseconds.

Hi James, good to hear from you.  PKRAM or SysV shm could be used for
a restart in that manner, but it would only support sriov guests if the
guest exports an agent that supports suspend-to-ram, and if all guest
drivers support the suspend-to-ram method.  I have done this using a linux
guest and qemu guest agent, and IIRC the guest pause time is 500 - 1000 msec.
With MADV_DOEXEC, pause time is 100 - 200 msec.  The pause time is a handful
of seconds if the guest uses an nvme drive because CC.SHN takes so long
to persist metadata to stable storage.

We could instead pass vfio descriptors from the old process to a 3rd party escrow 
process and pass  them back to the new qemu process, but the shm that vfio has 
already registered must be remapped at the same VA as the previous process, and 
there is no interface to guarantee that.  MAP_FIXED blows away existing mappings 
and breaks the app. MAP_FIXED_NOREPLACE respects existing mappings but cannot map 
the shm and breaks the app.  Adding a feature that reserves VAs would fix that, we 
have experimnted with one.  Fixing the vfio kernel implementation to not use the 
original VA base would also work, but I don't know how doable/difficult that would be.

Both solutions would require a qemu instance to be stopped and relaunched using shm
as guest ram, and its guest rebooted, so they do not let us update legacy 
already-running instances that use anon memory.  That problem solves itself if we 
get these rfe's into linux and qemu, and eventually users shut down the legacy
instances, but that takes years and we need to do it sooner.

- Steve


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
       [not found]                     ` <9371b8272fd84280ae40b409b260bab3@AcuMS.aculab.com>
@ 2020-08-04 11:13                       ` Matthew Wilcox
  0 siblings, 0 replies; 44+ messages in thread
From: Matthew Wilcox @ 2020-08-04 11:13 UTC (permalink / raw)
  To: David Laight
  Cc: 'James Bottomley',
	Eric W. Biederman, Steven Sistare, Anthony Yznaga, linux-kernel,
	linux-fsdevel, linux-mm, linux-arch, mhocko, tglx, mingo, bp,
	x86, hpa, viro, akpm, arnd, keescook, gerg, ktkhai,
	christian.brauner, peterz, esyr, jgg, christian, areber, cyphar

On Tue, Aug 04, 2020 at 08:44:42AM +0000, David Laight wrote:
> From: James Bottomley
> > Sent: 03 August 2020 16:43
> > 
> > On Mon, 2020-08-03 at 10:28 -0500, Eric W. Biederman wrote:
> > [...]
> > > What is wrong with live migration between one qemu process and
> > > another qemu process on the same machine not work for this use case?
> > >
> > > Just reusing live migration would seem to be the simplest path of
> > > all, as the code is already implemented.  Further if something goes
> > > wrong with the live migration you can fallback to the existing
> > > process.  With exec there is no fallback if the new version does not
> > > properly support the handoff protocol of the old version.
> > 
> > Actually, could I ask this another way: the other patch set you sent to
> > the KVM list was to snapshot the VM to a PKRAM capsule preserved across
> > kexec using zero copy for extremely fast save/restore.  The original
> > idea was to use this as part of a CRIU based snapshot, kexec to new
> > system, restore.  However, why can't you do a local snapshot, restart
> > qemu, restore using the PKRAM capsule to achieve exactly the same as
> > MADV_DOEXEC does but using a system that's easy to reason about?  It
> > may be slightly slower, but I think we're still talking milliseconds.
> 
> 
> I've had another idea (that is probably impossible...).
> What about a 'reverse mmap' operation.
> Something that creates an fd whose contents are a chunk of the
> processes address space.

http://www.wil.cx/~willy/linux/sileby.html


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, back to index

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-27 17:11 [RFC PATCH 0/5] madvise MADV_DOEXEC Anthony Yznaga
2020-07-27 17:07 ` Eric W. Biederman
2020-07-27 18:00   ` Steven Sistare
2020-07-28 13:40     ` Christian Brauner
2020-07-27 17:11 ` [RFC PATCH 1/5] elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings Anthony Yznaga
2020-07-27 17:11 ` [RFC PATCH 2/5] mm: do not assume only the stack vma exists in setup_arg_pages() Anthony Yznaga
2020-07-27 17:11 ` [RFC PATCH 3/5] mm: introduce VM_EXEC_KEEP Anthony Yznaga
2020-07-28 13:38   ` Eric W. Biederman
2020-07-28 17:44     ` Anthony Yznaga
2020-07-29 13:52   ` Kirill A. Shutemov
2020-07-29 23:20     ` Anthony Yznaga
2020-07-27 17:11 ` [RFC PATCH 4/5] exec, elf: require opt-in for accepting preserved mem Anthony Yznaga
2020-07-27 17:11 ` [RFC PATCH 5/5] mm: introduce MADV_DOEXEC Anthony Yznaga
2020-07-28 13:22   ` Kirill Tkhai
2020-07-28 14:06     ` Steven Sistare
2020-07-28 11:34 ` [RFC PATCH 0/5] madvise MADV_DOEXEC Kirill Tkhai
2020-07-28 17:28   ` Anthony Yznaga
2020-07-28 14:23 ` Andy Lutomirski
2020-07-28 14:30   ` Steven Sistare
2020-07-30 15:22 ` Matthew Wilcox
2020-07-30 15:27   ` Christian Brauner
2020-07-30 15:34     ` Matthew Wilcox
2020-07-30 15:54       ` Christian Brauner
2020-07-31  9:12     ` Stefan Hajnoczi
2020-07-30 15:59   ` Steven Sistare
2020-07-30 17:12     ` Matthew Wilcox
2020-07-30 17:35       ` Steven Sistare
2020-07-30 17:49         ` Matthew Wilcox
2020-07-30 18:27           ` Steven Sistare
2020-07-30 21:58             ` Eric W. Biederman
2020-07-31 14:57               ` Steven Sistare
2020-07-31 15:27                 ` Matthew Wilcox
2020-07-31 16:11                   ` Steven Sistare
2020-07-31 16:56                     ` Jason Gunthorpe
2020-07-31 17:15                       ` Steven Sistare
2020-07-31 17:48                         ` Jason Gunthorpe
2020-07-31 17:55                           ` Steven Sistare
2020-07-31 17:23                     ` Matthew Wilcox
2020-08-03 15:28                 ` Eric W. Biederman
2020-08-03 15:42                   ` James Bottomley
2020-08-03 20:03                     ` Steven Sistare
     [not found]                     ` <9371b8272fd84280ae40b409b260bab3@AcuMS.aculab.com>
2020-08-04 11:13                       ` Matthew Wilcox
2020-08-03 19:29                   ` Steven Sistare
2020-07-31 19:41 ` Steven Sistare

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git