linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
@ 2017-08-06 14:04 riel
  2017-08-06 14:04 ` [PATCH 1/2] x86,mpx: make mpx depend on x86-64 to free up VMA flag riel
                   ` (3 more replies)
  0 siblings, 4 replies; 58+ messages in thread
From: riel @ 2017-08-06 14:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: mike.kravetz, linux-mm, fweimer, colm, akpm, keescook, luto, wad,
	mingo, kirill, dave.hansen

v2: fix MAP_SHARED case and kbuild warnings

Introduce MADV_WIPEONFORK semantics, which result in a VMA being
empty in the child process after fork. This differs from MADV_DONTFORK
in one important way.

If a child process accesses memory that was MADV_WIPEONFORK, it
will get zeroes. The address ranges are still valid, they are just empty.

If a child process accesses memory that was MADV_DONTFORK, it will
get a segmentation fault, since those address ranges are no longer
valid in the child after fork.

Since MADV_DONTFORK also seems to be used to allow very large
programs to fork in systems with strict memory overcommit restrictions,
changing the semantics of MADV_DONTFORK might break existing programs.

The use case is libraries that store or cache information, and
want to know that they need to regenerate it in the child process
after fork.

Examples of this would be:
- systemd/pulseaudio API checks (fail after fork)
  (replacing a getpid check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)

The security benefits of a forking server having a re-inialized
PRNG in every child process are pretty obvious. However, due to
libraries having all kinds of internal state, and programs getting
compiled with many different versions of each library, it is
unreasonable to expect calling programs to re-initialize everything
manually after fork.

A further complication is the proliferation of clone flags,
programs bypassing glibc's functions to call clone directly,
and programs calling unshare, causing the glibc pthread_atfork
hook to not get called.

It would be better to have the kernel take care of this automatically.

This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

    https://man.openbsd.org/minherit.2

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 1/2] x86,mpx: make mpx depend on x86-64 to free up VMA flag
  2017-08-06 14:04 [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK riel
@ 2017-08-06 14:04 ` riel
  2017-08-06 14:04 ` [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK riel
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 58+ messages in thread
From: riel @ 2017-08-06 14:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: mike.kravetz, linux-mm, fweimer, colm, akpm, keescook, luto, wad,
	mingo, kirill, dave.hansen

From: Rik van Riel <riel@redhat.com>

MPX only seems to be available on 64 bit CPUs, starting with Skylake
and Goldmont. Move VM_MPX into the 64 bit only portion of vma->vm_flags,
in order to free up a VMA flag.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/Kconfig   | 4 +++-
 include/linux/mm.h | 8 ++++++--
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 781521b7cf9e..6dff14fadc6f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1756,7 +1756,9 @@ config X86_SMAP
 config X86_INTEL_MPX
 	prompt "Intel MPX (Memory Protection Extensions)"
 	def_bool n
-	depends on CPU_SUP_INTEL
+	# Note: only available in 64-bit mode due to VMA flags shortage
+	depends on CPU_SUP_INTEL && X86_64
+	select ARCH_USES_HIGH_VMA_FLAGS
 	---help---
 	  MPX provides hardware features that can be used in
 	  conjunction with compiler-instrumented code to check
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5e8569..7550eeb06ccf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -208,10 +208,12 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_1	33	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
+#define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #if defined(CONFIG_X86)
@@ -235,9 +237,11 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_MAPPED_COPY	VM_ARCH_1	/* T if mapped copy of data (nommu mmap) */
 #endif
 
-#if defined(CONFIG_X86)
+#if defined(CONFIG_X86_INTEL_MPX)
 /* MPX specific bounds table or bounds directory */
-# define VM_MPX		VM_ARCH_2
+# define VM_MPX		VM_HIGH_ARCH_BIT_4
+#else
+# define VM_MPX		VM_NONE
 #endif
 
 #ifndef VM_GROWSUP
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-06 14:04 [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK riel
  2017-08-06 14:04 ` [PATCH 1/2] x86,mpx: make mpx depend on x86-64 to free up VMA flag riel
@ 2017-08-06 14:04 ` riel
  2017-08-10 15:23   ` Michal Hocko
  2017-08-07 13:22 ` [PATCH v2 0/2] mm,fork,security: " Michal Hocko
  2017-08-07 18:23 ` Mike Kravetz
  3 siblings, 1 reply; 58+ messages in thread
From: riel @ 2017-08-06 14:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: mike.kravetz, linux-mm, fweimer, colm, akpm, keescook, luto, wad,
	mingo, kirill, dave.hansen

From: Rik van Riel <riel@redhat.com>

Introduce MADV_WIPEONFORK semantics, which result in a VMA being
empty in the child process after fork. This differs from MADV_DONTFORK
in one important way.

If a child process accesses memory that was MADV_WIPEONFORK, it
will get zeroes. The address ranges are still valid, they are just empty.

If a child process accesses memory that was MADV_DONTFORK, it will
get a segmentation fault, since those address ranges are no longer
valid in the child after fork.

Since MADV_DONTFORK also seems to be used to allow very large
programs to fork in systems with strict memory overcommit restrictions,
changing the semantics of MADV_DONTFORK might break existing programs.

The use case is libraries that store or cache information, and
want to know that they need to regenerate it in the child process
after fork.

Examples of this would be:
- systemd/pulseaudio API checks (fail after fork)
  (replacing a getpid check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)

The security benefits of a forking server having a re-inialized
PRNG in every child process are pretty obvious. However, due to
libraries having all kinds of internal state, and programs getting
compiled with many different versions of each library, it is
unreasonable to expect calling programs to re-initialize everything
manually after fork.

A further complication is the proliferation of clone flags,
programs bypassing glibc's functions to call clone directly,
and programs calling unshare, causing the glibc pthread_atfork
hook to not get called.

It would be better to have the kernel take care of this automatically.

This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

    https://man.openbsd.org/minherit.2

Reported-by: Florian Weimer <fweimer@redhat.com>
Reported-by: Colm MacCártaigh <colm@allcosts.net>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/alpha/include/uapi/asm/mman.h     |  3 +++
 arch/mips/include/uapi/asm/mman.h      |  3 +++
 arch/parisc/include/uapi/asm/mman.h    |  3 +++
 arch/xtensa/include/uapi/asm/mman.h    |  3 +++
 fs/proc/task_mmu.c                     |  1 +
 include/linux/mm.h                     |  2 +-
 include/trace/events/mmflags.h         |  8 +-------
 include/uapi/asm-generic/mman-common.h |  3 +++
 kernel/fork.c                          |  7 +++++++
 mm/madvise.c                           |  8 ++++++++
 mm/memory.c                            | 10 ++++++++++
 11 files changed, 43 insertions(+), 8 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 02760f6e6ca4..2a708a792882 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -64,6 +64,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 655e2fb5395b..d59c57d60d7d 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -91,6 +91,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 5979745815a5..e205e0179642 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -60,6 +60,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	70		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WIPEONFORK 71		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 72		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 #define MAP_VARIABLE	0
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 24365b30aae9..ed23e0a1b30d 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -103,6 +103,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b836fd61ed87..2591e70216ff 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -651,6 +651,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_NORESERVE)]	= "nr",
 		[ilog2(VM_HUGETLB)]	= "ht",
 		[ilog2(VM_ARCH_1)]	= "ar",
+		[ilog2(VM_WIPEONFORK)]	= "wf",
 		[ilog2(VM_DONTDUMP)]	= "dd",
 #ifdef CONFIG_MEM_SOFT_DIRTY
 		[ilog2(VM_SOFTDIRTY)]	= "sd",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7550eeb06ccf..58788c1b9e9d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -189,7 +189,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_NORESERVE	0x00200000	/* should the VM suppress accounting */
 #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
 #define VM_ARCH_1	0x01000000	/* Architecture-specific flag */
-#define VM_ARCH_2	0x02000000
+#define VM_WIPEONFORK	0x02000000	/* Wipe VMA contents in child. */
 #define VM_DONTDUMP	0x04000000	/* Do not include in the core dump */
 
 #ifdef CONFIG_MEM_SOFT_DIRTY
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 8e50d01c645f..4c2e4737d7bc 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -125,12 +125,6 @@ IF_HAVE_PG_IDLE(PG_idle,		"idle"		)
 #define __VM_ARCH_SPECIFIC_1 {VM_ARCH_1,	"arch_1"	}
 #endif
 
-#if defined(CONFIG_X86)
-#define __VM_ARCH_SPECIFIC_2 {VM_MPX,		"mpx"		}
-#else
-#define __VM_ARCH_SPECIFIC_2 {VM_ARCH_2,	"arch_2"	}
-#endif
-
 #ifdef CONFIG_MEM_SOFT_DIRTY
 #define IF_HAVE_VM_SOFTDIRTY(flag,name) {flag, name },
 #else
@@ -162,7 +156,7 @@ IF_HAVE_PG_IDLE(PG_idle,		"idle"		)
 	{VM_NORESERVE,			"noreserve"	},		\
 	{VM_HUGETLB,			"hugetlb"	},		\
 	__VM_ARCH_SPECIFIC_1				,		\
-	__VM_ARCH_SPECIFIC_2				,		\
+	{VM_WIPEONFORK,			"wipeonfork"	},		\
 	{VM_DONTDUMP,			"dontdump"	},		\
 IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,	"softdirty"	)		\
 	{VM_MIXEDMAP,			"mixedmap"	},		\
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 8c27db0c5c08..49e2b1d78093 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -58,6 +58,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_DONTDUMP flag */
 
+#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 17921b0390b4..db1fb2802ecc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -659,6 +659,13 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
 		tmp->vm_next = tmp->vm_prev = NULL;
 		file = tmp->vm_file;
+
+		/* With VM_WIPEONFORK, the child gets an empty VMA. */
+		if (tmp->vm_flags & VM_WIPEONFORK) {
+			tmp->vm_file = file = NULL;
+			tmp->vm_ops = NULL;
+		}
+
 		if (file) {
 			struct inode *inode = file_inode(file);
 			struct address_space *mapping = file->f_mapping;
diff --git a/mm/madvise.c b/mm/madvise.c
index 9976852f1e1c..9e644c0ed4dc 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -80,6 +80,12 @@ static long madvise_behavior(struct vm_area_struct *vma,
 		}
 		new_flags &= ~VM_DONTCOPY;
 		break;
+	case MADV_WIPEONFORK:
+		new_flags |= VM_WIPEONFORK;
+		break;
+	case MADV_KEEPONFORK:
+		new_flags &= ~VM_WIPEONFORK;
+		break;
 	case MADV_DONTDUMP:
 		new_flags |= VM_DONTDUMP;
 		break;
@@ -689,6 +695,8 @@ madvise_behavior_valid(int behavior)
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
+	case MADV_WIPEONFORK:
+	case MADV_KEEPONFORK:
 #ifdef CONFIG_MEMORY_FAILURE
 	case MADV_SOFT_OFFLINE:
 	case MADV_HWPOISON:
diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..f9b0ad7feb57 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1134,6 +1134,16 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			!vma->anon_vma)
 		return 0;
 
+	/*
+	 * With VM_WIPEONFORK, the child inherits the VMA from the
+	 * parent, but not its contents.
+	 *
+	 * A child accessing VM_WIPEONFORK memory will see all zeroes;
+	 * a child accessing VM_DONTCOPY memory receives a segfault.
+	 */
+	if (vma->vm_flags & VM_WIPEONFORK)
+		return 0;
+
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-06 14:04 [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK riel
  2017-08-06 14:04 ` [PATCH 1/2] x86,mpx: make mpx depend on x86-64 to free up VMA flag riel
  2017-08-06 14:04 ` [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK riel
@ 2017-08-07 13:22 ` Michal Hocko
  2017-08-07 13:46   ` Michal Hocko
  2017-08-07 18:23 ` Mike Kravetz
  3 siblings, 1 reply; 58+ messages in thread
From: Michal Hocko @ 2017-08-07 13:22 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, mike.kravetz, linux-mm, fweimer, colm, akpm,
	keescook, luto, wad, mingo, kirill, dave.hansen, linux-api

This is an user visible API so make sure you CC linux-api (added)

On Sun 06-08-17 10:04:23, Rik van Riel wrote:
> v2: fix MAP_SHARED case and kbuild warnings
> 
> Introduce MADV_WIPEONFORK semantics, which result in a VMA being
> empty in the child process after fork. This differs from MADV_DONTFORK
> in one important way.
> 
> If a child process accesses memory that was MADV_WIPEONFORK, it
> will get zeroes. The address ranges are still valid, they are just empty.
> 
> If a child process accesses memory that was MADV_DONTFORK, it will
> get a segmentation fault, since those address ranges are no longer
> valid in the child after fork.
> 
> Since MADV_DONTFORK also seems to be used to allow very large
> programs to fork in systems with strict memory overcommit restrictions,
> changing the semantics of MADV_DONTFORK might break existing programs.
> 
> The use case is libraries that store or cache information, and
> want to know that they need to regenerate it in the child process
> after fork.
> 
> Examples of this would be:
> - systemd/pulseaudio API checks (fail after fork)
>   (replacing a getpid check, which is too slow without a PID cache)
> - PKCS#11 API reinitialization check (mandated by specification)
> - glibc's upcoming PRNG (reseed after fork)
> - OpenSSL PRNG (reseed after fork)
> 
> The security benefits of a forking server having a re-inialized
> PRNG in every child process are pretty obvious. However, due to
> libraries having all kinds of internal state, and programs getting
> compiled with many different versions of each library, it is
> unreasonable to expect calling programs to re-initialize everything
> manually after fork.
> 
> A further complication is the proliferation of clone flags,
> programs bypassing glibc's functions to call clone directly,
> and programs calling unshare, causing the glibc pthread_atfork
> hook to not get called.
> 
> It would be better to have the kernel take care of this automatically.
> 
> This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
> 
>     https://man.openbsd.org/minherit.2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-07 13:22 ` [PATCH v2 0/2] mm,fork,security: " Michal Hocko
@ 2017-08-07 13:46   ` Michal Hocko
  2017-08-07 14:19     ` Florian Weimer
                       ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-07 13:46 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, mike.kravetz, linux-mm, fweimer, colm, akpm,
	keescook, luto, wad, mingo, kirill, dave.hansen, linux-api

On Mon 07-08-17 15:22:57, Michal Hocko wrote:
> This is an user visible API so make sure you CC linux-api (added)
> 
> On Sun 06-08-17 10:04:23, Rik van Riel wrote:
> > v2: fix MAP_SHARED case and kbuild warnings
> > 
> > Introduce MADV_WIPEONFORK semantics, which result in a VMA being
> > empty in the child process after fork. This differs from MADV_DONTFORK
> > in one important way.
> > 
> > If a child process accesses memory that was MADV_WIPEONFORK, it
> > will get zeroes. The address ranges are still valid, they are just empty.
> > 
> > If a child process accesses memory that was MADV_DONTFORK, it will
> > get a segmentation fault, since those address ranges are no longer
> > valid in the child after fork.
> > 
> > Since MADV_DONTFORK also seems to be used to allow very large
> > programs to fork in systems with strict memory overcommit restrictions,
> > changing the semantics of MADV_DONTFORK might break existing programs.
> > 
> > The use case is libraries that store or cache information, and
> > want to know that they need to regenerate it in the child process
> > after fork.

How do they know that they need to regenerate if they do not get SEGV?
Are they going to assume that a read of zeros is a "must init again"? Isn't
that too fragile? Or do they play other tricks like parse /proc/self/smaps
and read in the flag?
 
> > Examples of this would be:
> > - systemd/pulseaudio API checks (fail after fork)
> >   (replacing a getpid check, which is too slow without a PID cache)
> > - PKCS#11 API reinitialization check (mandated by specification)
> > - glibc's upcoming PRNG (reseed after fork)
> > - OpenSSL PRNG (reseed after fork)
> > 
> > The security benefits of a forking server having a re-inialized
> > PRNG in every child process are pretty obvious. However, due to
> > libraries having all kinds of internal state, and programs getting
> > compiled with many different versions of each library, it is
> > unreasonable to expect calling programs to re-initialize everything
> > manually after fork.
> > 
> > A further complication is the proliferation of clone flags,
> > programs bypassing glibc's functions to call clone directly,
> > and programs calling unshare, causing the glibc pthread_atfork
> > hook to not get called.
> > 
> > It would be better to have the kernel take care of this automatically.
> > 
> > This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
> > 
> >     https://man.openbsd.org/minherit.2

I would argue that a MAP_$FOO flag would be more appropriate. Or do you
see any cases where such a special mapping would need to change the
semantic and inherit the content over the fork again?

I do not like the madvise because it is an advise and as such it can be
ignored/not implemented and that shouldn't have any correctness effects
on the child process.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-07 13:46   ` Michal Hocko
@ 2017-08-07 14:19     ` Florian Weimer
  2017-08-10 13:06       ` Michal Hocko
  2017-08-07 14:59     ` Rik van Riel
       [not found]     ` <CAAF6GDcNoDUaDSxV6N12A_bOzo8phRUX5b8-OBteuN0AmeCv0g@mail.gmail.com>
  2 siblings, 1 reply; 58+ messages in thread
From: Florian Weimer @ 2017-08-07 14:19 UTC (permalink / raw)
  To: Michal Hocko, riel
  Cc: linux-kernel, mike.kravetz, linux-mm, colm, akpm, keescook, luto,
	wad, mingo, kirill, dave.hansen, linux-api

On 08/07/2017 03:46 PM, Michal Hocko wrote:
> How do they know that they need to regenerate if they do not get SEGV?
> Are they going to assume that a read of zeros is a "must init again"? Isn't
> that too fragile?

Why would it be fragile?  Some level of synchronization is needed to set
things up, of course, but I think it's possible to write a lock-free
algorithm to maintain the state even without strong guarantees of memory
ordering from fork.

In the DRBG uniqueness case, you don't care if you reinitialize because
it's the first use, or because a fork just happened.

In the API-mandated fork check, a detection false positive before a fork
is not acceptable (because it would prevent legitimate API use), but I
think you can deal with this case if you publish a pointer to a
pre-initialized, non-zero mapping.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-07 13:46   ` Michal Hocko
  2017-08-07 14:19     ` Florian Weimer
@ 2017-08-07 14:59     ` Rik van Riel
  2017-08-09  9:59       ` Kirill A. Shutemov
  2017-08-10 13:05       ` Michal Hocko
       [not found]     ` <CAAF6GDcNoDUaDSxV6N12A_bOzo8phRUX5b8-OBteuN0AmeCv0g@mail.gmail.com>
  2 siblings, 2 replies; 58+ messages in thread
From: Rik van Riel @ 2017-08-07 14:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, mike.kravetz, linux-mm, fweimer, colm, akpm,
	keescook, luto, wad, mingo, kirill, dave.hansen, linux-api

On Mon, 2017-08-07 at 15:46 +0200, Michal Hocko wrote:
> On Mon 07-08-17 15:22:57, Michal Hocko wrote:
> > This is an user visible API so make sure you CC linux-api (added)
> > 
> > On Sun 06-08-17 10:04:23, Rik van Riel wrote:
> > > 
> > > A further complication is the proliferation of clone flags,
> > > programs bypassing glibc's functions to call clone directly,
> > > and programs calling unshare, causing the glibc pthread_atfork
> > > hook to not get called.
> > > 
> > > It would be better to have the kernel take care of this
> > > automatically.
> > > 
> > > This is similar to the OpenBSD minherit syscall with
> > > MAP_INHERIT_ZERO:
> > > 
> > >     https://man.openbsd.org/minherit.2
> 
> I would argue that a MAP_$FOO flag would be more appropriate. Or do
> you
> see any cases where such a special mapping would need to change the
> semantic and inherit the content over the fork again?
> 
> I do not like the madvise because it is an advise and as such it can
> be
> ignored/not implemented and that shouldn't have any correctness
> effects
> on the child process.

Too late for that. VM_DONTFORK is already implemented
through MADV_DONTFORK & MADV_DOFORK, in a way that is
very similar to the MADV_WIPEONFORK from these patches.

I wonder if that was done because MAP_* flags are a
bitmap, with a very limited number of values as a result,
while MADV_* constants have an essentially unlimited
numerical namespace available.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
       [not found]     ` <CAAF6GDcNoDUaDSxV6N12A_bOzo8phRUX5b8-OBteuN0AmeCv0g@mail.gmail.com>
@ 2017-08-07 16:02       ` Colm MacCárthaigh
  2017-08-10 13:21       ` Michal Hocko
  1 sibling, 0 replies; 58+ messages in thread
From: Colm MacCárthaigh @ 2017-08-07 16:02 UTC (permalink / raw)
  To: linux-api, linux-kernel

[ Resent as text/plain, after sacrificing an offering to the RFC822 gods]

On Mon, Aug 7, 2017 at 3:46 PM, Michal Hocko <mhocko@kernel.org> wrote:
>
>
> > > The use case is libraries that store or cache information, and
> > > want to know that they need to regenerate it in the child process
> > > after fork.
>
> How do they know that they need to regenerate if they do not get SEGV?
> Are they going to assume that a read of zeros is a "must init again"?
> Isn't
> that too fragile? Or do they play other tricks like parse /proc/self/smaps
> and read in the flag?

Hi from a user space crypto maintainer :) Here's how we do exactly this it
in s2n:

https://github.com/awslabs/s2n/blob/master/utils/s2n_random.c , lines 62 - 91

and here's how LibreSSL does it:

https://github.com/libressl-portable/openbsd/blob/57dcd4329d83bff3dd67a293d5c4a53b795c587e/src/lib/libc/crypt/arc4random.h
(lines 37 on)
https://github.com/libressl-portable/openbsd/blob/57dcd4329d83bff3dd67a293d5c4a53b795c587e/src/lib/libc/crypt/arc4random.c
(Line 110)

OpenSSL and libc are in the process of adding similar DRBGs and would use a
WIPEONFORK. BoringSSL's maintainers are also interested as it adds
robustness.  I also recall it being a topic of discussion at the High
Assurance Cryptography Symposium (HACS) where many crypto maintainers meet
and several more maintainers there indicated it would be nice to have.

Right now on Linux we all either use pthread_atfork() to zero the memory on
fork, or getpid() and getppid() guards. The former can be evaded by direct
syscall() and other tricks (which things like Language VMs are prone to
doing), and the latter check is probabilistic as pids can repeat, though if
you use both getpid() and getppid() - which is slow! - the probability of
both PIDs colliding is very low indeed.

The result at the moment on Linux there's no bulletproof way to detect a
fork and erase a key or DRBG state. It would really be nice to be able to
match what we can do with MAP_INHERIT_ZERO and minherit() on BSD.  madvise()
does seem like the established idiom for behavior like this on Linux.  I
don't imagine it will be hard to use in practice, we can fall back to
existing behavior if the flag isn't accepted.


-- 
Colm

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-06 14:04 [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK riel
                   ` (2 preceding siblings ...)
  2017-08-07 13:22 ` [PATCH v2 0/2] mm,fork,security: " Michal Hocko
@ 2017-08-07 18:23 ` Mike Kravetz
  2017-08-08  9:58   ` Florian Weimer
  3 siblings, 1 reply; 58+ messages in thread
From: Mike Kravetz @ 2017-08-07 18:23 UTC (permalink / raw)
  To: riel, linux-kernel
  Cc: linux-mm, fweimer, colm, akpm, keescook, luto, wad, mingo,
	kirill, dave.hansen

On 08/06/2017 07:04 AM, riel@redhat.com wrote:
> v2: fix MAP_SHARED case and kbuild warnings
> 
> Introduce MADV_WIPEONFORK semantics, which result in a VMA being
> empty in the child process after fork. This differs from MADV_DONTFORK
> in one important way.

It seems that the target use case might be private anonymous mappings.
If a shared or file backed mapping exists, one would assume that it
was created with the intention of sharing, even across fork.  So,
setting MADV_DONTFORK on such a mapping seems to change the meaning
and conflict with the original intention of the mapping.

If my thoughts above are correct, what about returning EINVAL if one
attempts to set MADV_DONTFORK on mappings set up for sharing?

If not, and you really want this to be applicable to all mappings, then
you should be more specific about what happens at fork time.  Do they
all get turned into anonymous mappings?  What happens to file references?
What about the really ugly case of hugetlb mappings?  Do they get
'transformed' to non-hugetlb mappings?  Or, do you create a separate
hugetlb mapping for the child?

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-07 18:23 ` Mike Kravetz
@ 2017-08-08  9:58   ` Florian Weimer
  2017-08-08 13:15     ` Rik van Riel
  0 siblings, 1 reply; 58+ messages in thread
From: Florian Weimer @ 2017-08-08  9:58 UTC (permalink / raw)
  To: Mike Kravetz, riel, linux-kernel
  Cc: linux-mm, colm, akpm, keescook, luto, wad, mingo, kirill, dave.hansen

On 08/07/2017 08:23 PM, Mike Kravetz wrote:
> If my thoughts above are correct, what about returning EINVAL if one
> attempts to set MADV_DONTFORK on mappings set up for sharing?

That's my preference as well.  If there is a use case for shared or
non-anonymous mappings, then we can implement MADV_DONTFORK with the
semantics for this use case.  If we pick some arbitrary semantics now,
without any use case, we might end up with something that's not actually
useful.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-08  9:58   ` Florian Weimer
@ 2017-08-08 13:15     ` Rik van Riel
  2017-08-08 15:19       ` Mike Kravetz
  0 siblings, 1 reply; 58+ messages in thread
From: Rik van Riel @ 2017-08-08 13:15 UTC (permalink / raw)
  To: Florian Weimer, Mike Kravetz, linux-kernel
  Cc: linux-mm, colm, akpm, keescook, luto, wad, mingo, kirill, dave.hansen

On Tue, 2017-08-08 at 11:58 +0200, Florian Weimer wrote:
> On 08/07/2017 08:23 PM, Mike Kravetz wrote:
> > If my thoughts above are correct, what about returning EINVAL if
> > one
> > attempts to set MADV_DONTFORK on mappings set up for sharing?
> 
> That's my preference as well.  If there is a use case for shared or
> non-anonymous mappings, then we can implement MADV_DONTFORK with the
> semantics for this use case.  If we pick some arbitrary semantics
> now,
> without any use case, we might end up with something that's not
> actually
> useful.

MADV_DONTFORK is existing semantics, and it is enforced
on shared, non-anonymous mappings. It is frequently used
for things like device mappings, which should not be
inherited by a child process, because the device can only
be used by one process at a time.

When someone requests MADV_DONTFORK on a shared VMA, they
will get it. The later madvise request overrides the mmap
flags that were used earlier.

The question is, should MADV_WIPEONFORK (introduced by
this series) have not just different semantics, but also
totally different behavior from MADV_DONTFORK?

Does the principle of least surprise dictate that the
last request determines the policy on an area, or should
later requests not be able to override policy that was
set at mmap time?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-08 13:15     ` Rik van Riel
@ 2017-08-08 15:19       ` Mike Kravetz
  2017-08-08 15:22         ` Florian Weimer
  2017-08-08 15:46         ` Rik van Riel
  0 siblings, 2 replies; 58+ messages in thread
From: Mike Kravetz @ 2017-08-08 15:19 UTC (permalink / raw)
  To: Rik van Riel, Florian Weimer, linux-kernel
  Cc: linux-mm, colm, akpm, keescook, luto, wad, mingo, kirill, dave.hansen

On 08/08/2017 06:15 AM, Rik van Riel wrote:
> On Tue, 2017-08-08 at 11:58 +0200, Florian Weimer wrote:
>> On 08/07/2017 08:23 PM, Mike Kravetz wrote:
>>> If my thoughts above are correct, what about returning EINVAL if
>>> one
>>> attempts to set MADV_DONTFORK on mappings set up for sharing?
>>
>> That's my preference as well.  If there is a use case for shared or
>> non-anonymous mappings, then we can implement MADV_DONTFORK with the
>> semantics for this use case.  If we pick some arbitrary semantics
>> now,
>> without any use case, we might end up with something that's not
>> actually
>> useful.
> 
> MADV_DONTFORK is existing semantics, and it is enforced
> on shared, non-anonymous mappings. It is frequently used
> for things like device mappings, which should not be
> inherited by a child process, because the device can only
> be used by one process at a time.
> 
> When someone requests MADV_DONTFORK on a shared VMA, they
> will get it. The later madvise request overrides the mmap
> flags that were used earlier.
> 
> The question is, should MADV_WIPEONFORK (introduced by
> this series) have not just different semantics, but also
> totally different behavior from MADV_DONTFORK?

Sorry for the confusion.  I accidentally used MADV_DONTFORK instead
of MADV_WIPEONFORK in my reply (which Florian commented on).

> Does the principle of least surprise dictate that the
> last request determines the policy on an area, or should
> later requests not be able to override policy that was
> set at mmap time?

That is the question.

The other question I was trying to bring up is "What does MADV_WIPEONFORK
mean for various types of mappings?"  For example, if we allow
MADV_WIPEONFORK on a file backed mapping what does that mapping look
like in the child after fork?  Does it have any connection at all to the
file?  Or, do we drop all references to the file and essentially transform
it to a private (or shared?) anonymous mapping after fork.  What about
System V shared memory?  What about hugetlb?

If the use case is fairly specific, then perhaps it makes sense to
make MADV_WIPEONFORK not applicable (EINVAL) for mappings where the
result is 'questionable'.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-08 15:19       ` Mike Kravetz
@ 2017-08-08 15:22         ` Florian Weimer
  2017-08-08 15:46         ` Rik van Riel
  1 sibling, 0 replies; 58+ messages in thread
From: Florian Weimer @ 2017-08-08 15:22 UTC (permalink / raw)
  To: Mike Kravetz, Rik van Riel, linux-kernel
  Cc: linux-mm, colm, akpm, keescook, luto, wad, mingo, kirill, dave.hansen

On 08/08/2017 05:19 PM, Mike Kravetz wrote:
> On 08/08/2017 06:15 AM, Rik van Riel wrote:
>> On Tue, 2017-08-08 at 11:58 +0200, Florian Weimer wrote:
>>> On 08/07/2017 08:23 PM, Mike Kravetz wrote:
>>>> If my thoughts above are correct, what about returning EINVAL if
>>>> one
>>>> attempts to set MADV_DONTFORK on mappings set up for sharing?
>>>
>>> That's my preference as well.  If there is a use case for shared or
>>> non-anonymous mappings, then we can implement MADV_DONTFORK with the
>>> semantics for this use case.  If we pick some arbitrary semantics
>>> now,
>>> without any use case, we might end up with something that's not
>>> actually
>>> useful.
>>
>> MADV_DONTFORK is existing semantics, and it is enforced
>> on shared, non-anonymous mappings. It is frequently used
>> for things like device mappings, which should not be
>> inherited by a child process, because the device can only
>> be used by one process at a time.
>>
>> When someone requests MADV_DONTFORK on a shared VMA, they
>> will get it. The later madvise request overrides the mmap
>> flags that were used earlier.
>>
>> The question is, should MADV_WIPEONFORK (introduced by
>> this series) have not just different semantics, but also
>> totally different behavior from MADV_DONTFORK?
> 
> Sorry for the confusion.  I accidentally used MADV_DONTFORK instead
> of MADV_WIPEONFORK in my reply (which Florian commented on).

Yes, I made the same mistake.  I meant MADV_WIPEONFORK as well.

Florian

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-08 15:19       ` Mike Kravetz
  2017-08-08 15:22         ` Florian Weimer
@ 2017-08-08 15:46         ` Rik van Riel
  2017-08-08 16:48           ` Colm MacCárthaigh
  2017-08-08 16:52           ` Matthew Wilcox
  1 sibling, 2 replies; 58+ messages in thread
From: Rik van Riel @ 2017-08-08 15:46 UTC (permalink / raw)
  To: Mike Kravetz, Florian Weimer, linux-kernel
  Cc: linux-mm, colm, akpm, keescook, luto, wad, mingo, kirill, dave.hansen

On Tue, 2017-08-08 at 08:19 -0700, Mike Kravetz wrote:

> The other question I was trying to bring up is "What does
> MADV_WIPEONFORK
> mean for various types of mappings?"  For example, if we allow
> MADV_WIPEONFORK on a file backed mapping what does that mapping look
> like in the child after fork?  Does it have any connection at all to
> the
> file?  Or, do we drop all references to the file and essentially
> transform
> it to a private (or shared?) anonymous mapping after fork.  What
> about
> System V shared memory?  What about hugetlb?

My current patch turns any file-backed VMA into an empty
anonymous VMA if MADV_WIPEONFORK was used on that VMA.

> If the use case is fairly specific, then perhaps it makes sense to
> make MADV_WIPEONFORK not applicable (EINVAL) for mappings where the
> result is 'questionable'.

That would be a question for Florian and Colm.

If they are OK with MADV_WIPEONFORK only working on
anonymous VMAs (no file mapping), that certainly could
be implemented.

On the other hand, I am not sure that introducing cases
where MADV_WIPEONFORK does not implement wipe-on-fork
semantics would reduce user confusion...

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-08 15:46         ` Rik van Riel
@ 2017-08-08 16:48           ` Colm MacCárthaigh
  2017-08-08 16:52           ` Matthew Wilcox
  1 sibling, 0 replies; 58+ messages in thread
From: Colm MacCárthaigh @ 2017-08-08 16:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mike Kravetz, Florian Weimer, linux-kernel, linux-mm, akpm,
	Kees Cook, luto, Will Drewry, mingo, kirill, dave.hansen

On Tue, Aug 8, 2017 at 5:46 PM, Rik van Riel <riel@redhat.com> wrote:

>> If the use case is fairly specific, then perhaps it makes sense to
>> make MADV_WIPEONFORK not applicable (EINVAL) for mappings where the
>> result is 'questionable'.
>
> That would be a question for Florian and Colm.
>
> If they are OK with MADV_WIPEONFORK only working on
> anonymous VMAs (no file mapping), that certainly could
> be implemented.

Anonymous would be sufficient for all of the Crypto-cases that I've
come across. But I can imagine someone wanting to initialize all
application state from a saved file, or share it between processes.

The comparable minherit call sidesteps all of this by simply
documenting that it results in a new anonymous page after fork, and so
the previous state doesn't matter.

Maybe the problem here is the poor name (my fault). WIPEONFORK
suggests an action being taken ... like a user might think that it
literally zeroes a file, for example.  At the risk of bike shedding:
maybe ZEROESONFORK would resolve that small ambiguity?

-- 
Colm

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-08 15:46         ` Rik van Riel
  2017-08-08 16:48           ` Colm MacCárthaigh
@ 2017-08-08 16:52           ` Matthew Wilcox
  2017-08-08 18:45             ` Rik van Riel
  1 sibling, 1 reply; 58+ messages in thread
From: Matthew Wilcox @ 2017-08-08 16:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mike Kravetz, Florian Weimer, linux-kernel, linux-mm, colm, akpm,
	keescook, luto, wad, mingo, kirill, dave.hansen

On Tue, Aug 08, 2017 at 11:46:08AM -0400, Rik van Riel wrote:
> On Tue, 2017-08-08 at 08:19 -0700, Mike Kravetz wrote:
> > If the use case is fairly specific, then perhaps it makes sense to
> > make MADV_WIPEONFORK not applicable (EINVAL) for mappings where the
> > result is 'questionable'.
> 
> That would be a question for Florian and Colm.
> 
> If they are OK with MADV_WIPEONFORK only working on
> anonymous VMAs (no file mapping), that certainly could
> be implemented.
> 
> On the other hand, I am not sure that introducing cases
> where MADV_WIPEONFORK does not implement wipe-on-fork
> semantics would reduce user confusion...

It'll simply do exactly what it does today, so it won't introduce any
new fallback code.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-08 16:52           ` Matthew Wilcox
@ 2017-08-08 18:45             ` Rik van Riel
  2017-08-10 15:31               ` Michal Hocko
  0 siblings, 1 reply; 58+ messages in thread
From: Rik van Riel @ 2017-08-08 18:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mike Kravetz, Florian Weimer, linux-kernel, linux-mm, colm, akpm,
	keescook, luto, wad, mingo, kirill, dave.hansen

On Tue, 2017-08-08 at 09:52 -0700, Matthew Wilcox wrote:
> On Tue, Aug 08, 2017 at 11:46:08AM -0400, Rik van Riel wrote:
> > On Tue, 2017-08-08 at 08:19 -0700, Mike Kravetz wrote:
> > > If the use case is fairly specific, then perhaps it makes sense
> > > to
> > > make MADV_WIPEONFORK not applicable (EINVAL) for mappings where
> > > the
> > > result is 'questionable'.
> > 
> > That would be a question for Florian and Colm.
> > 
> > If they are OK with MADV_WIPEONFORK only working on
> > anonymous VMAs (no file mapping), that certainly could
> > be implemented.
> > 
> > On the other hand, I am not sure that introducing cases
> > where MADV_WIPEONFORK does not implement wipe-on-fork
> > semantics would reduce user confusion...
> 
> It'll simply do exactly what it does today, so it won't introduce any
> new fallback code.

Sure, but actually implementing MADV_WIPEONFORK in a
way that turns file mapped VMAs into zero page backed
anonymous VMAs after fork takes no more code than
implementing it in a way that refuses to work on VMAs
that have a file backing.

There is no complexity argument for or against either
approach.

The big question is, what is the best for users?

Should we return -EINVAL when MADV_WIPEONFORK is called
on a VMA that has a file backing, and only succeed on
anonymous VMAs?

Or, should we simply turn every memory range that has
MADV_WIPEONFORK done to it into an anonymous VMA in the
child process?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-07 14:59     ` Rik van Riel
@ 2017-08-09  9:59       ` Kirill A. Shutemov
  2017-08-09 12:31         ` Rik van Riel
  2017-08-09 12:42         ` Florian Weimer
  2017-08-10 13:05       ` Michal Hocko
  1 sibling, 2 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2017-08-09  9:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michal Hocko, linux-kernel, mike.kravetz, linux-mm, fweimer,
	colm, akpm, keescook, luto, wad, mingo, dave.hansen, linux-api

On Mon, Aug 07, 2017 at 10:59:51AM -0400, Rik van Riel wrote:
> On Mon, 2017-08-07 at 15:46 +0200, Michal Hocko wrote:
> > On Mon 07-08-17 15:22:57, Michal Hocko wrote:
> > > This is an user visible API so make sure you CC linux-api (added)
> > > 
> > > On Sun 06-08-17 10:04:23, Rik van Riel wrote:
> > > > 
> > > > A further complication is the proliferation of clone flags,
> > > > programs bypassing glibc's functions to call clone directly,
> > > > and programs calling unshare, causing the glibc pthread_atfork
> > > > hook to not get called.
> > > > 
> > > > It would be better to have the kernel take care of this
> > > > automatically.
> > > > 
> > > > This is similar to the OpenBSD minherit syscall with
> > > > MAP_INHERIT_ZERO:
> > > > 
> > > >     https://man.openbsd.org/minherit.2
> > 
> > I would argue that a MAP_$FOO flag would be more appropriate. Or do
> > you
> > see any cases where such a special mapping would need to change the
> > semantic and inherit the content over the fork again?
> > 
> > I do not like the madvise because it is an advise and as such it can
> > be
> > ignored/not implemented and that shouldn't have any correctness
> > effects
> > on the child process.
> 
> Too late for that. VM_DONTFORK is already implemented
> through MADV_DONTFORK & MADV_DOFORK, in a way that is
> very similar to the MADV_WIPEONFORK from these patches.

It's not obvious to me what would break if kernel would ignore
MADV_DONTFORK or MADV_DONTDUMP.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-09  9:59       ` Kirill A. Shutemov
@ 2017-08-09 12:31         ` Rik van Riel
  2017-08-09 12:42         ` Florian Weimer
  1 sibling, 0 replies; 58+ messages in thread
From: Rik van Riel @ 2017-08-09 12:31 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Michal Hocko, linux-kernel, mike.kravetz, linux-mm, fweimer,
	colm, akpm, keescook, luto, wad, mingo, dave.hansen, linux-api

On Wed, 2017-08-09 at 12:59 +0300, Kirill A. Shutemov wrote:
> On Mon, Aug 07, 2017 at 10:59:51AM -0400, Rik van Riel wrote:
> > On Mon, 2017-08-07 at 15:46 +0200, Michal Hocko wrote:
> > > On Mon 07-08-17 15:22:57, Michal Hocko wrote:
> > > > This is an user visible API so make sure you CC linux-api
> > > > (added)
> > > > 
> > > > On Sun 06-08-17 10:04:23, Rik van Riel wrote:
> > > > > 
> > > > > A further complication is the proliferation of clone flags,
> > > > > programs bypassing glibc's functions to call clone directly,
> > > > > and programs calling unshare, causing the glibc
> > > > > pthread_atfork
> > > > > hook to not get called.
> > > > > 
> > > > > It would be better to have the kernel take care of this
> > > > > automatically.
> > > > > 
> > > > > This is similar to the OpenBSD minherit syscall with
> > > > > MAP_INHERIT_ZERO:
> > > > > 
> > > > >     https://man.openbsd.org/minherit.2
> > > 
> > > I would argue that a MAP_$FOO flag would be more appropriate. Or
> > > do
> > > you
> > > see any cases where such a special mapping would need to change
> > > the
> > > semantic and inherit the content over the fork again?
> > > 
> > > I do not like the madvise because it is an advise and as such it
> > > can
> > > be
> > > ignored/not implemented and that shouldn't have any correctness
> > > effects
> > > on the child process.
> > 
> > Too late for that. VM_DONTFORK is already implemented
> > through MADV_DONTFORK & MADV_DOFORK, in a way that is
> > very similar to the MADV_WIPEONFORK from these patches.
> 
> It's not obvious to me what would break if kernel would ignore
> MADV_DONTFORK or MADV_DONTDUMP.
> 
You might end up with multiple processes having a device open
which can only handle one process at a time.

Another thing that could go wrong is that if overcommit_memory=2,
a very large process with MADV_DONTFORK on a large memory area
suddenly fails to fork (due to there not being enough available
memory), and is unable to start a helper process.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-09  9:59       ` Kirill A. Shutemov
  2017-08-09 12:31         ` Rik van Riel
@ 2017-08-09 12:42         ` Florian Weimer
  1 sibling, 0 replies; 58+ messages in thread
From: Florian Weimer @ 2017-08-09 12:42 UTC (permalink / raw)
  To: Kirill A. Shutemov, Rik van Riel
  Cc: Michal Hocko, linux-kernel, mike.kravetz, linux-mm, colm, akpm,
	keescook, luto, wad, mingo, dave.hansen, linux-api

On 08/09/2017 11:59 AM, Kirill A. Shutemov wrote:
> It's not obvious to me what would break if kernel would ignore
> MADV_DONTFORK or MADV_DONTDUMP.

Ignoring MADV_DONTDUMP could cause secrets to be written to disk,
contrary to the expected security policy of the system.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-07 14:59     ` Rik van Riel
  2017-08-09  9:59       ` Kirill A. Shutemov
@ 2017-08-10 13:05       ` Michal Hocko
  2017-08-10 13:23         ` Colm MacCárthaigh
  1 sibling, 1 reply; 58+ messages in thread
From: Michal Hocko @ 2017-08-10 13:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, mike.kravetz, linux-mm, fweimer, colm, akpm,
	keescook, luto, wad, mingo, kirill, dave.hansen, linux-api

On Mon 07-08-17 10:59:51, Rik van Riel wrote:
> On Mon, 2017-08-07 at 15:46 +0200, Michal Hocko wrote:
> > On Mon 07-08-17 15:22:57, Michal Hocko wrote:
> > > This is an user visible API so make sure you CC linux-api (added)
> > > 
> > > On Sun 06-08-17 10:04:23, Rik van Riel wrote:
> > > > 
> > > > A further complication is the proliferation of clone flags,
> > > > programs bypassing glibc's functions to call clone directly,
> > > > and programs calling unshare, causing the glibc pthread_atfork
> > > > hook to not get called.
> > > > 
> > > > It would be better to have the kernel take care of this
> > > > automatically.
> > > > 
> > > > This is similar to the OpenBSD minherit syscall with
> > > > MAP_INHERIT_ZERO:
> > > > 
> > > >     https://man.openbsd.org/minherit.2
> > 
> > I would argue that a MAP_$FOO flag would be more appropriate. Or do
> > you
> > see any cases where such a special mapping would need to change the
> > semantic and inherit the content over the fork again?
> > 
> > I do not like the madvise because it is an advise and as such it can
> > be
> > ignored/not implemented and that shouldn't have any correctness
> > effects
> > on the child process.
> 
> Too late for that. VM_DONTFORK is already implemented
> through MADV_DONTFORK & MADV_DOFORK, in a way that is
> very similar to the MADV_WIPEONFORK from these patches.

Yeah, those two seem to be breaking the "madvise as an advise" semantic as
well but that doesn't mean we should follow that pattern any further.

> I wonder if that was done because MAP_* flags are a
> bitmap, with a very limited number of values as a result,
> while MADV_* constants have an essentially unlimited
> numerical namespace available.

That might have been the reason or it could have been simply because it
is easier to put something into madvise than mmap...

So back to the question. Is there any real usecase where you want to
have this on/off like or would a simple MAP_ZERO_ON_FORK be sufficient.
There should be some bits left between from my quick grep over arch
mman.h.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-07 14:19     ` Florian Weimer
@ 2017-08-10 13:06       ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-10 13:06 UTC (permalink / raw)
  To: Florian Weimer
  Cc: riel, linux-kernel, mike.kravetz, linux-mm, colm, akpm, keescook,
	luto, wad, mingo, kirill, dave.hansen, linux-api

On Mon 07-08-17 16:19:18, Florian Weimer wrote:
> On 08/07/2017 03:46 PM, Michal Hocko wrote:
> > How do they know that they need to regenerate if they do not get SEGV?
> > Are they going to assume that a read of zeros is a "must init again"? Isn't
> > that too fragile?
> 
> Why would it be fragile?  Some level of synchronization is needed to set
> things up, of course, but I think it's possible to write a lock-free
> algorithm to maintain the state even without strong guarantees of memory
> ordering from fork.

Yeah, that is what I meant as fragile... I am not question this is
impossible.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
       [not found]     ` <CAAF6GDcNoDUaDSxV6N12A_bOzo8phRUX5b8-OBteuN0AmeCv0g@mail.gmail.com>
  2017-08-07 16:02       ` Colm MacCárthaigh
@ 2017-08-10 13:21       ` Michal Hocko
  2017-08-10 14:11         ` Michal Hocko
  1 sibling, 1 reply; 58+ messages in thread
From: Michal Hocko @ 2017-08-10 13:21 UTC (permalink / raw)
  To: Colm MacCárthaigh
  Cc: Rik van Riel, linux-kernel, mike.kravetz, linux-mm,
	Florian Weimer, akpm, keescook, luto, wad, mingo, kirill,
	dave.hansen, linux-api

On Mon 07-08-17 17:55:45, Colm MacCárthaigh wrote:
> On Mon, Aug 7, 2017 at 3:46 PM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> >
> > > > The use case is libraries that store or cache information, and
> > > > want to know that they need to regenerate it in the child process
> > > > after fork.
> >
> > How do they know that they need to regenerate if they do not get SEGV?
> > Are they going to assume that a read of zeros is a "must init again"? Isn't
> > that too fragile? Or do they play other tricks like parse /proc/self/smaps
> > and read in the flag?
> >
> 
> Hi from a user space crypto maintainer :) Here's how we do exactly this it
> in s2n:
> 
> https://github.com/awslabs/s2n/blob/master/utils/s2n_random.c , lines 62 -
> 91
> 
> and here's how LibreSSL does it:
> 
> https://github.com/libressl-portable/openbsd/blob/57dcd4329d83bff3dd67a293d5c4a53b795c587e/src/lib/libc/crypt/arc4random.h
> (lines 37 on)
> https://github.com/libressl-portable/openbsd/blob/57dcd4329d83bff3dd67a293d5c4a53b795c587e/src/lib/libc/crypt/arc4random.c
> (Line 110)
> 
> OpenSSL and libc are in the process of adding similar DRBGs and would use a
> WIPEONFORK. BoringSSL's maintainers are also interested as it adds
> robustness.  I also recall it being a topic of discussion at the High
> Assurance Cryptography Symposium (HACS) where many crypto maintainers meet
> and several more maintainers there indicated it would be nice to have.
> 
> Right now on Linux we all either use pthread_atfork() to zero the memory on
> fork, or getpid() and getppid() guards. The former can be evaded by direct
> syscall() and other tricks (which things like Language VMs are prone to
> doing), and the latter check is probabilistic as pids can repeat, though if
> you use both getpid() and getppid() - which is slow! - the probability of
> both PIDs colliding is very low indeed.

Thanks, these references are really useful to build a picture. I would
probably use an unlinked fd with O_CLOEXEC to dect this but I can see
how this is not the greatest option for a library.

> The result at the moment on Linux there's no bulletproof way to detect a
> fork and erase a key or DRBG state. It would really be nice to be able to
> match what we can do with MAP_INHERIT_ZERO and minherit() on BSD.
>  madvise() does seem like the established idiom for behavior like this on
> Linux.  I don't imagine it will be hard to use in practice, we can fall
> back to existing behavior if the flag isn't accepted.

The reason why I dislike madvise, as already said, is that it should be
an advise rather than something correctness related. Sure we do have
some exceptions there but that doesn't mean we should repeat the same
error. If anything an mmap MAP_$FOO sounds like a better approach to me.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-10 13:05       ` Michal Hocko
@ 2017-08-10 13:23         ` Colm MacCárthaigh
  2017-08-10 15:36           ` Michal Hocko
  0 siblings, 1 reply; 58+ messages in thread
From: Colm MacCárthaigh @ 2017-08-10 13:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, linux-kernel, Mike Kravetz, linux-mm,
	Florian Weimer, akpm, Kees Cook, luto, Will Drewry, mingo,
	kirill, dave.hansen, linux-api

On Thu, Aug 10, 2017 at 3:05 PM, Michal Hocko <mhocko@kernel.org> wrote:
>> Too late for that. VM_DONTFORK is already implemented
>> through MADV_DONTFORK & MADV_DOFORK, in a way that is
>> very similar to the MADV_WIPEONFORK from these patches.
>
> Yeah, those two seem to be breaking the "madvise as an advise" semantic as
> well but that doesn't mean we should follow that pattern any further.

I would imagine that many of the crypto applications using
MADV_WIPEONFORK will also be using MADV_DONTDUMP. In cases where it's
for protecting secret keys, I'd like to use both in my code, for
example. Though that doesn't really help decide this.

There is also at least one case for being able to turn WIPEONFORK
on/off with an existing page; a process that uses privilege separation
often goes through the following flow:

1. [ Access privileged keys as a power user and initialize memory ]
2. [ Fork a child process that actually does the work ]
3. [ Child drops privileges and uses the memory to do work ]
4. [ Parent hangs around to re-spawn a child if it crashes ]

In that mode it would be convenient to be able to mark the memory as
WIPEONFORK in the child, but not the parent.

-- 
Colm

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-10 13:21       ` Michal Hocko
@ 2017-08-10 14:11         ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-10 14:11 UTC (permalink / raw)
  To: Colm MacCárthaigh
  Cc: Rik van Riel, linux-kernel, mike.kravetz, linux-mm,
	Florian Weimer, akpm, keescook, luto, wad, mingo, kirill,
	dave.hansen, linux-api

On Thu 10-08-17 15:21:10, Michal Hocko wrote:
[...]
> Thanks, these references are really useful to build a picture. I would
> probably use an unlinked fd with O_CLOEXEC to dect this but I can see
> how this is not the greatest option for a library.

Blee, brainfart on my end. For some reason I mixed fork/exec...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-06 14:04 ` [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK riel
@ 2017-08-10 15:23   ` Michal Hocko
  2017-08-11 15:23     ` Rik van Riel
  0 siblings, 1 reply; 58+ messages in thread
From: Michal Hocko @ 2017-08-10 15:23 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, mike.kravetz, linux-mm, fweimer, colm, akpm,
	keescook, luto, wad, mingo, kirill, dave.hansen

On Sun 06-08-17 10:04:25, Rik van Riel wrote:
[...]
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 17921b0390b4..db1fb2802ecc 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -659,6 +659,13 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
>  		tmp->vm_next = tmp->vm_prev = NULL;
>  		file = tmp->vm_file;
> +
> +		/* With VM_WIPEONFORK, the child gets an empty VMA. */
> +		if (tmp->vm_flags & VM_WIPEONFORK) {
> +			tmp->vm_file = file = NULL;
> +			tmp->vm_ops = NULL;
> +		}

What about VM_SHARED/|VM)MAYSHARE flags. Is it OK to keep the around? At
least do_anonymous_page SIGBUS on !vm_ops && VM_SHARED. Or do I miss
where those flags are cleared?

> +
>  		if (file) {
>  			struct inode *inode = file_inode(file);
>  			struct address_space *mapping = file->f_mapping;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-08 18:45             ` Rik van Riel
@ 2017-08-10 15:31               ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-10 15:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Matthew Wilcox, Mike Kravetz, Florian Weimer, linux-kernel,
	linux-mm, colm, akpm, keescook, luto, wad, mingo, kirill,
	dave.hansen

On Tue 08-08-17 14:45:14, Rik van Riel wrote:
> On Tue, 2017-08-08 at 09:52 -0700, Matthew Wilcox wrote:
> > On Tue, Aug 08, 2017 at 11:46:08AM -0400, Rik van Riel wrote:
> > > On Tue, 2017-08-08 at 08:19 -0700, Mike Kravetz wrote:
> > > > If the use case is fairly specific, then perhaps it makes sense
> > > > to
> > > > make MADV_WIPEONFORK not applicable (EINVAL) for mappings where
> > > > the
> > > > result is 'questionable'.
> > > 
> > > That would be a question for Florian and Colm.
> > > 
> > > If they are OK with MADV_WIPEONFORK only working on
> > > anonymous VMAs (no file mapping), that certainly could
> > > be implemented.
> > > 
> > > On the other hand, I am not sure that introducing cases
> > > where MADV_WIPEONFORK does not implement wipe-on-fork
> > > semantics would reduce user confusion...
> > 
> > It'll simply do exactly what it does today, so it won't introduce any
> > new fallback code.
> 
> Sure, but actually implementing MADV_WIPEONFORK in a
> way that turns file mapped VMAs into zero page backed
> anonymous VMAs after fork takes no more code than
> implementing it in a way that refuses to work on VMAs
> that have a file backing.
> 
> There is no complexity argument for or against either
> approach.
> 
> The big question is, what is the best for users?
> 
> Should we return -EINVAL when MADV_WIPEONFORK is called
> on a VMA that has a file backing, and only succeed on
> anonymous VMAs?

I would rather be conservative and implement the bare minimum until
there is a reasonable usecase to demand the feature for shared mappings
as well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-10 13:23         ` Colm MacCárthaigh
@ 2017-08-10 15:36           ` Michal Hocko
       [not found]             ` <CAAF6GDeno6RpHf1KORVSxUL7M-CQfbWFFdyKK8LAWd_6PcJ55Q@mail.gmail.com>
  0 siblings, 1 reply; 58+ messages in thread
From: Michal Hocko @ 2017-08-10 15:36 UTC (permalink / raw)
  To: Colm MacCárthaigh
  Cc: Rik van Riel, linux-kernel, Mike Kravetz, linux-mm,
	Florian Weimer, akpm, Kees Cook, luto, Will Drewry, mingo,
	kirill, dave.hansen, linux-api

On Thu 10-08-17 15:23:05, Colm MacCárthaigh wrote:
> On Thu, Aug 10, 2017 at 3:05 PM, Michal Hocko <mhocko@kernel.org> wrote:
> >> Too late for that. VM_DONTFORK is already implemented
> >> through MADV_DONTFORK & MADV_DOFORK, in a way that is
> >> very similar to the MADV_WIPEONFORK from these patches.
> >
> > Yeah, those two seem to be breaking the "madvise as an advise" semantic as
> > well but that doesn't mean we should follow that pattern any further.
> 
> I would imagine that many of the crypto applications using
> MADV_WIPEONFORK will also be using MADV_DONTDUMP. In cases where it's
> for protecting secret keys, I'd like to use both in my code, for
> example. Though that doesn't really help decide this.
> 
> There is also at least one case for being able to turn WIPEONFORK
> on/off with an existing page; a process that uses privilege separation
> often goes through the following flow:
> 
> 1. [ Access privileged keys as a power user and initialize memory ]
> 2. [ Fork a child process that actually does the work ]
> 3. [ Child drops privileges and uses the memory to do work ]
> 4. [ Parent hangs around to re-spawn a child if it crashes ]
> 
> In that mode it would be convenient to be able to mark the memory as
> WIPEONFORK in the child, but not the parent.

I am not sure I understand. The child will have an own VMA so chaging
the attribute will not affect parent. Or did I misunderstand your
example?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
       [not found]             ` <CAAF6GDeno6RpHf1KORVSxUL7M-CQfbWFFdyKK8LAWd_6PcJ55Q@mail.gmail.com>
@ 2017-08-10 17:01               ` Michal Hocko
  2017-08-10 22:09                 ` Colm MacCárthaigh
  0 siblings, 1 reply; 58+ messages in thread
From: Michal Hocko @ 2017-08-10 17:01 UTC (permalink / raw)
  To: Colm MacCárthaigh
  Cc: Florian Weimer, Kees Cook, Mike Kravetz, Rik van Riel,
	Will Drewry, akpm, dave.hansen, kirill, linux-api, linux-kernel,
	linux-mm, luto, mingo

On Thu 10-08-17 16:17:18, Colm MacCárthaigh wrote:
> On Déar 10 Lún 2017 at 17:36 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Thu 10-08-17 15:23:05, Colm MacCįrthaigh wrote:
> > > On Thu, Aug 10, 2017 at 3:05 PM, Michal Hocko <mhocko@kernel.org> wrote:
> > > >> Too late for that. VM_DONTFORK is already implemented
> > > >> through MADV_DONTFORK & MADV_DOFORK, in a way that is
> > > >> very similar to the MADV_WIPEONFORK from these patches.
> > > >
> > > > Yeah, those two seem to be breaking the "madvise as an advise"
> > semantic as
> > > > well but that doesn't mean we should follow that pattern any further.
> > >
> > > I would imagine that many of the crypto applications using
> > > MADV_WIPEONFORK will also be using MADV_DONTDUMP. In cases where it's
> > > for protecting secret keys, I'd like to use both in my code, for
> > > example. Though that doesn't really help decide this.
> > >
> > > There is also at least one case for being able to turn WIPEONFORK
> > > on/off with an existing page; a process that uses privilege separation
> > > often goes through the following flow:
> > >
> > > 1. [ Access privileged keys as a power user and initialize memory ]
> > > 2. [ Fork a child process that actually does the work ]
> > > 3. [ Child drops privileges and uses the memory to do work ]
> > > 4. [ Parent hangs around to re-spawn a child if it crashes ]
> > >
> > > In that mode it would be convenient to be able to mark the memory as
> > > WIPEONFORK in the child, but not the parent.
> >
> > I am not sure I understand. The child will have an own VMA so chaging
> > the attribute will not affect parent. Or did I misunderstand your
> > example?
> >
> 
> Typically with privilege separation the parent has to share some minimal
> state with the child. In this case that's why the page is left alone.
> Though a smart parent could unset and set just immediately around the fork.
> 
> The point then of protecting it in the child is to ensure that a grandchild
> doesn't inherit the secret data.

Does anybody actually do that using the minherit BSD interface?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-10 17:01               ` Michal Hocko
@ 2017-08-10 22:09                 ` Colm MacCárthaigh
  2017-08-11 14:06                   ` Michal Hocko
  0 siblings, 1 reply; 58+ messages in thread
From: Colm MacCárthaigh @ 2017-08-10 22:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Florian Weimer, Kees Cook, Mike Kravetz, Rik van Riel,
	Will Drewry, akpm, dave.hansen, kirill, linux-api, linux-kernel,
	linux-mm, luto, mingo

On Thu, Aug 10, 2017 at 7:01 PM, Michal Hocko <mhocko@kernel.org> wrote:
> Does anybody actually do that using the minherit BSD interface?

I can't find any OSS examples. I just thought of it in response to
your question, but now that I have, I do want to use it that way in
privsep code.

As a mere user, fwiw it would make /my/ code less complex (in
Kolmogorov terms) to be an madvise option. Here's what that would look
like in user space:

mmap()

#if MAP_INHERIT_ZERO
    minherit() || pthread_atfork(workaround_fptr);
#elif MADVISE_WIPEONFORK
    madvise() || pthread_atfork(workaround_fptr);
#else
    pthread_atfork(workaround_fptr);
#endif

Vs:

#if MAP_WIPEONFORK
    mmap( ... WIPEONFORK) || pthread_atfork(workaround_fptr);
#else
    mmap()
#endif

#if MAP_INHERIT_ZERO
    madvise() || pthread_atfork(workaround_fptr);
#endif

#if !defined(MAP_WIPEONFORK) && !defined(MAP_INHERIT_ZERO)
    pthread_atfork(workaround_fptr);
#endif

The former is neater, and also a lot easier to stay structured if the
code is separated across different functional units. Allocation is
often handled in special functions.

For me, madvise() is the principle of least surprise, following
existing DONTDUMP semantics.

-- 
Colm

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-10 22:09                 ` Colm MacCárthaigh
@ 2017-08-11 14:06                   ` Michal Hocko
  2017-08-11 14:11                     ` Florian Weimer
  0 siblings, 1 reply; 58+ messages in thread
From: Michal Hocko @ 2017-08-11 14:06 UTC (permalink / raw)
  To: Colm MacCárthaigh
  Cc: Florian Weimer, Kees Cook, Mike Kravetz, Rik van Riel,
	Will Drewry, akpm, dave.hansen, kirill, linux-api, linux-kernel,
	linux-mm, luto, mingo

On Fri 11-08-17 00:09:57, Colm MacCárthaigh wrote:
> On Thu, Aug 10, 2017 at 7:01 PM, Michal Hocko <mhocko@kernel.org> wrote:
> > Does anybody actually do that using the minherit BSD interface?
> 
> I can't find any OSS examples. I just thought of it in response to
> your question, but now that I have, I do want to use it that way in
> privsep code.
> 
> As a mere user, fwiw it would make /my/ code less complex (in
> Kolmogorov terms) to be an madvise option. Here's what that would look
> like in user space:
> 
> mmap()
> 
> #if MAP_INHERIT_ZERO
>     minherit() || pthread_atfork(workaround_fptr);
> #elif MADVISE_WIPEONFORK
>     madvise() || pthread_atfork(workaround_fptr);
> #else
>     pthread_atfork(workaround_fptr);
> #endif
> 
> Vs:
> 
> #if MAP_WIPEONFORK
>     mmap( ... WIPEONFORK) || pthread_atfork(workaround_fptr);
> #else
>     mmap()
> #endif
> 
> #if MAP_INHERIT_ZERO
>     madvise() || pthread_atfork(workaround_fptr);
> #endif
> 
> #if !defined(MAP_WIPEONFORK) && !defined(MAP_INHERIT_ZERO)
>     pthread_atfork(workaround_fptr);
> #endif
> 
> The former is neater, and also a lot easier to stay structured if the
> code is separated across different functional units. Allocation is
> often handled in special functions.

OK, I guess I see your point. Thanks for the clarification.
 
> For me, madvise() is the principle of least surprise, following
> existing DONTDUMP semantics.

I am sorry to look too insisting here (I have still hard time to reconcile
myself with the madvise (ab)use) but if we in fact want minherit like
interface why don't we simply add minherit and make the code which wants
to use that interface easier to port? Is the only reason that hooking
into madvise is less code? If yes is that a sufficient reason to justify
the (ab)use of madvise? If there is a general consensus on that part I
will shut up and won't object anymore. Arguably MADV_DONTFORK would fit
into minherit API better as well. MADV_DONTDUMP is a differnet storry of
course.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-11 14:06                   ` Michal Hocko
@ 2017-08-11 14:11                     ` Florian Weimer
  2017-08-11 14:24                       ` Michal Hocko
  0 siblings, 1 reply; 58+ messages in thread
From: Florian Weimer @ 2017-08-11 14:11 UTC (permalink / raw)
  To: Michal Hocko, Colm MacCárthaigh
  Cc: Kees Cook, Mike Kravetz, Rik van Riel, Will Drewry, akpm,
	dave.hansen, kirill, linux-api, linux-kernel, linux-mm, luto,
	mingo

On 08/11/2017 04:06 PM, Michal Hocko wrote:

> I am sorry to look too insisting here (I have still hard time to reconcile
> myself with the madvise (ab)use) but if we in fact want minherit like
> interface why don't we simply add minherit and make the code which wants
> to use that interface easier to port? Is the only reason that hooking
> into madvise is less code? If yes is that a sufficient reason to justify
> the (ab)use of madvise? If there is a general consensus on that part I
> will shut up and won't object anymore. Arguably MADV_DONTFORK would fit
> into minherit API better as well.

It does, OpenBSD calls it MAP_INHERIT_NONE.

Could you implement MAP_INHERIT_COPY and MAP_INHERIT_SHARE as well?  Or
is changing from MAP_SHARED to MAP_PRIVATE and back impossible?

Thanks,
Florian

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-11 14:11                     ` Florian Weimer
@ 2017-08-11 14:24                       ` Michal Hocko
  2017-08-11 15:24                         ` Florian Weimer
  0 siblings, 1 reply; 58+ messages in thread
From: Michal Hocko @ 2017-08-11 14:24 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Colm MacCárthaigh, Kees Cook, Mike Kravetz, Rik van Riel,
	Will Drewry, akpm, dave.hansen, kirill, linux-api, linux-kernel,
	linux-mm, luto, mingo

On Fri 11-08-17 16:11:44, Florian Weimer wrote:
> On 08/11/2017 04:06 PM, Michal Hocko wrote:
> 
> > I am sorry to look too insisting here (I have still hard time to reconcile
> > myself with the madvise (ab)use) but if we in fact want minherit like
> > interface why don't we simply add minherit and make the code which wants
> > to use that interface easier to port? Is the only reason that hooking
> > into madvise is less code? If yes is that a sufficient reason to justify
> > the (ab)use of madvise? If there is a general consensus on that part I
> > will shut up and won't object anymore. Arguably MADV_DONTFORK would fit
> > into minherit API better as well.
> 
> It does, OpenBSD calls it MAP_INHERIT_NONE.
> 
> Could you implement MAP_INHERIT_COPY and MAP_INHERIT_SHARE as well?  Or
> is changing from MAP_SHARED to MAP_PRIVATE and back impossible?

I haven't explored those two very much. Their semantic seems rather
awkward, especially map_inherit_share one. I guess MAP_INHERIT_COPY
would be doable. Do we have to support all modes or a missing support
would disqualify the syscall completely?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-10 15:23   ` Michal Hocko
@ 2017-08-11 15:23     ` Rik van Riel
  2017-08-11 16:36       ` Mike Kravetz
  0 siblings, 1 reply; 58+ messages in thread
From: Rik van Riel @ 2017-08-11 15:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, mike.kravetz, linux-mm, fweimer, colm, akpm,
	keescook, luto, wad, mingo, kirill, dave.hansen

On Thu, 2017-08-10 at 17:23 +0200, Michal Hocko wrote:
> On Sun 06-08-17 10:04:25, Rik van Riel wrote:
> [...]
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 17921b0390b4..db1fb2802ecc 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -659,6 +659,13 @@ static __latent_entropy int dup_mmap(struct
> > mm_struct *mm,
> >  		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
> >  		tmp->vm_next = tmp->vm_prev = NULL;
> >  		file = tmp->vm_file;
> > +
> > +		/* With VM_WIPEONFORK, the child gets an empty
> > VMA. */
> > +		if (tmp->vm_flags & VM_WIPEONFORK) {
> > +			tmp->vm_file = file = NULL;
> > +			tmp->vm_ops = NULL;
> > +		}
> 
> What about VM_SHARED/|VM)MAYSHARE flags. Is it OK to keep the around?
> At
> least do_anonymous_page SIGBUS on !vm_ops && VM_SHARED. Or do I miss
> where those flags are cleared?

Huh, good spotting.  That makes me wonder why the test case that
Mike and I ran worked just fine on a MAP_SHARED|MAP_ANONYMOUS VMA,
and returned zero-filled memory when read by the child process.

OK, I'll do a minimal implementation for now, which will return
-EINVAL if MADV_WIPEONFORK is called on a VMA with MAP_SHARED
and/or an mmapped file.

It will work the way it is supposed to with anonymous MAP_PRIVATE
memory, which is likely the only memory it will be used on, anyway.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-11 14:24                       ` Michal Hocko
@ 2017-08-11 15:24                         ` Florian Weimer
  2017-08-11 15:31                           ` Michal Hocko
  0 siblings, 1 reply; 58+ messages in thread
From: Florian Weimer @ 2017-08-11 15:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Colm MacCárthaigh, Kees Cook, Mike Kravetz, Rik van Riel,
	Will Drewry, akpm, dave.hansen, kirill, linux-api, linux-kernel,
	linux-mm, luto, mingo

On 08/11/2017 04:24 PM, Michal Hocko wrote:
> On Fri 11-08-17 16:11:44, Florian Weimer wrote:
>> On 08/11/2017 04:06 PM, Michal Hocko wrote:
>>
>>> I am sorry to look too insisting here (I have still hard time to reconcile
>>> myself with the madvise (ab)use) but if we in fact want minherit like
>>> interface why don't we simply add minherit and make the code which wants
>>> to use that interface easier to port? Is the only reason that hooking
>>> into madvise is less code? If yes is that a sufficient reason to justify
>>> the (ab)use of madvise? If there is a general consensus on that part I
>>> will shut up and won't object anymore. Arguably MADV_DONTFORK would fit
>>> into minherit API better as well.
>>
>> It does, OpenBSD calls it MAP_INHERIT_NONE.
>>
>> Could you implement MAP_INHERIT_COPY and MAP_INHERIT_SHARE as well?  Or
>> is changing from MAP_SHARED to MAP_PRIVATE and back impossible?
> 
> I haven't explored those two very much. Their semantic seems rather
> awkward, especially map_inherit_share one. I guess MAP_INHERIT_COPY
> would be doable. Do we have to support all modes or a missing support
> would disqualify the syscall completely?

I think it would be a bit awkward if we implemented MAP_INHERIT_ZERO and
it would not turn a shared mapping into a private mapping in the child,
or would not work on shared mappings at all, or deviate in any way from
the OpenBSD implementation.

MAP_INHERIT_SHARE for a MAP_PRIVATE mapping which has been modified is a
bit bizarre, and I don't know how OpenBSD implements any of this.  It
could well be that the exact behavior implemented in OpenBSD is a poor
fit for the Linux VM implementation.

Florian

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK
  2017-08-11 15:24                         ` Florian Weimer
@ 2017-08-11 15:31                           ` Michal Hocko
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Hocko @ 2017-08-11 15:31 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Colm MacCárthaigh, Kees Cook, Mike Kravetz, Rik van Riel,
	Will Drewry, akpm, dave.hansen, kirill, linux-api, linux-kernel,
	linux-mm, luto, mingo

On Fri 11-08-17 17:24:29, Florian Weimer wrote:
> On 08/11/2017 04:24 PM, Michal Hocko wrote:
> > On Fri 11-08-17 16:11:44, Florian Weimer wrote:
> >> On 08/11/2017 04:06 PM, Michal Hocko wrote:
> >>
> >>> I am sorry to look too insisting here (I have still hard time to reconcile
> >>> myself with the madvise (ab)use) but if we in fact want minherit like
> >>> interface why don't we simply add minherit and make the code which wants
> >>> to use that interface easier to port? Is the only reason that hooking
> >>> into madvise is less code? If yes is that a sufficient reason to justify
> >>> the (ab)use of madvise? If there is a general consensus on that part I
> >>> will shut up and won't object anymore. Arguably MADV_DONTFORK would fit
> >>> into minherit API better as well.
> >>
> >> It does, OpenBSD calls it MAP_INHERIT_NONE.
> >>
> >> Could you implement MAP_INHERIT_COPY and MAP_INHERIT_SHARE as well?  Or
> >> is changing from MAP_SHARED to MAP_PRIVATE and back impossible?
> > 
> > I haven't explored those two very much. Their semantic seems rather
> > awkward, especially map_inherit_share one. I guess MAP_INHERIT_COPY
> > would be doable. Do we have to support all modes or a missing support
> > would disqualify the syscall completely?
> 
> I think it would be a bit awkward if we implemented MAP_INHERIT_ZERO and
> it would not turn a shared mapping into a private mapping in the child,
> or would not work on shared mappings at all, or deviate in any way from
> the OpenBSD implementation.

If we go with minherit API then I think we should adhere with the BSD
semantic and alloc MAP_INHERIT_ZERO for shared mappings as well

> MAP_INHERIT_SHARE for a MAP_PRIVATE mapping which has been modified is a
> bit bizarre, and I don't know how OpenBSD implements any of this.  It
> could well be that the exact behavior implemented in OpenBSD is a poor
> fit for the Linux VM implementation.

yeah, it would be MAP_INHERIT_SHARE that I would consider problematic
and rather go with ENOSUPP or even EINVAL.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-11 15:23     ` Rik van Riel
@ 2017-08-11 16:36       ` Mike Kravetz
  2017-08-11 16:59         ` Rik van Riel
  0 siblings, 1 reply; 58+ messages in thread
From: Mike Kravetz @ 2017-08-11 16:36 UTC (permalink / raw)
  To: Rik van Riel, Michal Hocko
  Cc: linux-kernel, linux-mm, fweimer, colm, akpm, keescook, luto, wad,
	mingo, kirill, dave.hansen

On 08/11/2017 08:23 AM, Rik van Riel wrote:
> On Thu, 2017-08-10 at 17:23 +0200, Michal Hocko wrote:
>> On Sun 06-08-17 10:04:25, Rik van Riel wrote:
>> [...]
>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>> index 17921b0390b4..db1fb2802ecc 100644
>>> --- a/kernel/fork.c
>>> +++ b/kernel/fork.c
>>> @@ -659,6 +659,13 @@ static __latent_entropy int dup_mmap(struct
>>> mm_struct *mm,
>>>  		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
>>>  		tmp->vm_next = tmp->vm_prev = NULL;
>>>  		file = tmp->vm_file;
>>> +
>>> +		/* With VM_WIPEONFORK, the child gets an empty
>>> VMA. */
>>> +		if (tmp->vm_flags & VM_WIPEONFORK) {
>>> +			tmp->vm_file = file = NULL;
>>> +			tmp->vm_ops = NULL;
>>> +		}
>>
>> What about VM_SHARED/|VM)MAYSHARE flags. Is it OK to keep the around?
>> At
>> least do_anonymous_page SIGBUS on !vm_ops && VM_SHARED. Or do I miss
>> where those flags are cleared?
> 
> Huh, good spotting.  That makes me wonder why the test case that
> Mike and I ran worked just fine on a MAP_SHARED|MAP_ANONYMOUS VMA,
> and returned zero-filled memory when read by the child process.

Well, I think I still got a BUG with a MAP_SHARED|MAP_ANONYMOUS vma on
your v2 patch.  Did not really want to start a discussion on the
implementation until the issue of exactly what VM_WIPEONFORK was supposed
to do was settled.

> 
> OK, I'll do a minimal implementation for now, which will return
> -EINVAL if MADV_WIPEONFORK is called on a VMA with MAP_SHARED
> and/or an mmapped file.
> 
> It will work the way it is supposed to with anonymous MAP_PRIVATE
> memory, which is likely the only memory it will be used on, anyway.
> 

Seems reasonable.

You should also add VM_HUGETLB to those returning -EINVAL.  IIRC, a
VM_HUGETLB vma even without VM_SHARED expects vm_file != NULL.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-11 16:36       ` Mike Kravetz
@ 2017-08-11 16:59         ` Rik van Riel
  2017-08-11 17:07           ` Mike Kravetz
  0 siblings, 1 reply; 58+ messages in thread
From: Rik van Riel @ 2017-08-11 16:59 UTC (permalink / raw)
  To: Mike Kravetz, Michal Hocko
  Cc: linux-kernel, linux-mm, fweimer, colm, akpm, keescook, luto, wad,
	mingo, kirill, dave.hansen

On Fri, 2017-08-11 at 09:36 -0700, Mike Kravetz wrote:
> On 08/11/2017 08:23 AM, Rik van Riel wrote:
> > On Thu, 2017-08-10 at 17:23 +0200, Michal Hocko wrote:
> > > On Sun 06-08-17 10:04:25, Rik van Riel wrote:
> > > [...]
> > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > index 17921b0390b4..db1fb2802ecc 100644
> > > > --- a/kernel/fork.c
> > > > +++ b/kernel/fork.c
> > > > @@ -659,6 +659,13 @@ static __latent_entropy int
> > > > dup_mmap(struct
> > > > mm_struct *mm,
> > > >  		tmp->vm_flags &= ~(VM_LOCKED |
> > > > VM_LOCKONFAULT);
> > > >  		tmp->vm_next = tmp->vm_prev = NULL;
> > > >  		file = tmp->vm_file;
> > > > +
> > > > +		/* With VM_WIPEONFORK, the child gets an empty
> > > > VMA. */
> > > > +		if (tmp->vm_flags & VM_WIPEONFORK) {
> > > > +			tmp->vm_file = file = NULL;
> > > > +			tmp->vm_ops = NULL;
> > > > +		}
> > > 
> > > What about VM_SHARED/|VM)MAYSHARE flags. Is it OK to keep the
> > > around?
> > > At
> > > least do_anonymous_page SIGBUS on !vm_ops && VM_SHARED. Or do I
> > > miss
> > > where those flags are cleared?
> > 
> > Huh, good spotting.  That makes me wonder why the test case that
> > Mike and I ran worked just fine on a MAP_SHARED|MAP_ANONYMOUS VMA,
> > and returned zero-filled memory when read by the child process.
> 
> Well, I think I still got a BUG with a MAP_SHARED|MAP_ANONYMOUS vma
> on
> your v2 patch.  Did not really want to start a discussion on the
> implementation until the issue of exactly what VM_WIPEONFORK was
> supposed
> to do was settled.

It worked here, but now I don't understand why :)

> > 
> > OK, I'll do a minimal implementation for now, which will return
> > -EINVAL if MADV_WIPEONFORK is called on a VMA with MAP_SHARED
> > and/or an mmapped file.
> > 
> > It will work the way it is supposed to with anonymous MAP_PRIVATE
> > memory, which is likely the only memory it will be used on, anyway.
> > 
> 
> Seems reasonable.
> 
> You should also add VM_HUGETLB to those returning -EINVAL.  IIRC, a
> VM_HUGETLB vma even without VM_SHARED expects vm_file != NULL.

In other words (flags & MAP_SHARED || vma->vm_file) would catch
hugetlbfs, too?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-11 16:59         ` Rik van Riel
@ 2017-08-11 17:07           ` Mike Kravetz
  0 siblings, 0 replies; 58+ messages in thread
From: Mike Kravetz @ 2017-08-11 17:07 UTC (permalink / raw)
  To: Rik van Riel, Michal Hocko
  Cc: linux-kernel, linux-mm, fweimer, colm, akpm, keescook, luto, wad,
	mingo, kirill, dave.hansen

On 08/11/2017 09:59 AM, Rik van Riel wrote:
> On Fri, 2017-08-11 at 09:36 -0700, Mike Kravetz wrote:
>> On 08/11/2017 08:23 AM, Rik van Riel wrote:
>>> On Thu, 2017-08-10 at 17:23 +0200, Michal Hocko wrote:
>>>> On Sun 06-08-17 10:04:25, Rik van Riel wrote:
>>>> [...]
>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>>> index 17921b0390b4..db1fb2802ecc 100644
>>>>> --- a/kernel/fork.c
>>>>> +++ b/kernel/fork.c
>>>>> @@ -659,6 +659,13 @@ static __latent_entropy int
>>>>> dup_mmap(struct
>>>>> mm_struct *mm,
>>>>>  		tmp->vm_flags &= ~(VM_LOCKED |
>>>>> VM_LOCKONFAULT);
>>>>>  		tmp->vm_next = tmp->vm_prev = NULL;
>>>>>  		file = tmp->vm_file;
>>>>> +
>>>>> +		/* With VM_WIPEONFORK, the child gets an empty
>>>>> VMA. */
>>>>> +		if (tmp->vm_flags & VM_WIPEONFORK) {
>>>>> +			tmp->vm_file = file = NULL;
>>>>> +			tmp->vm_ops = NULL;
>>>>> +		}
>>>>
>>>> What about VM_SHARED/|VM)MAYSHARE flags. Is it OK to keep the
>>>> around?
>>>> At
>>>> least do_anonymous_page SIGBUS on !vm_ops && VM_SHARED. Or do I
>>>> miss
>>>> where those flags are cleared?
>>>
>>> Huh, good spotting.  That makes me wonder why the test case that
>>> Mike and I ran worked just fine on a MAP_SHARED|MAP_ANONYMOUS VMA,
>>> and returned zero-filled memory when read by the child process.
>>
>> Well, I think I still got a BUG with a MAP_SHARED|MAP_ANONYMOUS vma
>> on
>> your v2 patch.  Did not really want to start a discussion on the
>> implementation until the issue of exactly what VM_WIPEONFORK was
>> supposed
>> to do was settled.
> 
> It worked here, but now I don't understand why :)
> 
>>>
>>> OK, I'll do a minimal implementation for now, which will return
>>> -EINVAL if MADV_WIPEONFORK is called on a VMA with MAP_SHARED
>>> and/or an mmapped file.
>>>
>>> It will work the way it is supposed to with anonymous MAP_PRIVATE
>>> memory, which is likely the only memory it will be used on, anyway.
>>>
>>
>> Seems reasonable.
>>
>> You should also add VM_HUGETLB to those returning -EINVAL.  IIRC, a
>> VM_HUGETLB vma even without VM_SHARED expects vm_file != NULL.
> 
> In other words (flags & MAP_SHARED || vma->vm_file) would catch
> hugetlbfs, too?

Yes, that should catch hugetlbfs.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-18 18:15           ` Andrew Morton
@ 2017-08-19  0:02             ` Rik van Riel
  0 siblings, 0 replies; 58+ messages in thread
From: Rik van Riel @ 2017-08-19  0:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, mhocko, mike.kravetz, linux-mm, fweimer, colm,
	keescook, luto, wad, mingo, kirill, dave.hansen, linux-api,
	torvalds, willy

On Fri, 2017-08-18 at 11:15 -0700, Andrew Morton wrote:
> On Fri, 18 Aug 2017 12:28:29 -0400 Rik van Riel <riel@redhat.com>
> wrote:
> 
> > On Thu, 2017-08-17 at 15:50 -0700, Andrew Morton wrote:
> > > On Tue, 15 Aug 2017 22:18:19 -0400 Rik van Riel <riel@redhat.com>
> > > wrote:
> > > 
> > > > > > --- a/mm/madvise.c
> > > > > > +++ b/mm/madvise.c
> > > > > > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > > > > > vm_area_struct *vma,
> > > > > > __		}
> > > > > > __		new_flags &= ~VM_DONTCOPY;
> > > > > > __		break;
> > > > > > +	case MADV_WIPEONFORK:
> > > > > > +		/* MADV_WIPEONFORK is only supported on
> > > > > > anonymous
> > > > > > memory. */
> > > > > > +		if (vma->vm_file || vma->vm_flags &
> > > > > > VM_SHARED)
> > > > > > {
> > > > > > +			error = -EINVAL;
> > > > > > +			goto out;
> > > > > > +		}
> > > > > > +		new_flags |= VM_WIPEONFORK;
> > > > > > +		break;
> > > > > > +	case MADV_KEEPONFORK:
> > > > > > +		new_flags &= ~VM_WIPEONFORK;
> > > > > > +		break;
> > > > > > __	case MADV_DONTDUMP:
> > > > > > __		new_flags |= VM_DONTDUMP;
> > > > > > __		break;
> > > > > 
> > > > > It seems odd to permit MADV_KEEPONFORK against other-than-
> > > > > anon
> > > > > vmas?
> > > > 
> > > > Given that the only way to set VM_WIPEONFORK is through
> > > > MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
> > > > other-than-anon vma would be equivalent to a noop.
> > > > 
> > > > If new_flags == vma->vm_flags, madvise_behavior() will
> > > > immediately exit.
> > > 
> > > Yes, but calling MADV_WIPEONFORK against an other-than-anon vma
> > > is
> > > presumably a userspace bug.____A bug which will probably result
> > > in
> > > userspace having WIPEONFORK memory which it didn't want.____The
> > > kernel
> > > can trivially tell userspace that it has this bug so why not do
> > > so?
> > 
> > Uh, what?
> > 
> 
> Braino.  I meant MADV_KEEPONFORK.  Calling MADV_KEEPONFORK against an
> other-than-anon vma is a presumptive userspace bug and the kernel
> should report that.

All MADV_KEEPONFORK does is clear the flag set by
MADV_WIPEONFORK. Since there is no way to set the
WIPEONFORK flag on other-than-anon VMAs, that means
MADV_KEEPONFORK is always a noop for those VMAs.

You remind me that I should send in a man page
addition, though, when this code gets sent to
Linus.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-18 16:28         ` Rik van Riel
@ 2017-08-18 18:15           ` Andrew Morton
  2017-08-19  0:02             ` Rik van Riel
  0 siblings, 1 reply; 58+ messages in thread
From: Andrew Morton @ 2017-08-18 18:15 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, mhocko, mike.kravetz, linux-mm, fweimer, colm,
	keescook, luto, wad, mingo, kirill, dave.hansen, linux-api,
	torvalds, willy

On Fri, 18 Aug 2017 12:28:29 -0400 Rik van Riel <riel@redhat.com> wrote:

> On Thu, 2017-08-17 at 15:50 -0700, Andrew Morton wrote:
> > On Tue, 15 Aug 2017 22:18:19 -0400 Rik van Riel <riel@redhat.com>
> > wrote:
> > 
> > > > > --- a/mm/madvise.c
> > > > > +++ b/mm/madvise.c
> > > > > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > > > > vm_area_struct *vma,
> > > > > __		}
> > > > > __		new_flags &= ~VM_DONTCOPY;
> > > > > __		break;
> > > > > +	case MADV_WIPEONFORK:
> > > > > +		/* MADV_WIPEONFORK is only supported on
> > > > > anonymous
> > > > > memory. */
> > > > > +		if (vma->vm_file || vma->vm_flags & VM_SHARED)
> > > > > {
> > > > > +			error = -EINVAL;
> > > > > +			goto out;
> > > > > +		}
> > > > > +		new_flags |= VM_WIPEONFORK;
> > > > > +		break;
> > > > > +	case MADV_KEEPONFORK:
> > > > > +		new_flags &= ~VM_WIPEONFORK;
> > > > > +		break;
> > > > > __	case MADV_DONTDUMP:
> > > > > __		new_flags |= VM_DONTDUMP;
> > > > > __		break;
> > > > 
> > > > It seems odd to permit MADV_KEEPONFORK against other-than-anon
> > > > vmas?
> > > 
> > > Given that the only way to set VM_WIPEONFORK is through
> > > MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
> > > other-than-anon vma would be equivalent to a noop.
> > > 
> > > If new_flags == vma->vm_flags, madvise_behavior() will
> > > immediately exit.
> > 
> > Yes, but calling MADV_WIPEONFORK against an other-than-anon vma is
> > presumably a userspace bug.____A bug which will probably result in
> > userspace having WIPEONFORK memory which it didn't want.____The kernel
> > can trivially tell userspace that it has this bug so why not do so?
> 
> Uh, what?
> 

Braino.  I meant MADV_KEEPONFORK.  Calling MADV_KEEPONFORK against an
other-than-anon vma is a presumptive userspace bug and the kernel
should report that.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-11 21:28 ` [PATCH 2/2] mm,fork: " riel
  2017-08-15 22:51   ` Andrew Morton
@ 2017-08-18 17:25   ` Mike Kravetz
  1 sibling, 0 replies; 58+ messages in thread
From: Mike Kravetz @ 2017-08-18 17:25 UTC (permalink / raw)
  To: riel, linux-kernel
  Cc: mhocko, linux-mm, fweimer, colm, akpm, keescook, luto, wad,
	mingo, kirill, dave.hansen, linux-api, torvalds, willy

On 08/11/2017 02:28 PM, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> Introduce MADV_WIPEONFORK semantics, which result in a VMA being
> empty in the child process after fork. This differs from MADV_DONTFORK
> in one important way.
> 
> If a child process accesses memory that was MADV_WIPEONFORK, it
> will get zeroes. The address ranges are still valid, they are just empty.
> 
> If a child process accesses memory that was MADV_DONTFORK, it will
> get a segmentation fault, since those address ranges are no longer
> valid in the child after fork.
> 
> Since MADV_DONTFORK also seems to be used to allow very large
> programs to fork in systems with strict memory overcommit restrictions,
> changing the semantics of MADV_DONTFORK might break existing programs.
> 
> MADV_WIPEONFORK only works on private, anonymous VMAs.
> 
> The use case is libraries that store or cache information, and
> want to know that they need to regenerate it in the child process
> after fork.
> 
> Examples of this would be:
> - systemd/pulseaudio API checks (fail after fork)
>   (replacing a getpid check, which is too slow without a PID cache)
> - PKCS#11 API reinitialization check (mandated by specification)
> - glibc's upcoming PRNG (reseed after fork)
> - OpenSSL PRNG (reseed after fork)
> 
> The security benefits of a forking server having a re-inialized
> PRNG in every child process are pretty obvious. However, due to
> libraries having all kinds of internal state, and programs getting
> compiled with many different versions of each library, it is
> unreasonable to expect calling programs to re-initialize everything
> manually after fork.
> 
> A further complication is the proliferation of clone flags,
> programs bypassing glibc's functions to call clone directly,
> and programs calling unshare, causing the glibc pthread_atfork
> hook to not get called.
> 
> It would be better to have the kernel take care of this automatically.
> 
> This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
> 
>     https://man.openbsd.org/minherit.2
> 
> Reported-by: Florian Weimer <fweimer@redhat.com>
> Reported-by: Colm MacCártaigh <colm@allcosts.net>
> Signed-off-by: Rik van Riel <riel@redhat.com>

My primary concern with the first suggested patch was trying to define
semantics if MADV_WIPEONFORK was applied to a shared or file backed
mapping.  This is no longer allowed.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

> ---
>  arch/alpha/include/uapi/asm/mman.h     |  3 +++
>  arch/mips/include/uapi/asm/mman.h      |  3 +++
>  arch/parisc/include/uapi/asm/mman.h    |  3 +++
>  arch/xtensa/include/uapi/asm/mman.h    |  3 +++
>  fs/proc/task_mmu.c                     |  1 +
>  include/linux/mm.h                     |  2 +-
>  include/trace/events/mmflags.h         |  8 +-------
>  include/uapi/asm-generic/mman-common.h |  3 +++
>  kernel/fork.c                          | 10 ++++++++--
>  mm/madvise.c                           | 13 +++++++++++++
>  10 files changed, 39 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index 02760f6e6ca4..2a708a792882 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -64,6 +64,9 @@
>  					   overrides the coredump filter bits */
>  #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
>  
> +#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
> +#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index 655e2fb5395b..d59c57d60d7d 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -91,6 +91,9 @@
>  					   overrides the coredump filter bits */
>  #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
>  
> +#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
> +#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index 5979745815a5..e205e0179642 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -60,6 +60,9 @@
>  					   overrides the coredump filter bits */
>  #define MADV_DODUMP	70		/* Clear the MADV_NODUMP flag */
>  
> +#define MADV_WIPEONFORK 71		/* Zero memory on fork, child only */
> +#define MADV_KEEPONFORK 72		/* Undo MADV_WIPEONFORK */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  #define MAP_VARIABLE	0
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index 24365b30aae9..ed23e0a1b30d 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -103,6 +103,9 @@
>  					   overrides the coredump filter bits */
>  #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
>  
> +#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
> +#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index b836fd61ed87..2591e70216ff 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -651,6 +651,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
>  		[ilog2(VM_NORESERVE)]	= "nr",
>  		[ilog2(VM_HUGETLB)]	= "ht",
>  		[ilog2(VM_ARCH_1)]	= "ar",
> +		[ilog2(VM_WIPEONFORK)]	= "wf",
>  		[ilog2(VM_DONTDUMP)]	= "dd",
>  #ifdef CONFIG_MEM_SOFT_DIRTY
>  		[ilog2(VM_SOFTDIRTY)]	= "sd",
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7550eeb06ccf..58788c1b9e9d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -189,7 +189,7 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_NORESERVE	0x00200000	/* should the VM suppress accounting */
>  #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
>  #define VM_ARCH_1	0x01000000	/* Architecture-specific flag */
> -#define VM_ARCH_2	0x02000000
> +#define VM_WIPEONFORK	0x02000000	/* Wipe VMA contents in child. */
>  #define VM_DONTDUMP	0x04000000	/* Do not include in the core dump */
>  
>  #ifdef CONFIG_MEM_SOFT_DIRTY
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index 8e50d01c645f..4c2e4737d7bc 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -125,12 +125,6 @@ IF_HAVE_PG_IDLE(PG_idle,		"idle"		)
>  #define __VM_ARCH_SPECIFIC_1 {VM_ARCH_1,	"arch_1"	}
>  #endif
>  
> -#if defined(CONFIG_X86)
> -#define __VM_ARCH_SPECIFIC_2 {VM_MPX,		"mpx"		}
> -#else
> -#define __VM_ARCH_SPECIFIC_2 {VM_ARCH_2,	"arch_2"	}
> -#endif
> -
>  #ifdef CONFIG_MEM_SOFT_DIRTY
>  #define IF_HAVE_VM_SOFTDIRTY(flag,name) {flag, name },
>  #else
> @@ -162,7 +156,7 @@ IF_HAVE_PG_IDLE(PG_idle,		"idle"		)
>  	{VM_NORESERVE,			"noreserve"	},		\
>  	{VM_HUGETLB,			"hugetlb"	},		\
>  	__VM_ARCH_SPECIFIC_1				,		\
> -	__VM_ARCH_SPECIFIC_2				,		\
> +	{VM_WIPEONFORK,			"wipeonfork"	},		\
>  	{VM_DONTDUMP,			"dontdump"	},		\
>  IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,	"softdirty"	)		\
>  	{VM_MIXEDMAP,			"mixedmap"	},		\
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 8c27db0c5c08..49e2b1d78093 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -58,6 +58,9 @@
>  					   overrides the coredump filter bits */
>  #define MADV_DODUMP	17		/* Clear the MADV_DONTDUMP flag */
>  
> +#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
> +#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 17921b0390b4..06376ae4877d 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -654,7 +654,12 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  		retval = dup_userfaultfd(tmp, &uf);
>  		if (retval)
>  			goto fail_nomem_anon_vma_fork;
> -		if (anon_vma_fork(tmp, mpnt))
> +		if (tmp->vm_flags & VM_WIPEONFORK) {
> +			/* VM_WIPEONFORK gets a clean slate in the child. */
> +			tmp->anon_vma = NULL;
> +			if (anon_vma_prepare(tmp))
> +				goto fail_nomem_anon_vma_fork;
> +		} else if (anon_vma_fork(tmp, mpnt))
>  			goto fail_nomem_anon_vma_fork;
>  		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
>  		tmp->vm_next = tmp->vm_prev = NULL;
> @@ -698,7 +703,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  		rb_parent = &tmp->vm_rb;
>  
>  		mm->map_count++;
> -		retval = copy_page_range(mm, oldmm, mpnt);
> +		if (!(tmp->vm_flags & VM_WIPEONFORK))
> +			retval = copy_page_range(mm, oldmm, mpnt);
>  
>  		if (tmp->vm_ops && tmp->vm_ops->open)
>  			tmp->vm_ops->open(tmp);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 9976852f1e1c..9b82cfa88ccf 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -80,6 +80,17 @@ static long madvise_behavior(struct vm_area_struct *vma,
>  		}
>  		new_flags &= ~VM_DONTCOPY;
>  		break;
> +	case MADV_WIPEONFORK:
> +		/* MADV_WIPEONFORK is only supported on anonymous memory. */
> +		if (vma->vm_file || vma->vm_flags & VM_SHARED) {
> +			error = -EINVAL;
> +			goto out;
> +		}
> +		new_flags |= VM_WIPEONFORK;
> +		break;
> +	case MADV_KEEPONFORK:
> +		new_flags &= ~VM_WIPEONFORK;
> +		break;
>  	case MADV_DONTDUMP:
>  		new_flags |= VM_DONTDUMP;
>  		break;
> @@ -689,6 +700,8 @@ madvise_behavior_valid(int behavior)
>  #endif
>  	case MADV_DONTDUMP:
>  	case MADV_DODUMP:
> +	case MADV_WIPEONFORK:
> +	case MADV_KEEPONFORK:
>  #ifdef CONFIG_MEMORY_FAILURE
>  	case MADV_SOFT_OFFLINE:
>  	case MADV_HWPOISON:
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-17 22:50       ` Andrew Morton
@ 2017-08-18 16:28         ` Rik van Riel
  2017-08-18 18:15           ` Andrew Morton
  0 siblings, 1 reply; 58+ messages in thread
From: Rik van Riel @ 2017-08-18 16:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, mhocko, mike.kravetz, linux-mm, fweimer, colm,
	keescook, luto, wad, mingo, kirill, dave.hansen, linux-api,
	torvalds, willy

On Thu, 2017-08-17 at 15:50 -0700, Andrew Morton wrote:
> On Tue, 15 Aug 2017 22:18:19 -0400 Rik van Riel <riel@redhat.com>
> wrote:
> 
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > > > vm_area_struct *vma,
> > > > __		}
> > > > __		new_flags &= ~VM_DONTCOPY;
> > > > __		break;
> > > > +	case MADV_WIPEONFORK:
> > > > +		/* MADV_WIPEONFORK is only supported on
> > > > anonymous
> > > > memory. */
> > > > +		if (vma->vm_file || vma->vm_flags & VM_SHARED)
> > > > {
> > > > +			error = -EINVAL;
> > > > +			goto out;
> > > > +		}
> > > > +		new_flags |= VM_WIPEONFORK;
> > > > +		break;
> > > > +	case MADV_KEEPONFORK:
> > > > +		new_flags &= ~VM_WIPEONFORK;
> > > > +		break;
> > > > __	case MADV_DONTDUMP:
> > > > __		new_flags |= VM_DONTDUMP;
> > > > __		break;
> > > 
> > > It seems odd to permit MADV_KEEPONFORK against other-than-anon
> > > vmas?
> > 
> > Given that the only way to set VM_WIPEONFORK is through
> > MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
> > other-than-anon vma would be equivalent to a noop.
> > 
> > If new_flags == vma->vm_flags, madvise_behavior() will
> > immediately exit.
> 
> Yes, but calling MADV_WIPEONFORK against an other-than-anon vma is
> presumably a userspace bug.  A bug which will probably result in
> userspace having WIPEONFORK memory which it didn't want.  The kernel
> can trivially tell userspace that it has this bug so why not do so?

Uh, what?

Calling MADV_WIPEONFORK on an other-than-anon vma results
in NOT getting VM_WIPEONFORK semantics on that VMA.

The code you are commenting on is the bit that CLEARS
the VM_WIPEONFORK code, not the bit where it gets set.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-16  2:18     ` Rik van Riel
@ 2017-08-17 22:50       ` Andrew Morton
  2017-08-18 16:28         ` Rik van Riel
  0 siblings, 1 reply; 58+ messages in thread
From: Andrew Morton @ 2017-08-17 22:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, mhocko, mike.kravetz, linux-mm, fweimer, colm,
	keescook, luto, wad, mingo, kirill, dave.hansen, linux-api,
	torvalds, willy

On Tue, 15 Aug 2017 22:18:19 -0400 Rik van Riel <riel@redhat.com> wrote:

> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > > vm_area_struct *vma,
> > > __		}
> > > __		new_flags &= ~VM_DONTCOPY;
> > > __		break;
> > > +	case MADV_WIPEONFORK:
> > > +		/* MADV_WIPEONFORK is only supported on anonymous
> > > memory. */
> > > +		if (vma->vm_file || vma->vm_flags & VM_SHARED) {
> > > +			error = -EINVAL;
> > > +			goto out;
> > > +		}
> > > +		new_flags |= VM_WIPEONFORK;
> > > +		break;
> > > +	case MADV_KEEPONFORK:
> > > +		new_flags &= ~VM_WIPEONFORK;
> > > +		break;
> > > __	case MADV_DONTDUMP:
> > > __		new_flags |= VM_DONTDUMP;
> > > __		break;
> > 
> > It seems odd to permit MADV_KEEPONFORK against other-than-anon vmas?
> 
> Given that the only way to set VM_WIPEONFORK is through
> MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
> other-than-anon vma would be equivalent to a noop.
> 
> If new_flags == vma->vm_flags, madvise_behavior() will
> immediately exit.

Yes, but calling MADV_WIPEONFORK against an other-than-anon vma is
presumably a userspace bug.  A bug which will probably result in
userspace having WIPEONFORK memory which it didn't want.  The kernel
can trivially tell userspace that it has this bug so why not do so?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-15 22:51   ` Andrew Morton
@ 2017-08-16  2:18     ` Rik van Riel
  2017-08-17 22:50       ` Andrew Morton
  0 siblings, 1 reply; 58+ messages in thread
From: Rik van Riel @ 2017-08-16  2:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, mhocko, mike.kravetz, linux-mm, fweimer, colm,
	keescook, luto, wad, mingo, kirill, dave.hansen, linux-api,
	torvalds, willy

On Tue, 2017-08-15 at 15:51 -0700, Andrew Morton wrote:
> On Fri, 11 Aug 2017 17:28:29 -0400 riel@redhat.com wrote:
> 
> > A further complication is the proliferation of clone flags,
> > programs bypassing glibc's functions to call clone directly,
> > and programs calling unshare, causing the glibc pthread_atfork
> > hook to not get called.
> > 
> > It would be better to have the kernel take care of this
> > automatically.
> 
> I'll add "The patch also adds MADV_KEEPONFORK, to undo the effects of
> a
> prior MADV_WIPEONFORK." here.
> 
> I guess it isn't worth mentioning that these things can cause VMA
> merges and splits. 

That's the same as every other Linux specific madvise operation.

> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -80,6 +80,17 @@ static long madvise_behavior(struct
> > vm_area_struct *vma,
> >  		}
> >  		new_flags &= ~VM_DONTCOPY;
> >  		break;
> > +	case MADV_WIPEONFORK:
> > +		/* MADV_WIPEONFORK is only supported on anonymous
> > memory. */
> > +		if (vma->vm_file || vma->vm_flags & VM_SHARED) {
> > +			error = -EINVAL;
> > +			goto out;
> > +		}
> > +		new_flags |= VM_WIPEONFORK;
> > +		break;
> > +	case MADV_KEEPONFORK:
> > +		new_flags &= ~VM_WIPEONFORK;
> > +		break;
> >  	case MADV_DONTDUMP:
> >  		new_flags |= VM_DONTDUMP;
> >  		break;
> 
> It seems odd to permit MADV_KEEPONFORK against other-than-anon vmas?

Given that the only way to set VM_WIPEONFORK is through
MADV_WIPEONFORK, calling MADV_KEEPONFORK on an
other-than-anon vma would be equivalent to a noop.

If new_flags == vma->vm_flags, madvise_behavior() will
immediately exit.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-11 21:28 ` [PATCH 2/2] mm,fork: " riel
@ 2017-08-15 22:51   ` Andrew Morton
  2017-08-16  2:18     ` Rik van Riel
  2017-08-18 17:25   ` Mike Kravetz
  1 sibling, 1 reply; 58+ messages in thread
From: Andrew Morton @ 2017-08-15 22:51 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, mhocko, mike.kravetz, linux-mm, fweimer, colm,
	keescook, luto, wad, mingo, kirill, dave.hansen, linux-api,
	torvalds, willy

On Fri, 11 Aug 2017 17:28:29 -0400 riel@redhat.com wrote:

> From: Rik van Riel <riel@redhat.com>
> 
> Introduce MADV_WIPEONFORK semantics, which result in a VMA being
> empty in the child process after fork. This differs from MADV_DONTFORK
> in one important way.
> 
> If a child process accesses memory that was MADV_WIPEONFORK, it
> will get zeroes. The address ranges are still valid, they are just empty.
> 
> If a child process accesses memory that was MADV_DONTFORK, it will
> get a segmentation fault, since those address ranges are no longer
> valid in the child after fork.
> 
> Since MADV_DONTFORK also seems to be used to allow very large
> programs to fork in systems with strict memory overcommit restrictions,
> changing the semantics of MADV_DONTFORK might break existing programs.
> 
> MADV_WIPEONFORK only works on private, anonymous VMAs.
> 
> The use case is libraries that store or cache information, and
> want to know that they need to regenerate it in the child process
> after fork.
> 
> Examples of this would be:
> - systemd/pulseaudio API checks (fail after fork)
>   (replacing a getpid check, which is too slow without a PID cache)
> - PKCS#11 API reinitialization check (mandated by specification)
> - glibc's upcoming PRNG (reseed after fork)
> - OpenSSL PRNG (reseed after fork)
> 
> The security benefits of a forking server having a re-inialized
> PRNG in every child process are pretty obvious. However, due to
> libraries having all kinds of internal state, and programs getting
> compiled with many different versions of each library, it is
> unreasonable to expect calling programs to re-initialize everything
> manually after fork.
> 
> A further complication is the proliferation of clone flags,
> programs bypassing glibc's functions to call clone directly,
> and programs calling unshare, causing the glibc pthread_atfork
> hook to not get called.
> 
> It would be better to have the kernel take care of this automatically.

I'll add "The patch also adds MADV_KEEPONFORK, to undo the effects of a
prior MADV_WIPEONFORK." here.

I guess it isn't worth mentioning that these things can cause VMA
merges and splits. 

> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -80,6 +80,17 @@ static long madvise_behavior(struct vm_area_struct *vma,
>  		}
>  		new_flags &= ~VM_DONTCOPY;
>  		break;
> +	case MADV_WIPEONFORK:
> +		/* MADV_WIPEONFORK is only supported on anonymous memory. */
> +		if (vma->vm_file || vma->vm_flags & VM_SHARED) {
> +			error = -EINVAL;
> +			goto out;
> +		}
> +		new_flags |= VM_WIPEONFORK;
> +		break;
> +	case MADV_KEEPONFORK:
> +		new_flags &= ~VM_WIPEONFORK;
> +		break;
>  	case MADV_DONTDUMP:
>  		new_flags |= VM_DONTDUMP;
>  		break;

It seems odd to permit MADV_KEEPONFORK against other-than-anon vmas?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-04 19:07 ` [PATCH 2/2] mm,fork: " riel
  2017-08-04 23:09   ` Mike Kravetz
@ 2017-08-14 15:45   ` kbuild test robot
  1 sibling, 0 replies; 58+ messages in thread
From: kbuild test robot @ 2017-08-14 15:45 UTC (permalink / raw)
  To: riel
  Cc: kbuild-all, linux-kernel, linux-mm, fweimer, colm, akpm, rppt,
	keescook, luto, wad, mingo

[-- Attachment #1: Type: text/plain, Size: 2089 bytes --]

Hi Rik,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.13-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/riel-redhat-com/mm-fork-security-introduce-MADV_WIPEONFORK/20170806-035914
config: m68k-sun3_defconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=m68k 

All errors (new ones prefixed by >>):

   In file included from mm/debug.c:12:0:
>> include/trace/events/mmflags.h:131:31: error: 'VM_ARCH_2' undeclared here (not in a function)
    #define __VM_ARCH_SPECIFIC_2 {VM_ARCH_2, "arch_2" }
                                  ^
   include/trace/events/mmflags.h:165:2: note: in expansion of macro '__VM_ARCH_SPECIFIC_2'
     __VM_ARCH_SPECIFIC_2    ,  \
     ^
   mm/debug.c:39:2: note: in expansion of macro '__def_vmaflag_names'
     __def_vmaflag_names,
     ^

vim +/VM_ARCH_2 +131 include/trace/events/mmflags.h

bcf669179 Kirill A. Shutemov 2016-03-17  127  
bcf669179 Kirill A. Shutemov 2016-03-17  128  #if defined(CONFIG_X86)
bcf669179 Kirill A. Shutemov 2016-03-17  129  #define __VM_ARCH_SPECIFIC_2 {VM_MPX,		"mpx"		}
bcf669179 Kirill A. Shutemov 2016-03-17  130  #else
bcf669179 Kirill A. Shutemov 2016-03-17 @131  #define __VM_ARCH_SPECIFIC_2 {VM_ARCH_2,	"arch_2"	}
420adbe9f Vlastimil Babka    2016-03-15  132  #endif
420adbe9f Vlastimil Babka    2016-03-15  133  

:::::: The code at line 131 was first introduced by commit
:::::: bcf6691797f425b301f629bb783b7ff2d0bcfa5a mm, tracing: refresh __def_vmaflag_names

:::::: TO: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
:::::: CC: Linus Torvalds <torvalds@linux-foundation.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 12050 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-11 21:28 [PATCH v4 " riel
@ 2017-08-11 21:28 ` riel
  2017-08-15 22:51   ` Andrew Morton
  2017-08-18 17:25   ` Mike Kravetz
  0 siblings, 2 replies; 58+ messages in thread
From: riel @ 2017-08-11 21:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: mhocko, mike.kravetz, linux-mm, fweimer, colm, akpm, keescook,
	luto, wad, mingo, kirill, dave.hansen, linux-api, torvalds,
	willy

From: Rik van Riel <riel@redhat.com>

Introduce MADV_WIPEONFORK semantics, which result in a VMA being
empty in the child process after fork. This differs from MADV_DONTFORK
in one important way.

If a child process accesses memory that was MADV_WIPEONFORK, it
will get zeroes. The address ranges are still valid, they are just empty.

If a child process accesses memory that was MADV_DONTFORK, it will
get a segmentation fault, since those address ranges are no longer
valid in the child after fork.

Since MADV_DONTFORK also seems to be used to allow very large
programs to fork in systems with strict memory overcommit restrictions,
changing the semantics of MADV_DONTFORK might break existing programs.

MADV_WIPEONFORK only works on private, anonymous VMAs.

The use case is libraries that store or cache information, and
want to know that they need to regenerate it in the child process
after fork.

Examples of this would be:
- systemd/pulseaudio API checks (fail after fork)
  (replacing a getpid check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)

The security benefits of a forking server having a re-inialized
PRNG in every child process are pretty obvious. However, due to
libraries having all kinds of internal state, and programs getting
compiled with many different versions of each library, it is
unreasonable to expect calling programs to re-initialize everything
manually after fork.

A further complication is the proliferation of clone flags,
programs bypassing glibc's functions to call clone directly,
and programs calling unshare, causing the glibc pthread_atfork
hook to not get called.

It would be better to have the kernel take care of this automatically.

This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

    https://man.openbsd.org/minherit.2

Reported-by: Florian Weimer <fweimer@redhat.com>
Reported-by: Colm MacCártaigh <colm@allcosts.net>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/alpha/include/uapi/asm/mman.h     |  3 +++
 arch/mips/include/uapi/asm/mman.h      |  3 +++
 arch/parisc/include/uapi/asm/mman.h    |  3 +++
 arch/xtensa/include/uapi/asm/mman.h    |  3 +++
 fs/proc/task_mmu.c                     |  1 +
 include/linux/mm.h                     |  2 +-
 include/trace/events/mmflags.h         |  8 +-------
 include/uapi/asm-generic/mman-common.h |  3 +++
 kernel/fork.c                          | 10 ++++++++--
 mm/madvise.c                           | 13 +++++++++++++
 10 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 02760f6e6ca4..2a708a792882 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -64,6 +64,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 655e2fb5395b..d59c57d60d7d 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -91,6 +91,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 5979745815a5..e205e0179642 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -60,6 +60,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	70		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WIPEONFORK 71		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 72		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 #define MAP_VARIABLE	0
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 24365b30aae9..ed23e0a1b30d 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -103,6 +103,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b836fd61ed87..2591e70216ff 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -651,6 +651,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_NORESERVE)]	= "nr",
 		[ilog2(VM_HUGETLB)]	= "ht",
 		[ilog2(VM_ARCH_1)]	= "ar",
+		[ilog2(VM_WIPEONFORK)]	= "wf",
 		[ilog2(VM_DONTDUMP)]	= "dd",
 #ifdef CONFIG_MEM_SOFT_DIRTY
 		[ilog2(VM_SOFTDIRTY)]	= "sd",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7550eeb06ccf..58788c1b9e9d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -189,7 +189,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_NORESERVE	0x00200000	/* should the VM suppress accounting */
 #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
 #define VM_ARCH_1	0x01000000	/* Architecture-specific flag */
-#define VM_ARCH_2	0x02000000
+#define VM_WIPEONFORK	0x02000000	/* Wipe VMA contents in child. */
 #define VM_DONTDUMP	0x04000000	/* Do not include in the core dump */
 
 #ifdef CONFIG_MEM_SOFT_DIRTY
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 8e50d01c645f..4c2e4737d7bc 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -125,12 +125,6 @@ IF_HAVE_PG_IDLE(PG_idle,		"idle"		)
 #define __VM_ARCH_SPECIFIC_1 {VM_ARCH_1,	"arch_1"	}
 #endif
 
-#if defined(CONFIG_X86)
-#define __VM_ARCH_SPECIFIC_2 {VM_MPX,		"mpx"		}
-#else
-#define __VM_ARCH_SPECIFIC_2 {VM_ARCH_2,	"arch_2"	}
-#endif
-
 #ifdef CONFIG_MEM_SOFT_DIRTY
 #define IF_HAVE_VM_SOFTDIRTY(flag,name) {flag, name },
 #else
@@ -162,7 +156,7 @@ IF_HAVE_PG_IDLE(PG_idle,		"idle"		)
 	{VM_NORESERVE,			"noreserve"	},		\
 	{VM_HUGETLB,			"hugetlb"	},		\
 	__VM_ARCH_SPECIFIC_1				,		\
-	__VM_ARCH_SPECIFIC_2				,		\
+	{VM_WIPEONFORK,			"wipeonfork"	},		\
 	{VM_DONTDUMP,			"dontdump"	},		\
 IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,	"softdirty"	)		\
 	{VM_MIXEDMAP,			"mixedmap"	},		\
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 8c27db0c5c08..49e2b1d78093 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -58,6 +58,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_DONTDUMP flag */
 
+#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 17921b0390b4..06376ae4877d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -654,7 +654,12 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		retval = dup_userfaultfd(tmp, &uf);
 		if (retval)
 			goto fail_nomem_anon_vma_fork;
-		if (anon_vma_fork(tmp, mpnt))
+		if (tmp->vm_flags & VM_WIPEONFORK) {
+			/* VM_WIPEONFORK gets a clean slate in the child. */
+			tmp->anon_vma = NULL;
+			if (anon_vma_prepare(tmp))
+				goto fail_nomem_anon_vma_fork;
+		} else if (anon_vma_fork(tmp, mpnt))
 			goto fail_nomem_anon_vma_fork;
 		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
 		tmp->vm_next = tmp->vm_prev = NULL;
@@ -698,7 +703,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		retval = copy_page_range(mm, oldmm, mpnt);
+		if (!(tmp->vm_flags & VM_WIPEONFORK))
+			retval = copy_page_range(mm, oldmm, mpnt);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
diff --git a/mm/madvise.c b/mm/madvise.c
index 9976852f1e1c..9b82cfa88ccf 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -80,6 +80,17 @@ static long madvise_behavior(struct vm_area_struct *vma,
 		}
 		new_flags &= ~VM_DONTCOPY;
 		break;
+	case MADV_WIPEONFORK:
+		/* MADV_WIPEONFORK is only supported on anonymous memory. */
+		if (vma->vm_file || vma->vm_flags & VM_SHARED) {
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags |= VM_WIPEONFORK;
+		break;
+	case MADV_KEEPONFORK:
+		new_flags &= ~VM_WIPEONFORK;
+		break;
 	case MADV_DONTDUMP:
 		new_flags |= VM_DONTDUMP;
 		break;
@@ -689,6 +700,8 @@ madvise_behavior_valid(int behavior)
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
+	case MADV_WIPEONFORK:
+	case MADV_KEEPONFORK:
 #ifdef CONFIG_MEMORY_FAILURE
 	case MADV_SOFT_OFFLINE:
 	case MADV_HWPOISON:
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-11 20:27     ` Rik van Riel
@ 2017-08-11 20:50       ` Linus Torvalds
  0 siblings, 0 replies; 58+ messages in thread
From: Linus Torvalds @ 2017-08-11 20:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linux Kernel Mailing List, Michal Hocko, Mike Kravetz, linux-mm,
	Florian Weimer, Colm MacCárthaigh, Andrew Morton, Kees Cook,
	Andy Lutomirski, Will Drewry, Ingo Molnar, Kirill A. Shutemov,
	Dave Hansen, Linux API, Matthew Wilcox

On Fri, Aug 11, 2017 at 1:27 PM, Rik van Riel <riel@redhat.com> wrote:
>>
>> Yes, you don't do the page table copies. Fine. But you leave vma with
>> the the anon_vma pointer - doesn't that mean that it's still
>> connected
>> to the original anonvma chain, and we might end up swapping something
>> in?
>
> Swapping something in would require there to be a swap entry in
> the page table entries, which we are not copying, so this should
> not be a correctness issue.

Yeah, I thought the rmap code just used the offset from the start to
avoid even doing swap entries, but I guess we don't actually ever
populate the page tables without the swap entry being there.

> There is another test in copy_page_range already which ends up
> skipping the page table copy when it should not be done.

Well, the VM_DONTCOPY test is in dup_mmap(), and I think I'd rather
match up the VM_WIPEONFORK logic with VM_DONTCOPY than with the
copy_page_range() tests.

Because I assume you are talking about the "if it's a shared mapping,
we don't need to copy the page tables and can just do it at page fault
time instead" part? That's a rather different thing, which isn't so
much about semantics, as about just a trade-off about when to touch
the page tables.

But yes, that one *might* make sense in dup_mmap() too. I just don't
think it's really analogous to the WIPEONFORK and DONTCOPY tests.

                      Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-11 19:42   ` Linus Torvalds
@ 2017-08-11 20:27     ` Rik van Riel
  2017-08-11 20:50       ` Linus Torvalds
  0 siblings, 1 reply; 58+ messages in thread
From: Rik van Riel @ 2017-08-11 20:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Michal Hocko, Mike Kravetz, linux-mm,
	Florian Weimer, colm, Andrew Morton, Kees Cook, Andy Lutomirski,
	Will Drewry, Ingo Molnar, Kirill A. Shutemov, Dave Hansen,
	Linux API, Matthew Wilcox

On Fri, 2017-08-11 at 12:42 -0700, Linus Torvalds wrote:
> On Fri, Aug 11, 2017 at 12:19 PM,  <riel@redhat.com> wrote:
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 0e517be91a89..f9b0ad7feb57 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1134,6 +1134,16 @@ int copy_page_range(struct mm_struct
> > *dst_mm, struct mm_struct *src_mm,
> >                         !vma->anon_vma)
> >                 return 0;
> > 
> > +       /*
> > +        * With VM_WIPEONFORK, the child inherits the VMA from the
> > +        * parent, but not its contents.
> > +        *
> > +        * A child accessing VM_WIPEONFORK memory will see all
> > zeroes;
> > +        * a child accessing VM_DONTCOPY memory receives a
> > segfault.
> > +        */
> > +       if (vma->vm_flags & VM_WIPEONFORK)
> > +               return 0;
> > +
> 
> Is this right?
> 
> Yes, you don't do the page table copies. Fine. But you leave vma with
> the the anon_vma pointer - doesn't that mean that it's still
> connected
> to the original anonvma chain, and we might end up swapping something
> in?

Swapping something in would require there to be a swap entry in
the page table entries, which we are not copying, so this should
not be a correctness issue.

> And even if that ends up not being an issue, I'd expect that you'd
> want to break the anon_vma chain just to not make it grow
> unnecessarily.

This is a good point. I can send a v4 that skips the anon_vma_fork()
call if VM_WIPEONFORK, and calls anon_vma_prepare(), instead.

> So my gut feel is that doing this in "copy_page_range()" is wrong,
> and
> the logic should be moved up to dup_mmap(), where we can also
> short-circuit the anon_vma chain entirely.
> 
> No?

There is another test in copy_page_range already which ends up
skipping the page table copy when it should not be done.

If you want, I can move that test into a should_copy_page_range()
function, and call that from dup_mmap(), skipping the call to
copy_page_range() if should_copy_page_range() returns false.

Having only one of the two sets of tests in dup_mmap(), and
the other in copy_page_range() seems wrong.

Just let me know what you prefer, and I'll put that in v4.

> The madvice() interface looks fine to me.

That was the main reason for adding you to the thread :)

kind regards,

Rik

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-11 19:19 ` [PATCH 2/2] mm,fork: " riel
@ 2017-08-11 19:42   ` Linus Torvalds
  2017-08-11 20:27     ` Rik van Riel
  0 siblings, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2017-08-11 19:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linux Kernel Mailing List, Michal Hocko, Mike Kravetz, linux-mm,
	Florian Weimer, colm, Andrew Morton, Kees Cook, Andy Lutomirski,
	Will Drewry, Ingo Molnar, Kirill A. Shutemov, Dave Hansen,
	Linux API, Matthew Wilcox

On Fri, Aug 11, 2017 at 12:19 PM,  <riel@redhat.com> wrote:
> diff --git a/mm/memory.c b/mm/memory.c
> index 0e517be91a89..f9b0ad7feb57 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1134,6 +1134,16 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>                         !vma->anon_vma)
>                 return 0;
>
> +       /*
> +        * With VM_WIPEONFORK, the child inherits the VMA from the
> +        * parent, but not its contents.
> +        *
> +        * A child accessing VM_WIPEONFORK memory will see all zeroes;
> +        * a child accessing VM_DONTCOPY memory receives a segfault.
> +        */
> +       if (vma->vm_flags & VM_WIPEONFORK)
> +               return 0;
> +

Is this right?

Yes, you don't do the page table copies. Fine. But you leave vma with
the the anon_vma pointer - doesn't that mean that it's still connected
to the original anonvma chain, and we might end up swapping something
in?

And even if that ends up not being an issue, I'd expect that you'd
want to break the anon_vma chain just to not make it grow
unnecessarily.

So my gut feel is that doing this in "copy_page_range()" is wrong, and
the logic should be moved up to dup_mmap(), where we can also
short-circuit the anon_vma chain entirely.

No?

The madvice() interface looks fine to me.

                  Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-11 19:19 [PATCH v3 0/2] mm,fork,security: " riel
@ 2017-08-11 19:19 ` riel
  2017-08-11 19:42   ` Linus Torvalds
  0 siblings, 1 reply; 58+ messages in thread
From: riel @ 2017-08-11 19:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: mhocko, mike.kravetz, linux-mm, fweimer, colm, akpm, keescook,
	luto, wad, mingo, kirill, dave.hansen, linux-api, torvalds,
	willy

From: Rik van Riel <riel@redhat.com>

Introduce MADV_WIPEONFORK semantics, which result in a VMA being
empty in the child process after fork. This differs from MADV_DONTFORK
in one important way.

If a child process accesses memory that was MADV_WIPEONFORK, it
will get zeroes. The address ranges are still valid, they are just empty.

If a child process accesses memory that was MADV_DONTFORK, it will
get a segmentation fault, since those address ranges are no longer
valid in the child after fork.

Since MADV_DONTFORK also seems to be used to allow very large
programs to fork in systems with strict memory overcommit restrictions,
changing the semantics of MADV_DONTFORK might break existing programs.

MADV_WIPEONFORK only works on private, anonymous VMAs.

The use case is libraries that store or cache information, and
want to know that they need to regenerate it in the child process
after fork.

Examples of this would be:
- systemd/pulseaudio API checks (fail after fork)
  (replacing a getpid check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)

The security benefits of a forking server having a re-inialized
PRNG in every child process are pretty obvious. However, due to
libraries having all kinds of internal state, and programs getting
compiled with many different versions of each library, it is
unreasonable to expect calling programs to re-initialize everything
manually after fork.

A further complication is the proliferation of clone flags,
programs bypassing glibc's functions to call clone directly,
and programs calling unshare, causing the glibc pthread_atfork
hook to not get called.

It would be better to have the kernel take care of this automatically.

This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

    https://man.openbsd.org/minherit.2

Reported-by: Florian Weimer <fweimer@redhat.com>
Reported-by: Colm MacCártaigh <colm@allcosts.net>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/alpha/include/uapi/asm/mman.h     |  3 +++
 arch/mips/include/uapi/asm/mman.h      |  3 +++
 arch/parisc/include/uapi/asm/mman.h    |  3 +++
 arch/xtensa/include/uapi/asm/mman.h    |  3 +++
 fs/proc/task_mmu.c                     |  1 +
 include/linux/mm.h                     |  2 +-
 include/trace/events/mmflags.h         |  8 +-------
 include/uapi/asm-generic/mman-common.h |  3 +++
 kernel/fork.c                          |  1 +
 mm/madvise.c                           | 13 +++++++++++++
 mm/memory.c                            | 10 ++++++++++
 11 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 02760f6e6ca4..2a708a792882 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -64,6 +64,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 655e2fb5395b..d59c57d60d7d 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -91,6 +91,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 5979745815a5..e205e0179642 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -60,6 +60,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	70		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WIPEONFORK 71		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 72		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 #define MAP_VARIABLE	0
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 24365b30aae9..ed23e0a1b30d 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -103,6 +103,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b836fd61ed87..2591e70216ff 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -651,6 +651,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_NORESERVE)]	= "nr",
 		[ilog2(VM_HUGETLB)]	= "ht",
 		[ilog2(VM_ARCH_1)]	= "ar",
+		[ilog2(VM_WIPEONFORK)]	= "wf",
 		[ilog2(VM_DONTDUMP)]	= "dd",
 #ifdef CONFIG_MEM_SOFT_DIRTY
 		[ilog2(VM_SOFTDIRTY)]	= "sd",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7550eeb06ccf..58788c1b9e9d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -189,7 +189,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_NORESERVE	0x00200000	/* should the VM suppress accounting */
 #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
 #define VM_ARCH_1	0x01000000	/* Architecture-specific flag */
-#define VM_ARCH_2	0x02000000
+#define VM_WIPEONFORK	0x02000000	/* Wipe VMA contents in child. */
 #define VM_DONTDUMP	0x04000000	/* Do not include in the core dump */
 
 #ifdef CONFIG_MEM_SOFT_DIRTY
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 8e50d01c645f..4c2e4737d7bc 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -125,12 +125,6 @@ IF_HAVE_PG_IDLE(PG_idle,		"idle"		)
 #define __VM_ARCH_SPECIFIC_1 {VM_ARCH_1,	"arch_1"	}
 #endif
 
-#if defined(CONFIG_X86)
-#define __VM_ARCH_SPECIFIC_2 {VM_MPX,		"mpx"		}
-#else
-#define __VM_ARCH_SPECIFIC_2 {VM_ARCH_2,	"arch_2"	}
-#endif
-
 #ifdef CONFIG_MEM_SOFT_DIRTY
 #define IF_HAVE_VM_SOFTDIRTY(flag,name) {flag, name },
 #else
@@ -162,7 +156,7 @@ IF_HAVE_PG_IDLE(PG_idle,		"idle"		)
 	{VM_NORESERVE,			"noreserve"	},		\
 	{VM_HUGETLB,			"hugetlb"	},		\
 	__VM_ARCH_SPECIFIC_1				,		\
-	__VM_ARCH_SPECIFIC_2				,		\
+	{VM_WIPEONFORK,			"wipeonfork"	},		\
 	{VM_DONTDUMP,			"dontdump"	},		\
 IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,	"softdirty"	)		\
 	{VM_MIXEDMAP,			"mixedmap"	},		\
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 8c27db0c5c08..49e2b1d78093 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -58,6 +58,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_DONTDUMP flag */
 
+#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 17921b0390b4..74be75373ee6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -659,6 +659,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
 		tmp->vm_next = tmp->vm_prev = NULL;
 		file = tmp->vm_file;
+
 		if (file) {
 			struct inode *inode = file_inode(file);
 			struct address_space *mapping = file->f_mapping;
diff --git a/mm/madvise.c b/mm/madvise.c
index 9976852f1e1c..9b82cfa88ccf 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -80,6 +80,17 @@ static long madvise_behavior(struct vm_area_struct *vma,
 		}
 		new_flags &= ~VM_DONTCOPY;
 		break;
+	case MADV_WIPEONFORK:
+		/* MADV_WIPEONFORK is only supported on anonymous memory. */
+		if (vma->vm_file || vma->vm_flags & VM_SHARED) {
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags |= VM_WIPEONFORK;
+		break;
+	case MADV_KEEPONFORK:
+		new_flags &= ~VM_WIPEONFORK;
+		break;
 	case MADV_DONTDUMP:
 		new_flags |= VM_DONTDUMP;
 		break;
@@ -689,6 +700,8 @@ madvise_behavior_valid(int behavior)
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
+	case MADV_WIPEONFORK:
+	case MADV_KEEPONFORK:
 #ifdef CONFIG_MEMORY_FAILURE
 	case MADV_SOFT_OFFLINE:
 	case MADV_HWPOISON:
diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..f9b0ad7feb57 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1134,6 +1134,16 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			!vma->anon_vma)
 		return 0;
 
+	/*
+	 * With VM_WIPEONFORK, the child inherits the VMA from the
+	 * parent, but not its contents.
+	 *
+	 * A child accessing VM_WIPEONFORK memory will see all zeroes;
+	 * a child accessing VM_DONTCOPY memory receives a segfault.
+	 */
+	if (vma->vm_flags & VM_WIPEONFORK)
+		return 0;
+
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-04 19:01 ` [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK riel
  2017-08-05 18:46   ` kbuild test robot
@ 2017-08-05 19:33   ` kbuild test robot
  1 sibling, 0 replies; 58+ messages in thread
From: kbuild test robot @ 2017-08-05 19:33 UTC (permalink / raw)
  To: riel
  Cc: kbuild-all, linux-kernel, linux-mm, fweimer, colm, akpm, rppt,
	keescook, luto, wad, mingo

[-- Attachment #1: Type: text/plain, Size: 1569 bytes --]

Hi Rik,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.13-rc3 next-20170804]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/riel-redhat-com/x86-mpx-make-mpx-depend-on-x86-64-to-free-up-VMA-flag/20170806-011851
config: tile-allmodconfig (attached as .config)
compiler: tilegx-linux-gcc (GCC) 4.6.2
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=tile 

All errors (new ones prefixed by >>):

>> mm/debug.c:39:2: error: 'VM_ARCH_2' undeclared here (not in a function)

vim +/VM_ARCH_2 +39 mm/debug.c

420adbe9f Vlastimil Babka 2016-03-15  37  
edf14cdbf Vlastimil Babka 2016-03-15  38  const struct trace_print_flags vmaflag_names[] = {
edf14cdbf Vlastimil Babka 2016-03-15 @39  	__def_vmaflag_names,
edf14cdbf Vlastimil Babka 2016-03-15  40  	{0, NULL}
82742a3a5 Sasha Levin     2014-10-09  41  };
82742a3a5 Sasha Levin     2014-10-09  42  

:::::: The code at line 39 was first introduced by commit
:::::: edf14cdbf9a0e5ab52698ca66d07a76ade0d5c46 mm, printk: introduce new format string for flags

:::::: TO: Vlastimil Babka <vbabka@suse.cz>
:::::: CC: Linus Torvalds <torvalds@linux-foundation.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 49780 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-04 19:01 ` [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK riel
@ 2017-08-05 18:46   ` kbuild test robot
  2017-08-05 19:33   ` kbuild test robot
  1 sibling, 0 replies; 58+ messages in thread
From: kbuild test robot @ 2017-08-05 18:46 UTC (permalink / raw)
  To: riel
  Cc: kbuild-all, linux-kernel, linux-mm, fweimer, colm, akpm, rppt,
	keescook, luto, wad, mingo

[-- Attachment #1: Type: text/plain, Size: 5134 bytes --]

Hi Rik,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.13-rc3 next-20170804]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/riel-redhat-com/x86-mpx-make-mpx-depend-on-x86-64-to-free-up-VMA-flag/20170806-011851
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=xtensa 

All error/warnings (new ones prefixed by >>):

   In file included from mm/debug.c:12:0:
>> include/trace/events/mmflags.h:131:31: error: 'VM_ARCH_2' undeclared here (not in a function)
    #define __VM_ARCH_SPECIFIC_2 {VM_ARCH_2, "arch_2" }
                                  ^
>> include/trace/events/mmflags.h:165:2: note: in expansion of macro '__VM_ARCH_SPECIFIC_2'
     __VM_ARCH_SPECIFIC_2    ,  \
     ^
>> mm/debug.c:39:2: note: in expansion of macro '__def_vmaflag_names'
     __def_vmaflag_names,
     ^

vim +/VM_ARCH_2 +131 include/trace/events/mmflags.h

bcf669179 Kirill A. Shutemov 2016-03-17  127  
bcf669179 Kirill A. Shutemov 2016-03-17  128  #if defined(CONFIG_X86)
bcf669179 Kirill A. Shutemov 2016-03-17  129  #define __VM_ARCH_SPECIFIC_2 {VM_MPX,		"mpx"		}
bcf669179 Kirill A. Shutemov 2016-03-17  130  #else
bcf669179 Kirill A. Shutemov 2016-03-17 @131  #define __VM_ARCH_SPECIFIC_2 {VM_ARCH_2,	"arch_2"	}
420adbe9f Vlastimil Babka    2016-03-15  132  #endif
420adbe9f Vlastimil Babka    2016-03-15  133  
420adbe9f Vlastimil Babka    2016-03-15  134  #ifdef CONFIG_MEM_SOFT_DIRTY
420adbe9f Vlastimil Babka    2016-03-15  135  #define IF_HAVE_VM_SOFTDIRTY(flag,name) {flag, name },
420adbe9f Vlastimil Babka    2016-03-15  136  #else
420adbe9f Vlastimil Babka    2016-03-15  137  #define IF_HAVE_VM_SOFTDIRTY(flag,name)
420adbe9f Vlastimil Babka    2016-03-15  138  #endif
420adbe9f Vlastimil Babka    2016-03-15  139  
420adbe9f Vlastimil Babka    2016-03-15  140  #define __def_vmaflag_names						\
420adbe9f Vlastimil Babka    2016-03-15  141  	{VM_READ,			"read"		},		\
420adbe9f Vlastimil Babka    2016-03-15  142  	{VM_WRITE,			"write"		},		\
420adbe9f Vlastimil Babka    2016-03-15  143  	{VM_EXEC,			"exec"		},		\
420adbe9f Vlastimil Babka    2016-03-15  144  	{VM_SHARED,			"shared"	},		\
420adbe9f Vlastimil Babka    2016-03-15  145  	{VM_MAYREAD,			"mayread"	},		\
420adbe9f Vlastimil Babka    2016-03-15  146  	{VM_MAYWRITE,			"maywrite"	},		\
420adbe9f Vlastimil Babka    2016-03-15  147  	{VM_MAYEXEC,			"mayexec"	},		\
420adbe9f Vlastimil Babka    2016-03-15  148  	{VM_MAYSHARE,			"mayshare"	},		\
420adbe9f Vlastimil Babka    2016-03-15  149  	{VM_GROWSDOWN,			"growsdown"	},		\
bcf669179 Kirill A. Shutemov 2016-03-17  150  	{VM_UFFD_MISSING,		"uffd_missing"	},		\
420adbe9f Vlastimil Babka    2016-03-15  151  	{VM_PFNMAP,			"pfnmap"	},		\
420adbe9f Vlastimil Babka    2016-03-15  152  	{VM_DENYWRITE,			"denywrite"	},		\
bcf669179 Kirill A. Shutemov 2016-03-17  153  	{VM_UFFD_WP,			"uffd_wp"	},		\
420adbe9f Vlastimil Babka    2016-03-15  154  	{VM_LOCKED,			"locked"	},		\
420adbe9f Vlastimil Babka    2016-03-15  155  	{VM_IO,				"io"		},		\
420adbe9f Vlastimil Babka    2016-03-15  156  	{VM_SEQ_READ,			"seqread"	},		\
420adbe9f Vlastimil Babka    2016-03-15  157  	{VM_RAND_READ,			"randread"	},		\
420adbe9f Vlastimil Babka    2016-03-15  158  	{VM_DONTCOPY,			"dontcopy"	},		\
420adbe9f Vlastimil Babka    2016-03-15  159  	{VM_DONTEXPAND,			"dontexpand"	},		\
bcf669179 Kirill A. Shutemov 2016-03-17  160  	{VM_LOCKONFAULT,		"lockonfault"	},		\
420adbe9f Vlastimil Babka    2016-03-15  161  	{VM_ACCOUNT,			"account"	},		\
420adbe9f Vlastimil Babka    2016-03-15  162  	{VM_NORESERVE,			"noreserve"	},		\
420adbe9f Vlastimil Babka    2016-03-15  163  	{VM_HUGETLB,			"hugetlb"	},		\
bcf669179 Kirill A. Shutemov 2016-03-17  164  	__VM_ARCH_SPECIFIC_1				,		\
bcf669179 Kirill A. Shutemov 2016-03-17 @165  	__VM_ARCH_SPECIFIC_2				,		\
420adbe9f Vlastimil Babka    2016-03-15  166  	{VM_DONTDUMP,			"dontdump"	},		\
420adbe9f Vlastimil Babka    2016-03-15  167  IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,	"softdirty"	)		\
420adbe9f Vlastimil Babka    2016-03-15  168  	{VM_MIXEDMAP,			"mixedmap"	},		\
420adbe9f Vlastimil Babka    2016-03-15  169  	{VM_HUGEPAGE,			"hugepage"	},		\
420adbe9f Vlastimil Babka    2016-03-15  170  	{VM_NOHUGEPAGE,			"nohugepage"	},		\
420adbe9f Vlastimil Babka    2016-03-15  171  	{VM_MERGEABLE,			"mergeable"	}		\
420adbe9f Vlastimil Babka    2016-03-15  172  

:::::: The code at line 131 was first introduced by commit
:::::: bcf6691797f425b301f629bb783b7ff2d0bcfa5a mm, tracing: refresh __def_vmaflag_names

:::::: TO: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
:::::: CC: Linus Torvalds <torvalds@linux-foundation.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 50920 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-04 23:09   ` Mike Kravetz
@ 2017-08-05 14:05     ` Rik van Riel
  0 siblings, 0 replies; 58+ messages in thread
From: Rik van Riel @ 2017-08-05 14:05 UTC (permalink / raw)
  To: Mike Kravetz, linux-kernel
  Cc: linux-mm, fweimer, colm, akpm, rppt, keescook, luto, wad, mingo

On Fri, 2017-08-04 at 16:09 -0700, Mike Kravetz wrote:
> On 08/04/2017 12:07 PM, riel@redhat.com wrote:
> > From: Rik van Riel <riel@redhat.com>
> > 
> > Introduce MADV_WIPEONFORK semantics, which result in a VMA being
> > empty in the child process after fork. This differs from
> > MADV_DONTFORK
> > in one important way.
> > 
> > If a child process accesses memory that was MADV_WIPEONFORK, it
> > will get zeroes. The address ranges are still valid, they are just
> > empty.
> > 
> This didn't seem 'quite right' to me for shared mappings and/or file
> backed mappings.  I wasn't exactly sure what it 'should' do in such
> cases.  So, I tried it with a mapping created as follows:
> 
> addr = mmap(ADDR, page_size,
>                         PROT_READ | PROT_WRITE,
>                         MAP_ANONYMOUS|MAP_SHARED, -1, 0);

Your test program is pretty much the same I used, except I
used MAP_PRIVATE instead of MAP_SHARED.

Let me see how the code paths differ for both cases...


> When setting MADV_WIPEONFORK on the vma/mapping, I got the following
> at task exit time:
> 
> [  694.558290] ------------[ cut here ]------------
> [  694.558978] kernel BUG at mm/filemap.c:212!
> [  694.559476] invalid opcode: 0000 [#1] SMP
> [  694.560023] Modules linked in: ip6t_REJECT nf_reject_ipv6
> ip6t_rpfilter xt_conntrack ebtable_broute bridge stp llc ebtable_nat
> ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
> ip6table_raw ip6table_mangle ip6table_security iptable_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
> iptable_raw iptable_mangle 9p iptable_security ebtable_filter
> ebtables ip6table_filter ip6_tables snd_hda_codec_generic
> snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_seq ppdev
> snd_seq_device joydev crct10dif_pclmul crc32_pclmul crc32c_intel
> snd_pcm ghash_clmulni_intel 9pnet_virtio virtio_balloon snd_timer
> 9pnet parport_pc snd parport i2c_piix4 soundcore nfsd auth_rpcgss
> nfs_acl lockd grace sunrpc virtio_net virtio_blk virtio_console
> 8139too qxl drm_kms_helper ttm drm serio_raw 8139cp
> [  694.571554]  mii virtio_pci ata_generic virtio_ring virtio
> pata_acpi
> [  694.572608] CPU: 3 PID: 1200 Comm: test_wipe2 Not tainted 4.13.0-
> rc3+ #8
> [  694.573778] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.9.1-1.fc24 04/01/2014
> [  694.574917] task: ffff880137178040 task.stack: ffffc900019d4000
> [  694.575650] RIP: 0010:__delete_from_page_cache+0x344/0x410
> [  694.576409] RSP: 0018:ffffc900019d7a88 EFLAGS: 00010082
> [  694.577238] RAX: 0000000000000021 RBX: ffffea00047d0e00 RCX:
> 0000000000000006
> [  694.578537] RDX: 0000000000000000 RSI: 0000000000000096 RDI:
> ffff88023fd0db90
> [  694.579774] RBP: ffffc900019d7ad8 R08: 00000000000882b6 R09:
> 000000000000028a
> [  694.580754] R10: ffffc900019d7da8 R11: ffffffff8211184d R12:
> ffffea00047d0e00
> [  694.582040] R13: 0000000000000000 R14: 0000000000000202 R15:
> ffff8801384439e8
> [  694.583236] FS:  0000000000000000(0000) GS:ffff88023fd00000(0000)
> knlGS:0000000000000000
> [  694.584607] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  694.585409] CR2: 00007ff77a8da618 CR3: 0000000001e09000 CR4:
> 00000000001406e0
> [  694.586547] Call Trace:
> [  694.586996]  delete_from_page_cache+0x54/0x110
> [  694.587481]  truncate_inode_page+0xab/0x120
> [  694.588110]  shmem_undo_range+0x498/0xa50
> [  694.588813]  ? save_stack_trace+0x1b/0x20
> [  694.589529]  ? set_track+0x70/0x140
> [  694.590150]  ? init_object+0x69/0xa0
> [  694.590722]  ? __inode_wait_for_writeback+0x73/0xe0
> [  694.591525]  shmem_truncate_range+0x16/0x40
> [  694.592268]  shmem_evict_inode+0xb1/0x190
> [  694.592735]  evict+0xbb/0x1c0
> [  694.593147]  iput+0x1c0/0x210
> [  694.593497]  dentry_unlink_inode+0xb4/0x150
> [  694.593982]  __dentry_kill+0xc1/0x150
> [  694.594400]  dput+0x1c8/0x1e0
> [  694.594745]  __fput+0x172/0x1e0
> [  694.595103]  ____fput+0xe/0x10
> [  694.595463]  task_work_run+0x80/0xa0
> [  694.595886]  do_exit+0x2d6/0xb50
> [  694.596323]  ? __do_page_fault+0x288/0x4a0
> [  694.596818]  do_group_exit+0x47/0xb0
> [  694.597249]  SyS_exit_group+0x14/0x20
> [  694.597682]  entry_SYSCALL_64_fastpath+0x1a/0xa5
> [  694.598198] RIP: 0033:0x7ff77a5e78c8
> [  694.598612] RSP: 002b:00007ffc5aece318 EFLAGS: 00000246 ORIG_RAX:
> 00000000000000e7
> [  694.599804] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
> 00007ff77a5e78c8
> [  694.600609] RDX: 0000000000000000 RSI: 000000000000003c RDI:
> 0000000000000000
> [  694.601424] RBP: 00007ff77a8da618 R08: 00000000000000e7 R09:
> ffffffffffffff98
> [  694.602224] R10: 0000000000000003 R11: 0000000000000246 R12:
> 0000000000000001
> [  694.603151] R13: 00007ff77a8dbc60 R14: 0000000000000000 R15:
> 0000000000000000
> [  694.603984] Code: 60 f3 c5 81 e8 2e 7e 03 00 0f 0b 48 c7 c6 60 f3
> c5 81 4c 89 e7 e8 1d 7e 03 00 0f 0b 48 c7 c6 00 f4 c5 81 4c 89 e7 e8
> 0c 7e 03 00 <0f> 0b 48 c7 c6 38 f3 c5 81 4c 89 e7 e8 fb 7d 03 00 0f
> 0b 48 c7 
> [  694.606500] RIP: __delete_from_page_cache+0x344/0x410 RSP:
> ffffc900019d7a88
> [  694.607426] ---[ end trace 55e6b04ae95d8ce3 ]---
> 
> BTW, this was on 4.13.0-rc3 + your patches.  Simple test program is
> below.
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-04 19:07 ` [PATCH 2/2] mm,fork: " riel
@ 2017-08-04 23:09   ` Mike Kravetz
  2017-08-05 14:05     ` Rik van Riel
  2017-08-14 15:45   ` kbuild test robot
  1 sibling, 1 reply; 58+ messages in thread
From: Mike Kravetz @ 2017-08-04 23:09 UTC (permalink / raw)
  To: riel, linux-kernel
  Cc: linux-mm, fweimer, colm, akpm, rppt, keescook, luto, wad, mingo

On 08/04/2017 12:07 PM, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> Introduce MADV_WIPEONFORK semantics, which result in a VMA being
> empty in the child process after fork. This differs from MADV_DONTFORK
> in one important way.
> 
> If a child process accesses memory that was MADV_WIPEONFORK, it
> will get zeroes. The address ranges are still valid, they are just empty.
> 
> If a child process accesses memory that was MADV_DONTFORK, it will
> get a segmentation fault, since those address ranges are no longer
> valid in the child after fork.
> 
> Since MADV_DONTFORK also seems to be used to allow very large
> programs to fork in systems with strict memory overcommit restrictions,
> changing the semantics of MADV_DONTFORK might break existing programs.
> 
> The use case is libraries that store or cache information, and
> want to know that they need to regenerate it in the child process
> after fork.
> 
> Examples of this would be:
> - systemd/pulseaudio API checks (fail after fork)
>   (replacing a getpid check, which is too slow without a PID cache)
> - PKCS#11 API reinitialization check (mandated by specification)
> - glibc's upcoming PRNG (reseed after fork)
> - OpenSSL PRNG (reseed after fork)
> 
> The security benefits of a forking server having a re-inialized
> PRNG in every child process are pretty obvious. However, due to
> libraries having all kinds of internal state, and programs getting
> compiled with many different versions of each library, it is
> unreasonable to expect calling programs to re-initialize everything
> manually after fork.
> 
> A further complication is the proliferation of clone flags,
> programs bypassing glibc's functions to call clone directly,
> and programs calling unshare, causing the glibc pthread_atfork
> hook to not get called.
> 
> It would be better to have the kernel take care of this automatically.
> 
> This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
> 
>     https://man.openbsd.org/minherit.2
> 
> Reported-by: Florian Weimer <fweimer@redhat.com>
> Reported-by: Colm MacCártaigh <colm@allcosts.net>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  arch/alpha/include/uapi/asm/mman.h     |  3 +++
>  arch/mips/include/uapi/asm/mman.h      |  3 +++
>  arch/parisc/include/uapi/asm/mman.h    |  3 +++
>  arch/xtensa/include/uapi/asm/mman.h    |  3 +++
>  fs/proc/task_mmu.c                     |  1 +
>  include/linux/mm.h                     |  2 +-
>  include/uapi/asm-generic/mman-common.h |  3 +++
>  kernel/fork.c                          |  8 ++++++--
>  mm/madvise.c                           |  8 ++++++++
>  mm/memory.c                            | 10 ++++++++++
>  10 files changed, 41 insertions(+), 3 deletions(-)
> 

This didn't seem 'quite right' to me for shared mappings and/or file
backed mappings.  I wasn't exactly sure what it 'should' do in such
cases.  So, I tried it with a mapping created as follows:

addr = mmap(ADDR, page_size,
                        PROT_READ | PROT_WRITE,
                        MAP_ANONYMOUS|MAP_SHARED, -1, 0);

When setting MADV_WIPEONFORK on the vma/mapping, I got the following
at task exit time:

[  694.558290] ------------[ cut here ]------------
[  694.558978] kernel BUG at mm/filemap.c:212!
[  694.559476] invalid opcode: 0000 [#1] SMP
[  694.560023] Modules linked in: ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter xt_conntrack ebtable_broute bridge stp llc ebtable_nat ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_raw ip6table_mangle ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw iptable_mangle 9p iptable_security ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_seq ppdev snd_seq_device joydev crct10dif_pclmul crc32_pclmul crc32c_intel snd_pcm ghash_clmulni_intel 9pnet_virtio virtio_balloon snd_timer 9pnet parport_pc snd parport i2c_piix4 soundcore nfsd auth_rpcgss nfs_acl lockd grace sunrpc virtio_net virtio_blk virtio_console 8139too qxl drm_kms_helper ttm drm serio_raw 8139cp
[  694.571554]  mii virtio_pci ata_generic virtio_ring virtio pata_acpi
[  694.572608] CPU: 3 PID: 1200 Comm: test_wipe2 Not tainted 4.13.0-rc3+ #8
[  694.573778] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
[  694.574917] task: ffff880137178040 task.stack: ffffc900019d4000
[  694.575650] RIP: 0010:__delete_from_page_cache+0x344/0x410
[  694.576409] RSP: 0018:ffffc900019d7a88 EFLAGS: 00010082
[  694.577238] RAX: 0000000000000021 RBX: ffffea00047d0e00 RCX: 0000000000000006
[  694.578537] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff88023fd0db90
[  694.579774] RBP: ffffc900019d7ad8 R08: 00000000000882b6 R09: 000000000000028a
[  694.580754] R10: ffffc900019d7da8 R11: ffffffff8211184d R12: ffffea00047d0e00
[  694.582040] R13: 0000000000000000 R14: 0000000000000202 R15: ffff8801384439e8
[  694.583236] FS:  0000000000000000(0000) GS:ffff88023fd00000(0000) knlGS:0000000000000000
[  694.584607] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  694.585409] CR2: 00007ff77a8da618 CR3: 0000000001e09000 CR4: 00000000001406e0
[  694.586547] Call Trace:
[  694.586996]  delete_from_page_cache+0x54/0x110
[  694.587481]  truncate_inode_page+0xab/0x120
[  694.588110]  shmem_undo_range+0x498/0xa50
[  694.588813]  ? save_stack_trace+0x1b/0x20
[  694.589529]  ? set_track+0x70/0x140
[  694.590150]  ? init_object+0x69/0xa0
[  694.590722]  ? __inode_wait_for_writeback+0x73/0xe0
[  694.591525]  shmem_truncate_range+0x16/0x40
[  694.592268]  shmem_evict_inode+0xb1/0x190
[  694.592735]  evict+0xbb/0x1c0
[  694.593147]  iput+0x1c0/0x210
[  694.593497]  dentry_unlink_inode+0xb4/0x150
[  694.593982]  __dentry_kill+0xc1/0x150
[  694.594400]  dput+0x1c8/0x1e0
[  694.594745]  __fput+0x172/0x1e0
[  694.595103]  ____fput+0xe/0x10
[  694.595463]  task_work_run+0x80/0xa0
[  694.595886]  do_exit+0x2d6/0xb50
[  694.596323]  ? __do_page_fault+0x288/0x4a0
[  694.596818]  do_group_exit+0x47/0xb0
[  694.597249]  SyS_exit_group+0x14/0x20
[  694.597682]  entry_SYSCALL_64_fastpath+0x1a/0xa5
[  694.598198] RIP: 0033:0x7ff77a5e78c8
[  694.598612] RSP: 002b:00007ffc5aece318 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[  694.599804] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff77a5e78c8
[  694.600609] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
[  694.601424] RBP: 00007ff77a8da618 R08: 00000000000000e7 R09: ffffffffffffff98
[  694.602224] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000001
[  694.603151] R13: 00007ff77a8dbc60 R14: 0000000000000000 R15: 0000000000000000
[  694.603984] Code: 60 f3 c5 81 e8 2e 7e 03 00 0f 0b 48 c7 c6 60 f3 c5 81 4c 89 e7 e8 1d 7e 03 00 0f 0b 48 c7 c6 00 f4 c5 81 4c 89 e7 e8 0c 7e 03 00 <0f> 0b 48 c7 c6 38 f3 c5 81 4c 89 e7 e8 fb 7d 03 00 0f 0b 48 c7 
[  694.606500] RIP: __delete_from_page_cache+0x344/0x410 RSP: ffffc900019d7a88
[  694.607426] ---[ end trace 55e6b04ae95d8ce3 ]---

BTW, this was on 4.13.0-rc3 + your patches.  Simple test program is below.

-- 
Mike Kravetz


#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <errno.h>

#define MADV_WIPEONFORK 18
#define ADDR (void *)(0x0UL)

int main(int argc, char ** argv)
{
	unsigned long page_size;
	int ret;
	void *addr;
	char foo;

	page_size = sysconf(_SC_PAGE_SIZE);

	addr = mmap(ADDR, page_size,
			PROT_READ | PROT_WRITE,
			MAP_ANONYMOUS|MAP_SHARED, -1, 0);
	if (addr == MAP_FAILED) {
		perror("mmap");
		exit (1);
	}

	printf("Parent writing 'a' to page\n");
	*((char *)addr) = 'a'; 

	ret = madvise(addr, page_size, MADV_WIPEONFORK);
	if (ret) {
		perror("madvise");
		exit (1);
	}

	if (fork()) {
		/* In parent */
		sleep(1);
	} else {
		/* In child */
		foo = *((char *)addr);
		printf("child read '%c' from page\n", foo);
	}

	return ret;
}

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-04 19:07 [PATCH 0/2] mm,fork,security: " riel
@ 2017-08-04 19:07 ` riel
  2017-08-04 23:09   ` Mike Kravetz
  2017-08-14 15:45   ` kbuild test robot
  0 siblings, 2 replies; 58+ messages in thread
From: riel @ 2017-08-04 19:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, fweimer, colm, akpm, rppt, keescook, luto, wad, mingo

From: Rik van Riel <riel@redhat.com>

Introduce MADV_WIPEONFORK semantics, which result in a VMA being
empty in the child process after fork. This differs from MADV_DONTFORK
in one important way.

If a child process accesses memory that was MADV_WIPEONFORK, it
will get zeroes. The address ranges are still valid, they are just empty.

If a child process accesses memory that was MADV_DONTFORK, it will
get a segmentation fault, since those address ranges are no longer
valid in the child after fork.

Since MADV_DONTFORK also seems to be used to allow very large
programs to fork in systems with strict memory overcommit restrictions,
changing the semantics of MADV_DONTFORK might break existing programs.

The use case is libraries that store or cache information, and
want to know that they need to regenerate it in the child process
after fork.

Examples of this would be:
- systemd/pulseaudio API checks (fail after fork)
  (replacing a getpid check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)

The security benefits of a forking server having a re-inialized
PRNG in every child process are pretty obvious. However, due to
libraries having all kinds of internal state, and programs getting
compiled with many different versions of each library, it is
unreasonable to expect calling programs to re-initialize everything
manually after fork.

A further complication is the proliferation of clone flags,
programs bypassing glibc's functions to call clone directly,
and programs calling unshare, causing the glibc pthread_atfork
hook to not get called.

It would be better to have the kernel take care of this automatically.

This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

    https://man.openbsd.org/minherit.2

Reported-by: Florian Weimer <fweimer@redhat.com>
Reported-by: Colm MacCártaigh <colm@allcosts.net>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/alpha/include/uapi/asm/mman.h     |  3 +++
 arch/mips/include/uapi/asm/mman.h      |  3 +++
 arch/parisc/include/uapi/asm/mman.h    |  3 +++
 arch/xtensa/include/uapi/asm/mman.h    |  3 +++
 fs/proc/task_mmu.c                     |  1 +
 include/linux/mm.h                     |  2 +-
 include/uapi/asm-generic/mman-common.h |  3 +++
 kernel/fork.c                          |  8 ++++++--
 mm/madvise.c                           |  8 ++++++++
 mm/memory.c                            | 10 ++++++++++
 10 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 02760f6e6ca4..2a708a792882 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -64,6 +64,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 655e2fb5395b..d59c57d60d7d 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -91,6 +91,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 5979745815a5..e205e0179642 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -60,6 +60,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	70		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WIPEONFORK 71		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 72		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 #define MAP_VARIABLE	0
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 24365b30aae9..ed23e0a1b30d 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -103,6 +103,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b836fd61ed87..2591e70216ff 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -651,6 +651,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_NORESERVE)]	= "nr",
 		[ilog2(VM_HUGETLB)]	= "ht",
 		[ilog2(VM_ARCH_1)]	= "ar",
+		[ilog2(VM_WIPEONFORK)]	= "wf",
 		[ilog2(VM_DONTDUMP)]	= "dd",
 #ifdef CONFIG_MEM_SOFT_DIRTY
 		[ilog2(VM_SOFTDIRTY)]	= "sd",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7550eeb06ccf..58788c1b9e9d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -189,7 +189,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_NORESERVE	0x00200000	/* should the VM suppress accounting */
 #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
 #define VM_ARCH_1	0x01000000	/* Architecture-specific flag */
-#define VM_ARCH_2	0x02000000
+#define VM_WIPEONFORK	0x02000000	/* Wipe VMA contents in child. */
 #define VM_DONTDUMP	0x04000000	/* Do not include in the core dump */
 
 #ifdef CONFIG_MEM_SOFT_DIRTY
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 8c27db0c5c08..49e2b1d78093 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -58,6 +58,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_DONTDUMP flag */
 
+#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 17921b0390b4..2dd0d0cae3bb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -628,7 +628,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 
 	prev = NULL;
 	for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
-		struct file *file;
+		struct file *file = NULL;
 
 		if (mpnt->vm_flags & VM_DONTCOPY) {
 			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
@@ -658,7 +658,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 			goto fail_nomem_anon_vma_fork;
 		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
 		tmp->vm_next = tmp->vm_prev = NULL;
-		file = tmp->vm_file;
+
+		/* With VM_WIPEONFORK, the child gets an empty VMA. */
+		if (!(tmp->vm_flags & VM_WIPEONFORK))
+			file = tmp->vm_file;
+
 		if (file) {
 			struct inode *inode = file_inode(file);
 			struct address_space *mapping = file->f_mapping;
diff --git a/mm/madvise.c b/mm/madvise.c
index 9976852f1e1c..9e644c0ed4dc 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -80,6 +80,12 @@ static long madvise_behavior(struct vm_area_struct *vma,
 		}
 		new_flags &= ~VM_DONTCOPY;
 		break;
+	case MADV_WIPEONFORK:
+		new_flags |= VM_WIPEONFORK;
+		break;
+	case MADV_KEEPONFORK:
+		new_flags &= ~VM_WIPEONFORK;
+		break;
 	case MADV_DONTDUMP:
 		new_flags |= VM_DONTDUMP;
 		break;
@@ -689,6 +695,8 @@ madvise_behavior_valid(int behavior)
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
+	case MADV_WIPEONFORK:
+	case MADV_KEEPONFORK:
 #ifdef CONFIG_MEMORY_FAILURE
 	case MADV_SOFT_OFFLINE:
 	case MADV_HWPOISON:
diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..f9b0ad7feb57 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1134,6 +1134,16 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			!vma->anon_vma)
 		return 0;
 
+	/*
+	 * With VM_WIPEONFORK, the child inherits the VMA from the
+	 * parent, but not its contents.
+	 *
+	 * A child accessing VM_WIPEONFORK memory will see all zeroes;
+	 * a child accessing VM_DONTCOPY memory receives a segfault.
+	 */
+	if (vma->vm_flags & VM_WIPEONFORK)
+		return 0;
+
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK
  2017-08-04 19:01 [PATCH 0/2] mm,fork: MADV_WIPEONFORK - an empty VMA in the child riel
@ 2017-08-04 19:01 ` riel
  2017-08-05 18:46   ` kbuild test robot
  2017-08-05 19:33   ` kbuild test robot
  0 siblings, 2 replies; 58+ messages in thread
From: riel @ 2017-08-04 19:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, fweimer, colm, akpm, rppt, keescook, luto, wad, mingo

From: Rik van Riel <riel@redhat.com>

Introduce MADV_WIPEONFORK semantics, which result in a VMA being
empty in the child process after fork. This differs from MADV_DONTFORK
in one important way.

If a child process accesses memory that was MADV_WIPEONFORK, it
will get zeroes. The address ranges are still valid, they are just empty.

If a child process accesses memory that was MADV_DONTFORK, it will
get a segmentation fault, since those address ranges are no longer
valid in the child after fork.

Since MADV_DONTFORK also seems to be used to allow very large
programs to fork in systems with strict memory overcommit restrictions,
changing the semantics of MADV_DONTFORK might break existing programs.

The use case is libraries that store or cache information, and
want to know that they need to regenerate it in the child process
after fork.

Examples of this would be:
- systemd/pulseaudio API checks (fail after fork)
  (replacing a getpid check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)

The security benefits of a forking server having a re-inialized
PRNG in every child process are pretty obvious. However, due to
libraries having all kinds of internal state, and programs getting
compiled with many different versions of each library, it is
unreasonable to expect calling programs to re-initialize everything
manually after fork.

A further complication is the proliferation of clone flags,
programs bypassing glibc's functions to call clone directly,
and programs calling unshare, causing the glibc pthread_atfork
hook to not get called.

It would be better to have the kernel take care of this automatically.

This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

    https://man.openbsd.org/minherit.2

Reported-by: Florian Weimer <fweimer@redhat.com>
Reported-by: Colm MacCártaigh <colm@allcosts.net>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/alpha/include/uapi/asm/mman.h     |  3 +++
 arch/mips/include/uapi/asm/mman.h      |  3 +++
 arch/parisc/include/uapi/asm/mman.h    |  3 +++
 arch/xtensa/include/uapi/asm/mman.h    |  3 +++
 fs/proc/task_mmu.c                     |  1 +
 include/linux/mm.h                     |  2 +-
 include/uapi/asm-generic/mman-common.h |  3 +++
 kernel/fork.c                          |  8 ++++++--
 mm/madvise.c                           |  8 ++++++++
 mm/memory.c                            | 10 ++++++++++
 10 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 02760f6e6ca4..2a708a792882 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -64,6 +64,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 655e2fb5395b..d59c57d60d7d 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -91,6 +91,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 5979745815a5..e205e0179642 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -60,6 +60,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	70		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WIPEONFORK 71		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 72		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 #define MAP_VARIABLE	0
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 24365b30aae9..ed23e0a1b30d 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -103,6 +103,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b836fd61ed87..2591e70216ff 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -651,6 +651,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_NORESERVE)]	= "nr",
 		[ilog2(VM_HUGETLB)]	= "ht",
 		[ilog2(VM_ARCH_1)]	= "ar",
+		[ilog2(VM_WIPEONFORK)]	= "wf",
 		[ilog2(VM_DONTDUMP)]	= "dd",
 #ifdef CONFIG_MEM_SOFT_DIRTY
 		[ilog2(VM_SOFTDIRTY)]	= "sd",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7550eeb06ccf..58788c1b9e9d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -189,7 +189,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_NORESERVE	0x00200000	/* should the VM suppress accounting */
 #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
 #define VM_ARCH_1	0x01000000	/* Architecture-specific flag */
-#define VM_ARCH_2	0x02000000
+#define VM_WIPEONFORK	0x02000000	/* Wipe VMA contents in child. */
 #define VM_DONTDUMP	0x04000000	/* Do not include in the core dump */
 
 #ifdef CONFIG_MEM_SOFT_DIRTY
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 8c27db0c5c08..49e2b1d78093 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -58,6 +58,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_DONTDUMP flag */
 
+#define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
+#define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 17921b0390b4..2dd0d0cae3bb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -628,7 +628,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 
 	prev = NULL;
 	for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
-		struct file *file;
+		struct file *file = NULL;
 
 		if (mpnt->vm_flags & VM_DONTCOPY) {
 			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
@@ -658,7 +658,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 			goto fail_nomem_anon_vma_fork;
 		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
 		tmp->vm_next = tmp->vm_prev = NULL;
-		file = tmp->vm_file;
+
+		/* With VM_WIPEONFORK, the child gets an empty VMA. */
+		if (!(tmp->vm_flags & VM_WIPEONFORK))
+			file = tmp->vm_file;
+
 		if (file) {
 			struct inode *inode = file_inode(file);
 			struct address_space *mapping = file->f_mapping;
diff --git a/mm/madvise.c b/mm/madvise.c
index 9976852f1e1c..9e644c0ed4dc 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -80,6 +80,12 @@ static long madvise_behavior(struct vm_area_struct *vma,
 		}
 		new_flags &= ~VM_DONTCOPY;
 		break;
+	case MADV_WIPEONFORK:
+		new_flags |= VM_WIPEONFORK;
+		break;
+	case MADV_KEEPONFORK:
+		new_flags &= ~VM_WIPEONFORK;
+		break;
 	case MADV_DONTDUMP:
 		new_flags |= VM_DONTDUMP;
 		break;
@@ -689,6 +695,8 @@ madvise_behavior_valid(int behavior)
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
+	case MADV_WIPEONFORK:
+	case MADV_KEEPONFORK:
 #ifdef CONFIG_MEMORY_FAILURE
 	case MADV_SOFT_OFFLINE:
 	case MADV_HWPOISON:
diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..f9b0ad7feb57 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1134,6 +1134,16 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			!vma->anon_vma)
 		return 0;
 
+	/*
+	 * With VM_WIPEONFORK, the child inherits the VMA from the
+	 * parent, but not its contents.
+	 *
+	 * A child accessing VM_WIPEONFORK memory will see all zeroes;
+	 * a child accessing VM_DONTCOPY memory receives a segfault.
+	 */
+	if (vma->vm_flags & VM_WIPEONFORK)
+		return 0;
+
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2017-08-19  0:02 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-06 14:04 [PATCH v2 0/2] mm,fork,security: introduce MADV_WIPEONFORK riel
2017-08-06 14:04 ` [PATCH 1/2] x86,mpx: make mpx depend on x86-64 to free up VMA flag riel
2017-08-06 14:04 ` [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK riel
2017-08-10 15:23   ` Michal Hocko
2017-08-11 15:23     ` Rik van Riel
2017-08-11 16:36       ` Mike Kravetz
2017-08-11 16:59         ` Rik van Riel
2017-08-11 17:07           ` Mike Kravetz
2017-08-07 13:22 ` [PATCH v2 0/2] mm,fork,security: " Michal Hocko
2017-08-07 13:46   ` Michal Hocko
2017-08-07 14:19     ` Florian Weimer
2017-08-10 13:06       ` Michal Hocko
2017-08-07 14:59     ` Rik van Riel
2017-08-09  9:59       ` Kirill A. Shutemov
2017-08-09 12:31         ` Rik van Riel
2017-08-09 12:42         ` Florian Weimer
2017-08-10 13:05       ` Michal Hocko
2017-08-10 13:23         ` Colm MacCárthaigh
2017-08-10 15:36           ` Michal Hocko
     [not found]             ` <CAAF6GDeno6RpHf1KORVSxUL7M-CQfbWFFdyKK8LAWd_6PcJ55Q@mail.gmail.com>
2017-08-10 17:01               ` Michal Hocko
2017-08-10 22:09                 ` Colm MacCárthaigh
2017-08-11 14:06                   ` Michal Hocko
2017-08-11 14:11                     ` Florian Weimer
2017-08-11 14:24                       ` Michal Hocko
2017-08-11 15:24                         ` Florian Weimer
2017-08-11 15:31                           ` Michal Hocko
     [not found]     ` <CAAF6GDcNoDUaDSxV6N12A_bOzo8phRUX5b8-OBteuN0AmeCv0g@mail.gmail.com>
2017-08-07 16:02       ` Colm MacCárthaigh
2017-08-10 13:21       ` Michal Hocko
2017-08-10 14:11         ` Michal Hocko
2017-08-07 18:23 ` Mike Kravetz
2017-08-08  9:58   ` Florian Weimer
2017-08-08 13:15     ` Rik van Riel
2017-08-08 15:19       ` Mike Kravetz
2017-08-08 15:22         ` Florian Weimer
2017-08-08 15:46         ` Rik van Riel
2017-08-08 16:48           ` Colm MacCárthaigh
2017-08-08 16:52           ` Matthew Wilcox
2017-08-08 18:45             ` Rik van Riel
2017-08-10 15:31               ` Michal Hocko
  -- strict thread matches above, loose matches on Subject: below --
2017-08-11 21:28 [PATCH v4 " riel
2017-08-11 21:28 ` [PATCH 2/2] mm,fork: " riel
2017-08-15 22:51   ` Andrew Morton
2017-08-16  2:18     ` Rik van Riel
2017-08-17 22:50       ` Andrew Morton
2017-08-18 16:28         ` Rik van Riel
2017-08-18 18:15           ` Andrew Morton
2017-08-19  0:02             ` Rik van Riel
2017-08-18 17:25   ` Mike Kravetz
2017-08-11 19:19 [PATCH v3 0/2] mm,fork,security: " riel
2017-08-11 19:19 ` [PATCH 2/2] mm,fork: " riel
2017-08-11 19:42   ` Linus Torvalds
2017-08-11 20:27     ` Rik van Riel
2017-08-11 20:50       ` Linus Torvalds
2017-08-04 19:07 [PATCH 0/2] mm,fork,security: " riel
2017-08-04 19:07 ` [PATCH 2/2] mm,fork: " riel
2017-08-04 23:09   ` Mike Kravetz
2017-08-05 14:05     ` Rik van Riel
2017-08-14 15:45   ` kbuild test robot
2017-08-04 19:01 [PATCH 0/2] mm,fork: MADV_WIPEONFORK - an empty VMA in the child riel
2017-08-04 19:01 ` [PATCH 2/2] mm,fork: introduce MADV_WIPEONFORK riel
2017-08-05 18:46   ` kbuild test robot
2017-08-05 19:33   ` kbuild test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).