Linux-api Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE
@ 2017-12-13  9:25 Michal Hocko
  2017-12-13  9:25 ` [PATCH 1/2] " Michal Hocko
                   ` (3 more replies)
  0 siblings, 4 replies; 48+ messages in thread
From: Michal Hocko @ 2017-12-13  9:25 UTC (permalink / raw)
  To: linux-api
  Cc: Khalid Aziz, Michael Ellerman, Andrew Morton,
	Russell King - ARM Linux, Andrea Arcangeli, linux-mm, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Abdul Haleem, Joel Stanley, Kees Cook, Michal Hocko


Hi,
I am resending with some minor updates based on Michael's review and
ask for inclusion. There haven't been any fundamental objections for
the RFC [1] nor the previous version [2].  The biggest discussion
revolved around the naming. There were many suggestions flowing
around MAP_REQUIRED, MAP_EXACT, MAP_FIXED_NOCLOBBER, MAP_AT_ADDR,
MAP_FIXED_NOREPLACE etc...

I am afraid we can bikeshed this to death and there will still be
somebody finding yet another better name. Therefore I've decided to
stick with my original MAP_FIXED_SAFE. Why? Well, because it keeps the
MAP_FIXED prefix which should be recognized by developers and _SAFE
suffix should also be clear that all dangerous side effects of the old
MAP_FIXED are gone.

If somebody _really_ hates this then feel free to nack and resubmit
with a different name you can find a consensus for. I am sorry to be
stubborn here but I would rather have this merged than go over few more
iterations changing the name just because it seems like a good idea
now. My experience tells me that chances are that the name will turn out
to be "suboptimal" anyway over time.

Some more background:
This has started as a follow up discussion [3][4] resulting in the
runtime failure caused by hardening patch [5] which removes MAP_FIXED
from the elf loader because MAP_FIXED is inherently dangerous as it
might silently clobber an existing underlying mapping (e.g. stack). The
reason for the failure is that some architectures enforce an alignment
for the given address hint without MAP_FIXED used (e.g. for shared or
file backed mappings).

One way around this would be excluding those archs which do alignment
tricks from the hardening [6]. The patch is really trivial but it has
been objected, rightfully so, that this screams for a more generic
solution. We basically want a non-destructive MAP_FIXED.

The first patch introduced MAP_FIXED_SAFE which enforces the given
address but unlike MAP_FIXED it fails with EEXIST if the given range
conflicts with an existing one. The flag is introduced as a completely
new one rather than a MAP_FIXED extension because of the backward
compatibility. We really want a never-clobber semantic even on older
kernels which do not recognize the flag. Unfortunately mmap sucks wrt.
flags evaluation because we do not EINVAL on unknown flags. On those
kernels we would simply use the traditional hint based semantic so the
caller can still get a different address (which sucks) but at least not
silently corrupt an existing mapping. I do not see a good way around
that. Except we won't export expose the new semantic to the userspace at
all. 

It seems there are users who would like to have something like that.
Jemalloc has been mentioned by Michael Ellerman [7]

Florian Weimer has mentioned the following:
: glibc ld.so currently maps DSOs without hints.  This means that the kernel
: will map right next to each other, and the offsets between them a completely
: predictable.  We would like to change that and supply a random address in a
: window of the address space.  If there is a conflict, we do not want the
: kernel to pick a non-random address. Instead, we would try again with a
: random address.

John Hubbard has mentioned CUDA example
: a) Searches /proc/<pid>/maps for a "suitable" region of available
: VA space.  "Suitable" generally means it has to have a base address
: within a certain limited range (a particular device model might
: have odd limitations, for example), it has to be large enough, and
: alignment has to be large enough (again, various devices may have
: constraints that lead us to do this).
: 
: This is of course subject to races with other threads in the process.
: 
: Let's say it finds a region starting at va.
: 
: b) Next it does: 
:     p = mmap(va, ...) 
: 
: *without* setting MAP_FIXED, of course (so va is just a hint), to
: attempt to safely reserve that region. If p != va, then in most cases,
: this is a failure (almost certainly due to another thread getting a
: mapping from that region before we did), and so this layer now has to
: call munmap(), before returning a "failure: retry" to upper layers.
: 
:     IMPROVEMENT: --> if instead, we could call this:
: 
:             p = mmap(va, ... MAP_FIXED_SAFE ...)
: 
:         , then we could skip the munmap() call upon failure. This
:         is a small thing, but it is useful here. (Thanks to Piotr
:         Jaroszynski and Mark Hairgrove for helping me get that detail
:         exactly right, btw.)
: 
: c) After that, CUDA suballocates from p, via: 
:  
:      q = mmap(sub_region_start, ... MAP_FIXED ...)
: 
: Interestingly enough, "freeing" is also done via MAP_FIXED, and
: setting PROT_NONE to the subregion. Anyway, I just included (c) for
: general interest.

Atomic address range probing in the multithreaded programs in general
sounds like an interesting thing to me.

The second patch simply replaces MAP_FIXED use in elf loader by
MAP_FIXED_SAFE. I believe other places which rely on MAP_FIXED should
follow. Actually real MAP_FIXED usages should be docummented properly
and they should be more of an exception.

Diffstat says
 arch/alpha/include/uapi/asm/mman.h     |  1 +
 arch/metag/kernel/process.c            |  6 +++++-
 arch/mips/include/uapi/asm/mman.h      |  2 ++
 arch/parisc/include/uapi/asm/mman.h    |  2 ++
 arch/sparc/include/uapi/asm/mman.h     |  1 -
 arch/xtensa/include/uapi/asm/mman.h    |  2 ++
 fs/binfmt_elf.c                        | 12 ++++++++----
 include/uapi/asm-generic/mman-common.h |  1 +
 mm/mmap.c                              | 11 +++++++++++
 9 files changed, 32 insertions(+), 6 deletions(-)

[1] http://lkml.kernel.org/r/20171116101900.13621-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20171129144219.22867-1-mhocko@kernel.org
[3] http://lkml.kernel.org/r/20171107162217.382cd754@canb.auug.org.au
[4] http://lkml.kernel.org/r/1510048229.12079.7.camel@abdul.in.ibm.com
[5] http://lkml.kernel.org/r/20171023082608.6167-1-mhocko@kernel.org
[6] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk55y@dhcp22.suse.cz
[7] http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/2] mm: introduce MAP_FIXED_SAFE
  2017-12-13  9:25 [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE Michal Hocko
@ 2017-12-13  9:25 ` Michal Hocko
       [not found]   ` <20171213092550.2774-2-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2017-12-13  9:25 ` [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map Michal Hocko
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2017-12-13  9:25 UTC (permalink / raw)
  To: linux-api
  Cc: Khalid Aziz, Michael Ellerman, Andrew Morton,
	Russell King - ARM Linux, Andrea Arcangeli, linux-mm, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

MAP_FIXED is used quite often to enforce mapping at the particular
range. The main problem of this flag is, however, that it is inherently
dangerous because it unmaps existing mappings covered by the requested
range. This can cause silent memory corruptions. Some of them even with
serious security implications. While the current semantic might be
really desiderable in many cases there are others which would want to
enforce the given range but rather see a failure than a silent memory
corruption on a clashing range. Please note that there is no guarantee
that a given range is obeyed by the mmap even when it is free - e.g.
arch specific code is allowed to apply an alignment.

Introduce a new MAP_FIXED_SAFE flag for mmap to achieve this behavior.
It has the same semantic as MAP_FIXED wrt. the given address request
with a single exception that it fails with EEXIST if the requested
address is already covered by an existing mapping. We still do rely on
get_unmaped_area to handle all the arch specific MAP_FIXED treatment and
check for a conflicting vma after it returns.

The flag is introduced as a completely new one rather than a MAP_FIXED
extension because of the backward compatibility. We really want a
never-clobber semantic even on older kernels which do not recognize
the flag. Unfortunately mmap sucks wrt. flags evaluation because we do
not EINVAL on unknown flags. On those kernels we would simply use the
traditional hint based semantic so the caller can still get a different
address (which sucks) but at least not silently corrupt an existing
mapping. I do not see a good way around that.

Changes since v1
- define MAP_FIXED_SAFE in asm-generic/mman-common.h as per Michael
  Ellerman because all architecture which use this header can share
  the same value. This will leave us with only 4 arches which need
  special handling.

[fail on clashing range with EEXIST as per Florian Weimer]
[set MAP_FIXED before round_hint_to_min as per Khalid Aziz]
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/alpha/include/uapi/asm/mman.h     |  1 +
 arch/mips/include/uapi/asm/mman.h      |  2 ++
 arch/parisc/include/uapi/asm/mman.h    |  2 ++
 arch/sparc/include/uapi/asm/mman.h     |  1 -
 arch/xtensa/include/uapi/asm/mman.h    |  2 ++
 include/uapi/asm-generic/mman-common.h |  1 +
 mm/mmap.c                              | 11 +++++++++++
 7 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 6bf730063e3f..7287dbf1e11b 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -31,6 +31,7 @@
 #define MAP_NONBLOCK	0x40000		/* do not block on IO */
 #define MAP_STACK	0x80000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x100000	/* create a huge page mapping */
+#define MAP_FIXED_SAFE	0x200000	/* MAP_FIXED which doesn't unmap underlying mapping */
 
 #define MS_ASYNC	1		/* sync memory asynchronously */
 #define MS_SYNC		2		/* synchronous memory sync */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 20c3df7a8fdd..f1e15890345c 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -50,6 +50,8 @@
 #define MAP_STACK	0x40000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x80000		/* create a huge page mapping */
 
+#define MAP_FIXED_SAFE	0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
+
 /*
  * Flags for msync
  */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index d1af0d74a188..daf0282ac417 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -26,6 +26,8 @@
 #define MAP_STACK	0x40000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x80000		/* create a huge page mapping */
 
+#define MAP_FIXED_SAFE	0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
+
 #define MS_SYNC		1		/* synchronous memory sync */
 #define MS_ASYNC	2		/* sync memory asynchronously */
 #define MS_INVALIDATE	4		/* invalidate the caches */
diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h
index 715a2c927e79..d21bffd5d3dc 100644
--- a/arch/sparc/include/uapi/asm/mman.h
+++ b/arch/sparc/include/uapi/asm/mman.h
@@ -25,5 +25,4 @@
 #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
 
-
 #endif /* _UAPI__SPARC_MMAN_H__ */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 2bfe590694fc..0daf199caa57 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -56,6 +56,7 @@
 #define MAP_NONBLOCK	0x20000		/* do not block on IO */
 #define MAP_STACK	0x40000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x80000		/* create a huge page mapping */
+#define MAP_FIXED_SAFE	0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
 #ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED
 # define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
 					 * uninitialized */
@@ -63,6 +64,7 @@
 # define MAP_UNINITIALIZED 0x0		/* Don't support this flag */
 #endif
 
+
 /*
  * Flags for msync
  */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6d319c46fd90..1eca2cb10d44 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -25,6 +25,7 @@
 #else
 # define MAP_UNINITIALIZED 0x0		/* Don't support this flag */
 #endif
+#define MAP_FIXED_SAFE	0x80000		/* MAP_FIXED which doesn't unmap underlying mapping */
 
 /*
  * Flags for mlock
diff --git a/mm/mmap.c b/mm/mmap.c
index 0de87a376aaa..447223a2e469 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1342,6 +1342,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 		if (!(file && path_noexec(&file->f_path)))
 			prot |= PROT_EXEC;
 
+	/* force arch specific MAP_FIXED handling in get_unmapped_area */
+	if (flags & MAP_FIXED_SAFE)
+		flags |= MAP_FIXED;
+
 	if (!(flags & MAP_FIXED))
 		addr = round_hint_to_min(addr);
 
@@ -1365,6 +1369,13 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	if (offset_in_page(addr))
 		return addr;
 
+	if (flags & MAP_FIXED_SAFE) {
+		struct vm_area_struct *vma = find_vma(mm, addr);
+
+		if (vma && vma->vm_start <= addr)
+			return -EEXIST;
+	}
+
 	if (prot == PROT_EXEC) {
 		pkey = execute_only_pkey(mm);
 		if (pkey < 0)
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map
  2017-12-13  9:25 [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE Michal Hocko
  2017-12-13  9:25 ` [PATCH 1/2] " Michal Hocko
@ 2017-12-13  9:25 ` Michal Hocko
  2017-12-16  0:49   ` [2/2] " Andrei Vagin
  2017-12-13  9:31 ` [PATCH 1/2] mmap.2: document new MAP_FIXED_SAFE flag Michal Hocko
       [not found] ` <20171213092550.2774-1-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  3 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2017-12-13  9:25 UTC (permalink / raw)
  To: linux-api
  Cc: Khalid Aziz, Michael Ellerman, Andrew Morton,
	Russell King - ARM Linux, Andrea Arcangeli, linux-mm, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Michal Hocko, Abdul Haleem, Joel Stanley, Kees Cook

From: Michal Hocko <mhocko@suse.com>

Both load_elf_interp and load_elf_binary rely on elf_map to map segments
on a controlled address and they use MAP_FIXED to enforce that. This is
however dangerous thing prone to silent data corruption which can be
even exploitable. Let's take CVE-2017-1000253 as an example. At the time
(before eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"))
ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
the stack top on 32b (legacy) memory layout (only 1GB away). Therefore
we could end up mapping over the existing stack with some luck.

The issue has been fixed since then (a87938b2e246 ("fs/binfmt_elf.c:
fix bug in loading of PIE binaries")), ELF_ET_DYN_BASE moved moved much
further from the stack (eab09532d400 and later by c715b72c1ba4 ("mm:
revert x86_64 and arm64 ELF_ET_DYN_BASE base changes")) and excessive
stack consumption early during execve fully stopped by da029c11e6b1
("exec: Limit arg stack to at most 75% of _STK_LIM"). So we should be
safe and any attack should be impractical. On the other hand this is
just too subtle assumption so it can break quite easily and hard to
spot.

I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
fundamentally dangerous. Moreover it shouldn't be even needed. We are
at the early process stage and so there shouldn't be unrelated mappings
(except for stack and loader) existing so mmap for a given address
should succeed even without MAP_FIXED. Something is terribly wrong if
this is not the case and we should rather fail than silently corrupt the
underlying mapping.

Address this issue by changing MAP_FIXED to the newly added
MAP_FIXED_SAFE. This will mean that mmap will fail if there is an
existing mapping clashing with the requested one without clobbering it.

Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/metag/kernel/process.c |  6 +++++-
 fs/binfmt_elf.c             | 12 ++++++++----
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/metag/kernel/process.c b/arch/metag/kernel/process.c
index 0909834c83a7..867c8d0a5fb4 100644
--- a/arch/metag/kernel/process.c
+++ b/arch/metag/kernel/process.c
@@ -399,7 +399,7 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
 	tcm_tag = tcm_lookup_tag(addr);
 
 	if (tcm_tag != TCM_INVALID_TAG)
-		type &= ~MAP_FIXED;
+		type &= ~(MAP_FIXED | MAP_FIXED_SAFE);
 
 	/*
 	* total_size is the size of the ELF (interpreter) image.
@@ -417,6 +417,10 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
 	} else
 		map_addr = vm_mmap(filep, addr, size, prot, type, off);
 
+	if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
+		pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
+				task_pid_nr(current), tsk->comm, (void*)addr);
+
 	if (!BAD_ADDR(map_addr) && tcm_tag != TCM_INVALID_TAG) {
 		struct tcm_allocation *tcm;
 		unsigned long tcm_addr;
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 73b01e474fdc..5916d45f64a7 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -372,6 +372,10 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
 	} else
 		map_addr = vm_mmap(filep, addr, size, prot, type, off);
 
+	if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
+		pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
+				task_pid_nr(current), current->comm, (void*)addr);
+
 	return(map_addr);
 }
 
@@ -569,7 +573,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 				elf_prot |= PROT_EXEC;
 			vaddr = eppnt->p_vaddr;
 			if (interp_elf_ex->e_type == ET_EXEC || load_addr_set)
-				elf_type |= MAP_FIXED;
+				elf_type |= MAP_FIXED_SAFE;
 			else if (no_base && interp_elf_ex->e_type == ET_DYN)
 				load_addr = -vaddr;
 
@@ -930,7 +934,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 		 * the ET_DYN load_addr calculations, proceed normally.
 		 */
 		if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
-			elf_flags |= MAP_FIXED;
+			elf_flags |= MAP_FIXED_SAFE;
 		} else if (loc->elf_ex.e_type == ET_DYN) {
 			/*
 			 * This logic is run once for the first LOAD Program
@@ -966,7 +970,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 				load_bias = ELF_ET_DYN_BASE;
 				if (current->flags & PF_RANDOMIZE)
 					load_bias += arch_mmap_rnd();
-				elf_flags |= MAP_FIXED;
+				elf_flags |= MAP_FIXED_SAFE;
 			} else
 				load_bias = 0;
 
@@ -1223,7 +1227,7 @@ static int load_elf_library(struct file *file)
 			(eppnt->p_filesz +
 			 ELF_PAGEOFFSET(eppnt->p_vaddr)),
 			PROT_READ | PROT_WRITE | PROT_EXEC,
-			MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
+			MAP_FIXED_SAFE | MAP_PRIVATE | MAP_DENYWRITE,
 			(eppnt->p_offset -
 			 ELF_PAGEOFFSET(eppnt->p_vaddr)));
 	if (error != ELF_PAGESTART(eppnt->p_vaddr))
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/2] mmap.2: document new MAP_FIXED_SAFE flag
  2017-12-13  9:25 [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE Michal Hocko
  2017-12-13  9:25 ` [PATCH 1/2] " Michal Hocko
  2017-12-13  9:25 ` [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map Michal Hocko
@ 2017-12-13  9:31 ` Michal Hocko
  2017-12-13  9:31   ` [PATCH 2/2] mmap.2: MAP_FIXED updated documentation Michal Hocko
       [not found] ` <20171213092550.2774-1-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  3 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2017-12-13  9:31 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: linux-api, Khalid Aziz, Michael Ellerman, Andrew Morton,
	Russell King - ARM Linux, Andrea Arcangeli, linux-mm, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

4.16+ kernels offer a new MAP_FIXED_SAFE flag which allows the caller to
atomicaly probe for a given address range.

[wording heavily updated by John Hubbard <jhubbard@nvidia.com>]
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 man2/mmap.2 | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/man2/mmap.2 b/man2/mmap.2
index a5a8eb47a263..02d391697ce6 100644
--- a/man2/mmap.2
+++ b/man2/mmap.2
@@ -227,6 +227,22 @@ in mind that the exact layout of a process' memory map is allowed to change
 significantly between kernel versions, C library versions, and operating system
 releases.
 .TP
+.BR MAP_FIXED_SAFE " (since Linux 4.16)"
+Similar to MAP_FIXED with respect to the
+.I
+addr
+enforcement, but different in that MAP_FIXED_SAFE never clobbers a pre-existing
+mapped range. If the requested range would collide with an existing
+mapping, then this call fails with
+.B EEXIST.
+This flag can therefore be used as a way to atomically (with respect to other
+threads) attempt to map an address range: one thread will succeed; all others
+will report failure. Please note that older kernels which do not recognize this
+flag will typically (upon detecting a collision with a pre-existing mapping)
+fall back to a "non-MAP_FIXED" type of behavior: they will return an address that
+is different than the requested one. Therefore, backward-compatible software
+should check the returned address against the requested address.
+.TP
 .B MAP_GROWSDOWN
 This flag is used for stacks.
 It indicates to the kernel virtual memory system that the mapping
@@ -451,6 +467,12 @@ is not a valid file descriptor (and
 .B MAP_ANONYMOUS
 was not set).
 .TP
+.B EEXIST
+range covered by
+.IR addr ,
+.IR length
+is clashing with an existing mapping.
+.TP
 .B EINVAL
 We don't like
 .IR addr ,
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
  2017-12-13  9:31 ` [PATCH 1/2] mmap.2: document new MAP_FIXED_SAFE flag Michal Hocko
@ 2017-12-13  9:31   ` Michal Hocko
  2017-12-13 12:55     ` Pavel Machek
  2017-12-14  2:52     ` Jann Horn
  0 siblings, 2 replies; 48+ messages in thread
From: Michal Hocko @ 2017-12-13  9:31 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: linux-api, Khalid Aziz, Michael Ellerman, Andrew Morton,
	Russell King - ARM Linux, Andrea Arcangeli, linux-mm, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Jann Horn, Mike Rapoport, Cyril Hrubis, Pavel Machek,
	Michal Hocko

From: John Hubbard <jhubbard@nvidia.com>

    -- Expand the documentation to discuss the hazards in
       enough detail to allow avoiding them.

    -- Mention the upcoming MAP_FIXED_SAFE flag.

    -- Enhance the alignment requirement slightly.

CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Jann Horn <jannh@google.com>
CC: Matthew Wilcox <willy@infradead.org>
CC: Michal Hocko <mhocko@kernel.org>
CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
CC: Cyril Hrubis <chrubis@suse.cz>
CC: Pavel Machek <pavel@ucw.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 man2/mmap.2 | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/man2/mmap.2 b/man2/mmap.2
index 02d391697ce6..cb8789daec2d 100644
--- a/man2/mmap.2
+++ b/man2/mmap.2
@@ -212,8 +212,9 @@ Don't interpret
 .I addr
 as a hint: place the mapping at exactly that address.
 .I addr
-must be a multiple of the page size.
-If the memory region specified by
+must be suitably aligned: for most architectures a multiple of page
+size is sufficient; however, some architectures may impose additional
+restrictions. If the memory region specified by
 .I addr
 and
 .I len
@@ -226,6 +227,33 @@ Software that aspires to be portable should use this option with care, keeping
 in mind that the exact layout of a process' memory map is allowed to change
 significantly between kernel versions, C library versions, and operating system
 releases.
+.IP
+Furthermore, this option is extremely hazardous (when used on its own), because
+it forcibly removes pre-existing mappings, making it easy for a multi-threaded
+process to corrupt its own address space.
+.IP
+For example, thread A looks through
+.I /proc/<pid>/maps
+and locates an available
+address range, while thread B simultaneously acquires part or all of that same
+address range. Thread A then calls mmap(MAP_FIXED), effectively overwriting
+the mapping that thread B created.
+.IP
+Thread B need not create a mapping directly; simply making a library call
+that, internally, uses
+.I dlopen(3)
+to load some other shared library, will
+suffice. The dlopen(3) call will map the library into the process's address
+space. Furthermore, almost any library call may be implemented using this
+technique.
+Examples include brk(2), malloc(3), pthread_create(3), and the PAM libraries
+(http://www.linux-pam.org).
+.IP
+Newer kernels
+(Linux 4.16 and later) have a
+.B MAP_FIXED_SAFE
+option that avoids the corruption problem; if available, MAP_FIXED_SAFE
+should be preferred over MAP_FIXED.
 .TP
 .BR MAP_FIXED_SAFE " (since Linux 4.16)"
 Similar to MAP_FIXED with respect to the
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE
       [not found] ` <20171213092550.2774-1-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2017-12-13 12:25   ` Matthew Wilcox
  2017-12-13 12:34     ` Michal Hocko
  2017-12-13 17:13   ` Kees Cook
  2017-12-14  0:32   ` Andrew Morton
  2 siblings, 1 reply; 48+ messages in thread
From: Matthew Wilcox @ 2017-12-13 12:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, Khalid Aziz, Michael Ellerman,
	Andrew Morton, Russell King - ARM Linux, Andrea Arcangeli,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, LKML,
	linux-arch-u79uwXL29TY76Z2rM5mHXA, Florian Weimer, John Hubbard,
	Abdul Haleem, Joel Stanley, Kees Cook, Michal Hocko

On Wed, Dec 13, 2017 at 10:25:48AM +0100, Michal Hocko wrote:
> I am afraid we can bikeshed this to death and there will still be
> somebody finding yet another better name. Therefore I've decided to
> stick with my original MAP_FIXED_SAFE. Why? Well, because it keeps the
> MAP_FIXED prefix which should be recognized by developers and _SAFE
> suffix should also be clear that all dangerous side effects of the old
> MAP_FIXED are gone.

I liked basically every other name suggested more than MAP_FIXED_SAFE.
"Safe against what?" was an important question.

MAP_AT_ADDR was the best suggestion I saw that wasn't one of mine.  Of
my suggestions, I liked MAP_STATIC the best.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE
  2017-12-13 12:25   ` [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE Matthew Wilcox
@ 2017-12-13 12:34     ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2017-12-13 12:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-api, Khalid Aziz, Michael Ellerman, Andrew Morton,
	Russell King - ARM Linux, Andrea Arcangeli, linux-mm, LKML,
	linux-arch, Florian Weimer, John Hubbard, Abdul Haleem,
	Joel Stanley, Kees Cook

On Wed 13-12-17 04:25:33, Matthew Wilcox wrote:
> On Wed, Dec 13, 2017 at 10:25:48AM +0100, Michal Hocko wrote:
> > I am afraid we can bikeshed this to death and there will still be
> > somebody finding yet another better name. Therefore I've decided to
> > stick with my original MAP_FIXED_SAFE. Why? Well, because it keeps the
> > MAP_FIXED prefix which should be recognized by developers and _SAFE
> > suffix should also be clear that all dangerous side effects of the old
> > MAP_FIXED are gone.
> 
> I liked basically every other name suggested more than MAP_FIXED_SAFE.
> "Safe against what?" was an important question.
> 
> MAP_AT_ADDR was the best suggestion I saw that wasn't one of mine.  Of
> my suggestions, I liked MAP_STATIC the best.

The question is whether you care enough to pursue this further yourself.
Because as I've said I do not want to spend another round discussing the
name. The flag is documented and I believe that the name has some merit.
Disagreeing on naming is the easiest pitfall to block otherwise useful
functionality from being merged. And I am pretty sure there will be
always somebody objecting...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/2] mm: introduce MAP_FIXED_SAFE
       [not found]   ` <20171213092550.2774-2-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2017-12-13 12:50     ` Matthew Wilcox
  2017-12-13 13:01       ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Matthew Wilcox @ 2017-12-13 12:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, Khalid Aziz, Michael Ellerman,
	Andrew Morton, Russell King - ARM Linux, Andrea Arcangeli,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, LKML,
	linux-arch-u79uwXL29TY76Z2rM5mHXA, Florian Weimer, John Hubbard,
	Michal Hocko

On Wed, Dec 13, 2017 at 10:25:49AM +0100, Michal Hocko wrote:
> +++ b/mm/mmap.c
> @@ -1342,6 +1342,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>  		if (!(file && path_noexec(&file->f_path)))
>  			prot |= PROT_EXEC;
>  
> +	/* force arch specific MAP_FIXED handling in get_unmapped_area */
> +	if (flags & MAP_FIXED_SAFE)
> +		flags |= MAP_FIXED;
> +
>  	if (!(flags & MAP_FIXED))
>  		addr = round_hint_to_min(addr);
>  

We're up to 22 MAP_ flags now.  We'll run out soon.  Let's preserve half
of a flag by giving userspace the definition:

#define MAP_FIXED_SAFE	(MAP_FIXED | _MAP_NOT_HINT)

then in here:

	if ((flags & _MAP_NOT_HINT) && !(flags & MAP_FIXED))
		return -EINVAL;

Now we can use _MAP_NOT_HINT all by itself in the future to mean
something else.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
  2017-12-13  9:31   ` [PATCH 2/2] mmap.2: MAP_FIXED updated documentation Michal Hocko
@ 2017-12-13 12:55     ` Pavel Machek
  2017-12-13 13:03       ` Cyril Hrubis
  2017-12-13 13:04       ` Michal Hocko
  2017-12-14  2:52     ` Jann Horn
  1 sibling, 2 replies; 48+ messages in thread
From: Pavel Machek @ 2017-12-13 12:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michael Kerrisk, linux-api, Khalid Aziz, Michael Ellerman,
	Andrew Morton, Russell King - ARM Linux, Andrea Arcangeli,
	linux-mm, LKML, linux-arch, Florian Weimer, John Hubbard,
	Matthew Wilcox, Jann Horn, Mike Rapoport, Cyril Hrubis,
	Michal Hocko


[-- Attachment #1: Type: text/plain, Size: 539 bytes --]

On Wed 2017-12-13 10:31:10, Michal Hocko wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
>     -- Expand the documentation to discuss the hazards in
>        enough detail to allow avoiding them.
> 
>     -- Mention the upcoming MAP_FIXED_SAFE flag.

Pretty map everyone agreed MAP_FIXED_SAFE was a bad
name. MAP_FIXED_NOREPLACE (IIRC) was best replacement.

								Pavel
								
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/2] mm: introduce MAP_FIXED_SAFE
  2017-12-13 12:50     ` Matthew Wilcox
@ 2017-12-13 13:01       ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2017-12-13 13:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-api, Khalid Aziz, Michael Ellerman, Andrew Morton,
	Russell King - ARM Linux, Andrea Arcangeli, linux-mm, LKML,
	linux-arch, Florian Weimer, John Hubbard

On Wed 13-12-17 04:50:53, Matthew Wilcox wrote:
> On Wed, Dec 13, 2017 at 10:25:49AM +0100, Michal Hocko wrote:
> > +++ b/mm/mmap.c
> > @@ -1342,6 +1342,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> >  		if (!(file && path_noexec(&file->f_path)))
> >  			prot |= PROT_EXEC;
> >  
> > +	/* force arch specific MAP_FIXED handling in get_unmapped_area */
> > +	if (flags & MAP_FIXED_SAFE)
> > +		flags |= MAP_FIXED;
> > +
> >  	if (!(flags & MAP_FIXED))
> >  		addr = round_hint_to_min(addr);
> >  
> 
> We're up to 22 MAP_ flags now.  We'll run out soon.  Let's preserve half
> of a flag by giving userspace the definition:
> 
> #define MAP_FIXED_SAFE	(MAP_FIXED | _MAP_NOT_HINT)

I've already tried to explain why this cannot be a modifier for
MAP_FIXED. Read about the backward compatibility note...
Or do I misunderstand what you are saying here?
 
> then in here:
> 
> 	if ((flags & _MAP_NOT_HINT) && !(flags & MAP_FIXED))
> 		return -EINVAL;
> 
> Now we can use _MAP_NOT_HINT all by itself in the future to mean
> something else.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
  2017-12-13 12:55     ` Pavel Machek
@ 2017-12-13 13:03       ` Cyril Hrubis
  2017-12-13 13:04       ` Michal Hocko
  1 sibling, 0 replies; 48+ messages in thread
From: Cyril Hrubis @ 2017-12-13 13:03 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Michal Hocko, Michael Kerrisk, linux-api-u79uwXL29TY76Z2rM5mHXA,
	Khalid Aziz, Michael Ellerman, Andrew Morton,
	Russell King - ARM Linux, Andrea Arcangeli,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, LKML,
	linux-arch-u79uwXL29TY76Z2rM5mHXA, Florian Weimer, John Hubbard,
	Matthew Wilcox, Jann Horn, Mike Rapoport, Michal Hocko

Hi!
> Pretty map everyone agreed MAP_FIXED_SAFE was a bad
> name. MAP_FIXED_NOREPLACE (IIRC) was best replacement.

For what it's worth I do agree here.

-- 
Cyril Hrubis
chrubis-AlSwsSmVLrQ@public.gmane.org

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
  2017-12-13 12:55     ` Pavel Machek
  2017-12-13 13:03       ` Cyril Hrubis
@ 2017-12-13 13:04       ` Michal Hocko
  2017-12-13 13:09         ` Pavel Machek
  1 sibling, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2017-12-13 13:04 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Michael Kerrisk, linux-api, Khalid Aziz, Michael Ellerman,
	Andrew Morton, Russell King - ARM Linux, Andrea Arcangeli,
	linux-mm, LKML, linux-arch, Florian Weimer, John Hubbard,
	Matthew Wilcox, Jann Horn, Mike Rapoport, Cyril Hrubis

On Wed 13-12-17 13:55:40, Pavel Machek wrote:
> On Wed 2017-12-13 10:31:10, Michal Hocko wrote:
> > From: John Hubbard <jhubbard@nvidia.com>
> > 
> >     -- Expand the documentation to discuss the hazards in
> >        enough detail to allow avoiding them.
> > 
> >     -- Mention the upcoming MAP_FIXED_SAFE flag.
> 
> Pretty map everyone agreed MAP_FIXED_SAFE was a bad
> name. MAP_FIXED_NOREPLACE (IIRC) was best replacement.

Please read http://lkml.kernel.org/r/20171213092550.2774-1-mhocko@kernel.org
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
  2017-12-13 13:04       ` Michal Hocko
@ 2017-12-13 13:09         ` Pavel Machek
  2017-12-13 13:16           ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Pavel Machek @ 2017-12-13 13:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michael Kerrisk, linux-api, Khalid Aziz, Michael Ellerman,
	Andrew Morton, Russell King - ARM Linux, Andrea Arcangeli,
	linux-mm, LKML, linux-arch, Florian Weimer, John Hubbard,
	Matthew Wilcox, Jann Horn, Mike Rapoport, Cyril Hrubis


[-- Attachment #1: Type: text/plain, Size: 851 bytes --]

On Wed 2017-12-13 14:04:58, Michal Hocko wrote:
> On Wed 13-12-17 13:55:40, Pavel Machek wrote:
> > On Wed 2017-12-13 10:31:10, Michal Hocko wrote:
> > > From: John Hubbard <jhubbard@nvidia.com>
> > > 
> > >     -- Expand the documentation to discuss the hazards in
> > >        enough detail to allow avoiding them.
> > > 
> > >     -- Mention the upcoming MAP_FIXED_SAFE flag.
> > 
> > Pretty map everyone agreed MAP_FIXED_SAFE was a bad
> > name. MAP_FIXED_NOREPLACE (IIRC) was best replacement.
> 
> Please read http://lkml.kernel.org/r/20171213092550.2774-1-mhocko@kernel.org

Please fix your patches according to the feedback...

NACCKED-by: Pavel Machek <pavel@ucw.cz>

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
  2017-12-13 13:09         ` Pavel Machek
@ 2017-12-13 13:16           ` Michal Hocko
       [not found]             ` <20171213131640.GJ25185-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2017-12-13 13:16 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Michael Kerrisk, linux-api, Khalid Aziz, Michael Ellerman,
	Andrew Morton, Russell King - ARM Linux, Andrea Arcangeli,
	linux-mm, LKML, linux-arch, Florian Weimer, John Hubbard,
	Matthew Wilcox, Jann Horn, Mike Rapoport, Cyril Hrubis

On Wed 13-12-17 14:09:00, Pavel Machek wrote:
> On Wed 2017-12-13 14:04:58, Michal Hocko wrote:
> > On Wed 13-12-17 13:55:40, Pavel Machek wrote:
> > > On Wed 2017-12-13 10:31:10, Michal Hocko wrote:
> > > > From: John Hubbard <jhubbard@nvidia.com>
> > > > 
> > > >     -- Expand the documentation to discuss the hazards in
> > > >        enough detail to allow avoiding them.
> > > > 
> > > >     -- Mention the upcoming MAP_FIXED_SAFE flag.
> > > 
> > > Pretty map everyone agreed MAP_FIXED_SAFE was a bad
> > > name. MAP_FIXED_NOREPLACE (IIRC) was best replacement.
> > 
> > Please read http://lkml.kernel.org/r/20171213092550.2774-1-mhocko@kernel.org
> 
> Please fix your patches according to the feedback...
> 
> NACCKED-by: Pavel Machek <pavel@ucw.cz>

Good luck pursuing this further then. I am not going to spend time on
naming bikeheds. I have more pressing stuff to work on.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
       [not found]             ` <20171213131640.GJ25185-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2017-12-13 13:21               ` Pavel Machek
  2017-12-13 13:35                 ` Michal Hocko
  2017-12-13 14:40                 ` Cyril Hrubis
  0 siblings, 2 replies; 48+ messages in thread
From: Pavel Machek @ 2017-12-13 13:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michael Kerrisk, linux-api-u79uwXL29TY76Z2rM5mHXA, Khalid Aziz,
	Michael Ellerman, Andrew Morton, Russell King - ARM Linux,
	Andrea Arcangeli, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, LKML,
	linux-arch-u79uwXL29TY76Z2rM5mHXA, Florian Weimer, John Hubbard,
	Matthew Wilcox, Jann Horn, Mike Rapoport, Cyril Hrubis


[-- Attachment #1: Type: text/plain, Size: 1516 bytes --]

On Wed 2017-12-13 14:16:40, Michal Hocko wrote:
> On Wed 13-12-17 14:09:00, Pavel Machek wrote:
> > On Wed 2017-12-13 14:04:58, Michal Hocko wrote:
> > > On Wed 13-12-17 13:55:40, Pavel Machek wrote:
> > > > On Wed 2017-12-13 10:31:10, Michal Hocko wrote:
> > > > > From: John Hubbard <jhubbard-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
> > > > > 
> > > > >     -- Expand the documentation to discuss the hazards in
> > > > >        enough detail to allow avoiding them.
> > > > > 
> > > > >     -- Mention the upcoming MAP_FIXED_SAFE flag.
> > > > 
> > > > Pretty map everyone agreed MAP_FIXED_SAFE was a bad
> > > > name. MAP_FIXED_NOREPLACE (IIRC) was best replacement.
> > > 
> > > Please read http://lkml.kernel.org/r/20171213092550.2774-1-mhocko@kernel.org
> > 
> > Please fix your patches according to the feedback...
> > 
> > NACCKED-by: Pavel Machek <pavel-+ZI9xUNit7I@public.gmane.org>
> 
> Good luck pursuing this further then. I am not going to spend time on
> naming bikeheds. I have more pressing stuff to work on.

You selected stupid name for a flag. Everyone and their dog agrees
with that. There's even consensus on better name (and everyone agrees
it is better than .._SAFE). Of course, we could have debate if it is
NOREPLACE or NOREMOVE or ... and that would be bikeshed. This was just
poor naming on your part.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
  2017-12-13 13:21               ` Pavel Machek
@ 2017-12-13 13:35                 ` Michal Hocko
  2017-12-13 14:40                 ` Cyril Hrubis
  1 sibling, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2017-12-13 13:35 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Michael Kerrisk, linux-api, Khalid Aziz, Michael Ellerman,
	Andrew Morton, Russell King - ARM Linux, Andrea Arcangeli,
	linux-mm, LKML, linux-arch, Florian Weimer, John Hubbard,
	Matthew Wilcox, Jann Horn, Mike Rapoport, Cyril Hrubis

On Wed 13-12-17 14:21:05, Pavel Machek wrote:
> On Wed 2017-12-13 14:16:40, Michal Hocko wrote:
> > On Wed 13-12-17 14:09:00, Pavel Machek wrote:
> > > On Wed 2017-12-13 14:04:58, Michal Hocko wrote:
> > > > On Wed 13-12-17 13:55:40, Pavel Machek wrote:
> > > > > On Wed 2017-12-13 10:31:10, Michal Hocko wrote:
> > > > > > From: John Hubbard <jhubbard@nvidia.com>
> > > > > > 
> > > > > >     -- Expand the documentation to discuss the hazards in
> > > > > >        enough detail to allow avoiding them.
> > > > > > 
> > > > > >     -- Mention the upcoming MAP_FIXED_SAFE flag.
> > > > > 
> > > > > Pretty map everyone agreed MAP_FIXED_SAFE was a bad
> > > > > name. MAP_FIXED_NOREPLACE (IIRC) was best replacement.
> > > > 
> > > > Please read http://lkml.kernel.org/r/20171213092550.2774-1-mhocko@kernel.org
> > > 
> > > Please fix your patches according to the feedback...
> > > 
> > > NACCKED-by: Pavel Machek <pavel@ucw.cz>
> > 
> > Good luck pursuing this further then. I am not going to spend time on
> > naming bikeheds. I have more pressing stuff to work on.
> 
> You selected stupid name for a flag. Everyone and their dog agrees
> with that.

Not sure about your dog but mine says that a flag which fixes an
_unsafe_ aspects of MAP_FIXED can be called MAP_FIXED_SAFE just fine.

Anyway, I am not going to argue about this further. I've implemented the
code, gathered uscases and fortified an in-kernel user which already led
to a security issue in the past. I consider my part done here. I do not
agree that MAP_FIXED_NOREPLACE would be so much better to respin and
then deal with what about this MAP_$FOO. If there are really stong
feelings about this then feel free to take these patches, do
s@MAP_FIXED_SAFE@MAP_$FOO@ and try to upstream them yourself.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
  2017-12-13 13:21               ` Pavel Machek
  2017-12-13 13:35                 ` Michal Hocko
@ 2017-12-13 14:40                 ` Cyril Hrubis
  2017-12-13 23:19                   ` Kees Cook
  1 sibling, 1 reply; 48+ messages in thread
From: Cyril Hrubis @ 2017-12-13 14:40 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Michal Hocko, Michael Kerrisk, linux-api-u79uwXL29TY76Z2rM5mHXA,
	Khalid Aziz, Michael Ellerman, Andrew Morton,
	Russell King - ARM Linux, Andrea Arcangeli,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, LKML,
	linux-arch-u79uwXL29TY76Z2rM5mHXA, Florian Weimer, John Hubbard,
	Matthew Wilcox, Jann Horn, Mike Rapoport

Hi!
> You selected stupid name for a flag. Everyone and their dog agrees
> with that. There's even consensus on better name (and everyone agrees
> it is better than .._SAFE). Of course, we could have debate if it is
> NOREPLACE or NOREMOVE or ... and that would be bikeshed. This was just
> poor naming on your part.

Well while everybody agrees that the name is so bad that basically
anything else would be better, there does not seem to be consensus on
which one to pick. I do understand that this frustrating and fruitless.

So what do we do now, roll a dice to choose new name?

Or do we ask BFDL[1] to choose the name?

[1] https://en.wikipedia.org/wiki/Benevolent_dictator_for_life

-- 
Cyril Hrubis
chrubis-AlSwsSmVLrQ@public.gmane.org

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE
       [not found] ` <20171213092550.2774-1-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2017-12-13 12:25   ` [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE Matthew Wilcox
@ 2017-12-13 17:13   ` Kees Cook
  2017-12-15  9:02     ` Michael Ellerman
  2017-12-14  0:32   ` Andrew Morton
  2 siblings, 1 reply; 48+ messages in thread
From: Kees Cook @ 2017-12-13 17:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linux API, Khalid Aziz, Michael Ellerman, Andrew Morton,
	Russell King - ARM Linux, Andrea Arcangeli, Linux-MM, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Abdul Haleem, Joel Stanley, Michal Hocko

On Wed, Dec 13, 2017 at 1:25 AM, Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> Hi,
> I am resending with some minor updates based on Michael's review and
> ask for inclusion. There haven't been any fundamental objections for
> the RFC [1] nor the previous version [2].  The biggest discussion
> revolved around the naming. There were many suggestions flowing
> around MAP_REQUIRED, MAP_EXACT, MAP_FIXED_NOCLOBBER, MAP_AT_ADDR,
> MAP_FIXED_NOREPLACE etc...

With this named MAP_FIXED_NOREPLACE (the best consensus we've got on a
name), please consider this series:

Acked-by: Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>

-Kees

>
> I am afraid we can bikeshed this to death and there will still be
> somebody finding yet another better name. Therefore I've decided to
> stick with my original MAP_FIXED_SAFE. Why? Well, because it keeps the
> MAP_FIXED prefix which should be recognized by developers and _SAFE
> suffix should also be clear that all dangerous side effects of the old
> MAP_FIXED are gone.
>
> If somebody _really_ hates this then feel free to nack and resubmit
> with a different name you can find a consensus for. I am sorry to be
> stubborn here but I would rather have this merged than go over few more
> iterations changing the name just because it seems like a good idea
> now. My experience tells me that chances are that the name will turn out
> to be "suboptimal" anyway over time.
>
> Some more background:
> This has started as a follow up discussion [3][4] resulting in the
> runtime failure caused by hardening patch [5] which removes MAP_FIXED
> from the elf loader because MAP_FIXED is inherently dangerous as it
> might silently clobber an existing underlying mapping (e.g. stack). The
> reason for the failure is that some architectures enforce an alignment
> for the given address hint without MAP_FIXED used (e.g. for shared or
> file backed mappings).
>
> One way around this would be excluding those archs which do alignment
> tricks from the hardening [6]. The patch is really trivial but it has
> been objected, rightfully so, that this screams for a more generic
> solution. We basically want a non-destructive MAP_FIXED.
>
> The first patch introduced MAP_FIXED_SAFE which enforces the given
> address but unlike MAP_FIXED it fails with EEXIST if the given range
> conflicts with an existing one. The flag is introduced as a completely
> new one rather than a MAP_FIXED extension because of the backward
> compatibility. We really want a never-clobber semantic even on older
> kernels which do not recognize the flag. Unfortunately mmap sucks wrt.
> flags evaluation because we do not EINVAL on unknown flags. On those
> kernels we would simply use the traditional hint based semantic so the
> caller can still get a different address (which sucks) but at least not
> silently corrupt an existing mapping. I do not see a good way around
> that. Except we won't export expose the new semantic to the userspace at
> all.
>
> It seems there are users who would like to have something like that.
> Jemalloc has been mentioned by Michael Ellerman [7]
>
> Florian Weimer has mentioned the following:
> : glibc ld.so currently maps DSOs without hints.  This means that the kernel
> : will map right next to each other, and the offsets between them a completely
> : predictable.  We would like to change that and supply a random address in a
> : window of the address space.  If there is a conflict, we do not want the
> : kernel to pick a non-random address. Instead, we would try again with a
> : random address.
>
> John Hubbard has mentioned CUDA example
> : a) Searches /proc/<pid>/maps for a "suitable" region of available
> : VA space.  "Suitable" generally means it has to have a base address
> : within a certain limited range (a particular device model might
> : have odd limitations, for example), it has to be large enough, and
> : alignment has to be large enough (again, various devices may have
> : constraints that lead us to do this).
> :
> : This is of course subject to races with other threads in the process.
> :
> : Let's say it finds a region starting at va.
> :
> : b) Next it does:
> :     p = mmap(va, ...)
> :
> : *without* setting MAP_FIXED, of course (so va is just a hint), to
> : attempt to safely reserve that region. If p != va, then in most cases,
> : this is a failure (almost certainly due to another thread getting a
> : mapping from that region before we did), and so this layer now has to
> : call munmap(), before returning a "failure: retry" to upper layers.
> :
> :     IMPROVEMENT: --> if instead, we could call this:
> :
> :             p = mmap(va, ... MAP_FIXED_SAFE ...)
> :
> :         , then we could skip the munmap() call upon failure. This
> :         is a small thing, but it is useful here. (Thanks to Piotr
> :         Jaroszynski and Mark Hairgrove for helping me get that detail
> :         exactly right, btw.)
> :
> : c) After that, CUDA suballocates from p, via:
> :
> :      q = mmap(sub_region_start, ... MAP_FIXED ...)
> :
> : Interestingly enough, "freeing" is also done via MAP_FIXED, and
> : setting PROT_NONE to the subregion. Anyway, I just included (c) for
> : general interest.
>
> Atomic address range probing in the multithreaded programs in general
> sounds like an interesting thing to me.
>
> The second patch simply replaces MAP_FIXED use in elf loader by
> MAP_FIXED_SAFE. I believe other places which rely on MAP_FIXED should
> follow. Actually real MAP_FIXED usages should be docummented properly
> and they should be more of an exception.
>
> Diffstat says
>  arch/alpha/include/uapi/asm/mman.h     |  1 +
>  arch/metag/kernel/process.c            |  6 +++++-
>  arch/mips/include/uapi/asm/mman.h      |  2 ++
>  arch/parisc/include/uapi/asm/mman.h    |  2 ++
>  arch/sparc/include/uapi/asm/mman.h     |  1 -
>  arch/xtensa/include/uapi/asm/mman.h    |  2 ++
>  fs/binfmt_elf.c                        | 12 ++++++++----
>  include/uapi/asm-generic/mman-common.h |  1 +
>  mm/mmap.c                              | 11 +++++++++++
>  9 files changed, 32 insertions(+), 6 deletions(-)
>
> [1] http://lkml.kernel.org/r/20171116101900.13621-1-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
> [2] http://lkml.kernel.org/r/20171129144219.22867-1-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
> [3] http://lkml.kernel.org/r/20171107162217.382cd754-3FnU+UHB4dNDw9hX6IcOSA@public.gmane.org
> [4] http://lkml.kernel.org/r/1510048229.12079.7.camel-JKZ9t1WPFCv1ENwx4SLHqw@public.gmane.org
> [5] http://lkml.kernel.org/r/20171023082608.6167-1-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
> [6] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk55y-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org
> [7] http://lkml.kernel.org/r/87efp1w7vy.fsf-W0DJWXSxmBNbyGPkN3NxC2scP1bn1w/D@public.gmane.org
>



-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
  2017-12-13 14:40                 ` Cyril Hrubis
@ 2017-12-13 23:19                   ` Kees Cook
       [not found]                     ` <CAGXu5jLqE6cUxk-Girx6PG7upEzz8jmu1OH_3LVC26iJc2vTxQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-12-18 19:12                     ` Michael Kerrisk (man-pages)
  0 siblings, 2 replies; 48+ messages in thread
From: Kees Cook @ 2017-12-13 23:19 UTC (permalink / raw)
  To: Cyril Hrubis
  Cc: Pavel Machek, Michal Hocko, Michael Kerrisk, Linux API,
	Khalid Aziz, Michael Ellerman, Andrew Morton,
	Russell King - ARM Linux, Andrea Arcangeli, Linux-MM, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Jann Horn, Mike Rapoport

On Wed, Dec 13, 2017 at 6:40 AM, Cyril Hrubis <chrubis@suse.cz> wrote:
> Hi!
>> You selected stupid name for a flag. Everyone and their dog agrees
>> with that. There's even consensus on better name (and everyone agrees
>> it is better than .._SAFE). Of course, we could have debate if it is
>> NOREPLACE or NOREMOVE or ... and that would be bikeshed. This was just
>> poor naming on your part.
>
> Well while everybody agrees that the name is so bad that basically
> anything else would be better, there does not seem to be consensus on
> which one to pick. I do understand that this frustrating and fruitless.

Based on the earlier threads where I tried to end the bikeshedding, it
seemed like MAP_FIXED_NOREPLACE was the least bad option.

> So what do we do now, roll a dice to choose new name?
>
> Or do we ask BFDL[1] to choose the name?

I'd like to hear feedback from Michael Kerrisk, as he's had to deal
with these kinds of choices in the past. I'm fine to ask Linus too. I
just want to get past the name since the feature is quite valuable.

And if Michal doesn't want to touch this patch any more, I'm happy to
do the search/replace/resend. :P

-Kees

-- 
Kees Cook
Pixel Security

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE
       [not found] ` <20171213092550.2774-1-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2017-12-13 12:25   ` [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE Matthew Wilcox
  2017-12-13 17:13   ` Kees Cook
@ 2017-12-14  0:32   ` Andrew Morton
  2017-12-14  1:35     ` David Goldblatt
  2017-12-14 12:44     ` Edward Napierala
  2 siblings, 2 replies; 48+ messages in thread
From: Andrew Morton @ 2017-12-14  0:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, Khalid Aziz, Michael Ellerman,
	Russell King - ARM Linux, Andrea Arcangeli,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, LKML,
	linux-arch-u79uwXL29TY76Z2rM5mHXA, Florian Weimer, John Hubbard,
	Matthew Wilcox, Abdul Haleem, Joel Stanley, Kees Cook,
	Michal Hocko, jasone-hpIqsD4AKlfQT0dZR+AlfA,
	davidtgoldblatt-Re5JQEeQqe8AvxtiuMwx3w,
	trasz-HZy0K5TPuP5AfugRpC6u6w

On Wed, 13 Dec 2017 10:25:48 +0100 Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:

> 
> Hi,
> I am resending with some minor updates based on Michael's review and
> ask for inclusion. There haven't been any fundamental objections for
> the RFC [1] nor the previous version [2].  The biggest discussion
> revolved around the naming. There were many suggestions flowing
> around MAP_REQUIRED, MAP_EXACT, MAP_FIXED_NOCLOBBER, MAP_AT_ADDR,
> MAP_FIXED_NOREPLACE etc...

I like MAP_FIXED_CAREFUL :)

> I am afraid we can bikeshed this to death and there will still be
> somebody finding yet another better name. Therefore I've decided to
> stick with my original MAP_FIXED_SAFE. Why? Well, because it keeps the
> MAP_FIXED prefix which should be recognized by developers and _SAFE
> suffix should also be clear that all dangerous side effects of the old
> MAP_FIXED are gone.
> 
> If somebody _really_ hates this then feel free to nack and resubmit
> with a different name you can find a consensus for. I am sorry to be
> stubborn here but I would rather have this merged than go over few more
> iterations changing the name just because it seems like a good idea
> now. My experience tells me that chances are that the name will turn out
> to be "suboptimal" anyway over time.
> 
> Some more background:
> This has started as a follow up discussion [3][4] resulting in the
> runtime failure caused by hardening patch [5] which removes MAP_FIXED
> from the elf loader because MAP_FIXED is inherently dangerous as it
> might silently clobber an existing underlying mapping (e.g. stack). The
> reason for the failure is that some architectures enforce an alignment
> for the given address hint without MAP_FIXED used (e.g. for shared or
> file backed mappings).
> 
> One way around this would be excluding those archs which do alignment
> tricks from the hardening [6]. The patch is really trivial but it has
> been objected, rightfully so, that this screams for a more generic
> solution. We basically want a non-destructive MAP_FIXED.
> 
> The first patch introduced MAP_FIXED_SAFE which enforces the given
> address but unlike MAP_FIXED it fails with EEXIST if the given range
> conflicts with an existing one. The flag is introduced as a completely
> new one rather than a MAP_FIXED extension because of the backward
> compatibility. We really want a never-clobber semantic even on older
> kernels which do not recognize the flag. Unfortunately mmap sucks wrt.
> flags evaluation because we do not EINVAL on unknown flags. On those
> kernels we would simply use the traditional hint based semantic so the
> caller can still get a different address (which sucks) but at least not
> silently corrupt an existing mapping. I do not see a good way around
> that. Except we won't export expose the new semantic to the userspace at
> all. 
> 
> It seems there are users who would like to have something like that.
> Jemalloc has been mentioned by Michael Ellerman [7]

http://lkml.kernel.org/r/87efp1w7vy.fsf-W0DJWXSxmBNbyGPkN3NxC2scP1bn1w/D4b8TPpkIZ3Q@public.gmane.org

It would be useful to get feedback from jemalloc developers (please). 
I'll add some cc's.


> Florian Weimer has mentioned the following:
> : glibc ld.so currently maps DSOs without hints.  This means that the kernel
> : will map right next to each other, and the offsets between them a completely
> : predictable.  We would like to change that and supply a random address in a
> : window of the address space.  If there is a conflict, we do not want the
> : kernel to pick a non-random address. Instead, we would try again with a
> : random address.
> 
> John Hubbard has mentioned CUDA example
> : a) Searches /proc/<pid>/maps for a "suitable" region of available
> : VA space.  "Suitable" generally means it has to have a base address
> : within a certain limited range (a particular device model might
> : have odd limitations, for example), it has to be large enough, and
> : alignment has to be large enough (again, various devices may have
> : constraints that lead us to do this).
> : 
> : This is of course subject to races with other threads in the process.
> : 
> : Let's say it finds a region starting at va.
> : 
> : b) Next it does: 
> :     p = mmap(va, ...) 
> : 
> : *without* setting MAP_FIXED, of course (so va is just a hint), to
> : attempt to safely reserve that region. If p != va, then in most cases,
> : this is a failure (almost certainly due to another thread getting a
> : mapping from that region before we did), and so this layer now has to
> : call munmap(), before returning a "failure: retry" to upper layers.
> : 
> :     IMPROVEMENT: --> if instead, we could call this:
> : 
> :             p = mmap(va, ... MAP_FIXED_SAFE ...)
> : 
> :         , then we could skip the munmap() call upon failure. This
> :         is a small thing, but it is useful here. (Thanks to Piotr
> :         Jaroszynski and Mark Hairgrove for helping me get that detail
> :         exactly right, btw.)
> : 
> : c) After that, CUDA suballocates from p, via: 
> :  
> :      q = mmap(sub_region_start, ... MAP_FIXED ...)
> : 
> : Interestingly enough, "freeing" is also done via MAP_FIXED, and
> : setting PROT_NONE to the subregion. Anyway, I just included (c) for
> : general interest.
> 
> Atomic address range probing in the multithreaded programs in general
> sounds like an interesting thing to me.
> 
> The second patch simply replaces MAP_FIXED use in elf loader by
> MAP_FIXED_SAFE. I believe other places which rely on MAP_FIXED should
> follow. Actually real MAP_FIXED usages should be docummented properly
> and they should be more of an exception.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE
  2017-12-14  0:32   ` Andrew Morton
@ 2017-12-14  1:35     ` David Goldblatt
  2017-12-14  1:42       ` David Goldblatt
  2017-12-14 12:44     ` Edward Napierala
  1 sibling, 1 reply; 48+ messages in thread
From: David Goldblatt @ 2017-12-14  1:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-api, Khalid Aziz, Michael Ellerman,
	Russell King - ARM Linux, Andrea Arcangeli, linux-mm, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Abdul Haleem, Joel Stanley, Kees Cook, Michal Hocko, trasz,
	Jason Evans


[-- Attachment #1: Type: text/plain, Size: 6278 bytes --]

(+cc the jemalloc jasone; -cc,+bcc the Google jasone).

The only time we would want MAP_FIXED (or rather, a non-broken variant) is
in the middle of trying to expand an allocation in place; "atomic address
range probing in the multithreaded programs" describes our use case pretty
well. That's in a pathway that usually fails; it's pretty far down on our
kernel mmap enhancements wish-list.

On Wed, Dec 13, 2017 at 4:32 PM, Andrew Morton <akpm@linux-foundation.org>
wrote:

> On Wed, 13 Dec 2017 10:25:48 +0100 Michal Hocko <mhocko@kernel.org> wrote:
>
> >
> > Hi,
> > I am resending with some minor updates based on Michael's review and
> > ask for inclusion. There haven't been any fundamental objections for
> > the RFC [1] nor the previous version [2].  The biggest discussion
> > revolved around the naming. There were many suggestions flowing
> > around MAP_REQUIRED, MAP_EXACT, MAP_FIXED_NOCLOBBER, MAP_AT_ADDR,
> > MAP_FIXED_NOREPLACE etc...
>
> I like MAP_FIXED_CAREFUL :)
>
> > I am afraid we can bikeshed this to death and there will still be
> > somebody finding yet another better name. Therefore I've decided to
> > stick with my original MAP_FIXED_SAFE. Why? Well, because it keeps the
> > MAP_FIXED prefix which should be recognized by developers and _SAFE
> > suffix should also be clear that all dangerous side effects of the old
> > MAP_FIXED are gone.
> >
> > If somebody _really_ hates this then feel free to nack and resubmit
> > with a different name you can find a consensus for. I am sorry to be
> > stubborn here but I would rather have this merged than go over few more
> > iterations changing the name just because it seems like a good idea
> > now. My experience tells me that chances are that the name will turn out
> > to be "suboptimal" anyway over time.
> >
> > Some more background:
> > This has started as a follow up discussion [3][4] resulting in the
> > runtime failure caused by hardening patch [5] which removes MAP_FIXED
> > from the elf loader because MAP_FIXED is inherently dangerous as it
> > might silently clobber an existing underlying mapping (e.g. stack). The
> > reason for the failure is that some architectures enforce an alignment
> > for the given address hint without MAP_FIXED used (e.g. for shared or
> > file backed mappings).
> >
> > One way around this would be excluding those archs which do alignment
> > tricks from the hardening [6]. The patch is really trivial but it has
> > been objected, rightfully so, that this screams for a more generic
> > solution. We basically want a non-destructive MAP_FIXED.
> >
> > The first patch introduced MAP_FIXED_SAFE which enforces the given
> > address but unlike MAP_FIXED it fails with EEXIST if the given range
> > conflicts with an existing one. The flag is introduced as a completely
> > new one rather than a MAP_FIXED extension because of the backward
> > compatibility. We really want a never-clobber semantic even on older
> > kernels which do not recognize the flag. Unfortunately mmap sucks wrt.
> > flags evaluation because we do not EINVAL on unknown flags. On those
> > kernels we would simply use the traditional hint based semantic so the
> > caller can still get a different address (which sucks) but at least not
> > silently corrupt an existing mapping. I do not see a good way around
> > that. Except we won't export expose the new semantic to the userspace at
> > all.
> >
> > It seems there are users who would like to have something like that.
> > Jemalloc has been mentioned by Michael Ellerman [7]
>
> http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au.
>
> It would be useful to get feedback from jemalloc developers (please).
> I'll add some cc's.
>
>
> > Florian Weimer has mentioned the following:
> > : glibc ld.so currently maps DSOs without hints.  This means that the
> kernel
> > : will map right next to each other, and the offsets between them a
> completely
> > : predictable.  We would like to change that and supply a random address
> in a
> > : window of the address space.  If there is a conflict, we do not want
> the
> > : kernel to pick a non-random address. Instead, we would try again with a
> > : random address.
> >
> > John Hubbard has mentioned CUDA example
> > : a) Searches /proc/<pid>/maps for a "suitable" region of available
> > : VA space.  "Suitable" generally means it has to have a base address
> > : within a certain limited range (a particular device model might
> > : have odd limitations, for example), it has to be large enough, and
> > : alignment has to be large enough (again, various devices may have
> > : constraints that lead us to do this).
> > :
> > : This is of course subject to races with other threads in the process.
> > :
> > : Let's say it finds a region starting at va.
> > :
> > : b) Next it does:
> > :     p = mmap(va, ...)
> > :
> > : *without* setting MAP_FIXED, of course (so va is just a hint), to
> > : attempt to safely reserve that region. If p != va, then in most cases,
> > : this is a failure (almost certainly due to another thread getting a
> > : mapping from that region before we did), and so this layer now has to
> > : call munmap(), before returning a "failure: retry" to upper layers.
> > :
> > :     IMPROVEMENT: --> if instead, we could call this:
> > :
> > :             p = mmap(va, ... MAP_FIXED_SAFE ...)
> > :
> > :         , then we could skip the munmap() call upon failure. This
> > :         is a small thing, but it is useful here. (Thanks to Piotr
> > :         Jaroszynski and Mark Hairgrove for helping me get that detail
> > :         exactly right, btw.)
> > :
> > : c) After that, CUDA suballocates from p, via:
> > :
> > :      q = mmap(sub_region_start, ... MAP_FIXED ...)
> > :
> > : Interestingly enough, "freeing" is also done via MAP_FIXED, and
> > : setting PROT_NONE to the subregion. Anyway, I just included (c) for
> > : general interest.
> >
> > Atomic address range probing in the multithreaded programs in general
> > sounds like an interesting thing to me.
> >
> > The second patch simply replaces MAP_FIXED use in elf loader by
> > MAP_FIXED_SAFE. I believe other places which rely on MAP_FIXED should
> > follow. Actually real MAP_FIXED usages should be docummented properly
> > and they should be more of an exception.
>
>

[-- Attachment #2: Type: text/html, Size: 7614 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE
  2017-12-14  1:35     ` David Goldblatt
@ 2017-12-14  1:42       ` David Goldblatt
  0 siblings, 0 replies; 48+ messages in thread
From: David Goldblatt @ 2017-12-14  1:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-api, Khalid Aziz, Michael Ellerman,
	Russell King - ARM Linux, Andrea Arcangeli, linux-mm, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Abdul Haleem, Joel Stanley, Kees Cook, Michal Hocko, trasz,
	Jason Evans

(+cc the jemalloc jasone; -cc,+bcc the Google jasone).

The only time we would want MAP_FIXED (or rather, a non-broken
variant) is in the middle of trying to expand an allocation in place;
"atomic address range probing in the multithreaded programs" describes
our use case pretty well. That's in a pathway that usually fails; it's
pretty far down on our kernel mmap enhancements wish-list.

(Sorry if you get this twice, an html reply bounced).


On Wed, Dec 13, 2017 at 5:35 PM, David Goldblatt
<davidtgoldblatt@gmail.com> wrote:
> (+cc the jemalloc jasone; -cc,+bcc the Google jasone).
>
> The only time we would want MAP_FIXED (or rather, a non-broken variant) is
> in the middle of trying to expand an allocation in place; "atomic address
> range probing in the multithreaded programs" describes our use case pretty
> well. That's in a pathway that usually fails; it's pretty far down on our
> kernel mmap enhancements wish-list.
>
> On Wed, Dec 13, 2017 at 4:32 PM, Andrew Morton <akpm@linux-foundation.org>
> wrote:
>>
>> On Wed, 13 Dec 2017 10:25:48 +0100 Michal Hocko <mhocko@kernel.org> wrote:
>>
>> >
>> > Hi,
>> > I am resending with some minor updates based on Michael's review and
>> > ask for inclusion. There haven't been any fundamental objections for
>> > the RFC [1] nor the previous version [2].  The biggest discussion
>> > revolved around the naming. There were many suggestions flowing
>> > around MAP_REQUIRED, MAP_EXACT, MAP_FIXED_NOCLOBBER, MAP_AT_ADDR,
>> > MAP_FIXED_NOREPLACE etc...
>>
>> I like MAP_FIXED_CAREFUL :)
>>
>> > I am afraid we can bikeshed this to death and there will still be
>> > somebody finding yet another better name. Therefore I've decided to
>> > stick with my original MAP_FIXED_SAFE. Why? Well, because it keeps the
>> > MAP_FIXED prefix which should be recognized by developers and _SAFE
>> > suffix should also be clear that all dangerous side effects of the old
>> > MAP_FIXED are gone.
>> >
>> > If somebody _really_ hates this then feel free to nack and resubmit
>> > with a different name you can find a consensus for. I am sorry to be
>> > stubborn here but I would rather have this merged than go over few more
>> > iterations changing the name just because it seems like a good idea
>> > now. My experience tells me that chances are that the name will turn out
>> > to be "suboptimal" anyway over time.
>> >
>> > Some more background:
>> > This has started as a follow up discussion [3][4] resulting in the
>> > runtime failure caused by hardening patch [5] which removes MAP_FIXED
>> > from the elf loader because MAP_FIXED is inherently dangerous as it
>> > might silently clobber an existing underlying mapping (e.g. stack). The
>> > reason for the failure is that some architectures enforce an alignment
>> > for the given address hint without MAP_FIXED used (e.g. for shared or
>> > file backed mappings).
>> >
>> > One way around this would be excluding those archs which do alignment
>> > tricks from the hardening [6]. The patch is really trivial but it has
>> > been objected, rightfully so, that this screams for a more generic
>> > solution. We basically want a non-destructive MAP_FIXED.
>> >
>> > The first patch introduced MAP_FIXED_SAFE which enforces the given
>> > address but unlike MAP_FIXED it fails with EEXIST if the given range
>> > conflicts with an existing one. The flag is introduced as a completely
>> > new one rather than a MAP_FIXED extension because of the backward
>> > compatibility. We really want a never-clobber semantic even on older
>> > kernels which do not recognize the flag. Unfortunately mmap sucks wrt.
>> > flags evaluation because we do not EINVAL on unknown flags. On those
>> > kernels we would simply use the traditional hint based semantic so the
>> > caller can still get a different address (which sucks) but at least not
>> > silently corrupt an existing mapping. I do not see a good way around
>> > that. Except we won't export expose the new semantic to the userspace at
>> > all.
>> >
>> > It seems there are users who would like to have something like that.
>> > Jemalloc has been mentioned by Michael Ellerman [7]
>>
>> http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au.
>>
>> It would be useful to get feedback from jemalloc developers (please).
>> I'll add some cc's.
>>
>>
>> > Florian Weimer has mentioned the following:
>> > : glibc ld.so currently maps DSOs without hints.  This means that the
>> > kernel
>> > : will map right next to each other, and the offsets between them a
>> > completely
>> > : predictable.  We would like to change that and supply a random address
>> > in a
>> > : window of the address space.  If there is a conflict, we do not want
>> > the
>> > : kernel to pick a non-random address. Instead, we would try again with
>> > a
>> > : random address.
>> >
>> > John Hubbard has mentioned CUDA example
>> > : a) Searches /proc/<pid>/maps for a "suitable" region of available
>> > : VA space.  "Suitable" generally means it has to have a base address
>> > : within a certain limited range (a particular device model might
>> > : have odd limitations, for example), it has to be large enough, and
>> > : alignment has to be large enough (again, various devices may have
>> > : constraints that lead us to do this).
>> > :
>> > : This is of course subject to races with other threads in the process.
>> > :
>> > : Let's say it finds a region starting at va.
>> > :
>> > : b) Next it does:
>> > :     p = mmap(va, ...)
>> > :
>> > : *without* setting MAP_FIXED, of course (so va is just a hint), to
>> > : attempt to safely reserve that region. If p != va, then in most cases,
>> > : this is a failure (almost certainly due to another thread getting a
>> > : mapping from that region before we did), and so this layer now has to
>> > : call munmap(), before returning a "failure: retry" to upper layers.
>> > :
>> > :     IMPROVEMENT: --> if instead, we could call this:
>> > :
>> > :             p = mmap(va, ... MAP_FIXED_SAFE ...)
>> > :
>> > :         , then we could skip the munmap() call upon failure. This
>> > :         is a small thing, but it is useful here. (Thanks to Piotr
>> > :         Jaroszynski and Mark Hairgrove for helping me get that detail
>> > :         exactly right, btw.)
>> > :
>> > : c) After that, CUDA suballocates from p, via:
>> > :
>> > :      q = mmap(sub_region_start, ... MAP_FIXED ...)
>> > :
>> > : Interestingly enough, "freeing" is also done via MAP_FIXED, and
>> > : setting PROT_NONE to the subregion. Anyway, I just included (c) for
>> > : general interest.
>> >
>> > Atomic address range probing in the multithreaded programs in general
>> > sounds like an interesting thing to me.
>> >
>> > The second patch simply replaces MAP_FIXED use in elf loader by
>> > MAP_FIXED_SAFE. I believe other places which rely on MAP_FIXED should
>> > follow. Actually real MAP_FIXED usages should be docummented properly
>> > and they should be more of an exception.
>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
  2017-12-13  9:31   ` [PATCH 2/2] mmap.2: MAP_FIXED updated documentation Michal Hocko
  2017-12-13 12:55     ` Pavel Machek
@ 2017-12-14  2:52     ` Jann Horn
       [not found]       ` <CAG48ez0JZ3PVW3vgSXDmDijS+a_5bSX9qNuyggnsB6JTSkKngA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-12-14 23:06       ` John Hubbard
  1 sibling, 2 replies; 48+ messages in thread
From: Jann Horn @ 2017-12-14  2:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michael Kerrisk, Linux API, Khalid Aziz, Michael Ellerman,
	Andrew Morton, Russell King - ARM Linux, Andrea Arcangeli,
	linux-mm, LKML, linux-arch, Florian Weimer, John Hubbard,
	Matthew Wilcox, Mike Rapoport, Cyril Hrubis, Pavel Machek,
	Michal Hocko

On Wed, Dec 13, 2017 at 10:31 AM, Michal Hocko <mhocko@kernel.org> wrote:
> From: John Hubbard <jhubbard@nvidia.com>
>
>     -- Expand the documentation to discuss the hazards in
>        enough detail to allow avoiding them.
>
>     -- Mention the upcoming MAP_FIXED_SAFE flag.
>
>     -- Enhance the alignment requirement slightly.
>
> CC: Michael Ellerman <mpe@ellerman.id.au>
> CC: Jann Horn <jannh@google.com>
> CC: Matthew Wilcox <willy@infradead.org>
> CC: Michal Hocko <mhocko@kernel.org>
> CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
> CC: Cyril Hrubis <chrubis@suse.cz>
> CC: Pavel Machek <pavel@ucw.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  man2/mmap.2 | 32 ++++++++++++++++++++++++++++++--
>  1 file changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/man2/mmap.2 b/man2/mmap.2
> index 02d391697ce6..cb8789daec2d 100644
> --- a/man2/mmap.2
> +++ b/man2/mmap.2
[...]
> @@ -226,6 +227,33 @@ Software that aspires to be portable should use this option with care, keeping
>  in mind that the exact layout of a process' memory map is allowed to change
>  significantly between kernel versions, C library versions, and operating system
>  releases.
> +.IP
> +Furthermore, this option is extremely hazardous (when used on its own), because
> +it forcibly removes pre-existing mappings, making it easy for a multi-threaded
> +process to corrupt its own address space.

I think this is worded unfortunately. It is dangerous if used
incorrectly, and it's a good tool when used correctly.

[...]
> +Thread B need not create a mapping directly; simply making a library call
> +that, internally, uses
> +.I dlopen(3)
> +to load some other shared library, will
> +suffice. The dlopen(3) call will map the library into the process's address
> +space. Furthermore, almost any library call may be implemented using this
> +technique.
> +Examples include brk(2), malloc(3), pthread_create(3), and the PAM libraries
> +(http://www.linux-pam.org).

This is arkward. This first mentions dlopen(), which is a very niche
case, and then just very casually mentions the much bigger
problem that tons of library functions can allocate memory through
malloc(), causing mmap() calls, sometimes without that even being
a documented property of the function.

> +.IP
> +Newer kernels
> +(Linux 4.16 and later) have a
> +.B MAP_FIXED_SAFE
> +option that avoids the corruption problem; if available, MAP_FIXED_SAFE
> +should be preferred over MAP_FIXED.

This is bad advice. MAP_FIXED is completely safe if you use it on an address
range you've allocated, and it is used in this way by core system libraries to
place multiple VMAs in virtually contiguous memory, for example:

ld.so (from glibc) uses it to load dynamic libraries:

$ strace -e trace=open,mmap,close /usr/bin/id 2>&1 >/dev/null | head -n20
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7f35811c0000
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 161237, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f3581198000
close(3)                                = 0
open("/lib/x86_64-linux-gnu/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 2259664, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0x7f3580d78000
mmap(0x7f3580f9c000, 8192, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x24000) = 0x7f3580f9c000
mmap(0x7f3580f9e000, 6864, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f3580f9e000
close(3)                                = 0
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 3795360, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0x7f35809d9000
mmap(0x7f3580d6e000, 24576, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x195000) = 0x7f3580d6e000
mmap(0x7f3580d74000, 14752, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f3580d74000
close(3)                                = 0
[...]

As a comment in dl-map-segments.h in glibc explains:
      /* This is a position-independent shared object.  We can let the
         kernel map it anywhere it likes, but we must have space for all
         the segments in their specified positions relative to the first.
         So we map the first segment without MAP_FIXED, but with its
         extent increased to cover all the segments.  Then we remove
         access from excess portion, and there is known sufficient space
         there to remap from the later segments.


And AFAIK anything that allocates thread stacks uses MAP_FIXED to
create the guard page at the bottom.


MAP_FIXED is a better solution for these usecases than MAP_FIXED_SAFE,
or whatever it ends up being called. Please remove this advice or, better,
clarify what MAP_FIXED should be used for (creation of virtually contiguous
VMAs) and what MAP_FIXED_SAFE should be used for (attempting to
allocate memory at a fixed address for some reason, with a failure instead of
the normal fallback to using a different address).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
       [not found]       ` <CAG48ez0JZ3PVW3vgSXDmDijS+a_5bSX9qNuyggnsB6JTSkKngA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-12-14  5:28         ` John Hubbard
  0 siblings, 0 replies; 48+ messages in thread
From: John Hubbard @ 2017-12-14  5:28 UTC (permalink / raw)
  To: Jann Horn, Michal Hocko
  Cc: Michael Kerrisk, Linux API, Khalid Aziz, Michael Ellerman,
	Andrew Morton, Russell King - ARM Linux, Andrea Arcangeli,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, LKML, linux-arch,
	Florian Weimer, Matthew Wilcox, Mike Rapoport, Cyril Hrubis,
	Pavel Machek, Michal Hocko

On 12/13/2017 06:52 PM, Jann Horn wrote:
> On Wed, Dec 13, 2017 at 10:31 AM, Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>> From: John Hubbard <jhubbard-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
>>
>>     -- Expand the documentation to discuss the hazards in
>>        enough detail to allow avoiding them.
>>
>>     -- Mention the upcoming MAP_FIXED_SAFE flag.
>>
>>     -- Enhance the alignment requirement slightly.
>>
>> CC: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
>> CC: Jann Horn <jannh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> CC: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
>> CC: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>> CC: Mike Rapoport <rppt-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
>> CC: Cyril Hrubis <chrubis-AlSwsSmVLrQ@public.gmane.org>
>> CC: Pavel Machek <pavel-+ZI9xUNit7I@public.gmane.org>
>> Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
>> Signed-off-by: John Hubbard <jhubbard-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
>> Signed-off-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
>> ---
>>  man2/mmap.2 | 32 ++++++++++++++++++++++++++++++--
>>  1 file changed, 30 insertions(+), 2 deletions(-)
>>
>> diff --git a/man2/mmap.2 b/man2/mmap.2
>> index 02d391697ce6..cb8789daec2d 100644
>> --- a/man2/mmap.2
>> +++ b/man2/mmap.2
> [...]
>> @@ -226,6 +227,33 @@ Software that aspires to be portable should use this option with care, keeping
>>  in mind that the exact layout of a process' memory map is allowed to change
>>  significantly between kernel versions, C library versions, and operating system
>>  releases.
>> +.IP
>> +Furthermore, this option is extremely hazardous (when used on its own), because
>> +it forcibly removes pre-existing mappings, making it easy for a multi-threaded
>> +process to corrupt its own address space.
> 
> I think this is worded unfortunately. It is dangerous if used
> incorrectly, and it's a good tool when used correctly.
> 

Hi Jann,

Hey, thanks for reviewing this again. I think I can accomodate all of your requests,
without contradicting other reviewers' earlier feedback...approximately. :)  I'll 
have a go at rewording this, and addressing your additional comments below, tomorrow
afternoon, so please look for an updated version later that day.

thanks,
-- 
John Hubbard
NVIDIA

> [...]
>> +Thread B need not create a mapping directly; simply making a library call
>> +that, internally, uses
>> +.I dlopen(3)
>> +to load some other shared library, will
>> +suffice. The dlopen(3) call will map the library into the process's address
>> +space. Furthermore, almost any library call may be implemented using this
>> +technique.
>> +Examples include brk(2), malloc(3), pthread_create(3), and the PAM libraries
>> +(http://www.linux-pam.org).
> 
> This is arkward. This first mentions dlopen(), which is a very niche
> case, and then just very casually mentions the much bigger
> problem that tons of library functions can allocate memory through
> malloc(), causing mmap() calls, sometimes without that even being
> a documented property of the function.
> 
>> +.IP
>> +Newer kernels
>> +(Linux 4.16 and later) have a
>> +.B MAP_FIXED_SAFE
>> +option that avoids the corruption problem; if available, MAP_FIXED_SAFE
>> +should be preferred over MAP_FIXED.
> 
> This is bad advice. MAP_FIXED is completely safe if you use it on an address
> range you've allocated, and it is used in this way by core system libraries to
> place multiple VMAs in virtually contiguous memory, for example:
> 
> ld.so (from glibc) uses it to load dynamic libraries:
> 
> $ strace -e trace=open,mmap,close /usr/bin/id 2>&1 >/dev/null | head -n20
> mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> 0) = 0x7f35811c0000
> open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> mmap(NULL, 161237, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f3581198000
> close(3)                                = 0
> open("/lib/x86_64-linux-gnu/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
> mmap(NULL, 2259664, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
> 0) = 0x7f3580d78000
> mmap(0x7f3580f9c000, 8192, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x24000) = 0x7f3580f9c000
> mmap(0x7f3580f9e000, 6864, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f3580f9e000
> close(3)                                = 0
> open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
> mmap(NULL, 3795360, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
> 0) = 0x7f35809d9000
> mmap(0x7f3580d6e000, 24576, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x195000) = 0x7f3580d6e000
> mmap(0x7f3580d74000, 14752, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f3580d74000
> close(3)                                = 0
> [...]
> 
> As a comment in dl-map-segments.h in glibc explains:
>       /* This is a position-independent shared object.  We can let the
>          kernel map it anywhere it likes, but we must have space for all
>          the segments in their specified positions relative to the first.
>          So we map the first segment without MAP_FIXED, but with its
>          extent increased to cover all the segments.  Then we remove
>          access from excess portion, and there is known sufficient space
>          there to remap from the later segments.
> 
> 
> And AFAIK anything that allocates thread stacks uses MAP_FIXED to
> create the guard page at the bottom.
> 
> 
> MAP_FIXED is a better solution for these usecases than MAP_FIXED_SAFE,
> or whatever it ends up being called. Please remove this advice or, better,
> clarify what MAP_FIXED should be used for (creation of virtually contiguous
> VMAs) and what MAP_FIXED_SAFE should be used for (attempting to
> allocate memory at a fixed address for some reason, with a failure instead of
> the normal fallback to using a different address).
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
       [not found]                     ` <CAGXu5jLqE6cUxk-Girx6PG7upEzz8jmu1OH_3LVC26iJc2vTxQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-12-14  7:07                       ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2017-12-14  7:07 UTC (permalink / raw)
  To: Kees Cook
  Cc: Cyril Hrubis, Pavel Machek, Michael Kerrisk, Linux API,
	Khalid Aziz, Michael Ellerman, Andrew Morton,
	Russell King - ARM Linux, Andrea Arcangeli, Linux-MM, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Jann Horn, Mike Rapoport

On Wed 13-12-17 15:19:00, Kees Cook wrote:
> On Wed, Dec 13, 2017 at 6:40 AM, Cyril Hrubis <chrubis-AlSwsSmVLrQ@public.gmane.org> wrote:
> > Hi!
> >> You selected stupid name for a flag. Everyone and their dog agrees
> >> with that. There's even consensus on better name (and everyone agrees
> >> it is better than .._SAFE). Of course, we could have debate if it is
> >> NOREPLACE or NOREMOVE or ... and that would be bikeshed. This was just
> >> poor naming on your part.
> >
> > Well while everybody agrees that the name is so bad that basically
> > anything else would be better, there does not seem to be consensus on
> > which one to pick. I do understand that this frustrating and fruitless.
> 
> Based on the earlier threads where I tried to end the bikeshedding, it
> seemed like MAP_FIXED_NOREPLACE was the least bad option.
> 
> > So what do we do now, roll a dice to choose new name?
> >
> > Or do we ask BFDL[1] to choose the name?
> 
> I'd like to hear feedback from Michael Kerrisk, as he's had to deal
> with these kinds of choices in the past. I'm fine to ask Linus too. I
> just want to get past the name since the feature is quite valuable.
> 
> And if Michal doesn't want to touch this patch any more, I'm happy to
> do the search/replace/resend. :P

I think Andrew can do the s@MAP_FIXED_SAFE@MAP_$FOO@ when adding the
patch to the mmotm tree. The reason why I refuse to repost is that a)
functionality doesn't really need a further rework (at least not based
on the review feedback) and b) I do not really see any large consensus
here. People claim to like this or that more but nobody (except of you
Kees) was willing to put their name under their preference in a form of
Acked-by. And that worries me, because generating "better" names sounds
too easy to allow a forward progress.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE
  2017-12-14  0:32   ` Andrew Morton
  2017-12-14  1:35     ` David Goldblatt
@ 2017-12-14 12:44     ` Edward Napierala
  2017-12-14 13:15       ` Michal Hocko
  1 sibling, 1 reply; 48+ messages in thread
From: Edward Napierala @ 2017-12-14 12:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-api, Khalid Aziz, Michael Ellerman,
	Russell King - ARM Linux, Andrea Arcangeli, linux-mm, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Abdul Haleem, Joel Stanley, Kees Cook, Michal Hocko, jasone,
	davidtgoldblatt


[-- Attachment #1: Type: text/plain, Size: 827 bytes --]

Regarding the name - how about adopting MAP_EXCL?  It was introduced in
FreeBSD,
and seems to do exactly this; quoting mmap(2):

MAP_FIXED    Do not permit the system to select a different address
                        than the one specified.  If the specified address
                        cannot be used, mmap() will fail.  If MAP_FIXED is
                        specified, addr must be a multiple of the page size.
                        If MAP_EXCL is not specified, a successful MAP_FIXED
                        request replaces any previous mappings for the
                        process' pages in the range from addr to addr + len.
                        In contrast, if MAP_EXCL is specified, the request
                        will fail if a mapping already exists within the
                        range.

[-- Attachment #2: Type: text/html, Size: 1120 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE
  2017-12-14 12:44     ` Edward Napierala
@ 2017-12-14 13:15       ` Michal Hocko
  2017-12-14 14:54         ` Edward Napierala
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2017-12-14 13:15 UTC (permalink / raw)
  To: Edward Napierala
  Cc: Andrew Morton, linux-api, Khalid Aziz, Michael Ellerman,
	Russell King - ARM Linux, Andrea Arcangeli, linux-mm, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Abdul Haleem, Joel Stanley, Kees Cook, jasone, davidtgoldblatt

On Thu 14-12-17 12:44:17, Edward Napierala wrote:
> Regarding the name - how about adopting MAP_EXCL?  It was introduced in
> FreeBSD,
> and seems to do exactly this; quoting mmap(2):
> 
> MAP_FIXED    Do not permit the system to select a different address
>                         than the one specified.  If the specified address
>                         cannot be used, mmap() will fail.  If MAP_FIXED is
>                         specified, addr must be a multiple of the page size.
>                         If MAP_EXCL is not specified, a successful MAP_FIXED
>                         request replaces any previous mappings for the
>                         process' pages in the range from addr to addr + len.
>                         In contrast, if MAP_EXCL is specified, the request
>                         will fail if a mapping already exists within the
>                         range.

I am not familiar with the FreeBSD implementation but from the above it
looks like MAP_EXCL is a MAP_FIXED mofifier which is not how we are
going to implement it in linux due to reasons mentioned in this cover
letter. Using the same name would be more confusing than helpful I am
afraid.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE
  2017-12-14 13:15       ` Michal Hocko
@ 2017-12-14 14:54         ` Edward Napierala
  2017-12-19 12:40           ` David Laight
  0 siblings, 1 reply; 48+ messages in thread
From: Edward Napierala @ 2017-12-14 14:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-api, Khalid Aziz, Michael Ellerman,
	Russell King - ARM Linux, Andrea Arcangeli, linux-mm, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Abdul Haleem, Joel Stanley, Kees Cook, jasone, davidtgoldblatt

On 1214T1415, Michal Hocko wrote:
> On Thu 14-12-17 12:44:17, Edward Napierala wrote:
> > Regarding the name - how about adopting MAP_EXCL?  It was introduced in
> > FreeBSD,
> > and seems to do exactly this; quoting mmap(2):
> > 
> > MAP_FIXED    Do not permit the system to select a different address
> >                         than the one specified.  If the specified address
> >                         cannot be used, mmap() will fail.  If MAP_FIXED is
> >                         specified, addr must be a multiple of the page size.
> >                         If MAP_EXCL is not specified, a successful MAP_FIXED
> >                         request replaces any previous mappings for the
> >                         process' pages in the range from addr to addr + len.
> >                         In contrast, if MAP_EXCL is specified, the request
> >                         will fail if a mapping already exists within the
> >                         range.
> 
> I am not familiar with the FreeBSD implementation but from the above it
> looks like MAP_EXCL is a MAP_FIXED mofifier which is not how we are
> going to implement it in linux due to reasons mentioned in this cover
> letter. Using the same name would be more confusing than helpful I am
> afraid.

Sorry, missed that.  Indeed, reusing a name with a different semantics
would be a bad idea.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
  2017-12-14  2:52     ` Jann Horn
       [not found]       ` <CAG48ez0JZ3PVW3vgSXDmDijS+a_5bSX9qNuyggnsB6JTSkKngA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-12-14 23:06       ` John Hubbard
       [not found]         ` <b4fb7b3a-e53e-bf87-53c5-186751a14f4e-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 48+ messages in thread
From: John Hubbard @ 2017-12-14 23:06 UTC (permalink / raw)
  To: Jann Horn, Michal Hocko
  Cc: Michael Kerrisk, Linux API, Khalid Aziz, Michael Ellerman,
	Andrew Morton, Russell King - ARM Linux, Andrea Arcangeli,
	linux-mm, LKML, linux-arch, Florian Weimer, Matthew Wilcox,
	Mike Rapoport, Cyril Hrubis, Pavel Machek, Michal Hocko

On 12/13/2017 06:52 PM, Jann Horn wrote:
> On Wed, Dec 13, 2017 at 10:31 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
[...]
>> +.IP
>> +Furthermore, this option is extremely hazardous (when used on its own), because
>> +it forcibly removes pre-existing mappings, making it easy for a multi-threaded
>> +process to corrupt its own address space.
> 
> I think this is worded unfortunately. It is dangerous if used
> incorrectly, and it's a good tool when used correctly.
> 
> [...]
>> +Thread B need not create a mapping directly; simply making a library call
>> +that, internally, uses
>> +.I dlopen(3)
>> +to load some other shared library, will
>> +suffice. The dlopen(3) call will map the library into the process's address
>> +space. Furthermore, almost any library call may be implemented using this
>> +technique.
>> +Examples include brk(2), malloc(3), pthread_create(3), and the PAM libraries
>> +(http://www.linux-pam.org).
> 
> This is arkward. This first mentions dlopen(), which is a very niche
> case, and then just very casually mentions the much bigger
> problem that tons of library functions can allocate memory through
> malloc(), causing mmap() calls, sometimes without that even being
> a documented property of the function.
> 

Hi Jann,

Here is some proposed new wording, to address your two comments above. What do 
you think of this:

NOTE:  this  option  can  be hazardous (when used on its own), because it
forcibly removes pre-existing mappings,  making  it  easy  for  a  multi-
threaded  process to corrupt its own address space. For example, thread A
looks through /proc/<pid>/maps and locates an  available  address  range,
while  thread  B simultaneously acquires part or all of that same address
range. Thread A then calls mmap(MAP_FIXED), effectively  overwriting  the
mapping that thread B created.

Thread B need not create a mapping directly; simply making a library call
whose implementation calls malloc(3), mmap(), or dlopen(3) will  suffice,
because those calls all create new mappings.

>> +.IP
>> +Newer kernels
>> +(Linux 4.16 and later) have a
>> +.B MAP_FIXED_SAFE
>> +option that avoids the corruption problem; if available, MAP_FIXED_SAFE
>> +should be preferred over MAP_FIXED.
> 
> This is bad advice. MAP_FIXED is completely safe if you use it on an address
> range you've allocated, and it is used in this way by core system libraries to
> place multiple VMAs in virtually contiguous memory, for example:
[...] 
> MAP_FIXED is a better solution for these usecases than MAP_FIXED_SAFE,
> or whatever it ends up being called. Please remove this advice or, better,
> clarify what MAP_FIXED should be used for (creation of virtually contiguous
> VMAs) and what MAP_FIXED_SAFE should be used for (attempting to
> allocate memory at a fixed address for some reason, with a failure instead of
> the normal fallback to using a different address).
> 

Rather than risk another back-and-forth with Michal (who doesn't want any advice
on how to use this safely, in the man page), I've simply removed this advice
entirely.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
       [not found]         ` <b4fb7b3a-e53e-bf87-53c5-186751a14f4e-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
@ 2017-12-14 23:10           ` Jann Horn
  0 siblings, 0 replies; 48+ messages in thread
From: Jann Horn @ 2017-12-14 23:10 UTC (permalink / raw)
  To: John Hubbard
  Cc: Michal Hocko, Michael Kerrisk, Linux API, Khalid Aziz,
	Michael Ellerman, Andrew Morton, Russell King - ARM Linux,
	Andrea Arcangeli, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, LKML,
	linux-arch, Florian Weimer, Matthew Wilcox, Mike Rapoport,
	Cyril Hrubis, Pavel Machek, Michal Hocko

On Fri, Dec 15, 2017 at 12:06 AM, John Hubbard <jhubbard-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org> wrote:
> On 12/13/2017 06:52 PM, Jann Horn wrote:
>> On Wed, Dec 13, 2017 at 10:31 AM, Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>>> From: John Hubbard <jhubbard-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
> [...]
>>> +.IP
>>> +Furthermore, this option is extremely hazardous (when used on its own), because
>>> +it forcibly removes pre-existing mappings, making it easy for a multi-threaded
>>> +process to corrupt its own address space.
>>
>> I think this is worded unfortunately. It is dangerous if used
>> incorrectly, and it's a good tool when used correctly.
>>
>> [...]
>>> +Thread B need not create a mapping directly; simply making a library call
>>> +that, internally, uses
>>> +.I dlopen(3)
>>> +to load some other shared library, will
>>> +suffice. The dlopen(3) call will map the library into the process's address
>>> +space. Furthermore, almost any library call may be implemented using this
>>> +technique.
>>> +Examples include brk(2), malloc(3), pthread_create(3), and the PAM libraries
>>> +(http://www.linux-pam.org).
>>
>> This is arkward. This first mentions dlopen(), which is a very niche
>> case, and then just very casually mentions the much bigger
>> problem that tons of library functions can allocate memory through
>> malloc(), causing mmap() calls, sometimes without that even being
>> a documented property of the function.
>>
>
> Hi Jann,
>
> Here is some proposed new wording, to address your two comments above. What do
> you think of this:
>
> NOTE:  this  option  can  be hazardous (when used on its own), because it
> forcibly removes pre-existing mappings,  making  it  easy  for  a  multi-
> threaded  process to corrupt its own address space. For example, thread A
> looks through /proc/<pid>/maps and locates an  available  address  range,
> while  thread  B simultaneously acquires part or all of that same address
> range. Thread A then calls mmap(MAP_FIXED), effectively  overwriting  the
> mapping that thread B created.
>
> Thread B need not create a mapping directly; simply making a library call
> whose implementation calls malloc(3), mmap(), or dlopen(3) will  suffice,
> because those calls all create new mappings.

Thanks! That sounds better to me.

>>> +.IP
>>> +Newer kernels
>>> +(Linux 4.16 and later) have a
>>> +.B MAP_FIXED_SAFE
>>> +option that avoids the corruption problem; if available, MAP_FIXED_SAFE
>>> +should be preferred over MAP_FIXED.
>>
>> This is bad advice. MAP_FIXED is completely safe if you use it on an address
>> range you've allocated, and it is used in this way by core system libraries to
>> place multiple VMAs in virtually contiguous memory, for example:
> [...]
>> MAP_FIXED is a better solution for these usecases than MAP_FIXED_SAFE,
>> or whatever it ends up being called. Please remove this advice or, better,
>> clarify what MAP_FIXED should be used for (creation of virtually contiguous
>> VMAs) and what MAP_FIXED_SAFE should be used for (attempting to
>> allocate memory at a fixed address for some reason, with a failure instead of
>> the normal fallback to using a different address).
>>
>
> Rather than risk another back-and-forth with Michal (who doesn't want any advice
> on how to use this safely, in the man page), I've simply removed this advice
> entirely.

Makes sense.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE
  2017-12-13 17:13   ` Kees Cook
@ 2017-12-15  9:02     ` Michael Ellerman
  0 siblings, 0 replies; 48+ messages in thread
From: Michael Ellerman @ 2017-12-15  9:02 UTC (permalink / raw)
  To: Kees Cook, Michal Hocko
  Cc: Linux API, Khalid Aziz, Andrew Morton, Russell King - ARM Linux,
	Andrea Arcangeli, Linux-MM, LKML, linux-arch, Florian Weimer,
	John Hubbard, Matthew Wilcox, Abdul Haleem, Joel Stanley,
	Michal Hocko

Kees Cook <keescook@chromium.org> writes:

> On Wed, Dec 13, 2017 at 1:25 AM, Michal Hocko <mhocko@kernel.org> wrote:
>>
>> Hi,
>> I am resending with some minor updates based on Michael's review and
>> ask for inclusion. There haven't been any fundamental objections for
>> the RFC [1] nor the previous version [2].  The biggest discussion
>> revolved around the naming. There were many suggestions flowing
>> around MAP_REQUIRED, MAP_EXACT, MAP_FIXED_NOCLOBBER, MAP_AT_ADDR,
>> MAP_FIXED_NOREPLACE etc...
>
> With this named MAP_FIXED_NOREPLACE (the best consensus we've got on a
> name), please consider this series:
>
> Acked-by: Kees Cook <keescook@chromium.org>

I don't feel like I'm actually qualified to ack the mm and binfmt
changes, but everything *looks* correct to me, and you've fixed the flag
numbering such that it can go in mman-common.h as I suggested.

So if the name was MAP_FIXED_NOREPLACE I would also be happy with it.

Acked-by: Michael Ellerman <mpe@ellerman.id.au>

I can resubmit with the name changed if that's what it takes.

cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2/2] fs, elf: drop MAP_FIXED usage from elf_map
  2017-12-13  9:25 ` [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map Michal Hocko
@ 2017-12-16  0:49   ` Andrei Vagin
  2017-12-18  9:13     ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Andrei Vagin @ 2017-12-16  0:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-api, Khalid Aziz, Michael Ellerman, Andrew Morton,
	Russell King - ARM Linux, Andrea Arcangeli, linux-mm, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Michal Hocko, Abdul Haleem, Joel Stanley, Kees Cook


[-- Attachment #1: Type: text/plain, Size: 6926 bytes --]

Hi Michal,

We run CRIU tests for linux-next and the 4.15.0-rc3-next-20171215 kernel
doesn't boot:

[    3.492549] Freeing unused kernel memory: 1640K
[    3.494547] Write protecting the kernel read-only data: 18432k
[    3.498781] Freeing unused kernel memory: 2016K
[    3.503330] Freeing unused kernel memory: 512K
[    3.505232] rodata_test: all tests were successful
[    3.515355] 1 (init): Uhuuh, elf segement at 00000000928fda3e requested but the memory is mapped already
[    3.519533] Starting init: /sbin/init exists but couldn't execute it (error -95)
[    3.528993] Starting init: /bin/sh exists but couldn't execute it (error -14)
[    3.532127] Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
[    3.538328] CPU: 0 PID: 1 Comm: init Not tainted 4.15.0-rc3-next-20171215-00001-g6d6aea478fce #11
[    3.542201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
[    3.546081] Call Trace:
[    3.547221]  dump_stack+0x5c/0x79
[    3.548768]  ? rest_init+0x30/0xb0
[    3.550320]  panic+0xe4/0x232
[    3.551669]  ? rest_init+0xb0/0xb0
[    3.553110]  kernel_init+0xeb/0x100
[    3.554701]  ret_from_fork+0x1f/0x30
[    3.558964] Kernel Offset: 0x2000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[    3.564160] ---[ end Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.

If I revert this patch, it boots normally.

Thanks,
Andrei

On Wed, Dec 13, 2017 at 10:25:50AM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Both load_elf_interp and load_elf_binary rely on elf_map to map segments
> on a controlled address and they use MAP_FIXED to enforce that. This is
> however dangerous thing prone to silent data corruption which can be
> even exploitable. Let's take CVE-2017-1000253 as an example. At the time
> (before eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"))
> ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
> the stack top on 32b (legacy) memory layout (only 1GB away). Therefore
> we could end up mapping over the existing stack with some luck.
> 
> The issue has been fixed since then (a87938b2e246 ("fs/binfmt_elf.c:
> fix bug in loading of PIE binaries")), ELF_ET_DYN_BASE moved moved much
> further from the stack (eab09532d400 and later by c715b72c1ba4 ("mm:
> revert x86_64 and arm64 ELF_ET_DYN_BASE base changes")) and excessive
> stack consumption early during execve fully stopped by da029c11e6b1
> ("exec: Limit arg stack to at most 75% of _STK_LIM"). So we should be
> safe and any attack should be impractical. On the other hand this is
> just too subtle assumption so it can break quite easily and hard to
> spot.
> 
> I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
> fundamentally dangerous. Moreover it shouldn't be even needed. We are
> at the early process stage and so there shouldn't be unrelated mappings
> (except for stack and loader) existing so mmap for a given address
> should succeed even without MAP_FIXED. Something is terribly wrong if
> this is not the case and we should rather fail than silently corrupt the
> underlying mapping.
> 
> Address this issue by changing MAP_FIXED to the newly added
> MAP_FIXED_SAFE. This will mean that mmap will fail if there is an
> existing mapping clashing with the requested one without clobbering it.
> 
> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
> Cc: Joel Stanley <joel@jms.id.au>
> Acked-by: Kees Cook <keescook@chromium.org>
> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  arch/metag/kernel/process.c |  6 +++++-
>  fs/binfmt_elf.c             | 12 ++++++++----
>  2 files changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/metag/kernel/process.c b/arch/metag/kernel/process.c
> index 0909834c83a7..867c8d0a5fb4 100644
> --- a/arch/metag/kernel/process.c
> +++ b/arch/metag/kernel/process.c
> @@ -399,7 +399,7 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
>  	tcm_tag = tcm_lookup_tag(addr);
>  
>  	if (tcm_tag != TCM_INVALID_TAG)
> -		type &= ~MAP_FIXED;
> +		type &= ~(MAP_FIXED | MAP_FIXED_SAFE);
>  
>  	/*
>  	* total_size is the size of the ELF (interpreter) image.
> @@ -417,6 +417,10 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
>  	} else
>  		map_addr = vm_mmap(filep, addr, size, prot, type, off);
>  
> +	if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
> +		pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
> +				task_pid_nr(current), tsk->comm, (void*)addr);
> +
>  	if (!BAD_ADDR(map_addr) && tcm_tag != TCM_INVALID_TAG) {
>  		struct tcm_allocation *tcm;
>  		unsigned long tcm_addr;
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 73b01e474fdc..5916d45f64a7 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -372,6 +372,10 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
>  	} else
>  		map_addr = vm_mmap(filep, addr, size, prot, type, off);
>  
> +	if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
> +		pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
> +				task_pid_nr(current), current->comm, (void*)addr);
> +
>  	return(map_addr);
>  }
>  
> @@ -569,7 +573,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
>  				elf_prot |= PROT_EXEC;
>  			vaddr = eppnt->p_vaddr;
>  			if (interp_elf_ex->e_type == ET_EXEC || load_addr_set)
> -				elf_type |= MAP_FIXED;
> +				elf_type |= MAP_FIXED_SAFE;
>  			else if (no_base && interp_elf_ex->e_type == ET_DYN)
>  				load_addr = -vaddr;
>  
> @@ -930,7 +934,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  		 * the ET_DYN load_addr calculations, proceed normally.
>  		 */
>  		if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
> -			elf_flags |= MAP_FIXED;
> +			elf_flags |= MAP_FIXED_SAFE;
>  		} else if (loc->elf_ex.e_type == ET_DYN) {
>  			/*
>  			 * This logic is run once for the first LOAD Program
> @@ -966,7 +970,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  				load_bias = ELF_ET_DYN_BASE;
>  				if (current->flags & PF_RANDOMIZE)
>  					load_bias += arch_mmap_rnd();
> -				elf_flags |= MAP_FIXED;
> +				elf_flags |= MAP_FIXED_SAFE;
>  			} else
>  				load_bias = 0;
>  
> @@ -1223,7 +1227,7 @@ static int load_elf_library(struct file *file)
>  			(eppnt->p_filesz +
>  			 ELF_PAGEOFFSET(eppnt->p_vaddr)),
>  			PROT_READ | PROT_WRITE | PROT_EXEC,
> -			MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
> +			MAP_FIXED_SAFE | MAP_PRIVATE | MAP_DENYWRITE,
>  			(eppnt->p_offset -
>  			 ELF_PAGEOFFSET(eppnt->p_vaddr)));
>  	if (error != ELF_PAGESTART(eppnt->p_vaddr))

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 21900 bytes --]

[-- Attachment #3: dmesg.txt --]
[-- Type: text/plain, Size: 24417 bytes --]

[    0.000000] Linux version 4.15.0-rc3-next-20171215-00001-g6d6aea478fce (avagin@laptop) (gcc version 7.2.1 20170915 (Red Hat 7.2.1-2) (GCC)) #11 SMP Fri Dec 15 16:39:11 PST 2017
[    0.000000] Command line: root=/dev/vda2 ro debug console=ttyS0,115200 LANG=en_US.UTF-8 slub_debug=FZP raid=noautodetect selinux=0
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
[    0.000000] x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
[    0.000000] x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format.
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007ffd8fff] usable
[    0.000000] BIOS-e820: [mem 0x000000007ffd9000-0x000000007fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] random: fast init done
[    0.000000] SMBIOS 2.8 present.
[    0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
[    0.000000] Hypervisor detected: KVM
[    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.000000] e820: last_pfn = 0x7ffd9 max_arch_pfn = 0x400000000
[    0.000000] MTRR default type: write-back
[    0.000000] MTRR fixed ranges enabled:
[    0.000000]   00000-9FFFF write-back
[    0.000000]   A0000-BFFFF uncachable
[    0.000000]   C0000-FFFFF write-protect
[    0.000000] MTRR variable ranges enabled:
[    0.000000]   0 base 0080000000 mask FF80000000 uncachable
[    0.000000]   1 disabled
[    0.000000]   2 disabled
[    0.000000]   3 disabled
[    0.000000]   4 disabled
[    0.000000]   5 disabled
[    0.000000]   6 disabled
[    0.000000]   7 disabled
[    0.000000] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
[    0.000000] found SMP MP-table at [mem 0x000f6bd0-0x000f6bdf] mapped at [        (ptrval)]
[    0.000000] Base memory trampoline at [        (ptrval)] 99000 size 24576
[    0.000000] Using GB pages for direct mapping
[    0.000000] BRK [0x2c984000, 0x2c984fff] PGTABLE
[    0.000000] BRK [0x2c985000, 0x2c985fff] PGTABLE
[    0.000000] BRK [0x2c986000, 0x2c986fff] PGTABLE
[    0.000000] BRK [0x2c987000, 0x2c987fff] PGTABLE
[    0.000000] BRK [0x2c988000, 0x2c988fff] PGTABLE
[    0.000000] BRK [0x2c989000, 0x2c989fff] PGTABLE
[    0.000000] ACPI: Early table checksum verification disabled
[    0.000000] ACPI: RSDP 0x00000000000F69C0 000014 (v00 BOCHS )
[    0.000000] ACPI: RSDT 0x000000007FFE12FF 00002C (v01 BOCHS  BXPCRSDT 00000001 BXPC 00000001)
[    0.000000] ACPI: FACP 0x000000007FFE120B 000074 (v01 BOCHS  BXPCFACP 00000001 BXPC 00000001)
[    0.000000] ACPI: DSDT 0x000000007FFE0040 0011CB (v01 BOCHS  BXPCDSDT 00000001 BXPC 00000001)
[    0.000000] ACPI: FACS 0x000000007FFE0000 000040
[    0.000000] ACPI: APIC 0x000000007FFE127F 000080 (v01 BOCHS  BXPCAPIC 00000001 BXPC 00000001)
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] No NUMA configuration found
[    0.000000] Faking a node at [mem 0x0000000000000000-0x000000007ffd8fff]
[    0.000000] NODE_DATA(0) allocated [mem 0x7ffc2000-0x7ffd8fff]
[    0.000000] kvm-clock: cpu 0, msr 0:7ffc0001, primary cpu clock
[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[    0.000000] kvm-clock: using sched offset of 1076013277 cycles
[    0.000000] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.000000]   DMA32    [mem 0x0000000001000000-0x000000007ffd8fff]
[    0.000000]   Normal   empty
[    0.000000]   Device   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
[    0.000000]   node   0: [mem 0x0000000000100000-0x000000007ffd8fff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000007ffd8fff]
[    0.000000] On node 0 totalpages: 524151
[    0.000000]   DMA zone: 64 pages used for memmap
[    0.000000]   DMA zone: 21 pages reserved
[    0.000000]   DMA zone: 3998 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 8128 pages used for memmap
[    0.000000]   DMA32 zone: 520153 pages, LIFO batch:31
[    0.000000] Reserved but unavailable: 98 pages
[    0.000000] ACPI: PM-Timer IO Port: 0x608
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
[    0.000000] IOAPIC[0]: apic_id 0, version 17, address 0xfec00000, GSI 0-23
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
[    0.000000] ACPI: IRQ0 used by override.
[    0.000000] ACPI: IRQ5 used by override.
[    0.000000] ACPI: IRQ9 used by override.
[    0.000000] ACPI: IRQ10 used by override.
[    0.000000] ACPI: IRQ11 used by override.
[    0.000000] Using ACPI (MADT) for SMP configuration information
[    0.000000] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
[    0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[    0.000000] PM: Registered nosave memory: [mem 0x0009f000-0x0009ffff]
[    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000effff]
[    0.000000] PM: Registered nosave memory: [mem 0x000f0000-0x000fffff]
[    0.000000] e820: [mem 0x80000000-0xfeffbfff] available for PCI devices
[    0.000000] Booting paravirtualized kernel on KVM
[    0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[    0.000000] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:2 nr_node_ids:1
[    0.000000] percpu: Embedded 44 pages/cpu @        (ptrval) s142296 r8192 d29736 u1048576
[    0.000000] pcpu-alloc: s142296 r8192 d29736 u1048576 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 0 1 
[    0.000000] KVM setup async PF for cpu 0
[    0.000000] kvm-stealtime: cpu 0, msr 7fc122c0
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 515938
[    0.000000] Policy zone: DMA32
[    0.000000] Kernel command line: root=/dev/vda2 ro debug console=ttyS0,115200 LANG=en_US.UTF-8 slub_debug=FZP raid=noautodetect selinux=0
[    0.000000] Memory: 2037056K/2096604K available (12300K kernel code, 1554K rwdata, 3584K rodata, 1640K init, 912K bss, 59548K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
[    0.000000] ftrace: allocating 36554 entries in 143 pages
[    0.001000] Hierarchical RCU implementation.
[    0.001000] 	RCU restricting CPUs from NR_CPUS=64 to nr_cpu_ids=2.
[    0.001000] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=2
[    0.001000] NR_IRQS: 4352, nr_irqs: 440, preallocated irqs: 16
[    0.001000] 	Offload RCU callbacks from CPUs: (none).
[    0.001000] Console: colour dummy device 80x25
[    0.001000] console [ttyS0] enabled
[    0.001000] ACPI: Core revision 20171110
[    0.001000] ACPI: 1 ACPI AML tables successfully acquired and loaded
[    0.001009] APIC: Switch to symmetric I/O mode setup
[    0.001571] x2apic enabled
[    0.002003] Switched APIC routing to physical x2apic.
[    0.003538] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.004000] tsc: Detected 2496.000 MHz processor
[    0.004014] Calibrating delay loop (skipped) preset value.. 4992.00 BogoMIPS (lpj=2496000)
[    0.005014] pid_max: default: 32768 minimum: 301
[    0.006057] Security Framework initialized
[    0.006548] Yama: becoming mindful.
[    0.007019] SELinux:  Disabled at boot.
[    0.008206] Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
[    0.009164] Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
[    0.009816] Mount-cache hash table entries: 4096 (order: 3, 32768 bytes)
[    0.010009] Mountpoint-cache hash table entries: 4096 (order: 3, 32768 bytes)
[    0.011322] mce: CPU supports 10 MCE banks
[    0.011740] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[    0.012002] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
[    0.012610] Freeing SMP alternatives memory: 36K
[    0.013467] TSC deadline timer enabled
[    0.013820] smpboot: CPU0: Intel Core Processor (Skylake) (family: 0x6, model: 0x5e, stepping: 0x3)
[    0.014000] Performance Events: unsupported p6 CPU model 94 no PMU driver, software events only.
[    0.014041] Hierarchical SRCU implementation.
[    0.015133] NMI watchdog: Perf event create on CPU 0 failed with -2
[    0.015725] NMI watchdog: Perf NMI watchdog permanently disabled
[    0.016077] smp: Bringing up secondary CPUs ...
[    0.016654] x86: Booting SMP configuration:
[    0.017005] .... node  #0, CPUs:      #1
[    0.001000] kvm-clock: cpu 1, msr 0:7ffc0041, secondary cpu clock
[    0.019051] KVM setup async PF for cpu 1
[    0.019599] kvm-stealtime: cpu 1, msr 7fd122c0
[    0.020009] smp: Brought up 1 node, 2 CPUs
[    0.020531] smpboot: Max logical packages: 2
[    0.021009] smpboot: Total of 2 processors activated (9984.00 BogoMIPS)
[    0.023160] devtmpfs: initialized
[    0.023513] x86/mm: Memory block size: 128MB
[    0.024811] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[    0.025015] futex hash table entries: 512 (order: 3, 32768 bytes)
[    0.026185] RTC time:  0:42:06, date: 12/16/17
[    0.026790] NET: Registered protocol family 16
[    0.027204] audit: initializing netlink subsys (disabled)
[    0.027914] audit: type=2000 audit(1513384927.133:1): state=initialized audit_enabled=0 res=1
[    0.028185] cpuidle: using governor menu
[    0.029118] ACPI: bus type PCI registered
[    0.029872] PCI: Using configuration type 1 for base access
[    0.034355] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
[    0.035011] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
[    0.036066] cryptd: max_cpu_qlen set to 1000
[    0.036579] ACPI: Added _OSI(Module Device)
[    0.037007] ACPI: Added _OSI(Processor Device)
[    0.037426] ACPI: Added _OSI(3.0 _SCP Extensions)
[    0.037857] ACPI: Added _OSI(Processor Aggregator Device)
[    0.041356] ACPI: Interpreter enabled
[    0.041764] ACPI: (supports S0 S5)
[    0.042005] ACPI: Using IOAPIC for interrupt routing
[    0.042655] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
[    0.043625] ACPI: Enabled 2 GPEs in block 00 to 0F
[    0.059248] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
[    0.059953] acpi PNP0A03:00: _OSC: OS supports [ASPM ClockPM Segments MSI]
[    0.060045] acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM
[    0.061180] PCI host bridge to bus 0000:00
[    0.061874] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
[    0.062013] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
[    0.063016] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
[    0.064015] pci_bus 0000:00: root bus resource [mem 0x80000000-0xfebfffff window]
[    0.065014] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.065753] pci 0000:00:00.0: [8086:1237] type 00 class 0x060000
[    0.066487] pci 0000:00:01.0: [8086:7000] type 00 class 0x060100
[    0.067537] pci 0000:00:01.1: [8086:7010] type 00 class 0x010180
[    0.071700] pci 0000:00:01.1: reg 0x20: [io  0xc100-0xc10f]
[    0.074032] pci 0000:00:01.1: legacy IDE quirk: reg 0x10: [io  0x01f0-0x01f7]
[    0.074908] pci 0000:00:01.1: legacy IDE quirk: reg 0x14: [io  0x03f6]
[    0.075011] pci 0000:00:01.1: legacy IDE quirk: reg 0x18: [io  0x0170-0x0177]
[    0.076010] pci 0000:00:01.1: legacy IDE quirk: reg 0x1c: [io  0x0376]
[    0.077121] pci 0000:00:01.3: [8086:7113] type 00 class 0x068000
[    0.078148] pci 0000:00:01.3: quirk: [io  0x0600-0x063f] claimed by PIIX4 ACPI
[    0.079014] pci 0000:00:01.3: quirk: [io  0x0700-0x070f] claimed by PIIX4 SMB
[    0.080224] pci 0000:00:03.0: [1af4:1000] type 00 class 0x020000
[    0.082007] pci 0000:00:03.0: reg 0x10: [io  0xc040-0xc05f]
[    0.083814] pci 0000:00:03.0: reg 0x14: [mem 0xfebc0000-0xfebc0fff]
[    0.089891] pci 0000:00:03.0: reg 0x30: [mem 0xfeb80000-0xfebbffff pref]
[    0.090545] pci 0000:00:05.0: [1af4:1003] type 00 class 0x078000
[    0.092708] pci 0000:00:05.0: reg 0x10: [io  0xc060-0xc07f]
[    0.094009] pci 0000:00:05.0: reg 0x14: [mem 0xfebc1000-0xfebc1fff]
[    0.102484] pci 0000:00:06.0: [8086:2934] type 00 class 0x0c0300
[    0.108028] pci 0000:00:06.0: reg 0x20: [io  0xc080-0xc09f]
[    0.110738] pci 0000:00:06.1: [8086:2935] type 00 class 0x0c0300
[    0.114388] pci 0000:00:06.1: reg 0x20: [io  0xc0a0-0xc0bf]
[    0.117339] pci 0000:00:06.2: [8086:2936] type 00 class 0x0c0300
[    0.122770] pci 0000:00:06.2: reg 0x20: [io  0xc0c0-0xc0df]
[    0.124738] pci 0000:00:06.7: [8086:293a] type 00 class 0x0c0320
[    0.125825] pci 0000:00:06.7: reg 0x10: [mem 0xfebc2000-0xfebc2fff]
[    0.130347] pci 0000:00:07.0: [1af4:1001] type 00 class 0x010000
[    0.133007] pci 0000:00:07.0: reg 0x10: [io  0xc000-0xc03f]
[    0.134793] pci 0000:00:07.0: reg 0x14: [mem 0xfebc3000-0xfebc3fff]
[    0.141808] pci 0000:00:08.0: [1af4:1002] type 00 class 0x00ff00
[    0.142914] pci 0000:00:08.0: reg 0x10: [io  0xc0e0-0xc0ff]
[    0.148977] ACPI: PCI Interrupt Link [LNKA] (IRQs 5 *10 11)
[    0.149455] ACPI: PCI Interrupt Link [LNKB] (IRQs 5 *10 11)
[    0.150390] ACPI: PCI Interrupt Link [LNKC] (IRQs 5 10 *11)
[    0.151382] ACPI: PCI Interrupt Link [LNKD] (IRQs 5 10 *11)
[    0.152380] ACPI: PCI Interrupt Link [LNKS] (IRQs *9)
[    0.154508] vgaarb: loaded
[    0.155271] SCSI subsystem initialized
[    0.155887] EDAC MC: Ver: 3.0.0
[    0.156255] PCI: Using ACPI for IRQ routing
[    0.156566] PCI: pci_cache_line_size set to 64 bytes
[    0.157161] e820: reserve RAM buffer [mem 0x0009fc00-0x0009ffff]
[    0.157914] e820: reserve RAM buffer [mem 0x7ffd9000-0x7fffffff]
[    0.158253] NetLabel: Initializing
[    0.158765] NetLabel:  domain hash size = 128
[    0.159005] NetLabel:  protocols = UNLABELED CIPSOv4 CALIPSO
[    0.159775] NetLabel:  unlabeled traffic allowed by default
[    0.160073] clocksource: Switched to clocksource kvm-clock
[    0.186764] VFS: Disk quotas dquot_6.6.0
[    0.187277] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    0.188251] FS-Cache: Loaded
[    0.188725] pnp: PnP ACPI init
[    0.189271] pnp 00:00: Plug and Play ACPI device, IDs PNP0b00 (active)
[    0.190231] pnp 00:01: Plug and Play ACPI device, IDs PNP0303 (active)
[    0.191229] pnp 00:02: Plug and Play ACPI device, IDs PNP0f13 (active)
[    0.192096] pnp 00:03: [dma 2]
[    0.192514] pnp 00:03: Plug and Play ACPI device, IDs PNP0700 (active)
[    0.193631] pnp 00:04: Plug and Play ACPI device, IDs PNP0501 (active)
[    0.195416] pnp: PnP ACPI: found 5 devices
[    0.206055] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    0.207127] pci_bus 0000:00: resource 4 [io  0x0000-0x0cf7 window]
[    0.207832] pci_bus 0000:00: resource 5 [io  0x0d00-0xffff window]
[    0.208594] pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff window]
[    0.209469] pci_bus 0000:00: resource 7 [mem 0x80000000-0xfebfffff window]
[    0.210493] NET: Registered protocol family 2
[    0.211244] tcp_listen_portaddr_hash hash table entries: 1024 (order: 2, 16384 bytes)
[    0.212283] TCP established hash table entries: 16384 (order: 5, 131072 bytes)
[    0.213285] TCP bind hash table entries: 16384 (order: 6, 262144 bytes)
[    0.214306] TCP: Hash tables configured (established 16384 bind 16384)
[    0.215065] UDP hash table entries: 1024 (order: 3, 32768 bytes)
[    0.215797] UDP-Lite hash table entries: 1024 (order: 3, 32768 bytes)
[    0.217934] NET: Registered protocol family 1
[    0.219126] RPC: Registered named UNIX socket transport module.
[    0.219676] RPC: Registered udp transport module.
[    0.220130] RPC: Registered tcp transport module.
[    0.220552] RPC: Registered tcp NFSv4.1 backchannel transport module.
[    0.221153] pci 0000:00:00.0: Limiting direct PCI/PCI transfers
[    0.221701] pci 0000:00:01.0: PIIX3: Enabling Passive Release
[    0.222319] pci 0000:00:01.0: Activating ISA DMA hang workarounds
[    0.444214] ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 10
[    0.880141] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 11
[    1.311493] ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 11
[    1.748829] ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 10
[    1.962124] PCI: CLS 0 bytes, default 64
[    1.964749] Initialise system trusted keyrings
[    1.965289] workingset: timestamp_bits=37 max_order=19 bucket_order=0
[    1.969600] zbud: loaded
[    1.971287] SGI XFS with security attributes, no debug enabled
[    2.106071] NET: Registered protocol family 38
[    2.106556] Key type asymmetric registered
[    2.106931] Asymmetric key parser 'x509' registered
[    2.107514] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 248)
[    2.108327] io scheduler noop registered
[    2.108813] io scheduler deadline registered
[    2.109608] io scheduler cfq registered (default)
[    2.110258] io scheduler mq-deadline registered
[    2.110796] io scheduler kyber registered
[    2.111688] intel_idle: Please enable MWAIT in BIOS SETUP
[    2.112310] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
[    2.113037] ACPI: Power Button [PWRF]
[    2.331642] virtio-pci 0000:00:03.0: virtio_pci: leaving for legacy driver
[    2.554093] virtio-pci 0000:00:05.0: virtio_pci: leaving for legacy driver
[    2.775938] virtio-pci 0000:00:07.0: virtio_pci: leaving for legacy driver
[    2.975053] tsc: Refined TSC clocksource calibration: 2495.981 MHz
[    2.975641] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x23fa6529869, max_idle_ns: 440795218057 ns
[    3.029409] virtio-pci 0000:00:08.0: virtio_pci: leaving for legacy driver
[    3.032925] Serial: 8250/16550 driver, 32 ports, IRQ sharing enabled
[    3.056849] 00:04: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
[    3.064748] Non-volatile memory driver v1.3
[    3.065925] ppdev: user-space parallel port driver
[    3.071816] loop: module loaded
[    3.075337]  vda: vda1 vda2 vda3
[    3.076659] Rounding down aligned max_sectors from 4294967295 to 4294967288
[    3.077996] Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
[    3.079790] libphy: Fixed MDIO Bus: probed
[    3.080257] tun: Universal TUN/TAP device driver, 1.6
[    3.082222] i8042: PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12
[    3.083675] serio: i8042 KBD port at 0x60,0x64 irq 1
[    3.084160] serio: i8042 AUX port at 0x60,0x64 irq 12
[    3.084816] mousedev: PS/2 mouse device common for all mice
[    3.086603] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input1
[    3.089192] rtc_cmos 00:00: RTC can wake from S4
[    3.090116] rtc_cmos 00:00: rtc core: registered rtc_cmos as rtc0
[    3.092829] rtc_cmos 00:00: alarms up to one day, y3k, 114 bytes nvram
[    3.093937] IR NEC protocol handler initialized
[    3.094408] IR RC5(x/sz) protocol handler initialized
[    3.094901] IR RC6 protocol handler initialized
[    3.095510] IR JVC protocol handler initialized
[    3.095952] IR Sony protocol handler initialized
[    3.096399] IR SANYO protocol handler initialized
[    3.096862] IR Sharp protocol handler initialized
[    3.097342] IR MCE Keyboard/mouse protocol handler initialized
[    3.097919] IR XMP protocol handler initialized
[    3.098530] device-mapper: uevent: version 1.0.3
[    3.099209] device-mapper: ioctl: 4.37.0-ioctl (2017-09-20) initialised: dm-devel@redhat.com
[    3.100398] device-mapper: multipath round-robin: version 1.2.0 loaded
[    3.101883] drop_monitor: Initializing network drop monitor service
[    3.102553] Netfilter messages via NETLINK v0.30.
[    3.103090] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[    3.103738] ctnetlink v0.93: registering with nfnetlink.
[    3.104494] ip_tables: (C) 2000-2006 Netfilter Core Team
[    3.105734] Initializing XFRM netlink socket
[    3.106885] NET: Registered protocol family 10
[    3.109341] Segment Routing with IPv6
[    3.109976] mip6: Mobile IPv6
[    3.111987] ip6_tables: (C) 2000-2006 Netfilter Core Team
[    3.114230] NET: Registered protocol family 17
[    3.115047] Bridge firewalling registered
[    3.115824] Ebtables v2.0 registered
[    3.117996] 8021q: 802.1Q VLAN Support v1.8
[    3.119429] AVX2 version of gcm_enc/dec engaged.
[    3.119886] AES CTR mode by8 optimization enabled
[    3.128818] sched_clock: Marking stable (3128714579, 0)->(3404180881, -275466302)
[    3.129945] registered taskstats version 1
[    3.130427] Loading compiled-in X.509 certificates
[    3.163216] Loaded X.509 cert 'Build time autogenerated kernel key: 38e0adea1af8bd8a23b02436d4acf2f8c7408d23'
[    3.166359] zswap: loaded using pool lzo/zbud
[    3.167943] Key type big_key registered
[    3.168778]   Magic number: 13:918:708
[    3.169255] rtc_cmos 00:00: setting system clock to 2017-12-16 00:42:09 UTC (1513384929)
[    3.170604] md: Skipping autodetection of RAID arrays. (raid=autodetect will force)
[    3.171932] EXT4-fs (vda2): couldn't mount as ext3 due to feature incompatibilities
[    3.173871] EXT4-fs (vda2): couldn't mount as ext2 due to feature incompatibilities
[    3.175306] EXT4-fs (vda2): INFO: recovery required on readonly filesystem
[    3.176212] EXT4-fs (vda2): write access will be enabled during recovery
[    3.397187] EXT4-fs (vda2): orphan cleanup on readonly fs
[    3.399412] EXT4-fs (vda2): 5 orphan inodes deleted
[    3.402759] EXT4-fs (vda2): recovery complete
[    3.466647] EXT4-fs (vda2): mounted filesystem with ordered data mode. Opts: (null)
[    3.469401] VFS: Mounted root (ext4 filesystem) readonly on device 253:2.
[    3.473719] devtmpfs: mounted
[    3.492549] Freeing unused kernel memory: 1640K
[    3.494547] Write protecting the kernel read-only data: 18432k
[    3.498781] Freeing unused kernel memory: 2016K
[    3.503330] Freeing unused kernel memory: 512K
[    3.505232] rodata_test: all tests were successful
[    3.515355] 1 (init): Uhuuh, elf segement at 00000000928fda3e requested but the memory is mapped already
[    3.519533] Starting init: /sbin/init exists but couldn't execute it (error -95)
[    3.528993] Starting init: /bin/sh exists but couldn't execute it (error -14)
[    3.532127] Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
[    3.538328] CPU: 0 PID: 1 Comm: init Not tainted 4.15.0-rc3-next-20171215-00001-g6d6aea478fce #11
[    3.542201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
[    3.546081] Call Trace:
[    3.547221]  dump_stack+0x5c/0x79
[    3.548768]  ? rest_init+0x30/0xb0
[    3.550320]  panic+0xe4/0x232
[    3.551669]  ? rest_init+0xb0/0xb0
[    3.553110]  kernel_init+0xeb/0x100
[    3.554701]  ret_from_fork+0x1f/0x30
[    3.558964] Kernel Offset: 0x2000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[    3.564160] ---[ end Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2/2] fs, elf: drop MAP_FIXED usage from elf_map
  2017-12-16  0:49   ` [2/2] " Andrei Vagin
@ 2017-12-18  9:13     ` Michal Hocko
  2017-12-18 18:12       ` Andrei Vagin
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2017-12-18  9:13 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: linux-api, Khalid Aziz, Michael Ellerman, Andrew Morton,
	Russell King - ARM Linux, Andrea Arcangeli, linux-mm, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Abdul Haleem, Joel Stanley, Kees Cook

On Fri 15-12-17 16:49:28, Andrei Vagin wrote:
> Hi Michal,
> 
> We run CRIU tests for linux-next and the 4.15.0-rc3-next-20171215 kernel
> doesn't boot:
> 
> [    3.492549] Freeing unused kernel memory: 1640K
> [    3.494547] Write protecting the kernel read-only data: 18432k
> [    3.498781] Freeing unused kernel memory: 2016K
> [    3.503330] Freeing unused kernel memory: 512K
> [    3.505232] rodata_test: all tests were successful
> [    3.515355] 1 (init): Uhuuh, elf segement at 00000000928fda3e requested but the memory is mapped already

Hmm, this interesting. What does the test actualy do? Could you add some
instrumentation to see what is actually mapped there? Something like

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 0e50230ce53d..1b68ddc34043 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -372,10 +372,28 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
 	} else
 		map_addr = vm_mmap(filep, addr, size, prot, type, off);
 
-	if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
+	if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr)) {
+		struct vm_area_struct *vma;
+
 		pr_info("%d (%s): Uhuuh, elf segment at %p requested but the memory is mapped already\n",
 				task_pid_nr(current), current->comm,
 				(void *)addr);
+		vma = find_vma(current->mm, map_addr);
+		if (vma && vma->vm_start < addr) {
+			pr_info("requested [%lx, %lx] mapped [%lx, %lx] %lx ", addr, addr + total_size,
+					vma->vm_start, vma->vm_end, vma->vm_flags);
+			if (!vma->vm_file) {
+				pr_cont("anon\n");
+			} else {
+				char path[512];
+				char *p = file_path(vma->vm_file, path, sizeof(path));
+				if (IS_ERR(p))
+					p = "?";
+				pr_cont("\"%s\"\n", kbasename(p));
+			}
+			dump_stack();
+		}
+	}
 
 	return(map_addr);
 }

> [    3.519533] Starting init: /sbin/init exists but couldn't execute it (error -95)
> [    3.528993] Starting init: /bin/sh exists but couldn't execute it (error -14)
> [    3.532127] Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
> [    3.538328] CPU: 0 PID: 1 Comm: init Not tainted 4.15.0-rc3-next-20171215-00001-g6d6aea478fce #11
> [    3.542201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
> [    3.546081] Call Trace:
> [    3.547221]  dump_stack+0x5c/0x79
> [    3.548768]  ? rest_init+0x30/0xb0
> [    3.550320]  panic+0xe4/0x232
> [    3.551669]  ? rest_init+0xb0/0xb0
> [    3.553110]  kernel_init+0xeb/0x100
> [    3.554701]  ret_from_fork+0x1f/0x30
> [    3.558964] Kernel Offset: 0x2000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [    3.564160] ---[ end Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
> 
> If I revert this patch, it boots normally.
> 
> Thanks,
> Andrei
> 
> On Wed, Dec 13, 2017 at 10:25:50AM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Both load_elf_interp and load_elf_binary rely on elf_map to map segments
> > on a controlled address and they use MAP_FIXED to enforce that. This is
> > however dangerous thing prone to silent data corruption which can be
> > even exploitable. Let's take CVE-2017-1000253 as an example. At the time
> > (before eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"))
> > ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
> > the stack top on 32b (legacy) memory layout (only 1GB away). Therefore
> > we could end up mapping over the existing stack with some luck.
> > 
> > The issue has been fixed since then (a87938b2e246 ("fs/binfmt_elf.c:
> > fix bug in loading of PIE binaries")), ELF_ET_DYN_BASE moved moved much
> > further from the stack (eab09532d400 and later by c715b72c1ba4 ("mm:
> > revert x86_64 and arm64 ELF_ET_DYN_BASE base changes")) and excessive
> > stack consumption early during execve fully stopped by da029c11e6b1
> > ("exec: Limit arg stack to at most 75% of _STK_LIM"). So we should be
> > safe and any attack should be impractical. On the other hand this is
> > just too subtle assumption so it can break quite easily and hard to
> > spot.
> > 
> > I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
> > fundamentally dangerous. Moreover it shouldn't be even needed. We are
> > at the early process stage and so there shouldn't be unrelated mappings
> > (except for stack and loader) existing so mmap for a given address
> > should succeed even without MAP_FIXED. Something is terribly wrong if
> > this is not the case and we should rather fail than silently corrupt the
> > underlying mapping.
> > 
> > Address this issue by changing MAP_FIXED to the newly added
> > MAP_FIXED_SAFE. This will mean that mmap will fail if there is an
> > existing mapping clashing with the requested one without clobbering it.
> > 
> > Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
> > Cc: Joel Stanley <joel@jms.id.au>
> > Acked-by: Kees Cook <keescook@chromium.org>
> > Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  arch/metag/kernel/process.c |  6 +++++-
> >  fs/binfmt_elf.c             | 12 ++++++++----
> >  2 files changed, 13 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/metag/kernel/process.c b/arch/metag/kernel/process.c
> > index 0909834c83a7..867c8d0a5fb4 100644
> > --- a/arch/metag/kernel/process.c
> > +++ b/arch/metag/kernel/process.c
> > @@ -399,7 +399,7 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
> >  	tcm_tag = tcm_lookup_tag(addr);
> >  
> >  	if (tcm_tag != TCM_INVALID_TAG)
> > -		type &= ~MAP_FIXED;
> > +		type &= ~(MAP_FIXED | MAP_FIXED_SAFE);
> >  
> >  	/*
> >  	* total_size is the size of the ELF (interpreter) image.
> > @@ -417,6 +417,10 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
> >  	} else
> >  		map_addr = vm_mmap(filep, addr, size, prot, type, off);
> >  
> > +	if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
> > +		pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
> > +				task_pid_nr(current), tsk->comm, (void*)addr);
> > +
> >  	if (!BAD_ADDR(map_addr) && tcm_tag != TCM_INVALID_TAG) {
> >  		struct tcm_allocation *tcm;
> >  		unsigned long tcm_addr;
> > diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> > index 73b01e474fdc..5916d45f64a7 100644
> > --- a/fs/binfmt_elf.c
> > +++ b/fs/binfmt_elf.c
> > @@ -372,6 +372,10 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
> >  	} else
> >  		map_addr = vm_mmap(filep, addr, size, prot, type, off);
> >  
> > +	if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
> > +		pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
> > +				task_pid_nr(current), current->comm, (void*)addr);
> > +
> >  	return(map_addr);
> >  }
> >  
> > @@ -569,7 +573,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
> >  				elf_prot |= PROT_EXEC;
> >  			vaddr = eppnt->p_vaddr;
> >  			if (interp_elf_ex->e_type == ET_EXEC || load_addr_set)
> > -				elf_type |= MAP_FIXED;
> > +				elf_type |= MAP_FIXED_SAFE;
> >  			else if (no_base && interp_elf_ex->e_type == ET_DYN)
> >  				load_addr = -vaddr;
> >  
> > @@ -930,7 +934,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
> >  		 * the ET_DYN load_addr calculations, proceed normally.
> >  		 */
> >  		if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
> > -			elf_flags |= MAP_FIXED;
> > +			elf_flags |= MAP_FIXED_SAFE;
> >  		} else if (loc->elf_ex.e_type == ET_DYN) {
> >  			/*
> >  			 * This logic is run once for the first LOAD Program
> > @@ -966,7 +970,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
> >  				load_bias = ELF_ET_DYN_BASE;
> >  				if (current->flags & PF_RANDOMIZE)
> >  					load_bias += arch_mmap_rnd();
> > -				elf_flags |= MAP_FIXED;
> > +				elf_flags |= MAP_FIXED_SAFE;
> >  			} else
> >  				load_bias = 0;
> >  
> > @@ -1223,7 +1227,7 @@ static int load_elf_library(struct file *file)
> >  			(eppnt->p_filesz +
> >  			 ELF_PAGEOFFSET(eppnt->p_vaddr)),
> >  			PROT_READ | PROT_WRITE | PROT_EXEC,
> > -			MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
> > +			MAP_FIXED_SAFE | MAP_PRIVATE | MAP_DENYWRITE,
> >  			(eppnt->p_offset -
> >  			 ELF_PAGEOFFSET(eppnt->p_vaddr)));
> >  	if (error != ELF_PAGESTART(eppnt->p_vaddr))


> [    0.000000] Linux version 4.15.0-rc3-next-20171215-00001-g6d6aea478fce (avagin@laptop) (gcc version 7.2.1 20170915 (Red Hat 7.2.1-2) (GCC)) #11 SMP Fri Dec 15 16:39:11 PST 2017
> [    0.000000] Command line: root=/dev/vda2 ro debug console=ttyS0,115200 LANG=en_US.UTF-8 slub_debug=FZP raid=noautodetect selinux=0
> [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
> [    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
> [    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
> [    0.000000] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
> [    0.000000] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
> [    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
> [    0.000000] x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
> [    0.000000] x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
> [    0.000000] x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format.
> [    0.000000] e820: BIOS-provided physical RAM map:
> [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
> [    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007ffd8fff] usable
> [    0.000000] BIOS-e820: [mem 0x000000007ffd9000-0x000000007fffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> [    0.000000] NX (Execute Disable) protection: active
> [    0.000000] random: fast init done
> [    0.000000] SMBIOS 2.8 present.
> [    0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
> [    0.000000] Hypervisor detected: KVM
> [    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> [    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
> [    0.000000] e820: last_pfn = 0x7ffd9 max_arch_pfn = 0x400000000
> [    0.000000] MTRR default type: write-back
> [    0.000000] MTRR fixed ranges enabled:
> [    0.000000]   00000-9FFFF write-back
> [    0.000000]   A0000-BFFFF uncachable
> [    0.000000]   C0000-FFFFF write-protect
> [    0.000000] MTRR variable ranges enabled:
> [    0.000000]   0 base 0080000000 mask FF80000000 uncachable
> [    0.000000]   1 disabled
> [    0.000000]   2 disabled
> [    0.000000]   3 disabled
> [    0.000000]   4 disabled
> [    0.000000]   5 disabled
> [    0.000000]   6 disabled
> [    0.000000]   7 disabled
> [    0.000000] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
> [    0.000000] found SMP MP-table at [mem 0x000f6bd0-0x000f6bdf] mapped at [        (ptrval)]
> [    0.000000] Base memory trampoline at [        (ptrval)] 99000 size 24576
> [    0.000000] Using GB pages for direct mapping
> [    0.000000] BRK [0x2c984000, 0x2c984fff] PGTABLE
> [    0.000000] BRK [0x2c985000, 0x2c985fff] PGTABLE
> [    0.000000] BRK [0x2c986000, 0x2c986fff] PGTABLE
> [    0.000000] BRK [0x2c987000, 0x2c987fff] PGTABLE
> [    0.000000] BRK [0x2c988000, 0x2c988fff] PGTABLE
> [    0.000000] BRK [0x2c989000, 0x2c989fff] PGTABLE
> [    0.000000] ACPI: Early table checksum verification disabled
> [    0.000000] ACPI: RSDP 0x00000000000F69C0 000014 (v00 BOCHS )
> [    0.000000] ACPI: RSDT 0x000000007FFE12FF 00002C (v01 BOCHS  BXPCRSDT 00000001 BXPC 00000001)
> [    0.000000] ACPI: FACP 0x000000007FFE120B 000074 (v01 BOCHS  BXPCFACP 00000001 BXPC 00000001)
> [    0.000000] ACPI: DSDT 0x000000007FFE0040 0011CB (v01 BOCHS  BXPCDSDT 00000001 BXPC 00000001)
> [    0.000000] ACPI: FACS 0x000000007FFE0000 000040
> [    0.000000] ACPI: APIC 0x000000007FFE127F 000080 (v01 BOCHS  BXPCAPIC 00000001 BXPC 00000001)
> [    0.000000] ACPI: Local APIC address 0xfee00000
> [    0.000000] No NUMA configuration found
> [    0.000000] Faking a node at [mem 0x0000000000000000-0x000000007ffd8fff]
> [    0.000000] NODE_DATA(0) allocated [mem 0x7ffc2000-0x7ffd8fff]
> [    0.000000] kvm-clock: cpu 0, msr 0:7ffc0001, primary cpu clock
> [    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> [    0.000000] kvm-clock: using sched offset of 1076013277 cycles
> [    0.000000] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> [    0.000000] Zone ranges:
> [    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> [    0.000000]   DMA32    [mem 0x0000000001000000-0x000000007ffd8fff]
> [    0.000000]   Normal   empty
> [    0.000000]   Device   empty
> [    0.000000] Movable zone start for each node
> [    0.000000] Early memory node ranges
> [    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
> [    0.000000]   node   0: [mem 0x0000000000100000-0x000000007ffd8fff]
> [    0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000007ffd8fff]
> [    0.000000] On node 0 totalpages: 524151
> [    0.000000]   DMA zone: 64 pages used for memmap
> [    0.000000]   DMA zone: 21 pages reserved
> [    0.000000]   DMA zone: 3998 pages, LIFO batch:0
> [    0.000000]   DMA32 zone: 8128 pages used for memmap
> [    0.000000]   DMA32 zone: 520153 pages, LIFO batch:31
> [    0.000000] Reserved but unavailable: 98 pages
> [    0.000000] ACPI: PM-Timer IO Port: 0x608
> [    0.000000] ACPI: Local APIC address 0xfee00000
> [    0.000000] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
> [    0.000000] IOAPIC[0]: apic_id 0, version 17, address 0xfec00000, GSI 0-23
> [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
> [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
> [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
> [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
> [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
> [    0.000000] ACPI: IRQ0 used by override.
> [    0.000000] ACPI: IRQ5 used by override.
> [    0.000000] ACPI: IRQ9 used by override.
> [    0.000000] ACPI: IRQ10 used by override.
> [    0.000000] ACPI: IRQ11 used by override.
> [    0.000000] Using ACPI (MADT) for SMP configuration information
> [    0.000000] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
> [    0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
> [    0.000000] PM: Registered nosave memory: [mem 0x0009f000-0x0009ffff]
> [    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000effff]
> [    0.000000] PM: Registered nosave memory: [mem 0x000f0000-0x000fffff]
> [    0.000000] e820: [mem 0x80000000-0xfeffbfff] available for PCI devices
> [    0.000000] Booting paravirtualized kernel on KVM
> [    0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> [    0.000000] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:2 nr_node_ids:1
> [    0.000000] percpu: Embedded 44 pages/cpu @        (ptrval) s142296 r8192 d29736 u1048576
> [    0.000000] pcpu-alloc: s142296 r8192 d29736 u1048576 alloc=1*2097152
> [    0.000000] pcpu-alloc: [0] 0 1 
> [    0.000000] KVM setup async PF for cpu 0
> [    0.000000] kvm-stealtime: cpu 0, msr 7fc122c0
> [    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 515938
> [    0.000000] Policy zone: DMA32
> [    0.000000] Kernel command line: root=/dev/vda2 ro debug console=ttyS0,115200 LANG=en_US.UTF-8 slub_debug=FZP raid=noautodetect selinux=0
> [    0.000000] Memory: 2037056K/2096604K available (12300K kernel code, 1554K rwdata, 3584K rodata, 1640K init, 912K bss, 59548K reserved, 0K cma-reserved)
> [    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
> [    0.000000] ftrace: allocating 36554 entries in 143 pages
> [    0.001000] Hierarchical RCU implementation.
> [    0.001000] 	RCU restricting CPUs from NR_CPUS=64 to nr_cpu_ids=2.
> [    0.001000] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=2
> [    0.001000] NR_IRQS: 4352, nr_irqs: 440, preallocated irqs: 16
> [    0.001000] 	Offload RCU callbacks from CPUs: (none).
> [    0.001000] Console: colour dummy device 80x25
> [    0.001000] console [ttyS0] enabled
> [    0.001000] ACPI: Core revision 20171110
> [    0.001000] ACPI: 1 ACPI AML tables successfully acquired and loaded
> [    0.001009] APIC: Switch to symmetric I/O mode setup
> [    0.001571] x2apic enabled
> [    0.002003] Switched APIC routing to physical x2apic.
> [    0.003538] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [    0.004000] tsc: Detected 2496.000 MHz processor
> [    0.004014] Calibrating delay loop (skipped) preset value.. 4992.00 BogoMIPS (lpj=2496000)
> [    0.005014] pid_max: default: 32768 minimum: 301
> [    0.006057] Security Framework initialized
> [    0.006548] Yama: becoming mindful.
> [    0.007019] SELinux:  Disabled at boot.
> [    0.008206] Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
> [    0.009164] Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
> [    0.009816] Mount-cache hash table entries: 4096 (order: 3, 32768 bytes)
> [    0.010009] Mountpoint-cache hash table entries: 4096 (order: 3, 32768 bytes)
> [    0.011322] mce: CPU supports 10 MCE banks
> [    0.011740] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
> [    0.012002] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
> [    0.012610] Freeing SMP alternatives memory: 36K
> [    0.013467] TSC deadline timer enabled
> [    0.013820] smpboot: CPU0: Intel Core Processor (Skylake) (family: 0x6, model: 0x5e, stepping: 0x3)
> [    0.014000] Performance Events: unsupported p6 CPU model 94 no PMU driver, software events only.
> [    0.014041] Hierarchical SRCU implementation.
> [    0.015133] NMI watchdog: Perf event create on CPU 0 failed with -2
> [    0.015725] NMI watchdog: Perf NMI watchdog permanently disabled
> [    0.016077] smp: Bringing up secondary CPUs ...
> [    0.016654] x86: Booting SMP configuration:
> [    0.017005] .... node  #0, CPUs:      #1
> [    0.001000] kvm-clock: cpu 1, msr 0:7ffc0041, secondary cpu clock
> [    0.019051] KVM setup async PF for cpu 1
> [    0.019599] kvm-stealtime: cpu 1, msr 7fd122c0
> [    0.020009] smp: Brought up 1 node, 2 CPUs
> [    0.020531] smpboot: Max logical packages: 2
> [    0.021009] smpboot: Total of 2 processors activated (9984.00 BogoMIPS)
> [    0.023160] devtmpfs: initialized
> [    0.023513] x86/mm: Memory block size: 128MB
> [    0.024811] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> [    0.025015] futex hash table entries: 512 (order: 3, 32768 bytes)
> [    0.026185] RTC time:  0:42:06, date: 12/16/17
> [    0.026790] NET: Registered protocol family 16
> [    0.027204] audit: initializing netlink subsys (disabled)
> [    0.027914] audit: type=2000 audit(1513384927.133:1): state=initialized audit_enabled=0 res=1
> [    0.028185] cpuidle: using governor menu
> [    0.029118] ACPI: bus type PCI registered
> [    0.029872] PCI: Using configuration type 1 for base access
> [    0.034355] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
> [    0.035011] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
> [    0.036066] cryptd: max_cpu_qlen set to 1000
> [    0.036579] ACPI: Added _OSI(Module Device)
> [    0.037007] ACPI: Added _OSI(Processor Device)
> [    0.037426] ACPI: Added _OSI(3.0 _SCP Extensions)
> [    0.037857] ACPI: Added _OSI(Processor Aggregator Device)
> [    0.041356] ACPI: Interpreter enabled
> [    0.041764] ACPI: (supports S0 S5)
> [    0.042005] ACPI: Using IOAPIC for interrupt routing
> [    0.042655] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
> [    0.043625] ACPI: Enabled 2 GPEs in block 00 to 0F
> [    0.059248] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
> [    0.059953] acpi PNP0A03:00: _OSC: OS supports [ASPM ClockPM Segments MSI]
> [    0.060045] acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM
> [    0.061180] PCI host bridge to bus 0000:00
> [    0.061874] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
> [    0.062013] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
> [    0.063016] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
> [    0.064015] pci_bus 0000:00: root bus resource [mem 0x80000000-0xfebfffff window]
> [    0.065014] pci_bus 0000:00: root bus resource [bus 00-ff]
> [    0.065753] pci 0000:00:00.0: [8086:1237] type 00 class 0x060000
> [    0.066487] pci 0000:00:01.0: [8086:7000] type 00 class 0x060100
> [    0.067537] pci 0000:00:01.1: [8086:7010] type 00 class 0x010180
> [    0.071700] pci 0000:00:01.1: reg 0x20: [io  0xc100-0xc10f]
> [    0.074032] pci 0000:00:01.1: legacy IDE quirk: reg 0x10: [io  0x01f0-0x01f7]
> [    0.074908] pci 0000:00:01.1: legacy IDE quirk: reg 0x14: [io  0x03f6]
> [    0.075011] pci 0000:00:01.1: legacy IDE quirk: reg 0x18: [io  0x0170-0x0177]
> [    0.076010] pci 0000:00:01.1: legacy IDE quirk: reg 0x1c: [io  0x0376]
> [    0.077121] pci 0000:00:01.3: [8086:7113] type 00 class 0x068000
> [    0.078148] pci 0000:00:01.3: quirk: [io  0x0600-0x063f] claimed by PIIX4 ACPI
> [    0.079014] pci 0000:00:01.3: quirk: [io  0x0700-0x070f] claimed by PIIX4 SMB
> [    0.080224] pci 0000:00:03.0: [1af4:1000] type 00 class 0x020000
> [    0.082007] pci 0000:00:03.0: reg 0x10: [io  0xc040-0xc05f]
> [    0.083814] pci 0000:00:03.0: reg 0x14: [mem 0xfebc0000-0xfebc0fff]
> [    0.089891] pci 0000:00:03.0: reg 0x30: [mem 0xfeb80000-0xfebbffff pref]
> [    0.090545] pci 0000:00:05.0: [1af4:1003] type 00 class 0x078000
> [    0.092708] pci 0000:00:05.0: reg 0x10: [io  0xc060-0xc07f]
> [    0.094009] pci 0000:00:05.0: reg 0x14: [mem 0xfebc1000-0xfebc1fff]
> [    0.102484] pci 0000:00:06.0: [8086:2934] type 00 class 0x0c0300
> [    0.108028] pci 0000:00:06.0: reg 0x20: [io  0xc080-0xc09f]
> [    0.110738] pci 0000:00:06.1: [8086:2935] type 00 class 0x0c0300
> [    0.114388] pci 0000:00:06.1: reg 0x20: [io  0xc0a0-0xc0bf]
> [    0.117339] pci 0000:00:06.2: [8086:2936] type 00 class 0x0c0300
> [    0.122770] pci 0000:00:06.2: reg 0x20: [io  0xc0c0-0xc0df]
> [    0.124738] pci 0000:00:06.7: [8086:293a] type 00 class 0x0c0320
> [    0.125825] pci 0000:00:06.7: reg 0x10: [mem 0xfebc2000-0xfebc2fff]
> [    0.130347] pci 0000:00:07.0: [1af4:1001] type 00 class 0x010000
> [    0.133007] pci 0000:00:07.0: reg 0x10: [io  0xc000-0xc03f]
> [    0.134793] pci 0000:00:07.0: reg 0x14: [mem 0xfebc3000-0xfebc3fff]
> [    0.141808] pci 0000:00:08.0: [1af4:1002] type 00 class 0x00ff00
> [    0.142914] pci 0000:00:08.0: reg 0x10: [io  0xc0e0-0xc0ff]
> [    0.148977] ACPI: PCI Interrupt Link [LNKA] (IRQs 5 *10 11)
> [    0.149455] ACPI: PCI Interrupt Link [LNKB] (IRQs 5 *10 11)
> [    0.150390] ACPI: PCI Interrupt Link [LNKC] (IRQs 5 10 *11)
> [    0.151382] ACPI: PCI Interrupt Link [LNKD] (IRQs 5 10 *11)
> [    0.152380] ACPI: PCI Interrupt Link [LNKS] (IRQs *9)
> [    0.154508] vgaarb: loaded
> [    0.155271] SCSI subsystem initialized
> [    0.155887] EDAC MC: Ver: 3.0.0
> [    0.156255] PCI: Using ACPI for IRQ routing
> [    0.156566] PCI: pci_cache_line_size set to 64 bytes
> [    0.157161] e820: reserve RAM buffer [mem 0x0009fc00-0x0009ffff]
> [    0.157914] e820: reserve RAM buffer [mem 0x7ffd9000-0x7fffffff]
> [    0.158253] NetLabel: Initializing
> [    0.158765] NetLabel:  domain hash size = 128
> [    0.159005] NetLabel:  protocols = UNLABELED CIPSOv4 CALIPSO
> [    0.159775] NetLabel:  unlabeled traffic allowed by default
> [    0.160073] clocksource: Switched to clocksource kvm-clock
> [    0.186764] VFS: Disk quotas dquot_6.6.0
> [    0.187277] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
> [    0.188251] FS-Cache: Loaded
> [    0.188725] pnp: PnP ACPI init
> [    0.189271] pnp 00:00: Plug and Play ACPI device, IDs PNP0b00 (active)
> [    0.190231] pnp 00:01: Plug and Play ACPI device, IDs PNP0303 (active)
> [    0.191229] pnp 00:02: Plug and Play ACPI device, IDs PNP0f13 (active)
> [    0.192096] pnp 00:03: [dma 2]
> [    0.192514] pnp 00:03: Plug and Play ACPI device, IDs PNP0700 (active)
> [    0.193631] pnp 00:04: Plug and Play ACPI device, IDs PNP0501 (active)
> [    0.195416] pnp: PnP ACPI: found 5 devices
> [    0.206055] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> [    0.207127] pci_bus 0000:00: resource 4 [io  0x0000-0x0cf7 window]
> [    0.207832] pci_bus 0000:00: resource 5 [io  0x0d00-0xffff window]
> [    0.208594] pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff window]
> [    0.209469] pci_bus 0000:00: resource 7 [mem 0x80000000-0xfebfffff window]
> [    0.210493] NET: Registered protocol family 2
> [    0.211244] tcp_listen_portaddr_hash hash table entries: 1024 (order: 2, 16384 bytes)
> [    0.212283] TCP established hash table entries: 16384 (order: 5, 131072 bytes)
> [    0.213285] TCP bind hash table entries: 16384 (order: 6, 262144 bytes)
> [    0.214306] TCP: Hash tables configured (established 16384 bind 16384)
> [    0.215065] UDP hash table entries: 1024 (order: 3, 32768 bytes)
> [    0.215797] UDP-Lite hash table entries: 1024 (order: 3, 32768 bytes)
> [    0.217934] NET: Registered protocol family 1
> [    0.219126] RPC: Registered named UNIX socket transport module.
> [    0.219676] RPC: Registered udp transport module.
> [    0.220130] RPC: Registered tcp transport module.
> [    0.220552] RPC: Registered tcp NFSv4.1 backchannel transport module.
> [    0.221153] pci 0000:00:00.0: Limiting direct PCI/PCI transfers
> [    0.221701] pci 0000:00:01.0: PIIX3: Enabling Passive Release
> [    0.222319] pci 0000:00:01.0: Activating ISA DMA hang workarounds
> [    0.444214] ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 10
> [    0.880141] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 11
> [    1.311493] ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 11
> [    1.748829] ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 10
> [    1.962124] PCI: CLS 0 bytes, default 64
> [    1.964749] Initialise system trusted keyrings
> [    1.965289] workingset: timestamp_bits=37 max_order=19 bucket_order=0
> [    1.969600] zbud: loaded
> [    1.971287] SGI XFS with security attributes, no debug enabled
> [    2.106071] NET: Registered protocol family 38
> [    2.106556] Key type asymmetric registered
> [    2.106931] Asymmetric key parser 'x509' registered
> [    2.107514] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 248)
> [    2.108327] io scheduler noop registered
> [    2.108813] io scheduler deadline registered
> [    2.109608] io scheduler cfq registered (default)
> [    2.110258] io scheduler mq-deadline registered
> [    2.110796] io scheduler kyber registered
> [    2.111688] intel_idle: Please enable MWAIT in BIOS SETUP
> [    2.112310] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> [    2.113037] ACPI: Power Button [PWRF]
> [    2.331642] virtio-pci 0000:00:03.0: virtio_pci: leaving for legacy driver
> [    2.554093] virtio-pci 0000:00:05.0: virtio_pci: leaving for legacy driver
> [    2.775938] virtio-pci 0000:00:07.0: virtio_pci: leaving for legacy driver
> [    2.975053] tsc: Refined TSC clocksource calibration: 2495.981 MHz
> [    2.975641] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x23fa6529869, max_idle_ns: 440795218057 ns
> [    3.029409] virtio-pci 0000:00:08.0: virtio_pci: leaving for legacy driver
> [    3.032925] Serial: 8250/16550 driver, 32 ports, IRQ sharing enabled
> [    3.056849] 00:04: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> [    3.064748] Non-volatile memory driver v1.3
> [    3.065925] ppdev: user-space parallel port driver
> [    3.071816] loop: module loaded
> [    3.075337]  vda: vda1 vda2 vda3
> [    3.076659] Rounding down aligned max_sectors from 4294967295 to 4294967288
> [    3.077996] Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
> [    3.079790] libphy: Fixed MDIO Bus: probed
> [    3.080257] tun: Universal TUN/TAP device driver, 1.6
> [    3.082222] i8042: PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12
> [    3.083675] serio: i8042 KBD port at 0x60,0x64 irq 1
> [    3.084160] serio: i8042 AUX port at 0x60,0x64 irq 12
> [    3.084816] mousedev: PS/2 mouse device common for all mice
> [    3.086603] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input1
> [    3.089192] rtc_cmos 00:00: RTC can wake from S4
> [    3.090116] rtc_cmos 00:00: rtc core: registered rtc_cmos as rtc0
> [    3.092829] rtc_cmos 00:00: alarms up to one day, y3k, 114 bytes nvram
> [    3.093937] IR NEC protocol handler initialized
> [    3.094408] IR RC5(x/sz) protocol handler initialized
> [    3.094901] IR RC6 protocol handler initialized
> [    3.095510] IR JVC protocol handler initialized
> [    3.095952] IR Sony protocol handler initialized
> [    3.096399] IR SANYO protocol handler initialized
> [    3.096862] IR Sharp protocol handler initialized
> [    3.097342] IR MCE Keyboard/mouse protocol handler initialized
> [    3.097919] IR XMP protocol handler initialized
> [    3.098530] device-mapper: uevent: version 1.0.3
> [    3.099209] device-mapper: ioctl: 4.37.0-ioctl (2017-09-20) initialised: dm-devel@redhat.com
> [    3.100398] device-mapper: multipath round-robin: version 1.2.0 loaded
> [    3.101883] drop_monitor: Initializing network drop monitor service
> [    3.102553] Netfilter messages via NETLINK v0.30.
> [    3.103090] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
> [    3.103738] ctnetlink v0.93: registering with nfnetlink.
> [    3.104494] ip_tables: (C) 2000-2006 Netfilter Core Team
> [    3.105734] Initializing XFRM netlink socket
> [    3.106885] NET: Registered protocol family 10
> [    3.109341] Segment Routing with IPv6
> [    3.109976] mip6: Mobile IPv6
> [    3.111987] ip6_tables: (C) 2000-2006 Netfilter Core Team
> [    3.114230] NET: Registered protocol family 17
> [    3.115047] Bridge firewalling registered
> [    3.115824] Ebtables v2.0 registered
> [    3.117996] 8021q: 802.1Q VLAN Support v1.8
> [    3.119429] AVX2 version of gcm_enc/dec engaged.
> [    3.119886] AES CTR mode by8 optimization enabled
> [    3.128818] sched_clock: Marking stable (3128714579, 0)->(3404180881, -275466302)
> [    3.129945] registered taskstats version 1
> [    3.130427] Loading compiled-in X.509 certificates
> [    3.163216] Loaded X.509 cert 'Build time autogenerated kernel key: 38e0adea1af8bd8a23b02436d4acf2f8c7408d23'
> [    3.166359] zswap: loaded using pool lzo/zbud
> [    3.167943] Key type big_key registered
> [    3.168778]   Magic number: 13:918:708
> [    3.169255] rtc_cmos 00:00: setting system clock to 2017-12-16 00:42:09 UTC (1513384929)
> [    3.170604] md: Skipping autodetection of RAID arrays. (raid=autodetect will force)
> [    3.171932] EXT4-fs (vda2): couldn't mount as ext3 due to feature incompatibilities
> [    3.173871] EXT4-fs (vda2): couldn't mount as ext2 due to feature incompatibilities
> [    3.175306] EXT4-fs (vda2): INFO: recovery required on readonly filesystem
> [    3.176212] EXT4-fs (vda2): write access will be enabled during recovery
> [    3.397187] EXT4-fs (vda2): orphan cleanup on readonly fs
> [    3.399412] EXT4-fs (vda2): 5 orphan inodes deleted
> [    3.402759] EXT4-fs (vda2): recovery complete
> [    3.466647] EXT4-fs (vda2): mounted filesystem with ordered data mode. Opts: (null)
> [    3.469401] VFS: Mounted root (ext4 filesystem) readonly on device 253:2.
> [    3.473719] devtmpfs: mounted
> [    3.492549] Freeing unused kernel memory: 1640K
> [    3.494547] Write protecting the kernel read-only data: 18432k
> [    3.498781] Freeing unused kernel memory: 2016K
> [    3.503330] Freeing unused kernel memory: 512K
> [    3.505232] rodata_test: all tests were successful
> [    3.515355] 1 (init): Uhuuh, elf segement at 00000000928fda3e requested but the memory is mapped already
> [    3.519533] Starting init: /sbin/init exists but couldn't execute it (error -95)
> [    3.528993] Starting init: /bin/sh exists but couldn't execute it (error -14)
> [    3.532127] Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
> [    3.538328] CPU: 0 PID: 1 Comm: init Not tainted 4.15.0-rc3-next-20171215-00001-g6d6aea478fce #11
> [    3.542201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
> [    3.546081] Call Trace:
> [    3.547221]  dump_stack+0x5c/0x79
> [    3.548768]  ? rest_init+0x30/0xb0
> [    3.550320]  panic+0xe4/0x232
> [    3.551669]  ? rest_init+0xb0/0xb0
> [    3.553110]  kernel_init+0xeb/0x100
> [    3.554701]  ret_from_fork+0x1f/0x30
> [    3.558964] Kernel Offset: 0x2000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [    3.564160] ---[ end Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.


-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [2/2] fs, elf: drop MAP_FIXED usage from elf_map
  2017-12-18  9:13     ` Michal Hocko
@ 2017-12-18 18:12       ` Andrei Vagin
  0 siblings, 0 replies; 48+ messages in thread
From: Andrei Vagin @ 2017-12-18 18:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-api, Khalid Aziz, Michael Ellerman, Andrew Morton,
	Russell King - ARM Linux, Andrea Arcangeli, linux-mm, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Abdul Haleem, Joel Stanley, Kees Cook

On Mon, Dec 18, 2017 at 10:13:02AM +0100, Michal Hocko wrote:
> On Fri 15-12-17 16:49:28, Andrei Vagin wrote:
> > Hi Michal,
> > 
> > We run CRIU tests for linux-next and the 4.15.0-rc3-next-20171215 kernel
> > doesn't boot:
> > 
> > [    3.492549] Freeing unused kernel memory: 1640K
> > [    3.494547] Write protecting the kernel read-only data: 18432k
> > [    3.498781] Freeing unused kernel memory: 2016K
> > [    3.503330] Freeing unused kernel memory: 512K
> > [    3.505232] rodata_test: all tests were successful
> > [    3.515355] 1 (init): Uhuuh, elf segement at 00000000928fda3e requested but the memory is mapped already
> 
> Hmm, this interesting. What does the test actualy do? Could you add some
> instrumentation to see what is actually mapped there? Something like

There is nothing mapped there. It returns -95 (ENOSUPP)

The kernel is booted with this patch:

+       int ttype = type & ~MAP_FIXED_SAFE;
        if (total_size) {
                total_size = ELF_PAGEALIGN(total_size);
-               map_addr = vm_mmap(filep, addr, total_size, prot, type,
                off);
+               map_addr = vm_mmap(filep, addr, total_size, prot, ttype, off);
                if (!BAD_ADDR(map_addr))
                        vm_munmap(map_addr+size, total_size-size);
        } else
-               map_addr = vm_mmap(filep, addr, size, prot, type, off);
+               map_addr = vm_mmap(filep, addr, size, prot, ttype, off);


> 
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 0e50230ce53d..1b68ddc34043 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -372,10 +372,28 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
>  	} else
>  		map_addr = vm_mmap(filep, addr, size, prot, type, off);
>  
> -	if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
> +	if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr)) {
> +		struct vm_area_struct *vma;
> +
>  		pr_info("%d (%s): Uhuuh, elf segment at %p requested but the memory is mapped already\n",
>  				task_pid_nr(current), current->comm,
>  				(void *)addr);
> +		vma = find_vma(current->mm, map_addr);
> +		if (vma && vma->vm_start < addr) {
> +			pr_info("requested [%lx, %lx] mapped [%lx, %lx] %lx ", addr, addr + total_size,
> +					vma->vm_start, vma->vm_end, vma->vm_flags);
> +			if (!vma->vm_file) {
> +				pr_cont("anon\n");
> +			} else {
> +				char path[512];
> +				char *p = file_path(vma->vm_file, path, sizeof(path));
> +				if (IS_ERR(p))
> +					p = "?";
> +				pr_cont("\"%s\"\n", kbasename(p));
> +			}
> +			dump_stack();
> +		}
> +	}
>  
>  	return(map_addr);
>  }
> 
> > [    3.519533] Starting init: /sbin/init exists but couldn't execute it (error -95)
> > [    3.528993] Starting init: /bin/sh exists but couldn't execute it (error -14)
> > [    3.532127] Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
> > [    3.538328] CPU: 0 PID: 1 Comm: init Not tainted 4.15.0-rc3-next-20171215-00001-g6d6aea478fce #11
> > [    3.542201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
> > [    3.546081] Call Trace:
> > [    3.547221]  dump_stack+0x5c/0x79
> > [    3.548768]  ? rest_init+0x30/0xb0
> > [    3.550320]  panic+0xe4/0x232
> > [    3.551669]  ? rest_init+0xb0/0xb0
> > [    3.553110]  kernel_init+0xeb/0x100
> > [    3.554701]  ret_from_fork+0x1f/0x30
> > [    3.558964] Kernel Offset: 0x2000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > [    3.564160] ---[ end Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
> > 
> > If I revert this patch, it boots normally.
> > 
> > Thanks,
> > Andrei
> > 
> > On Wed, Dec 13, 2017 at 10:25:50AM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > Both load_elf_interp and load_elf_binary rely on elf_map to map segments
> > > on a controlled address and they use MAP_FIXED to enforce that. This is
> > > however dangerous thing prone to silent data corruption which can be
> > > even exploitable. Let's take CVE-2017-1000253 as an example. At the time
> > > (before eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"))
> > > ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
> > > the stack top on 32b (legacy) memory layout (only 1GB away). Therefore
> > > we could end up mapping over the existing stack with some luck.
> > > 
> > > The issue has been fixed since then (a87938b2e246 ("fs/binfmt_elf.c:
> > > fix bug in loading of PIE binaries")), ELF_ET_DYN_BASE moved moved much
> > > further from the stack (eab09532d400 and later by c715b72c1ba4 ("mm:
> > > revert x86_64 and arm64 ELF_ET_DYN_BASE base changes")) and excessive
> > > stack consumption early during execve fully stopped by da029c11e6b1
> > > ("exec: Limit arg stack to at most 75% of _STK_LIM"). So we should be
> > > safe and any attack should be impractical. On the other hand this is
> > > just too subtle assumption so it can break quite easily and hard to
> > > spot.
> > > 
> > > I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
> > > fundamentally dangerous. Moreover it shouldn't be even needed. We are
> > > at the early process stage and so there shouldn't be unrelated mappings
> > > (except for stack and loader) existing so mmap for a given address
> > > should succeed even without MAP_FIXED. Something is terribly wrong if
> > > this is not the case and we should rather fail than silently corrupt the
> > > underlying mapping.
> > > 
> > > Address this issue by changing MAP_FIXED to the newly added
> > > MAP_FIXED_SAFE. This will mean that mmap will fail if there is an
> > > existing mapping clashing with the requested one without clobbering it.
> > > 
> > > Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
> > > Cc: Joel Stanley <joel@jms.id.au>
> > > Acked-by: Kees Cook <keescook@chromium.org>
> > > Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  arch/metag/kernel/process.c |  6 +++++-
> > >  fs/binfmt_elf.c             | 12 ++++++++----
> > >  2 files changed, 13 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/arch/metag/kernel/process.c b/arch/metag/kernel/process.c
> > > index 0909834c83a7..867c8d0a5fb4 100644
> > > --- a/arch/metag/kernel/process.c
> > > +++ b/arch/metag/kernel/process.c
> > > @@ -399,7 +399,7 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
> > >  	tcm_tag = tcm_lookup_tag(addr);
> > >  
> > >  	if (tcm_tag != TCM_INVALID_TAG)
> > > -		type &= ~MAP_FIXED;
> > > +		type &= ~(MAP_FIXED | MAP_FIXED_SAFE);
> > >  
> > >  	/*
> > >  	* total_size is the size of the ELF (interpreter) image.
> > > @@ -417,6 +417,10 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
> > >  	} else
> > >  		map_addr = vm_mmap(filep, addr, size, prot, type, off);
> > >  
> > > +	if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
> > > +		pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
> > > +				task_pid_nr(current), tsk->comm, (void*)addr);
> > > +
> > >  	if (!BAD_ADDR(map_addr) && tcm_tag != TCM_INVALID_TAG) {
> > >  		struct tcm_allocation *tcm;
> > >  		unsigned long tcm_addr;
> > > diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> > > index 73b01e474fdc..5916d45f64a7 100644
> > > --- a/fs/binfmt_elf.c
> > > +++ b/fs/binfmt_elf.c
> > > @@ -372,6 +372,10 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
> > >  	} else
> > >  		map_addr = vm_mmap(filep, addr, size, prot, type, off);
> > >  
> > > +	if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
> > > +		pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
> > > +				task_pid_nr(current), current->comm, (void*)addr);
> > > +
> > >  	return(map_addr);
> > >  }
> > >  
> > > @@ -569,7 +573,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
> > >  				elf_prot |= PROT_EXEC;
> > >  			vaddr = eppnt->p_vaddr;
> > >  			if (interp_elf_ex->e_type == ET_EXEC || load_addr_set)
> > > -				elf_type |= MAP_FIXED;
> > > +				elf_type |= MAP_FIXED_SAFE;
> > >  			else if (no_base && interp_elf_ex->e_type == ET_DYN)
> > >  				load_addr = -vaddr;
> > >  
> > > @@ -930,7 +934,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
> > >  		 * the ET_DYN load_addr calculations, proceed normally.
> > >  		 */
> > >  		if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
> > > -			elf_flags |= MAP_FIXED;
> > > +			elf_flags |= MAP_FIXED_SAFE;
> > >  		} else if (loc->elf_ex.e_type == ET_DYN) {
> > >  			/*
> > >  			 * This logic is run once for the first LOAD Program
> > > @@ -966,7 +970,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
> > >  				load_bias = ELF_ET_DYN_BASE;
> > >  				if (current->flags & PF_RANDOMIZE)
> > >  					load_bias += arch_mmap_rnd();
> > > -				elf_flags |= MAP_FIXED;
> > > +				elf_flags |= MAP_FIXED_SAFE;
> > >  			} else
> > >  				load_bias = 0;
> > >  
> > > @@ -1223,7 +1227,7 @@ static int load_elf_library(struct file *file)
> > >  			(eppnt->p_filesz +
> > >  			 ELF_PAGEOFFSET(eppnt->p_vaddr)),
> > >  			PROT_READ | PROT_WRITE | PROT_EXEC,
> > > -			MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
> > > +			MAP_FIXED_SAFE | MAP_PRIVATE | MAP_DENYWRITE,
> > >  			(eppnt->p_offset -
> > >  			 ELF_PAGEOFFSET(eppnt->p_vaddr)));
> > >  	if (error != ELF_PAGESTART(eppnt->p_vaddr))
> 
> 
> > [    0.000000] Linux version 4.15.0-rc3-next-20171215-00001-g6d6aea478fce (avagin@laptop) (gcc version 7.2.1 20170915 (Red Hat 7.2.1-2) (GCC)) #11 SMP Fri Dec 15 16:39:11 PST 2017
> > [    0.000000] Command line: root=/dev/vda2 ro debug console=ttyS0,115200 LANG=en_US.UTF-8 slub_debug=FZP raid=noautodetect selinux=0
> > [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
> > [    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
> > [    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
> > [    0.000000] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
> > [    0.000000] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
> > [    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
> > [    0.000000] x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
> > [    0.000000] x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
> > [    0.000000] x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format.
> > [    0.000000] e820: BIOS-provided physical RAM map:
> > [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
> > [    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > [    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> > [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007ffd8fff] usable
> > [    0.000000] BIOS-e820: [mem 0x000000007ffd9000-0x000000007fffffff] reserved
> > [    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> > [    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> > [    0.000000] NX (Execute Disable) protection: active
> > [    0.000000] random: fast init done
> > [    0.000000] SMBIOS 2.8 present.
> > [    0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
> > [    0.000000] Hypervisor detected: KVM
> > [    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> > [    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
> > [    0.000000] e820: last_pfn = 0x7ffd9 max_arch_pfn = 0x400000000
> > [    0.000000] MTRR default type: write-back
> > [    0.000000] MTRR fixed ranges enabled:
> > [    0.000000]   00000-9FFFF write-back
> > [    0.000000]   A0000-BFFFF uncachable
> > [    0.000000]   C0000-FFFFF write-protect
> > [    0.000000] MTRR variable ranges enabled:
> > [    0.000000]   0 base 0080000000 mask FF80000000 uncachable
> > [    0.000000]   1 disabled
> > [    0.000000]   2 disabled
> > [    0.000000]   3 disabled
> > [    0.000000]   4 disabled
> > [    0.000000]   5 disabled
> > [    0.000000]   6 disabled
> > [    0.000000]   7 disabled
> > [    0.000000] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
> > [    0.000000] found SMP MP-table at [mem 0x000f6bd0-0x000f6bdf] mapped at [        (ptrval)]
> > [    0.000000] Base memory trampoline at [        (ptrval)] 99000 size 24576
> > [    0.000000] Using GB pages for direct mapping
> > [    0.000000] BRK [0x2c984000, 0x2c984fff] PGTABLE
> > [    0.000000] BRK [0x2c985000, 0x2c985fff] PGTABLE
> > [    0.000000] BRK [0x2c986000, 0x2c986fff] PGTABLE
> > [    0.000000] BRK [0x2c987000, 0x2c987fff] PGTABLE
> > [    0.000000] BRK [0x2c988000, 0x2c988fff] PGTABLE
> > [    0.000000] BRK [0x2c989000, 0x2c989fff] PGTABLE
> > [    0.000000] ACPI: Early table checksum verification disabled
> > [    0.000000] ACPI: RSDP 0x00000000000F69C0 000014 (v00 BOCHS )
> > [    0.000000] ACPI: RSDT 0x000000007FFE12FF 00002C (v01 BOCHS  BXPCRSDT 00000001 BXPC 00000001)
> > [    0.000000] ACPI: FACP 0x000000007FFE120B 000074 (v01 BOCHS  BXPCFACP 00000001 BXPC 00000001)
> > [    0.000000] ACPI: DSDT 0x000000007FFE0040 0011CB (v01 BOCHS  BXPCDSDT 00000001 BXPC 00000001)
> > [    0.000000] ACPI: FACS 0x000000007FFE0000 000040
> > [    0.000000] ACPI: APIC 0x000000007FFE127F 000080 (v01 BOCHS  BXPCAPIC 00000001 BXPC 00000001)
> > [    0.000000] ACPI: Local APIC address 0xfee00000
> > [    0.000000] No NUMA configuration found
> > [    0.000000] Faking a node at [mem 0x0000000000000000-0x000000007ffd8fff]
> > [    0.000000] NODE_DATA(0) allocated [mem 0x7ffc2000-0x7ffd8fff]
> > [    0.000000] kvm-clock: cpu 0, msr 0:7ffc0001, primary cpu clock
> > [    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> > [    0.000000] kvm-clock: using sched offset of 1076013277 cycles
> > [    0.000000] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> > [    0.000000] Zone ranges:
> > [    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> > [    0.000000]   DMA32    [mem 0x0000000001000000-0x000000007ffd8fff]
> > [    0.000000]   Normal   empty
> > [    0.000000]   Device   empty
> > [    0.000000] Movable zone start for each node
> > [    0.000000] Early memory node ranges
> > [    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
> > [    0.000000]   node   0: [mem 0x0000000000100000-0x000000007ffd8fff]
> > [    0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000007ffd8fff]
> > [    0.000000] On node 0 totalpages: 524151
> > [    0.000000]   DMA zone: 64 pages used for memmap
> > [    0.000000]   DMA zone: 21 pages reserved
> > [    0.000000]   DMA zone: 3998 pages, LIFO batch:0
> > [    0.000000]   DMA32 zone: 8128 pages used for memmap
> > [    0.000000]   DMA32 zone: 520153 pages, LIFO batch:31
> > [    0.000000] Reserved but unavailable: 98 pages
> > [    0.000000] ACPI: PM-Timer IO Port: 0x608
> > [    0.000000] ACPI: Local APIC address 0xfee00000
> > [    0.000000] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
> > [    0.000000] IOAPIC[0]: apic_id 0, version 17, address 0xfec00000, GSI 0-23
> > [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
> > [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
> > [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
> > [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
> > [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
> > [    0.000000] ACPI: IRQ0 used by override.
> > [    0.000000] ACPI: IRQ5 used by override.
> > [    0.000000] ACPI: IRQ9 used by override.
> > [    0.000000] ACPI: IRQ10 used by override.
> > [    0.000000] ACPI: IRQ11 used by override.
> > [    0.000000] Using ACPI (MADT) for SMP configuration information
> > [    0.000000] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
> > [    0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
> > [    0.000000] PM: Registered nosave memory: [mem 0x0009f000-0x0009ffff]
> > [    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000effff]
> > [    0.000000] PM: Registered nosave memory: [mem 0x000f0000-0x000fffff]
> > [    0.000000] e820: [mem 0x80000000-0xfeffbfff] available for PCI devices
> > [    0.000000] Booting paravirtualized kernel on KVM
> > [    0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > [    0.000000] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:2 nr_node_ids:1
> > [    0.000000] percpu: Embedded 44 pages/cpu @        (ptrval) s142296 r8192 d29736 u1048576
> > [    0.000000] pcpu-alloc: s142296 r8192 d29736 u1048576 alloc=1*2097152
> > [    0.000000] pcpu-alloc: [0] 0 1 
> > [    0.000000] KVM setup async PF for cpu 0
> > [    0.000000] kvm-stealtime: cpu 0, msr 7fc122c0
> > [    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 515938
> > [    0.000000] Policy zone: DMA32
> > [    0.000000] Kernel command line: root=/dev/vda2 ro debug console=ttyS0,115200 LANG=en_US.UTF-8 slub_debug=FZP raid=noautodetect selinux=0
> > [    0.000000] Memory: 2037056K/2096604K available (12300K kernel code, 1554K rwdata, 3584K rodata, 1640K init, 912K bss, 59548K reserved, 0K cma-reserved)
> > [    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
> > [    0.000000] ftrace: allocating 36554 entries in 143 pages
> > [    0.001000] Hierarchical RCU implementation.
> > [    0.001000] 	RCU restricting CPUs from NR_CPUS=64 to nr_cpu_ids=2.
> > [    0.001000] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=2
> > [    0.001000] NR_IRQS: 4352, nr_irqs: 440, preallocated irqs: 16
> > [    0.001000] 	Offload RCU callbacks from CPUs: (none).
> > [    0.001000] Console: colour dummy device 80x25
> > [    0.001000] console [ttyS0] enabled
> > [    0.001000] ACPI: Core revision 20171110
> > [    0.001000] ACPI: 1 ACPI AML tables successfully acquired and loaded
> > [    0.001009] APIC: Switch to symmetric I/O mode setup
> > [    0.001571] x2apic enabled
> > [    0.002003] Switched APIC routing to physical x2apic.
> > [    0.003538] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> > [    0.004000] tsc: Detected 2496.000 MHz processor
> > [    0.004014] Calibrating delay loop (skipped) preset value.. 4992.00 BogoMIPS (lpj=2496000)
> > [    0.005014] pid_max: default: 32768 minimum: 301
> > [    0.006057] Security Framework initialized
> > [    0.006548] Yama: becoming mindful.
> > [    0.007019] SELinux:  Disabled at boot.
> > [    0.008206] Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
> > [    0.009164] Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
> > [    0.009816] Mount-cache hash table entries: 4096 (order: 3, 32768 bytes)
> > [    0.010009] Mountpoint-cache hash table entries: 4096 (order: 3, 32768 bytes)
> > [    0.011322] mce: CPU supports 10 MCE banks
> > [    0.011740] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
> > [    0.012002] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
> > [    0.012610] Freeing SMP alternatives memory: 36K
> > [    0.013467] TSC deadline timer enabled
> > [    0.013820] smpboot: CPU0: Intel Core Processor (Skylake) (family: 0x6, model: 0x5e, stepping: 0x3)
> > [    0.014000] Performance Events: unsupported p6 CPU model 94 no PMU driver, software events only.
> > [    0.014041] Hierarchical SRCU implementation.
> > [    0.015133] NMI watchdog: Perf event create on CPU 0 failed with -2
> > [    0.015725] NMI watchdog: Perf NMI watchdog permanently disabled
> > [    0.016077] smp: Bringing up secondary CPUs ...
> > [    0.016654] x86: Booting SMP configuration:
> > [    0.017005] .... node  #0, CPUs:      #1
> > [    0.001000] kvm-clock: cpu 1, msr 0:7ffc0041, secondary cpu clock
> > [    0.019051] KVM setup async PF for cpu 1
> > [    0.019599] kvm-stealtime: cpu 1, msr 7fd122c0
> > [    0.020009] smp: Brought up 1 node, 2 CPUs
> > [    0.020531] smpboot: Max logical packages: 2
> > [    0.021009] smpboot: Total of 2 processors activated (9984.00 BogoMIPS)
> > [    0.023160] devtmpfs: initialized
> > [    0.023513] x86/mm: Memory block size: 128MB
> > [    0.024811] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > [    0.025015] futex hash table entries: 512 (order: 3, 32768 bytes)
> > [    0.026185] RTC time:  0:42:06, date: 12/16/17
> > [    0.026790] NET: Registered protocol family 16
> > [    0.027204] audit: initializing netlink subsys (disabled)
> > [    0.027914] audit: type=2000 audit(1513384927.133:1): state=initialized audit_enabled=0 res=1
> > [    0.028185] cpuidle: using governor menu
> > [    0.029118] ACPI: bus type PCI registered
> > [    0.029872] PCI: Using configuration type 1 for base access
> > [    0.034355] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
> > [    0.035011] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
> > [    0.036066] cryptd: max_cpu_qlen set to 1000
> > [    0.036579] ACPI: Added _OSI(Module Device)
> > [    0.037007] ACPI: Added _OSI(Processor Device)
> > [    0.037426] ACPI: Added _OSI(3.0 _SCP Extensions)
> > [    0.037857] ACPI: Added _OSI(Processor Aggregator Device)
> > [    0.041356] ACPI: Interpreter enabled
> > [    0.041764] ACPI: (supports S0 S5)
> > [    0.042005] ACPI: Using IOAPIC for interrupt routing
> > [    0.042655] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
> > [    0.043625] ACPI: Enabled 2 GPEs in block 00 to 0F
> > [    0.059248] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
> > [    0.059953] acpi PNP0A03:00: _OSC: OS supports [ASPM ClockPM Segments MSI]
> > [    0.060045] acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM
> > [    0.061180] PCI host bridge to bus 0000:00
> > [    0.061874] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
> > [    0.062013] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
> > [    0.063016] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
> > [    0.064015] pci_bus 0000:00: root bus resource [mem 0x80000000-0xfebfffff window]
> > [    0.065014] pci_bus 0000:00: root bus resource [bus 00-ff]
> > [    0.065753] pci 0000:00:00.0: [8086:1237] type 00 class 0x060000
> > [    0.066487] pci 0000:00:01.0: [8086:7000] type 00 class 0x060100
> > [    0.067537] pci 0000:00:01.1: [8086:7010] type 00 class 0x010180
> > [    0.071700] pci 0000:00:01.1: reg 0x20: [io  0xc100-0xc10f]
> > [    0.074032] pci 0000:00:01.1: legacy IDE quirk: reg 0x10: [io  0x01f0-0x01f7]
> > [    0.074908] pci 0000:00:01.1: legacy IDE quirk: reg 0x14: [io  0x03f6]
> > [    0.075011] pci 0000:00:01.1: legacy IDE quirk: reg 0x18: [io  0x0170-0x0177]
> > [    0.076010] pci 0000:00:01.1: legacy IDE quirk: reg 0x1c: [io  0x0376]
> > [    0.077121] pci 0000:00:01.3: [8086:7113] type 00 class 0x068000
> > [    0.078148] pci 0000:00:01.3: quirk: [io  0x0600-0x063f] claimed by PIIX4 ACPI
> > [    0.079014] pci 0000:00:01.3: quirk: [io  0x0700-0x070f] claimed by PIIX4 SMB
> > [    0.080224] pci 0000:00:03.0: [1af4:1000] type 00 class 0x020000
> > [    0.082007] pci 0000:00:03.0: reg 0x10: [io  0xc040-0xc05f]
> > [    0.083814] pci 0000:00:03.0: reg 0x14: [mem 0xfebc0000-0xfebc0fff]
> > [    0.089891] pci 0000:00:03.0: reg 0x30: [mem 0xfeb80000-0xfebbffff pref]
> > [    0.090545] pci 0000:00:05.0: [1af4:1003] type 00 class 0x078000
> > [    0.092708] pci 0000:00:05.0: reg 0x10: [io  0xc060-0xc07f]
> > [    0.094009] pci 0000:00:05.0: reg 0x14: [mem 0xfebc1000-0xfebc1fff]
> > [    0.102484] pci 0000:00:06.0: [8086:2934] type 00 class 0x0c0300
> > [    0.108028] pci 0000:00:06.0: reg 0x20: [io  0xc080-0xc09f]
> > [    0.110738] pci 0000:00:06.1: [8086:2935] type 00 class 0x0c0300
> > [    0.114388] pci 0000:00:06.1: reg 0x20: [io  0xc0a0-0xc0bf]
> > [    0.117339] pci 0000:00:06.2: [8086:2936] type 00 class 0x0c0300
> > [    0.122770] pci 0000:00:06.2: reg 0x20: [io  0xc0c0-0xc0df]
> > [    0.124738] pci 0000:00:06.7: [8086:293a] type 00 class 0x0c0320
> > [    0.125825] pci 0000:00:06.7: reg 0x10: [mem 0xfebc2000-0xfebc2fff]
> > [    0.130347] pci 0000:00:07.0: [1af4:1001] type 00 class 0x010000
> > [    0.133007] pci 0000:00:07.0: reg 0x10: [io  0xc000-0xc03f]
> > [    0.134793] pci 0000:00:07.0: reg 0x14: [mem 0xfebc3000-0xfebc3fff]
> > [    0.141808] pci 0000:00:08.0: [1af4:1002] type 00 class 0x00ff00
> > [    0.142914] pci 0000:00:08.0: reg 0x10: [io  0xc0e0-0xc0ff]
> > [    0.148977] ACPI: PCI Interrupt Link [LNKA] (IRQs 5 *10 11)
> > [    0.149455] ACPI: PCI Interrupt Link [LNKB] (IRQs 5 *10 11)
> > [    0.150390] ACPI: PCI Interrupt Link [LNKC] (IRQs 5 10 *11)
> > [    0.151382] ACPI: PCI Interrupt Link [LNKD] (IRQs 5 10 *11)
> > [    0.152380] ACPI: PCI Interrupt Link [LNKS] (IRQs *9)
> > [    0.154508] vgaarb: loaded
> > [    0.155271] SCSI subsystem initialized
> > [    0.155887] EDAC MC: Ver: 3.0.0
> > [    0.156255] PCI: Using ACPI for IRQ routing
> > [    0.156566] PCI: pci_cache_line_size set to 64 bytes
> > [    0.157161] e820: reserve RAM buffer [mem 0x0009fc00-0x0009ffff]
> > [    0.157914] e820: reserve RAM buffer [mem 0x7ffd9000-0x7fffffff]
> > [    0.158253] NetLabel: Initializing
> > [    0.158765] NetLabel:  domain hash size = 128
> > [    0.159005] NetLabel:  protocols = UNLABELED CIPSOv4 CALIPSO
> > [    0.159775] NetLabel:  unlabeled traffic allowed by default
> > [    0.160073] clocksource: Switched to clocksource kvm-clock
> > [    0.186764] VFS: Disk quotas dquot_6.6.0
> > [    0.187277] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
> > [    0.188251] FS-Cache: Loaded
> > [    0.188725] pnp: PnP ACPI init
> > [    0.189271] pnp 00:00: Plug and Play ACPI device, IDs PNP0b00 (active)
> > [    0.190231] pnp 00:01: Plug and Play ACPI device, IDs PNP0303 (active)
> > [    0.191229] pnp 00:02: Plug and Play ACPI device, IDs PNP0f13 (active)
> > [    0.192096] pnp 00:03: [dma 2]
> > [    0.192514] pnp 00:03: Plug and Play ACPI device, IDs PNP0700 (active)
> > [    0.193631] pnp 00:04: Plug and Play ACPI device, IDs PNP0501 (active)
> > [    0.195416] pnp: PnP ACPI: found 5 devices
> > [    0.206055] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > [    0.207127] pci_bus 0000:00: resource 4 [io  0x0000-0x0cf7 window]
> > [    0.207832] pci_bus 0000:00: resource 5 [io  0x0d00-0xffff window]
> > [    0.208594] pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff window]
> > [    0.209469] pci_bus 0000:00: resource 7 [mem 0x80000000-0xfebfffff window]
> > [    0.210493] NET: Registered protocol family 2
> > [    0.211244] tcp_listen_portaddr_hash hash table entries: 1024 (order: 2, 16384 bytes)
> > [    0.212283] TCP established hash table entries: 16384 (order: 5, 131072 bytes)
> > [    0.213285] TCP bind hash table entries: 16384 (order: 6, 262144 bytes)
> > [    0.214306] TCP: Hash tables configured (established 16384 bind 16384)
> > [    0.215065] UDP hash table entries: 1024 (order: 3, 32768 bytes)
> > [    0.215797] UDP-Lite hash table entries: 1024 (order: 3, 32768 bytes)
> > [    0.217934] NET: Registered protocol family 1
> > [    0.219126] RPC: Registered named UNIX socket transport module.
> > [    0.219676] RPC: Registered udp transport module.
> > [    0.220130] RPC: Registered tcp transport module.
> > [    0.220552] RPC: Registered tcp NFSv4.1 backchannel transport module.
> > [    0.221153] pci 0000:00:00.0: Limiting direct PCI/PCI transfers
> > [    0.221701] pci 0000:00:01.0: PIIX3: Enabling Passive Release
> > [    0.222319] pci 0000:00:01.0: Activating ISA DMA hang workarounds
> > [    0.444214] ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 10
> > [    0.880141] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 11
> > [    1.311493] ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 11
> > [    1.748829] ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 10
> > [    1.962124] PCI: CLS 0 bytes, default 64
> > [    1.964749] Initialise system trusted keyrings
> > [    1.965289] workingset: timestamp_bits=37 max_order=19 bucket_order=0
> > [    1.969600] zbud: loaded
> > [    1.971287] SGI XFS with security attributes, no debug enabled
> > [    2.106071] NET: Registered protocol family 38
> > [    2.106556] Key type asymmetric registered
> > [    2.106931] Asymmetric key parser 'x509' registered
> > [    2.107514] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 248)
> > [    2.108327] io scheduler noop registered
> > [    2.108813] io scheduler deadline registered
> > [    2.109608] io scheduler cfq registered (default)
> > [    2.110258] io scheduler mq-deadline registered
> > [    2.110796] io scheduler kyber registered
> > [    2.111688] intel_idle: Please enable MWAIT in BIOS SETUP
> > [    2.112310] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> > [    2.113037] ACPI: Power Button [PWRF]
> > [    2.331642] virtio-pci 0000:00:03.0: virtio_pci: leaving for legacy driver
> > [    2.554093] virtio-pci 0000:00:05.0: virtio_pci: leaving for legacy driver
> > [    2.775938] virtio-pci 0000:00:07.0: virtio_pci: leaving for legacy driver
> > [    2.975053] tsc: Refined TSC clocksource calibration: 2495.981 MHz
> > [    2.975641] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x23fa6529869, max_idle_ns: 440795218057 ns
> > [    3.029409] virtio-pci 0000:00:08.0: virtio_pci: leaving for legacy driver
> > [    3.032925] Serial: 8250/16550 driver, 32 ports, IRQ sharing enabled
> > [    3.056849] 00:04: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> > [    3.064748] Non-volatile memory driver v1.3
> > [    3.065925] ppdev: user-space parallel port driver
> > [    3.071816] loop: module loaded
> > [    3.075337]  vda: vda1 vda2 vda3
> > [    3.076659] Rounding down aligned max_sectors from 4294967295 to 4294967288
> > [    3.077996] Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
> > [    3.079790] libphy: Fixed MDIO Bus: probed
> > [    3.080257] tun: Universal TUN/TAP device driver, 1.6
> > [    3.082222] i8042: PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12
> > [    3.083675] serio: i8042 KBD port at 0x60,0x64 irq 1
> > [    3.084160] serio: i8042 AUX port at 0x60,0x64 irq 12
> > [    3.084816] mousedev: PS/2 mouse device common for all mice
> > [    3.086603] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input1
> > [    3.089192] rtc_cmos 00:00: RTC can wake from S4
> > [    3.090116] rtc_cmos 00:00: rtc core: registered rtc_cmos as rtc0
> > [    3.092829] rtc_cmos 00:00: alarms up to one day, y3k, 114 bytes nvram
> > [    3.093937] IR NEC protocol handler initialized
> > [    3.094408] IR RC5(x/sz) protocol handler initialized
> > [    3.094901] IR RC6 protocol handler initialized
> > [    3.095510] IR JVC protocol handler initialized
> > [    3.095952] IR Sony protocol handler initialized
> > [    3.096399] IR SANYO protocol handler initialized
> > [    3.096862] IR Sharp protocol handler initialized
> > [    3.097342] IR MCE Keyboard/mouse protocol handler initialized
> > [    3.097919] IR XMP protocol handler initialized
> > [    3.098530] device-mapper: uevent: version 1.0.3
> > [    3.099209] device-mapper: ioctl: 4.37.0-ioctl (2017-09-20) initialised: dm-devel@redhat.com
> > [    3.100398] device-mapper: multipath round-robin: version 1.2.0 loaded
> > [    3.101883] drop_monitor: Initializing network drop monitor service
> > [    3.102553] Netfilter messages via NETLINK v0.30.
> > [    3.103090] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
> > [    3.103738] ctnetlink v0.93: registering with nfnetlink.
> > [    3.104494] ip_tables: (C) 2000-2006 Netfilter Core Team
> > [    3.105734] Initializing XFRM netlink socket
> > [    3.106885] NET: Registered protocol family 10
> > [    3.109341] Segment Routing with IPv6
> > [    3.109976] mip6: Mobile IPv6
> > [    3.111987] ip6_tables: (C) 2000-2006 Netfilter Core Team
> > [    3.114230] NET: Registered protocol family 17
> > [    3.115047] Bridge firewalling registered
> > [    3.115824] Ebtables v2.0 registered
> > [    3.117996] 8021q: 802.1Q VLAN Support v1.8
> > [    3.119429] AVX2 version of gcm_enc/dec engaged.
> > [    3.119886] AES CTR mode by8 optimization enabled
> > [    3.128818] sched_clock: Marking stable (3128714579, 0)->(3404180881, -275466302)
> > [    3.129945] registered taskstats version 1
> > [    3.130427] Loading compiled-in X.509 certificates
> > [    3.163216] Loaded X.509 cert 'Build time autogenerated kernel key: 38e0adea1af8bd8a23b02436d4acf2f8c7408d23'
> > [    3.166359] zswap: loaded using pool lzo/zbud
> > [    3.167943] Key type big_key registered
> > [    3.168778]   Magic number: 13:918:708
> > [    3.169255] rtc_cmos 00:00: setting system clock to 2017-12-16 00:42:09 UTC (1513384929)
> > [    3.170604] md: Skipping autodetection of RAID arrays. (raid=autodetect will force)
> > [    3.171932] EXT4-fs (vda2): couldn't mount as ext3 due to feature incompatibilities
> > [    3.173871] EXT4-fs (vda2): couldn't mount as ext2 due to feature incompatibilities
> > [    3.175306] EXT4-fs (vda2): INFO: recovery required on readonly filesystem
> > [    3.176212] EXT4-fs (vda2): write access will be enabled during recovery
> > [    3.397187] EXT4-fs (vda2): orphan cleanup on readonly fs
> > [    3.399412] EXT4-fs (vda2): 5 orphan inodes deleted
> > [    3.402759] EXT4-fs (vda2): recovery complete
> > [    3.466647] EXT4-fs (vda2): mounted filesystem with ordered data mode. Opts: (null)
> > [    3.469401] VFS: Mounted root (ext4 filesystem) readonly on device 253:2.
> > [    3.473719] devtmpfs: mounted
> > [    3.492549] Freeing unused kernel memory: 1640K
> > [    3.494547] Write protecting the kernel read-only data: 18432k
> > [    3.498781] Freeing unused kernel memory: 2016K
> > [    3.503330] Freeing unused kernel memory: 512K
> > [    3.505232] rodata_test: all tests were successful
> > [    3.515355] 1 (init): Uhuuh, elf segement at 00000000928fda3e requested but the memory is mapped already
> > [    3.519533] Starting init: /sbin/init exists but couldn't execute it (error -95)
> > [    3.528993] Starting init: /bin/sh exists but couldn't execute it (error -14)
> > [    3.532127] Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
> > [    3.538328] CPU: 0 PID: 1 Comm: init Not tainted 4.15.0-rc3-next-20171215-00001-g6d6aea478fce #11
> > [    3.542201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
> > [    3.546081] Call Trace:
> > [    3.547221]  dump_stack+0x5c/0x79
> > [    3.548768]  ? rest_init+0x30/0xb0
> > [    3.550320]  panic+0xe4/0x232
> > [    3.551669]  ? rest_init+0xb0/0xb0
> > [    3.553110]  kernel_init+0xeb/0x100
> > [    3.554701]  ret_from_fork+0x1f/0x30
> > [    3.558964] Kernel Offset: 0x2000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > [    3.564160] ---[ end Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
> 
> 
> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
  2017-12-13 23:19                   ` Kees Cook
       [not found]                     ` <CAGXu5jLqE6cUxk-Girx6PG7upEzz8jmu1OH_3LVC26iJc2vTxQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-12-18 19:12                     ` Michael Kerrisk (man-pages)
  2017-12-18 20:19                       ` Kees Cook
  1 sibling, 1 reply; 48+ messages in thread
From: Michael Kerrisk (man-pages) @ 2017-12-18 19:12 UTC (permalink / raw)
  To: Kees Cook, Cyril Hrubis
  Cc: mtk.manpages, Pavel Machek, Michal Hocko, Linux API, Khalid Aziz,
	Michael Ellerman, Andrew Morton, Russell King - ARM Linux,
	Andrea Arcangeli, Linux-MM, LKML, linux-arch, Florian Weimer,
	John Hubbard, Matthew Wilcox, Jann Horn, Mike Rapoport

Hello Kees,

I'm late to the party, and only just caught up with the fuss :-).

On 12/14/2017 12:19 AM, Kees Cook wrote:
> On Wed, Dec 13, 2017 at 6:40 AM, Cyril Hrubis <chrubis@suse.cz> wrote:
>> Hi!
>>> You selected stupid name for a flag. Everyone and their dog agrees
>>> with that. There's even consensus on better name (and everyone agrees
>>> it is better than .._SAFE). Of course, we could have debate if it is
>>> NOREPLACE or NOREMOVE or ... and that would be bikeshed. This was just
>>> poor naming on your part.
>>
>> Well while everybody agrees that the name is so bad that basically
>> anything else would be better, there does not seem to be consensus on
>> which one to pick. I do understand that this frustrating and fruitless.
> 
> Based on the earlier threads where I tried to end the bikeshedding, it
> seemed like MAP_FIXED_NOREPLACE was the least bad option.
> 
>> So what do we do now, roll a dice to choose new name?
>>
>> Or do we ask BFDL[1] to choose the name?
> 
> I'd like to hear feedback from Michael Kerrisk, as he's had to deal
> with these kinds of choices in the past. I'm fine to ask Linus too. I
> just want to get past the name since the feature is quite valuable.
> 
> And if Michal doesn't want to touch this patch any more, I'm happy to
> do the search/replace/resend. :P

Something with the prefix MAP_FIXED_ seems to me obviously desirable,
both to suggest that the function is similar, and also for easy
grepping of the source code to look for instances of both.
MAP_FIXED_SAFE didn't really bother me as a name, but 
MAP_FIXED_NOREPLACE (or MAP_FIXED_NOCLOBBER) seem slightly more 
descriptive of what the flag actually does, so a little better.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
  2017-12-18 19:12                     ` Michael Kerrisk (man-pages)
@ 2017-12-18 20:19                       ` Kees Cook
  2017-12-18 20:33                         ` Matthew Wilcox
       [not found]                         ` <CAGXu5jJ289R9koVoHmxcvUWr6XHSZR2p0qq3WtpNyN-iNSvrNQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 2 replies; 48+ messages in thread
From: Kees Cook @ 2017-12-18 20:19 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), Andrew Morton
  Cc: Cyril Hrubis, Pavel Machek, Michal Hocko, Linux API, Khalid Aziz,
	Michael Ellerman, Russell King - ARM Linux, Andrea Arcangeli,
	Linux-MM, LKML, linux-arch, Florian Weimer, John Hubbard,
	Matthew Wilcox, Jann Horn, Mike Rapoport

On Mon, Dec 18, 2017 at 11:12 AM, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> Hello Kees,
>
> I'm late to the party, and only just caught up with the fuss :-).

No worries!

> On 12/14/2017 12:19 AM, Kees Cook wrote:
>> On Wed, Dec 13, 2017 at 6:40 AM, Cyril Hrubis <chrubis@suse.cz> wrote:
>>> Hi!
>>>> You selected stupid name for a flag. Everyone and their dog agrees
>>>> with that. There's even consensus on better name (and everyone agrees
>>>> it is better than .._SAFE). Of course, we could have debate if it is
>>>> NOREPLACE or NOREMOVE or ... and that would be bikeshed. This was just
>>>> poor naming on your part.
>>>
>>> Well while everybody agrees that the name is so bad that basically
>>> anything else would be better, there does not seem to be consensus on
>>> which one to pick. I do understand that this frustrating and fruitless.
>>
>> Based on the earlier threads where I tried to end the bikeshedding, it
>> seemed like MAP_FIXED_NOREPLACE was the least bad option.
>>
>>> So what do we do now, roll a dice to choose new name?
>>>
>>> Or do we ask BFDL[1] to choose the name?
>>
>> I'd like to hear feedback from Michael Kerrisk, as he's had to deal
>> with these kinds of choices in the past. I'm fine to ask Linus too. I
>> just want to get past the name since the feature is quite valuable.
>>
>> And if Michal doesn't want to touch this patch any more, I'm happy to
>> do the search/replace/resend. :P
>
> Something with the prefix MAP_FIXED_ seems to me obviously desirable,
> both to suggest that the function is similar, and also for easy
> grepping of the source code to look for instances of both.
> MAP_FIXED_SAFE didn't really bother me as a name, but
> MAP_FIXED_NOREPLACE (or MAP_FIXED_NOCLOBBER) seem slightly more
> descriptive of what the flag actually does, so a little better.

Great, thanks!

Andrew, can you s/MAP_FIXED_SAFE/MAP_FIXED_NOREPLACE/g in the series?

-Kees

-- 
Kees Cook
Pixel Security

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
  2017-12-18 20:19                       ` Kees Cook
@ 2017-12-18 20:33                         ` Matthew Wilcox
       [not found]                         ` <CAGXu5jJ289R9koVoHmxcvUWr6XHSZR2p0qq3WtpNyN-iNSvrNQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 0 replies; 48+ messages in thread
From: Matthew Wilcox @ 2017-12-18 20:33 UTC (permalink / raw)
  To: Kees Cook
  Cc: Michael Kerrisk (man-pages),
	Andrew Morton, Cyril Hrubis, Pavel Machek, Michal Hocko,
	Linux API, Khalid Aziz, Michael Ellerman,
	Russell King - ARM Linux, Andrea Arcangeli, Linux-MM, LKML,
	linux-arch, Florian Weimer, John Hubbard, Jann Horn,
	Mike Rapoport

On Mon, Dec 18, 2017 at 12:19:21PM -0800, Kees Cook wrote:
> Andrew, can you s/MAP_FIXED_SAFE/MAP_FIXED_NOREPLACE/g in the series?

+1

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE
  2017-12-14 14:54         ` Edward Napierala
@ 2017-12-19 12:40           ` David Laight
  2017-12-19 12:46             ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: David Laight @ 2017-12-19 12:40 UTC (permalink / raw)
  To: 'Edward Napierala', Michal Hocko
  Cc: Andrew Morton, linux-api, Khalid Aziz, Michael Ellerman,
	Russell King - ARM Linux, Andrea Arcangeli, linux-mm, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Abdul Haleem, Joel Stanley, Kees Cook, jasone, davidtgoldblatt

From: Edward Napierala
> Sent: 14 December 2017 14:55
>
> On 1214T1415, Michal Hocko wrote:
> > On Thu 14-12-17 12:44:17, Edward Napierala wrote:
> > > Regarding the name - how about adopting MAP_EXCL?  It was introduced in
> > > FreeBSD,
> > > and seems to do exactly this; quoting mmap(2):
> > >
> > > MAP_FIXED    Do not permit the system to select a different address
> > >                         than the one specified.  If the specified address
> > >                         cannot be used, mmap() will fail.  If MAP_FIXED is
> > >                         specified, addr must be a multiple of the page size.
> > >                         If MAP_EXCL is not specified, a successful MAP_FIXED
> > >                         request replaces any previous mappings for the
> > >                         process' pages in the range from addr to addr + len.
> > >                         In contrast, if MAP_EXCL is specified, the request
> > >                         will fail if a mapping already exists within the
> > >                         range.
> >
> > I am not familiar with the FreeBSD implementation but from the above it
> > looks like MAP_EXCL is a MAP_FIXED mofifier which is not how we are
> > going to implement it in linux due to reasons mentioned in this cover
> > letter. Using the same name would be more confusing than helpful I am
> > afraid.
> 
> Sorry, missed that.  Indeed, reusing a name with a different semantics
> would be a bad idea.

I don't remember any discussion about using MAP_FIXED | MAP_EXCL ?

Why not match the prior art??

	David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE
  2017-12-19 12:40           ` David Laight
@ 2017-12-19 12:46             ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2017-12-19 12:46 UTC (permalink / raw)
  To: David Laight
  Cc: 'Edward Napierala',
	Andrew Morton, linux-api, Khalid Aziz, Michael Ellerman,
	Russell King - ARM Linux, Andrea Arcangeli, linux-mm, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Abdul Haleem, Joel Stanley, Kees Cook, jasone, davidtgoldblatt

On Tue 19-12-17 12:40:16, David Laight wrote:
> From: Edward Napierala
> > Sent: 14 December 2017 14:55
> >
> > On 1214T1415, Michal Hocko wrote:
> > > On Thu 14-12-17 12:44:17, Edward Napierala wrote:
> > > > Regarding the name - how about adopting MAP_EXCL?  It was introduced in
> > > > FreeBSD,
> > > > and seems to do exactly this; quoting mmap(2):
> > > >
> > > > MAP_FIXED    Do not permit the system to select a different address
> > > >                         than the one specified.  If the specified address
> > > >                         cannot be used, mmap() will fail.  If MAP_FIXED is
> > > >                         specified, addr must be a multiple of the page size.
> > > >                         If MAP_EXCL is not specified, a successful MAP_FIXED
> > > >                         request replaces any previous mappings for the
> > > >                         process' pages in the range from addr to addr + len.
> > > >                         In contrast, if MAP_EXCL is specified, the request
> > > >                         will fail if a mapping already exists within the
> > > >                         range.
> > >
> > > I am not familiar with the FreeBSD implementation but from the above it
> > > looks like MAP_EXCL is a MAP_FIXED mofifier which is not how we are
> > > going to implement it in linux due to reasons mentioned in this cover
> > > letter. Using the same name would be more confusing than helpful I am
> > > afraid.
> > 
> > Sorry, missed that.  Indeed, reusing a name with a different semantics
> > would be a bad idea.
> 
> I don't remember any discussion about using MAP_FIXED | MAP_EXCL ?
> 
> Why not match the prior art??

See the cover letter which explains why an extension is not backward
compatible.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
       [not found]                         ` <CAGXu5jJ289R9koVoHmxcvUWr6XHSZR2p0qq3WtpNyN-iNSvrNQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-12-21 12:38                           ` Michael Ellerman
  2017-12-21 14:59                             ` known bad patch in -mm tree was " Pavel Machek
  2017-12-21 22:24                             ` Andrew Morton
  0 siblings, 2 replies; 48+ messages in thread
From: Michael Ellerman @ 2017-12-21 12:38 UTC (permalink / raw)
  To: Kees Cook, Michael Kerrisk (man-pages), Andrew Morton
  Cc: Cyril Hrubis, Pavel Machek, Michal Hocko, Linux API, Khalid Aziz,
	Russell King - ARM Linux, Andrea Arcangeli, Linux-MM, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Jann Horn, Mike Rapoport

Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> writes:

> On Mon, Dec 18, 2017 at 11:12 AM, Michael Kerrisk (man-pages)
> <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> Hello Kees,
>>
>> I'm late to the party, and only just caught up with the fuss :-).
>
> No worries!
>
>> On 12/14/2017 12:19 AM, Kees Cook wrote:
>>> On Wed, Dec 13, 2017 at 6:40 AM, Cyril Hrubis <chrubis-AlSwsSmVLrQ@public.gmane.org> wrote:
>>>> Hi!
>>>>> You selected stupid name for a flag. Everyone and their dog agrees
>>>>> with that. There's even consensus on better name (and everyone agrees
>>>>> it is better than .._SAFE). Of course, we could have debate if it is
>>>>> NOREPLACE or NOREMOVE or ... and that would be bikeshed. This was just
>>>>> poor naming on your part.
>>>>
>>>> Well while everybody agrees that the name is so bad that basically
>>>> anything else would be better, there does not seem to be consensus on
>>>> which one to pick. I do understand that this frustrating and fruitless.
>>>
>>> Based on the earlier threads where I tried to end the bikeshedding, it
>>> seemed like MAP_FIXED_NOREPLACE was the least bad option.
>>>
>>>> So what do we do now, roll a dice to choose new name?
>>>>
>>>> Or do we ask BFDL[1] to choose the name?
>>>
>>> I'd like to hear feedback from Michael Kerrisk, as he's had to deal
>>> with these kinds of choices in the past. I'm fine to ask Linus too. I
>>> just want to get past the name since the feature is quite valuable.
>>>
>>> And if Michal doesn't want to touch this patch any more, I'm happy to
>>> do the search/replace/resend. :P
>>
>> Something with the prefix MAP_FIXED_ seems to me obviously desirable,
>> both to suggest that the function is similar, and also for easy
>> grepping of the source code to look for instances of both.
>> MAP_FIXED_SAFE didn't really bother me as a name, but
>> MAP_FIXED_NOREPLACE (or MAP_FIXED_NOCLOBBER) seem slightly more
>> descriptive of what the flag actually does, so a little better.
>
> Great, thanks!
>
> Andrew, can you s/MAP_FIXED_SAFE/MAP_FIXED_NOREPLACE/g in the series?

This seems to have not happened. Presumably Andrew just missed the mail
in the flood. And will probably miss this one too ... :)

cheers

^ permalink raw reply	[flat|nested] 48+ messages in thread

* known bad patch in -mm tree was Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
  2017-12-21 12:38                           ` Michael Ellerman
@ 2017-12-21 14:59                             ` Pavel Machek
  2017-12-21 15:08                               ` Michal Hocko
  2017-12-21 22:24                             ` Andrew Morton
  1 sibling, 1 reply; 48+ messages in thread
From: Pavel Machek @ 2017-12-21 14:59 UTC (permalink / raw)
  To: Michael Ellerman, vojtech, jikos
  Cc: Kees Cook, Michael Kerrisk (man-pages),
	Andrew Morton, Cyril Hrubis, Michal Hocko, Linux API,
	Khalid Aziz, Russell King - ARM Linux, Andrea Arcangeli,
	Linux-MM, LKML, linux-arch, Florian Weimer, John Hubbard,
	Matthew Wilcox, Jann Horn, Mike Rapoport


[-- Attachment #1: Type: text/plain, Size: 1227 bytes --]

Hi!

> >>> And if Michal doesn't want to touch this patch any more, I'm happy to
> >>> do the search/replace/resend. :P
> >>
> >> Something with the prefix MAP_FIXED_ seems to me obviously desirable,
> >> both to suggest that the function is similar, and also for easy
> >> grepping of the source code to look for instances of both.
> >> MAP_FIXED_SAFE didn't really bother me as a name, but
> >> MAP_FIXED_NOREPLACE (or MAP_FIXED_NOCLOBBER) seem slightly more
> >> descriptive of what the flag actually does, so a little better.
> >
> > Great, thanks!
> >
> > Andrew, can you s/MAP_FIXED_SAFE/MAP_FIXED_NOREPLACE/g in the series?
> 
> This seems to have not happened. Presumably Andrew just missed the mail
> in the flood. And will probably miss this one too ... :)

Nice way to mess up kernel development, Michal. Thank you! :-(.

Andrew, everyone and their dog agrees MAP_FIXED_SAFE is stupid name,
but Michal decided to just go ahead, ignoring feedback...

Can you either s/MAP_FIXED_SAFE/MAP_FIXED_NOREPLACE/g or drop the patches?

Thanks,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: known bad patch in -mm tree was Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
  2017-12-21 14:59                             ` known bad patch in -mm tree was " Pavel Machek
@ 2017-12-21 15:08                               ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2017-12-21 15:08 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Michael Ellerman, vojtech, jikos, Kees Cook,
	Michael Kerrisk (man-pages),
	Andrew Morton, Cyril Hrubis, Linux API, Khalid Aziz,
	Russell King - ARM Linux, Andrea Arcangeli, Linux-MM, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Jann Horn, Mike Rapoport

On Thu 21-12-17 15:59:07, Pavel Machek wrote:
> Hi!
> 
> > >>> And if Michal doesn't want to touch this patch any more, I'm happy to
> > >>> do the search/replace/resend. :P
> > >>
> > >> Something with the prefix MAP_FIXED_ seems to me obviously desirable,
> > >> both to suggest that the function is similar, and also for easy
> > >> grepping of the source code to look for instances of both.
> > >> MAP_FIXED_SAFE didn't really bother me as a name, but
> > >> MAP_FIXED_NOREPLACE (or MAP_FIXED_NOCLOBBER) seem slightly more
> > >> descriptive of what the flag actually does, so a little better.
> > >
> > > Great, thanks!
> > >
> > > Andrew, can you s/MAP_FIXED_SAFE/MAP_FIXED_NOREPLACE/g in the series?
> > 
> > This seems to have not happened. Presumably Andrew just missed the mail
> > in the flood. And will probably miss this one too ... :)
> 
> Nice way to mess up kernel development, Michal. Thank you! :-(.

Thank you for your valuable feedback! Maybe you have noticed that I
haven't enforced the patch and led others to decide the final name
(either by resubmitting patches or a simple replace in mmotm tree). Or
maybe you haven't because you are so busy bikesheding that you can
hardly see anything else.
 
> Andrew, everyone and their dog agrees MAP_FIXED_SAFE is stupid name,
> but Michal decided to just go ahead, ignoring feedback...
>
> Can you either s/MAP_FIXED_SAFE/MAP_FIXED_NOREPLACE/g or drop the patches?

You have surely saved the world today and I hardly find words to thank
you (and your dog of course).

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
  2017-12-21 12:38                           ` Michael Ellerman
  2017-12-21 14:59                             ` known bad patch in -mm tree was " Pavel Machek
@ 2017-12-21 22:24                             ` Andrew Morton
  2017-12-22  0:06                               ` Michael Ellerman
  1 sibling, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2017-12-21 22:24 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Kees Cook, Michael Kerrisk (man-pages),
	Cyril Hrubis, Pavel Machek, Michal Hocko, Linux API, Khalid Aziz,
	Russell King - ARM Linux, Andrea Arcangeli, Linux-MM, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Jann Horn, Mike Rapoport

On Thu, 21 Dec 2017 23:38:37 +1100 Michael Ellerman <mpe@ellerman.id.au> wrote:

> > Andrew, can you s/MAP_FIXED_SAFE/MAP_FIXED_NOREPLACE/g in the series?

Done.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] mmap.2: MAP_FIXED updated documentation
  2017-12-21 22:24                             ` Andrew Morton
@ 2017-12-22  0:06                               ` Michael Ellerman
  0 siblings, 0 replies; 48+ messages in thread
From: Michael Ellerman @ 2017-12-22  0:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kees Cook, Michael Kerrisk (man-pages),
	Cyril Hrubis, Pavel Machek, Michal Hocko, Linux API, Khalid Aziz,
	Russell King - ARM Linux, Andrea Arcangeli, Linux-MM, LKML,
	linux-arch, Florian Weimer, John Hubbard, Matthew Wilcox,
	Jann Horn, Mike Rapoport

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu, 21 Dec 2017 23:38:37 +1100 Michael Ellerman <mpe@ellerman.id.au> wrote:
>
>> > Andrew, can you s/MAP_FIXED_SAFE/MAP_FIXED_NOREPLACE/g in the series?
>
> Done.

Thanks.

I sent an ack at some point, here's another if you like:

  Acked-by: Michael Ellerman <mpe@ellerman.id.au>

There's also a couple of stray whitespace changes in the version in
linux-next, and some inconsistent whitespace between the various mman.h
changes. Patch below to fix them up if you haven't already.

cheers


diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index bb9ccb5ff3ed..5e362db59780 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -50,7 +50,6 @@
 #define MAP_NONBLOCK	0x20000		/* do not block on IO */
 #define MAP_STACK	0x40000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x80000		/* create a huge page mapping */
-
 #define MAP_FIXED_SAFE	0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
 
 /*
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index dedc09ead4cb..a0702506d7c6 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -26,7 +26,6 @@
 #define MAP_NONBLOCK	0x20000		/* do not block on IO */
 #define MAP_STACK	0x40000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x80000		/* create a huge page mapping */
-
 #define MAP_FIXED_SAFE	0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
 
 #define MS_SYNC		1		/* synchronous memory sync */
diff --git a/arch/sparc/include/uapi/asm/mman.h b/arch/sparc/include/uapi/asm/mman.h
index d21bffd5d3dc..715a2c927e79 100644
--- a/arch/sparc/include/uapi/asm/mman.h
+++ b/arch/sparc/include/uapi/asm/mman.h
@@ -25,4 +25,5 @@
 #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
 
+
 #endif /* _UAPI__SPARC_MMAN_H__ */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index da73b6d5dbcd..52f4d21923b3 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -65,7 +65,6 @@
 # define MAP_UNINITIALIZED 0x0		/* Don't support this flag */
 #endif
 
-
 /*
  * Flags for msync
  */

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map
  2017-11-29 14:42 ` [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map Michal Hocko
@ 2017-11-29 17:45   ` Khalid Aziz
  0 siblings, 0 replies; 48+ messages in thread
From: Khalid Aziz @ 2017-11-29 17:45 UTC (permalink / raw)
  To: Michal Hocko, linux-api
  Cc: Michael Ellerman, Andrew Morton, Russell King - ARM Linux,
	Andrea Arcangeli, linux-mm, LKML, linux-arch, Florian Weimer,
	John Hubbard, Michal Hocko, Abdul Haleem, Joel Stanley,
	Kees Cook

On 11/29/2017 07:42 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Both load_elf_interp and load_elf_binary rely on elf_map to map segments
> on a controlled address and they use MAP_FIXED to enforce that. This is
> however dangerous thing prone to silent data corruption which can be
> even exploitable. Let's take CVE-2017-1000253 as an example. At the time
> (before eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"))
> ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
> the stack top on 32b (legacy) memory layout (only 1GB away). Therefore
> we could end up mapping over the existing stack with some luck.
> 
> The issue has been fixed since then (a87938b2e246 ("fs/binfmt_elf.c:
> fix bug in loading of PIE binaries")), ELF_ET_DYN_BASE moved moved much
> further from the stack (eab09532d400 and later by c715b72c1ba4 ("mm:
> revert x86_64 and arm64 ELF_ET_DYN_BASE base changes")) and excessive
> stack consumption early during execve fully stopped by da029c11e6b1
> ("exec: Limit arg stack to at most 75% of _STK_LIM"). So we should be
> safe and any attack should be impractical. On the other hand this is
> just too subtle assumption so it can break quite easily and hard to
> spot.
> 
> I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
> fundamentally dangerous. Moreover it shouldn't be even needed. We are
> at the early process stage and so there shouldn't be unrelated mappings
> (except for stack and loader) existing so mmap for a given address
> should succeed even without MAP_FIXED. Something is terribly wrong if
> this is not the case and we should rather fail than silently corrupt the
> underlying mapping.
> 
> Address this issue by changing MAP_FIXED to the newly added
> MAP_FIXED_SAFE. This will mean that mmap will fail if there is an
> existing mapping clashing with the requested one without clobbering it.
> 
> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
> Cc: Joel Stanley <joel@jms.id.au>
> Acked-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map
  2017-11-29 14:42 [PATCH " Michal Hocko
@ 2017-11-29 14:42 ` Michal Hocko
  2017-11-29 17:45   ` Khalid Aziz
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2017-11-29 14:42 UTC (permalink / raw)
  To: linux-api
  Cc: Khalid Aziz, Michael Ellerman, Andrew Morton,
	Russell King - ARM Linux, Andrea Arcangeli, linux-mm, LKML,
	linux-arch, Florian Weimer, John Hubbard, Michal Hocko,
	Abdul Haleem, Joel Stanley, Kees Cook

From: Michal Hocko <mhocko@suse.com>

Both load_elf_interp and load_elf_binary rely on elf_map to map segments
on a controlled address and they use MAP_FIXED to enforce that. This is
however dangerous thing prone to silent data corruption which can be
even exploitable. Let's take CVE-2017-1000253 as an example. At the time
(before eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"))
ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
the stack top on 32b (legacy) memory layout (only 1GB away). Therefore
we could end up mapping over the existing stack with some luck.

The issue has been fixed since then (a87938b2e246 ("fs/binfmt_elf.c:
fix bug in loading of PIE binaries")), ELF_ET_DYN_BASE moved moved much
further from the stack (eab09532d400 and later by c715b72c1ba4 ("mm:
revert x86_64 and arm64 ELF_ET_DYN_BASE base changes")) and excessive
stack consumption early during execve fully stopped by da029c11e6b1
("exec: Limit arg stack to at most 75% of _STK_LIM"). So we should be
safe and any attack should be impractical. On the other hand this is
just too subtle assumption so it can break quite easily and hard to
spot.

I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
fundamentally dangerous. Moreover it shouldn't be even needed. We are
at the early process stage and so there shouldn't be unrelated mappings
(except for stack and loader) existing so mmap for a given address
should succeed even without MAP_FIXED. Something is terribly wrong if
this is not the case and we should rather fail than silently corrupt the
underlying mapping.

Address this issue by changing MAP_FIXED to the newly added
MAP_FIXED_SAFE. This will mean that mmap will fail if there is an
existing mapping clashing with the requested one without clobbering it.

Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/metag/kernel/process.c |  6 +++++-
 fs/binfmt_elf.c             | 12 ++++++++----
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/metag/kernel/process.c b/arch/metag/kernel/process.c
index 0909834c83a7..867c8d0a5fb4 100644
--- a/arch/metag/kernel/process.c
+++ b/arch/metag/kernel/process.c
@@ -399,7 +399,7 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
 	tcm_tag = tcm_lookup_tag(addr);
 
 	if (tcm_tag != TCM_INVALID_TAG)
-		type &= ~MAP_FIXED;
+		type &= ~(MAP_FIXED | MAP_FIXED_SAFE);
 
 	/*
 	* total_size is the size of the ELF (interpreter) image.
@@ -417,6 +417,10 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
 	} else
 		map_addr = vm_mmap(filep, addr, size, prot, type, off);
 
+	if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
+		pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
+				task_pid_nr(current), tsk->comm, (void*)addr);
+
 	if (!BAD_ADDR(map_addr) && tcm_tag != TCM_INVALID_TAG) {
 		struct tcm_allocation *tcm;
 		unsigned long tcm_addr;
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 73b01e474fdc..5916d45f64a7 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -372,6 +372,10 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
 	} else
 		map_addr = vm_mmap(filep, addr, size, prot, type, off);
 
+	if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
+		pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
+				task_pid_nr(current), current->comm, (void*)addr);
+
 	return(map_addr);
 }
 
@@ -569,7 +573,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 				elf_prot |= PROT_EXEC;
 			vaddr = eppnt->p_vaddr;
 			if (interp_elf_ex->e_type == ET_EXEC || load_addr_set)
-				elf_type |= MAP_FIXED;
+				elf_type |= MAP_FIXED_SAFE;
 			else if (no_base && interp_elf_ex->e_type == ET_DYN)
 				load_addr = -vaddr;
 
@@ -930,7 +934,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 		 * the ET_DYN load_addr calculations, proceed normally.
 		 */
 		if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
-			elf_flags |= MAP_FIXED;
+			elf_flags |= MAP_FIXED_SAFE;
 		} else if (loc->elf_ex.e_type == ET_DYN) {
 			/*
 			 * This logic is run once for the first LOAD Program
@@ -966,7 +970,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 				load_bias = ELF_ET_DYN_BASE;
 				if (current->flags & PF_RANDOMIZE)
 					load_bias += arch_mmap_rnd();
-				elf_flags |= MAP_FIXED;
+				elf_flags |= MAP_FIXED_SAFE;
 			} else
 				load_bias = 0;
 
@@ -1223,7 +1227,7 @@ static int load_elf_library(struct file *file)
 			(eppnt->p_filesz +
 			 ELF_PAGEOFFSET(eppnt->p_vaddr)),
 			PROT_READ | PROT_WRITE | PROT_EXEC,
-			MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
+			MAP_FIXED_SAFE | MAP_PRIVATE | MAP_DENYWRITE,
 			(eppnt->p_offset -
 			 ELF_PAGEOFFSET(eppnt->p_vaddr)));
 	if (error != ELF_PAGESTART(eppnt->p_vaddr))
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map
  2017-11-16 10:19 ` [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map Michal Hocko
@ 2017-11-17  0:30   ` Kees Cook
  0 siblings, 0 replies; 48+ messages in thread
From: Kees Cook @ 2017-11-17  0:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linux API, Khalid Aziz, Michael Ellerman, Andrew Morton,
	Russell King - ARM Linux, Andrea Arcangeli, Linux-MM, LKML,
	linux-arch, Michal Hocko, Abdul Haleem, Joel Stanley

On Thu, Nov 16, 2017 at 2:19 AM, Michal Hocko <mhocko@kernel.org> wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> Both load_elf_interp and load_elf_binary rely on elf_map to map segments
> on a controlled address and they use MAP_FIXED to enforce that. This is
> however dangerous thing prone to silent data corruption which can be
> even exploitable. Let's take CVE-2017-1000253 as an example. At the time
> (before eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"))
> ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
> the stack top on 32b (legacy) memory layout (only 1GB away). Therefore
> we could end up mapping over the existing stack with some luck.
>
> The issue has been fixed since then (a87938b2e246 ("fs/binfmt_elf.c:
> fix bug in loading of PIE binaries")), ELF_ET_DYN_BASE moved moved much
> further from the stack (eab09532d400 and later by c715b72c1ba4 ("mm:
> revert x86_64 and arm64 ELF_ET_DYN_BASE base changes")) and excessive
> stack consumption early during execve fully stopped by da029c11e6b1
> ("exec: Limit arg stack to at most 75% of _STK_LIM"). So we should be
> safe and any attack should be impractical. On the other hand this is
> just too subtle assumption so it can break quite easily and hard to
> spot.
>
> I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
> fundamentally dangerous. Moreover it shouldn't be even needed. We are
> at the early process stage and so there shouldn't be unrelated mappings
> (except for stack and loader) existing so mmap for a given address
> should succeed even without MAP_FIXED. Something is terribly wrong if
> this is not the case and we should rather fail than silently corrupt the
> underlying mapping.
>
> Address this issue by changing MAP_FIXED to the newly added
> MAP_FIXED_SAFE. This will mean that mmap will fail if there is an
> existing mapping clashing with the requested one without clobbering it.
>
> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
> Cc: Joel Stanley <joel@jms.id.au>
> Cc: Kees Cook <keescook@chromium.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Once (if?) the name gets settled, this looks good to me:

Acked-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  arch/metag/kernel/process.c |  6 +++++-
>  fs/binfmt_elf.c             | 12 ++++++++----
>  2 files changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/arch/metag/kernel/process.c b/arch/metag/kernel/process.c
> index c4606ce743d2..2286140e54e0 100644
> --- a/arch/metag/kernel/process.c
> +++ b/arch/metag/kernel/process.c
> @@ -398,7 +398,7 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
>         tcm_tag = tcm_lookup_tag(addr);
>
>         if (tcm_tag != TCM_INVALID_TAG)
> -               type &= ~MAP_FIXED;
> +               type &= ~(MAP_FIXED | MAP_FIXED_SAFE);
>
>         /*
>         * total_size is the size of the ELF (interpreter) image.
> @@ -416,6 +416,10 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
>         } else
>                 map_addr = vm_mmap(filep, addr, size, prot, type, off);
>
> +       if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
> +               pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
> +                               task_pid_nr(current), tsk->comm, (void*)addr);
> +
>         if (!BAD_ADDR(map_addr) && tcm_tag != TCM_INVALID_TAG) {
>                 struct tcm_allocation *tcm;
>                 unsigned long tcm_addr;
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 6466153f2bf0..12b21942ccde 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -372,6 +372,10 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
>         } else
>                 map_addr = vm_mmap(filep, addr, size, prot, type, off);
>
> +       if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
> +               pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
> +                               task_pid_nr(current), current->comm, (void*)addr);
> +
>         return(map_addr);
>  }
>
> @@ -569,7 +573,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
>                                 elf_prot |= PROT_EXEC;
>                         vaddr = eppnt->p_vaddr;
>                         if (interp_elf_ex->e_type == ET_EXEC || load_addr_set)
> -                               elf_type |= MAP_FIXED;
> +                               elf_type |= MAP_FIXED_SAFE;
>                         else if (no_base && interp_elf_ex->e_type == ET_DYN)
>                                 load_addr = -vaddr;
>
> @@ -929,7 +933,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
>                  * the ET_DYN load_addr calculations, proceed normally.
>                  */
>                 if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
> -                       elf_flags |= MAP_FIXED;
> +                       elf_flags |= MAP_FIXED_SAFE;
>                 } else if (loc->elf_ex.e_type == ET_DYN) {
>                         /*
>                          * This logic is run once for the first LOAD Program
> @@ -965,7 +969,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
>                                 load_bias = ELF_ET_DYN_BASE;
>                                 if (current->flags & PF_RANDOMIZE)
>                                         load_bias += arch_mmap_rnd();
> -                               elf_flags |= MAP_FIXED;
> +                               elf_flags |= MAP_FIXED_SAFE;
>                         } else
>                                 load_bias = 0;
>
> @@ -1220,7 +1224,7 @@ static int load_elf_library(struct file *file)
>                         (eppnt->p_filesz +
>                          ELF_PAGEOFFSET(eppnt->p_vaddr)),
>                         PROT_READ | PROT_WRITE | PROT_EXEC,
> -                       MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
> +                       MAP_FIXED_SAFE | MAP_PRIVATE | MAP_DENYWRITE,
>                         (eppnt->p_offset -
>                          ELF_PAGEOFFSET(eppnt->p_vaddr)));
>         if (error != ELF_PAGESTART(eppnt->p_vaddr))
> --
> 2.15.0
>



-- 
Kees Cook
Pixel Security

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map
  2017-11-16 10:18 (unknown), Michal Hocko
@ 2017-11-16 10:19 ` Michal Hocko
  2017-11-17  0:30   ` Kees Cook
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2017-11-16 10:19 UTC (permalink / raw)
  To: linux-api
  Cc: Khalid Aziz, Michael Ellerman, Andrew Morton,
	Russell King - ARM Linux, Andrea Arcangeli, linux-mm, LKML,
	linux-arch, Michal Hocko, Abdul Haleem, Joel Stanley, Kees Cook

From: Michal Hocko <mhocko@suse.com>

Both load_elf_interp and load_elf_binary rely on elf_map to map segments
on a controlled address and they use MAP_FIXED to enforce that. This is
however dangerous thing prone to silent data corruption which can be
even exploitable. Let's take CVE-2017-1000253 as an example. At the time
(before eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"))
ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
the stack top on 32b (legacy) memory layout (only 1GB away). Therefore
we could end up mapping over the existing stack with some luck.

The issue has been fixed since then (a87938b2e246 ("fs/binfmt_elf.c:
fix bug in loading of PIE binaries")), ELF_ET_DYN_BASE moved moved much
further from the stack (eab09532d400 and later by c715b72c1ba4 ("mm:
revert x86_64 and arm64 ELF_ET_DYN_BASE base changes")) and excessive
stack consumption early during execve fully stopped by da029c11e6b1
("exec: Limit arg stack to at most 75% of _STK_LIM"). So we should be
safe and any attack should be impractical. On the other hand this is
just too subtle assumption so it can break quite easily and hard to
spot.

I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
fundamentally dangerous. Moreover it shouldn't be even needed. We are
at the early process stage and so there shouldn't be unrelated mappings
(except for stack and loader) existing so mmap for a given address
should succeed even without MAP_FIXED. Something is terribly wrong if
this is not the case and we should rather fail than silently corrupt the
underlying mapping.

Address this issue by changing MAP_FIXED to the newly added
MAP_FIXED_SAFE. This will mean that mmap will fail if there is an
existing mapping clashing with the requested one without clobbering it.

Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/metag/kernel/process.c |  6 +++++-
 fs/binfmt_elf.c             | 12 ++++++++----
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/metag/kernel/process.c b/arch/metag/kernel/process.c
index c4606ce743d2..2286140e54e0 100644
--- a/arch/metag/kernel/process.c
+++ b/arch/metag/kernel/process.c
@@ -398,7 +398,7 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
 	tcm_tag = tcm_lookup_tag(addr);
 
 	if (tcm_tag != TCM_INVALID_TAG)
-		type &= ~MAP_FIXED;
+		type &= ~(MAP_FIXED | MAP_FIXED_SAFE);
 
 	/*
 	* total_size is the size of the ELF (interpreter) image.
@@ -416,6 +416,10 @@ unsigned long __metag_elf_map(struct file *filep, unsigned long addr,
 	} else
 		map_addr = vm_mmap(filep, addr, size, prot, type, off);
 
+	if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
+		pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
+				task_pid_nr(current), tsk->comm, (void*)addr);
+
 	if (!BAD_ADDR(map_addr) && tcm_tag != TCM_INVALID_TAG) {
 		struct tcm_allocation *tcm;
 		unsigned long tcm_addr;
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 6466153f2bf0..12b21942ccde 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -372,6 +372,10 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
 	} else
 		map_addr = vm_mmap(filep, addr, size, prot, type, off);
 
+	if ((type & MAP_FIXED_SAFE) && BAD_ADDR(map_addr))
+		pr_info("%d (%s): Uhuuh, elf segement at %p requested but the memory is mapped already\n",
+				task_pid_nr(current), current->comm, (void*)addr);
+
 	return(map_addr);
 }
 
@@ -569,7 +573,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 				elf_prot |= PROT_EXEC;
 			vaddr = eppnt->p_vaddr;
 			if (interp_elf_ex->e_type == ET_EXEC || load_addr_set)
-				elf_type |= MAP_FIXED;
+				elf_type |= MAP_FIXED_SAFE;
 			else if (no_base && interp_elf_ex->e_type == ET_DYN)
 				load_addr = -vaddr;
 
@@ -929,7 +933,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 		 * the ET_DYN load_addr calculations, proceed normally.
 		 */
 		if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
-			elf_flags |= MAP_FIXED;
+			elf_flags |= MAP_FIXED_SAFE;
 		} else if (loc->elf_ex.e_type == ET_DYN) {
 			/*
 			 * This logic is run once for the first LOAD Program
@@ -965,7 +969,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 				load_bias = ELF_ET_DYN_BASE;
 				if (current->flags & PF_RANDOMIZE)
 					load_bias += arch_mmap_rnd();
-				elf_flags |= MAP_FIXED;
+				elf_flags |= MAP_FIXED_SAFE;
 			} else
 				load_bias = 0;
 
@@ -1220,7 +1224,7 @@ static int load_elf_library(struct file *file)
 			(eppnt->p_filesz +
 			 ELF_PAGEOFFSET(eppnt->p_vaddr)),
 			PROT_READ | PROT_WRITE | PROT_EXEC,
-			MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
+			MAP_FIXED_SAFE | MAP_PRIVATE | MAP_DENYWRITE,
 			(eppnt->p_offset -
 			 ELF_PAGEOFFSET(eppnt->p_vaddr)));
 	if (error != ELF_PAGESTART(eppnt->p_vaddr))
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, back to index

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-13  9:25 [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE Michal Hocko
2017-12-13  9:25 ` [PATCH 1/2] " Michal Hocko
     [not found]   ` <20171213092550.2774-2-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2017-12-13 12:50     ` Matthew Wilcox
2017-12-13 13:01       ` Michal Hocko
2017-12-13  9:25 ` [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map Michal Hocko
2017-12-16  0:49   ` [2/2] " Andrei Vagin
2017-12-18  9:13     ` Michal Hocko
2017-12-18 18:12       ` Andrei Vagin
2017-12-13  9:31 ` [PATCH 1/2] mmap.2: document new MAP_FIXED_SAFE flag Michal Hocko
2017-12-13  9:31   ` [PATCH 2/2] mmap.2: MAP_FIXED updated documentation Michal Hocko
2017-12-13 12:55     ` Pavel Machek
2017-12-13 13:03       ` Cyril Hrubis
2017-12-13 13:04       ` Michal Hocko
2017-12-13 13:09         ` Pavel Machek
2017-12-13 13:16           ` Michal Hocko
     [not found]             ` <20171213131640.GJ25185-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-12-13 13:21               ` Pavel Machek
2017-12-13 13:35                 ` Michal Hocko
2017-12-13 14:40                 ` Cyril Hrubis
2017-12-13 23:19                   ` Kees Cook
     [not found]                     ` <CAGXu5jLqE6cUxk-Girx6PG7upEzz8jmu1OH_3LVC26iJc2vTxQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-12-14  7:07                       ` Michal Hocko
2017-12-18 19:12                     ` Michael Kerrisk (man-pages)
2017-12-18 20:19                       ` Kees Cook
2017-12-18 20:33                         ` Matthew Wilcox
     [not found]                         ` <CAGXu5jJ289R9koVoHmxcvUWr6XHSZR2p0qq3WtpNyN-iNSvrNQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-12-21 12:38                           ` Michael Ellerman
2017-12-21 14:59                             ` known bad patch in -mm tree was " Pavel Machek
2017-12-21 15:08                               ` Michal Hocko
2017-12-21 22:24                             ` Andrew Morton
2017-12-22  0:06                               ` Michael Ellerman
2017-12-14  2:52     ` Jann Horn
     [not found]       ` <CAG48ez0JZ3PVW3vgSXDmDijS+a_5bSX9qNuyggnsB6JTSkKngA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-12-14  5:28         ` John Hubbard
2017-12-14 23:06       ` John Hubbard
     [not found]         ` <b4fb7b3a-e53e-bf87-53c5-186751a14f4e-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
2017-12-14 23:10           ` Jann Horn
     [not found] ` <20171213092550.2774-1-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2017-12-13 12:25   ` [PATCH v2 0/2] mm: introduce MAP_FIXED_SAFE Matthew Wilcox
2017-12-13 12:34     ` Michal Hocko
2017-12-13 17:13   ` Kees Cook
2017-12-15  9:02     ` Michael Ellerman
2017-12-14  0:32   ` Andrew Morton
2017-12-14  1:35     ` David Goldblatt
2017-12-14  1:42       ` David Goldblatt
2017-12-14 12:44     ` Edward Napierala
2017-12-14 13:15       ` Michal Hocko
2017-12-14 14:54         ` Edward Napierala
2017-12-19 12:40           ` David Laight
2017-12-19 12:46             ` Michal Hocko
  -- strict thread matches above, loose matches on Subject: below --
2017-11-29 14:42 [PATCH " Michal Hocko
2017-11-29 14:42 ` [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map Michal Hocko
2017-11-29 17:45   ` Khalid Aziz
2017-11-16 10:18 (unknown), Michal Hocko
2017-11-16 10:19 ` [PATCH 2/2] fs, elf: drop MAP_FIXED usage from elf_map Michal Hocko
2017-11-17  0:30   ` Kees Cook

Linux-api Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-api/0 linux-api/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-api linux-api/ https://lore.kernel.org/linux-api \
		linux-api@vger.kernel.org
	public-inbox-index linux-api

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-api


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git