All of lore.kernel.org
 help / color / mirror / Atom feed
* Should mmap MAP_LOCKED fail if mm_poppulate fails?
@ 2015-01-14  9:50 ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-01-14  9:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Cyril Hrubis, Hugh Dickins, Michel Lespinasse, Andrew Morton,
	Linus Torvalds, Rik van Riel, Michael Kerrisk (man-pages),
	LKML, linux-api

Hi,
Cyril has encountered one of the LTP tests failing after 3.12 kernel.
To quote him:
"
What the test does is to set memory limit inside of memcg to PAGESIZE by
writing to memory.limit_in_bytes, then runs a subprocess that uses
mmap() with MAP_LOCKED which allocates 2 * PAGESIZE and expects that
it's killed by OOM. This does not happen and the call to mmap() returns
a correct pointer to a memory region, that when accessed finally causes
the OOM.
"

The difference came from the memcg OOM killer rework because OOM killer
is triggered only from the page fault path since 519e52473ebe (mm:
memcg: enable memcg OOM killer only for user faults). The rationale is
described in 3812c8c8f395 (mm: memcg: do not trap chargers with full
callstack on OOM).

This is _not_ the primary _issue_, though. It has just made a long
standing issue more visible, the same is possible even without memcg but
it is much less likely (it might get more visible once we start failing
GFP_KERNEL allocations more often). The primary issue is that mmap
doesn't report a failure if MAP_LOCKED fails to populate the area. Is
this the correct/expected behavior?

The man page says
"
MAP_LOCKED (since Linux 2.5.37)
      Lock the pages of the mapped region into memory in the manner of
      mlock(2).  This flag is ignored in older kernels.
"

and mlock is required to fail if the population fails.
"
       mlock() locks pages in the address range starting at addr and
       continuing for len bytes.  All pages that contain a part of the
       specified address range are guaranteed to be resident in RAM when
       the call returns successfully; the pages are guaranteed to stay
       in RAM until later unlocked.
"

I have checked the history and it seems we never reported an error, at
least not during git era.

FWIW mlock behaves correctly and reports the error to the userspace.

I am not sure this is something to be fixed or rather documented in the
man page. I can imagine users who would prefer ENOMEM rather than seeing
a page fault later on - I would expect RT - but do those run inside memcg
controller or on heavily overcommited systems?

On the other hand the fix sound quite easy, we just have to use
__mm_populate and unmap the area on failure for VM_LOCKED vmas. Maybe
there are some historical reason for not doing that though.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Should mmap MAP_LOCKED fail if mm_poppulate fails?
@ 2015-01-14  9:50 ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-01-14  9:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Cyril Hrubis, Hugh Dickins, Michel Lespinasse, Andrew Morton,
	Linus Torvalds, Rik van Riel, Michael Kerrisk (man-pages),
	LKML, linux-api

Hi,
Cyril has encountered one of the LTP tests failing after 3.12 kernel.
To quote him:
"
What the test does is to set memory limit inside of memcg to PAGESIZE by
writing to memory.limit_in_bytes, then runs a subprocess that uses
mmap() with MAP_LOCKED which allocates 2 * PAGESIZE and expects that
it's killed by OOM. This does not happen and the call to mmap() returns
a correct pointer to a memory region, that when accessed finally causes
the OOM.
"

The difference came from the memcg OOM killer rework because OOM killer
is triggered only from the page fault path since 519e52473ebe (mm:
memcg: enable memcg OOM killer only for user faults). The rationale is
described in 3812c8c8f395 (mm: memcg: do not trap chargers with full
callstack on OOM).

This is _not_ the primary _issue_, though. It has just made a long
standing issue more visible, the same is possible even without memcg but
it is much less likely (it might get more visible once we start failing
GFP_KERNEL allocations more often). The primary issue is that mmap
doesn't report a failure if MAP_LOCKED fails to populate the area. Is
this the correct/expected behavior?

The man page says
"
MAP_LOCKED (since Linux 2.5.37)
      Lock the pages of the mapped region into memory in the manner of
      mlock(2).  This flag is ignored in older kernels.
"

and mlock is required to fail if the population fails.
"
       mlock() locks pages in the address range starting at addr and
       continuing for len bytes.  All pages that contain a part of the
       specified address range are guaranteed to be resident in RAM when
       the call returns successfully; the pages are guaranteed to stay
       in RAM until later unlocked.
"

I have checked the history and it seems we never reported an error, at
least not during git era.

FWIW mlock behaves correctly and reports the error to the userspace.

I am not sure this is something to be fixed or rather documented in the
man page. I can imagine users who would prefer ENOMEM rather than seeing
a page fault later on - I would expect RT - but do those run inside memcg
controller or on heavily overcommited systems?

On the other hand the fix sound quite easy, we just have to use
__mm_populate and unmap the area on failure for VM_LOCKED vmas. Maybe
there are some historical reason for not doing that though.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?
@ 2015-04-28 12:11   ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-28 12:11 UTC (permalink / raw)
  To: linux-mm
  Cc: Cyril Hrubis, Andrew Morton, Hugh Dickins, Michel Lespinasse,
	Linus Torvalds, Rik van Riel, Michael Kerrisk, LKML, Linux API

Hi,
it seems that the initial email got lost (or ignored). I would like to
revive it again. I've cooked up a potential fix to this issue which will
follow as a reply to this email.

The first patch is dumb and straightforward. It should be safe as is and
also good without the follow up 2 patches which try to handle potential
allocation failures in the do_munmap path more gracefully. As we still
do not fail small allocations even the first patch could be simplified
a bit and the retry loop replaced by a BUG_ON right away. But I felt this
would better be done robust.

An obvious alternative would be patching the man pages to mention the
subtle difference between mlock and MAP_LOCKED semantic. I have checked
debian code search and it shown some applications relying on MAP_LOCKED
but I have no idea whether they really require the mlock all-or-nothing
fault in semantic.

Any thoughts, ideas?


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?
@ 2015-04-28 12:11   ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-28 12:11 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Cyril Hrubis, Andrew Morton, Hugh Dickins, Michel Lespinasse,
	Linus Torvalds, Rik van Riel, Michael Kerrisk, LKML, Linux API

Hi,
it seems that the initial email got lost (or ignored). I would like to
revive it again. I've cooked up a potential fix to this issue which will
follow as a reply to this email.

The first patch is dumb and straightforward. It should be safe as is and
also good without the follow up 2 patches which try to handle potential
allocation failures in the do_munmap path more gracefully. As we still
do not fail small allocations even the first patch could be simplified
a bit and the retry loop replaced by a BUG_ON right away. But I felt this
would better be done robust.

An obvious alternative would be patching the man pages to mention the
subtle difference between mlock and MAP_LOCKED semantic. I have checked
debian code search and it shown some applications relying on MAP_LOCKED
but I have no idea whether they really require the mlock all-or-nothing
fault in semantic.

Any thoughts, ideas?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?
@ 2015-04-28 12:11   ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-28 12:11 UTC (permalink / raw)
  To: linux-mm
  Cc: Cyril Hrubis, Andrew Morton, Hugh Dickins, Michel Lespinasse,
	Linus Torvalds, Rik van Riel, Michael Kerrisk, LKML, Linux API

Hi,
it seems that the initial email got lost (or ignored). I would like to
revive it again. I've cooked up a potential fix to this issue which will
follow as a reply to this email.

The first patch is dumb and straightforward. It should be safe as is and
also good without the follow up 2 patches which try to handle potential
allocation failures in the do_munmap path more gracefully. As we still
do not fail small allocations even the first patch could be simplified
a bit and the retry loop replaced by a BUG_ON right away. But I felt this
would better be done robust.

An obvious alternative would be patching the man pages to mention the
subtle difference between mlock and MAP_LOCKED semantic. I have checked
debian code search and it shown some applications relying on MAP_LOCKED
but I have no idea whether they really require the mlock all-or-nothing
fault in semantic.

Any thoughts, ideas?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC 1/3] mm: mmap make MAP_LOCKED really mlock semantic
  2015-04-28 12:11   ` Michal Hocko
@ 2015-04-28 12:11     ` Michal Hocko
  -1 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-28 12:11 UTC (permalink / raw)
  To: linux-mm
  Cc: Cyril Hrubis, Andrew Morton, Hugh Dickins, Michel Lespinasse,
	Linus Torvalds, Rik van Riel, Michael Kerrisk, LKML, Linux API

Cyril has encountered one of the LTP tests failing after 3.12 kernel.
To quote him:
"
What the test does is to set memory limit inside of memcg to PAGESIZE by
writing to memory.limit_in_bytes, then runs a subprocess that uses
mmap() with MAP_LOCKED which allocates 2 * PAGESIZE and expects that
it's killed by OOM. This does not happen and the call to mmap() returns
a correct pointer to a memory region, that when accessed finally causes
the OOM.
"

The difference came from the memcg OOM killer rework because OOM killer
is triggered only from the page fault path since 519e52473ebe (mm:
memcg: enable memcg OOM killer only for user faults). The rationale is
described in 3812c8c8f395 (mm: memcg: do not trap chargers with full
callstack on OOM).

This is _not_ the primary _issue_, though. It has just made a long
standing issue visible. The same is possible even without memcg but it
is much less likely (it might get more visible once we start failing
GFP_KERNEL small allocations). The primary issue is that mmap doesn't
report a failure if MAP_LOCKED fails to populate the area.

The man page however says
"
MAP_LOCKED (since Linux 2.5.37)
      Lock the pages of the mapped region into memory in the manner of
      mlock(2).  This flag is ignored in older kernels.
"

and mlock is required to fail if the population fails.
"
       mlock() locks pages in the address range starting at addr and
       continuing for len bytes.  All pages that contain a part of the
       specified address range are guaranteed to be resident in RAM when
       the call returns successfully; the pages are guaranteed to stay
       in RAM until later unlocked.
"

According to the git history this has alaways been the case so it
doesn't look like anything new. Most applications probably even do not
care because they do not explicitly require the population at the mmap
call time. If the application cannot tolerate later pagefault this would
be an unexpected and potentially silent failure though.

This patch fixes the behavior to really mimic mlock so mmap fails
if the population is not successful. The only issue here is that
we cannot leave the already created VMA behind and so it has to be
unmapped which as an operation which might fail.

 There are basically two potential reasons for a failure. Either the
map count limit could have been reached after we have dropped mmap_sem
for write when doing do_mmap_pgoff or any of the allocations during vma
splitting fails.

The first one is easy to solve because we can elevate map_count while we
are still holding mmap_sem before calling do_mmap_pgoff when MAP_LOCKED
is specified. In the worst case do_munmap would need to split VMA in the
middle and we would simply consume a cached map_count.

The allocation failure down the do_munmap path is the tricky one, albeit
only theoretical one right now because small allocations do not fail yet
(this sounds like something that might change in the future though). There
are more allocations places (e.g. in __split_vma) and those are allowed to fail.
Let's keep retrying do_munmap, drop the semaphore each round to allow other
threads to make a progress (e.g. madvise to free some memory) and hope we will
be able to do it sooner or later.

An alternative would be making all of do_munmap allocations non failing
by using __GFP_NOFAIL or killing the task but that sounds too harsh.

Reported-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/util.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 59 insertions(+), 7 deletions(-)

diff --git a/mm/util.c b/mm/util.c
index 0c7f65e7ef5e..fbffefa3b812 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -290,16 +290,68 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long ret;
 	struct mm_struct *mm = current->mm;
 	unsigned long populate;
+	bool need_map_count_fix = false;
 
 	ret = security_mmap_file(file, prot, flag);
-	if (!ret) {
-		down_write(&mm->mmap_sem);
-		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
-				    &populate);
-		up_write(&mm->mmap_sem);
-		if (populate)
-			mm_populate(ret, populate);
+	if (ret)
+		return ret;
+
+	down_write(&mm->mmap_sem);
+	/*
+	 * Reserve one slot for a cleanup should __mm_populate fail
+	 * and we would need to split VMA in the middle.
+	 */
+	if (flag & MAP_LOCKED) {
+		mm->map_count++;
+		need_map_count_fix = true;
+	}
+	ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
+			    &populate);
+	up_write(&mm->mmap_sem);
+
+	if (populate) {
+		int error;
+
+		error = __mm_populate(ret, populate, 0);
+		if (!error)
+			return ret;
+
+		/*
+		 * MAP_LOCKED has a mlock semantic so we have to
+		 * fail mmap call if the population fails.
+		 * Regular MAP_POPULATE can tolerate the failure
+		 * though.
+		 */
+		if (flag & MAP_LOCKED) {
+			down_write(&mm->mmap_sem);
+			while (!fatal_signal_pending(current)) {
+				mm->map_count--;
+				need_map_count_fix = false;
+				if (!do_munmap(mm, ret, populate))
+					break;
+
+				/*
+				 * Do not block other threads to make a progress
+				 * e.g. madvise
+				 */
+				mm->map_count++;
+				need_map_count_fix = true;
+				up_write(&mm->mmap_sem);
+				cond_resched();
+				down_write(&mm->mmap_sem);
+			}
+			up_write(&mm->mmap_sem);
+
+			ret = -ENOMEM;
+		}
+	}
+
+	if (need_map_count_fix) {
+		down_read(&mm->mmap_sem);
+		mm->map_count--;
+		up_read(&mm->mmap_sem);
 	}
+
 	return ret;
 }
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 1/3] mm: mmap make MAP_LOCKED really mlock semantic
@ 2015-04-28 12:11     ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-28 12:11 UTC (permalink / raw)
  To: linux-mm
  Cc: Cyril Hrubis, Andrew Morton, Hugh Dickins, Michel Lespinasse,
	Linus Torvalds, Rik van Riel, Michael Kerrisk, LKML, Linux API

Cyril has encountered one of the LTP tests failing after 3.12 kernel.
To quote him:
"
What the test does is to set memory limit inside of memcg to PAGESIZE by
writing to memory.limit_in_bytes, then runs a subprocess that uses
mmap() with MAP_LOCKED which allocates 2 * PAGESIZE and expects that
it's killed by OOM. This does not happen and the call to mmap() returns
a correct pointer to a memory region, that when accessed finally causes
the OOM.
"

The difference came from the memcg OOM killer rework because OOM killer
is triggered only from the page fault path since 519e52473ebe (mm:
memcg: enable memcg OOM killer only for user faults). The rationale is
described in 3812c8c8f395 (mm: memcg: do not trap chargers with full
callstack on OOM).

This is _not_ the primary _issue_, though. It has just made a long
standing issue visible. The same is possible even without memcg but it
is much less likely (it might get more visible once we start failing
GFP_KERNEL small allocations). The primary issue is that mmap doesn't
report a failure if MAP_LOCKED fails to populate the area.

The man page however says
"
MAP_LOCKED (since Linux 2.5.37)
      Lock the pages of the mapped region into memory in the manner of
      mlock(2).  This flag is ignored in older kernels.
"

and mlock is required to fail if the population fails.
"
       mlock() locks pages in the address range starting at addr and
       continuing for len bytes.  All pages that contain a part of the
       specified address range are guaranteed to be resident in RAM when
       the call returns successfully; the pages are guaranteed to stay
       in RAM until later unlocked.
"

According to the git history this has alaways been the case so it
doesn't look like anything new. Most applications probably even do not
care because they do not explicitly require the population at the mmap
call time. If the application cannot tolerate later pagefault this would
be an unexpected and potentially silent failure though.

This patch fixes the behavior to really mimic mlock so mmap fails
if the population is not successful. The only issue here is that
we cannot leave the already created VMA behind and so it has to be
unmapped which as an operation which might fail.

 There are basically two potential reasons for a failure. Either the
map count limit could have been reached after we have dropped mmap_sem
for write when doing do_mmap_pgoff or any of the allocations during vma
splitting fails.

The first one is easy to solve because we can elevate map_count while we
are still holding mmap_sem before calling do_mmap_pgoff when MAP_LOCKED
is specified. In the worst case do_munmap would need to split VMA in the
middle and we would simply consume a cached map_count.

The allocation failure down the do_munmap path is the tricky one, albeit
only theoretical one right now because small allocations do not fail yet
(this sounds like something that might change in the future though). There
are more allocations places (e.g. in __split_vma) and those are allowed to fail.
Let's keep retrying do_munmap, drop the semaphore each round to allow other
threads to make a progress (e.g. madvise to free some memory) and hope we will
be able to do it sooner or later.

An alternative would be making all of do_munmap allocations non failing
by using __GFP_NOFAIL or killing the task but that sounds too harsh.

Reported-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/util.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 59 insertions(+), 7 deletions(-)

diff --git a/mm/util.c b/mm/util.c
index 0c7f65e7ef5e..fbffefa3b812 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -290,16 +290,68 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long ret;
 	struct mm_struct *mm = current->mm;
 	unsigned long populate;
+	bool need_map_count_fix = false;
 
 	ret = security_mmap_file(file, prot, flag);
-	if (!ret) {
-		down_write(&mm->mmap_sem);
-		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
-				    &populate);
-		up_write(&mm->mmap_sem);
-		if (populate)
-			mm_populate(ret, populate);
+	if (ret)
+		return ret;
+
+	down_write(&mm->mmap_sem);
+	/*
+	 * Reserve one slot for a cleanup should __mm_populate fail
+	 * and we would need to split VMA in the middle.
+	 */
+	if (flag & MAP_LOCKED) {
+		mm->map_count++;
+		need_map_count_fix = true;
+	}
+	ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
+			    &populate);
+	up_write(&mm->mmap_sem);
+
+	if (populate) {
+		int error;
+
+		error = __mm_populate(ret, populate, 0);
+		if (!error)
+			return ret;
+
+		/*
+		 * MAP_LOCKED has a mlock semantic so we have to
+		 * fail mmap call if the population fails.
+		 * Regular MAP_POPULATE can tolerate the failure
+		 * though.
+		 */
+		if (flag & MAP_LOCKED) {
+			down_write(&mm->mmap_sem);
+			while (!fatal_signal_pending(current)) {
+				mm->map_count--;
+				need_map_count_fix = false;
+				if (!do_munmap(mm, ret, populate))
+					break;
+
+				/*
+				 * Do not block other threads to make a progress
+				 * e.g. madvise
+				 */
+				mm->map_count++;
+				need_map_count_fix = true;
+				up_write(&mm->mmap_sem);
+				cond_resched();
+				down_write(&mm->mmap_sem);
+			}
+			up_write(&mm->mmap_sem);
+
+			ret = -ENOMEM;
+		}
+	}
+
+	if (need_map_count_fix) {
+		down_read(&mm->mmap_sem);
+		mm->map_count--;
+		up_read(&mm->mmap_sem);
 	}
+
 	return ret;
 }
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 2/3] mm: allow munmap related functions to understand gfp_mask
  2015-04-28 12:11   ` Michal Hocko
@ 2015-04-28 12:11     ` Michal Hocko
  -1 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-28 12:11 UTC (permalink / raw)
  To: linux-mm
  Cc: Cyril Hrubis, Andrew Morton, Hugh Dickins, Michel Lespinasse,
	Linus Torvalds, Rik van Riel, Michael Kerrisk, LKML, Linux API

__split_vma path requires to allocate a memory. Later patch in the
series will require to change standard GFP_KERNEL allocation to use
__GFP_NOFAIL. In order to do that all the allocation paths down this
path should understand gfp requirements of the caller.

This involves vma_dup_policy and vma_adjust which have _gfp variant now
and anon_vma_clone got just a new parameter because it doesn't have
many callers.

The patch doesn't have any runtime effects but it makes the code
slightly larger:
   text    data     bss     dec     hex filename
 511480   74147   44440  630067   99d33 mm/built-in.o.before
 511560   74147   44440  630147   99d83 mm/built-in.o.after

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/mempolicy.h | 17 ++++++++++++++---
 include/linux/mm.h        | 10 ++++++++--
 include/linux/rmap.h      |  2 +-
 mm/mempolicy.c            |  9 +++++----
 mm/mmap.c                 | 43 +++++++++++++++++++++++++------------------
 mm/nommu.c                | 24 ++++++++++++++++++------
 mm/rmap.c                 |  7 ++++---
 7 files changed, 75 insertions(+), 37 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 3d385c81c153..a42585c09d4e 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -82,11 +82,12 @@ static inline void mpol_cond_put(struct mempolicy *pol)
 		__mpol_put(pol);
 }
 
-extern struct mempolicy *__mpol_dup(struct mempolicy *pol);
+extern struct mempolicy *mpol_dup_gfp(struct mempolicy *pol,
+		gfp_t gfp_mask);
 static inline struct mempolicy *mpol_dup(struct mempolicy *pol)
 {
 	if (pol)
-		pol = __mpol_dup(pol);
+		pol = mpol_dup_gfp(pol, GFP_KERNEL);
 	return pol;
 }
 
@@ -125,7 +126,12 @@ struct shared_policy {
 	spinlock_t lock;
 };
 
-int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst);
+int vma_dup_policy_gfp(struct vm_area_struct *src,
+		struct vm_area_struct *dst, gfp_t gfp_mask);
+static inline int
+vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst) {
+	return vma_dup_policy_gfp(src, dst, GFP_KERNEL);
+}
 void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol);
 int mpol_set_shared_policy(struct shared_policy *info,
 				struct vm_area_struct *vma,
@@ -235,6 +241,11 @@ vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
 {
 	return 0;
 }
+static inline int
+vma_dup_policy_gfp(struct vm_area_struct *src,
+		struct vm_area_struct *dst, gfp_t gfp_mask) {
+	return 0;
+}
 
 static inline void numa_policy_init(void)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5d20fba62081..723032a2273f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1752,8 +1752,14 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
 
 /* mmap.c */
 extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
-extern int vma_adjust(struct vm_area_struct *vma, unsigned long start,
-	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert);
+extern int vma_adjust_gfp(struct vm_area_struct *vma, unsigned long start,
+	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
+	gfp_t gfp_mask);
+static inline int
+vma_adjust(struct vm_area_struct *vma, unsigned long start,
+	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert) {
+	return vma_adjust_gfp(vma, start, end, pgoff, insert, GFP_KERNEL);
+}
 extern struct vm_area_struct *vma_merge(struct mm_struct *,
 	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
 	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bf36b6e644c4..23d210b84431 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -147,7 +147,7 @@ static inline void anon_vma_unlock_read(struct anon_vma *anon_vma)
 void anon_vma_init(void);	/* create anon_vma_cachep */
 int  anon_vma_prepare(struct vm_area_struct *);
 void unlink_anon_vmas(struct vm_area_struct *);
-int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
+int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *, gfp_t);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
 
 static inline void anon_vma_merge(struct vm_area_struct *vma,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index ede26291d4aa..9002d0a15d74 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2060,9 +2060,10 @@ retry_cpuset:
 }
 EXPORT_SYMBOL(alloc_pages_current);
 
-int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
+int vma_dup_policy_gfp(struct vm_area_struct *src, struct vm_area_struct *dst,
+		gfp_t gfp_mask)
 {
-	struct mempolicy *pol = mpol_dup(vma_policy(src));
+	struct mempolicy *pol = mpol_dup_gfp(vma_policy(src), gfp_mask);
 
 	if (IS_ERR(pol))
 		return PTR_ERR(pol);
@@ -2082,9 +2083,9 @@ int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
  */
 
 /* Slow path of a mempolicy duplicate */
-struct mempolicy *__mpol_dup(struct mempolicy *old)
+struct mempolicy *mpol_dup_gfp(struct mempolicy *old, gfp_t gfp_mask)
 {
-	struct mempolicy *new = kmem_cache_alloc(policy_cache, GFP_KERNEL);
+	struct mempolicy *new = kmem_cache_alloc(policy_cache, gfp_mask);
 
 	if (!new)
 		return ERR_PTR(-ENOMEM);
diff --git a/mm/mmap.c b/mm/mmap.c
index bb50cacc3ea5..4882008dac83 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -719,8 +719,9 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
  * are necessary.  The "insert" vma (if any) is to be inserted
  * before we drop the necessary locks.
  */
-int vma_adjust(struct vm_area_struct *vma, unsigned long start,
-	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
+int vma_adjust_gfp(struct vm_area_struct *vma, unsigned long start,
+	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
+	gfp_t gfp_mask)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *next = vma->vm_next;
@@ -773,7 +774,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 			int error;
 
 			importer->anon_vma = exporter->anon_vma;
-			error = anon_vma_clone(importer, exporter);
+			error = anon_vma_clone(importer, exporter, gfp_mask);
 			if (error)
 				return error;
 		}
@@ -2435,11 +2436,11 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 /*
- * __split_vma() bypasses sysctl_max_map_count checking.  We use this on the
+ * __split_vma_gfp() bypasses sysctl_max_map_count checking.  We use this on the
  * munmap path where it doesn't make sense to fail.
  */
-static int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
-	      unsigned long addr, int new_below)
+static int __split_vma_gfp(struct mm_struct *mm, struct vm_area_struct *vma,
+	      unsigned long addr, int new_below, gfp_t gfp_mask)
 {
 	struct vm_area_struct *new;
 	int err = -ENOMEM;
@@ -2448,7 +2449,7 @@ static int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 					~(huge_page_mask(hstate_vma(vma)))))
 		return -EINVAL;
 
-	new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	new = kmem_cache_alloc(vm_area_cachep, gfp_mask);
 	if (!new)
 		goto out_err;
 
@@ -2464,11 +2465,11 @@ static int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 		new->vm_pgoff += ((addr - vma->vm_start) >> PAGE_SHIFT);
 	}
 
-	err = vma_dup_policy(vma, new);
+	err = vma_dup_policy_gfp(vma, new, gfp_mask);
 	if (err)
 		goto out_free_vma;
 
-	err = anon_vma_clone(new, vma);
+	err = anon_vma_clone(new, vma, gfp_mask);
 	if (err)
 		goto out_free_mpol;
 
@@ -2479,10 +2480,10 @@ static int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 		new->vm_ops->open(new);
 
 	if (new_below)
-		err = vma_adjust(vma, addr, vma->vm_end, vma->vm_pgoff +
-			((addr - new->vm_start) >> PAGE_SHIFT), new);
+		err = vma_adjust_gfp(vma, addr, vma->vm_end, vma->vm_pgoff +
+			((addr - new->vm_start) >> PAGE_SHIFT), new, gfp_mask);
 	else
-		err = vma_adjust(vma, vma->vm_start, addr, vma->vm_pgoff, new);
+		err = vma_adjust_gfp(vma, vma->vm_start, addr, vma->vm_pgoff, new, gfp_mask);
 
 	/* Success. */
 	if (!err)
@@ -2512,7 +2513,7 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (mm->map_count >= sysctl_max_map_count)
 		return -ENOMEM;
 
-	return __split_vma(mm, vma, addr, new_below);
+	return __split_vma_gfp(mm, vma, addr, new_below, GFP_KERNEL);
 }
 
 /* Munmap is split into 2 main parts -- this part which finds
@@ -2520,7 +2521,8 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
  * work.  This now handles partial unmappings.
  * Jeremy Fitzhardinge <jeremy@goop.org>
  */
-int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
+static int __do_munmap_gfp(struct mm_struct *mm, unsigned long start, size_t len,
+		gfp_t gfp_mask)
 {
 	unsigned long end;
 	struct vm_area_struct *vma, *prev, *last;
@@ -2562,7 +2564,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 		if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
 			return -ENOMEM;
 
-		error = __split_vma(mm, vma, start, 0);
+		error = __split_vma_gfp(mm, vma, start, 0, gfp_mask);
 		if (error)
 			return error;
 		prev = vma;
@@ -2571,7 +2573,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 	/* Does it split the last one? */
 	last = find_vma(mm, end);
 	if (last && end > last->vm_start) {
-		int error = __split_vma(mm, last, end, 1);
+		int error = __split_vma_gfp(mm, last, end, 1, gfp_mask);
 		if (error)
 			return error;
 	}
@@ -2605,6 +2607,11 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 	return 0;
 }
 
+int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
+{
+	return __do_munmap_gfp(mm, start, len, GFP_KERNEL);
+}
+
 int vm_munmap(unsigned long start, size_t len)
 {
 	int ret;
@@ -2943,10 +2950,10 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 			new_vma->vm_start = addr;
 			new_vma->vm_end = addr + len;
 			new_vma->vm_pgoff = pgoff;
-			if (vma_dup_policy(vma, new_vma))
+			if (vma_dup_policy_gfp(vma, new_vma, GFP_KERNEL))
 				goto out_free_vma;
 			INIT_LIST_HEAD(&new_vma->anon_vma_chain);
-			if (anon_vma_clone(new_vma, vma))
+			if (anon_vma_clone(new_vma, vma, GFP_KERNEL))
 				goto out_free_mempol;
 			if (new_vma->vm_file)
 				get_file(new_vma->vm_file);
diff --git a/mm/nommu.c b/mm/nommu.c
index 3fba2dc97c44..f1e7b41a2031 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1556,8 +1556,8 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
  * split a vma into two pieces at address 'addr', a new vma is allocated either
  * for the first part or the tail.
  */
-int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
-	      unsigned long addr, int new_below)
+static int __split_vma_gfp(struct mm_struct *mm, struct vm_area_struct *vma,
+	      unsigned long addr, int new_below, gfp_t gfp_mask)
 {
 	struct vm_area_struct *new;
 	struct vm_region *region;
@@ -1573,11 +1573,11 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (mm->map_count >= sysctl_max_map_count)
 		return -ENOMEM;
 
-	region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL);
+	region = kmem_cache_alloc(vm_region_jar, gfp_mask);
 	if (!region)
 		return -ENOMEM;
 
-	new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	new = kmem_cache_alloc(vm_area_cachep, gfp_mask);
 	if (!new) {
 		kmem_cache_free(vm_region_jar, region);
 		return -ENOMEM;
@@ -1618,6 +1618,12 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 	return 0;
 }
 
+int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
+	      unsigned long addr, int new_below)
+{
+	return __split_vma_gfp(mm, vma, addr, new_below, GFP_KERNEL);
+}
+
 /*
  * shrink a VMA by removing the specified chunk from either the beginning or
  * the end
@@ -1663,7 +1669,8 @@ static int shrink_vma(struct mm_struct *mm,
  * - under NOMMU conditions the chunk to be unmapped must be backed by a single
  *   VMA, though it need not cover the whole VMA
  */
-int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
+static int __do_munmap_gfp(struct mm_struct *mm, unsigned long start, size_t len
+		gfp_t gfp_mask)
 {
 	struct vm_area_struct *vma;
 	unsigned long end;
@@ -1722,7 +1729,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 			return -EINVAL;
 		}
 		if (start != vma->vm_start && end != vma->vm_end) {
-			ret = split_vma(mm, vma, start, 1);
+			ret = __split_vma_gfp(mm, vma, start, 1, gfp_mask);
 			if (ret < 0) {
 				kleave(" = %d [split]", ret);
 				return ret;
@@ -1737,6 +1744,11 @@ erase_whole_vma:
 	kleave(" = 0");
 	return 0;
 }
+
+int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
+{
+	return __do_munmap_gfp(mm, start, len, GFP_KERNEL);
+}
 EXPORT_SYMBOL(do_munmap);
 
 int vm_munmap(unsigned long addr, size_t len)
diff --git a/mm/rmap.c b/mm/rmap.c
index dad23a43e42c..e10101940031 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -249,7 +249,8 @@ static inline void unlock_anon_vma_root(struct anon_vma *root)
  * good chance of avoiding scanning the whole hierarchy when it searches where
  * page is mapped.
  */
-int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
+int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src,
+		gfp_t gfp_mask)
 {
 	struct anon_vma_chain *avc, *pavc;
 	struct anon_vma *root = NULL;
@@ -261,7 +262,7 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 		if (unlikely(!avc)) {
 			unlock_anon_vma_root(root);
 			root = NULL;
-			avc = anon_vma_chain_alloc(GFP_KERNEL);
+			avc = anon_vma_chain_alloc(gfp_mask);
 			if (!avc)
 				goto enomem_failure;
 		}
@@ -320,7 +321,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	 * First, attach the new VMA to the parent VMA's anon_vmas,
 	 * so rmap can find non-COWed pages in child processes.
 	 */
-	error = anon_vma_clone(vma, pvma);
+	error = anon_vma_clone(vma, pvma, GFP_KERNEL);
 	if (error)
 		return error;
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 2/3] mm: allow munmap related functions to understand gfp_mask
@ 2015-04-28 12:11     ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-28 12:11 UTC (permalink / raw)
  To: linux-mm
  Cc: Cyril Hrubis, Andrew Morton, Hugh Dickins, Michel Lespinasse,
	Linus Torvalds, Rik van Riel, Michael Kerrisk, LKML, Linux API

__split_vma path requires to allocate a memory. Later patch in the
series will require to change standard GFP_KERNEL allocation to use
__GFP_NOFAIL. In order to do that all the allocation paths down this
path should understand gfp requirements of the caller.

This involves vma_dup_policy and vma_adjust which have _gfp variant now
and anon_vma_clone got just a new parameter because it doesn't have
many callers.

The patch doesn't have any runtime effects but it makes the code
slightly larger:
   text    data     bss     dec     hex filename
 511480   74147   44440  630067   99d33 mm/built-in.o.before
 511560   74147   44440  630147   99d83 mm/built-in.o.after

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/mempolicy.h | 17 ++++++++++++++---
 include/linux/mm.h        | 10 ++++++++--
 include/linux/rmap.h      |  2 +-
 mm/mempolicy.c            |  9 +++++----
 mm/mmap.c                 | 43 +++++++++++++++++++++++++------------------
 mm/nommu.c                | 24 ++++++++++++++++++------
 mm/rmap.c                 |  7 ++++---
 7 files changed, 75 insertions(+), 37 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 3d385c81c153..a42585c09d4e 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -82,11 +82,12 @@ static inline void mpol_cond_put(struct mempolicy *pol)
 		__mpol_put(pol);
 }
 
-extern struct mempolicy *__mpol_dup(struct mempolicy *pol);
+extern struct mempolicy *mpol_dup_gfp(struct mempolicy *pol,
+		gfp_t gfp_mask);
 static inline struct mempolicy *mpol_dup(struct mempolicy *pol)
 {
 	if (pol)
-		pol = __mpol_dup(pol);
+		pol = mpol_dup_gfp(pol, GFP_KERNEL);
 	return pol;
 }
 
@@ -125,7 +126,12 @@ struct shared_policy {
 	spinlock_t lock;
 };
 
-int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst);
+int vma_dup_policy_gfp(struct vm_area_struct *src,
+		struct vm_area_struct *dst, gfp_t gfp_mask);
+static inline int
+vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst) {
+	return vma_dup_policy_gfp(src, dst, GFP_KERNEL);
+}
 void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol);
 int mpol_set_shared_policy(struct shared_policy *info,
 				struct vm_area_struct *vma,
@@ -235,6 +241,11 @@ vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
 {
 	return 0;
 }
+static inline int
+vma_dup_policy_gfp(struct vm_area_struct *src,
+		struct vm_area_struct *dst, gfp_t gfp_mask) {
+	return 0;
+}
 
 static inline void numa_policy_init(void)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5d20fba62081..723032a2273f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1752,8 +1752,14 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
 
 /* mmap.c */
 extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
-extern int vma_adjust(struct vm_area_struct *vma, unsigned long start,
-	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert);
+extern int vma_adjust_gfp(struct vm_area_struct *vma, unsigned long start,
+	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
+	gfp_t gfp_mask);
+static inline int
+vma_adjust(struct vm_area_struct *vma, unsigned long start,
+	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert) {
+	return vma_adjust_gfp(vma, start, end, pgoff, insert, GFP_KERNEL);
+}
 extern struct vm_area_struct *vma_merge(struct mm_struct *,
 	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
 	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bf36b6e644c4..23d210b84431 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -147,7 +147,7 @@ static inline void anon_vma_unlock_read(struct anon_vma *anon_vma)
 void anon_vma_init(void);	/* create anon_vma_cachep */
 int  anon_vma_prepare(struct vm_area_struct *);
 void unlink_anon_vmas(struct vm_area_struct *);
-int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
+int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *, gfp_t);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
 
 static inline void anon_vma_merge(struct vm_area_struct *vma,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index ede26291d4aa..9002d0a15d74 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2060,9 +2060,10 @@ retry_cpuset:
 }
 EXPORT_SYMBOL(alloc_pages_current);
 
-int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
+int vma_dup_policy_gfp(struct vm_area_struct *src, struct vm_area_struct *dst,
+		gfp_t gfp_mask)
 {
-	struct mempolicy *pol = mpol_dup(vma_policy(src));
+	struct mempolicy *pol = mpol_dup_gfp(vma_policy(src), gfp_mask);
 
 	if (IS_ERR(pol))
 		return PTR_ERR(pol);
@@ -2082,9 +2083,9 @@ int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
  */
 
 /* Slow path of a mempolicy duplicate */
-struct mempolicy *__mpol_dup(struct mempolicy *old)
+struct mempolicy *mpol_dup_gfp(struct mempolicy *old, gfp_t gfp_mask)
 {
-	struct mempolicy *new = kmem_cache_alloc(policy_cache, GFP_KERNEL);
+	struct mempolicy *new = kmem_cache_alloc(policy_cache, gfp_mask);
 
 	if (!new)
 		return ERR_PTR(-ENOMEM);
diff --git a/mm/mmap.c b/mm/mmap.c
index bb50cacc3ea5..4882008dac83 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -719,8 +719,9 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
  * are necessary.  The "insert" vma (if any) is to be inserted
  * before we drop the necessary locks.
  */
-int vma_adjust(struct vm_area_struct *vma, unsigned long start,
-	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
+int vma_adjust_gfp(struct vm_area_struct *vma, unsigned long start,
+	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
+	gfp_t gfp_mask)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *next = vma->vm_next;
@@ -773,7 +774,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 			int error;
 
 			importer->anon_vma = exporter->anon_vma;
-			error = anon_vma_clone(importer, exporter);
+			error = anon_vma_clone(importer, exporter, gfp_mask);
 			if (error)
 				return error;
 		}
@@ -2435,11 +2436,11 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 /*
- * __split_vma() bypasses sysctl_max_map_count checking.  We use this on the
+ * __split_vma_gfp() bypasses sysctl_max_map_count checking.  We use this on the
  * munmap path where it doesn't make sense to fail.
  */
-static int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
-	      unsigned long addr, int new_below)
+static int __split_vma_gfp(struct mm_struct *mm, struct vm_area_struct *vma,
+	      unsigned long addr, int new_below, gfp_t gfp_mask)
 {
 	struct vm_area_struct *new;
 	int err = -ENOMEM;
@@ -2448,7 +2449,7 @@ static int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 					~(huge_page_mask(hstate_vma(vma)))))
 		return -EINVAL;
 
-	new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	new = kmem_cache_alloc(vm_area_cachep, gfp_mask);
 	if (!new)
 		goto out_err;
 
@@ -2464,11 +2465,11 @@ static int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 		new->vm_pgoff += ((addr - vma->vm_start) >> PAGE_SHIFT);
 	}
 
-	err = vma_dup_policy(vma, new);
+	err = vma_dup_policy_gfp(vma, new, gfp_mask);
 	if (err)
 		goto out_free_vma;
 
-	err = anon_vma_clone(new, vma);
+	err = anon_vma_clone(new, vma, gfp_mask);
 	if (err)
 		goto out_free_mpol;
 
@@ -2479,10 +2480,10 @@ static int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 		new->vm_ops->open(new);
 
 	if (new_below)
-		err = vma_adjust(vma, addr, vma->vm_end, vma->vm_pgoff +
-			((addr - new->vm_start) >> PAGE_SHIFT), new);
+		err = vma_adjust_gfp(vma, addr, vma->vm_end, vma->vm_pgoff +
+			((addr - new->vm_start) >> PAGE_SHIFT), new, gfp_mask);
 	else
-		err = vma_adjust(vma, vma->vm_start, addr, vma->vm_pgoff, new);
+		err = vma_adjust_gfp(vma, vma->vm_start, addr, vma->vm_pgoff, new, gfp_mask);
 
 	/* Success. */
 	if (!err)
@@ -2512,7 +2513,7 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (mm->map_count >= sysctl_max_map_count)
 		return -ENOMEM;
 
-	return __split_vma(mm, vma, addr, new_below);
+	return __split_vma_gfp(mm, vma, addr, new_below, GFP_KERNEL);
 }
 
 /* Munmap is split into 2 main parts -- this part which finds
@@ -2520,7 +2521,8 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
  * work.  This now handles partial unmappings.
  * Jeremy Fitzhardinge <jeremy@goop.org>
  */
-int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
+static int __do_munmap_gfp(struct mm_struct *mm, unsigned long start, size_t len,
+		gfp_t gfp_mask)
 {
 	unsigned long end;
 	struct vm_area_struct *vma, *prev, *last;
@@ -2562,7 +2564,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 		if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
 			return -ENOMEM;
 
-		error = __split_vma(mm, vma, start, 0);
+		error = __split_vma_gfp(mm, vma, start, 0, gfp_mask);
 		if (error)
 			return error;
 		prev = vma;
@@ -2571,7 +2573,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 	/* Does it split the last one? */
 	last = find_vma(mm, end);
 	if (last && end > last->vm_start) {
-		int error = __split_vma(mm, last, end, 1);
+		int error = __split_vma_gfp(mm, last, end, 1, gfp_mask);
 		if (error)
 			return error;
 	}
@@ -2605,6 +2607,11 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 	return 0;
 }
 
+int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
+{
+	return __do_munmap_gfp(mm, start, len, GFP_KERNEL);
+}
+
 int vm_munmap(unsigned long start, size_t len)
 {
 	int ret;
@@ -2943,10 +2950,10 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 			new_vma->vm_start = addr;
 			new_vma->vm_end = addr + len;
 			new_vma->vm_pgoff = pgoff;
-			if (vma_dup_policy(vma, new_vma))
+			if (vma_dup_policy_gfp(vma, new_vma, GFP_KERNEL))
 				goto out_free_vma;
 			INIT_LIST_HEAD(&new_vma->anon_vma_chain);
-			if (anon_vma_clone(new_vma, vma))
+			if (anon_vma_clone(new_vma, vma, GFP_KERNEL))
 				goto out_free_mempol;
 			if (new_vma->vm_file)
 				get_file(new_vma->vm_file);
diff --git a/mm/nommu.c b/mm/nommu.c
index 3fba2dc97c44..f1e7b41a2031 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1556,8 +1556,8 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
  * split a vma into two pieces at address 'addr', a new vma is allocated either
  * for the first part or the tail.
  */
-int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
-	      unsigned long addr, int new_below)
+static int __split_vma_gfp(struct mm_struct *mm, struct vm_area_struct *vma,
+	      unsigned long addr, int new_below, gfp_t gfp_mask)
 {
 	struct vm_area_struct *new;
 	struct vm_region *region;
@@ -1573,11 +1573,11 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (mm->map_count >= sysctl_max_map_count)
 		return -ENOMEM;
 
-	region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL);
+	region = kmem_cache_alloc(vm_region_jar, gfp_mask);
 	if (!region)
 		return -ENOMEM;
 
-	new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	new = kmem_cache_alloc(vm_area_cachep, gfp_mask);
 	if (!new) {
 		kmem_cache_free(vm_region_jar, region);
 		return -ENOMEM;
@@ -1618,6 +1618,12 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 	return 0;
 }
 
+int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
+	      unsigned long addr, int new_below)
+{
+	return __split_vma_gfp(mm, vma, addr, new_below, GFP_KERNEL);
+}
+
 /*
  * shrink a VMA by removing the specified chunk from either the beginning or
  * the end
@@ -1663,7 +1669,8 @@ static int shrink_vma(struct mm_struct *mm,
  * - under NOMMU conditions the chunk to be unmapped must be backed by a single
  *   VMA, though it need not cover the whole VMA
  */
-int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
+static int __do_munmap_gfp(struct mm_struct *mm, unsigned long start, size_t len
+		gfp_t gfp_mask)
 {
 	struct vm_area_struct *vma;
 	unsigned long end;
@@ -1722,7 +1729,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 			return -EINVAL;
 		}
 		if (start != vma->vm_start && end != vma->vm_end) {
-			ret = split_vma(mm, vma, start, 1);
+			ret = __split_vma_gfp(mm, vma, start, 1, gfp_mask);
 			if (ret < 0) {
 				kleave(" = %d [split]", ret);
 				return ret;
@@ -1737,6 +1744,11 @@ erase_whole_vma:
 	kleave(" = 0");
 	return 0;
 }
+
+int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
+{
+	return __do_munmap_gfp(mm, start, len, GFP_KERNEL);
+}
 EXPORT_SYMBOL(do_munmap);
 
 int vm_munmap(unsigned long addr, size_t len)
diff --git a/mm/rmap.c b/mm/rmap.c
index dad23a43e42c..e10101940031 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -249,7 +249,8 @@ static inline void unlock_anon_vma_root(struct anon_vma *root)
  * good chance of avoiding scanning the whole hierarchy when it searches where
  * page is mapped.
  */
-int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
+int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src,
+		gfp_t gfp_mask)
 {
 	struct anon_vma_chain *avc, *pavc;
 	struct anon_vma *root = NULL;
@@ -261,7 +262,7 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 		if (unlikely(!avc)) {
 			unlock_anon_vma_root(root);
 			root = NULL;
-			avc = anon_vma_chain_alloc(GFP_KERNEL);
+			avc = anon_vma_chain_alloc(gfp_mask);
 			if (!avc)
 				goto enomem_failure;
 		}
@@ -320,7 +321,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	 * First, attach the new VMA to the parent VMA's anon_vmas,
 	 * so rmap can find non-COWed pages in child processes.
 	 */
-	error = anon_vma_clone(vma, pvma);
+	error = anon_vma_clone(vma, pvma, GFP_KERNEL);
 	if (error)
 		return error;
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 3/3] mm: introduce do_munmap_nofail
@ 2015-04-28 12:11     ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-28 12:11 UTC (permalink / raw)
  To: linux-mm
  Cc: Cyril Hrubis, Andrew Morton, Hugh Dickins, Michel Lespinasse,
	Linus Torvalds, Rik van Riel, Michael Kerrisk, LKML, Linux API

vm_mmap_pgoff with MAP_LOCKED need to call do_munmap in case the
population of the area fails. The operation cannot fail for obvious
reasons. The current code simply retries in the loop which is not
very nice.

This patch introduces do_munmap_nofail() which uses __GFP_NOFAIL
for allocations required down the unmap path. It is always better
to loop inside the allocator rather than outside if there is no
sensible way handle the allocation failure.
Allocator can perform additional steps to help the allocation to
succeed (e.g. can get access to memory reserves).

The caller of the function has to make sure that the mapping is
initialized properly - namely [start, len] correspond to an existing VMA
and that the split doesn't exceed sysctl_max_map_count. This is true for
vm_mmap_pgoff so it is safe to be used in this path. The function is for
internal use only so it is not exported to the rest of the kernel.

While we are at it, let's make nommu shrink_vma return void because it
doesn't have any failing path.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/internal.h |  4 ++++
 mm/mmap.c     |  6 ++++++
 mm/nommu.c    | 11 ++++++++---
 mm/util.c     | 15 ++-------------
 4 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index a25e359a4039..7f9d1f112d3b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -415,6 +415,10 @@ extern unsigned long vm_mmap_pgoff(struct file *, unsigned long,
         unsigned long, unsigned long,
         unsigned long, unsigned long);
 
+/* Caller has to make sure the [addr, len] corresponds to a valid VMA */
+extern void do_munmap_nofail(struct mm_struct * mm,
+			     unsigned long addr, size_t len);
+
 extern void set_pageblock_order(void);
 unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 					    struct list_head *page_list);
diff --git a/mm/mmap.c b/mm/mmap.c
index 4882008dac83..d54544c7b2ba 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2612,6 +2612,12 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 	return __do_munmap_gfp(mm, start, len, GFP_KERNEL);
 }
 
+void do_munmap_nofail(struct mm_struct *mm, unsigned long start, size_t len)
+{
+	BUG_ON(__do_munmap_gfp(mm, start, len, GFP_KERNEL|__GFP_NOFAIL));
+}
+
+
 int vm_munmap(unsigned long start, size_t len)
 {
 	int ret;
diff --git a/mm/nommu.c b/mm/nommu.c
index f1e7b41a2031..9bdd1dedb4cd 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1628,7 +1628,7 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
  * shrink a VMA by removing the specified chunk from either the beginning or
  * the end
  */
-static int shrink_vma(struct mm_struct *mm,
+static void shrink_vma(struct mm_struct *mm,
 		      struct vm_area_struct *vma,
 		      unsigned long from, unsigned long to)
 {
@@ -1661,7 +1661,6 @@ static int shrink_vma(struct mm_struct *mm,
 	up_write(&nommu_region_sem);
 
 	free_page_series(from, to);
-	return 0;
 }
 
 /*
@@ -1735,7 +1734,8 @@ static int __do_munmap_gfp(struct mm_struct *mm, unsigned long start, size_t len
 				return ret;
 			}
 		}
-		return shrink_vma(mm, vma, start, end);
+		shrink_vma(mm, vma, start, end);
+		return;
 	}
 
 erase_whole_vma:
@@ -1751,6 +1751,11 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 }
 EXPORT_SYMBOL(do_munmap);
 
+static void do_munmap_nofail(struct mm_struct *mm, unsigned long start, size_t len)
+{
+	BUG_ON(do_munmap(mm, start, len, GFP_KERNEL|__GFP_NOFAIL));
+}
+
 int vm_munmap(unsigned long addr, size_t len)
 {
 	struct mm_struct *mm = current->mm;
diff --git a/mm/util.c b/mm/util.c
index fbffefa3b812..ddac3ea918c2 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -324,21 +324,10 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 		 */
 		if (flag & MAP_LOCKED) {
 			down_write(&mm->mmap_sem);
-			while (!fatal_signal_pending(current)) {
+			if (!fatal_signal_pending(current)) {
 				mm->map_count--;
 				need_map_count_fix = false;
-				if (!do_munmap(mm, ret, populate))
-					break;
-
-				/*
-				 * Do not block other threads to make a progress
-				 * e.g. madvise
-				 */
-				mm->map_count++;
-				need_map_count_fix = true;
-				up_write(&mm->mmap_sem);
-				cond_resched();
-				down_write(&mm->mmap_sem);
+				do_munmap_nofail(mm, ret, populate);
 			}
 			up_write(&mm->mmap_sem);
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 3/3] mm: introduce do_munmap_nofail
@ 2015-04-28 12:11     ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-28 12:11 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Cyril Hrubis, Andrew Morton, Hugh Dickins, Michel Lespinasse,
	Linus Torvalds, Rik van Riel, Michael Kerrisk, LKML, Linux API

vm_mmap_pgoff with MAP_LOCKED need to call do_munmap in case the
population of the area fails. The operation cannot fail for obvious
reasons. The current code simply retries in the loop which is not
very nice.

This patch introduces do_munmap_nofail() which uses __GFP_NOFAIL
for allocations required down the unmap path. It is always better
to loop inside the allocator rather than outside if there is no
sensible way handle the allocation failure.
Allocator can perform additional steps to help the allocation to
succeed (e.g. can get access to memory reserves).

The caller of the function has to make sure that the mapping is
initialized properly - namely [start, len] correspond to an existing VMA
and that the split doesn't exceed sysctl_max_map_count. This is true for
vm_mmap_pgoff so it is safe to be used in this path. The function is for
internal use only so it is not exported to the rest of the kernel.

While we are at it, let's make nommu shrink_vma return void because it
doesn't have any failing path.

Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
---
 mm/internal.h |  4 ++++
 mm/mmap.c     |  6 ++++++
 mm/nommu.c    | 11 ++++++++---
 mm/util.c     | 15 ++-------------
 4 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index a25e359a4039..7f9d1f112d3b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -415,6 +415,10 @@ extern unsigned long vm_mmap_pgoff(struct file *, unsigned long,
         unsigned long, unsigned long,
         unsigned long, unsigned long);
 
+/* Caller has to make sure the [addr, len] corresponds to a valid VMA */
+extern void do_munmap_nofail(struct mm_struct * mm,
+			     unsigned long addr, size_t len);
+
 extern void set_pageblock_order(void);
 unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 					    struct list_head *page_list);
diff --git a/mm/mmap.c b/mm/mmap.c
index 4882008dac83..d54544c7b2ba 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2612,6 +2612,12 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 	return __do_munmap_gfp(mm, start, len, GFP_KERNEL);
 }
 
+void do_munmap_nofail(struct mm_struct *mm, unsigned long start, size_t len)
+{
+	BUG_ON(__do_munmap_gfp(mm, start, len, GFP_KERNEL|__GFP_NOFAIL));
+}
+
+
 int vm_munmap(unsigned long start, size_t len)
 {
 	int ret;
diff --git a/mm/nommu.c b/mm/nommu.c
index f1e7b41a2031..9bdd1dedb4cd 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1628,7 +1628,7 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
  * shrink a VMA by removing the specified chunk from either the beginning or
  * the end
  */
-static int shrink_vma(struct mm_struct *mm,
+static void shrink_vma(struct mm_struct *mm,
 		      struct vm_area_struct *vma,
 		      unsigned long from, unsigned long to)
 {
@@ -1661,7 +1661,6 @@ static int shrink_vma(struct mm_struct *mm,
 	up_write(&nommu_region_sem);
 
 	free_page_series(from, to);
-	return 0;
 }
 
 /*
@@ -1735,7 +1734,8 @@ static int __do_munmap_gfp(struct mm_struct *mm, unsigned long start, size_t len
 				return ret;
 			}
 		}
-		return shrink_vma(mm, vma, start, end);
+		shrink_vma(mm, vma, start, end);
+		return;
 	}
 
 erase_whole_vma:
@@ -1751,6 +1751,11 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 }
 EXPORT_SYMBOL(do_munmap);
 
+static void do_munmap_nofail(struct mm_struct *mm, unsigned long start, size_t len)
+{
+	BUG_ON(do_munmap(mm, start, len, GFP_KERNEL|__GFP_NOFAIL));
+}
+
 int vm_munmap(unsigned long addr, size_t len)
 {
 	struct mm_struct *mm = current->mm;
diff --git a/mm/util.c b/mm/util.c
index fbffefa3b812..ddac3ea918c2 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -324,21 +324,10 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 		 */
 		if (flag & MAP_LOCKED) {
 			down_write(&mm->mmap_sem);
-			while (!fatal_signal_pending(current)) {
+			if (!fatal_signal_pending(current)) {
 				mm->map_count--;
 				need_map_count_fix = false;
-				if (!do_munmap(mm, ret, populate))
-					break;
-
-				/*
-				 * Do not block other threads to make a progress
-				 * e.g. madvise
-				 */
-				mm->map_count++;
-				need_map_count_fix = true;
-				up_write(&mm->mmap_sem);
-				cond_resched();
-				down_write(&mm->mmap_sem);
+				do_munmap_nofail(mm, ret, populate);
 			}
 			up_write(&mm->mmap_sem);
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 3/3] mm: introduce do_munmap_nofail
@ 2015-04-28 12:11     ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-28 12:11 UTC (permalink / raw)
  To: linux-mm
  Cc: Cyril Hrubis, Andrew Morton, Hugh Dickins, Michel Lespinasse,
	Linus Torvalds, Rik van Riel, Michael Kerrisk, LKML, Linux API

vm_mmap_pgoff with MAP_LOCKED need to call do_munmap in case the
population of the area fails. The operation cannot fail for obvious
reasons. The current code simply retries in the loop which is not
very nice.

This patch introduces do_munmap_nofail() which uses __GFP_NOFAIL
for allocations required down the unmap path. It is always better
to loop inside the allocator rather than outside if there is no
sensible way handle the allocation failure.
Allocator can perform additional steps to help the allocation to
succeed (e.g. can get access to memory reserves).

The caller of the function has to make sure that the mapping is
initialized properly - namely [start, len] correspond to an existing VMA
and that the split doesn't exceed sysctl_max_map_count. This is true for
vm_mmap_pgoff so it is safe to be used in this path. The function is for
internal use only so it is not exported to the rest of the kernel.

While we are at it, let's make nommu shrink_vma return void because it
doesn't have any failing path.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/internal.h |  4 ++++
 mm/mmap.c     |  6 ++++++
 mm/nommu.c    | 11 ++++++++---
 mm/util.c     | 15 ++-------------
 4 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index a25e359a4039..7f9d1f112d3b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -415,6 +415,10 @@ extern unsigned long vm_mmap_pgoff(struct file *, unsigned long,
         unsigned long, unsigned long,
         unsigned long, unsigned long);
 
+/* Caller has to make sure the [addr, len] corresponds to a valid VMA */
+extern void do_munmap_nofail(struct mm_struct * mm,
+			     unsigned long addr, size_t len);
+
 extern void set_pageblock_order(void);
 unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 					    struct list_head *page_list);
diff --git a/mm/mmap.c b/mm/mmap.c
index 4882008dac83..d54544c7b2ba 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2612,6 +2612,12 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 	return __do_munmap_gfp(mm, start, len, GFP_KERNEL);
 }
 
+void do_munmap_nofail(struct mm_struct *mm, unsigned long start, size_t len)
+{
+	BUG_ON(__do_munmap_gfp(mm, start, len, GFP_KERNEL|__GFP_NOFAIL));
+}
+
+
 int vm_munmap(unsigned long start, size_t len)
 {
 	int ret;
diff --git a/mm/nommu.c b/mm/nommu.c
index f1e7b41a2031..9bdd1dedb4cd 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1628,7 +1628,7 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
  * shrink a VMA by removing the specified chunk from either the beginning or
  * the end
  */
-static int shrink_vma(struct mm_struct *mm,
+static void shrink_vma(struct mm_struct *mm,
 		      struct vm_area_struct *vma,
 		      unsigned long from, unsigned long to)
 {
@@ -1661,7 +1661,6 @@ static int shrink_vma(struct mm_struct *mm,
 	up_write(&nommu_region_sem);
 
 	free_page_series(from, to);
-	return 0;
 }
 
 /*
@@ -1735,7 +1734,8 @@ static int __do_munmap_gfp(struct mm_struct *mm, unsigned long start, size_t len
 				return ret;
 			}
 		}
-		return shrink_vma(mm, vma, start, end);
+		shrink_vma(mm, vma, start, end);
+		return;
 	}
 
 erase_whole_vma:
@@ -1751,6 +1751,11 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 }
 EXPORT_SYMBOL(do_munmap);
 
+static void do_munmap_nofail(struct mm_struct *mm, unsigned long start, size_t len)
+{
+	BUG_ON(do_munmap(mm, start, len, GFP_KERNEL|__GFP_NOFAIL));
+}
+
 int vm_munmap(unsigned long addr, size_t len)
 {
 	struct mm_struct *mm = current->mm;
diff --git a/mm/util.c b/mm/util.c
index fbffefa3b812..ddac3ea918c2 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -324,21 +324,10 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 		 */
 		if (flag & MAP_LOCKED) {
 			down_write(&mm->mmap_sem);
-			while (!fatal_signal_pending(current)) {
+			if (!fatal_signal_pending(current)) {
 				mm->map_count--;
 				need_map_count_fix = false;
-				if (!do_munmap(mm, ret, populate))
-					break;
-
-				/*
-				 * Do not block other threads to make a progress
-				 * e.g. madvise
-				 */
-				mm->map_count++;
-				need_map_count_fix = true;
-				up_write(&mm->mmap_sem);
-				cond_resched();
-				down_write(&mm->mmap_sem);
+				do_munmap_nofail(mm, ret, populate);
 			}
 			up_write(&mm->mmap_sem);
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?
  2015-04-28 12:11   ` Michal Hocko
@ 2015-04-28 16:01     ` Linus Torvalds
  -1 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2015-04-28 16:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, Michael Kerrisk, LKML,
	Linux API

On Tue, Apr 28, 2015 at 5:11 AM, Michal Hocko <mhocko@suse.cz> wrote:
>
> The first patch is dumb and straightforward. It should be safe as is and
> also good without the follow up 2 patches which try to handle potential
> allocation failures in the do_munmap path more gracefully. As we still
> do not fail small allocations even the first patch could be simplified
> a bit and the retry loop replaced by a BUG_ON right away.

I think the BUG_ON() is a bad idea in the first place, and is in fact
a good reason to ignore the patch series entirely.

What is the point of that BUG_ON()?

Hell, people add too many of those things. There is *no* excuse for
killing the kernel for things like this (and in certain setups,
BUG_ON() *will* cause the machine to be rebooted). None. It's
completely inexcusable.

Thinking like this must go. BUG_ON() is for things where our internal
data structures are so corrupted that we don't know what to do, and
there's no way to continue. Not for "I want to sprinkle these things
around and this should not happen".

I also think that the whole complex "do_munmap_nofail()" is broken to
begin with, along with the crazy "!fatal_signal_pending()" thing.
There is absolutely no excuse for any of this.

Your code is also fundamentally buggy in that it tries to do unmap()
after it has dropped all locks, and things went wrong. So you may nto
be unmapping some other threads data.

There is no way in hell any of these patches can ever be applied.

There's a reason we don't handle populate failures - it's exactly
because we've dropped the locks etc. After dropping the locks, we
*cannot* clean up any more, because there's no way to know whather
we're cleaning up the right thing.  You'd have to hold the write lock
over the whole populate, which has serious problems of its own.

So NAK on this series. I think just documenting the man-page might be
better. I don't think MAP_LOCKED is sanely fixable.

We might improve on MAP_LOCKED by having a heuristic up-front
(*before* actually doing any mmap) to verify that it's *likely* that
it will work. So we could return ENOMEM early if it looks like the
user would hit the resource limits, for example. That wouldn't be any
guarantee (another process might eat up the resource limit anyway),
and in fact it might be overly eager to fail (maybe the
mmap(MAP_LOCKED ends up unmapping an older locked mapping and we deny
it too eagerly), but it would probably work well enough in practice.

That, together with a warning in the man-page about mmap(MAP_LOCKED)
not being able to return "I only locked part of the mapping", if you
want full error handling you need to do mmap()+mlock() and check the
two errors separately.

Hmm? But I really dislike your patch-series as-is.

                       Linus

                      Linus

                           Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?
@ 2015-04-28 16:01     ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2015-04-28 16:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, Michael Kerrisk, LKML,
	Linux API

On Tue, Apr 28, 2015 at 5:11 AM, Michal Hocko <mhocko@suse.cz> wrote:
>
> The first patch is dumb and straightforward. It should be safe as is and
> also good without the follow up 2 patches which try to handle potential
> allocation failures in the do_munmap path more gracefully. As we still
> do not fail small allocations even the first patch could be simplified
> a bit and the retry loop replaced by a BUG_ON right away.

I think the BUG_ON() is a bad idea in the first place, and is in fact
a good reason to ignore the patch series entirely.

What is the point of that BUG_ON()?

Hell, people add too many of those things. There is *no* excuse for
killing the kernel for things like this (and in certain setups,
BUG_ON() *will* cause the machine to be rebooted). None. It's
completely inexcusable.

Thinking like this must go. BUG_ON() is for things where our internal
data structures are so corrupted that we don't know what to do, and
there's no way to continue. Not for "I want to sprinkle these things
around and this should not happen".

I also think that the whole complex "do_munmap_nofail()" is broken to
begin with, along with the crazy "!fatal_signal_pending()" thing.
There is absolutely no excuse for any of this.

Your code is also fundamentally buggy in that it tries to do unmap()
after it has dropped all locks, and things went wrong. So you may nto
be unmapping some other threads data.

There is no way in hell any of these patches can ever be applied.

There's a reason we don't handle populate failures - it's exactly
because we've dropped the locks etc. After dropping the locks, we
*cannot* clean up any more, because there's no way to know whather
we're cleaning up the right thing.  You'd have to hold the write lock
over the whole populate, which has serious problems of its own.

So NAK on this series. I think just documenting the man-page might be
better. I don't think MAP_LOCKED is sanely fixable.

We might improve on MAP_LOCKED by having a heuristic up-front
(*before* actually doing any mmap) to verify that it's *likely* that
it will work. So we could return ENOMEM early if it looks like the
user would hit the resource limits, for example. That wouldn't be any
guarantee (another process might eat up the resource limit anyway),
and in fact it might be overly eager to fail (maybe the
mmap(MAP_LOCKED ends up unmapping an older locked mapping and we deny
it too eagerly), but it would probably work well enough in practice.

That, together with a warning in the man-page about mmap(MAP_LOCKED)
not being able to return "I only locked part of the mapping", if you
want full error handling you need to do mmap()+mlock() and check the
two errors separately.

Hmm? But I really dislike your patch-series as-is.

                       Linus

                      Linus

                           Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?
  2015-04-28 16:01     ` Linus Torvalds
@ 2015-04-28 16:43       ` Michal Hocko
  -1 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-28 16:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, Michael Kerrisk, LKML,
	Linux API

On Tue 28-04-15 09:01:59, Linus Torvalds wrote:
[...]
> Your code is also fundamentally buggy in that it tries to do unmap()
> after it has dropped all locks, and things went wrong. So you may nto
> be unmapping some other threads data.

Hmm, no other thread has the address from the current mmap call except
for MAP_FIXED (more on that below).

Well I can imagine userspace doing nasty things like watching
/proc/self/maps and using the address from there or using an address as
an mmap hint and then using it before mmap returns by other threads. But
would those be valid usecases? They sound crazy and buggy to me.

Another nasty case would be MAP_FIXED from a different thread destroying
the mmap I am trying to poppulate but that is not so interesting because
nothing protects from that even now.
Or this being MAP_FIXED|MAP_LOCKED which has already destroyed a part
of somebody's else mapping and the cleanup would lead to an unexpected
SIGSEGV for the other thread. Is this the case you are worried about?

Or am I missing other cases?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?
@ 2015-04-28 16:43       ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-28 16:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, Michael Kerrisk, LKML,
	Linux API

On Tue 28-04-15 09:01:59, Linus Torvalds wrote:
[...]
> Your code is also fundamentally buggy in that it tries to do unmap()
> after it has dropped all locks, and things went wrong. So you may nto
> be unmapping some other threads data.

Hmm, no other thread has the address from the current mmap call except
for MAP_FIXED (more on that below).

Well I can imagine userspace doing nasty things like watching
/proc/self/maps and using the address from there or using an address as
an mmap hint and then using it before mmap returns by other threads. But
would those be valid usecases? They sound crazy and buggy to me.

Another nasty case would be MAP_FIXED from a different thread destroying
the mmap I am trying to poppulate but that is not so interesting because
nothing protects from that even now.
Or this being MAP_FIXED|MAP_LOCKED which has already destroyed a part
of somebody's else mapping and the cleanup would lead to an unexpected
SIGSEGV for the other thread. Is this the case you are worried about?

Or am I missing other cases?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?
  2015-04-28 16:43       ` Michal Hocko
@ 2015-04-28 16:57         ` Linus Torvalds
  -1 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2015-04-28 16:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, Michael Kerrisk, LKML,
	Linux API

On Tue, Apr 28, 2015 at 9:43 AM, Michal Hocko <mhocko@suse.cz> wrote:
>
> Hmm, no other thread has the address from the current mmap call except
> for MAP_FIXED (more on that below).

With things like opportunistic SIGSEGV handlers that map/unmap things
as the user takes faults, that's actually not at all guaranteed.

Yeah, it's unusual, but I've seen it, with threaded applications where
people play games with user-space memory management, and do "demand
allocation" with mmap() in response to signals.

Admittedly we already do bad things in mmap(MAP_FIXED) for that case,
since we dropped the vm lock. But at least it shouldn't be any worse
than a thread speculatively touching the pages..

                      Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?
@ 2015-04-28 16:57         ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2015-04-28 16:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, Michael Kerrisk, LKML,
	Linux API

On Tue, Apr 28, 2015 at 9:43 AM, Michal Hocko <mhocko@suse.cz> wrote:
>
> Hmm, no other thread has the address from the current mmap call except
> for MAP_FIXED (more on that below).

With things like opportunistic SIGSEGV handlers that map/unmap things
as the user takes faults, that's actually not at all guaranteed.

Yeah, it's unusual, but I've seen it, with threaded applications where
people play games with user-space memory management, and do "demand
allocation" with mmap() in response to signals.

Admittedly we already do bad things in mmap(MAP_FIXED) for that case,
since we dropped the vm lock. But at least it shouldn't be any worse
than a thread speculatively touching the pages..

                      Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?
@ 2015-04-28 18:35           ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-28 18:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, Michael Kerrisk, LKML,
	Linux API

On Tue 28-04-15 09:57:11, Linus Torvalds wrote:
> On Tue, Apr 28, 2015 at 9:43 AM, Michal Hocko <mhocko@suse.cz> wrote:
> >
> > Hmm, no other thread has the address from the current mmap call except
> > for MAP_FIXED (more on that below).
> 
> With things like opportunistic SIGSEGV handlers that map/unmap things
> as the user takes faults, that's actually not at all guaranteed.
> 
> Yeah, it's unusual, but I've seen it, with threaded applications where
> people play games with user-space memory management, and do "demand
> allocation" with mmap() in response to signals.

I am still not sure I see the problem here. Let's say we have a
userspace page fault handler which would do mmap(fault_addr, MAP_FIXED),
right?

If we had a racy mmap(NULL, MAP_LOCKED) that could have mapped
fault_addr by the time handler does its work then this is buggy wrt. to
MAP_LOCKED semantic because the fault handler would discard the locked
part. This wouldn't lead to a data loss but still makes MAP_LOCKED usage
buggy IMO.

If the racing thread did mmap(around_fault_addr, MAP_FIXED|MAP_LOCKED)
then it would be broken as well, and even worse I would say, because the
original fault could have been discarded and data lost.

I would expect that user fault handlers would be synchronized with
other mmap activity otherwise I have hard time to see how this can all
have a well defined behavior. Especially when MAP_FIXED is involved.

> Admittedly we already do bad things in mmap(MAP_FIXED) for that case,
> since we dropped the vm lock. But at least it shouldn't be any worse
> than a thread speculatively touching the pages..

Actually we already allow to mmap(MAP_FIXED) to fail after
discarding an existing mmaped area (see mmap_region and e.g.
security_vm_enough_memory_mm or other failure cases).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?
@ 2015-04-28 18:35           ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-28 18:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, Michael Kerrisk, LKML,
	Linux API

On Tue 28-04-15 09:57:11, Linus Torvalds wrote:
> On Tue, Apr 28, 2015 at 9:43 AM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> >
> > Hmm, no other thread has the address from the current mmap call except
> > for MAP_FIXED (more on that below).
> 
> With things like opportunistic SIGSEGV handlers that map/unmap things
> as the user takes faults, that's actually not at all guaranteed.
> 
> Yeah, it's unusual, but I've seen it, with threaded applications where
> people play games with user-space memory management, and do "demand
> allocation" with mmap() in response to signals.

I am still not sure I see the problem here. Let's say we have a
userspace page fault handler which would do mmap(fault_addr, MAP_FIXED),
right?

If we had a racy mmap(NULL, MAP_LOCKED) that could have mapped
fault_addr by the time handler does its work then this is buggy wrt. to
MAP_LOCKED semantic because the fault handler would discard the locked
part. This wouldn't lead to a data loss but still makes MAP_LOCKED usage
buggy IMO.

If the racing thread did mmap(around_fault_addr, MAP_FIXED|MAP_LOCKED)
then it would be broken as well, and even worse I would say, because the
original fault could have been discarded and data lost.

I would expect that user fault handlers would be synchronized with
other mmap activity otherwise I have hard time to see how this can all
have a well defined behavior. Especially when MAP_FIXED is involved.

> Admittedly we already do bad things in mmap(MAP_FIXED) for that case,
> since we dropped the vm lock. But at least it shouldn't be any worse
> than a thread speculatively touching the pages..

Actually we already allow to mmap(MAP_FIXED) to fail after
discarding an existing mmaped area (see mmap_region and e.g.
security_vm_enough_memory_mm or other failure cases).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?
@ 2015-04-28 18:35           ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-28 18:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, Michael Kerrisk, LKML,
	Linux API

On Tue 28-04-15 09:57:11, Linus Torvalds wrote:
> On Tue, Apr 28, 2015 at 9:43 AM, Michal Hocko <mhocko@suse.cz> wrote:
> >
> > Hmm, no other thread has the address from the current mmap call except
> > for MAP_FIXED (more on that below).
> 
> With things like opportunistic SIGSEGV handlers that map/unmap things
> as the user takes faults, that's actually not at all guaranteed.
> 
> Yeah, it's unusual, but I've seen it, with threaded applications where
> people play games with user-space memory management, and do "demand
> allocation" with mmap() in response to signals.

I am still not sure I see the problem here. Let's say we have a
userspace page fault handler which would do mmap(fault_addr, MAP_FIXED),
right?

If we had a racy mmap(NULL, MAP_LOCKED) that could have mapped
fault_addr by the time handler does its work then this is buggy wrt. to
MAP_LOCKED semantic because the fault handler would discard the locked
part. This wouldn't lead to a data loss but still makes MAP_LOCKED usage
buggy IMO.

If the racing thread did mmap(around_fault_addr, MAP_FIXED|MAP_LOCKED)
then it would be broken as well, and even worse I would say, because the
original fault could have been discarded and data lost.

I would expect that user fault handlers would be synchronized with
other mmap activity otherwise I have hard time to see how this can all
have a well defined behavior. Especially when MAP_FIXED is involved.

> Admittedly we already do bad things in mmap(MAP_FIXED) for that case,
> since we dropped the vm lock. But at least it shouldn't be any worse
> than a thread speculatively touching the pages..

Actually we already allow to mmap(MAP_FIXED) to fail after
discarding an existing mmaped area (see mmap_region and e.g.
security_vm_enough_memory_mm or other failure cases).
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?
@ 2015-04-28 18:38             ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2015-04-28 18:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, Michael Kerrisk, LKML,
	Linux API

On Tue, Apr 28, 2015 at 11:35 AM, Michal Hocko <mhocko@suse.cz> wrote:
>
> I am still not sure I see the problem here.

Basically, I absolutely hate the notion of us doing something
unsynchronized, when I can see us undoing a mmap that another thread
is doing. It's wrong.

You also didn't react to all the *other* things that were wrong in
that patch-set. The games you play with !fatal_signal_pending() etc
are just crazy.

End result: I absolutely detest the whole thing. I told you what I
consider an acceptable solution instead, that is much simpler and
doesn't have any of the problems of your patchset.

                      Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?
@ 2015-04-28 18:38             ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2015-04-28 18:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, Michael Kerrisk, LKML,
	Linux API

On Tue, Apr 28, 2015 at 11:35 AM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
>
> I am still not sure I see the problem here.

Basically, I absolutely hate the notion of us doing something
unsynchronized, when I can see us undoing a mmap that another thread
is doing. It's wrong.

You also didn't react to all the *other* things that were wrong in
that patch-set. The games you play with !fatal_signal_pending() etc
are just crazy.

End result: I absolutely detest the whole thing. I told you what I
consider an acceptable solution instead, that is much simpler and
doesn't have any of the problems of your patchset.

                      Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?
@ 2015-04-28 18:38             ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2015-04-28 18:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, Michael Kerrisk, LKML,
	Linux API

On Tue, Apr 28, 2015 at 11:35 AM, Michal Hocko <mhocko@suse.cz> wrote:
>
> I am still not sure I see the problem here.

Basically, I absolutely hate the notion of us doing something
unsynchronized, when I can see us undoing a mmap that another thread
is doing. It's wrong.

You also didn't react to all the *other* things that were wrong in
that patch-set. The games you play with !fatal_signal_pending() etc
are just crazy.

End result: I absolutely detest the whole thing. I told you what I
consider an acceptable solution instead, that is much simpler and
doesn't have any of the problems of your patchset.

                      Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?
@ 2015-04-28 20:21       ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-28 20:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, Michael Kerrisk, LKML,
	Linux API

On Tue 28-04-15 09:01:59, Linus Torvalds wrote:
> On Tue, Apr 28, 2015 at 5:11 AM, Michal Hocko <mhocko@suse.cz> wrote:
> >
> > The first patch is dumb and straightforward. It should be safe as is and
> > also good without the follow up 2 patches which try to handle potential
> > allocation failures in the do_munmap path more gracefully. As we still
> > do not fail small allocations even the first patch could be simplified
> > a bit and the retry loop replaced by a BUG_ON right away.
> 
> I think the BUG_ON() is a bad idea in the first place, and is in fact
> a good reason to ignore the patch series entirely.

> What is the point of that BUG_ON()?
> 
> Hell, people add too many of those things. There is *no* excuse for
> killing the kernel for things like this (and in certain setups,
> BUG_ON() *will* cause the machine to be rebooted). None. It's
> completely inexcusable.
> 
> Thinking like this must go. BUG_ON() is for things where our internal
> data structures are so corrupted that we don't know what to do, and
> there's no way to continue. Not for "I want to sprinkle these things
> around and this should not happen".

The BUG_ON in do_munmap_nofail was to catch an unexpected failure
mode which would be caused by later changes. So it was a way to express
an invariant.
Anyway I understand your point above.

> I also think that the whole complex "do_munmap_nofail()" is broken to
> begin with,

Could you be more specific please?

> along with the crazy "!fatal_signal_pending()" thing.

The primary motivation was to back out when we know that the whole
thread group will go and we will cleanup the whole state anyway. As
the only real reason to fail do_munmap is an allocation failure (the
sysctl_max_map_count one is pro-actively avoided) then this basically
means that we have been OOM killed.
On the other hand the allocating thread will get TIF_MEMDIE and access
to memory reserves sooner or later if we are really OOM so the explicit
check is not really needed and it can be dropped.

> There is absolutely no excuse for any of this.
> 
> Your code is also fundamentally buggy in that it tries to do unmap()
> after it has dropped all locks, and things went wrong. So you may nto
> be unmapping some other threads data.
> 
> There is no way in hell any of these patches can ever be applied.
> 
> There's a reason we don't handle populate failures - it's exactly
> because we've dropped the locks etc. After dropping the locks, we
> *cannot* clean up any more, because there's no way to know whather
> we're cleaning up the right thing.  You'd have to hold the write lock
> over the whole populate, which has serious problems of its own.
> 
> So NAK on this series. I think just documenting the man-page might be
> better. I don't think MAP_LOCKED is sanely fixable.

I am OK with this answer as well. Users who really need no-later faults
behavior should use mlock(). I will cook up a patch for man pages and
post it tomorrow.

> We might improve on MAP_LOCKED by having a heuristic up-front
> (*before* actually doing any mmap) to verify that it's *likely* that
> it will work. So we could return ENOMEM early if it looks like the
> user would hit the resource limits, for example. That wouldn't be any
> guarantee (another process might eat up the resource limit anyway),
> and in fact it might be overly eager to fail (maybe the
> mmap(MAP_LOCKED ends up unmapping an older locked mapping and we deny
> it too eagerly), but it would probably work well enough in practice.

As you've said. This would be inherently racy. Some of those checks are
already done before any destructive actions but this doesn't cover all
of them and certainly cannot cover the area fault in by definition.

> That, together with a warning in the man-page about mmap(MAP_LOCKED)
> not being able to return "I only locked part of the mapping", if you
> want full error handling you need to do mmap()+mlock() and check the
> two errors separately.
> 
> Hmm? But I really dislike your patch-series as-is.
> 
>                        Linus
> 
>                       Linus
> 
>                            Linus

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?
@ 2015-04-28 20:21       ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-28 20:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, Michael Kerrisk, LKML,
	Linux API

On Tue 28-04-15 09:01:59, Linus Torvalds wrote:
> On Tue, Apr 28, 2015 at 5:11 AM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> >
> > The first patch is dumb and straightforward. It should be safe as is and
> > also good without the follow up 2 patches which try to handle potential
> > allocation failures in the do_munmap path more gracefully. As we still
> > do not fail small allocations even the first patch could be simplified
> > a bit and the retry loop replaced by a BUG_ON right away.
> 
> I think the BUG_ON() is a bad idea in the first place, and is in fact
> a good reason to ignore the patch series entirely.

> What is the point of that BUG_ON()?
> 
> Hell, people add too many of those things. There is *no* excuse for
> killing the kernel for things like this (and in certain setups,
> BUG_ON() *will* cause the machine to be rebooted). None. It's
> completely inexcusable.
> 
> Thinking like this must go. BUG_ON() is for things where our internal
> data structures are so corrupted that we don't know what to do, and
> there's no way to continue. Not for "I want to sprinkle these things
> around and this should not happen".

The BUG_ON in do_munmap_nofail was to catch an unexpected failure
mode which would be caused by later changes. So it was a way to express
an invariant.
Anyway I understand your point above.

> I also think that the whole complex "do_munmap_nofail()" is broken to
> begin with,

Could you be more specific please?

> along with the crazy "!fatal_signal_pending()" thing.

The primary motivation was to back out when we know that the whole
thread group will go and we will cleanup the whole state anyway. As
the only real reason to fail do_munmap is an allocation failure (the
sysctl_max_map_count one is pro-actively avoided) then this basically
means that we have been OOM killed.
On the other hand the allocating thread will get TIF_MEMDIE and access
to memory reserves sooner or later if we are really OOM so the explicit
check is not really needed and it can be dropped.

> There is absolutely no excuse for any of this.
> 
> Your code is also fundamentally buggy in that it tries to do unmap()
> after it has dropped all locks, and things went wrong. So you may nto
> be unmapping some other threads data.
> 
> There is no way in hell any of these patches can ever be applied.
> 
> There's a reason we don't handle populate failures - it's exactly
> because we've dropped the locks etc. After dropping the locks, we
> *cannot* clean up any more, because there's no way to know whather
> we're cleaning up the right thing.  You'd have to hold the write lock
> over the whole populate, which has serious problems of its own.
> 
> So NAK on this series. I think just documenting the man-page might be
> better. I don't think MAP_LOCKED is sanely fixable.

I am OK with this answer as well. Users who really need no-later faults
behavior should use mlock(). I will cook up a patch for man pages and
post it tomorrow.

> We might improve on MAP_LOCKED by having a heuristic up-front
> (*before* actually doing any mmap) to verify that it's *likely* that
> it will work. So we could return ENOMEM early if it looks like the
> user would hit the resource limits, for example. That wouldn't be any
> guarantee (another process might eat up the resource limit anyway),
> and in fact it might be overly eager to fail (maybe the
> mmap(MAP_LOCKED ends up unmapping an older locked mapping and we deny
> it too eagerly), but it would probably work well enough in practice.

As you've said. This would be inherently racy. Some of those checks are
already done before any destructive actions but this doesn't cover all
of them and certainly cannot cover the area fault in by definition.

> That, together with a warning in the man-page about mmap(MAP_LOCKED)
> not being able to return "I only locked part of the mapping", if you
> want full error handling you need to do mmap()+mlock() and check the
> two errors separately.
> 
> Hmm? But I really dislike your patch-series as-is.
> 
>                        Linus
> 
>                       Linus
> 
>                            Linus

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?
@ 2015-04-28 20:21       ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-28 20:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, Michael Kerrisk, LKML,
	Linux API

On Tue 28-04-15 09:01:59, Linus Torvalds wrote:
> On Tue, Apr 28, 2015 at 5:11 AM, Michal Hocko <mhocko@suse.cz> wrote:
> >
> > The first patch is dumb and straightforward. It should be safe as is and
> > also good without the follow up 2 patches which try to handle potential
> > allocation failures in the do_munmap path more gracefully. As we still
> > do not fail small allocations even the first patch could be simplified
> > a bit and the retry loop replaced by a BUG_ON right away.
> 
> I think the BUG_ON() is a bad idea in the first place, and is in fact
> a good reason to ignore the patch series entirely.

> What is the point of that BUG_ON()?
> 
> Hell, people add too many of those things. There is *no* excuse for
> killing the kernel for things like this (and in certain setups,
> BUG_ON() *will* cause the machine to be rebooted). None. It's
> completely inexcusable.
> 
> Thinking like this must go. BUG_ON() is for things where our internal
> data structures are so corrupted that we don't know what to do, and
> there's no way to continue. Not for "I want to sprinkle these things
> around and this should not happen".

The BUG_ON in do_munmap_nofail was to catch an unexpected failure
mode which would be caused by later changes. So it was a way to express
an invariant.
Anyway I understand your point above.

> I also think that the whole complex "do_munmap_nofail()" is broken to
> begin with,

Could you be more specific please?

> along with the crazy "!fatal_signal_pending()" thing.

The primary motivation was to back out when we know that the whole
thread group will go and we will cleanup the whole state anyway. As
the only real reason to fail do_munmap is an allocation failure (the
sysctl_max_map_count one is pro-actively avoided) then this basically
means that we have been OOM killed.
On the other hand the allocating thread will get TIF_MEMDIE and access
to memory reserves sooner or later if we are really OOM so the explicit
check is not really needed and it can be dropped.

> There is absolutely no excuse for any of this.
> 
> Your code is also fundamentally buggy in that it tries to do unmap()
> after it has dropped all locks, and things went wrong. So you may nto
> be unmapping some other threads data.
> 
> There is no way in hell any of these patches can ever be applied.
> 
> There's a reason we don't handle populate failures - it's exactly
> because we've dropped the locks etc. After dropping the locks, we
> *cannot* clean up any more, because there's no way to know whather
> we're cleaning up the right thing.  You'd have to hold the write lock
> over the whole populate, which has serious problems of its own.
> 
> So NAK on this series. I think just documenting the man-page might be
> better. I don't think MAP_LOCKED is sanely fixable.

I am OK with this answer as well. Users who really need no-later faults
behavior should use mlock(). I will cook up a patch for man pages and
post it tomorrow.

> We might improve on MAP_LOCKED by having a heuristic up-front
> (*before* actually doing any mmap) to verify that it's *likely* that
> it will work. So we could return ENOMEM early if it looks like the
> user would hit the resource limits, for example. That wouldn't be any
> guarantee (another process might eat up the resource limit anyway),
> and in fact it might be overly eager to fail (maybe the
> mmap(MAP_LOCKED ends up unmapping an older locked mapping and we deny
> it too eagerly), but it would probably work well enough in practice.

As you've said. This would be inherently racy. Some of those checks are
already done before any destructive actions but this doesn't cover all
of them and certainly cannot cover the area fault in by definition.

> That, together with a warning in the man-page about mmap(MAP_LOCKED)
> not being able to return "I only locked part of the mapping", if you
> want full error handling you need to do mmap()+mlock() and check the
> two errors separately.
> 
> Hmm? But I really dislike your patch-series as-is.
> 
>                        Linus
> 
>                       Linus
> 
>                            Linus

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?
  2015-04-28 18:38             ` Linus Torvalds
@ 2015-04-28 20:36               ` Michal Hocko
  -1 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-28 20:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, Michael Kerrisk, LKML,
	Linux API

On Tue 28-04-15 11:38:35, Linus Torvalds wrote:
> On Tue, Apr 28, 2015 at 11:35 AM, Michal Hocko <mhocko@suse.cz> wrote:
> >
> > I am still not sure I see the problem here.
> 
> Basically, I absolutely hate the notion of us doing something
> unsynchronized, when I can see us undoing a mmap that another thread
> is doing. It's wrong.
> 
> You also didn't react to all the *other* things that were wrong in
> that patch-set. The games you play with !fatal_signal_pending() etc
> are just crazy.

I planed to get to those later, because I felt the locks vs. racing
mmaps argument was the most important objection.

> End result: I absolutely detest the whole thing. I told you what I
> consider an acceptable solution instead, that is much simpler and
> doesn't have any of the problems of your patchset.

I will surely think about those. As I've written in the cover email
already, I am fine with patching the man page and be clear about a long
term behavior. The primary motivation for this RFC was to start the
discussion.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?
@ 2015-04-28 20:36               ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-28 20:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, Michael Kerrisk, LKML,
	Linux API

On Tue 28-04-15 11:38:35, Linus Torvalds wrote:
> On Tue, Apr 28, 2015 at 11:35 AM, Michal Hocko <mhocko@suse.cz> wrote:
> >
> > I am still not sure I see the problem here.
> 
> Basically, I absolutely hate the notion of us doing something
> unsynchronized, when I can see us undoing a mmap that another thread
> is doing. It's wrong.
> 
> You also didn't react to all the *other* things that were wrong in
> that patch-set. The games you play with !fatal_signal_pending() etc
> are just crazy.

I planed to get to those later, because I felt the locks vs. racing
mmaps argument was the most important objection.

> End result: I absolutely detest the whole thing. I told you what I
> consider an acceptable solution instead, that is much simpler and
> doesn't have any of the problems of your patchset.

I will surely think about those. As I've written in the cover email
already, I am fine with patching the man page and be clear about a long
term behavior. The primary motivation for this RFC was to start the
discussion.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 1/3] mm: mmap make MAP_LOCKED really mlock semantic
@ 2015-04-28 23:10       ` Andrew Morton
  0 siblings, 0 replies; 43+ messages in thread
From: Andrew Morton @ 2015-04-28 23:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Cyril Hrubis, Hugh Dickins, Michel Lespinasse,
	Linus Torvalds, Rik van Riel, Michael Kerrisk, LKML, Linux API

On Tue, 28 Apr 2015 14:11:49 +0200 Michal Hocko <mhocko@suse.cz> wrote:

> The man page however says
> "
> MAP_LOCKED (since Linux 2.5.37)
>       Lock the pages of the mapped region into memory in the manner of
>       mlock(2).  This flag is ignored in older kernels.
> "

I'm trying to remember why we implemented MAP_LOCKED in the first
place.  Was it better than mmap+mlock in some fashion?

afaict we had a #define MAP_LOCKED in the header file but it wasn't
implemented, so we went and wired it up.  13 years ago:
https://lkml.org/lkml/2002/9/18/108


Anyway...  the third way of doing this is to use plain old mmap() while
mlockall(MCL_FUTURE) is in force.  Has anyone looked at that, checked
that the behaviour is sane and compared it with the mmap+mlock
behaviour, the MAP_LOCKED behaviour and the manpages?



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 1/3] mm: mmap make MAP_LOCKED really mlock semantic
@ 2015-04-28 23:10       ` Andrew Morton
  0 siblings, 0 replies; 43+ messages in thread
From: Andrew Morton @ 2015-04-28 23:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Cyril Hrubis, Hugh Dickins,
	Michel Lespinasse, Linus Torvalds, Rik van Riel, Michael Kerrisk,
	LKML, Linux API

On Tue, 28 Apr 2015 14:11:49 +0200 Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:

> The man page however says
> "
> MAP_LOCKED (since Linux 2.5.37)
>       Lock the pages of the mapped region into memory in the manner of
>       mlock(2).  This flag is ignored in older kernels.
> "

I'm trying to remember why we implemented MAP_LOCKED in the first
place.  Was it better than mmap+mlock in some fashion?

afaict we had a #define MAP_LOCKED in the header file but it wasn't
implemented, so we went and wired it up.  13 years ago:
https://lkml.org/lkml/2002/9/18/108


Anyway...  the third way of doing this is to use plain old mmap() while
mlockall(MCL_FUTURE) is in force.  Has anyone looked at that, checked
that the behaviour is sane and compared it with the mmap+mlock
behaviour, the MAP_LOCKED behaviour and the manpages?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 1/3] mm: mmap make MAP_LOCKED really mlock semantic
@ 2015-04-28 23:10       ` Andrew Morton
  0 siblings, 0 replies; 43+ messages in thread
From: Andrew Morton @ 2015-04-28 23:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Cyril Hrubis, Hugh Dickins, Michel Lespinasse,
	Linus Torvalds, Rik van Riel, Michael Kerrisk, LKML, Linux API

On Tue, 28 Apr 2015 14:11:49 +0200 Michal Hocko <mhocko@suse.cz> wrote:

> The man page however says
> "
> MAP_LOCKED (since Linux 2.5.37)
>       Lock the pages of the mapped region into memory in the manner of
>       mlock(2).  This flag is ignored in older kernels.
> "

I'm trying to remember why we implemented MAP_LOCKED in the first
place.  Was it better than mmap+mlock in some fashion?

afaict we had a #define MAP_LOCKED in the header file but it wasn't
implemented, so we went and wired it up.  13 years ago:
https://lkml.org/lkml/2002/9/18/108


Anyway...  the third way of doing this is to use plain old mmap() while
mlockall(MCL_FUTURE) is in force.  Has anyone looked at that, checked
that the behaviour is sane and compared it with the mmap+mlock
behaviour, the MAP_LOCKED behaviour and the manpages?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 1/3] mm: mmap make MAP_LOCKED really mlock semantic
  2015-04-28 23:10       ` Andrew Morton
@ 2015-04-29  7:52         ` Michal Hocko
  -1 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-29  7:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Cyril Hrubis, Hugh Dickins, Michel Lespinasse,
	Linus Torvalds, Rik van Riel, Michael Kerrisk, LKML, Linux API

On Tue 28-04-15 16:10:01, Andrew Morton wrote:
> On Tue, 28 Apr 2015 14:11:49 +0200 Michal Hocko <mhocko@suse.cz> wrote:
> 
> > The man page however says
> > "
> > MAP_LOCKED (since Linux 2.5.37)
> >       Lock the pages of the mapped region into memory in the manner of
> >       mlock(2).  This flag is ignored in older kernels.
> > "
> 
> I'm trying to remember why we implemented MAP_LOCKED in the first
> place.  Was it better than mmap+mlock in some fashion?
> 
> afaict we had a #define MAP_LOCKED in the header file but it wasn't
> implemented, so we went and wired it up.  13 years ago:
> https://lkml.org/lkml/2002/9/18/108

Yeah I have encountered this one while digging though the history as
well but there was no real usecase described - except "it doesn't work
currently".

The only sensible usecase I was able to come up with was a userspace
fault handling when we need to mmap and lock the faulting address in an
atomic way so that other threads cannot possibly leak data to the swap.
These guys can live with the current implementation, though.

I do not really believe that 2 instead of 1 syscall really justifies the
complexity.

> Anyway...  the third way of doing this is to use plain old mmap() while
> mlockall(MCL_FUTURE) is in force.  Has anyone looked at that, checked
> that the behaviour is sane and compared it with the mmap+mlock
> behaviour, the MAP_LOCKED behaviour and the manpages?

AFAICS this will behave the same way as mmap(MAP_LOCKED). VMA will be
marked VM_LOCKED but the popullation might fail for the very same
reason.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 1/3] mm: mmap make MAP_LOCKED really mlock semantic
@ 2015-04-29  7:52         ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-29  7:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Cyril Hrubis, Hugh Dickins, Michel Lespinasse,
	Linus Torvalds, Rik van Riel, Michael Kerrisk, LKML, Linux API

On Tue 28-04-15 16:10:01, Andrew Morton wrote:
> On Tue, 28 Apr 2015 14:11:49 +0200 Michal Hocko <mhocko@suse.cz> wrote:
> 
> > The man page however says
> > "
> > MAP_LOCKED (since Linux 2.5.37)
> >       Lock the pages of the mapped region into memory in the manner of
> >       mlock(2).  This flag is ignored in older kernels.
> > "
> 
> I'm trying to remember why we implemented MAP_LOCKED in the first
> place.  Was it better than mmap+mlock in some fashion?
> 
> afaict we had a #define MAP_LOCKED in the header file but it wasn't
> implemented, so we went and wired it up.  13 years ago:
> https://lkml.org/lkml/2002/9/18/108

Yeah I have encountered this one while digging though the history as
well but there was no real usecase described - except "it doesn't work
currently".

The only sensible usecase I was able to come up with was a userspace
fault handling when we need to mmap and lock the faulting address in an
atomic way so that other threads cannot possibly leak data to the swap.
These guys can live with the current implementation, though.

I do not really believe that 2 instead of 1 syscall really justifies the
complexity.

> Anyway...  the third way of doing this is to use plain old mmap() while
> mlockall(MCL_FUTURE) is in force.  Has anyone looked at that, checked
> that the behaviour is sane and compared it with the mmap+mlock
> behaviour, the MAP_LOCKED behaviour and the manpages?

AFAICS this will behave the same way as mmap(MAP_LOCKED). VMA will be
marked VM_LOCKED but the popullation might fail for the very same
reason.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH] mmap.2: clarify MAP_LOCKED semantic (was: Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?)
@ 2015-04-29 11:38               ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-29 11:38 UTC (permalink / raw)
  To: Linus Torvalds, Michael Kerrisk
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, LKML, Linux API

On Tue 28-04-15 11:38:35, Linus Torvalds wrote:
> On Tue, Apr 28, 2015 at 11:35 AM, Michal Hocko <mhocko@suse.cz> wrote:
> >
> > I am still not sure I see the problem here.
> 
> Basically, I absolutely hate the notion of us doing something
> unsynchronized, when I can see us undoing a mmap that another thread
> is doing. It's wrong.

OK, I have checked the mmap(2) man page and there is no single mention
about multi-threaded usage. So even though I personally think that
user fault handlers which do mmap(MAP_FIXED) without synchronization
to parallel mmaps are broken by definition we cannot simply rule them
out and it is not the kernel job to make them broken even more or in a
subtly different way.
So here is an RFC for the man page patch. I am not very good in the
format but man doesn't complain about any formating issues.
---
>From 903ed733187afaa4d27fef3c24f413304494411c Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Wed, 29 Apr 2015 11:02:19 +0200
Subject: [RFC PATCH] mmap.2: clarify MAP_LOCKED semantic

MAP_LOCKED had a subtly different semantic from mmap(2)+mlock(2) since
it has been introduced.
mlock(2) fails if the memory range cannot get populated to guarantee
that no future major faults will happen on the range. mmap(MAP_LOCKED) on
the other hand silently succeeds even if the range was populated only
partially.

Fixing this subtle difference in the kernel is rather awkward because
the memory population happens after mm locks have been dropped and so
the cleanup before returning failure (munlock) could operate on something
else than the originally mapped area.

E.g. speculative userspace page fault handler catching SEGV and doing
mmap(fault_addr, MAP_FIXED|MAP_LOCKED) might discard portion of a racing
mmap and lead to lost data. Although it is not clear whether such a
usage would be valid, mmap page doesn't explicitly describe requirements
for threaded applications so we cannot exclude this possibility.

This patch makes the semantic of MAP_LOCKED explicit and suggest using
mmap + mlock as the only way to guarantee no later major page faults.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 man2/mmap.2 | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/man2/mmap.2 b/man2/mmap.2
index 54d68cf87e9e..1486be2e96b3 100644
--- a/man2/mmap.2
+++ b/man2/mmap.2
@@ -235,8 +235,19 @@ See the Linux kernel source file
 for further information.
 .TP
 .BR MAP_LOCKED " (since Linux 2.5.37)"
-Lock the pages of the mapped region into memory in the manner of
+Mark the mmaped region to be locked in the same way as
 .BR mlock (2).
+This implementation will try to populate (prefault) the whole range but
+the mmap call doesn't fail with
+.B ENOMEM
+if this fails. Therefore major faults might happen later on. So the semantic
+is not as strong as
+.BR mlock (2).
+.BR mmap (2)
++
+.BR mlock (2)
+should be used when major faults are not acceptable after the initialization
+of the mapping.
 This flag is ignored in older kernels.
 .\" If set, the mapped pages will not be swapped out.
 .TP
-- 
2.1.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH] mmap.2: clarify MAP_LOCKED semantic (was: Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?)
@ 2015-04-29 11:38               ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-29 11:38 UTC (permalink / raw)
  To: Linus Torvalds, Michael Kerrisk
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, LKML, Linux API

On Tue 28-04-15 11:38:35, Linus Torvalds wrote:
> On Tue, Apr 28, 2015 at 11:35 AM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> >
> > I am still not sure I see the problem here.
> 
> Basically, I absolutely hate the notion of us doing something
> unsynchronized, when I can see us undoing a mmap that another thread
> is doing. It's wrong.

OK, I have checked the mmap(2) man page and there is no single mention
about multi-threaded usage. So even though I personally think that
user fault handlers which do mmap(MAP_FIXED) without synchronization
to parallel mmaps are broken by definition we cannot simply rule them
out and it is not the kernel job to make them broken even more or in a
subtly different way.
So here is an RFC for the man page patch. I am not very good in the
format but man doesn't complain about any formating issues.
---
>From 903ed733187afaa4d27fef3c24f413304494411c Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Date: Wed, 29 Apr 2015 11:02:19 +0200
Subject: [RFC PATCH] mmap.2: clarify MAP_LOCKED semantic

MAP_LOCKED had a subtly different semantic from mmap(2)+mlock(2) since
it has been introduced.
mlock(2) fails if the memory range cannot get populated to guarantee
that no future major faults will happen on the range. mmap(MAP_LOCKED) on
the other hand silently succeeds even if the range was populated only
partially.

Fixing this subtle difference in the kernel is rather awkward because
the memory population happens after mm locks have been dropped and so
the cleanup before returning failure (munlock) could operate on something
else than the originally mapped area.

E.g. speculative userspace page fault handler catching SEGV and doing
mmap(fault_addr, MAP_FIXED|MAP_LOCKED) might discard portion of a racing
mmap and lead to lost data. Although it is not clear whether such a
usage would be valid, mmap page doesn't explicitly describe requirements
for threaded applications so we cannot exclude this possibility.

This patch makes the semantic of MAP_LOCKED explicit and suggest using
mmap + mlock as the only way to guarantee no later major page faults.

Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
---
 man2/mmap.2 | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/man2/mmap.2 b/man2/mmap.2
index 54d68cf87e9e..1486be2e96b3 100644
--- a/man2/mmap.2
+++ b/man2/mmap.2
@@ -235,8 +235,19 @@ See the Linux kernel source file
 for further information.
 .TP
 .BR MAP_LOCKED " (since Linux 2.5.37)"
-Lock the pages of the mapped region into memory in the manner of
+Mark the mmaped region to be locked in the same way as
 .BR mlock (2).
+This implementation will try to populate (prefault) the whole range but
+the mmap call doesn't fail with
+.B ENOMEM
+if this fails. Therefore major faults might happen later on. So the semantic
+is not as strong as
+.BR mlock (2).
+.BR mmap (2)
++
+.BR mlock (2)
+should be used when major faults are not acceptable after the initialization
+of the mapping.
 This flag is ignored in older kernels.
 .\" If set, the mapped pages will not be swapped out.
 .TP
-- 
2.1.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH] mmap.2: clarify MAP_LOCKED semantic (was: Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?)
@ 2015-04-29 11:38               ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-29 11:38 UTC (permalink / raw)
  To: Linus Torvalds, Michael Kerrisk
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, LKML, Linux API

On Tue 28-04-15 11:38:35, Linus Torvalds wrote:
> On Tue, Apr 28, 2015 at 11:35 AM, Michal Hocko <mhocko@suse.cz> wrote:
> >
> > I am still not sure I see the problem here.
> 
> Basically, I absolutely hate the notion of us doing something
> unsynchronized, when I can see us undoing a mmap that another thread
> is doing. It's wrong.

OK, I have checked the mmap(2) man page and there is no single mention
about multi-threaded usage. So even though I personally think that
user fault handlers which do mmap(MAP_FIXED) without synchronization
to parallel mmaps are broken by definition we cannot simply rule them
out and it is not the kernel job to make them broken even more or in a
subtly different way.
So here is an RFC for the man page patch. I am not very good in the
format but man doesn't complain about any formating issues.
---

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH] mmap.2: clarify MAP_LOCKED semantic (was: Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?)
  2015-04-29 11:38               ` Michal Hocko
@ 2015-04-30  0:28                 ` David Rientjes
  -1 siblings, 0 replies; 43+ messages in thread
From: David Rientjes @ 2015-04-30  0:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Michael Kerrisk, linux-mm, Cyril Hrubis,
	Andrew Morton, Hugh Dickins, Michel Lespinasse, Rik van Riel,
	LKML, Linux API

On Wed, 29 Apr 2015, Michal Hocko wrote:

> MAP_LOCKED had a subtly different semantic from mmap(2)+mlock(2) since
> it has been introduced.
> mlock(2) fails if the memory range cannot get populated to guarantee
> that no future major faults will happen on the range. mmap(MAP_LOCKED) on
> the other hand silently succeeds even if the range was populated only
> partially.
> 
> Fixing this subtle difference in the kernel is rather awkward because
> the memory population happens after mm locks have been dropped and so
> the cleanup before returning failure (munlock) could operate on something
> else than the originally mapped area.
> 
> E.g. speculative userspace page fault handler catching SEGV and doing
> mmap(fault_addr, MAP_FIXED|MAP_LOCKED) might discard portion of a racing
> mmap and lead to lost data. Although it is not clear whether such a
> usage would be valid, mmap page doesn't explicitly describe requirements
> for threaded applications so we cannot exclude this possibility.
> 
> This patch makes the semantic of MAP_LOCKED explicit and suggest using
> mmap + mlock as the only way to guarantee no later major page faults.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  man2/mmap.2 | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/man2/mmap.2 b/man2/mmap.2
> index 54d68cf87e9e..1486be2e96b3 100644
> --- a/man2/mmap.2
> +++ b/man2/mmap.2
> @@ -235,8 +235,19 @@ See the Linux kernel source file
>  for further information.
>  .TP
>  .BR MAP_LOCKED " (since Linux 2.5.37)"
> -Lock the pages of the mapped region into memory in the manner of
> +Mark the mmaped region to be locked in the same way as
>  .BR mlock (2).
> +This implementation will try to populate (prefault) the whole range but
> +the mmap call doesn't fail with
> +.B ENOMEM
> +if this fails. Therefore major faults might happen later on. So the semantic
> +is not as strong as
> +.BR mlock (2).
> +.BR mmap (2)
> ++
> +.BR mlock (2)
> +should be used when major faults are not acceptable after the initialization
> +of the mapping.
>  This flag is ignored in older kernels.
>  .\" If set, the mapped pages will not be swapped out.
>  .TP

The wording of this begs the question on the behavior of 
MAP_LOCKED | MAP_POPULATE since this same man page specifies that 
accesses to memory mapped with MAP_POPULATE will not block on page faults 
later.

I think Documentation/vm/unevictable-lru.txt would benefit from an update 
under the mmap(MAP_LOCKED) section where all this can be laid out and 
perhaps reference it from the man page?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH] mmap.2: clarify MAP_LOCKED semantic (was: Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?)
@ 2015-04-30  0:28                 ` David Rientjes
  0 siblings, 0 replies; 43+ messages in thread
From: David Rientjes @ 2015-04-30  0:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Michael Kerrisk, linux-mm, Cyril Hrubis,
	Andrew Morton, Hugh Dickins, Michel Lespinasse, Rik van Riel,
	LKML, Linux API

On Wed, 29 Apr 2015, Michal Hocko wrote:

> MAP_LOCKED had a subtly different semantic from mmap(2)+mlock(2) since
> it has been introduced.
> mlock(2) fails if the memory range cannot get populated to guarantee
> that no future major faults will happen on the range. mmap(MAP_LOCKED) on
> the other hand silently succeeds even if the range was populated only
> partially.
> 
> Fixing this subtle difference in the kernel is rather awkward because
> the memory population happens after mm locks have been dropped and so
> the cleanup before returning failure (munlock) could operate on something
> else than the originally mapped area.
> 
> E.g. speculative userspace page fault handler catching SEGV and doing
> mmap(fault_addr, MAP_FIXED|MAP_LOCKED) might discard portion of a racing
> mmap and lead to lost data. Although it is not clear whether such a
> usage would be valid, mmap page doesn't explicitly describe requirements
> for threaded applications so we cannot exclude this possibility.
> 
> This patch makes the semantic of MAP_LOCKED explicit and suggest using
> mmap + mlock as the only way to guarantee no later major page faults.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  man2/mmap.2 | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/man2/mmap.2 b/man2/mmap.2
> index 54d68cf87e9e..1486be2e96b3 100644
> --- a/man2/mmap.2
> +++ b/man2/mmap.2
> @@ -235,8 +235,19 @@ See the Linux kernel source file
>  for further information.
>  .TP
>  .BR MAP_LOCKED " (since Linux 2.5.37)"
> -Lock the pages of the mapped region into memory in the manner of
> +Mark the mmaped region to be locked in the same way as
>  .BR mlock (2).
> +This implementation will try to populate (prefault) the whole range but
> +the mmap call doesn't fail with
> +.B ENOMEM
> +if this fails. Therefore major faults might happen later on. So the semantic
> +is not as strong as
> +.BR mlock (2).
> +.BR mmap (2)
> ++
> +.BR mlock (2)
> +should be used when major faults are not acceptable after the initialization
> +of the mapping.
>  This flag is ignored in older kernels.
>  .\" If set, the mapped pages will not be swapped out.
>  .TP

The wording of this begs the question on the behavior of 
MAP_LOCKED | MAP_POPULATE since this same man page specifies that 
accesses to memory mapped with MAP_POPULATE will not block on page faults 
later.

I think Documentation/vm/unevictable-lru.txt would benefit from an update 
under the mmap(MAP_LOCKED) section where all this can be laid out and 
perhaps reference it from the man page?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH] mmap.2: clarify MAP_LOCKED semantic (was: Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?)
  2015-04-30  0:28                 ` David Rientjes
@ 2015-04-30 14:52                   ` Michal Hocko
  -1 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-30 14:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Michael Kerrisk, linux-mm, Cyril Hrubis,
	Andrew Morton, Hugh Dickins, Michel Lespinasse, Rik van Riel,
	LKML, Linux API

On Wed 29-04-15 17:28:54, David Rientjes wrote:
[...]
> The wording of this begs the question on the behavior of 
> MAP_LOCKED | MAP_POPULATE since this same man page specifies that 
> accesses to memory mapped with MAP_POPULATE will not block on page faults 
> later.

Interesting. I haven't thought of this combination. The wording of
MAP_POPULATE is too strong and it really might suggest that no future
major faults will happen. And that is simply not true.
---
diff --git a/man2/mmap.2 b/man2/mmap.2
index 1486be2e96b3..c51d3f241ff9 100644
--- a/man2/mmap.2
+++ b/man2/mmap.2
@@ -284,7 +284,7 @@ private writable mappings.
 .BR MAP_POPULATE " (since Linux 2.5.46)"
 Populate (prefault) page tables for a mapping.
 For a file mapping, this causes read-ahead on the file.
-Later accesses to the mapping will not be blocked by page faults.
+This will help to reduce blocking on the page faults later.
 .BR MAP_POPULATE
 is supported for private mappings only since Linux 2.6.23.
 .TP
 
> I think Documentation/vm/unevictable-lru.txt would benefit from an update 
> under the mmap(MAP_LOCKED) section where all this can be laid out and 
> perhaps reference it from the man page?

Sure, what about the following:
---
diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt
index 3be0bfc4738d..9106f50781ac 100644
--- a/Documentation/vm/unevictable-lru.txt
+++ b/Documentation/vm/unevictable-lru.txt
@@ -467,7 +467,13 @@ mmap(MAP_LOCKED) SYSTEM CALL HANDLING
 
 In addition the mlock()/mlockall() system calls, an application can request
 that a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap()
-call.  Furthermore, any mmap() call or brk() call that expands the heap by a
+call. There is one important and subtle difference here, though. mmap() + mlock()
+will fail if the range cannot be faulted in (e.g. because mm_populate fails)
+and returns with ENOMEM while mmap(MAP_LOCKED) will not fail. The mmaped are
+will still have properties of the locked area - aka. pages will not get
+swapped out - but major page faults to fault memory in might still happen.
+
+Furthermore, any mmap() call or brk() call that expands the heap by a
 task that has previously called mlockall() with the MCL_FUTURE flag will result
 in the newly mapped memory being mlocked.  Before the unevictable/mlock
 changes, the kernel simply called make_pages_present() to allocate pages and

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH] mmap.2: clarify MAP_LOCKED semantic (was: Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?)
@ 2015-04-30 14:52                   ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-04-30 14:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Michael Kerrisk, linux-mm, Cyril Hrubis,
	Andrew Morton, Hugh Dickins, Michel Lespinasse, Rik van Riel,
	LKML, Linux API

On Wed 29-04-15 17:28:54, David Rientjes wrote:
[...]
> The wording of this begs the question on the behavior of 
> MAP_LOCKED | MAP_POPULATE since this same man page specifies that 
> accesses to memory mapped with MAP_POPULATE will not block on page faults 
> later.

Interesting. I haven't thought of this combination. The wording of
MAP_POPULATE is too strong and it really might suggest that no future
major faults will happen. And that is simply not true.
---
diff --git a/man2/mmap.2 b/man2/mmap.2
index 1486be2e96b3..c51d3f241ff9 100644
--- a/man2/mmap.2
+++ b/man2/mmap.2
@@ -284,7 +284,7 @@ private writable mappings.
 .BR MAP_POPULATE " (since Linux 2.5.46)"
 Populate (prefault) page tables for a mapping.
 For a file mapping, this causes read-ahead on the file.
-Later accesses to the mapping will not be blocked by page faults.
+This will help to reduce blocking on the page faults later.
 .BR MAP_POPULATE
 is supported for private mappings only since Linux 2.6.23.
 .TP
 
> I think Documentation/vm/unevictable-lru.txt would benefit from an update 
> under the mmap(MAP_LOCKED) section where all this can be laid out and 
> perhaps reference it from the man page?

Sure, what about the following:
---
diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt
index 3be0bfc4738d..9106f50781ac 100644
--- a/Documentation/vm/unevictable-lru.txt
+++ b/Documentation/vm/unevictable-lru.txt
@@ -467,7 +467,13 @@ mmap(MAP_LOCKED) SYSTEM CALL HANDLING
 
 In addition the mlock()/mlockall() system calls, an application can request
 that a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap()
-call.  Furthermore, any mmap() call or brk() call that expands the heap by a
+call. There is one important and subtle difference here, though. mmap() + mlock()
+will fail if the range cannot be faulted in (e.g. because mm_populate fails)
+and returns with ENOMEM while mmap(MAP_LOCKED) will not fail. The mmaped are
+will still have properties of the locked area - aka. pages will not get
+swapped out - but major page faults to fault memory in might still happen.
+
+Furthermore, any mmap() call or brk() call that expands the heap by a
 task that has previously called mlockall() with the MCL_FUTURE flag will result
 in the newly mapped memory being mlocked.  Before the unevictable/mlock
 changes, the kernel simply called make_pages_present() to allocate pages and

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH] mmap.2: clarify MAP_LOCKED semantic (was: Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?)
  2015-04-29 11:38               ` Michal Hocko
@ 2015-05-06 12:21                 ` Michal Hocko
  -1 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-05-06 12:21 UTC (permalink / raw)
  To: Linus Torvalds, Michael Kerrisk
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, LKML, Linux API

If there are no objections here I will resubmit this and MAP_POPULATE
patches in few days.

On Wed 29-04-15 13:38:18, Michal Hocko wrote:
> On Tue 28-04-15 11:38:35, Linus Torvalds wrote:
> > On Tue, Apr 28, 2015 at 11:35 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > >
> > > I am still not sure I see the problem here.
> > 
> > Basically, I absolutely hate the notion of us doing something
> > unsynchronized, when I can see us undoing a mmap that another thread
> > is doing. It's wrong.
> 
> OK, I have checked the mmap(2) man page and there is no single mention
> about multi-threaded usage. So even though I personally think that
> user fault handlers which do mmap(MAP_FIXED) without synchronization
> to parallel mmaps are broken by definition we cannot simply rule them
> out and it is not the kernel job to make them broken even more or in a
> subtly different way.
> So here is an RFC for the man page patch. I am not very good in the
> format but man doesn't complain about any formating issues.
> ---
> From 903ed733187afaa4d27fef3c24f413304494411c Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Wed, 29 Apr 2015 11:02:19 +0200
> Subject: [RFC PATCH] mmap.2: clarify MAP_LOCKED semantic
> 
> MAP_LOCKED had a subtly different semantic from mmap(2)+mlock(2) since
> it has been introduced.
> mlock(2) fails if the memory range cannot get populated to guarantee
> that no future major faults will happen on the range. mmap(MAP_LOCKED) on
> the other hand silently succeeds even if the range was populated only
> partially.
> 
> Fixing this subtle difference in the kernel is rather awkward because
> the memory population happens after mm locks have been dropped and so
> the cleanup before returning failure (munlock) could operate on something
> else than the originally mapped area.
> 
> E.g. speculative userspace page fault handler catching SEGV and doing
> mmap(fault_addr, MAP_FIXED|MAP_LOCKED) might discard portion of a racing
> mmap and lead to lost data. Although it is not clear whether such a
> usage would be valid, mmap page doesn't explicitly describe requirements
> for threaded applications so we cannot exclude this possibility.
> 
> This patch makes the semantic of MAP_LOCKED explicit and suggest using
> mmap + mlock as the only way to guarantee no later major page faults.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  man2/mmap.2 | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/man2/mmap.2 b/man2/mmap.2
> index 54d68cf87e9e..1486be2e96b3 100644
> --- a/man2/mmap.2
> +++ b/man2/mmap.2
> @@ -235,8 +235,19 @@ See the Linux kernel source file
>  for further information.
>  .TP
>  .BR MAP_LOCKED " (since Linux 2.5.37)"
> -Lock the pages of the mapped region into memory in the manner of
> +Mark the mmaped region to be locked in the same way as
>  .BR mlock (2).
> +This implementation will try to populate (prefault) the whole range but
> +the mmap call doesn't fail with
> +.B ENOMEM
> +if this fails. Therefore major faults might happen later on. So the semantic
> +is not as strong as
> +.BR mlock (2).
> +.BR mmap (2)
> ++
> +.BR mlock (2)
> +should be used when major faults are not acceptable after the initialization
> +of the mapping.
>  This flag is ignored in older kernels.
>  .\" If set, the mapped pages will not be swapped out.
>  .TP
> -- 
> 2.1.4
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH] mmap.2: clarify MAP_LOCKED semantic (was: Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?)
@ 2015-05-06 12:21                 ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2015-05-06 12:21 UTC (permalink / raw)
  To: Linus Torvalds, Michael Kerrisk
  Cc: linux-mm, Cyril Hrubis, Andrew Morton, Hugh Dickins,
	Michel Lespinasse, Rik van Riel, LKML, Linux API

If there are no objections here I will resubmit this and MAP_POPULATE
patches in few days.

On Wed 29-04-15 13:38:18, Michal Hocko wrote:
> On Tue 28-04-15 11:38:35, Linus Torvalds wrote:
> > On Tue, Apr 28, 2015 at 11:35 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > >
> > > I am still not sure I see the problem here.
> > 
> > Basically, I absolutely hate the notion of us doing something
> > unsynchronized, when I can see us undoing a mmap that another thread
> > is doing. It's wrong.
> 
> OK, I have checked the mmap(2) man page and there is no single mention
> about multi-threaded usage. So even though I personally think that
> user fault handlers which do mmap(MAP_FIXED) without synchronization
> to parallel mmaps are broken by definition we cannot simply rule them
> out and it is not the kernel job to make them broken even more or in a
> subtly different way.
> So here is an RFC for the man page patch. I am not very good in the
> format but man doesn't complain about any formating issues.
> ---
> From 903ed733187afaa4d27fef3c24f413304494411c Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Wed, 29 Apr 2015 11:02:19 +0200
> Subject: [RFC PATCH] mmap.2: clarify MAP_LOCKED semantic
> 
> MAP_LOCKED had a subtly different semantic from mmap(2)+mlock(2) since
> it has been introduced.
> mlock(2) fails if the memory range cannot get populated to guarantee
> that no future major faults will happen on the range. mmap(MAP_LOCKED) on
> the other hand silently succeeds even if the range was populated only
> partially.
> 
> Fixing this subtle difference in the kernel is rather awkward because
> the memory population happens after mm locks have been dropped and so
> the cleanup before returning failure (munlock) could operate on something
> else than the originally mapped area.
> 
> E.g. speculative userspace page fault handler catching SEGV and doing
> mmap(fault_addr, MAP_FIXED|MAP_LOCKED) might discard portion of a racing
> mmap and lead to lost data. Although it is not clear whether such a
> usage would be valid, mmap page doesn't explicitly describe requirements
> for threaded applications so we cannot exclude this possibility.
> 
> This patch makes the semantic of MAP_LOCKED explicit and suggest using
> mmap + mlock as the only way to guarantee no later major page faults.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  man2/mmap.2 | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/man2/mmap.2 b/man2/mmap.2
> index 54d68cf87e9e..1486be2e96b3 100644
> --- a/man2/mmap.2
> +++ b/man2/mmap.2
> @@ -235,8 +235,19 @@ See the Linux kernel source file
>  for further information.
>  .TP
>  .BR MAP_LOCKED " (since Linux 2.5.37)"
> -Lock the pages of the mapped region into memory in the manner of
> +Mark the mmaped region to be locked in the same way as
>  .BR mlock (2).
> +This implementation will try to populate (prefault) the whole range but
> +the mmap call doesn't fail with
> +.B ENOMEM
> +if this fails. Therefore major faults might happen later on. So the semantic
> +is not as strong as
> +.BR mlock (2).
> +.BR mmap (2)
> ++
> +.BR mlock (2)
> +should be used when major faults are not acceptable after the initialization
> +of the mapping.
>  This flag is ignored in older kernels.
>  .\" If set, the mapped pages will not be swapped out.
>  .TP
> -- 
> 2.1.4
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2015-05-06 12:21 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-14  9:50 Should mmap MAP_LOCKED fail if mm_poppulate fails? Michal Hocko
2015-01-14  9:50 ` Michal Hocko
2015-04-28 12:11 ` Michal Hocko
2015-04-28 12:11   ` Michal Hocko
2015-04-28 12:11   ` Michal Hocko
2015-04-28 12:11   ` [RFC 1/3] mm: mmap make MAP_LOCKED really mlock semantic Michal Hocko
2015-04-28 12:11     ` Michal Hocko
2015-04-28 23:10     ` Andrew Morton
2015-04-28 23:10       ` Andrew Morton
2015-04-28 23:10       ` Andrew Morton
2015-04-29  7:52       ` Michal Hocko
2015-04-29  7:52         ` Michal Hocko
2015-04-28 12:11   ` [RFC 2/3] mm: allow munmap related functions to understand gfp_mask Michal Hocko
2015-04-28 12:11     ` Michal Hocko
2015-04-28 12:11   ` [RFC 3/3] mm: introduce do_munmap_nofail Michal Hocko
2015-04-28 12:11     ` Michal Hocko
2015-04-28 12:11     ` Michal Hocko
2015-04-28 16:01   ` Should mmap MAP_LOCKED fail if mm_poppulate fails? Linus Torvalds
2015-04-28 16:01     ` Linus Torvalds
2015-04-28 16:43     ` Michal Hocko
2015-04-28 16:43       ` Michal Hocko
2015-04-28 16:57       ` Linus Torvalds
2015-04-28 16:57         ` Linus Torvalds
2015-04-28 18:35         ` Michal Hocko
2015-04-28 18:35           ` Michal Hocko
2015-04-28 18:35           ` Michal Hocko
2015-04-28 18:38           ` Linus Torvalds
2015-04-28 18:38             ` Linus Torvalds
2015-04-28 18:38             ` Linus Torvalds
2015-04-28 20:36             ` Michal Hocko
2015-04-28 20:36               ` Michal Hocko
2015-04-29 11:38             ` [RFC PATCH] mmap.2: clarify MAP_LOCKED semantic (was: Re: Should mmap MAP_LOCKED fail if mm_poppulate fails?) Michal Hocko
2015-04-29 11:38               ` Michal Hocko
2015-04-29 11:38               ` Michal Hocko
2015-04-30  0:28               ` David Rientjes
2015-04-30  0:28                 ` David Rientjes
2015-04-30 14:52                 ` Michal Hocko
2015-04-30 14:52                   ` Michal Hocko
2015-05-06 12:21               ` Michal Hocko
2015-05-06 12:21                 ` Michal Hocko
2015-04-28 20:21     ` Should mmap MAP_LOCKED fail if mm_poppulate fails? Michal Hocko
2015-04-28 20:21       ` Michal Hocko
2015-04-28 20:21       ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.