All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
@ 2018-03-29 11:27 Tetsuo Handa
  2018-03-29 21:30 ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2018-03-29 11:27 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Tetsuo Handa, Alexander Viro, Andrew Morton, Kirill A. Shutemov,
	Michal Hocko, Rik van Riel

Theoretically it is possible that an mm_struct with 60000+ vmas loops
with potentially allocating memory, with mm->mmap_sem held for write by
the current thread. Unless I overlooked that fatal_signal_pending() is
somewhere in the loop, this is bad if current thread was selected as an
OOM victim, for the current thread will continue allocations using memory
reserves while the OOM reaper is unable to reclaim memory.

But there is no point with continuing the loop from the beginning if
current thread is killed. If there were __GFP_KILLABLE (or something
like memalloc_nofs_save()/memalloc_nofs_restore()), we could apply it
to all allocations inside the loop. But since we don't have such flag,
this patch uses fatal_signal_pending() check inside the loop.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 kernel/fork.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index 1e8c9a7..38d5baa 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -440,6 +440,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 			continue;
 		}
 		charge = 0;
+		if (fatal_signal_pending(current)) {
+			retval = -EINTR;
+			goto out;
+		}
 		if (mpnt->vm_flags & VM_ACCOUNT) {
 			unsigned long len = vma_pages(mpnt);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
  2018-03-29 11:27 [PATCH] mm: Check for SIGKILL inside dup_mmap() loop Tetsuo Handa
@ 2018-03-29 21:30 ` Andrew Morton
  2018-03-30 10:34   ` Tetsuo Handa
                     ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Andrew Morton @ 2018-03-29 21:30 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, linux-fsdevel, Alexander Viro, Kirill A. Shutemov,
	Michal Hocko, Rik van Riel

On Thu, 29 Mar 2018 20:27:50 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:

> Theoretically it is possible that an mm_struct with 60000+ vmas loops
> with potentially allocating memory, with mm->mmap_sem held for write by
> the current thread. Unless I overlooked that fatal_signal_pending() is
> somewhere in the loop, this is bad if current thread was selected as an
> OOM victim, for the current thread will continue allocations using memory
> reserves while the OOM reaper is unable to reclaim memory.

All of which implies to me that this patch fixes a problem which is not
known to exist!  

> But there is no point with continuing the loop from the beginning if
> current thread is killed. If there were __GFP_KILLABLE (or something
> like memalloc_nofs_save()/memalloc_nofs_restore()), we could apply it
> to all allocations inside the loop. But since we don't have such flag,
> this patch uses fatal_signal_pending() check inside the loop.

Dumb question: if a thread has been oom-killed and then tries to
allocate memory, should the page allocator just fail the allocation
attempt?  I suppose there are all sorts of reasons why not :(

In which case, yes, setting a new
PF_MEMALLOC_MAY_FAIL_IF_I_WAS_OOMKILLED around such code might be a
tidy enough solution.  It would be a bit sad to add another test in the
hot path (should_fail_alloc_page()?), but geeze we do a lot of junk
already.

> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -440,6 +440,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  			continue;
>  		}
>  		charge = 0;
> +		if (fatal_signal_pending(current)) {
> +			retval = -EINTR;
> +			goto out;
> +		}
>  		if (mpnt->vm_flags & VM_ACCOUNT) {
>  			unsigned long len = vma_pages(mpnt);

I think a comment explaining why we're doing this would help.

Better would be to add a new function "current_is_oom_killed()" or
such, which becomes self-documenting.  Because there are other reasons
why a task may have a fatal signal pending.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
  2018-03-29 21:30 ` Andrew Morton
@ 2018-03-30 10:34   ` Tetsuo Handa
  2018-04-03 12:14     ` Matthew Wilcox
  2018-04-03 11:16   ` Michal Hocko
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2018-03-30 10:34 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-fsdevel, viro, kirill.shutemov, mhocko, riel

Andrew Morton wrote:
> On Thu, 29 Mar 2018 20:27:50 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> > Theoretically it is possible that an mm_struct with 60000+ vmas loops
> > with potentially allocating memory, with mm->mmap_sem held for write by
> > the current thread. Unless I overlooked that fatal_signal_pending() is
> > somewhere in the loop, this is bad if current thread was selected as an
> > OOM victim, for the current thread will continue allocations using memory
> > reserves while the OOM reaper is unable to reclaim memory.
> 
> All of which implies to me that this patch fixes a problem which is not
> known to exist!

Yes.

The trigger which made me to check this loop is that it is not difficult to hit
"oom_reaper: unable to reap pid:" messages if the victim thread is blocked at
i_mmap_lock_write() in this loop.

Checking for SIGKILL in this loop will be nice. Doing so should to some degree help
faster termination by reducing possibility of being blocked at i_mmap_sem_write().
That is, if i_mmap_lock_write() against N'th vma would block, we can avoid needless
delay by escaping the loop via fatal_signal_pending() test before reaching
i_mmap_lock_write() against N'th vma. Even if we are already blocked at
i_mmap_lock_write() against N'th vma, we can still avoid needless delay if
i_mmap_lock_write() against subsequent vmas would also block.

> 
> > But there is no point with continuing the loop from the beginning if
> > current thread is killed. If there were __GFP_KILLABLE (or something
> > like memalloc_nofs_save()/memalloc_nofs_restore()), we could apply it
> > to all allocations inside the loop. But since we don't have such flag,
> > this patch uses fatal_signal_pending() check inside the loop.
> 
> Dumb question: if a thread has been oom-killed and then tries to
> allocate memory, should the page allocator just fail the allocation
> attempt?  I suppose there are all sorts of reasons why not :(

Maybe because allocation failure paths are not tested enough
( https://lwn.net/Articles/627419/ ). But that should not prevent the page
allocator from just failing the allocation attempt.

I do want a mechanism for telling the page allocator whether we want to
give up upon SIGKILL. I've been proposing it as __GFP_KILLABLE.

> 
> In which case, yes, setting a new
> PF_MEMALLOC_MAY_FAIL_IF_I_WAS_OOMKILLED around such code might be a
> tidy enough solution.  It would be a bit sad to add another test in the
> hot path (should_fail_alloc_page()?), but geeze we do a lot of junk
> already.

Maybe we can make "give up by default upon SIGKILL" and let callers
explicitly say "do not give up upon SIGKILL".

----------------------------------------
 include/linux/gfp.h            | 10 ++++++++++
 include/trace/events/mmflags.h |  1 +
 mm/page_alloc.c                | 15 +++++++++++++++
 tools/perf/builtin-kmem.c      |  1 +
 4 files changed, 27 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1a4582b..a0e8a9c 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -24,6 +24,7 @@
 #define ___GFP_HIGH		0x20u
 #define ___GFP_IO		0x40u
 #define ___GFP_FS		0x80u
+#define ___GFP_UNKILLABLE	0x100u
 #define ___GFP_NOWARN		0x200u
 #define ___GFP_RETRY_MAYFAIL	0x400u
 #define ___GFP_NOFAIL		0x800u
@@ -122,6 +123,14 @@
  *   allocator recursing into the filesystem which might already be holding
  *   locks.
  *
+ * __GFP_UNKILLABLE: The VM implementation does not fail by simply because
+ *   fatal_signal_pending(current) is true when the current thread in task
+ *   context is doing memory allocations. Those allocations which do not want
+ *   to be disturbed by SIGKILL can add this flag. But note that those
+ *   allocations which must not fail have to add __GFP_NOFAIL, for
+ *   __GFP_UNKILLABLE allocations can still fail by other reasons such as
+ *   __GFP_NORETRY, __GFP_RETRY_MAYFAIL, being selected as an OOM victim.
+ *
  * __GFP_DIRECT_RECLAIM indicates that the caller may enter direct reclaim.
  *   This flag can be cleared to avoid unnecessary delays when a fallback
  *   option is available.
@@ -181,6 +190,7 @@
  */
 #define __GFP_IO	((__force gfp_t)___GFP_IO)
 #define __GFP_FS	((__force gfp_t)___GFP_FS)
+#define __GFP_UNKILLABLE ((__force gfp_t)___GFP_UNKILLABLE)
 #define __GFP_DIRECT_RECLAIM	((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
 #define __GFP_KSWAPD_RECLAIM	((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
 #define __GFP_RECLAIM ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a81cffb..6a21654 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -32,6 +32,7 @@
 	{(unsigned long)__GFP_ATOMIC,		"__GFP_ATOMIC"},	\
 	{(unsigned long)__GFP_IO,		"__GFP_IO"},		\
 	{(unsigned long)__GFP_FS,		"__GFP_FS"},		\
+	{(unsigned long)__GFP_UNKILLABLE,	"__GFP_UNKILLABLE"},	\
 	{(unsigned long)__GFP_NOWARN,		"__GFP_NOWARN"},	\
 	{(unsigned long)__GFP_RETRY_MAYFAIL,	"__GFP_RETRY_MAYFAIL"},	\
 	{(unsigned long)__GFP_NOFAIL,		"__GFP_NOFAIL"},	\
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 905db9d..c8af32e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4183,6 +4183,13 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	if (current->flags & PF_MEMALLOC)
 		goto nopage;
 
+	/* Can give up if caller is willing to give up upon fatal signals */
+	if (fatal_signal_pending(current) &&
+	    !(gfp_mask & (__GFP_UNKILLABLE | __GFP_NOFAIL))) {
+		gfp_mask |= __GFP_NOWARN;
+		goto nopage;
+	}
+
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
 							&did_some_progress);
@@ -4301,6 +4308,14 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 		struct alloc_context *ac, gfp_t *alloc_mask,
 		unsigned int *alloc_flags)
 {
+	/*
+	 * Can give up if caller in task context is willing to give up upon
+	 * fatal signals
+	 */
+	if (in_task() && fatal_signal_pending(current) &&
+	    (gfp_mask & (__GFP_UNKILLABLE | __GFP_NOFAIL)))
+		return false;
+
 	ac->high_zoneidx = gfp_zone(gfp_mask);
 	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
 	ac->nodemask = nodemask;
diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
index ae11e4c..b36d945 100644
--- a/tools/perf/builtin-kmem.c
+++ b/tools/perf/builtin-kmem.c
@@ -641,6 +641,7 @@ static int gfpcmp(const void *a, const void *b)
 	{ "__GFP_ATOMIC",		"_A" },
 	{ "__GFP_IO",			"I" },
 	{ "__GFP_FS",			"F" },
+	{ "__GFP_UNKILLABLE",		"UK" },
 	{ "__GFP_NOWARN",		"NWR" },
 	{ "__GFP_RETRY_MAYFAIL",	"R" },
 	{ "__GFP_NOFAIL",		"NF" },
----------------------------------------

> 
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -440,6 +440,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >  			continue;
> >  		}
> >  		charge = 0;
> > +		if (fatal_signal_pending(current)) {
> > +			retval = -EINTR;
> > +			goto out;
> > +		}
> >  		if (mpnt->vm_flags & VM_ACCOUNT) {
> >  			unsigned long len = vma_pages(mpnt);
> 
> I think a comment explaining why we're doing this would help.

I think such comment can go to patch description like commit d1908f52557b3230
("fs: break out of iomap_file_buffered_write on fatal signals") did.

> 
> Better would be to add a new function "current_is_oom_killed()" or
> such, which becomes self-documenting.  Because there are other reasons
> why a task may have a fatal signal pending.
> 

current_is_oom_killed() is already there as tsk_is_oom_victim(current)
except that tsk_is_oom_victim(current) is not accurate because the OOM
killer sets ->signal->oom_mm field to only one thread group even when
the OOM victim consists of multiple thread groups.

But I don't think we need to distinguish "killed by the OOM killer" and
"killed by other than the OOM killer" if we can tell the page allocator
whether we want to give up upon SIGKILL. Any allocation which does not
want to give up upon SIGKILL could get preferred access to memory reserves
if SIGKILL is pending.

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
  2018-03-29 21:30 ` Andrew Morton
  2018-03-30 10:34   ` Tetsuo Handa
@ 2018-04-03 11:16   ` Michal Hocko
  2018-04-03 11:32     ` Tetsuo Handa
  2018-04-03 11:58   ` Matthew Wilcox
  2018-04-07 10:38     ` Tetsuo Handa
  3 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2018-04-03 11:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tetsuo Handa, linux-mm, linux-fsdevel, Alexander Viro,
	Kirill A. Shutemov, Rik van Riel

On Thu 29-03-18 14:30:03, Andrew Morton wrote:
[...]
> Dumb question: if a thread has been oom-killed and then tries to
> allocate memory, should the page allocator just fail the allocation
> attempt?  I suppose there are all sorts of reasons why not :(

We give those tasks access to memory reserves to move on (see
oom_reserves_allowed) and fail allocation if reserves do not help

	if (tsk_is_oom_victim(current) &&
	    (alloc_flags == ALLOC_OOM ||
	     (gfp_mask & __GFP_NOMEMALLOC)))
		goto nopage;
So we...

> In which case, yes, setting a new
> PF_MEMALLOC_MAY_FAIL_IF_I_WAS_OOMKILLED around such code might be a
> tidy enough solution.  It would be a bit sad to add another test in the
> hot path (should_fail_alloc_page()?), but geeze we do a lot of junk
> already.

... do not need this.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
  2018-04-03 11:16   ` Michal Hocko
@ 2018-04-03 11:32     ` Tetsuo Handa
  2018-04-03 11:38       ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2018-04-03 11:32 UTC (permalink / raw)
  To: mhocko, akpm; +Cc: linux-mm, linux-fsdevel, viro, kirill.shutemov, riel

Michal Hocko wrote:
> On Thu 29-03-18 14:30:03, Andrew Morton wrote:
> [...]
> > Dumb question: if a thread has been oom-killed and then tries to
> > allocate memory, should the page allocator just fail the allocation
> > attempt?  I suppose there are all sorts of reasons why not :(
> 
> We give those tasks access to memory reserves to move on (see
> oom_reserves_allowed) and fail allocation if reserves do not help
> 
> 	if (tsk_is_oom_victim(current) &&
> 	    (alloc_flags == ALLOC_OOM ||
> 	     (gfp_mask & __GFP_NOMEMALLOC)))
> 		goto nopage;
> So we...
> 
> > In which case, yes, setting a new
> > PF_MEMALLOC_MAY_FAIL_IF_I_WAS_OOMKILLED around such code might be a
> > tidy enough solution.  It would be a bit sad to add another test in the
> > hot path (should_fail_alloc_page()?), but geeze we do a lot of junk
> > already.
> 
> ... do not need this.

Excuse me? But that check is after

	/* Reclaim has failed us, start killing things */
	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
	if (page)
		goto got_pg;

which means that tsk_is_oom_victim(current) && alloc_flags == ALLOC_OOM threads
can still trigger the OOM killer as soon as the OOM reaper sets MMF_OOM_SKIP.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
  2018-04-03 11:32     ` Tetsuo Handa
@ 2018-04-03 11:38       ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2018-04-03 11:38 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm, linux-fsdevel, viro, kirill.shutemov, riel

On Tue 03-04-18 20:32:39, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 29-03-18 14:30:03, Andrew Morton wrote:
> > [...]
> > > Dumb question: if a thread has been oom-killed and then tries to
> > > allocate memory, should the page allocator just fail the allocation
> > > attempt?  I suppose there are all sorts of reasons why not :(
> > 
> > We give those tasks access to memory reserves to move on (see
> > oom_reserves_allowed) and fail allocation if reserves do not help
> > 
> > 	if (tsk_is_oom_victim(current) &&
> > 	    (alloc_flags == ALLOC_OOM ||
> > 	     (gfp_mask & __GFP_NOMEMALLOC)))
> > 		goto nopage;
> > So we...
> > 
> > > In which case, yes, setting a new
> > > PF_MEMALLOC_MAY_FAIL_IF_I_WAS_OOMKILLED around such code might be a
> > > tidy enough solution.  It would be a bit sad to add another test in the
> > > hot path (should_fail_alloc_page()?), but geeze we do a lot of junk
> > > already.
> > 
> > ... do not need this.
> 
> Excuse me? But that check is after
> 
> 	/* Reclaim has failed us, start killing things */
> 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> 	if (page)
> 		goto got_pg;
> 
> which means that tsk_is_oom_victim(current) && alloc_flags == ALLOC_OOM threads
> can still trigger the OOM killer as soon as the OOM reaper sets MMF_OOM_SKIP.

Races are possible and I do not see them as critical _right now_. If
that turnes out to be not the case we can think of a more robust way.
The thing is that we have "bail out for OOM victims already".

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
  2018-03-29 21:30 ` Andrew Morton
  2018-03-30 10:34   ` Tetsuo Handa
  2018-04-03 11:16   ` Michal Hocko
@ 2018-04-03 11:58   ` Matthew Wilcox
  2018-04-03 12:08     ` Michal Hocko
  2018-04-07 10:38     ` Tetsuo Handa
  3 siblings, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2018-04-03 11:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tetsuo Handa, linux-mm, linux-fsdevel, Alexander Viro,
	Kirill A. Shutemov, Michal Hocko, Rik van Riel

On Thu, Mar 29, 2018 at 02:30:03PM -0700, Andrew Morton wrote:
> > @@ -440,6 +440,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >  			continue;
> >  		}
> >  		charge = 0;
> > +		if (fatal_signal_pending(current)) {
> > +			retval = -EINTR;
> > +			goto out;
> > +		}
> >  		if (mpnt->vm_flags & VM_ACCOUNT) {
> >  			unsigned long len = vma_pages(mpnt);
> 
> I think a comment explaining why we're doing this would help.
> 
> Better would be to add a new function "current_is_oom_killed()" or
> such, which becomes self-documenting.  Because there are other reasons
> why a task may have a fatal signal pending.

I disagree that we need a comment here, or to create an alias.  Someone
who knows nothing of the oom-killer (like, er, me) reading that code sees
"Oh, we're checking for fatal signals here.  I guess it doesn't make sense
to continue forking a process if it's already received a fatal signal."

One might speculate about the causes of the fatal signal having been
received and settle on reasons which make sense even without thinking
of the OOM case.  Because it's why it was introduced, I always think
about a task blocked on a dead NFS mount.  If it's multithreaded and
one of the threads called fork() while another thread was blocked on a
page fault and the dup_mmap() had to wait for the page fault to finish
... that would make some kind of sense.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
  2018-04-03 11:58   ` Matthew Wilcox
@ 2018-04-03 12:08     ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2018-04-03 12:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Tetsuo Handa, linux-mm, linux-fsdevel,
	Alexander Viro, Kirill A. Shutemov, Rik van Riel

On Tue 03-04-18 04:58:57, Matthew Wilcox wrote:
> On Thu, Mar 29, 2018 at 02:30:03PM -0700, Andrew Morton wrote:
> > > @@ -440,6 +440,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >  			continue;
> > >  		}
> > >  		charge = 0;
> > > +		if (fatal_signal_pending(current)) {
> > > +			retval = -EINTR;
> > > +			goto out;
> > > +		}
> > >  		if (mpnt->vm_flags & VM_ACCOUNT) {
> > >  			unsigned long len = vma_pages(mpnt);
> > 
> > I think a comment explaining why we're doing this would help.
> > 
> > Better would be to add a new function "current_is_oom_killed()" or
> > such, which becomes self-documenting.  Because there are other reasons
> > why a task may have a fatal signal pending.
> 
> I disagree that we need a comment here, or to create an alias.  Someone
> who knows nothing of the oom-killer (like, er, me) reading that code sees
> "Oh, we're checking for fatal signals here.  I guess it doesn't make sense
> to continue forking a process if it's already received a fatal signal."
> 
> One might speculate about the causes of the fatal signal having been
> received and settle on reasons which make sense even without thinking
> of the OOM case.  Because it's why it was introduced, I always think
> about a task blocked on a dead NFS mount.  If it's multithreaded and
> one of the threads called fork() while another thread was blocked on a
> page fault and the dup_mmap() had to wait for the page fault to finish
> ... that would make some kind of sense.

I completely agree. If the check is really correct then it should be
pretty much self explanatory like many other checks. There is absolutely
zero oom specific in there. If a check _is_ oom specific then there is
something fishy going on.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
  2018-03-30 10:34   ` Tetsuo Handa
@ 2018-04-03 12:14     ` Matthew Wilcox
  2018-04-03 12:19       ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2018-04-03 12:14 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, linux-fsdevel, viro, kirill.shutemov, mhocko, riel

On Fri, Mar 30, 2018 at 07:34:59PM +0900, Tetsuo Handa wrote:
> Maybe we can make "give up by default upon SIGKILL" and let callers
> explicitly say "do not give up upon SIGKILL".

I really strongly disapprove of this patch.  This GFP flag will be abused
like every other GFP flag.

> +++ b/mm/page_alloc.c
> @@ -4183,6 +4183,13 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	if (current->flags & PF_MEMALLOC)
>  		goto nopage;
>  
> +	/* Can give up if caller is willing to give up upon fatal signals */
> +	if (fatal_signal_pending(current) &&
> +	    !(gfp_mask & (__GFP_UNKILLABLE | __GFP_NOFAIL))) {
> +		gfp_mask |= __GFP_NOWARN;
> +		goto nopage;
> +	}
> +
>  	/* Try direct reclaim and then allocating */

This part is superficially tempting, although without the UNKILLABLE.  ie:

+	if (fatal_signal_pending(current) && !(gfp_mask & __GFP_NOFAIL)) {
+		gfp_mask |= __GFP_NOWARN;
+		goto nopage;
+	}

It makes some sense to me to prevent tasks with a fatal signal pending
from being able to trigger reclaim.  But I'm worried about what memory
allocation failures it might trigger on paths that aren't accustomed to
seeing failures.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
  2018-04-03 12:14     ` Matthew Wilcox
@ 2018-04-03 12:19       ` Michal Hocko
  2018-04-03 12:25         ` Matthew Wilcox
  2018-04-03 12:29         ` Tetsuo Handa
  0 siblings, 2 replies; 21+ messages in thread
From: Michal Hocko @ 2018-04-03 12:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Tetsuo Handa, akpm, linux-mm, linux-fsdevel, viro, kirill.shutemov, riel

On Tue 03-04-18 05:14:14, Matthew Wilcox wrote:
> On Fri, Mar 30, 2018 at 07:34:59PM +0900, Tetsuo Handa wrote:
> > Maybe we can make "give up by default upon SIGKILL" and let callers
> > explicitly say "do not give up upon SIGKILL".
> 
> I really strongly disapprove of this patch.  This GFP flag will be abused
> like every other GFP flag.
> 
> > +++ b/mm/page_alloc.c
> > @@ -4183,6 +4183,13 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> >  	if (current->flags & PF_MEMALLOC)
> >  		goto nopage;
> >  
> > +	/* Can give up if caller is willing to give up upon fatal signals */
> > +	if (fatal_signal_pending(current) &&
> > +	    !(gfp_mask & (__GFP_UNKILLABLE | __GFP_NOFAIL))) {
> > +		gfp_mask |= __GFP_NOWARN;
> > +		goto nopage;
> > +	}
> > +
> >  	/* Try direct reclaim and then allocating */
> 
> This part is superficially tempting, although without the UNKILLABLE.  ie:
> 
> +	if (fatal_signal_pending(current) && !(gfp_mask & __GFP_NOFAIL)) {
> +		gfp_mask |= __GFP_NOWARN;
> +		goto nopage;
> +	}
> 
> It makes some sense to me to prevent tasks with a fatal signal pending
> from being able to trigger reclaim.  But I'm worried about what memory
> allocation failures it might trigger on paths that aren't accustomed to
> seeing failures.

Please be aware that we _do_ allocate in the exit path. I have a strong
suspicion that even while fatal signal is pending. Do we really want
fail those really easily.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
  2018-04-03 12:19       ` Michal Hocko
@ 2018-04-03 12:25         ` Matthew Wilcox
  2018-04-03 14:54           ` Tetsuo Handa
  2018-04-03 12:29         ` Tetsuo Handa
  1 sibling, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2018-04-03 12:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, akpm, linux-mm, linux-fsdevel, viro, kirill.shutemov, riel

On Tue, Apr 03, 2018 at 02:19:50PM +0200, Michal Hocko wrote:
> On Tue 03-04-18 05:14:14, Matthew Wilcox wrote:
> > On Fri, Mar 30, 2018 at 07:34:59PM +0900, Tetsuo Handa wrote:
> > > Maybe we can make "give up by default upon SIGKILL" and let callers
> > > explicitly say "do not give up upon SIGKILL".
> > 
> > I really strongly disapprove of this patch.  This GFP flag will be abused
> > like every other GFP flag.
> > 
> > > +++ b/mm/page_alloc.c
> > > @@ -4183,6 +4183,13 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > >  	if (current->flags & PF_MEMALLOC)
> > >  		goto nopage;
> > >  
> > > +	/* Can give up if caller is willing to give up upon fatal signals */
> > > +	if (fatal_signal_pending(current) &&
> > > +	    !(gfp_mask & (__GFP_UNKILLABLE | __GFP_NOFAIL))) {
> > > +		gfp_mask |= __GFP_NOWARN;
> > > +		goto nopage;
> > > +	}
> > > +
> > >  	/* Try direct reclaim and then allocating */
> > 
> > This part is superficially tempting, although without the UNKILLABLE.  ie:
> > 
> > +	if (fatal_signal_pending(current) && !(gfp_mask & __GFP_NOFAIL)) {
> > +		gfp_mask |= __GFP_NOWARN;
> > +		goto nopage;
> > +	}
> > 
> > It makes some sense to me to prevent tasks with a fatal signal pending
> > from being able to trigger reclaim.  But I'm worried about what memory
> > allocation failures it might trigger on paths that aren't accustomed to
> > seeing failures.
> 
> Please be aware that we _do_ allocate in the exit path. I have a strong
> suspicion that even while fatal signal is pending. Do we really want
> fail those really easily.

I agree.  The allocations I'm thinking about are NFS wanting to send
I/Os in order to fsync each file that gets closed.  We probably don't
want those to fail.  And we definitely don't want to chase around the
kernel adding __GFP_KILLABLE to each place that we discover needs to
allocate on the exit path.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
  2018-04-03 12:19       ` Michal Hocko
  2018-04-03 12:25         ` Matthew Wilcox
@ 2018-04-03 12:29         ` Tetsuo Handa
  2018-04-03 13:06           ` Michal Hocko
  1 sibling, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2018-04-03 12:29 UTC (permalink / raw)
  To: mhocko, willy; +Cc: akpm, linux-mm, linux-fsdevel, viro, kirill.shutemov, riel

Michal Hocko wrote:
> On Tue 03-04-18 05:14:14, Matthew Wilcox wrote:
> > On Fri, Mar 30, 2018 at 07:34:59PM +0900, Tetsuo Handa wrote:
> > > Maybe we can make "give up by default upon SIGKILL" and let callers
> > > explicitly say "do not give up upon SIGKILL".
> > 
> > I really strongly disapprove of this patch.  This GFP flag will be abused
> > like every other GFP flag.
> > 
> > > +++ b/mm/page_alloc.c
> > > @@ -4183,6 +4183,13 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > >  	if (current->flags & PF_MEMALLOC)
> > >  		goto nopage;
> > >  
> > > +	/* Can give up if caller is willing to give up upon fatal signals */
> > > +	if (fatal_signal_pending(current) &&
> > > +	    !(gfp_mask & (__GFP_UNKILLABLE | __GFP_NOFAIL))) {
> > > +		gfp_mask |= __GFP_NOWARN;
> > > +		goto nopage;
> > > +	}
> > > +
> > >  	/* Try direct reclaim and then allocating */
> > 
> > This part is superficially tempting, although without the UNKILLABLE.  ie:
> > 
> > +	if (fatal_signal_pending(current) && !(gfp_mask & __GFP_NOFAIL)) {
> > +		gfp_mask |= __GFP_NOWARN;
> > +		goto nopage;
> > +	}
> > 
> > It makes some sense to me to prevent tasks with a fatal signal pending
> > from being able to trigger reclaim.  But I'm worried about what memory
> > allocation failures it might trigger on paths that aren't accustomed to
> > seeing failures.

Userspace tasks might call routines which need GFP_FS or GFP_NOIO (for
direct reclaim), and giving up upon fatal signals leads to more problems
like FS error or I/O error. Thus, we can't unconditionally give up upon
fatal signals.

> 
> Please be aware that we _do_ allocate in the exit path. I have a strong
> suspicion that even while fatal signal is pending. Do we really want
> fail those really easily.

Does the exit path mean inside do_exit() ? If yes, fatal signals are already
cleared before reaching do_exit().

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
  2018-04-03 12:29         ` Tetsuo Handa
@ 2018-04-03 13:06           ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2018-04-03 13:06 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: willy, akpm, linux-mm, linux-fsdevel, viro, kirill.shutemov, riel

On Tue 03-04-18 21:29:52, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > Please be aware that we _do_ allocate in the exit path. I have a strong
> > suspicion that even while fatal signal is pending. Do we really want
> > fail those really easily.
> 
> Does the exit path mean inside do_exit() ? If yes, fatal signals are already
> cleared before reaching do_exit().

They usually are. But we can send a SIGKILL on an already killed task
after it removed the previously deadly signal already AFAIR. Maybe I
mis-remember of course. Signal handling code always makes my head
explode and I tend to forget all the details. Anyway relying on
fatal_signal_pending for some allocator semantic is just too subtle.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
  2018-04-03 12:25         ` Matthew Wilcox
@ 2018-04-03 14:54           ` Tetsuo Handa
  0 siblings, 0 replies; 21+ messages in thread
From: Tetsuo Handa @ 2018-04-03 14:54 UTC (permalink / raw)
  To: willy, mhocko; +Cc: akpm, linux-mm, linux-fsdevel, viro, kirill.shutemov, riel

Matthew Wilcox wrote:
> On Tue, Apr 03, 2018 at 02:19:50PM +0200, Michal Hocko wrote:
> > On Tue 03-04-18 05:14:14, Matthew Wilcox wrote:
> > > On Fri, Mar 30, 2018 at 07:34:59PM +0900, Tetsuo Handa wrote:
> > > > Maybe we can make "give up by default upon SIGKILL" and let callers
> > > > explicitly say "do not give up upon SIGKILL".
> > > 
> > > I really strongly disapprove of this patch.  This GFP flag will be abused
> > > like every other GFP flag.
> > > 
> > > > +++ b/mm/page_alloc.c
> > > > @@ -4183,6 +4183,13 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > > >  	if (current->flags & PF_MEMALLOC)
> > > >  		goto nopage;
> > > >  
> > > > +	/* Can give up if caller is willing to give up upon fatal signals */
> > > > +	if (fatal_signal_pending(current) &&
> > > > +	    !(gfp_mask & (__GFP_UNKILLABLE | __GFP_NOFAIL))) {
> > > > +		gfp_mask |= __GFP_NOWARN;
> > > > +		goto nopage;
> > > > +	}
> > > > +
> > > >  	/* Try direct reclaim and then allocating */
> > > 
> > > This part is superficially tempting, although without the UNKILLABLE.  ie:
> > > 
> > > +	if (fatal_signal_pending(current) && !(gfp_mask & __GFP_NOFAIL)) {
> > > +		gfp_mask |= __GFP_NOWARN;
> > > +		goto nopage;
> > > +	}
> > > 
> > > It makes some sense to me to prevent tasks with a fatal signal pending
> > > from being able to trigger reclaim.  But I'm worried about what memory
> > > allocation failures it might trigger on paths that aren't accustomed to
> > > seeing failures.
> > 
> > Please be aware that we _do_ allocate in the exit path. I have a strong
> > suspicion that even while fatal signal is pending. Do we really want
> > fail those really easily.
> 
> I agree.  The allocations I'm thinking about are NFS wanting to send
> I/Os in order to fsync each file that gets closed.  We probably don't
> want those to fail.  And we definitely don't want to chase around the
> kernel adding __GFP_KILLABLE to each place that we discover needs to
> allocate on the exit path.
> 

But memory allocations for syscalls are willing to give up upon SIGKILL
regardless of OOM.

If we worry the exit/nofs/noio paths, we can use scoped masking like
memalloc_nofs_save()/memalloc_nofs_restore() for ignoring __GFP_KILLABLE.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
  2018-03-29 21:30 ` Andrew Morton
@ 2018-04-07 10:38     ` Tetsuo Handa
  2018-04-03 11:16   ` Michal Hocko
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Tetsuo Handa @ 2018-04-07 10:38 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-fsdevel, viro, kirill.shutemov, mhocko, riel

>>From 31c863e57a4ab7dfb491b2860fe3653e1e8f593b Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 7 Apr 2018 19:29:30 +0900
Subject: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.

As a theoretical problem, an mm_struct with 60000+ vmas can loop with
potentially allocating memory, with mm->mmap_sem held for write by current
thread. This is bad if current thread was selected as an OOM victim, for
current thread will continue allocations using memory reserves while OOM
reaper is unable to reclaim memory.

As an actually observable problem, it is not difficult to make OOM reaper
unable to reclaim memory if the OOM victim is blocked at
i_mmap_lock_write() in this loop. Unfortunately, since nobody can explain
whether it is safe to use killable wait there, let's check for SIGKILL
before trying to allocate memory. Even without an OOM event, there is no
point with continuing the loop from the beginning if current thread is
killed.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 kernel/fork.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index 242c8c9..8831bae 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -441,6 +441,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 			continue;
 		}
 		charge = 0;
+		if (fatal_signal_pending(current)) {
+			retval = -EINTR;
+			goto out;
+		}
 		if (mpnt->vm_flags & VM_ACCOUNT) {
 			unsigned long len = vma_pages(mpnt);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
@ 2018-04-07 10:38     ` Tetsuo Handa
  0 siblings, 0 replies; 21+ messages in thread
From: Tetsuo Handa @ 2018-04-07 10:38 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-fsdevel, viro, kirill.shutemov, mhocko, riel

>From 31c863e57a4ab7dfb491b2860fe3653e1e8f593b Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 7 Apr 2018 19:29:30 +0900
Subject: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.

As a theoretical problem, an mm_struct with 60000+ vmas can loop with
potentially allocating memory, with mm->mmap_sem held for write by current
thread. This is bad if current thread was selected as an OOM victim, for
current thread will continue allocations using memory reserves while OOM
reaper is unable to reclaim memory.

As an actually observable problem, it is not difficult to make OOM reaper
unable to reclaim memory if the OOM victim is blocked at
i_mmap_lock_write() in this loop. Unfortunately, since nobody can explain
whether it is safe to use killable wait there, let's check for SIGKILL
before trying to allocate memory. Even without an OOM event, there is no
point with continuing the loop from the beginning if current thread is
killed.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 kernel/fork.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index 242c8c9..8831bae 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -441,6 +441,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 			continue;
 		}
 		charge = 0;
+		if (fatal_signal_pending(current)) {
+			retval = -EINTR;
+			goto out;
+		}
 		if (mpnt->vm_flags & VM_ACCOUNT) {
 			unsigned long len = vma_pages(mpnt);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
  2018-04-07 10:38     ` Tetsuo Handa
  (?)
@ 2018-04-18 21:44     ` Andrew Morton
  2018-04-19  1:54       ` Tetsuo Handa
  -1 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2018-04-18 21:44 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, linux-fsdevel, viro, kirill.shutemov, mhocko, riel

On Sat, 7 Apr 2018 19:38:28 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:

> >From 31c863e57a4ab7dfb491b2860fe3653e1e8f593b Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sat, 7 Apr 2018 19:29:30 +0900
> Subject: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
> 
> As a theoretical problem, an mm_struct with 60000+ vmas can loop with
> potentially allocating memory, with mm->mmap_sem held for write by current
> thread. This is bad if current thread was selected as an OOM victim, for
> current thread will continue allocations using memory reserves while OOM
> reaper is unable to reclaim memory.
> 
> As an actually observable problem, it is not difficult to make OOM reaper
> unable to reclaim memory if the OOM victim is blocked at
> i_mmap_lock_write() in this loop. Unfortunately, since nobody can explain
> whether it is safe to use killable wait there, let's check for SIGKILL
> before trying to allocate memory. Even without an OOM event, there is no
> point with continuing the loop from the beginning if current thread is
> killed.
> 
> ...
>
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -441,6 +441,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  			continue;
>  		}
>  		charge = 0;
> +		if (fatal_signal_pending(current)) {
> +			retval = -EINTR;
> +			goto out;
> +		}
>  		if (mpnt->vm_flags & VM_ACCOUNT) {
>  			unsigned long len = vma_pages(mpnt);

Seems sane.  Has this been runtime tested?

I would like to see a comment here explaining why we're testing for
this at this particualr place.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
  2018-04-18 21:44     ` Andrew Morton
@ 2018-04-19  1:54       ` Tetsuo Handa
  2018-04-19  2:32         ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2018-04-19  1:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: penguin-kernel, linux-mm, linux-fsdevel, viro, kirill.shutemov,
	mhocko, riel

Andrew Morton wrote:
> On Sat, 7 Apr 2018 19:38:28 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> 
> > >From 31c863e57a4ab7dfb491b2860fe3653e1e8f593b Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Sat, 7 Apr 2018 19:29:30 +0900
> > Subject: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
> > 
> > As a theoretical problem, an mm_struct with 60000+ vmas can loop with
> > potentially allocating memory, with mm->mmap_sem held for write by current
> > thread. This is bad if current thread was selected as an OOM victim, for
> > current thread will continue allocations using memory reserves while OOM
> > reaper is unable to reclaim memory.
> > 
> > As an actually observable problem, it is not difficult to make OOM reaper
> > unable to reclaim memory if the OOM victim is blocked at
> > i_mmap_lock_write() in this loop. Unfortunately, since nobody can explain
> > whether it is safe to use killable wait there, let's check for SIGKILL
> > before trying to allocate memory. Even without an OOM event, there is no
> > point with continuing the loop from the beginning if current thread is
> > killed.
> > 
> > ...
> >
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -441,6 +441,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >  			continue;
> >  		}
> >  		charge = 0;
> > +		if (fatal_signal_pending(current)) {
> > +			retval = -EINTR;
> > +			goto out;
> > +		}
> >  		if (mpnt->vm_flags & VM_ACCOUNT) {
> >  			unsigned long len = vma_pages(mpnt);
> 
> Seems sane.  Has this been runtime tested?
> 

Yes, I tested with debug printk(). This patch should be safe
because we already fail if security_vm_enough_memory_mm() or
kmem_cache_alloc(GFP_KERNEL) fails and exit_mmap() handles it.

[  417.030691] ***** Aborting dup_mmap() due to SIGKILL *****
[  417.036129] ***** Aborting dup_mmap() due to SIGKILL *****
[  417.044544] ***** Aborting dup_mmap() due to SIGKILL *****
[  419.116445] ***** Aborting dup_mmap() due to SIGKILL *****
[  419.118401] ***** Aborting exit_mmap() due to NULL mmap *****
[  419.168917] ***** Aborting dup_mmap() due to SIGKILL *****
[  419.169064] ***** Aborting dup_mmap() due to SIGKILL *****
[  419.170913] ***** Aborting exit_mmap() due to NULL mmap *****
[  419.171411] ***** Aborting dup_mmap() due to SIGKILL *****
[  419.171417] ***** Aborting exit_mmap() due to NULL mmap *****
[  419.172804] ***** Aborting exit_mmap() due to NULL mmap *****
[  419.176253] ***** Aborting dup_mmap() due to SIGKILL *****
[  419.182676] ***** Aborting exit_mmap() due to NULL mmap *****

> I would like to see a comment here explaining why we're testing for
> this at this particualr place.
> 
Such comment goes to patch description.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
  2018-04-19  1:54       ` Tetsuo Handa
@ 2018-04-19  2:32         ` Andrew Morton
  2018-06-07 22:05           ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2018-04-19  2:32 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: penguin-kernel, linux-mm, linux-fsdevel, viro, kirill.shutemov,
	mhocko, riel

On Thu, 19 Apr 2018 10:54:44 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:

> Andrew Morton wrote:
> > On Sat, 7 Apr 2018 19:38:28 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> > 
> > > >From 31c863e57a4ab7dfb491b2860fe3653e1e8f593b Mon Sep 17 00:00:00 2001
> > > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Date: Sat, 7 Apr 2018 19:29:30 +0900
> > > Subject: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
> > > 
> > > As a theoretical problem, an mm_struct with 60000+ vmas can loop with
> > > potentially allocating memory, with mm->mmap_sem held for write by current
> > > thread. This is bad if current thread was selected as an OOM victim, for
> > > current thread will continue allocations using memory reserves while OOM
> > > reaper is unable to reclaim memory.
> > > 
> > > As an actually observable problem, it is not difficult to make OOM reaper
> > > unable to reclaim memory if the OOM victim is blocked at
> > > i_mmap_lock_write() in this loop. Unfortunately, since nobody can explain
> > > whether it is safe to use killable wait there, let's check for SIGKILL
> > > before trying to allocate memory. Even without an OOM event, there is no
> > > point with continuing the loop from the beginning if current thread is
> > > killed.
> > > 
> > > ...
> > >
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -441,6 +441,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >  			continue;
> > >  		}
> > >  		charge = 0;
> > > +		if (fatal_signal_pending(current)) {
> > > +			retval = -EINTR;
> > > +			goto out;
> > > +		}
> > >  		if (mpnt->vm_flags & VM_ACCOUNT) {
> > >  			unsigned long len = vma_pages(mpnt);
> > 
> > Seems sane.  Has this been runtime tested?
> > 
> 
> Yes, I tested with debug printk(). This patch should be safe
> because we already fail if security_vm_enough_memory_mm() or
> kmem_cache_alloc(GFP_KERNEL) fails and exit_mmap() handles it.
> 
> [  417.030691] ***** Aborting dup_mmap() due to SIGKILL *****
> [  417.036129] ***** Aborting dup_mmap() due to SIGKILL *****
> [  417.044544] ***** Aborting dup_mmap() due to SIGKILL *****
> [  419.116445] ***** Aborting dup_mmap() due to SIGKILL *****
> [  419.118401] ***** Aborting exit_mmap() due to NULL mmap *****
> [  419.168917] ***** Aborting dup_mmap() due to SIGKILL *****
> [  419.169064] ***** Aborting dup_mmap() due to SIGKILL *****
> [  419.170913] ***** Aborting exit_mmap() due to NULL mmap *****
> [  419.171411] ***** Aborting dup_mmap() due to SIGKILL *****
> [  419.171417] ***** Aborting exit_mmap() due to NULL mmap *****
> [  419.172804] ***** Aborting exit_mmap() due to NULL mmap *****
> [  419.176253] ***** Aborting dup_mmap() due to SIGKILL *****
> [  419.182676] ***** Aborting exit_mmap() due to NULL mmap *****

OK, thanks.

> > I would like to see a comment here explaining why we're testing for
> > this at this particualr place.
> > 
> Such comment goes to patch description.

I know.  I'm suggesting that it be in the code itself.  Making people
putz around with git to understand the code is unfriendly.

I'll add something in there tomorrow.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
  2018-04-19  2:32         ` Andrew Morton
@ 2018-06-07 22:05           ` Andrew Morton
  2018-06-08 17:05             ` Matthew Wilcox
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2018-06-07 22:05 UTC (permalink / raw)
  To: Tetsuo Handa, penguin-kernel, linux-mm, linux-fsdevel, viro,
	kirill.shutemov, mhocko, riel, Matthew Wilcox

Despite all the discussion, we're short on formal review/ack tags on
this one.

Here's what I have:


From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Subject: mm: check for SIGKILL inside dup_mmap() loop

As a theoretical problem, dup_mmap() of an mm_struct with 60000+ vmas can
loop while potentially allocating memory, with mm->mmap_sem held for write
by current thread.  This is bad if current thread was selected as an OOM
victim, for current thread will continue allocations using memory reserves
while OOM reaper is unable to reclaim memory.

As an actually observable problem, it is not difficult to make OOM reaper
unable to reclaim memory if the OOM victim is blocked at
i_mmap_lock_write() in this loop.  Unfortunately, since nobody can explain
whether it is safe to use killable wait there, let's check for SIGKILL
before trying to allocate memory.  Even without an OOM event, there is no
point with continuing the loop from the beginning if current thread is
killed.

I tested with debug printk().  This patch should be safe because we
already fail if security_vm_enough_memory_mm() or
kmem_cache_alloc(GFP_KERNEL) fails and exit_mmap() handles it.

[  417.030691] ***** Aborting dup_mmap() due to SIGKILL *****
[  417.036129] ***** Aborting dup_mmap() due to SIGKILL *****
[  417.044544] ***** Aborting dup_mmap() due to SIGKILL *****
[  419.116445] ***** Aborting dup_mmap() due to SIGKILL *****
[  419.118401] ***** Aborting exit_mmap() due to NULL mmap *****

[akpm@linux-foundation.org: add comment]
Link: http://lkml.kernel.org/r/201804071938.CDE04681.SOFVQJFtMHOOLF@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/fork.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff -puN kernel/fork.c~mm-check-for-sigkill-inside-dup_mmap-loop kernel/fork.c
--- a/kernel/fork.c~mm-check-for-sigkill-inside-dup_mmap-loop
+++ a/kernel/fork.c
@@ -440,6 +440,14 @@ static __latent_entropy int dup_mmap(str
 			continue;
 		}
 		charge = 0;
+		/*
+		 * Don't duplicate many vmas if we've been oom-killed (for
+		 * example)
+		 */
+		if (fatal_signal_pending(current)) {
+			retval = -EINTR;
+			goto out;
+		}
 		if (mpnt->vm_flags & VM_ACCOUNT) {
 			unsigned long len = vma_pages(mpnt);
 
_

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop.
  2018-06-07 22:05           ` Andrew Morton
@ 2018-06-08 17:05             ` Matthew Wilcox
  0 siblings, 0 replies; 21+ messages in thread
From: Matthew Wilcox @ 2018-06-08 17:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tetsuo Handa, linux-mm, linux-fsdevel, viro, kirill.shutemov,
	mhocko, riel

On Thu, Jun 07, 2018 at 03:05:46PM -0700, Andrew Morton wrote:
> [akpm@linux-foundation.org: add comment]

Can I fix the comment?  ;-)

> @@ -440,6 +440,14 @@ static __latent_entropy int dup_mmap(str
>  			continue;
>  		}
>  		charge = 0;
> +		/*
> +		 * Don't duplicate many vmas if we've been oom-killed (for
> +		 * example)
> +		 */

		/*
		 * No point in continuing if we're just going to die at
		 * the end of the fork.  This may happen due to being OOM.
		 */

> +		if (fatal_signal_pending(current)) {
> +			retval = -EINTR;
> +			goto out;
> +		}

Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-06-08 17:05 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-29 11:27 [PATCH] mm: Check for SIGKILL inside dup_mmap() loop Tetsuo Handa
2018-03-29 21:30 ` Andrew Morton
2018-03-30 10:34   ` Tetsuo Handa
2018-04-03 12:14     ` Matthew Wilcox
2018-04-03 12:19       ` Michal Hocko
2018-04-03 12:25         ` Matthew Wilcox
2018-04-03 14:54           ` Tetsuo Handa
2018-04-03 12:29         ` Tetsuo Handa
2018-04-03 13:06           ` Michal Hocko
2018-04-03 11:16   ` Michal Hocko
2018-04-03 11:32     ` Tetsuo Handa
2018-04-03 11:38       ` Michal Hocko
2018-04-03 11:58   ` Matthew Wilcox
2018-04-03 12:08     ` Michal Hocko
2018-04-07 10:38   ` Tetsuo Handa
2018-04-07 10:38     ` Tetsuo Handa
2018-04-18 21:44     ` Andrew Morton
2018-04-19  1:54       ` Tetsuo Handa
2018-04-19  2:32         ` Andrew Morton
2018-06-07 22:05           ` Andrew Morton
2018-06-08 17:05             ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.