linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: David Rientjes <rientjes@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap
Date: Thu, 7 Dec 2017 09:28:01 +0100	[thread overview]
Message-ID: <20171207082801.GB20234@dhcp22.suse.cz> (raw)
In-Reply-To: <201712070720.vB77KlBQ009754@www262.sakura.ne.jp>

On Thu 07-12-17 16:20:47, Tetsuo Handa wrote:
[...]
> int main(int argc, char *argv[])
> {
> 	int i;
> 	char *stack;
> 	if (fork() || fork() || setsid() == EOF || pipe(pipe_fd))
> 		_exit(0);
> 	stack = mmap(NULL, STACKSIZE * NUMTHREADS, PROT_WRITE | PROT_READ,
> 		     MAP_ANONYMOUS | MAP_PRIVATE, EOF, 0);
> 	for (i = 0; i < NUMTHREADS; i++)
> 		if (clone(memory_eater, stack + (i + 1) * STACKSIZE,
> 			  /*CLONE_THREAD | CLONE_SIGHAND | */CLONE_VM | CLONE_FS |
> 			  CLONE_FILES, NULL) == -1)
> 			break;

Hmm, so you are creating a separate process (from the signal point of
view) and I suspect it is one of those that holds the last reference to
the mm_struct which is released here and it has tsk_oom_victim = F

[...]
> [  113.273394] Freed by task 1377:
> [  113.276211]  kasan_slab_free+0x71/0xc0
> [  113.279093]  kmem_cache_free+0xaf/0x1e0
> [  113.281974]  remove_vma+0x9d/0xb0
> [  113.284734]  exit_mmap+0x179/0x250
> [  113.287651]  mmput+0x7d/0x1b0
> [  113.290456]  do_exit+0x408/0x1290
> [  113.293268]  do_group_exit+0x84/0x140
> [  113.296109]  get_signal+0x291/0x9b0
> [  113.298915]  do_signal+0x8e/0xa70
> [  113.301637]  exit_to_usermode_loop+0x71/0xb0
> [  113.304632]  do_syscall_64+0x343/0x390
> [  113.307349]  return_from_SYSCALL_64+0x0/0x75

[...]

> What we overlooked is the fact that "it is not always the process which
> got ->signal->oom_mm set, it is any thread which called mmput() which
> invoked __mmput() path". Therefore, below patch fixes oops in my case.
> If some unrelated kernel thread was holding mm_users ref, it is possible
> that we miss down_write()/up_write() synchronization.

Very well spotted! It could be any task in fact (e.g. somebody reading
from /proc/<pid> file which requires mm_struct).

oom_reaper		oom_victim		task
						mmget_not_zero
			exit_mmap
			  mmput
__oom_reap_task_mm				mmput
  						  __mmput
						    exit_mmap
						      remove_vma
  unmap_page_range

So we need a more robust test for the oom victim. Your suggestion is
basically what I came up with originally [1] and which was deemed
ineffective because we took the mmap_sem even for regular paths and
Kirill was afraid this adds some unnecessary cycles to the exit path
which is quite hot.

So I guess we have to do something else instead. We have to store the
oom flag to the mm struct as well. Something like the patch below.

[1] http://lkml.kernel.org/r/20170724072332.31903-1-mhocko@kernel.org
---
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 27cd36b762b5..b7668b5d3e14 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -77,6 +77,11 @@ static inline bool tsk_is_oom_victim(struct task_struct * tsk)
 	return tsk->signal->oom_mm;
 }
 
+static inline bool mm_is_oom_victim(struct mm_struct *mm)
+{
+	return test_bit(MMF_OOM_VICTIM, &mm->flags);
+}
+
 /*
  * Checks whether a page fault on the given mm is still reliable.
  * This is no longer true if the oom reaper started to reap the
diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index 9c8847395b5e..da673ca66e7a 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -68,8 +68,9 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 #define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
 #define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
-#define MMF_HUGE_ZERO_PAGE	23      /* mm has ever used the global huge zero page */
-#define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
+#define MMF_OOM_VICTIM		23	/* mm is the oom victim */
+#define MMF_HUGE_ZERO_PAGE	24      /* mm has ever used the global huge zero page */
+#define MMF_DISABLE_THP		25	/* disable THP for all VMAs */
 #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
diff --git a/mm/mmap.c b/mm/mmap.c
index 476e810cf100..d00a06248ef1 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3005,7 +3005,7 @@ void exit_mmap(struct mm_struct *mm)
 	unmap_vmas(&tlb, vma, 0, -1);
 
 	set_bit(MMF_OOM_SKIP, &mm->flags);
-	if (unlikely(tsk_is_oom_victim(current))) {
+	if (unlikely(mm_is_oom_victim(mm))) {
 		/*
 		 * Wait for oom_reap_task() to stop working on this
 		 * mm. Because MMF_OOM_SKIP is already set before
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3b0d0fed8480..e4d290b6804b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -666,8 +666,10 @@ static void mark_oom_victim(struct task_struct *tsk)
 		return;
 
 	/* oom_mm is bound to the signal struct life time. */
-	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
+	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) {
 		mmgrab(tsk->signal->oom_mm);
+		set_bit(MMF_OOM_VICTIM, &mm->flags);
+	}
 
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
-- 
Michal Hocko
SUSE Labs

  parent reply	other threads:[~2017-12-07  8:28 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-06  2:43 Multiple oom_reaper BUGs: unmap_page_range racing with exit_mmap David Rientjes
2017-12-06  2:58 ` David Rientjes
     [not found]   ` <201712060328.vB63SrDK069830@www262.sakura.ne.jp>
2017-12-06  7:48     ` David Rientjes
2017-12-06  9:00       ` Michal Hocko
     [not found]         ` <201712070720.vB77KlBQ009754@www262.sakura.ne.jp>
2017-12-07  8:28           ` Michal Hocko [this message]
2017-12-07  8:44             ` Michal Hocko
2017-12-07 11:00             ` Tetsuo Handa
2017-12-07 21:22             ` David Rientjes
2017-12-08  7:50               ` Michal Hocko
2017-12-06 11:37       ` Tetsuo Handa
2017-12-06  8:50     ` Michal Hocko
2017-12-06  8:31 ` Michal Hocko
2017-12-07 11:35 ` Michal Hocko
2017-12-07 15:44   ` Tetsuo Handa
2017-12-07 16:30     ` Michal Hocko
2017-12-07 21:55       ` David Rientjes
2017-12-08  9:26         ` David Rientjes
2017-12-08 11:27           ` Michal Hocko
2017-12-08 10:11       ` Tetsuo Handa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171207082801.GB20234@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=penguin-kernel@i-love.sakura.ne.jp \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).