[PATCH] mm, oom: fix for hiding mm which is shared with kthread or global init

* [PATCH] mm, oom: fix for hiding mm which is shared with kthread or global init
@ 2016-07-16  5:30 Tetsuo Handa
  2016-07-18  7:18 ` Michal Hocko
  0 siblings, 1 reply; 9+ messages in thread
From: Tetsuo Handa @ 2016-07-16  5:30 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Tetsuo Handa, Michal Hocko, Oleg Nesterov, Vladimir Davydov,
	David Rientjes

Patch "mm, oom: hide mm which is shared with kthread or global init" tried
to guarantee a forward progress for the OOM killer even when the selected
victim is sharing memory with a kernel thread or global init, but a race
scenario still remains because it did not add a call to exit_oom_victim()
in oom_kill_process() in order to avoid a problem which is already worked
around by commit 74070542099c66d8 ("oom, suspend: fix oom_reaper vs.
oom_killer_disable race").

The race scenario is that a !can_oom_reap TIF_MEMDIE thread becomes
the only user of that mm (i.e. mm->mm_users drops to 1) and is later
blocked for unbounded period at __mmput() from mmput() from
exit_mm() from do_exit() by hitting e.g.

  (1) First round of OOM killer invocation starts.
  (2) select_bad_process() chooses P1 as an OOM victim because
      oom_scan_process_thread() does not find existing victims.
  (3) oom_kill_process() sets TIF_MEMDIE on P1, but does not put P1 under
      the OOM reaper's supervision due to (p->flags & PF_KTHREAD) being
      true, and instead sets MMF_OOM_REAPED on the P1's mm.
  (4) First round of OOM killer invocation finishes.
  (5) P1 is unable to arrive at do_exit() due to being blocked at
      unkillable event waiting for somebody else's memory allocation.
  (6) Second round of OOM killer invocation starts.
  (7) select_bad_process() chooses P2 as an OOM victim because
      oom_scan_process_thread() finds P1's mm with MMF_OOM_REAPED set.
  (8) oom_kill_process() sets TIF_MEMDIE on P2 via mark_oom_victim(),
      and puts P2 under the OOM reaper's supervision due to
      (p->flags & PF_KTHREAD) being false.
  (9) Second round of OOM killer invocation finishes.
  (10) The OOM reaper reaps P2's mm, and sets MMF_OOM_REAPED to
       P2's mm, and clears TIF_MEMDIE from P2.
  (11) Regarding P1's mm, (p->flags & PF_KTHREAD) becomes false because
       somebody else's memory allocation succeeds and unuse_mm(P1->mm)
       is called. At this point P1 becomes the only user of P1->mm.
  (12) P1 arrives at do_exit() due to no longer being blocked at
       unkillable event waiting for somebody else's memory allocation.
  (13) P1 reaches P1->mm = NULL line in exit_mm() from do_exit().
  (14) P1 is blocked at __mmput().
  (15) Third round of OOM killer invocation starts.
  (16) select_bad_process() does not choose new OOM victim because
       oom_scan_process_thread() fails to find P1's mm while
       P1->signal->oom_victims > 0.
  (17) Third round of OOM killer invocation finishes.
  (18) OOM livelock happens because nobody will clear TIF_MEMDIE from
       P1 (and decrement P1->signal->oom_victims) while P1 is blocked
       at __mmput().

sequence, but the patch "mm, oom: hide mm which is shared with kthread
or global init" is failing to return OOM_SCAN_CONTINUE when we hit
atomic_read(&task->signal->oom_victims) != 0 &&
find_lock_task_mm(task) == NULL in oom_scan_process_thread().

Long term we are planning to change oom_scan_process_thread() not to
depend on atomic_read(&task->signal->oom_victims) != 0 &&
find_lock_task_mm(task) != NULL, and remove exit_oom_victim() from
oom_kill_process() and oom_reap_task() along with signal->oom_victims
and commit 74070542099c66d8. But since we did not complete such changes
in time for 4.8 merge window, let's rely on commit 74070542099c66d8
for now in order to guarantee a forward progress for the OOM killer.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7d0a275..041373e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -922,6 +922,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			 */
 			can_oom_reap = false;
 			set_bit(MMF_OOM_REAPED, &mm->flags);
+			exit_oom_victim(victim);
 			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
 					task_pid_nr(victim), victim->comm,
 					task_pid_nr(p), p->comm);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread