All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] mm, oom: Fix unnecessary killing of additional processes.
@ 2018-09-08  4:54 Tetsuo Handa
  2018-09-10  9:54 ` Michal Hocko
  2018-09-10 12:55 ` [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover Michal Hocko
  0 siblings, 2 replies; 31+ messages in thread
From: Tetsuo Handa @ 2018-09-08  4:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Michal Hocko, Roman Gushchin

David Rientjes is complaining about current behavior that the OOM killer
selects next OOM victim as soon as MMF_OOM_SKIP is set even if
__oom_reap_task_mm() returned without any progress. Michal Hocko insisted
that we should make efforts for

  (i) making sure that __oom_reap_task_mm() can make more progress

and

  (ii) using hand over to exit_mmap() instead of setting MMF_OOM_SKIP
       if the OOM reaper detected that exit_mmap() has already past
       the last point of blocking

instead of using timeout based back off. However, Tetsuo Handa is
observing that the OOM killer is selecting next OOM victim despite

  (A) more than 50% of available RAM was reclaimed by __oom_reap_task_mm()

and

  (B) last allocation in __alloc_pages_may_oom() was attempted after
      MMF_OOM_SKIP was set

. This is an obviously unnecessary killing case where (i) and (ii) shown
above are irrelevant.

  Out of memory: Kill process 3359 (a.out) score 834 or sacrifice child
  Killed process 3359 (a.out) total-vm:4267252kB, anon-rss:2894908kB, file-rss:0kB, shmem-rss:0kB
  in:imjournal invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
  in:imjournal cpuset=/ mems_allowed=0
  CPU: 2 PID: 1001 Comm: in:imjournal Tainted: G                T 4.19.0-rc2+ #692
  Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
  Call Trace:
   dump_stack+0x85/0xcb
   dump_header+0x69/0x2fe
   ? _raw_spin_unlock_irqrestore+0x41/0x70
   oom_kill_process+0x307/0x390
   out_of_memory+0x2f3/0x5d0
   __alloc_pages_slowpath+0xc01/0x1030
   __alloc_pages_nodemask+0x333/0x390
   filemap_fault+0x465/0x910
   ? xfs_ilock+0xbf/0x2b0 [xfs]
   ? __xfs_filemap_fault+0x7d/0x2c0 [xfs]
   ? down_read_nested+0x66/0xa0
   __xfs_filemap_fault+0x8e/0x2c0 [xfs]
   __do_fault+0x11/0x133
   __handle_mm_fault+0xa57/0xf40
   handle_mm_fault+0x1b7/0x3c0
   __do_page_fault+0x2a6/0x580
   do_page_fault+0x32/0x270
   ? page_fault+0x8/0x30
   page_fault+0x1e/0x30
  RIP: 0033:0x7f5f078f6e28
  Code: Bad RIP value.
  RSP: 002b:00007f5f05007c50 EFLAGS: 00010246
  RAX: 0000000000000300 RBX: 0000000000000009 RCX: 00007f5f05007d80
  RDX: 00000000000003dd RSI: 00007f5f07b1ca1a RDI: 00005596745e9bb0
  RBP: 00007f5f05007d70 R08: 0000000000000006 R09: 0000000000000078
  R10: 00005596745e9810 R11: 00007f5f082bb4a0 R12: 00007f5f05007d90
  R13: 00007f5f00035fc0 R14: 00007f5f0003c850 R15: 00007f5f0000d760
  Mem-Info:
  active_anon:243554 inactive_anon:3457 isolated_anon:0
   active_file:53 inactive_file:4386 isolated_file:3
   unevictable:0 dirty:0 writeback:0 unstable:0
   slab_reclaimable:8152 slab_unreclaimable:24512
   mapped:3730 shmem:3704 pagetables:3123 bounce:0
   free:562331 free_pcp:502 free_cma:0
  Node 0 active_anon:974216kB inactive_anon:13828kB active_file:212kB inactive_file:17544kB unevictable:0kB isolated(anon):0kB isolated(file):12kB mapped:14920kB dirty:0kB writeback:0kB shmem:14816kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 305152kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
  Node 0 DMA free:13816kB min:308kB low:384kB high:460kB active_anon:1976kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
  lowmem_reserve[]: 0 2717 3378 3378
  Node 0 DMA32 free:2077084kB min:54108kB low:67632kB high:81156kB active_anon:696988kB inactive_anon:12kB active_file:252kB inactive_file:756kB unevictable:0kB writepending:0kB present:3129152kB managed:2782296kB mlocked:0kB kernel_stack:352kB pagetables:388kB bounce:0kB free_pcp:1328kB local_pcp:248kB free_cma:0kB
  lowmem_reserve[]: 0 0 661 661
  Node 0 Normal free:158424kB min:13164kB low:16452kB high:19740kB active_anon:275676kB inactive_anon:13816kB active_file:0kB inactive_file:16084kB unevictable:0kB writepending:0kB present:1048576kB managed:676908kB mlocked:0kB kernel_stack:6720kB pagetables:12100kB bounce:0kB free_pcp:680kB local_pcp:148kB free_cma:0kB
  lowmem_reserve[]: 0 0 0 0
  Node 0 DMA: 0*4kB 1*8kB (M) 1*16kB (U) 3*32kB (U) 4*64kB (UM) 1*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 0*2048kB 3*4096kB (M) = 13816kB
  Node 0 DMA32: 122*4kB (UM) 83*8kB (UM) 56*16kB (UM) 70*32kB (UME) 47*64kB (UE) 46*128kB (UME) 41*256kB (UME) 27*512kB (UME) 20*1024kB (ME) 213*2048kB (M) 387*4096kB (M) = 2079360kB
  Node 0 Normal: 653*4kB (UM) 1495*8kB (UM) 1604*16kB (UME) 1069*32kB (UME) 278*64kB (UME) 147*128kB (UME) 85*256kB (ME) 6*512kB (M) 0*1024kB 7*2048kB (M) 2*4096kB (M) = 158412kB
  Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
  8221 total pagecache pages
  0 pages in swap cache
  Swap cache stats: add 0, delete 0, find 0/0
  Free swap  = 0kB
  Total swap = 0kB
  1048422 pages RAM
  0 pages HighMem/MovableOnly
  179652 pages reserved
  0 pages cma reserved
  0 pages hwpoisoned
  Out of memory: Kill process 1089 (java) score 52 or sacrifice child
  Killed process 1089 (java) total-vm:5555688kB, anon-rss:181976kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: reaped process 3359 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

This is because we need to wait for a while until reclaimed memory becomes
re-allocatable by get_page_from_freelist(). Michal Hocko finally admitted
that there is no reliable way to handle this race.

Before the OOM reaper was introduced, the OOM killer waited until
TIF_MEMDIE is cleared. It implied completion of __mmput() as long as
a thread which got TIF_MEMDIE invokes __mmput(). But currently, the OOM
killer waits for only completion of __oom_reap_task_mm(), which is very
much shorter wait compared to waiting for completion of __mmput().

We can again try to wait for completion of __mmput(). But like described
in commit 97b1255cb27c551d ("mm,oom_reaper: check for MMF_OOM_SKIP before
complaining"), we have memory allocation dependency after returning from
__oom_reap_task_mm(). Thus, effectively timeout based back off is the only
choice we can use for now.

To mitigate this race, this patch adds a timeout based back off, with
whether the OOM score of an OOM victim's memory is decreasing over time
as a feedback, after MMF_OOM_SKIP is set by the OOM reaper or exit_mmap().

This patch also fixes three possible bugs

  (1) oom_task_origin() tasks can be selected forever because it did not
      check for MMF_OOM_SKIP.

  (2) sysctl_oom_kill_allocating_task path can be selected forever
      because it did not check for MMF_OOM_SKIP.

  (3) CONFIG_MMU=n kernels might livelock because nobody except
      is_global_init() case in __oom_kill_process() sets MMF_OOM_SKIP.

which prevent proof of "the forward progress guarantee"
and adds one optimization

  (4) oom_evaluate_task() was calling oom_unkillable_task() twice because
      oom_evaluate_task() needs to check for !MMF_OOM_SKIP and
      oom_task_origin() tasks before calling oom_badness().

in addition to mitigating the race.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |   9 +-
 include/linux/oom.h        |   1 +
 include/linux/sched.h      |   7 +-
 kernel/fork.c              |   2 +
 mm/memcontrol.c            |  18 +---
 mm/oom_kill.c              | 242 ++++++++++++++++++++++++---------------------
 6 files changed, 148 insertions(+), 131 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 652f602..396b01d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -417,8 +417,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
 				   struct mem_cgroup *,
 				   struct mem_cgroup_reclaim_cookie *);
 void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
-int mem_cgroup_scan_tasks(struct mem_cgroup *,
-			  int (*)(struct task_struct *, void *), void *);
+void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
+			   void (*fn)(struct task_struct *, void *), void *arg);
 
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
@@ -917,10 +917,9 @@ static inline void mem_cgroup_iter_break(struct mem_cgroup *root,
 {
 }
 
-static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
-		int (*fn)(struct task_struct *, void *), void *arg)
+static inline void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
+		void (*fn)(struct task_struct *, void *), void *arg)
 {
-	return 0;
 }
 
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 69864a5..4a14787 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -104,6 +104,7 @@ extern unsigned long oom_badness(struct task_struct *p,
 extern bool out_of_memory(struct oom_control *oc);
 
 extern void exit_oom_victim(void);
+extern void exit_oom_mm(struct mm_struct *mm);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 977cb57..4e0e357 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1174,9 +1174,10 @@ struct task_struct {
 	unsigned long			task_state_change;
 #endif
 	int				pagefault_disabled;
-#ifdef CONFIG_MMU
-	struct task_struct		*oom_reaper_list;
-#endif
+	struct list_head		oom_victim_list;
+	unsigned long			last_oom_compared;
+	unsigned long			last_oom_score;
+	unsigned char			oom_reap_stall_count;
 #ifdef CONFIG_VMAP_STACK
 	struct vm_struct		*stack_vm_area;
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index f0b5847..205386b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1008,6 +1008,8 @@ static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
+	if (unlikely(mm_is_oom_victim(mm)))
+		exit_oom_mm(mm);
 	mmdrop(mm);
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e79cb59..53c8d2c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1058,17 +1058,14 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
  * @arg: argument passed to @fn
  *
  * This function iterates over tasks attached to @memcg or to any of its
- * descendants and calls @fn for each task. If @fn returns a non-zero
- * value, the function breaks the iteration loop and returns the value.
- * Otherwise, it will iterate over all tasks and return 0.
+ * descendants and calls @fn for each task.
  *
  * This function must not be called for the root memory cgroup.
  */
-int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
-			  int (*fn)(struct task_struct *, void *), void *arg)
+void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
+			   void (*fn)(struct task_struct *, void *), void *arg)
 {
 	struct mem_cgroup *iter;
-	int ret = 0;
 
 	BUG_ON(memcg == root_mem_cgroup);
 
@@ -1077,15 +1074,10 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 		struct task_struct *task;
 
 		css_task_iter_start(&iter->css, 0, &it);
-		while (!ret && (task = css_task_iter_next(&it)))
-			ret = fn(task, arg);
+		while ((task = css_task_iter_next(&it)))
+			fn(task, arg);
 		css_task_iter_end(&it);
-		if (ret) {
-			mem_cgroup_iter_break(memcg, iter);
-			break;
-		}
 	}
-	return ret;
 }
 
 /**
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..713545e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -49,6 +49,12 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
 
+static inline unsigned long oom_victim_mm_score(struct mm_struct *mm)
+{
+	return get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS) +
+		mm_pgtables_bytes(mm) / PAGE_SIZE;
+}
+
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
@@ -206,22 +212,30 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	long points;
 	long adj;
 
+	/*
+	 * Ignore if this task was already marked as an OOM victim, for
+	 * oom_has_pending_victims() already waited for this task.
+	 */
+	if (tsk_is_oom_victim(p))
+		return 0;
+
 	if (oom_unkillable_task(p, memcg, nodemask))
 		return 0;
 
+	/* Select if this task is marked to be killed first. */
+	if (unlikely(oom_task_origin(p)))
+		return ULONG_MAX;
+
 	p = find_lock_task_mm(p);
 	if (!p)
 		return 0;
 
 	/*
 	 * Do not even consider tasks which are explicitly marked oom
-	 * unkillable or have been already oom reaped or the are in
-	 * the middle of vfork
+	 * unkillable or are in the middle of vfork
 	 */
 	adj = (long)p->signal->oom_score_adj;
-	if (adj == OOM_SCORE_ADJ_MIN ||
-			test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
-			in_vfork(p)) {
+	if (adj == OOM_SCORE_ADJ_MIN || (in_vfork(p) && totalpages)) {
 		task_unlock(p);
 		return 0;
 	}
@@ -230,8 +244,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	 * The baseline for the badness score is the proportion of RAM that each
 	 * task's rss, pagetable and swap space use.
 	 */
-	points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
-		mm_pgtables_bytes(p->mm) / PAGE_SIZE;
+	points = oom_victim_mm_score(p->mm);
 	task_unlock(p);
 
 	/* Normalize to oom_score_adj units */
@@ -312,60 +325,27 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	return CONSTRAINT_NONE;
 }
 
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+static void oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
 	unsigned long points;
 
-	if (oom_unkillable_task(task, NULL, oc->nodemask))
-		goto next;
-
-	/*
-	 * This task already has access to memory reserves and is being killed.
-	 * Don't allow any other task to have access to the reserves unless
-	 * the task has MMF_OOM_SKIP because chances that it would release
-	 * any memory is quite low.
-	 */
-	if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
-		if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
-			goto next;
-		goto abort;
-	}
-
-	/*
-	 * If task is allocating a lot of memory and has been marked to be
-	 * killed first if it triggers an oom, then select it.
-	 */
-	if (oom_task_origin(task)) {
-		points = ULONG_MAX;
-		goto select;
-	}
-
 	points = oom_badness(task, NULL, oc->nodemask, oc->totalpages);
 	if (!points || points < oc->chosen_points)
-		goto next;
-
+		return;
 	/* Prefer thread group leaders for display purposes */
 	if (points == oc->chosen_points && thread_group_leader(oc->chosen))
-		goto next;
-select:
+		return;
 	if (oc->chosen)
 		put_task_struct(oc->chosen);
 	get_task_struct(task);
 	oc->chosen = task;
 	oc->chosen_points = points;
-next:
-	return 0;
-abort:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
-	oc->chosen = (void *)-1UL;
-	return 1;
 }
 
 /*
  * Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen is set to -1.
+ * 'points'.
  */
 static void select_bad_process(struct oom_control *oc)
 {
@@ -376,8 +356,7 @@ static void select_bad_process(struct oom_control *oc)
 
 		rcu_read_lock();
 		for_each_process(p)
-			if (oom_evaluate_task(p, oc))
-				break;
+			oom_evaluate_task(p, oc);
 		rcu_read_unlock();
 	}
 
@@ -478,6 +457,8 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 	return false;
 }
 
+static LIST_HEAD(oom_victim_list);
+
 #ifdef CONFIG_MMU
 /*
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
@@ -485,8 +466,7 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
  */
 static struct task_struct *oom_reaper_th;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
-static DEFINE_SPINLOCK(oom_reaper_lock);
+static struct task_struct *oom_reap_target;
 
 bool __oom_reap_task_mm(struct mm_struct *mm)
 {
@@ -591,73 +571,32 @@ static void oom_reap_task(struct task_struct *tsk)
 	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
 		schedule_timeout_idle(HZ/10);
 
-	if (attempts <= MAX_OOM_REAP_RETRIES ||
-	    test_bit(MMF_OOM_SKIP, &mm->flags))
-		goto done;
-
-	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-		task_pid_nr(tsk), tsk->comm);
-	debug_show_all_locks();
-
-done:
-	tsk->oom_reaper_list = NULL;
-
 	/*
 	 * Hide this mm from OOM killer because it has been either reaped or
 	 * somebody can't call up_write(mmap_sem).
 	 */
 	set_bit(MMF_OOM_SKIP, &mm->flags);
-
-	/* Drop a reference taken by wake_oom_reaper */
-	put_task_struct(tsk);
 }
 
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct task_struct *tsk = NULL;
-
-		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
-		spin_lock(&oom_reaper_lock);
-		if (oom_reaper_list != NULL) {
-			tsk = oom_reaper_list;
-			oom_reaper_list = tsk->oom_reaper_list;
-		}
-		spin_unlock(&oom_reaper_lock);
-
-		if (tsk)
-			oom_reap_task(tsk);
+		wait_event_freezable(oom_reaper_wait, oom_reap_target != NULL);
+		oom_reap_task(oom_reap_target);
+		/* Drop a reference taken by oom_has_pending_victims(). */
+		put_task_struct(oom_reap_target);
+		oom_reap_target = NULL;
 	}
 
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
-{
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
-		return;
-
-	get_task_struct(tsk);
-
-	spin_lock(&oom_reaper_lock);
-	tsk->oom_reaper_list = oom_reaper_list;
-	oom_reaper_list = tsk;
-	spin_unlock(&oom_reaper_lock);
-	trace_wake_reaper(tsk->pid);
-	wake_up(&oom_reaper_wait);
-}
-
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
 	return 0;
 }
 subsys_initcall(oom_init)
-#else
-static inline void wake_oom_reaper(struct task_struct *tsk)
-{
-}
 #endif /* CONFIG_MMU */
 
 /**
@@ -683,6 +622,11 @@ static void mark_oom_victim(struct task_struct *tsk)
 	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) {
 		mmgrab(tsk->signal->oom_mm);
 		set_bit(MMF_OOM_VICTIM, &mm->flags);
+		tsk->last_oom_compared = jiffies;
+		tsk->last_oom_score = oom_victim_mm_score(mm);
+		tsk->oom_reap_stall_count = 0;
+		get_task_struct(tsk);
+		list_add(&tsk->oom_victim_list, &oom_victim_list);
 	}
 
 	/*
@@ -696,6 +640,21 @@ static void mark_oom_victim(struct task_struct *tsk)
 	trace_mark_victim(tsk->pid);
 }
 
+void exit_oom_mm(struct mm_struct *mm)
+{
+	struct task_struct *p, *tmp;
+
+	mutex_lock(&oom_lock);
+	list_for_each_entry_safe(p, tmp, &oom_victim_list, oom_victim_list) {
+		if (mm != p->signal->oom_mm)
+			continue;
+		list_del(&p->oom_victim_list);
+		/* Drop a reference taken by mark_oom_victim(). */
+		put_task_struct(p);
+	}
+	mutex_unlock(&oom_lock);
+}
+
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
@@ -834,7 +793,6 @@ static void __oom_kill_process(struct task_struct *victim)
 {
 	struct task_struct *p;
 	struct mm_struct *mm;
-	bool can_oom_reap = true;
 
 	p = find_lock_task_mm(victim);
 	if (!p) {
@@ -884,7 +842,6 @@ static void __oom_kill_process(struct task_struct *victim)
 		if (same_thread_group(p, victim))
 			continue;
 		if (is_global_init(p)) {
-			can_oom_reap = false;
 			set_bit(MMF_OOM_SKIP, &mm->flags);
 			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
 					task_pid_nr(victim), victim->comm,
@@ -901,25 +858,20 @@ static void __oom_kill_process(struct task_struct *victim)
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
-
 	mmdrop(mm);
 	put_task_struct(victim);
 }
-#undef K
 
 /*
  * Kill provided task unless it's secured by setting
  * oom_score_adj to OOM_SCORE_ADJ_MIN.
  */
-static int oom_kill_memcg_member(struct task_struct *task, void *unused)
+static void oom_kill_memcg_member(struct task_struct *task, void *unused)
 {
 	if (task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
 		get_task_struct(task);
 		__oom_kill_process(task);
 	}
-	return 0;
 }
 
 static void oom_kill_process(struct oom_control *oc, const char *message)
@@ -942,7 +894,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	task_lock(p);
 	if (task_will_free_mem(p)) {
 		mark_oom_victim(p);
-		wake_oom_reaper(p);
 		task_unlock(p);
 		put_task_struct(p);
 		return;
@@ -1041,6 +992,76 @@ int unregister_oom_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 
+static bool victim_mm_stalling(struct task_struct *p, struct mm_struct *mm)
+{
+	unsigned long score;
+
+	if (time_before(jiffies, p->last_oom_compared + HZ / 10))
+		return false;
+	score = oom_victim_mm_score(mm);
+	if (score < p->last_oom_score)
+		p->oom_reap_stall_count = 0;
+	else
+		p->oom_reap_stall_count++;
+	p->last_oom_score = oom_victim_mm_score(mm);
+	p->last_oom_compared = jiffies;
+	if (p->oom_reap_stall_count < 30)
+		return false;
+	pr_info("Gave up waiting for process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+		task_pid_nr(p), p->comm, K(mm->total_vm),
+		K(get_mm_counter(mm, MM_ANONPAGES)),
+		K(get_mm_counter(mm, MM_FILEPAGES)),
+		K(get_mm_counter(mm, MM_SHMEMPAGES)));
+	return true;
+}
+
+static bool oom_has_pending_victims(struct oom_control *oc)
+{
+	struct task_struct *p, *tmp;
+	bool ret = false;
+	bool gaveup = false;
+
+	if (is_sysrq_oom(oc))
+		return false;
+	/*
+	 * Wait for pending victims until __mmput() completes or stalled
+	 * too long.
+	 */
+	list_for_each_entry_safe(p, tmp, &oom_victim_list, oom_victim_list) {
+		struct mm_struct *mm = p->signal->oom_mm;
+
+		if (oom_unkillable_task(p, oc->memcg, oc->nodemask))
+			continue;
+		ret = true;
+#ifdef CONFIG_MMU
+		/*
+		 * Since the OOM reaper exists, we can safely wait until
+		 * MMF_OOM_SKIP is set.
+		 */
+		if (!test_bit(MMF_OOM_SKIP, &mm->flags)) {
+			if (!oom_reap_target) {
+				get_task_struct(p);
+				oom_reap_target = p;
+				trace_wake_reaper(p->pid);
+				wake_up(&oom_reaper_wait);
+			}
+			continue;
+		}
+#endif
+		/* We can wait as long as OOM score is decreasing over time. */
+		if (!victim_mm_stalling(p, mm))
+			continue;
+		gaveup = true;
+		list_del(&p->oom_victim_list);
+		/* Drop a reference taken by mark_oom_victim(). */
+		put_task_struct(p);
+	}
+	if (gaveup)
+		debug_show_all_locks();
+
+	return ret;
+}
+
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  * @oc: pointer to struct oom_control
@@ -1072,7 +1093,6 @@ bool out_of_memory(struct oom_control *oc)
 	 */
 	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		wake_oom_reaper(current);
 		return true;
 	}
 
@@ -1094,9 +1114,11 @@ bool out_of_memory(struct oom_control *oc)
 		oc->nodemask = NULL;
 	check_panic_on_oom(oc, constraint);
 
+	if (oom_has_pending_victims(oc))
+		return true;
+
 	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
-	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
-	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+	    oom_badness(current, NULL, oc->nodemask, 0)) {
 		get_task_struct(current);
 		oc->chosen = current;
 		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
@@ -1115,11 +1137,11 @@ bool out_of_memory(struct oom_control *oc)
 		 */
 		if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
 			panic("System is deadlocked on memory\n");
+		return false;
 	}
-	if (oc->chosen && oc->chosen != (void *)-1UL)
-		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
-				 "Memory cgroup out of memory");
-	return !!oc->chosen;
+	oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
+			 "Memory cgroup out of memory");
+	return true;
 }
 
 /*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] mm, oom: Fix unnecessary killing of additional processes.
  2018-09-08  4:54 [PATCH v2] mm, oom: Fix unnecessary killing of additional processes Tetsuo Handa
@ 2018-09-10  9:54 ` Michal Hocko
  2018-09-10 11:27   ` Tetsuo Handa
  2018-09-10 12:55 ` [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover Michal Hocko
  1 sibling, 1 reply; 31+ messages in thread
From: Michal Hocko @ 2018-09-10  9:54 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Andrew Morton, linux-mm, David Rientjes, Roman Gushchin

On Sat 08-09-18 13:54:12, Tetsuo Handa wrote:
[...]

I will not comment on the above because I have already done so and you
keep ignoring it so I will not waste my time again. But let me ask about
the following though

> This patch also fixes three possible bugs
> 
>   (1) oom_task_origin() tasks can be selected forever because it did not
>       check for MMF_OOM_SKIP.

Is this a real problem. Could you point to any path that wouldn't bail
out and oom_origin task would keep trying for ever? If such a path
doesn't exist and you believe it is too fragile and point out the older
bugs proving that then I can imagine we should care.

>   (2) sysctl_oom_kill_allocating_task path can be selected forever
>       because it did not check for MMF_OOM_SKIP.

Why is that a problem? sysctl_oom_kill_allocating_task doesn't have any
well defined semantic. It is a gross hack to save long and expensive oom
victim selection. If we were too strict we should even not allow anybody
else but an allocating task to be killed. So selecting it multiple times
doesn't sound harmful to me.

>   (3) CONFIG_MMU=n kernels might livelock because nobody except
>       is_global_init() case in __oom_kill_process() sets MMF_OOM_SKIP.

And now the obligatory question. Is this a real problem?
 
> which prevent proof of "the forward progress guarantee"
> and adds one optimization
> 
>   (4) oom_evaluate_task() was calling oom_unkillable_task() twice because
>       oom_evaluate_task() needs to check for !MMF_OOM_SKIP and
>       oom_task_origin() tasks before calling oom_badness().

ENOPARSE
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] mm, oom: Fix unnecessary killing of additional processes.
  2018-09-10  9:54 ` Michal Hocko
@ 2018-09-10 11:27   ` Tetsuo Handa
  2018-09-10 11:40     ` Michal Hocko
  0 siblings, 1 reply; 31+ messages in thread
From: Tetsuo Handa @ 2018-09-10 11:27 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, linux-mm, David Rientjes, Roman Gushchin

On 2018/09/10 18:54, Michal Hocko wrote:
> On Sat 08-09-18 13:54:12, Tetsuo Handa wrote:
> [...]
> 
> I will not comment on the above because I have already done so and you
> keep ignoring it so I will not waste my time again.

Then, your NACK no longer stands.

>                                                     But let me ask about
> the following though
> 
>> This patch also fixes three possible bugs
>>
>>   (1) oom_task_origin() tasks can be selected forever because it did not
>>       check for MMF_OOM_SKIP.
> 
> Is this a real problem. Could you point to any path that wouldn't bail
> out and oom_origin task would keep trying for ever? If such a path
> doesn't exist and you believe it is too fragile and point out the older
> bugs proving that then I can imagine we should care.

My confusion. MMF_OOM_SKIP is checked before oom_task_origin() test.

> 
>>   (2) sysctl_oom_kill_allocating_task path can be selected forever
>>       because it did not check for MMF_OOM_SKIP.
> 
> Why is that a problem? sysctl_oom_kill_allocating_task doesn't have any
> well defined semantic. It is a gross hack to save long and expensive oom
> victim selection. If we were too strict we should even not allow anybody
> else but an allocating task to be killed. So selecting it multiple times
> doesn't sound harmful to me.

After current thread was selected as an OOM victim by that code path and
the OOM reaper reaped current thread's memory, the OOM killer has to select
next OOM victim, for such situation means that current thread cannot bail
out due to __GFP_NOFAIL allocation. That is, similar to what you ignored

  if (tsk_is_oom_victim(current) && !(oc->gfp_mask & __GFP_NOFAIL))
      return true;

change. That is, when

  If current thread is an OOM victim, it is guaranteed to make forward
  progress (unless __GFP_NOFAIL) by failing that allocation attempt after
  trying memory reserves. The OOM path does not need to do anything at all.

failed due to __GFP_NOFAIL, sysctl_oom_kill_allocating_task has to select
next OOM victim.

> 
>>   (3) CONFIG_MMU=n kernels might livelock because nobody except
>>       is_global_init() case in __oom_kill_process() sets MMF_OOM_SKIP.
> 
> And now the obligatory question. Is this a real problem?

I SAID "POSSIBLE BUGS". You have never heard is not a proof that the problem
is not occurring in the world. Not everybody is skillful enough to report
OOM (or low memory) problems to you!

>  
>> which prevent proof of "the forward progress guarantee"
>> and adds one optimization
>>
>>   (4) oom_evaluate_task() was calling oom_unkillable_task() twice because
>>       oom_evaluate_task() needs to check for !MMF_OOM_SKIP and
>>       oom_task_origin() tasks before calling oom_badness().
> 
> ENOPARSE
> 

Not difficult to parse at all.

oom_evaluate_task() {

  if (oom_unkillable_task(task, NULL, oc->nodemask))
    goto next;

  if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
    if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
      goto next;
    goto abort;
  }

  if (oom_task_origin(task)) {
    points = ULONG_MAX;
    goto select;
  }

  points = oom_badness(task, NULL, oc->nodemask, oc->totalpages) {

    if (oom_unkillable_task(p, memcg, nodemask))
      return 0;

  }
}

By moving oom_task_origin() to inside oom_badness(), and
by bringing !MMF_OOM_SKIP test earlier, we can eliminate

  oom_unkillable_task(task, NULL, oc->nodemask)

test in oom_evaluate_task().

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] mm, oom: Fix unnecessary killing of additional processes.
  2018-09-10 11:27   ` Tetsuo Handa
@ 2018-09-10 11:40     ` Michal Hocko
  2018-09-10 12:52       ` Tetsuo Handa
  0 siblings, 1 reply; 31+ messages in thread
From: Michal Hocko @ 2018-09-10 11:40 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Andrew Morton, linux-mm, David Rientjes, Roman Gushchin

On Mon 10-09-18 20:27:21, Tetsuo Handa wrote:
> On 2018/09/10 18:54, Michal Hocko wrote:
> > On Sat 08-09-18 13:54:12, Tetsuo Handa wrote:
> > [...]
> > 
> > I will not comment on the above because I have already done so and you
> > keep ignoring it so I will not waste my time again.
> 
> Then, your NACK no longer stands.

And how exactly have you reached that conclusion? Nothing has really
changed. Except you keep pushing this crap no matter what you keep
hearing. You obviously do not worry to put words into my mouth.

> >>   (2) sysctl_oom_kill_allocating_task path can be selected forever
> >>       because it did not check for MMF_OOM_SKIP.
> > 
> > Why is that a problem? sysctl_oom_kill_allocating_task doesn't have any
> > well defined semantic. It is a gross hack to save long and expensive oom
> > victim selection. If we were too strict we should even not allow anybody
> > else but an allocating task to be killed. So selecting it multiple times
> > doesn't sound harmful to me.
> 
> After current thread was selected as an OOM victim by that code path and
> the OOM reaper reaped current thread's memory, the OOM killer has to select
> next OOM victim,

And how have you reached this conclusion. What kind of "kill the
allocating task" semantic really implies this?

> for such situation means that current thread cannot bail
> out due to __GFP_NOFAIL allocation. That is, similar to what you ignored
> 
>   if (tsk_is_oom_victim(current) && !(oc->gfp_mask & __GFP_NOFAIL))
>       return true;
> 
> change. That is, when
> 
>   If current thread is an OOM victim, it is guaranteed to make forward
>   progress (unless __GFP_NOFAIL) by failing that allocation attempt after
>   trying memory reserves. The OOM path does not need to do anything at all.
> 
> failed due to __GFP_NOFAIL, sysctl_oom_kill_allocating_task has to select
> next OOM victim.

this doesn't make any sense

> >>   (3) CONFIG_MMU=n kernels might livelock because nobody except
> >>       is_global_init() case in __oom_kill_process() sets MMF_OOM_SKIP.
> > 
> > And now the obligatory question. Is this a real problem?
> 
> I SAID "POSSIBLE BUGS". You have never heard is not a proof that the problem
> is not occurring in the world. Not everybody is skillful enough to report
> OOM (or low memory) problems to you!

No, we are not making the code overly complex or convoluted for
theoretically possible issues we have never heard before.

> >> which prevent proof of "the forward progress guarantee"
> >> and adds one optimization
> >>
> >>   (4) oom_evaluate_task() was calling oom_unkillable_task() twice because
> >>       oom_evaluate_task() needs to check for !MMF_OOM_SKIP and
> >>       oom_task_origin() tasks before calling oom_badness().
> > 
> > ENOPARSE
> > 
> 
> Not difficult to parse at all.
> 
> oom_evaluate_task() {
> 
>   if (oom_unkillable_task(task, NULL, oc->nodemask))
>     goto next;
> 
>   if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
>     if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
>       goto next;
>     goto abort;
>   }
> 
>   if (oom_task_origin(task)) {
>     points = ULONG_MAX;
>     goto select;
>   }
> 
>   points = oom_badness(task, NULL, oc->nodemask, oc->totalpages) {
> 
>     if (oom_unkillable_task(p, memcg, nodemask))
>       return 0;
> 
>   }
> }
> 
> By moving oom_task_origin() to inside oom_badness(), and
> by bringing !MMF_OOM_SKIP test earlier, we can eliminate
> 
>   oom_unkillable_task(task, NULL, oc->nodemask)
> 
> test in oom_evaluate_task().

And so what?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] mm, oom: Fix unnecessary killing of additional processes.
  2018-09-10 11:40     ` Michal Hocko
@ 2018-09-10 12:52       ` Tetsuo Handa
  0 siblings, 0 replies; 31+ messages in thread
From: Tetsuo Handa @ 2018-09-10 12:52 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, linux-mm, David Rientjes, Roman Gushchin

On 2018/09/10 20:40, Michal Hocko wrote:
> On Mon 10-09-18 20:27:21, Tetsuo Handa wrote:
>> On 2018/09/10 18:54, Michal Hocko wrote:
>>> On Sat 08-09-18 13:54:12, Tetsuo Handa wrote:
>>> [...]
>>>
>>> I will not comment on the above because I have already done so and you
>>> keep ignoring it so I will not waste my time again.
>>
>> Then, your NACK no longer stands.
> 
> And how exactly have you reached that conclusion? Nothing has really
> changed. Except you keep pushing this crap no matter what you keep
> hearing. You obviously do not worry to put words into my mouth.

THEN, PLEASE SHOW US YOUR PATCH. WHAT YOU ARE SAYING IS ONLY "WE DON'T CARE"
AND WHAT I AND DAVID ARE SAYING IS "WE DO CARE".

> 
>>>>   (2) sysctl_oom_kill_allocating_task path can be selected forever
>>>>       because it did not check for MMF_OOM_SKIP.
>>>
>>> Why is that a problem? sysctl_oom_kill_allocating_task doesn't have any
>>> well defined semantic. It is a gross hack to save long and expensive oom
>>> victim selection. If we were too strict we should even not allow anybody
>>> else but an allocating task to be killed. So selecting it multiple times
>>> doesn't sound harmful to me.
>>
>> After current thread was selected as an OOM victim by that code path and
>> the OOM reaper reaped current thread's memory, the OOM killer has to select
>> next OOM victim,
> 
> And how have you reached this conclusion. What kind of "kill the
> allocating task" semantic really implies this?

FOR THE PROOF OF "THE FORWARD PROGRESS GUARANTEE".

>>>>   (3) CONFIG_MMU=n kernels might livelock because nobody except
>>>>       is_global_init() case in __oom_kill_process() sets MMF_OOM_SKIP.
>>>
>>> And now the obligatory question. Is this a real problem?
>>
>> I SAID "POSSIBLE BUGS". You have never heard is not a proof that the problem
>> is not occurring in the world. Not everybody is skillful enough to report
>> OOM (or low memory) problems to you!
> 
> No, we are not making the code overly complex or convoluted for
> theoretically possible issues we have never heard before.

FOR THE PROOF OF "THE FORWARD PROGRESS GUARANTEE".

>>>> which prevent proof of "the forward progress guarantee"
>>>> and adds one optimization
>>>>
>>>>   (4) oom_evaluate_task() was calling oom_unkillable_task() twice because
>>>>       oom_evaluate_task() needs to check for !MMF_OOM_SKIP and
>>>>       oom_task_origin() tasks before calling oom_badness().
>>>
>>> ENOPARSE
>>>
>>
>> Not difficult to parse at all.
>>
>> oom_evaluate_task() {
>>
>>   if (oom_unkillable_task(task, NULL, oc->nodemask))
>>     goto next;
>>
>>   if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
>>     if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
>>       goto next;
>>     goto abort;
>>   }
>>
>>   if (oom_task_origin(task)) {
>>     points = ULONG_MAX;
>>     goto select;
>>   }
>>
>>   points = oom_badness(task, NULL, oc->nodemask, oc->totalpages) {
>>
>>     if (oom_unkillable_task(p, memcg, nodemask))
>>       return 0;
>>
>>   }
>> }
>>
>> By moving oom_task_origin() to inside oom_badness(), and
>> by bringing !MMF_OOM_SKIP test earlier, we can eliminate
>>
>>   oom_unkillable_task(task, NULL, oc->nodemask)
>>
>> test in oom_evaluate_task().
> 
> And so what?
> 

WE CAN MAKE THE RCU PROTECTED SECTION SHORTER. IT IS 99% WASTE TO TEST
oom_unkillable_task(task, NULL, oc->nodemask) FOR UNLIKELY MMF_OOM_SKIP
OR oom_task_origin() CASES.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-08  4:54 [PATCH v2] mm, oom: Fix unnecessary killing of additional processes Tetsuo Handa
  2018-09-10  9:54 ` Michal Hocko
@ 2018-09-10 12:55 ` Michal Hocko
  2018-09-10 12:55   ` [RFC PATCH 1/3] mm, oom: rework mmap_exit vs. oom_reaper synchronization Michal Hocko
                     ` (4 more replies)
  1 sibling, 5 replies; 31+ messages in thread
From: Michal Hocko @ 2018-09-10 12:55 UTC (permalink / raw)
  To: linux-mm; +Cc: Tetsuo Handa, Roman Gushchin, Andrew Morton

I am sending this as a follow up to yet-another timeout based proposal
by Tetsuo because I was accused that I am pushing for a solution which
I am not working on.

This is a very coarse implementation of the idea I've had before.
Please note that I haven't tested it yet. It is mostly to show the
direction I would wish to go for.

I have already explained why I hate timeout based retry loop and why
I believe a more direct feedback based approach is much better.

The locking protocol between the oom_reaper and the exit path is as
follows.

All parts which cannot race should use the exclusive lock on the exit
path. Once the exit path has passed the moment when no blocking locks
are taken then it clears mm->mmap under the exclusive lock. It is
trivial to use a MMF_$FOO for this purpose if people think this is safer
or better for any other reason.

The oom proper is the only one which sets MMF_OOM_SKIP with these
patches which is IMHO better because it is much easier understand the
interaction this way.

Last but not least, this is a core of the implementation. We might
want to tune the number of oom_reaper retries or to think about a more
effective tear down when there are multiple oom victims in the queue. I
would simply care about those later once we see a clear evidence that
this is needed. Ideally with a workload description and some numbers.

If this looks like a feasible idea I can spend on this more time and
turn it into something meargable.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC PATCH 1/3] mm, oom: rework mmap_exit vs. oom_reaper synchronization
  2018-09-10 12:55 ` [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover Michal Hocko
@ 2018-09-10 12:55   ` Michal Hocko
  2018-09-10 12:55   ` [RFC PATCH 2/3] mm, oom: keep retrying the oom_reap operation as long as there is substantial memory left Michal Hocko
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2018-09-10 12:55 UTC (permalink / raw)
  To: linux-mm; +Cc: Tetsuo Handa, Roman Gushchin, Andrew Morton, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

The oom_reaper cannot handle mlocked vmas right now and therefore we
have exit_mmap to reap the memory before it clears the mlock flags on
mappings. This is all good but we would like to have a better hand over
protocol between the oom_reaper and exit_mmap paths.

Therefore use exclusive mmap_sem in exit_mmap whenever exit_mmap has to
synchronize with the oom_reaper. There are two notable places. Mlocked
vmas (munlock_vma_pages_all) and page tables tear down path. All others
should be fine to race with oom_reap_task_mm.

This is mostly a preparatory patch which shouldn't introduce functional
changes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/mmap.c | 48 +++++++++++++++++++++++-------------------------
 1 file changed, 23 insertions(+), 25 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 5f2b2b184c60..3481424717ac 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3042,39 +3042,29 @@ void exit_mmap(struct mm_struct *mm)
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
+	bool oom = mm_is_oom_victim(mm);
 
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
-	if (unlikely(mm_is_oom_victim(mm))) {
-		/*
-		 * Manually reap the mm to free as much memory as possible.
-		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
-		 * this mm from further consideration.  Taking mm->mmap_sem for
-		 * write after setting MMF_OOM_SKIP will guarantee that the oom
-		 * reaper will not run on this mm again after mmap_sem is
-		 * dropped.
-		 *
-		 * Nothing can be holding mm->mmap_sem here and the above call
-		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
-		 * __oom_reap_task_mm() will not block.
-		 *
-		 * This needs to be done before calling munlock_vma_pages_all(),
-		 * which clears VM_LOCKED, otherwise the oom reaper cannot
-		 * reliably test it.
-		 */
-		(void)__oom_reap_task_mm(mm);
-
-		set_bit(MMF_OOM_SKIP, &mm->flags);
-		down_write(&mm->mmap_sem);
-		up_write(&mm->mmap_sem);
-	}
-
 	if (mm->locked_vm) {
 		vma = mm->mmap;
 		while (vma) {
-			if (vma->vm_flags & VM_LOCKED)
+			if (vma->vm_flags & VM_LOCKED) {
+				/*
+				 * oom_reaper cannot handle mlocked vmas but we
+				 * need to serialize it with munlock_vma_pages_all
+				 * which clears VM_LOCKED, otherwise the oom reaper
+				 * cannot reliably test it.
+				 */
+				if (oom)
+					down_write(&mm->mmap_sem);
+
 				munlock_vma_pages_all(vma);
+
+				if (oom)
+					up_write(&mm->mmap_sem);
+			}
 			vma = vma->vm_next;
 		}
 	}
@@ -3091,6 +3081,11 @@ void exit_mmap(struct mm_struct *mm)
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
+
+	/* oom_reaper cannot race with the page tables teardown */
+	if (oom)
+		down_write(&mm->mmap_sem);
+
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
 
@@ -3104,6 +3099,9 @@ void exit_mmap(struct mm_struct *mm)
 		vma = remove_vma(vma);
 	}
 	vm_unacct_memory(nr_accounted);
+
+	if (oom)
+		up_write(&mm->mmap_sem);
 }
 
 /* Insert vm structure into process list sorted by address
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 2/3] mm, oom: keep retrying the oom_reap operation as long as there is substantial memory left
  2018-09-10 12:55 ` [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover Michal Hocko
  2018-09-10 12:55   ` [RFC PATCH 1/3] mm, oom: rework mmap_exit vs. oom_reaper synchronization Michal Hocko
@ 2018-09-10 12:55   ` Michal Hocko
  2018-09-10 12:55   ` [RFC PATCH 3/3] mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish Michal Hocko
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2018-09-10 12:55 UTC (permalink / raw)
  To: linux-mm; +Cc: Tetsuo Handa, Roman Gushchin, Andrew Morton, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_reaper is not able to reap all types of memory. E.g. mlocked
mappings or page tables. In some cases this might be a lot of memory
and we do rely on exit_mmap to release that memory. Yet we cannot rely
on exit_mmap to set MMF_OOM_SKIP right now because there are several
places when sleeping locks are taken.

This patch adds a simple heuristic to check for the amount of memory
the mm is sitting on after oom_reaper is done with it. If this is still
few megabytes (this is a subject for further tunning based on real world
usecases) then simply keep retrying oom_reap_task_mm.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa5360616..049e67dc039b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -189,6 +189,16 @@ static bool is_dump_unreclaim_slabs(void)
 	return (global_node_page_state(NR_SLAB_UNRECLAIMABLE) > nr_lru);
 }
 
+/*
+ * Rough memory consumption of the given mm which should be theoretically freed
+ * when the mm is removed.
+ */
+static unsigned long oom_badness_pages(struct mm_struct *mm)
+{
+	return get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS) +
+		mm_pgtables_bytes(mm) / PAGE_SIZE;
+}
+
 /**
  * oom_badness - heuristic function to determine which candidate task to kill
  * @p: task struct of which task we should calculate
@@ -230,8 +240,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	 * The baseline for the badness score is the proportion of RAM that each
 	 * task's rss, pagetable and swap space use.
 	 */
-	points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
-		mm_pgtables_bytes(p->mm) / PAGE_SIZE;
+	points = oom_badness_pages(p->mm);
 	task_unlock(p);
 
 	/* Normalize to oom_score_adj units */
@@ -532,6 +541,16 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 		}
 	}
 
+	/*
+	 * If we still sit on a noticeable amount of memory even after successfully
+	 * reaping the address space then keep retrying until exit_mmap makes some
+	 * further progress.
+	 * TODO: add a flag for a stage when the exit path doesn't block anymore
+	 * and hand over MMF_OOM_SKIP handling there in that case
+	 */
+	if (ret && oom_badness_pages(mm) > 1024)
+		ret = false;
+
 	return ret;
 }
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 3/3] mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish
  2018-09-10 12:55 ` [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover Michal Hocko
  2018-09-10 12:55   ` [RFC PATCH 1/3] mm, oom: rework mmap_exit vs. oom_reaper synchronization Michal Hocko
  2018-09-10 12:55   ` [RFC PATCH 2/3] mm, oom: keep retrying the oom_reap operation as long as there is substantial memory left Michal Hocko
@ 2018-09-10 12:55   ` Michal Hocko
  2018-09-10 14:59   ` [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover Tetsuo Handa
  2018-09-11 14:01   ` Tetsuo Handa
  4 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2018-09-10 12:55 UTC (permalink / raw)
  To: linux-mm; +Cc: Tetsuo Handa, Roman Gushchin, Andrew Morton, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

David Rientjes has noted that certain user space memory allocators leave
a lot of page tables behind and the current implementation of oom_reaper
doesn't deal with those workloads very well. In order to improve these
workloads define a point when exit_mmap is guaranteed to finish the tear
down without any further blocking etc. This is right after we unlink
vmas (those still depend on locks which are held while performing memory
allocations from other contexts) and before we start releasing page
tables.

Opencode free_pgtables and explicitly unlink all vmas first. Then set
mm->mmap to NULL (there shouldn't be anybody looking at it at this
stage) and check for mm->mmap in the oom_reaper path. If the mm->mmap
is NULL we rely on the exit path and won't set MMF_OOM_SKIP from the
reaper.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/mmap.c     | 24 ++++++++++++++++++++----
 mm/oom_kill.c | 13 +++++++------
 2 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 3481424717ac..99bb9ce29bc5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3085,8 +3085,27 @@ void exit_mmap(struct mm_struct *mm)
 	/* oom_reaper cannot race with the page tables teardown */
 	if (oom)
 		down_write(&mm->mmap_sem);
+	/*
+	 * Hide vma from rmap and truncate_pagecache before freeing
+	 * pgtables
+	 */
+	while (vma) {
+		unlink_anon_vmas(vma);
+		unlink_file_vma(vma);
+		vma = vma->vm_next;
+	}
+	vma = mm->mmap;
+	if (oom) {
+		/*
+		 * the exit path is guaranteed to finish without any unbound
+		 * blocking at this stage so make it clear to the caller.
+		 */
+		mm->mmap = NULL;
+		up_write(&mm->mmap_sem);
+	}
 
-	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
+	free_pgd_range(&tlb, vma->vm_start, vma->vm_prev->vm_end,
+			FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
 
 	/*
@@ -3099,9 +3118,6 @@ void exit_mmap(struct mm_struct *mm)
 		vma = remove_vma(vma);
 	}
 	vm_unacct_memory(nr_accounted);
-
-	if (oom)
-		up_write(&mm->mmap_sem);
 }
 
 /* Insert vm structure into process list sorted by address
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 049e67dc039b..0ebf93c76c81 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -570,12 +570,10 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	}
 
 	/*
-	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
-	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
-	 * under mmap_sem for reading because it serializes against the
-	 * down_write();up_write() cycle in exit_mmap().
+	 * If exit path clear mm->mmap then we know it will finish the tear down
+	 * and we can go and bail out here.
 	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
+	if (!mm->mmap) {
 		trace_skip_task_reaping(tsk->pid);
 		goto out_unlock;
 	}
@@ -624,8 +622,11 @@ static void oom_reap_task(struct task_struct *tsk)
 	/*
 	 * Hide this mm from OOM killer because it has been either reaped or
 	 * somebody can't call up_write(mmap_sem).
+	 * Leave the MMF_OOM_SKIP to the exit path if it managed to reach the
+	 * point it is guaranteed to finish without any blocking
 	 */
-	set_bit(MMF_OOM_SKIP, &mm->flags);
+	if (mm->mmap)
+		set_bit(MMF_OOM_SKIP, &mm->flags);
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-10 12:55 ` [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover Michal Hocko
                     ` (2 preceding siblings ...)
  2018-09-10 12:55   ` [RFC PATCH 3/3] mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish Michal Hocko
@ 2018-09-10 14:59   ` Tetsuo Handa
  2018-09-10 15:11     ` Michal Hocko
  2018-09-11 14:01   ` Tetsuo Handa
  4 siblings, 1 reply; 31+ messages in thread
From: Tetsuo Handa @ 2018-09-10 14:59 UTC (permalink / raw)
  To: Michal Hocko, linux-mm; +Cc: Roman Gushchin, Andrew Morton

Thank you for proposing a patch.

On 2018/09/10 21:55, Michal Hocko wrote:
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5f2b2b1..99bb9ce 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3091,7 +3081,31 @@ void exit_mmap(struct mm_struct *mm)
>  	/* update_hiwater_rss(mm) here? but nobody should be looking */
>  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
>  	unmap_vmas(&tlb, vma, 0, -1);

unmap_vmas() might involve hugepage path. Is it safe to race with the OOM reaper?

  i_mmap_lock_write(vma->vm_file->f_mapping);
  __unmap_hugepage_range_final(tlb, vma, start, end, NULL);
  i_mmap_unlock_write(vma->vm_file->f_mapping);

> -	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> +
> +	/* oom_reaper cannot race with the page tables teardown */
> +	if (oom)
> +		down_write(&mm->mmap_sem);
> +	/*
> +	 * Hide vma from rmap and truncate_pagecache before freeing
> +	 * pgtables
> +	 */
> +	while (vma) {
> +		unlink_anon_vmas(vma);
> +		unlink_file_vma(vma);
> +		vma = vma->vm_next;
> +	}
> +	vma = mm->mmap;
> +	if (oom) {
> +		/*
> +		 * the exit path is guaranteed to finish without any unbound
> +		 * blocking at this stage so make it clear to the caller.
> +		 */
> +		mm->mmap = NULL;
> +		up_write(&mm->mmap_sem);
> +	}
> +
> +	free_pgd_range(&tlb, vma->vm_start, vma->vm_prev->vm_end,
> +			FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);

Are you trying to inline free_pgtables() here? But some architectures are
using hugetlb_free_pgd_range() which does more than free_pgd_range(). Are
they really safe (with regard to memory allocation dependency and flags
manipulation) ?

>  	tlb_finish_mmu(&tlb, 0, -1);
>  
>  	/*

Also, how do you plan to give this thread enough CPU resources, for this thread might
be SCHED_IDLE priority? Since this thread might not be a thread which is exiting
(because this is merely a thread which invoked __mmput()), we can't use boosting
approach. CPU resource might be given eventually unless schedule_timeout_*() is used,
but it might be deadly slow if allocating threads keep wasting CPU resources.

Also, why MMF_OOM_SKIP will not be set if the OOM reaper handed over?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-10 14:59   ` [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover Tetsuo Handa
@ 2018-09-10 15:11     ` Michal Hocko
  2018-09-10 15:40       ` Tetsuo Handa
  0 siblings, 1 reply; 31+ messages in thread
From: Michal Hocko @ 2018-09-10 15:11 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, Roman Gushchin, Andrew Morton

On Mon 10-09-18 23:59:02, Tetsuo Handa wrote:
> Thank you for proposing a patch.
> 
> On 2018/09/10 21:55, Michal Hocko wrote:
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 5f2b2b1..99bb9ce 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3091,7 +3081,31 @@ void exit_mmap(struct mm_struct *mm)
> >  	/* update_hiwater_rss(mm) here? but nobody should be looking */
> >  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
> >  	unmap_vmas(&tlb, vma, 0, -1);
> 
> unmap_vmas() might involve hugepage path. Is it safe to race with the OOM reaper?
> 
>   i_mmap_lock_write(vma->vm_file->f_mapping);
>   __unmap_hugepage_range_final(tlb, vma, start, end, NULL);
>   i_mmap_unlock_write(vma->vm_file->f_mapping);

We do not unmap hugetlb pages in the oom reaper.

> > -	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > +
> > +	/* oom_reaper cannot race with the page tables teardown */
> > +	if (oom)
> > +		down_write(&mm->mmap_sem);
> > +	/*
> > +	 * Hide vma from rmap and truncate_pagecache before freeing
> > +	 * pgtables
> > +	 */
> > +	while (vma) {
> > +		unlink_anon_vmas(vma);
> > +		unlink_file_vma(vma);
> > +		vma = vma->vm_next;
> > +	}
> > +	vma = mm->mmap;
> > +	if (oom) {
> > +		/*
> > +		 * the exit path is guaranteed to finish without any unbound
> > +		 * blocking at this stage so make it clear to the caller.
> > +		 */
> > +		mm->mmap = NULL;
> > +		up_write(&mm->mmap_sem);
> > +	}
> > +
> > +	free_pgd_range(&tlb, vma->vm_start, vma->vm_prev->vm_end,
> > +			FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> 
> Are you trying to inline free_pgtables() here? But some architectures are
> using hugetlb_free_pgd_range() which does more than free_pgd_range(). Are
> they really safe (with regard to memory allocation dependency and flags
> manipulation) ?

This is something for me to double check of course. A cursory look
suggests that ppc just does some address manipulations because
free_pgtables can be called from the unmap path and that might cut a
mapping into non-hugeltb pieces. This is not possible in the full tear
down though.

> 
> >  	tlb_finish_mmu(&tlb, 0, -1);
> >  
> >  	/*
> 
> Also, how do you plan to give this thread enough CPU resources, for this thread might
> be SCHED_IDLE priority? Since this thread might not be a thread which is exiting
> (because this is merely a thread which invoked __mmput()), we can't use boosting
> approach. CPU resource might be given eventually unless schedule_timeout_*() is used,
> but it might be deadly slow if allocating threads keep wasting CPU resources.

This is OOM path which is glacial slow path. This is btw. no different
from any other low priority tasks sitting on a lot of memory trying to
release the memory (either by unmapping or exiting). Why should be this
particular case any different?

> Also, why MMF_OOM_SKIP will not be set if the OOM reaper handed over?

The idea is that the mm is not visible to anybody (except for the oom
reaper) anymore. So MMF_OOM_SKIP shouldn't matter.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-10 15:11     ` Michal Hocko
@ 2018-09-10 15:40       ` Tetsuo Handa
  2018-09-10 16:44         ` Michal Hocko
  0 siblings, 1 reply; 31+ messages in thread
From: Tetsuo Handa @ 2018-09-10 15:40 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Roman Gushchin, Andrew Morton

On 2018/09/11 0:11, Michal Hocko wrote:
> On Mon 10-09-18 23:59:02, Tetsuo Handa wrote:
>> Thank you for proposing a patch.
>>
>> On 2018/09/10 21:55, Michal Hocko wrote:
>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>> index 5f2b2b1..99bb9ce 100644
>>> --- a/mm/mmap.c
>>> +++ b/mm/mmap.c
>>> @@ -3091,7 +3081,31 @@ void exit_mmap(struct mm_struct *mm)
>>>  	/* update_hiwater_rss(mm) here? but nobody should be looking */
>>>  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
>>>  	unmap_vmas(&tlb, vma, 0, -1);
>>
>> unmap_vmas() might involve hugepage path. Is it safe to race with the OOM reaper?
>>
>>   i_mmap_lock_write(vma->vm_file->f_mapping);
>>   __unmap_hugepage_range_final(tlb, vma, start, end, NULL);
>>   i_mmap_unlock_write(vma->vm_file->f_mapping);
> 
> We do not unmap hugetlb pages in the oom reaper.
> 

But the OOM reaper can run while __unmap_hugepage_range_final() is in progress.
Then, I worry an overlooked race similar to clearing VM_LOCKED flag.

> 
>>
>>>  	tlb_finish_mmu(&tlb, 0, -1);
>>>  
>>>  	/*
>>
>> Also, how do you plan to give this thread enough CPU resources, for this thread might
>> be SCHED_IDLE priority? Since this thread might not be a thread which is exiting
>> (because this is merely a thread which invoked __mmput()), we can't use boosting
>> approach. CPU resource might be given eventually unless schedule_timeout_*() is used,
>> but it might be deadly slow if allocating threads keep wasting CPU resources.
> 
> This is OOM path which is glacial slow path. This is btw. no different
> from any other low priority tasks sitting on a lot of memory trying to
> release the memory (either by unmapping or exiting). Why should be this
> particular case any different?
> 

Not a problem if not under OOM situation. Since the OOM killer keeps wasting
CPU resources until memory reclaim completes, we want to solve OOM situation
as soon as possible.

>> Also, why MMF_OOM_SKIP will not be set if the OOM reaper handed over?
> 
> The idea is that the mm is not visible to anybody (except for the oom
> reaper) anymore. So MMF_OOM_SKIP shouldn't matter.
> 

I think it absolutely matters. The OOM killer waits until MMF_OOM_SKIP is set
on a mm which is visible via task_struct->signal->oom_mm .

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-10 15:40       ` Tetsuo Handa
@ 2018-09-10 16:44         ` Michal Hocko
  2018-09-12  3:06           ` Tetsuo Handa
  0 siblings, 1 reply; 31+ messages in thread
From: Michal Hocko @ 2018-09-10 16:44 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, Roman Gushchin, Andrew Morton

On Tue 11-09-18 00:40:23, Tetsuo Handa wrote:
> On 2018/09/11 0:11, Michal Hocko wrote:
> > On Mon 10-09-18 23:59:02, Tetsuo Handa wrote:
> >> Thank you for proposing a patch.
> >>
> >> On 2018/09/10 21:55, Michal Hocko wrote:
> >>> diff --git a/mm/mmap.c b/mm/mmap.c
> >>> index 5f2b2b1..99bb9ce 100644
> >>> --- a/mm/mmap.c
> >>> +++ b/mm/mmap.c
> >>> @@ -3091,7 +3081,31 @@ void exit_mmap(struct mm_struct *mm)
> >>>  	/* update_hiwater_rss(mm) here? but nobody should be looking */
> >>>  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
> >>>  	unmap_vmas(&tlb, vma, 0, -1);
> >>
> >> unmap_vmas() might involve hugepage path. Is it safe to race with the OOM reaper?
> >>
> >>   i_mmap_lock_write(vma->vm_file->f_mapping);
> >>   __unmap_hugepage_range_final(tlb, vma, start, end, NULL);
> >>   i_mmap_unlock_write(vma->vm_file->f_mapping);
> > 
> > We do not unmap hugetlb pages in the oom reaper.
> > 
> 
> But the OOM reaper can run while __unmap_hugepage_range_final() is in progress.
> Then, I worry an overlooked race similar to clearing VM_LOCKED flag.

But VM_HUGETLB is a persistent flag unlike VM_LOCKED IIRC.
 
> >>>  	tlb_finish_mmu(&tlb, 0, -1);
> >>>  
> >>>  	/*
> >>
> >> Also, how do you plan to give this thread enough CPU resources, for this thread might
> >> be SCHED_IDLE priority? Since this thread might not be a thread which is exiting
> >> (because this is merely a thread which invoked __mmput()), we can't use boosting
> >> approach. CPU resource might be given eventually unless schedule_timeout_*() is used,
> >> but it might be deadly slow if allocating threads keep wasting CPU resources.
> > 
> > This is OOM path which is glacial slow path. This is btw. no different
> > from any other low priority tasks sitting on a lot of memory trying to
> > release the memory (either by unmapping or exiting). Why should be this
> > particular case any different?
> > 
> 
> Not a problem if not under OOM situation. Since the OOM killer keeps wasting
> CPU resources until memory reclaim completes, we want to solve OOM situation
> as soon as possible.

OK, it seems that yet again we have a deep disagreement here. The point
of the OOM is to get the system out of the desperate situation. The
objective is to free up _some_ memory. Your QoS is completely off at
that moment.
 
> >> Also, why MMF_OOM_SKIP will not be set if the OOM reaper handed over?
> > 
> > The idea is that the mm is not visible to anybody (except for the oom
> > reaper) anymore. So MMF_OOM_SKIP shouldn't matter.
> > 
> 
> I think it absolutely matters. The OOM killer waits until MMF_OOM_SKIP is set
> on a mm which is visible via task_struct->signal->oom_mm .

Hmm, I have to re-read the exit path once again and see when we unhash
the task and how many dangerous things we do in the mean time. I might
have been overly optimistic and you might be right that we indeed have
to set MMF_OOM_SKIP after all.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-10 12:55 ` [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover Michal Hocko
                     ` (3 preceding siblings ...)
  2018-09-10 14:59   ` [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover Tetsuo Handa
@ 2018-09-11 14:01   ` Tetsuo Handa
  2018-09-12  7:50     ` Michal Hocko
  4 siblings, 1 reply; 31+ messages in thread
From: Tetsuo Handa @ 2018-09-11 14:01 UTC (permalink / raw)
  To: Michal Hocko, linux-mm; +Cc: Roman Gushchin, Andrew Morton

On 2018/09/10 21:55, Michal Hocko wrote:
> This is a very coarse implementation of the idea I've had before.
> Please note that I haven't tested it yet. It is mostly to show the
> direction I would wish to go for.

Hmm, this patchset does not allow me to boot. ;-)

        free_pgd_range(&tlb, vma->vm_start, vma->vm_prev->vm_end,
                        FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);

[    1.875675] sched_clock: Marking stable (1810466565, 65169393)->(1977240380, -101604422)
[    1.877833] registered taskstats version 1
[    1.877853] Loading compiled-in X.509 certificates
[    1.878835] zswap: loaded using pool lzo/zbud
[    1.880835] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[    1.881792] PGD 0 P4D 0 
[    1.881812] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[    1.882792] CPU: 1 PID: 121 Comm: modprobe Not tainted 4.19.0-rc3+ #469
[    1.883803] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
[    1.884792] RIP: 0010:exit_mmap+0x122/0x1f0
[    1.884812] Code: 8b 5b 10 48 85 db 75 e7 45 84 e4 48 8b 45 00 0f 85 9a 00 00 00 48 8b 50 18 48 8b 30 48 8d 7c 24 08 45 31 c0 31 c9 48 89 04 24 <48> 8b 52 08 e8 45 3b ff ff 48 8d 7c 24 08 31 f6 48 c7 c2 ff ff ff
[    1.886793] RSP: 0018:ffffc90000897de0 EFLAGS: 00010246
[    1.887812] RAX: ffff88013494fcc0 RBX: 0000000000000000 RCX: 0000000000000000
[    1.888872] RDX: 0000000000000000 RSI: 0000000000400000 RDI: ffffc90000897de8
[    1.889794] RBP: ffff880134950040 R08: 0000000000000000 R09: 0000000000000000
[    1.890792] R10: 0000000000000001 R11: 0000000000081741 R12: 0000000000000000
[    1.891794] R13: ffff8801348fe240 R14: ffff8801348fe928 R15: 0000000000000000
[    1.892836] FS:  0000000000000000(0000) GS:ffff88013b240000(0000) knlGS:0000000000000000
[    1.893792] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.894792] CR2: 0000000000000008 CR3: 000000000220f001 CR4: 00000000001606e0
[    1.895797] Call Trace:
[    1.895817]  ? switch_mm_irqs_off+0x2e1/0x870
[    1.895837]  mmput+0x63/0x130
[    1.895857]  do_exit+0x2a7/0xc80
[    1.895877]  ? __do_page_fault+0x219/0x520
[    1.896793]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[    1.896813]  do_group_exit+0x41/0xc0
[    1.896833]  __x64_sys_exit_group+0xf/0x10
[    1.896853]  do_syscall_64+0x4f/0x1f0
[    1.896873]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[    1.896893] RIP: 0033:0x7fa50e122909
[    1.896913] Code: Bad RIP value.
[    1.896933] RSP: 002b:00007fff0fdb96a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[    1.897792] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fa50e122909
[    1.898792] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000001
[    1.899795] RBP: 00007fa50e41f838 R08: 000000000000003c R09: 00000000000000e7
[    1.900800] R10: ffffffffffffff70 R11: 0000000000000246 R12: 00007fa50e41f838
[    1.901793] R13: 00007fa50e424e80 R14: 0000000000000000 R15: 0000000000000000
[    1.902796] Modules linked in:
[    1.902816] CR2: 0000000000000008
[    1.902836] ---[ end trace a1a4ea7953190d43 ]---
[    1.902856] RIP: 0010:exit_mmap+0x122/0x1f0
[    1.902876] Code: 8b 5b 10 48 85 db 75 e7 45 84 e4 48 8b 45 00 0f 85 9a 00 00 00 48 8b 50 18 48 8b 30 48 8d 7c 24 08 45 31 c0 31 c9 48 89 04 24 <48> 8b 52 08 e8 45 3b ff ff 48 8d 7c 24 08 31 f6 48 c7 c2 ff ff ff
[    1.905792] RSP: 0018:ffffc90000897de0 EFLAGS: 00010246
[    1.906792] RAX: ffff88013494fcc0 RBX: 0000000000000000 RCX: 0000000000000000
[    1.907799] RDX: 0000000000000000 RSI: 0000000000400000 RDI: ffffc90000897de8
[    1.908837] RBP: ffff880134950040 R08: 0000000000000000 R09: 0000000000000000
[    1.909814] R10: 0000000000000001 R11: 0000000000081741 R12: 0000000000000000
[    1.910812] R13: ffff8801348fe240 R14: ffff8801348fe928 R15: 0000000000000000
[    1.911807] FS:  0000000000000000(0000) GS:ffff88013b240000(0000) knlGS:0000000000000000
[    1.912792] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.913820] CR2: 00007fa50e1228df CR3: 000000000220f001 CR4: 00000000001606e0
[    1.914812] Fixing recursive fault but reboot is needed!
[    2.076860] input: ImPS/2 Generic Wheel Mouse as /devices/platform/i8042/serio1/input/input3
[    2.667963] tsc: Refined TSC clocksource calibration: 2793.558 MHz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-10 16:44         ` Michal Hocko
@ 2018-09-12  3:06           ` Tetsuo Handa
  2018-09-12  7:18             ` Michal Hocko
  0 siblings, 1 reply; 31+ messages in thread
From: Tetsuo Handa @ 2018-09-12  3:06 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Roman Gushchin, Andrew Morton

Michal Hocko wrote:
> On Tue 11-09-18 00:40:23, Tetsuo Handa wrote:
> > >> Also, why MMF_OOM_SKIP will not be set if the OOM reaper handed over?
> > > 
> > > The idea is that the mm is not visible to anybody (except for the oom
> > > reaper) anymore. So MMF_OOM_SKIP shouldn't matter.
> > > 
> > 
> > I think it absolutely matters. The OOM killer waits until MMF_OOM_SKIP is set
> > on a mm which is visible via task_struct->signal->oom_mm .
> 
> Hmm, I have to re-read the exit path once again and see when we unhash
> the task and how many dangerous things we do in the mean time. I might
> have been overly optimistic and you might be right that we indeed have
> to set MMF_OOM_SKIP after all.

What a foolhardy attempt!

Commit d7a94e7e11badf84 ("oom: don't count on mm-less current process") says

    out_of_memory() doesn't trigger the OOM killer if the current task is
    already exiting or it has fatal signals pending, and gives the task
    access to memory reserves instead.  However, doing so is wrong if
    out_of_memory() is called by an allocation (e.g. from exit_task_work())
    after the current task has already released its memory and cleared
    TIF_MEMDIE at exit_mm().  If we again set TIF_MEMDIE to post-exit_mm()
    current task, the OOM killer will be blocked by the task sitting in the
    final schedule() waiting for its parent to reap it.  It will trigger an
    OOM livelock if its parent is unable to reap it due to doing an
    allocation and waiting for the OOM killer to kill it.

and your

+               /*
+                * the exit path is guaranteed to finish without any unbound
+                * blocking at this stage so make it clear to the caller.
+                */

attempt is essentially same with "we keep TIF_MEMDIE of post-exit_mm() task".

That is, we can't expect that the OOM victim can finish without any unbound
blocking. We have no choice but timeout based heuristic if we don't want to
set MMF_OOM_SKIP even with your customized version of free_pgtables().

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-12  3:06           ` Tetsuo Handa
@ 2018-09-12  7:18             ` Michal Hocko
  2018-09-12  7:58               ` Tetsuo Handa
  0 siblings, 1 reply; 31+ messages in thread
From: Michal Hocko @ 2018-09-12  7:18 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, Roman Gushchin, Andrew Morton

On Wed 12-09-18 12:06:19, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 11-09-18 00:40:23, Tetsuo Handa wrote:
> > > >> Also, why MMF_OOM_SKIP will not be set if the OOM reaper handed over?
> > > > 
> > > > The idea is that the mm is not visible to anybody (except for the oom
> > > > reaper) anymore. So MMF_OOM_SKIP shouldn't matter.
> > > > 
> > > 
> > > I think it absolutely matters. The OOM killer waits until MMF_OOM_SKIP is set
> > > on a mm which is visible via task_struct->signal->oom_mm .
> > 
> > Hmm, I have to re-read the exit path once again and see when we unhash
> > the task and how many dangerous things we do in the mean time. I might
> > have been overly optimistic and you might be right that we indeed have
> > to set MMF_OOM_SKIP after all.
> 
> What a foolhardy attempt!
> 
> Commit d7a94e7e11badf84 ("oom: don't count on mm-less current process") says
> 
>     out_of_memory() doesn't trigger the OOM killer if the current task is
>     already exiting or it has fatal signals pending, and gives the task
>     access to memory reserves instead.  However, doing so is wrong if
>     out_of_memory() is called by an allocation (e.g. from exit_task_work())
>     after the current task has already released its memory and cleared
>     TIF_MEMDIE at exit_mm().  If we again set TIF_MEMDIE to post-exit_mm()
>     current task, the OOM killer will be blocked by the task sitting in the
>     final schedule() waiting for its parent to reap it.  It will trigger an
>     OOM livelock if its parent is unable to reap it due to doing an
>     allocation and waiting for the OOM killer to kill it.
> 
> and your
> 
> +               /*
> +                * the exit path is guaranteed to finish without any unbound
> +                * blocking at this stage so make it clear to the caller.
> +                */

This comment was meant to tell that the tear down will not block for
unbound amount of time.

> attempt is essentially same with "we keep TIF_MEMDIE of post-exit_mm() task".
> 
> That is, we can't expect that the OOM victim can finish without any unbound
> blocking. We have no choice but timeout based heuristic if we don't want to
> set MMF_OOM_SKIP even with your customized version of free_pgtables().

OK, I will fold the following to the patch

commit e57a1e84db95906e6505de26db896f1b66b5b057
Author: Michal Hocko <mhocko@suse.com>
Date:   Tue Sep 11 13:09:16 2018 +0200

    fold me "mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish"
    
    - the task is still visible to the OOM killer after exit_mmap terminates
      so we should set MMF_OOM_SKIP from that path to be sure the oom killer
      doesn't get stuck on this task (see d7a94e7e11badf84 for more context)
      - per Tetsuo

diff --git a/mm/mmap.c b/mm/mmap.c
index 99bb9ce29bc5..64e8ccce5282 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3097,8 +3097,9 @@ void exit_mmap(struct mm_struct *mm)
 	vma = mm->mmap;
 	if (oom) {
 		/*
-		 * the exit path is guaranteed to finish without any unbound
-		 * blocking at this stage so make it clear to the caller.
+		 * the exit path is guaranteed to finish the memory tear down
+		 * without any unbound blocking at this stage so make it clear
+		 * to the oom_reaper
 		 */
 		mm->mmap = NULL;
 		up_write(&mm->mmap_sem);
@@ -3118,6 +3119,13 @@ void exit_mmap(struct mm_struct *mm)
 		vma = remove_vma(vma);
 	}
 	vm_unacct_memory(nr_accounted);
+
+	/*
+	 * Now that the full address space is torn down, make sure the
+	 * OOM killer skips over this task
+	 */
+	if (oom)
+		set_bit(MMF_OOM_SKIP, &mm->flags);
 }
 
 /* Insert vm structure into process list sorted by address

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-11 14:01   ` Tetsuo Handa
@ 2018-09-12  7:50     ` Michal Hocko
  2018-09-12 13:42       ` Michal Hocko
  0 siblings, 1 reply; 31+ messages in thread
From: Michal Hocko @ 2018-09-12  7:50 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, Roman Gushchin, Andrew Morton

On Tue 11-09-18 23:01:57, Tetsuo Handa wrote:
> On 2018/09/10 21:55, Michal Hocko wrote:
> > This is a very coarse implementation of the idea I've had before.
> > Please note that I haven't tested it yet. It is mostly to show the
> > direction I would wish to go for.
> 
> Hmm, this patchset does not allow me to boot. ;-)
> 
>         free_pgd_range(&tlb, vma->vm_start, vma->vm_prev->vm_end,
>                         FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> 
> [    1.875675] sched_clock: Marking stable (1810466565, 65169393)->(1977240380, -101604422)
> [    1.877833] registered taskstats version 1
> [    1.877853] Loading compiled-in X.509 certificates
> [    1.878835] zswap: loaded using pool lzo/zbud
> [    1.880835] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008

This is vm_prev == NULL. I thought we always have vm_prev as long as
this is not a single VMA in the address space. I will double check this.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-12  7:18             ` Michal Hocko
@ 2018-09-12  7:58               ` Tetsuo Handa
  2018-09-12  8:17                 ` Michal Hocko
  0 siblings, 1 reply; 31+ messages in thread
From: Tetsuo Handa @ 2018-09-12  7:58 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Roman Gushchin, Andrew Morton

Michal Hocko wrote:
> OK, I will fold the following to the patch

OK. But at that point, my patch which tries to wait for reclaimed memory
to be re-allocatable addresses a different problem which you are refusing.



By the way, is it guaranteed that vma->vm_ops->close(vma) in remove_vma() never
sleeps? Since remove_vma() has might_sleep() since 2005, and that might_sleep()
predates the git history, I don't know what that ->close() would do.

Anyway, please fix free_pgd_range() crash in this patchset.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-12  7:58               ` Tetsuo Handa
@ 2018-09-12  8:17                 ` Michal Hocko
  2018-09-12 10:59                   ` Tetsuo Handa
  0 siblings, 1 reply; 31+ messages in thread
From: Michal Hocko @ 2018-09-12  8:17 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, Roman Gushchin, Andrew Morton

On Wed 12-09-18 16:58:53, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > OK, I will fold the following to the patch
> 
> OK. But at that point, my patch which tries to wait for reclaimed memory
> to be re-allocatable addresses a different problem which you are refusing.

I am trying to address a real world example of when the excessive amount
of memory is in page tables. As David pointed, this can happen with some
userspace allocators.

> By the way, is it guaranteed that vma->vm_ops->close(vma) in remove_vma() never
> sleeps? Since remove_vma() has might_sleep() since 2005, and that might_sleep()
> predates the git history, I don't know what that ->close() would do.

Hmm, I am afraid we cannot assume anything so we have to consider it
unsafe. A cursory look at some callers shows that they are taking locks.
E.g. drm_gem_object_put_unlocked might take a mutex. So MMF_OOM_SKIP
would have to set right after releasing page tables.

> Anyway, please fix free_pgd_range() crash in this patchset.

I will try to get to this later today.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-12  8:17                 ` Michal Hocko
@ 2018-09-12 10:59                   ` Tetsuo Handa
  2018-09-12 11:22                     ` Michal Hocko
  0 siblings, 1 reply; 31+ messages in thread
From: Tetsuo Handa @ 2018-09-12 10:59 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Roman Gushchin, Andrew Morton

On 2018/09/12 17:17, Michal Hocko wrote:
> On Wed 12-09-18 16:58:53, Tetsuo Handa wrote:
>> Michal Hocko wrote:
>>> OK, I will fold the following to the patch
>>
>> OK. But at that point, my patch which tries to wait for reclaimed memory
>> to be re-allocatable addresses a different problem which you are refusing.
> 
> I am trying to address a real world example of when the excessive amount
> of memory is in page tables. As David pointed, this can happen with some
> userspace allocators.

My patch or David's patch will address it as well, without scattering
down_write(&mm->mmap_sem)/up_write(&mm->mmap_sem) like your attempt.

> 
>> By the way, is it guaranteed that vma->vm_ops->close(vma) in remove_vma() never
>> sleeps? Since remove_vma() has might_sleep() since 2005, and that might_sleep()
>> predates the git history, I don't know what that ->close() would do.
> 
> Hmm, I am afraid we cannot assume anything so we have to consider it
> unsafe. A cursory look at some callers shows that they are taking locks.
> E.g. drm_gem_object_put_unlocked might take a mutex. So MMF_OOM_SKIP
> would have to set right after releasing page tables.

I won't be happy unless handed over section can run in atomic context
(e.g. preempt_disable()/preempt_enable()) because current thread might be
SCHED_IDLE priority.

If current thread is SCHED_IDLE priority, it might be difficult to hand over
because current thread is unlikely able to reach

+	if (oom) {
+		/*
+		 * the exit path is guaranteed to finish without any unbound
+		 * blocking at this stage so make it clear to the caller.
+		 */
+		mm->mmap = NULL;
+		up_write(&mm->mmap_sem);
+	}

before the OOM reaper kernel thread (which is not SCHED_IDLE priority) checks
whether mm->mmap is already NULL.

Honestly, I'm not sure whether current thread (even !SCHED_IDLE priority) can
reach there before the OOM killer checks whether mm->mmap is already NULL, for
current thread has to do more things than the OOM reaper can do.

Also, in the worst case,

+                               /*
+                                * oom_reaper cannot handle mlocked vmas but we
+                                * need to serialize it with munlock_vma_pages_all
+                                * which clears VM_LOCKED, otherwise the oom reaper
+                                * cannot reliably test it.
+                                */
+                               if (oom)
+                                       down_write(&mm->mmap_sem);

would cause the OOM reaper to set MMF_OOM_SKIP without reclaiming any memory
if munlock_vma_pages_all(vma) by current thread did not complete quick enough
to make down_read_trylock(&mm->mmap_sem) attempt by the OOM reaper succeed.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-12 10:59                   ` Tetsuo Handa
@ 2018-09-12 11:22                     ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2018-09-12 11:22 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, Roman Gushchin, Andrew Morton

On Wed 12-09-18 19:59:24, Tetsuo Handa wrote:
> On 2018/09/12 17:17, Michal Hocko wrote:
> > On Wed 12-09-18 16:58:53, Tetsuo Handa wrote:
> >> Michal Hocko wrote:
> >>> OK, I will fold the following to the patch
> >>
> >> OK. But at that point, my patch which tries to wait for reclaimed memory
> >> to be re-allocatable addresses a different problem which you are refusing.
> > 
> > I am trying to address a real world example of when the excessive amount
> > of memory is in page tables. As David pointed, this can happen with some
> > userspace allocators.
> 
> My patch or David's patch will address it as well, without scattering
> down_write(&mm->mmap_sem)/up_write(&mm->mmap_sem) like your attempt.

Based on a timeout. I am trying to fit the fix into an existing retry
logic and it seems this is possible.
 
> >> By the way, is it guaranteed that vma->vm_ops->close(vma) in remove_vma() never
> >> sleeps? Since remove_vma() has might_sleep() since 2005, and that might_sleep()
> >> predates the git history, I don't know what that ->close() would do.
> > 
> > Hmm, I am afraid we cannot assume anything so we have to consider it
> > unsafe. A cursory look at some callers shows that they are taking locks.
> > E.g. drm_gem_object_put_unlocked might take a mutex. So MMF_OOM_SKIP
> > would have to set right after releasing page tables.
> 
> I won't be happy unless handed over section can run in atomic context
> (e.g. preempt_disable()/preempt_enable()) because current thread might be
> SCHED_IDLE priority.
> 
> If current thread is SCHED_IDLE priority, it might be difficult to hand over
> because current thread is unlikely able to reach
> 
> +	if (oom) {
> +		/*
> +		 * the exit path is guaranteed to finish without any unbound
> +		 * blocking at this stage so make it clear to the caller.
> +		 */
> +		mm->mmap = NULL;
> +		up_write(&mm->mmap_sem);
> +	}
> 
> before the OOM reaper kernel thread (which is not SCHED_IDLE priority) checks
> whether mm->mmap is already NULL.
> 
> Honestly, I'm not sure whether current thread (even !SCHED_IDLE priority) can
> reach there before the OOM killer checks whether mm->mmap is already NULL, for
> current thread has to do more things than the OOM reaper can do.
> 
> Also, in the worst case,
> 
> +                               /*
> +                                * oom_reaper cannot handle mlocked vmas but we
> +                                * need to serialize it with munlock_vma_pages_all
> +                                * which clears VM_LOCKED, otherwise the oom reaper
> +                                * cannot reliably test it.
> +                                */
> +                               if (oom)
> +                                       down_write(&mm->mmap_sem);
> 
> would cause the OOM reaper to set MMF_OOM_SKIP without reclaiming any memory
> if munlock_vma_pages_all(vma) by current thread did not complete quick enough
> to make down_read_trylock(&mm->mmap_sem) attempt by the OOM reaper succeed.

Which is kind of inherent problem, isn't it? Unless we add some sort of
priority inheritance then this will be always the case. And this is by
far not OOM specific. It is the full memory reclaim path. So I really do
not see this as an argument. More importantly though, no timeout based
solution will handle it better.

As I've said countless times. I do not really consider this to be
interesting case until we see that actual real world workloads really
suffer from this problem.

So can we focus on the practical problems please?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-12  7:50     ` Michal Hocko
@ 2018-09-12 13:42       ` Michal Hocko
  2018-09-13  2:44         ` Tetsuo Handa
  0 siblings, 1 reply; 31+ messages in thread
From: Michal Hocko @ 2018-09-12 13:42 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, Roman Gushchin, Andrew Morton

On Wed 12-09-18 09:50:54, Michal Hocko wrote:
> On Tue 11-09-18 23:01:57, Tetsuo Handa wrote:
> > On 2018/09/10 21:55, Michal Hocko wrote:
> > > This is a very coarse implementation of the idea I've had before.
> > > Please note that I haven't tested it yet. It is mostly to show the
> > > direction I would wish to go for.
> > 
> > Hmm, this patchset does not allow me to boot. ;-)
> > 
> >         free_pgd_range(&tlb, vma->vm_start, vma->vm_prev->vm_end,
> >                         FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > 
> > [    1.875675] sched_clock: Marking stable (1810466565, 65169393)->(1977240380, -101604422)
> > [    1.877833] registered taskstats version 1
> > [    1.877853] Loading compiled-in X.509 certificates
> > [    1.878835] zswap: loaded using pool lzo/zbud
> > [    1.880835] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> 
> This is vm_prev == NULL. I thought we always have vm_prev as long as
> this is not a single VMA in the address space. I will double check this.

So this is me misunderstanding the code. vm_next, vm_prev are not a full
doubly linked list. The first entry doesn't really refer to the last
entry. So the above cannot work at all. We can go around this in two
ways. Either keep the iteration or use the following which should cover
the full mapped range, unless I am missing something

diff --git a/mm/mmap.c b/mm/mmap.c
index 64e8ccce5282..078295344a17 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3105,7 +3105,7 @@ void exit_mmap(struct mm_struct *mm)
 		up_write(&mm->mmap_sem);
 	}
 
-	free_pgd_range(&tlb, vma->vm_start, vma->vm_prev->vm_end,
+	free_pgd_range(&tlb, vma->vm_start, mm->highest_vm_end,
 			FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-12 13:42       ` Michal Hocko
@ 2018-09-13  2:44         ` Tetsuo Handa
  2018-09-13  9:09           ` Michal Hocko
  0 siblings, 1 reply; 31+ messages in thread
From: Tetsuo Handa @ 2018-09-13  2:44 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Roman Gushchin, Andrew Morton

On 2018/09/12 22:42, Michal Hocko wrote:
> On Wed 12-09-18 09:50:54, Michal Hocko wrote:
>> On Tue 11-09-18 23:01:57, Tetsuo Handa wrote:
>>> On 2018/09/10 21:55, Michal Hocko wrote:
>>>> This is a very coarse implementation of the idea I've had before.
>>>> Please note that I haven't tested it yet. It is mostly to show the
>>>> direction I would wish to go for.
>>>
>>> Hmm, this patchset does not allow me to boot. ;-)
>>>
>>>         free_pgd_range(&tlb, vma->vm_start, vma->vm_prev->vm_end,
>>>                         FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>>>
>>> [    1.875675] sched_clock: Marking stable (1810466565, 65169393)->(1977240380, -101604422)
>>> [    1.877833] registered taskstats version 1
>>> [    1.877853] Loading compiled-in X.509 certificates
>>> [    1.878835] zswap: loaded using pool lzo/zbud
>>> [    1.880835] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
>>
>> This is vm_prev == NULL. I thought we always have vm_prev as long as
>> this is not a single VMA in the address space. I will double check this.
> 
> So this is me misunderstanding the code. vm_next, vm_prev are not a full
> doubly linked list. The first entry doesn't really refer to the last
> entry. So the above cannot work at all. We can go around this in two
> ways. Either keep the iteration or use the following which should cover
> the full mapped range, unless I am missing something
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 64e8ccce5282..078295344a17 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3105,7 +3105,7 @@ void exit_mmap(struct mm_struct *mm)
>  		up_write(&mm->mmap_sem);
>  	}
>  
> -	free_pgd_range(&tlb, vma->vm_start, vma->vm_prev->vm_end,
> +	free_pgd_range(&tlb, vma->vm_start, mm->highest_vm_end,
>  			FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>  	tlb_finish_mmu(&tlb, 0, -1);
>  

This is bad because architectures where hugetlb_free_pgd_range() does
more than free_pgd_range() need to check VM_HUGETLB flag for each "vma".
Thus, I think we need to keep the iteration.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-13  2:44         ` Tetsuo Handa
@ 2018-09-13  9:09           ` Michal Hocko
  2018-09-13 11:20             ` Tetsuo Handa
  0 siblings, 1 reply; 31+ messages in thread
From: Michal Hocko @ 2018-09-13  9:09 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, Roman Gushchin, Andrew Morton

On Thu 13-09-18 11:44:03, Tetsuo Handa wrote:
> On 2018/09/12 22:42, Michal Hocko wrote:
> > On Wed 12-09-18 09:50:54, Michal Hocko wrote:
> >> On Tue 11-09-18 23:01:57, Tetsuo Handa wrote:
> >>> On 2018/09/10 21:55, Michal Hocko wrote:
> >>>> This is a very coarse implementation of the idea I've had before.
> >>>> Please note that I haven't tested it yet. It is mostly to show the
> >>>> direction I would wish to go for.
> >>>
> >>> Hmm, this patchset does not allow me to boot. ;-)
> >>>
> >>>         free_pgd_range(&tlb, vma->vm_start, vma->vm_prev->vm_end,
> >>>                         FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> >>>
> >>> [    1.875675] sched_clock: Marking stable (1810466565, 65169393)->(1977240380, -101604422)
> >>> [    1.877833] registered taskstats version 1
> >>> [    1.877853] Loading compiled-in X.509 certificates
> >>> [    1.878835] zswap: loaded using pool lzo/zbud
> >>> [    1.880835] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> >>
> >> This is vm_prev == NULL. I thought we always have vm_prev as long as
> >> this is not a single VMA in the address space. I will double check this.
> > 
> > So this is me misunderstanding the code. vm_next, vm_prev are not a full
> > doubly linked list. The first entry doesn't really refer to the last
> > entry. So the above cannot work at all. We can go around this in two
> > ways. Either keep the iteration or use the following which should cover
> > the full mapped range, unless I am missing something
> > 
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 64e8ccce5282..078295344a17 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3105,7 +3105,7 @@ void exit_mmap(struct mm_struct *mm)
> >  		up_write(&mm->mmap_sem);
> >  	}
> >  
> > -	free_pgd_range(&tlb, vma->vm_start, vma->vm_prev->vm_end,
> > +	free_pgd_range(&tlb, vma->vm_start, mm->highest_vm_end,
> >  			FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> >  	tlb_finish_mmu(&tlb, 0, -1);
> >  
> 
> This is bad because architectures where hugetlb_free_pgd_range() does
> more than free_pgd_range() need to check VM_HUGETLB flag for each "vma".
> Thus, I think we need to keep the iteration.

Fair point. I have looked more closely and most of them simply redirect
to free_pgd_range but ppc and sparc are doing some pretty involved
tricks which we cannot really skip. So I will go and split
free_pgtables into two phases and keep per vma loops. So this
incremental update on top

commit e568c3f34e11c1a7abb4fe6f26e51eb8f60620c3
Author: Michal Hocko <mhocko@suse.com>
Date:   Thu Sep 13 11:08:00 2018 +0200

    fold me "mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish"
    
    - split free_pgtables into unlinking and actual freeing part. We cannot
      rely on free_pgd_range because of hugetlb pages on ppc resp. sparc
      which do their own tear down

diff --git a/mm/internal.h b/mm/internal.h
index 87256ae1bef8..35adbfec4935 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -40,6 +40,9 @@ void page_writeback_init(void);
 
 vm_fault_t do_swap_page(struct vm_fault *vmf);
 
+void __unlink_vmas(struct vm_area_struct *vma);
+void __free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
+		unsigned long floor, unsigned long ceiling);
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
diff --git a/mm/memory.c b/mm/memory.c
index c467102a5cbc..cf910ed5f283 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -612,20 +612,23 @@ void free_pgd_range(struct mmu_gather *tlb,
 	} while (pgd++, addr = next, addr != end);
 }
 
-void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
+void __unlink_vmas(struct vm_area_struct *vma)
+{
+	while (vma) {
+		unlink_anon_vmas(vma);
+		unlink_file_vma(vma);
+		vma = vma->vm_next;
+	}
+}
+
+/* expects that __unlink_vmas has been called already */
+void __free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		unsigned long floor, unsigned long ceiling)
 {
 	while (vma) {
 		struct vm_area_struct *next = vma->vm_next;
 		unsigned long addr = vma->vm_start;
 
-		/*
-		 * Hide vma from rmap and truncate_pagecache before freeing
-		 * pgtables
-		 */
-		unlink_anon_vmas(vma);
-		unlink_file_vma(vma);
-
 		if (is_vm_hugetlb_page(vma)) {
 			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
 				floor, next ? next->vm_start : ceiling);
@@ -637,8 +640,6 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			       && !is_vm_hugetlb_page(next)) {
 				vma = next;
 				next = vma->vm_next;
-				unlink_anon_vmas(vma);
-				unlink_file_vma(vma);
 			}
 			free_pgd_range(tlb, addr, vma->vm_end,
 				floor, next ? next->vm_start : ceiling);
@@ -647,6 +648,13 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	}
 }
 
+void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		unsigned long floor, unsigned long ceiling)
+{
+	__unlink_vmas(vma);
+	__free_pgtables(tlb, vma, floor, ceiling);
+}
+
 int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
 {
 	spinlock_t *ptl;
diff --git a/mm/mmap.c b/mm/mmap.c
index 078295344a17..f4b562e21764 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3082,20 +3082,14 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
 
-	/* oom_reaper cannot race with the page tables teardown */
-	if (oom)
-		down_write(&mm->mmap_sem);
 	/*
-	 * Hide vma from rmap and truncate_pagecache before freeing
-	 * pgtables
+	 * oom_reaper cannot race with the page tables teardown but we
+	 * want to make sure that the exit path can take over the full
+	 * tear down when it is safe to do so
 	 */
-	while (vma) {
-		unlink_anon_vmas(vma);
-		unlink_file_vma(vma);
-		vma = vma->vm_next;
-	}
-	vma = mm->mmap;
 	if (oom) {
+		down_write(&mm->mmap_sem);
+		__unlink_vmas(vma);
 		/*
 		 * the exit path is guaranteed to finish the memory tear down
 		 * without any unbound blocking at this stage so make it clear
@@ -3103,10 +3097,11 @@ void exit_mmap(struct mm_struct *mm)
 		 */
 		mm->mmap = NULL;
 		up_write(&mm->mmap_sem);
+		__free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
+	} else {
+		free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	}
 
-	free_pgd_range(&tlb, vma->vm_start, mm->highest_vm_end,
-			FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
 
 	/*
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-13  9:09           ` Michal Hocko
@ 2018-09-13 11:20             ` Tetsuo Handa
  2018-09-13 11:35               ` Michal Hocko
  0 siblings, 1 reply; 31+ messages in thread
From: Tetsuo Handa @ 2018-09-13 11:20 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Roman Gushchin, Andrew Morton

On 2018/09/13 18:09, Michal Hocko wrote:
>> This is bad because architectures where hugetlb_free_pgd_range() does
>> more than free_pgd_range() need to check VM_HUGETLB flag for each "vma".
>> Thus, I think we need to keep the iteration.
> 
> Fair point. I have looked more closely and most of them simply redirect
> to free_pgd_range but ppc and sparc are doing some pretty involved
> tricks which we cannot really skip. So I will go and split
> free_pgtables into two phases and keep per vma loops. So this
> incremental update on top

Next question.

        /* Use -1 here to ensure all VMAs in the mm are unmapped */
        unmap_vmas(&tlb, vma, 0, -1);

in exit_mmap() will now race with the OOM reaper. And unmap_vmas() will handle
VM_HUGETLB or VM_PFNMAP or VM_SHARED or !vma_is_anonymous() vmas, won't it?
Then, is it guaranteed to be safe if the OOM reaper raced with unmap_vmas() ?



By the way, there is a potential bug in hugepd_free() in arch/powerpc/mm/hugetlbpage.c

        if (*batchp == NULL) {
                *batchp = (struct hugepd_freelist *)__get_free_page(GFP_ATOMIC);
                (*batchp)->index = 0;
        }

because GFP_ATOMIC allocation might fail if ALLOC_OOM allocations are in progress?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-13 11:20             ` Tetsuo Handa
@ 2018-09-13 11:35               ` Michal Hocko
  2018-09-13 11:53                 ` Tetsuo Handa
  0 siblings, 1 reply; 31+ messages in thread
From: Michal Hocko @ 2018-09-13 11:35 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, Roman Gushchin, Andrew Morton

On Thu 13-09-18 20:20:12, Tetsuo Handa wrote:
> On 2018/09/13 18:09, Michal Hocko wrote:
> >> This is bad because architectures where hugetlb_free_pgd_range() does
> >> more than free_pgd_range() need to check VM_HUGETLB flag for each "vma".
> >> Thus, I think we need to keep the iteration.
> > 
> > Fair point. I have looked more closely and most of them simply redirect
> > to free_pgd_range but ppc and sparc are doing some pretty involved
> > tricks which we cannot really skip. So I will go and split
> > free_pgtables into two phases and keep per vma loops. So this
> > incremental update on top
> 
> Next question.
> 
>         /* Use -1 here to ensure all VMAs in the mm are unmapped */
>         unmap_vmas(&tlb, vma, 0, -1);
> 
> in exit_mmap() will now race with the OOM reaper. And unmap_vmas() will handle
> VM_HUGETLB or VM_PFNMAP or VM_SHARED or !vma_is_anonymous() vmas, won't it?
> Then, is it guaranteed to be safe if the OOM reaper raced with unmap_vmas() ?

I do not understand the question. unmap_vmas is basically MADV_DONTNEED
and that doesn't require the exclusive mmap_sem lock so yes it should be
safe those two to race (modulo bugs of course but I am not aware of any
there).
 
> By the way, there is a potential bug in hugepd_free() in arch/powerpc/mm/hugetlbpage.c
> 
>         if (*batchp == NULL) {
>                 *batchp = (struct hugepd_freelist *)__get_free_page(GFP_ATOMIC);
>                 (*batchp)->index = 0;
>         }
> 
> because GFP_ATOMIC allocation might fail if ALLOC_OOM allocations are in progress?

I am not familiar with that code so I would recommend to ask
maintainers.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-13 11:35               ` Michal Hocko
@ 2018-09-13 11:53                 ` Tetsuo Handa
  2018-09-13 13:40                   ` Michal Hocko
  0 siblings, 1 reply; 31+ messages in thread
From: Tetsuo Handa @ 2018-09-13 11:53 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Roman Gushchin, Andrew Morton

On 2018/09/13 20:35, Michal Hocko wrote:
>> Next question.
>>
>>         /* Use -1 here to ensure all VMAs in the mm are unmapped */
>>         unmap_vmas(&tlb, vma, 0, -1);
>>
>> in exit_mmap() will now race with the OOM reaper. And unmap_vmas() will handle
>> VM_HUGETLB or VM_PFNMAP or VM_SHARED or !vma_is_anonymous() vmas, won't it?
>> Then, is it guaranteed to be safe if the OOM reaper raced with unmap_vmas() ?
> 
> I do not understand the question. unmap_vmas is basically MADV_DONTNEED
> and that doesn't require the exclusive mmap_sem lock so yes it should be
> safe those two to race (modulo bugs of course but I am not aware of any
> there).
>  

You need to verify that races we observed with VM_LOCKED can't happen
for VM_HUGETLB / VM_PFNMAP / VM_SHARED / !vma_is_anonymous() cases.

                for (vma = mm->mmap; vma; vma = vma->vm_next) {
                        if (!(vma->vm_flags & VM_LOCKED))
                                continue;
                        /*
                         * oom_reaper cannot handle mlocked vmas but we
                         * need to serialize it with munlock_vma_pages_all
                         * which clears VM_LOCKED, otherwise the oom reaper
                         * cannot reliably test it.
                         */
                        if (oom)
                                down_write(&mm->mmap_sem);

                        munlock_vma_pages_all(vma);

                        if (oom)
                                up_write(&mm->mmap_sem);
                }

Without enough comments, future changes might overlook the assumption.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-13 11:53                 ` Tetsuo Handa
@ 2018-09-13 13:40                   ` Michal Hocko
  2018-09-14 13:54                     ` Tetsuo Handa
  0 siblings, 1 reply; 31+ messages in thread
From: Michal Hocko @ 2018-09-13 13:40 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, Roman Gushchin, Andrew Morton

On Thu 13-09-18 20:53:24, Tetsuo Handa wrote:
> On 2018/09/13 20:35, Michal Hocko wrote:
> >> Next question.
> >>
> >>         /* Use -1 here to ensure all VMAs in the mm are unmapped */
> >>         unmap_vmas(&tlb, vma, 0, -1);
> >>
> >> in exit_mmap() will now race with the OOM reaper. And unmap_vmas() will handle
> >> VM_HUGETLB or VM_PFNMAP or VM_SHARED or !vma_is_anonymous() vmas, won't it?
> >> Then, is it guaranteed to be safe if the OOM reaper raced with unmap_vmas() ?
> > 
> > I do not understand the question. unmap_vmas is basically MADV_DONTNEED
> > and that doesn't require the exclusive mmap_sem lock so yes it should be
> > safe those two to race (modulo bugs of course but I am not aware of any
> > there).
> >  
> 
> You need to verify that races we observed with VM_LOCKED can't happen
> for VM_HUGETLB / VM_PFNMAP / VM_SHARED / !vma_is_anonymous() cases.

Well, VM_LOCKED is kind of special because that is not a permanent state
which might change. VM_HUGETLB, VM_PFNMAP resp VM_SHARED are not changed
throughout the vma lifetime.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-13 13:40                   ` Michal Hocko
@ 2018-09-14 13:54                     ` Tetsuo Handa
  2018-09-14 14:14                       ` Michal Hocko
  0 siblings, 1 reply; 31+ messages in thread
From: Tetsuo Handa @ 2018-09-14 13:54 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Roman Gushchin, Andrew Morton, David Rientjes

On 2018/09/13 22:40, Michal Hocko wrote:
> On Thu 13-09-18 20:53:24, Tetsuo Handa wrote:
>> On 2018/09/13 20:35, Michal Hocko wrote:
>>>> Next question.
>>>>
>>>>         /* Use -1 here to ensure all VMAs in the mm are unmapped */
>>>>         unmap_vmas(&tlb, vma, 0, -1);
>>>>
>>>> in exit_mmap() will now race with the OOM reaper. And unmap_vmas() will handle
>>>> VM_HUGETLB or VM_PFNMAP or VM_SHARED or !vma_is_anonymous() vmas, won't it?
>>>> Then, is it guaranteed to be safe if the OOM reaper raced with unmap_vmas() ?
>>>
>>> I do not understand the question. unmap_vmas is basically MADV_DONTNEED
>>> and that doesn't require the exclusive mmap_sem lock so yes it should be
>>> safe those two to race (modulo bugs of course but I am not aware of any
>>> there).
>>>  
>>
>> You need to verify that races we observed with VM_LOCKED can't happen
>> for VM_HUGETLB / VM_PFNMAP / VM_SHARED / !vma_is_anonymous() cases.
> 
> Well, VM_LOCKED is kind of special because that is not a permanent state
> which might change. VM_HUGETLB, VM_PFNMAP resp VM_SHARED are not changed
> throughout the vma lifetime.
> 
OK, next question.
Is it guaranteed that arch_exit_mmap(mm) is safe with the OOM reaper?



Well, anyway, diffstat of your proposal would be

 include/linux/oom.h |  2 --
 mm/internal.h       |  3 +++
 mm/memory.c         | 28 ++++++++++++--------
 mm/mmap.c           | 73 +++++++++++++++++++++++++++++++----------------------
 mm/oom_kill.c       | 46 ++++++++++++++++++++++++---------
 5 files changed, 98 insertions(+), 54 deletions(-)

trying to hand over only __free_pgtables() part at the risk of
setting MMF_OOM_SKIP without reclaiming any memory due to dropping
__oom_reap_task_mm() and scattering down_write()/up_write() inside
exit_mmap(), while diffstat of my proposal (not tested yet) would be

 include/linux/mm_types.h |   2 +
 include/linux/oom.h      |   3 +-
 include/linux/sched.h    |   2 +-
 kernel/fork.c            |  11 +++
 mm/mmap.c                |  42 ++++-------
 mm/oom_kill.c            | 182 ++++++++++++++++++++++-------------------------
 6 files changed, 117 insertions(+), 125 deletions(-)

trying to wait until __mmput() completes and also trying to handle
multiple OOM victims in parallel.

You are refusing timeout based approach but I don't think this is
something we have to be frayed around the edge about possibility of
overlooking races/bugs because you don't want to use timeout. And you
have never showed that timeout based approach cannot work well enough.



Michal's proposal:

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 69864a5..11e26ca 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -95,8 +95,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
 	return 0;
 }
 
-bool __oom_reap_task_mm(struct mm_struct *mm);
-
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
diff --git a/mm/internal.h b/mm/internal.h
index 87256ae..35adbfe 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -40,6 +40,9 @@
 
 vm_fault_t do_swap_page(struct vm_fault *vmf);
 
+void __unlink_vmas(struct vm_area_struct *vma);
+void __free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
+		unsigned long floor, unsigned long ceiling);
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
diff --git a/mm/memory.c b/mm/memory.c
index c467102..cf910ed 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -612,20 +612,23 @@ void free_pgd_range(struct mmu_gather *tlb,
 	} while (pgd++, addr = next, addr != end);
 }
 
-void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
+void __unlink_vmas(struct vm_area_struct *vma)
+{
+	while (vma) {
+		unlink_anon_vmas(vma);
+		unlink_file_vma(vma);
+		vma = vma->vm_next;
+	}
+}
+
+/* expects that __unlink_vmas has been called already */
+void __free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		unsigned long floor, unsigned long ceiling)
 {
 	while (vma) {
 		struct vm_area_struct *next = vma->vm_next;
 		unsigned long addr = vma->vm_start;
 
-		/*
-		 * Hide vma from rmap and truncate_pagecache before freeing
-		 * pgtables
-		 */
-		unlink_anon_vmas(vma);
-		unlink_file_vma(vma);
-
 		if (is_vm_hugetlb_page(vma)) {
 			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
 				floor, next ? next->vm_start : ceiling);
@@ -637,8 +640,6 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			       && !is_vm_hugetlb_page(next)) {
 				vma = next;
 				next = vma->vm_next;
-				unlink_anon_vmas(vma);
-				unlink_file_vma(vma);
 			}
 			free_pgd_range(tlb, addr, vma->vm_end,
 				floor, next ? next->vm_start : ceiling);
@@ -647,6 +648,13 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	}
 }
 
+void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		unsigned long floor, unsigned long ceiling)
+{
+	__unlink_vmas(vma);
+	__free_pgtables(tlb, vma, floor, ceiling);
+}
+
 int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
 {
 	spinlock_t *ptl;
diff --git a/mm/mmap.c b/mm/mmap.c
index 5f2b2b1..67bd8a0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3042,40 +3042,26 @@ void exit_mmap(struct mm_struct *mm)
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
+	const bool oom = mm_is_oom_victim(mm);
 
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
-	if (unlikely(mm_is_oom_victim(mm))) {
-		/*
-		 * Manually reap the mm to free as much memory as possible.
-		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
-		 * this mm from further consideration.  Taking mm->mmap_sem for
-		 * write after setting MMF_OOM_SKIP will guarantee that the oom
-		 * reaper will not run on this mm again after mmap_sem is
-		 * dropped.
-		 *
-		 * Nothing can be holding mm->mmap_sem here and the above call
-		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
-		 * __oom_reap_task_mm() will not block.
-		 *
-		 * This needs to be done before calling munlock_vma_pages_all(),
-		 * which clears VM_LOCKED, otherwise the oom reaper cannot
-		 * reliably test it.
-		 */
-		(void)__oom_reap_task_mm(mm);
-
-		set_bit(MMF_OOM_SKIP, &mm->flags);
-		down_write(&mm->mmap_sem);
-		up_write(&mm->mmap_sem);
-	}
-
 	if (mm->locked_vm) {
-		vma = mm->mmap;
-		while (vma) {
-			if (vma->vm_flags & VM_LOCKED)
-				munlock_vma_pages_all(vma);
-			vma = vma->vm_next;
+		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			if (!(vma->vm_flags & VM_LOCKED))
+				continue;
+			/*
+			 * oom_reaper cannot handle mlocked vmas but we
+			 * need to serialize it with munlock_vma_pages_all
+			 * which clears VM_LOCKED, otherwise the oom reaper
+			 * cannot reliably test it.
+			 */
+			if (oom)
+				down_write(&mm->mmap_sem);
+			munlock_vma_pages_all(vma);
+			if (oom)
+				up_write(&mm->mmap_sem);
 		}
 	}
 
@@ -3091,10 +3077,37 @@ void exit_mmap(struct mm_struct *mm)
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
-	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
+
+	/*
+	 * oom_reaper cannot race with the page tables teardown but we
+	 * want to make sure that the exit path can take over the full
+	 * tear down when it is safe to do so
+	 */
+	if (oom) {
+		down_write(&mm->mmap_sem);
+		__unlink_vmas(vma);
+		/*
+		 * the exit path is guaranteed to finish the memory tear down
+		 * without any unbound blocking at this stage so make it clear
+		 * to the oom_reaper
+		 */
+		mm->mmap = NULL;
+		up_write(&mm->mmap_sem);
+		__free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
+	} else {
+		free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
+	}
+
 	tlb_finish_mmu(&tlb, 0, -1);
 
 	/*
+	 * Now that the full address space is torn down, make sure the
+	 * OOM killer skips over this task
+	 */
+	if (oom)
+		set_bit(MMF_OOM_SKIP, &mm->flags);
+
+	/*
 	 * Walk the list again, actually closing and freeing it,
 	 * with preemption enabled, without holding any MM locks.
 	 */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..abddcde 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -189,6 +189,16 @@ static bool is_dump_unreclaim_slabs(void)
 	return (global_node_page_state(NR_SLAB_UNRECLAIMABLE) > nr_lru);
 }
 
+/*
+ * Rough memory consumption of the given mm which should be theoretically freed
+ * when the mm is removed.
+ */
+static unsigned long oom_badness_pages(struct mm_struct *mm)
+{
+	return get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS) +
+		mm_pgtables_bytes(mm) / PAGE_SIZE;
+}
+
 /**
  * oom_badness - heuristic function to determine which candidate task to kill
  * @p: task struct of which task we should calculate
@@ -230,8 +240,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	 * The baseline for the badness score is the proportion of RAM that each
 	 * task's rss, pagetable and swap space use.
 	 */
-	points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
-		mm_pgtables_bytes(p->mm) / PAGE_SIZE;
+	points = oom_badness_pages(p->mm);
 	task_unlock(p);
 
 	/* Normalize to oom_score_adj units */
@@ -483,12 +492,11 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
  * victim (if that is possible) to help the OOM killer to move on.
  */
-static struct task_struct *oom_reaper_th;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-bool __oom_reap_task_mm(struct mm_struct *mm)
+static bool __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
 	bool ret = true;
@@ -501,7 +509,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 	 */
 	set_bit(MMF_UNSTABLE, &mm->flags);
 
-	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		if (!can_madv_dontneed_vma(vma))
 			continue;
 
@@ -532,6 +540,16 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 		}
 	}
 
+	/*
+	 * If we still sit on a noticeable amount of memory even after successfully
+	 * reaping the address space then keep retrying until exit_mmap makes some
+	 * further progress.
+	 * TODO: add a flag for a stage when the exit path doesn't block anymore
+	 * and hand over MMF_OOM_SKIP handling there in that case
+	 */
+	if (ret && oom_badness_pages(mm) > 1024)
+		ret = false;
+
 	return ret;
 }
 
@@ -551,12 +569,10 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	}
 
 	/*
-	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
-	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
-	 * under mmap_sem for reading because it serializes against the
-	 * down_write();up_write() cycle in exit_mmap().
+	 * If exit path clear mm->mmap then we know it will finish the tear down
+	 * and we can go and bail out here.
 	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
+	if (!mm->mmap) {
 		trace_skip_task_reaping(tsk->pid);
 		goto out_unlock;
 	}
@@ -605,8 +621,14 @@ static void oom_reap_task(struct task_struct *tsk)
 	/*
 	 * Hide this mm from OOM killer because it has been either reaped or
 	 * somebody can't call up_write(mmap_sem).
+	 * Leave the MMF_OOM_SKIP to the exit path if it managed to reach the
+	 * point it is guaranteed to finish without any blocking
 	 */
-	set_bit(MMF_OOM_SKIP, &mm->flags);
+	if (mm->mmap)
+		set_bit(MMF_OOM_SKIP, &mm->flags);
+	else if (!test_bit(MMF_OOM_SKIP, &mm->flags))
+		pr_info("oom_reaper: handed over pid:%d (%s) to exit path\n",
+			task_pid_nr(tsk), tsk->comm);
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
@@ -650,7 +672,7 @@ static void wake_oom_reaper(struct task_struct *tsk)
 
 static int __init oom_init(void)
 {
-	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	kthread_run(oom_reaper, NULL, "oom_reaper");
 	return 0;
 }
 subsys_initcall(oom_init)
-- 
1.8.3.1



Tetsuo's proposal:

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index cd2bc93..3c48c08 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -486,6 +486,8 @@ struct mm_struct {
 		atomic_long_t hugetlb_usage;
 #endif
 		struct work_struct async_put_work;
+		unsigned long last_oom_pages;
+		unsigned char oom_stalled_count;
 
 #if IS_ENABLED(CONFIG_HMM)
 		/* HMM needs to track a few things per mm */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 69864a5..8a987c6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -95,7 +95,7 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
 	return 0;
 }
 
-bool __oom_reap_task_mm(struct mm_struct *mm);
+void __oom_reap_task_mm(struct mm_struct *mm);
 
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
@@ -104,6 +104,7 @@ extern unsigned long oom_badness(struct task_struct *p,
 extern bool out_of_memory(struct oom_control *oc);
 
 extern void exit_oom_victim(void);
+extern void exit_oom_mm(struct mm_struct *mm);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 977cb57..efed2ea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1175,7 +1175,7 @@ struct task_struct {
 #endif
 	int				pagefault_disabled;
 #ifdef CONFIG_MMU
-	struct task_struct		*oom_reaper_list;
+	struct list_head		oom_victim;
 #endif
 #ifdef CONFIG_VMAP_STACK
 	struct vm_struct		*stack_vm_area;
diff --git a/kernel/fork.c b/kernel/fork.c
index f0b5847..3e662bb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -992,7 +992,16 @@ struct mm_struct *mm_alloc(void)
 
 static inline void __mmput(struct mm_struct *mm)
 {
+	const bool oom = IS_ENABLED(CONFIG_MMU) && mm_is_oom_victim(mm);
+
 	VM_BUG_ON(atomic_read(&mm->mm_users));
+	if (oom) {
+		/* Try what the OOM reaper kernel thread can afford. */
+		__oom_reap_task_mm(mm);
+		/* Shut out the OOM reaper kernel thread. */
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	}
 
 	uprobe_clear_state(mm);
 	exit_aio(mm);
@@ -1008,6 +1017,8 @@ static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
+	if (oom)
+		exit_oom_mm(mm);
 	mmdrop(mm);
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 5f2b2b1..f1b27f7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3042,41 +3042,27 @@ void exit_mmap(struct mm_struct *mm)
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
+	const bool oom = mm_is_oom_victim(mm);
 
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
-
-	if (unlikely(mm_is_oom_victim(mm))) {
-		/*
-		 * Manually reap the mm to free as much memory as possible.
-		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
-		 * this mm from further consideration.  Taking mm->mmap_sem for
-		 * write after setting MMF_OOM_SKIP will guarantee that the oom
-		 * reaper will not run on this mm again after mmap_sem is
-		 * dropped.
-		 *
-		 * Nothing can be holding mm->mmap_sem here and the above call
-		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
-		 * __oom_reap_task_mm() will not block.
-		 *
-		 * This needs to be done before calling munlock_vma_pages_all(),
-		 * which clears VM_LOCKED, otherwise the oom reaper cannot
-		 * reliably test it.
-		 */
-		(void)__oom_reap_task_mm(mm);
-
-		set_bit(MMF_OOM_SKIP, &mm->flags);
-		down_write(&mm->mmap_sem);
-		up_write(&mm->mmap_sem);
-	}
+	/*
+	 * Retry what the OOM reaper kernel thread can afford, for
+	 * all MMU notifiers are now gone.
+	 */
+	if (oom)
+		__oom_reap_task_mm(mm);
 
 	if (mm->locked_vm) {
-		vma = mm->mmap;
-		while (vma) {
+		for (vma = mm->mmap; vma; vma = vma->vm_next)
 			if (vma->vm_flags & VM_LOCKED)
 				munlock_vma_pages_all(vma);
-			vma = vma->vm_next;
-		}
+		/*
+		 * Retry what the OOM reaper kernel thread can afford, for
+		 * all mlocked vmas are now unlocked.
+		 */
+		if (oom)
+			__oom_reap_task_mm(mm);
 	}
 
 	arch_exit_mmap(mm);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..01fa0d7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -189,6 +189,16 @@ static bool is_dump_unreclaim_slabs(void)
 	return (global_node_page_state(NR_SLAB_UNRECLAIMABLE) > nr_lru);
 }
 
+/*
+ * Rough memory consumption of the given mm which should be theoretically freed
+ * when the mm is removed.
+ */
+static unsigned long oom_badness_pages(struct mm_struct *mm)
+{
+	return get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS) +
+		mm_pgtables_bytes(mm) / PAGE_SIZE;
+}
+
 /**
  * oom_badness - heuristic function to determine which candidate task to kill
  * @p: task struct of which task we should calculate
@@ -230,8 +240,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	 * The baseline for the badness score is the proportion of RAM that each
 	 * task's rss, pagetable and swap space use.
 	 */
-	points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
-		mm_pgtables_bytes(p->mm) / PAGE_SIZE;
+	points = oom_badness_pages(p->mm);
 	task_unlock(p);
 
 	/* Normalize to oom_score_adj units */
@@ -483,25 +492,15 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
  * victim (if that is possible) to help the OOM killer to move on.
  */
-static struct task_struct *oom_reaper_th;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
+static LIST_HEAD(oom_reaper_list);
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-bool __oom_reap_task_mm(struct mm_struct *mm)
+void __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
-	bool ret = true;
-
-	/*
-	 * Tell all users of get_user/copy_from_user etc... that the content
-	 * is no longer stable. No barriers really needed because unmapping
-	 * should imply barriers already and the reader would hit a page fault
-	 * if it stumbled over a reaped memory.
-	 */
-	set_bit(MMF_UNSTABLE, &mm->flags);
 
-	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		if (!can_madv_dontneed_vma(vma))
 			continue;
 
@@ -523,7 +522,6 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 			tlb_gather_mmu(&tlb, mm, start, end);
 			if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) {
 				tlb_finish_mmu(&tlb, start, end);
-				ret = false;
 				continue;
 			}
 			unmap_page_range(&tlb, vma, start, end, NULL);
@@ -531,118 +529,110 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 			tlb_finish_mmu(&tlb, start, end);
 		}
 	}
-
-	return ret;
 }
 
 /*
  * Reaps the address space of the give task.
- *
- * Returns true on success and false if none or part of the address space
- * has been reclaimed and the caller should retry later.
  */
-static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
+static void oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
-	bool ret = true;
-
-	if (!down_read_trylock(&mm->mmap_sem)) {
-		trace_skip_task_reaping(tsk->pid);
-		return false;
-	}
-
 	/*
-	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
-	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
-	 * under mmap_sem for reading because it serializes against the
-	 * down_write();up_write() cycle in exit_mmap().
+	 * Reaping operation needs mmap_sem held for read. Also, the check for
+	 * mm_users must run under mmap_sem for reading because it serializes
+	 * against the down_write()/up_write() cycle in __mmput().
 	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
-		trace_skip_task_reaping(tsk->pid);
-		goto out_unlock;
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		/* Do nothing if already in __mmput() */
+		if (atomic_read(&mm->mm_users))
+			trace_skip_task_reaping(tsk->pid);
+		return;
+	}
+	/* Do nothing if already in __mmput() */
+	if (atomic_read(&mm->mm_users)) {
+		trace_start_task_reaping(tsk->pid);
+		__oom_reap_task_mm(mm);
+		trace_finish_task_reaping(tsk->pid);
 	}
-
-	trace_start_task_reaping(tsk->pid);
-
-	/* failed to reap part of the address space. Try again later */
-	ret = __oom_reap_task_mm(mm);
-	if (!ret)
-		goto out_finish;
-
-	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
-			task_pid_nr(tsk), tsk->comm,
-			K(get_mm_counter(mm, MM_ANONPAGES)),
-			K(get_mm_counter(mm, MM_FILEPAGES)),
-			K(get_mm_counter(mm, MM_SHMEMPAGES)));
-out_finish:
-	trace_finish_task_reaping(tsk->pid);
-out_unlock:
 	up_read(&mm->mmap_sem);
-
-	return ret;
 }
 
-#define MAX_OOM_REAP_RETRIES 10
 static void oom_reap_task(struct task_struct *tsk)
 {
-	int attempts = 0;
 	struct mm_struct *mm = tsk->signal->oom_mm;
-
-	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
-		schedule_timeout_idle(HZ/10);
-
-	if (attempts <= MAX_OOM_REAP_RETRIES ||
-	    test_bit(MMF_OOM_SKIP, &mm->flags))
-		goto done;
-
-	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-		task_pid_nr(tsk), tsk->comm);
-	debug_show_all_locks();
-
-done:
-	tsk->oom_reaper_list = NULL;
-
-	/*
-	 * Hide this mm from OOM killer because it has been either reaped or
-	 * somebody can't call up_write(mmap_sem).
-	 */
-	set_bit(MMF_OOM_SKIP, &mm->flags);
-
-	/* Drop a reference taken by wake_oom_reaper */
-	put_task_struct(tsk);
+	unsigned long pages;
+
+	oom_reap_task_mm(tsk, mm);
+	pages = oom_badness_pages(mm);
+	/* Hide this mm from OOM killer if stalled for too long. */
+	if (mm->last_oom_pages > pages) {
+		mm->last_oom_pages = pages;
+		mm->oom_stalled_count = 0;
+	} else if (mm->oom_stalled_count++ > 10) {
+		pr_info("oom_reaper: gave up process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+			task_pid_nr(tsk), tsk->comm,
+			K(get_mm_counter(mm, MM_ANONPAGES)),
+			K(get_mm_counter(mm, MM_FILEPAGES)),
+			K(get_mm_counter(mm, MM_SHMEMPAGES)));
+		set_bit(MMF_OOM_SKIP, &mm->flags);
+	}
 }
 
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct task_struct *tsk = NULL;
+		struct task_struct *tsk;
 
-		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
+		if (!list_empty(&oom_reaper_list))
+			schedule_timeout_uninterruptible(HZ / 10);
+		else
+			wait_event_freezable(oom_reaper_wait,
+					     !list_empty(&oom_reaper_list));
 		spin_lock(&oom_reaper_lock);
-		if (oom_reaper_list != NULL) {
-			tsk = oom_reaper_list;
-			oom_reaper_list = tsk->oom_reaper_list;
+		list_for_each_entry(tsk, &oom_reaper_list, oom_victim) {
+			get_task_struct(tsk);
+			spin_unlock(&oom_reaper_lock);
+			oom_reap_task(tsk);
+			spin_lock(&oom_reaper_lock);
+			put_task_struct(tsk);
 		}
 		spin_unlock(&oom_reaper_lock);
-
-		if (tsk)
-			oom_reap_task(tsk);
 	}
-
 	return 0;
 }
 
+void exit_oom_mm(struct mm_struct *mm)
+{
+	struct task_struct *tsk;
+
+	spin_lock(&oom_reaper_lock);
+	list_for_each_entry(tsk, &oom_reaper_list, oom_victim)
+		if (tsk->signal->oom_mm == mm)
+			break;
+	list_del(&tsk->oom_victim);
+	spin_unlock(&oom_reaper_lock);
+	/* Drop a reference taken by wake_oom_reaper(). */
+	put_task_struct(tsk);
+}
+
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+	struct mm_struct *mm = tsk->signal->oom_mm;
+
+	/* There is no point with processing same mm twice. */
+	if (test_bit(MMF_UNSTABLE, &mm->flags))
 		return;
+	/*
+	 * Tell all users of get_user/copy_from_user etc... that the content
+	 * is no longer stable. No barriers really needed because unmapping
+	 * should imply barriers already and the reader would hit a page fault
+	 * if it stumbled over a reaped memory.
+	 */
+	set_bit(MMF_UNSTABLE, &mm->flags);
 
 	get_task_struct(tsk);
 
 	spin_lock(&oom_reaper_lock);
-	tsk->oom_reaper_list = oom_reaper_list;
-	oom_reaper_list = tsk;
+	list_add_tail(&tsk->oom_victim, &oom_reaper_list);
 	spin_unlock(&oom_reaper_lock);
 	trace_wake_reaper(tsk->pid);
 	wake_up(&oom_reaper_wait);
@@ -650,7 +640,7 @@ static void wake_oom_reaper(struct task_struct *tsk)
 
 static int __init oom_init(void)
 {
-	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	kthread_run(oom_reaper, NULL, "oom_reaper");
 	return 0;
 }
 subsys_initcall(oom_init)
@@ -681,8 +671,10 @@ static void mark_oom_victim(struct task_struct *tsk)
 
 	/* oom_mm is bound to the signal struct life time. */
 	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) {
-		mmgrab(tsk->signal->oom_mm);
+		mmgrab(mm);
 		set_bit(MMF_OOM_VICTIM, &mm->flags);
+		mm->last_oom_pages = oom_badness_pages(mm);
+		mm->oom_stalled_count = 0;
 	}
 
 	/*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-14 13:54                     ` Tetsuo Handa
@ 2018-09-14 14:14                       ` Michal Hocko
  2018-09-14 17:07                         ` Tetsuo Handa
  0 siblings, 1 reply; 31+ messages in thread
From: Michal Hocko @ 2018-09-14 14:14 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, Roman Gushchin, Andrew Morton, David Rientjes

On Fri 14-09-18 22:54:45, Tetsuo Handa wrote:
> On 2018/09/13 22:40, Michal Hocko wrote:
> > On Thu 13-09-18 20:53:24, Tetsuo Handa wrote:
> >> On 2018/09/13 20:35, Michal Hocko wrote:
> >>>> Next question.
> >>>>
> >>>>         /* Use -1 here to ensure all VMAs in the mm are unmapped */
> >>>>         unmap_vmas(&tlb, vma, 0, -1);
> >>>>
> >>>> in exit_mmap() will now race with the OOM reaper. And unmap_vmas() will handle
> >>>> VM_HUGETLB or VM_PFNMAP or VM_SHARED or !vma_is_anonymous() vmas, won't it?
> >>>> Then, is it guaranteed to be safe if the OOM reaper raced with unmap_vmas() ?
> >>>
> >>> I do not understand the question. unmap_vmas is basically MADV_DONTNEED
> >>> and that doesn't require the exclusive mmap_sem lock so yes it should be
> >>> safe those two to race (modulo bugs of course but I am not aware of any
> >>> there).
> >>>  
> >>
> >> You need to verify that races we observed with VM_LOCKED can't happen
> >> for VM_HUGETLB / VM_PFNMAP / VM_SHARED / !vma_is_anonymous() cases.
> > 
> > Well, VM_LOCKED is kind of special because that is not a permanent state
> > which might change. VM_HUGETLB, VM_PFNMAP resp VM_SHARED are not changed
> > throughout the vma lifetime.
> > 
> OK, next question.
> Is it guaranteed that arch_exit_mmap(mm) is safe with the OOM reaper?

I do not see any obvious problem and we used to allow to race unmaping
in exit and oom_reaper paths before we had to handle mlocked vmas
specially.
 
> Well, anyway, diffstat of your proposal would be
> 
>  include/linux/oom.h |  2 --
>  mm/internal.h       |  3 +++
>  mm/memory.c         | 28 ++++++++++++--------
>  mm/mmap.c           | 73 +++++++++++++++++++++++++++++++----------------------
>  mm/oom_kill.c       | 46 ++++++++++++++++++++++++---------
>  5 files changed, 98 insertions(+), 54 deletions(-)
> 
> trying to hand over only __free_pgtables() part at the risk of
> setting MMF_OOM_SKIP without reclaiming any memory due to dropping
> __oom_reap_task_mm() and scattering down_write()/up_write() inside
> exit_mmap(), while diffstat of my proposal (not tested yet) would be
> 
>  include/linux/mm_types.h |   2 +
>  include/linux/oom.h      |   3 +-
>  include/linux/sched.h    |   2 +-
>  kernel/fork.c            |  11 +++
>  mm/mmap.c                |  42 ++++-------
>  mm/oom_kill.c            | 182 ++++++++++++++++++++++-------------------------
>  6 files changed, 117 insertions(+), 125 deletions(-)
> 
> trying to wait until __mmput() completes and also trying to handle
> multiple OOM victims in parallel.
> 
> You are refusing timeout based approach but I don't think this is
> something we have to be frayed around the edge about possibility of
> overlooking races/bugs because you don't want to use timeout. And you
> have never showed that timeout based approach cannot work well enough.

I have tried to explain why I do not like the timeout based approach
several times alreay and I am getting fed up repeating it over and over
again.  The main point though is that we know _what_ we are waiting for
and _how_ we are synchronizing different parts rather than let's wait
some time and hopefully something happens.

Moreover, we have a backoff mechanism. The new class of oom victims
with a large amount of memory in page tables can fit into that
model. The new model adds few more branches to the exit path so if this
is acceptable for other mm developers then I think this is much more
preferrable to add a diffrent retry mechanism.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover
  2018-09-14 14:14                       ` Michal Hocko
@ 2018-09-14 17:07                         ` Tetsuo Handa
  0 siblings, 0 replies; 31+ messages in thread
From: Tetsuo Handa @ 2018-09-14 17:07 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Roman Gushchin, Andrew Morton, David Rientjes

On 2018/09/14 23:14, Michal Hocko wrote:
> On Fri 14-09-18 22:54:45, Tetsuo Handa wrote:
>> OK, next question.
>> Is it guaranteed that arch_exit_mmap(mm) is safe with the OOM reaper?
> 
> I do not see any obvious problem and we used to allow to race unmaping
> in exit and oom_reaper paths before we had to handle mlocked vmas
> specially.

Although we used to allow arch_exit_mmap() to race, it might be nothing but
we hit mlock() problem first. I want "clearly no problem".



>> Well, anyway, diffstat of your proposal would be
>>
>>  include/linux/oom.h |  2 --
>>  mm/internal.h       |  3 +++
>>  mm/memory.c         | 28 ++++++++++++--------
>>  mm/mmap.c           | 73 +++++++++++++++++++++++++++++++----------------------
>>  mm/oom_kill.c       | 46 ++++++++++++++++++++++++---------
>>  5 files changed, 98 insertions(+), 54 deletions(-)
>>
>> trying to hand over only __free_pgtables() part at the risk of
>> setting MMF_OOM_SKIP without reclaiming any memory due to dropping
>> __oom_reap_task_mm() and scattering down_write()/up_write() inside
>> exit_mmap(), while diffstat of my proposal (not tested yet) would be
>>
>>  include/linux/mm_types.h |   2 +
>>  include/linux/oom.h      |   3 +-
>>  include/linux/sched.h    |   2 +-
>>  kernel/fork.c            |  11 +++
>>  mm/mmap.c                |  42 ++++-------
>>  mm/oom_kill.c            | 182 ++++++++++++++++++++++-------------------------
>>  6 files changed, 117 insertions(+), 125 deletions(-)
>>
>> trying to wait until __mmput() completes and also trying to handle
>> multiple OOM victims in parallel.

Bottom is the fix-up for my proposal. It seems to be working well enough.

 include/linux/oom.h |  1 -
 kernel/fork.c       |  2 +-
 mm/oom_kill.c       | 30 ++++++++++++------------------
 3 files changed, 13 insertions(+), 20 deletions(-)



>>
>> You are refusing timeout based approach but I don't think this is
>> something we have to be frayed around the edge about possibility of
>> overlooking races/bugs because you don't want to use timeout. And you
>> have never showed that timeout based approach cannot work well enough.
> 
> I have tried to explain why I do not like the timeout based approach
> several times alreay and I am getting fed up repeating it over and over
> again.  The main point though is that we know _what_ we are waiting for
> and _how_ we are synchronizing different parts rather than let's wait
> some time and hopefully something happens.

At the risk of overlooking bugs. Quite few persons are checking OOM lockup
possibility which is a dangerous thing for taking your aggressive approach.

> 
> Moreover, we have a backoff mechanism. The new class of oom victims
> with a large amount of memory in page tables can fit into that
> model. The new model adds few more branches to the exit path so if this
> is acceptable for other mm developers then I think this is much more
> preferrable to add a diffrent retry mechanism.
> 

These "few more branches" have to be "clearly no problem" rather than
"passed some stress tests". And so far no response from other mm developers.






diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8a987c6..9d30c15 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -104,7 +104,6 @@ extern unsigned long oom_badness(struct task_struct *p,
 extern bool out_of_memory(struct oom_control *oc);
 
 extern void exit_oom_victim(void);
-extern void exit_oom_mm(struct mm_struct *mm);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/kernel/fork.c b/kernel/fork.c
index 3e662bb..5c32791 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1018,7 +1018,7 @@ static inline void __mmput(struct mm_struct *mm)
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
 	if (oom)
-		exit_oom_mm(mm);
+		set_bit(MMF_OOM_SKIP, &mm->flags);
 	mmdrop(mm);
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 01fa0d7..cff41fa 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -561,6 +561,14 @@ static void oom_reap_task(struct task_struct *tsk)
 	struct mm_struct *mm = tsk->signal->oom_mm;
 	unsigned long pages;
 
+	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
+		spin_lock(&oom_reaper_lock);
+		list_del(&tsk->oom_victim);
+		spin_unlock(&oom_reaper_lock);
+		/* Drop a reference taken by wake_oom_reaper(). */
+		put_task_struct(tsk);
+		return;
+	}
 	oom_reap_task_mm(tsk, mm);
 	pages = oom_badness_pages(mm);
 	/* Hide this mm from OOM killer if stalled for too long. */
@@ -581,6 +589,7 @@ static int oom_reaper(void *unused)
 {
 	while (true) {
 		struct task_struct *tsk;
+		struct task_struct *tmp;
 
 		if (!list_empty(&oom_reaper_list))
 			schedule_timeout_uninterruptible(HZ / 10);
@@ -588,32 +597,17 @@ static int oom_reaper(void *unused)
 			wait_event_freezable(oom_reaper_wait,
 					     !list_empty(&oom_reaper_list));
 		spin_lock(&oom_reaper_lock);
-		list_for_each_entry(tsk, &oom_reaper_list, oom_victim) {
-			get_task_struct(tsk);
+		list_for_each_entry_safe(tsk, tmp, &oom_reaper_list,
+					 oom_victim) {
 			spin_unlock(&oom_reaper_lock);
 			oom_reap_task(tsk);
 			spin_lock(&oom_reaper_lock);
-			put_task_struct(tsk);
 		}
 		spin_unlock(&oom_reaper_lock);
 	}
 	return 0;
 }
 
-void exit_oom_mm(struct mm_struct *mm)
-{
-	struct task_struct *tsk;
-
-	spin_lock(&oom_reaper_lock);
-	list_for_each_entry(tsk, &oom_reaper_list, oom_victim)
-		if (tsk->signal->oom_mm == mm)
-			break;
-	list_del(&tsk->oom_victim);
-	spin_unlock(&oom_reaper_lock);
-	/* Drop a reference taken by wake_oom_reaper(). */
-	put_task_struct(tsk);
-}
-
 static void wake_oom_reaper(struct task_struct *tsk)
 {
 	struct mm_struct *mm = tsk->signal->oom_mm;
@@ -632,7 +626,7 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	get_task_struct(tsk);
 
 	spin_lock(&oom_reaper_lock);
-	list_add_tail(&tsk->oom_victim, &oom_reaper_list);
+	list_add(&tsk->oom_victim, &oom_reaper_list);
 	spin_unlock(&oom_reaper_lock);
 	trace_wake_reaper(tsk->pid);
 	wake_up(&oom_reaper_wait);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2018-09-14 17:08 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-08  4:54 [PATCH v2] mm, oom: Fix unnecessary killing of additional processes Tetsuo Handa
2018-09-10  9:54 ` Michal Hocko
2018-09-10 11:27   ` Tetsuo Handa
2018-09-10 11:40     ` Michal Hocko
2018-09-10 12:52       ` Tetsuo Handa
2018-09-10 12:55 ` [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover Michal Hocko
2018-09-10 12:55   ` [RFC PATCH 1/3] mm, oom: rework mmap_exit vs. oom_reaper synchronization Michal Hocko
2018-09-10 12:55   ` [RFC PATCH 2/3] mm, oom: keep retrying the oom_reap operation as long as there is substantial memory left Michal Hocko
2018-09-10 12:55   ` [RFC PATCH 3/3] mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish Michal Hocko
2018-09-10 14:59   ` [RFC PATCH 0/3] rework mmap-exit vs. oom_reaper handover Tetsuo Handa
2018-09-10 15:11     ` Michal Hocko
2018-09-10 15:40       ` Tetsuo Handa
2018-09-10 16:44         ` Michal Hocko
2018-09-12  3:06           ` Tetsuo Handa
2018-09-12  7:18             ` Michal Hocko
2018-09-12  7:58               ` Tetsuo Handa
2018-09-12  8:17                 ` Michal Hocko
2018-09-12 10:59                   ` Tetsuo Handa
2018-09-12 11:22                     ` Michal Hocko
2018-09-11 14:01   ` Tetsuo Handa
2018-09-12  7:50     ` Michal Hocko
2018-09-12 13:42       ` Michal Hocko
2018-09-13  2:44         ` Tetsuo Handa
2018-09-13  9:09           ` Michal Hocko
2018-09-13 11:20             ` Tetsuo Handa
2018-09-13 11:35               ` Michal Hocko
2018-09-13 11:53                 ` Tetsuo Handa
2018-09-13 13:40                   ` Michal Hocko
2018-09-14 13:54                     ` Tetsuo Handa
2018-09-14 14:14                       ` Michal Hocko
2018-09-14 17:07                         ` Tetsuo Handa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.