[PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem.

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem.
@ 2018-07-03 14:25 Tetsuo Handa
  2018-07-03 14:25 ` [PATCH 1/8] mm,oom: Don't call schedule_timeout_killable() with oom_lock held Tetsuo Handa
                   ` (8 more replies)
  0 siblings, 9 replies; 25+ messages in thread
From: Tetsuo Handa @ 2018-07-03 14:25 UTC (permalink / raw)
  To: linux-mm, akpm; +Cc: torvalds, Tetsuo Handa

This series provides

  (1) Mitigation and a fix for CVE-2016-10723.

  (2) A mitigation for needlessly selecting next OOM victim reported
      by David Rientjes and rejected by Michal Hocko.

  (3) A preparation for handling many concurrent OOM victims which
      could become real by introducing memcg-aware OOM killer.

Tetsuo Handa (7):
  mm,oom: Don't call schedule_timeout_killable() with oom_lock held.
  mm,oom: Check pending victims earlier in out_of_memory().
  mm,oom: Fix unnecessary killing of additional processes.
  mm,page_alloc: Make oom_reserves_allowed() even.
  mm,oom: Bring OOM notifier to outside of oom_lock.
  mm,oom: Make oom_lock static variable.
  mm,oom: Do not sleep with oom_lock held.
Michal Hocko (1):
  mm,page_alloc: Move the short sleep to should_reclaim_retry().

 drivers/tty/sysrq.c        |   2 -
 include/linux/memcontrol.h |   9 +-
 include/linux/oom.h        |   6 +-
 include/linux/sched.h      |   7 +-
 include/trace/events/oom.h |  64 -------
 kernel/fork.c              |   2 +
 mm/memcontrol.c            |  24 +--
 mm/mmap.c                  |  17 +-
 mm/oom_kill.c              | 439 +++++++++++++++++++++------------------------
 mm/page_alloc.c            | 134 ++++++--------
 10 files changed, 287 insertions(+), 417 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/8] mm,oom: Don't call schedule_timeout_killable() with oom_lock held.
  2018-07-03 14:25 [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem Tetsuo Handa
@ 2018-07-03 14:25 ` Tetsuo Handa
  2018-07-03 14:38   ` Michal Hocko
  2018-07-03 14:25 ` [PATCH 2/8] mm,oom: Check pending victims earlier in out_of_memory() Tetsuo Handa
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Tetsuo Handa @ 2018-07-03 14:25 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: torvalds, Tetsuo Handa, David Rientjes, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Tejun Heo, Vladimir Davydov

When I was examining a bug which occurs under CPU + memory pressure, I
observed that a thread which called out_of_memory() can sleep for minutes
at schedule_timeout_killable(1) with oom_lock held when many threads are
doing direct reclaim.

The whole point of the sleep is to give the OOM victim some time to exit.
But since commit 27ae357fa82be5ab ("mm, oom: fix concurrent munlock and
oom reaper unmap, v3") changed the OOM victim to wait for oom_lock in order
to close race window at exit_mmap(), the whole point of this sleep is lost
now. We need to make sure that the thread which called out_of_memory() will
release oom_lock shortly. Therefore, this patch brings the sleep to outside
of the OOM path.

Although the sleep will be after all removed by the last patch in this
series, this patch is meant for ease of backport to stable kernels, for
we are waiting for patches which can mitigate CVE-2016-10723.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Mitigates: CVE-2016-10723
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
---
 mm/oom_kill.c   | 38 +++++++++++++++++---------------------
 mm/page_alloc.c |  7 ++++++-
 2 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 84081e7..d3fb4e4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -479,6 +479,21 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
+/*
+ * We have to make sure not to cause premature new oom victim selection.
+ *
+ * __alloc_pages_may_oom()     oom_reap_task_mm()/exit_mmap()
+ *   mutex_trylock(&oom_lock)
+ *   get_page_from_freelist(ALLOC_WMARK_HIGH) # fails
+ *                               unmap_page_range() # frees some memory
+ *                               set_bit(MMF_OOM_SKIP)
+ *   out_of_memory()
+ *     select_bad_process()
+ *       test_bit(MMF_OOM_SKIP) # selects new oom victim
+ *   mutex_unlock(&oom_lock)
+ *
+ * Therefore, the callers hold oom_lock when calling this function.
+ */
 void __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
@@ -523,20 +538,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
 	bool ret = true;
 
-	/*
-	 * We have to make sure to not race with the victim exit path
-	 * and cause premature new oom victim selection:
-	 * oom_reap_task_mm		exit_mm
-	 *   mmget_not_zero
-	 *				  mmput
-	 *				    atomic_dec_and_test
-	 *				  exit_oom_victim
-	 *				[...]
-	 *				out_of_memory
-	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
-	 *  unmap_page_range # frees some memory
-	 */
 	mutex_lock(&oom_lock);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
@@ -1077,15 +1078,9 @@ bool out_of_memory(struct oom_control *oc)
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (oc->chosen && oc->chosen != (void *)-1UL) {
+	if (oc->chosen && oc->chosen != (void *)-1UL)
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
 				 "Memory cgroup out of memory");
-		/*
-		 * Give the killed process a good chance to exit before trying
-		 * to allocate memory again.
-		 */
-		schedule_timeout_killable(1);
-	}
 	return !!oc->chosen;
 }
 
@@ -1111,4 +1106,5 @@ void pagefault_out_of_memory(void)
 		return;
 	out_of_memory(&oc);
 	mutex_unlock(&oom_lock);
+	schedule_timeout_killable(1);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1521100..6205d34 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3468,7 +3468,6 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	 */
 	if (!mutex_trylock(&oom_lock)) {
 		*did_some_progress = 1;
-		schedule_timeout_uninterruptible(1);
 		return NULL;
 	}
 
@@ -4244,6 +4243,12 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	/* Retry as long as the OOM killer is making progress */
 	if (did_some_progress) {
 		no_progress_loops = 0;
+		/*
+		 * This schedule_timeout_*() serves as a guaranteed sleep for
+		 * PF_WQ_WORKER threads when __zone_watermark_ok() == false.
+		 */
+		if (!tsk_is_oom_victim(current))
+			schedule_timeout_uninterruptible(1);
 		goto retry;
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/8] mm,oom: Check pending victims earlier in out_of_memory().
  2018-07-03 14:25 [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem Tetsuo Handa
  2018-07-03 14:25 ` [PATCH 1/8] mm,oom: Don't call schedule_timeout_killable() with oom_lock held Tetsuo Handa
@ 2018-07-03 14:25 ` Tetsuo Handa
  2018-07-03 14:25 ` [PATCH 3/8] mm,oom: Fix unnecessary killing of additional processes Tetsuo Handa
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Tetsuo Handa @ 2018-07-03 14:25 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: torvalds, Tetsuo Handa, David Rientjes, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Tejun Heo, Vladimir Davydov

The "mm, oom: cgroup-aware OOM killer" patchset is trying to introduce
INFLIGHT_VICTIM in order to replace open-coded ((void *)-1UL). But
(regarding CONFIG_MMU=y case) we have a list of inflight OOM victim
threads which are connected to oom_reaper_list. Thus we can check
whether there are inflight OOM victims before starting process/memcg
list traversal. Since it is likely that only few threads are linked to
oom_reaper_list, checking all victims' OOM domain will not matter.

Thus, check whether there are inflight OOM victims before starting
process/memcg list traversal and eliminate the "abort" path.

Note that this patch could temporarily regress CONFIG_MMU=n kernels
because this patch selects same victims rather than waits for victims
if CONFIG_MMU=n. This will be fixed by the next patch in this series.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
---
 include/linux/memcontrol.h |   9 ++--
 include/linux/sched.h      |   2 +-
 mm/memcontrol.c            |  18 +++-----
 mm/oom_kill.c              | 103 +++++++++++++++++++++++++--------------------
 4 files changed, 67 insertions(+), 65 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6c6fb11..a82360a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -382,8 +382,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
 				   struct mem_cgroup *,
 				   struct mem_cgroup_reclaim_cookie *);
 void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
-int mem_cgroup_scan_tasks(struct mem_cgroup *,
-			  int (*)(struct task_struct *, void *), void *);
+void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
+			   void (*fn)(struct task_struct *, void *), void *arg);
 
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
@@ -850,10 +850,9 @@ static inline void mem_cgroup_iter_break(struct mem_cgroup *root,
 {
 }
 
-static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
-		int (*fn)(struct task_struct *, void *), void *arg)
+static inline void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
+		void (*fn)(struct task_struct *, void *), void *arg)
 {
-	return 0;
 }
 
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9256118..d56ae68 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1163,7 +1163,7 @@ struct task_struct {
 #endif
 	int				pagefault_disabled;
 #ifdef CONFIG_MMU
-	struct task_struct		*oom_reaper_list;
+	struct list_head		oom_victim_list;
 #endif
 #ifdef CONFIG_VMAP_STACK
 	struct vm_struct		*stack_vm_area;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e6f0d5e..c8a75c8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -884,17 +884,14 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
  * @arg: argument passed to @fn
  *
  * This function iterates over tasks attached to @memcg or to any of its
- * descendants and calls @fn for each task. If @fn returns a non-zero
- * value, the function breaks the iteration loop and returns the value.
- * Otherwise, it will iterate over all tasks and return 0.
+ * descendants and calls @fn for each task.
  *
  * This function must not be called for the root memory cgroup.
  */
-int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
-			  int (*fn)(struct task_struct *, void *), void *arg)
+void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
+			   void (*fn)(struct task_struct *, void *), void *arg)
 {
 	struct mem_cgroup *iter;
-	int ret = 0;
 
 	BUG_ON(memcg == root_mem_cgroup);
 
@@ -903,15 +900,10 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 		struct task_struct *task;
 
 		css_task_iter_start(&iter->css, 0, &it);
-		while (!ret && (task = css_task_iter_next(&it)))
-			ret = fn(task, arg);
+		while ((task = css_task_iter_next(&it)))
+			fn(task, arg);
 		css_task_iter_end(&it);
-		if (ret) {
-			mem_cgroup_iter_break(memcg, iter);
-			break;
-		}
 	}
-	return ret;
 }
 
 /**
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d3fb4e4..f58281e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -304,25 +304,13 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	return CONSTRAINT_NONE;
 }
 
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+static void oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
 	unsigned long points;
 
 	if (oom_unkillable_task(task, NULL, oc->nodemask))
-		goto next;
-
-	/*
-	 * This task already has access to memory reserves and is being killed.
-	 * Don't allow any other task to have access to the reserves unless
-	 * the task has MMF_OOM_SKIP because chances that it would release
-	 * any memory is quite low.
-	 */
-	if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
-		if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
-			goto next;
-		goto abort;
-	}
+		return;
 
 	/*
 	 * If task is allocating a lot of memory and has been marked to be
@@ -335,29 +323,22 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 
 	points = oom_badness(task, NULL, oc->nodemask, oc->totalpages);
 	if (!points || points < oc->chosen_points)
-		goto next;
+		return;
 
 	/* Prefer thread group leaders for display purposes */
 	if (points == oc->chosen_points && thread_group_leader(oc->chosen))
-		goto next;
+		return;
 select:
 	if (oc->chosen)
 		put_task_struct(oc->chosen);
 	get_task_struct(task);
 	oc->chosen = task;
 	oc->chosen_points = points;
-next:
-	return 0;
-abort:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
-	oc->chosen = (void *)-1UL;
-	return 1;
 }
 
 /*
  * Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen is set to -1.
+ * 'points'.
  */
 static void select_bad_process(struct oom_control *oc)
 {
@@ -368,8 +349,7 @@ static void select_bad_process(struct oom_control *oc)
 
 		rcu_read_lock();
 		for_each_process(p)
-			if (oom_evaluate_task(p, oc))
-				break;
+			oom_evaluate_task(p, oc);
 		rcu_read_unlock();
 	}
 
@@ -476,7 +456,7 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
  */
 static struct task_struct *oom_reaper_th;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
+static LIST_HEAD(oom_victim_list);
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
 /*
@@ -488,7 +468,7 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
  *                               unmap_page_range() # frees some memory
  *                               set_bit(MMF_OOM_SKIP)
  *   out_of_memory()
- *     select_bad_process()
+ *     oom_has_pending_victims()
  *       test_bit(MMF_OOM_SKIP) # selects new oom victim
  *   mutex_unlock(&oom_lock)
  *
@@ -606,14 +586,16 @@ static void oom_reap_task(struct task_struct *tsk)
 	debug_show_all_locks();
 
 done:
-	tsk->oom_reaper_list = NULL;
-
 	/*
 	 * Hide this mm from OOM killer because it has been either reaped or
 	 * somebody can't call up_write(mmap_sem).
 	 */
 	set_bit(MMF_OOM_SKIP, &mm->flags);
 
+	spin_lock(&oom_reaper_lock);
+	list_del(&tsk->oom_victim_list);
+	spin_unlock(&oom_reaper_lock);
+
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 }
@@ -623,12 +605,13 @@ static int oom_reaper(void *unused)
 	while (true) {
 		struct task_struct *tsk = NULL;
 
-		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
+		wait_event_freezable(oom_reaper_wait,
+				     !list_empty(&oom_victim_list));
 		spin_lock(&oom_reaper_lock);
-		if (oom_reaper_list != NULL) {
-			tsk = oom_reaper_list;
-			oom_reaper_list = tsk->oom_reaper_list;
-		}
+		if (!list_empty(&oom_victim_list))
+			tsk = list_first_entry(&oom_victim_list,
+					       struct task_struct,
+					       oom_victim_list);
 		spin_unlock(&oom_reaper_lock);
 
 		if (tsk)
@@ -640,15 +623,11 @@ static int oom_reaper(void *unused)
 
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+	if (tsk->oom_victim_list.next)
 		return;
-
 	get_task_struct(tsk);
-
 	spin_lock(&oom_reaper_lock);
-	tsk->oom_reaper_list = oom_reaper_list;
-	oom_reaper_list = tsk;
+	list_add_tail(&tsk->oom_victim_list, &oom_victim_list);
 	spin_unlock(&oom_reaper_lock);
 	trace_wake_reaper(tsk->pid);
 	wake_up(&oom_reaper_wait);
@@ -1010,6 +989,34 @@ int unregister_oom_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 
+static bool oom_has_pending_victims(struct oom_control *oc)
+{
+#ifdef CONFIG_MMU
+	struct task_struct *p;
+
+	if (is_sysrq_oom(oc))
+		return false;
+	/*
+	 * Since oom_reap_task_mm()/exit_mmap() will set MMF_OOM_SKIP, let's
+	 * wait for pending victims until MMF_OOM_SKIP is set.
+	 */
+	spin_lock(&oom_reaper_lock);
+	list_for_each_entry(p, &oom_victim_list, oom_victim_list)
+		if (!oom_unkillable_task(p, oc->memcg, oc->nodemask) &&
+		    !test_bit(MMF_OOM_SKIP, &p->signal->oom_mm->flags))
+			break;
+	spin_unlock(&oom_reaper_lock);
+	return p != NULL;
+#else
+	/*
+	 * Since nobody except oom_kill_process() sets MMF_OOM_SKIP, waiting
+	 * for pending victims until MMF_OOM_SKIP is set is useless. Therefore,
+	 * simply let the OOM killer select pending victims again.
+	 */
+	return false;
+#endif
+}
+
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  * @oc: pointer to struct oom_control
@@ -1063,6 +1070,9 @@ bool out_of_memory(struct oom_control *oc)
 		oc->nodemask = NULL;
 	check_panic_on_oom(oc, constraint);
 
+	if (oom_has_pending_victims(oc))
+		return true;
+
 	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
 	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
@@ -1074,14 +1084,15 @@ bool out_of_memory(struct oom_control *oc)
 
 	select_bad_process(oc);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
-	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+	if (!oc->chosen) {
+		if (is_sysrq_oom(oc) || is_memcg_oom(oc))
+			return false;
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (oc->chosen && oc->chosen != (void *)-1UL)
-		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
-				 "Memory cgroup out of memory");
-	return !!oc->chosen;
+	oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
+			 "Memory cgroup out of memory");
+	return true;
 }
 
 /*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/8] mm,oom: Fix unnecessary killing of additional processes.
  2018-07-03 14:25 [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem Tetsuo Handa
  2018-07-03 14:25 ` [PATCH 1/8] mm,oom: Don't call schedule_timeout_killable() with oom_lock held Tetsuo Handa
  2018-07-03 14:25 ` [PATCH 2/8] mm,oom: Check pending victims earlier in out_of_memory() Tetsuo Handa
@ 2018-07-03 14:25 ` Tetsuo Handa
  2018-07-03 14:58   ` Michal Hocko
  2018-07-03 14:25 ` [PATCH 4/8] mm,page_alloc: Make oom_reserves_allowed() even Tetsuo Handa
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Tetsuo Handa @ 2018-07-03 14:25 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: torvalds, Tetsuo Handa, David Rientjes, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Tejun Heo, Vladimir Davydov

David Rientjes is complaining that memcg OOM events needlessly select
more OOM victims when the OOM reaper was unable to reclaim memory. This
is because exit_mmap() is setting MMF_OOM_SKIP before calling
free_pgtables(). While David is trying to introduce timeout based hold
off approach, Michal Hocko is rejecting plain timeout based approaches.

Therefore, this patch gets rid of the OOM reaper kernel thread and
introduces OOM-badness-score based hold off approach. The reason for
getting rid of the OOM reaper kernel thread is explained below.

    We are about to start getting a lot of OOM victim processes by
    introducing "mm, oom: cgroup-aware OOM killer" patchset.

    When there are multiple OOM victims, we should try reclaiming memory
    in parallel rather than sequentially wait until conditions to be able
    to reclaim memory are met. Also, we should preferentially reclaim from
    OOM domains where currently allocating threads belong to. Also, we
    want to get rid of schedule_timeout_*(1) heuristic which is used for
    trying to give CPU resource to the owner of oom_lock. Therefire,
    direct OOM reaping by allocating threads can do the job better than
    the OOM reaper kernel thread.

This patch changes the OOM killer to wait until either __mmput()
completes or OOM badness score did not decrease for 3 seconds.
This patch changes the meaning of MMF_OOM_SKIP changes from "start
selecting next OOM victim" to "stop calling oom_reap_mm() from direct
OOM reaping". In order to allow direct OOM reaping, counter variables
which remember "the current OOM badness score", "when OOM badness score
was compared for the last time", "how many times OOM badness score did
not decrease" are added to "struct task_struct".

Note that four tracepoints which are used for recording the OOM reaper is
removed because the OOM reaper kernel thread is removed and concurrently
executed direct OOM reaping can generate too many trace logs.

As a side note, the potential regression for CONFIG_MMU=n kernels is
fixed by doing the same things except oom_reap_mm(), for there is now
no difference between CONFIG_MMU=y kernels and CONFIG_MMU=n kernels
except that CONFIG_MMU=n kernels cannot implement oom_reap_mm(). But
CONFIG_MMU=n kernels can save memory for one kernel thread.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
---
 include/linux/oom.h        |   3 +-
 include/linux/sched.h      |   5 +-
 include/trace/events/oom.h |  64 ----------
 kernel/fork.c              |   2 +
 mm/mmap.c                  |  17 +--
 mm/oom_kill.c              | 299 ++++++++++++++++-----------------------------
 6 files changed, 116 insertions(+), 274 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 6adac11..eab409f 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -95,7 +95,7 @@ static inline int check_stable_address_space(struct mm_struct *mm)
 	return 0;
 }
 
-void __oom_reap_task_mm(struct mm_struct *mm);
+extern void oom_reap_mm(struct mm_struct *mm);
 
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
@@ -104,6 +104,7 @@ extern unsigned long oom_badness(struct task_struct *p,
 extern bool out_of_memory(struct oom_control *oc);
 
 extern void exit_oom_victim(void);
+extern void exit_oom_mm(struct mm_struct *mm);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d56ae68..dc47976 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1162,9 +1162,10 @@ struct task_struct {
 	unsigned long			task_state_change;
 #endif
 	int				pagefault_disabled;
-#ifdef CONFIG_MMU
 	struct list_head		oom_victim_list;
-#endif
+	unsigned long			last_oom_compared;
+	unsigned long			last_oom_score;
+	unsigned char			oom_reap_stall_count;
 #ifdef CONFIG_VMAP_STACK
 	struct vm_struct		*stack_vm_area;
 #endif
diff --git a/include/trace/events/oom.h b/include/trace/events/oom.h
index 26a11e4..5228fc2 100644
--- a/include/trace/events/oom.h
+++ b/include/trace/events/oom.h
@@ -87,70 +87,6 @@
 	TP_printk("pid=%d", __entry->pid)
 );
 
-TRACE_EVENT(wake_reaper,
-	TP_PROTO(int pid),
-
-	TP_ARGS(pid),
-
-	TP_STRUCT__entry(
-		__field(int, pid)
-	),
-
-	TP_fast_assign(
-		__entry->pid = pid;
-	),
-
-	TP_printk("pid=%d", __entry->pid)
-);
-
-TRACE_EVENT(start_task_reaping,
-	TP_PROTO(int pid),
-
-	TP_ARGS(pid),
-
-	TP_STRUCT__entry(
-		__field(int, pid)
-	),
-
-	TP_fast_assign(
-		__entry->pid = pid;
-	),
-
-	TP_printk("pid=%d", __entry->pid)
-);
-
-TRACE_EVENT(finish_task_reaping,
-	TP_PROTO(int pid),
-
-	TP_ARGS(pid),
-
-	TP_STRUCT__entry(
-		__field(int, pid)
-	),
-
-	TP_fast_assign(
-		__entry->pid = pid;
-	),
-
-	TP_printk("pid=%d", __entry->pid)
-);
-
-TRACE_EVENT(skip_task_reaping,
-	TP_PROTO(int pid),
-
-	TP_ARGS(pid),
-
-	TP_STRUCT__entry(
-		__field(int, pid)
-	),
-
-	TP_fast_assign(
-		__entry->pid = pid;
-	),
-
-	TP_printk("pid=%d", __entry->pid)
-);
-
 #ifdef CONFIG_COMPACTION
 TRACE_EVENT(compact_retry,
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 9440d61..63a0bc6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -977,6 +977,8 @@ static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
+	if (unlikely(mm_is_oom_victim(mm)))
+		exit_oom_mm(mm);
 	mmdrop(mm);
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index d1eb87e..a12e6bf 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3059,24 +3059,19 @@ void exit_mmap(struct mm_struct *mm)
 	if (unlikely(mm_is_oom_victim(mm))) {
 		/*
 		 * Manually reap the mm to free as much memory as possible.
-		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
-		 * this mm from further consideration.  Taking mm->mmap_sem for
-		 * write after setting MMF_OOM_SKIP will guarantee that the oom
-		 * reaper will not run on this mm again after mmap_sem is
-		 * dropped.
+		 * Then, tell oom_has_pending_victims() to stop calling
+		 * oom_reap_mm(), by taking mm->mmap_sem for write after
+		 * setting MMF_OOM_SKIP.
 		 *
 		 * Nothing can be holding mm->mmap_sem here and the above call
 		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
-		 * __oom_reap_task_mm() will not block.
+		 * oom_reap_mm() will not block.
 		 *
 		 * This needs to be done before calling munlock_vma_pages_all(),
-		 * which clears VM_LOCKED, otherwise the oom reaper cannot
+		 * which clears VM_LOCKED, otherwise oom_reap_mm() cannot
 		 * reliably test it.
 		 */
-		mutex_lock(&oom_lock);
-		__oom_reap_task_mm(mm);
-		mutex_unlock(&oom_lock);
-
+		oom_reap_mm(mm);
 		set_bit(MMF_OOM_SKIP, &mm->flags);
 		down_write(&mm->mmap_sem);
 		up_write(&mm->mmap_sem);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f58281e..1a9fae4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -49,6 +49,12 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
 
+static inline unsigned long oom_victim_mm_score(struct mm_struct *mm)
+{
+	return get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS) +
+		mm_pgtables_bytes(mm) / PAGE_SIZE;
+}
+
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
@@ -198,6 +204,9 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	long points;
 	long adj;
 
+	if (tsk_is_oom_victim(p))
+		return 0;
+
 	if (oom_unkillable_task(p, memcg, nodemask))
 		return 0;
 
@@ -207,13 +216,10 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 
 	/*
 	 * Do not even consider tasks which are explicitly marked oom
-	 * unkillable or have been already oom reaped or the are in
-	 * the middle of vfork
+	 * unkillable or are in the middle of vfork
 	 */
 	adj = (long)p->signal->oom_score_adj;
-	if (adj == OOM_SCORE_ADJ_MIN ||
-			test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
-			in_vfork(p)) {
+	if (adj == OOM_SCORE_ADJ_MIN || in_vfork(p)) {
 		task_unlock(p);
 		return 0;
 	}
@@ -222,8 +228,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	 * The baseline for the badness score is the proportion of RAM that each
 	 * task's rss, pagetable and swap space use.
 	 */
-	points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
-		mm_pgtables_bytes(p->mm) / PAGE_SIZE;
+	points = oom_victim_mm_score(p->mm);
 	task_unlock(p);
 
 	/* Normalize to oom_score_adj units */
@@ -431,6 +436,29 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
 
+static bool victim_mm_stalling(struct task_struct *p, struct mm_struct *mm)
+{
+	unsigned long score;
+
+	if (time_before(jiffies, p->last_oom_compared + HZ / 10))
+		return false;
+	score = oom_victim_mm_score(mm);
+	if (score < p->last_oom_score)
+		p->oom_reap_stall_count = 0;
+	else
+		p->oom_reap_stall_count++;
+	p->last_oom_score = oom_victim_mm_score(mm);
+	p->last_oom_compared = jiffies;
+	if (p->oom_reap_stall_count < 30)
+		return false;
+	pr_info("Gave up waiting for process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+		task_pid_nr(p), p->comm, K(mm->total_vm),
+		K(get_mm_counter(mm, MM_ANONPAGES)),
+		K(get_mm_counter(mm, MM_FILEPAGES)),
+		K(get_mm_counter(mm, MM_SHMEMPAGES)));
+	return true;
+}
+
 /*
  * task->mm can be NULL if the task is the exited group leader.  So to
  * determine whether the task is using a particular mm, we examine all the
@@ -449,32 +477,10 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 	return false;
 }
 
-#ifdef CONFIG_MMU
-/*
- * OOM Reaper kernel thread which tries to reap the memory used by the OOM
- * victim (if that is possible) to help the OOM killer to move on.
- */
-static struct task_struct *oom_reaper_th;
-static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static LIST_HEAD(oom_victim_list);
-static DEFINE_SPINLOCK(oom_reaper_lock);
 
-/*
- * We have to make sure not to cause premature new oom victim selection.
- *
- * __alloc_pages_may_oom()     oom_reap_task_mm()/exit_mmap()
- *   mutex_trylock(&oom_lock)
- *   get_page_from_freelist(ALLOC_WMARK_HIGH) # fails
- *                               unmap_page_range() # frees some memory
- *                               set_bit(MMF_OOM_SKIP)
- *   out_of_memory()
- *     oom_has_pending_victims()
- *       test_bit(MMF_OOM_SKIP) # selects new oom victim
- *   mutex_unlock(&oom_lock)
- *
- * Therefore, the callers hold oom_lock when calling this function.
- */
-void __oom_reap_task_mm(struct mm_struct *mm)
+#ifdef CONFIG_MMU
+void oom_reap_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
 
@@ -513,137 +519,7 @@ void __oom_reap_task_mm(struct mm_struct *mm)
 		}
 	}
 }
-
-static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
-{
-	bool ret = true;
-
-	mutex_lock(&oom_lock);
-
-	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
-		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
-	}
-
-	/*
-	 * If the mm has invalidate_{start,end}() notifiers that could block,
-	 * sleep to give the oom victim some more time.
-	 * TODO: we really want to get rid of this ugly hack and make sure that
-	 * notifiers cannot block for unbounded amount of time
-	 */
-	if (mm_has_blockable_invalidate_notifiers(mm)) {
-		up_read(&mm->mmap_sem);
-		schedule_timeout_idle(HZ);
-		goto unlock_oom;
-	}
-
-	/*
-	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
-	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
-	 * under mmap_sem for reading because it serializes against the
-	 * down_write();up_write() cycle in exit_mmap().
-	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
-		up_read(&mm->mmap_sem);
-		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
-	}
-
-	trace_start_task_reaping(tsk->pid);
-
-	__oom_reap_task_mm(mm);
-
-	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
-			task_pid_nr(tsk), tsk->comm,
-			K(get_mm_counter(mm, MM_ANONPAGES)),
-			K(get_mm_counter(mm, MM_FILEPAGES)),
-			K(get_mm_counter(mm, MM_SHMEMPAGES)));
-	up_read(&mm->mmap_sem);
-
-	trace_finish_task_reaping(tsk->pid);
-unlock_oom:
-	mutex_unlock(&oom_lock);
-	return ret;
-}
-
-#define MAX_OOM_REAP_RETRIES 10
-static void oom_reap_task(struct task_struct *tsk)
-{
-	int attempts = 0;
-	struct mm_struct *mm = tsk->signal->oom_mm;
-
-	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
-		schedule_timeout_idle(HZ/10);
-
-	if (attempts <= MAX_OOM_REAP_RETRIES ||
-	    test_bit(MMF_OOM_SKIP, &mm->flags))
-		goto done;
-
-	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-		task_pid_nr(tsk), tsk->comm);
-	debug_show_all_locks();
-
-done:
-	/*
-	 * Hide this mm from OOM killer because it has been either reaped or
-	 * somebody can't call up_write(mmap_sem).
-	 */
-	set_bit(MMF_OOM_SKIP, &mm->flags);
-
-	spin_lock(&oom_reaper_lock);
-	list_del(&tsk->oom_victim_list);
-	spin_unlock(&oom_reaper_lock);
-
-	/* Drop a reference taken by wake_oom_reaper */
-	put_task_struct(tsk);
-}
-
-static int oom_reaper(void *unused)
-{
-	while (true) {
-		struct task_struct *tsk = NULL;
-
-		wait_event_freezable(oom_reaper_wait,
-				     !list_empty(&oom_victim_list));
-		spin_lock(&oom_reaper_lock);
-		if (!list_empty(&oom_victim_list))
-			tsk = list_first_entry(&oom_victim_list,
-					       struct task_struct,
-					       oom_victim_list);
-		spin_unlock(&oom_reaper_lock);
-
-		if (tsk)
-			oom_reap_task(tsk);
-	}
-
-	return 0;
-}
-
-static void wake_oom_reaper(struct task_struct *tsk)
-{
-	if (tsk->oom_victim_list.next)
-		return;
-	get_task_struct(tsk);
-	spin_lock(&oom_reaper_lock);
-	list_add_tail(&tsk->oom_victim_list, &oom_victim_list);
-	spin_unlock(&oom_reaper_lock);
-	trace_wake_reaper(tsk->pid);
-	wake_up(&oom_reaper_wait);
-}
-
-static int __init oom_init(void)
-{
-	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
-	return 0;
-}
-subsys_initcall(oom_init)
-#else
-static inline void wake_oom_reaper(struct task_struct *tsk)
-{
-}
-#endif /* CONFIG_MMU */
+#endif
 
 /**
  * mark_oom_victim - mark the given task as OOM victim
@@ -668,6 +544,13 @@ static void mark_oom_victim(struct task_struct *tsk)
 	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) {
 		mmgrab(tsk->signal->oom_mm);
 		set_bit(MMF_OOM_VICTIM, &mm->flags);
+		/* Save current values for victim_mm_stalling() test. */
+		tsk->oom_reap_stall_count = 0;
+		tsk->last_oom_score = oom_victim_mm_score(mm);
+		tsk->last_oom_compared = jiffies;
+		get_task_struct(tsk);
+		lockdep_assert_held(&oom_lock);
+		list_add_tail(&tsk->oom_victim_list, &oom_victim_list);
 	}
 
 	/*
@@ -786,10 +669,11 @@ static bool task_will_free_mem(struct task_struct *task)
 		return false;
 
 	/*
-	 * This task has already been drained by the oom reaper so there are
-	 * only small chances it will free some more
+	 * If memory reserves granted to this task was not sufficient, allow
+	 * killing more processes after oom_has_pending_victims() completed
+	 * reaping this mm.
 	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags))
+	if (tsk_is_oom_victim(task))
 		return false;
 
 	if (atomic_read(&mm->mm_users) <= 1)
@@ -826,7 +710,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -836,7 +719,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	task_lock(p);
 	if (task_will_free_mem(p)) {
 		mark_oom_victim(p);
-		wake_oom_reaper(p);
 		task_unlock(p);
 		put_task_struct(p);
 		return;
@@ -925,8 +807,11 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 		if (same_thread_group(p, victim))
 			continue;
 		if (is_global_init(p)) {
-			can_oom_reap = false;
 			set_bit(MMF_OOM_SKIP, &mm->flags);
+			/*
+			 * Since oom_lock is held, unlike exit_mmap(), there is
+			 * no need to take mm->mmap_sem for write here.
+			 */
 			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
 					task_pid_nr(victim), victim->comm,
 					task_pid_nr(p), p->comm);
@@ -942,9 +827,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
-
 	mmdrop(mm);
 	put_task_struct(victim);
 }
@@ -989,32 +871,59 @@ int unregister_oom_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 
+void exit_oom_mm(struct mm_struct *mm)
+{
+	struct task_struct *p, *tmp;
+
+	mutex_lock(&oom_lock);
+	list_for_each_entry_safe(p, tmp, &oom_victim_list, oom_victim_list) {
+		if (mm != p->signal->oom_mm)
+			continue;
+		list_del(&p->oom_victim_list);
+		/* Drop a reference taken by mark_oom_victim(). */
+		put_task_struct(p);
+	}
+	mutex_unlock(&oom_lock);
+}
+
 static bool oom_has_pending_victims(struct oom_control *oc)
 {
-#ifdef CONFIG_MMU
-	struct task_struct *p;
+	struct task_struct *p, *tmp;
+	bool ret = false;
+	bool gaveup = false;
 
-	if (is_sysrq_oom(oc))
-		return false;
-	/*
-	 * Since oom_reap_task_mm()/exit_mmap() will set MMF_OOM_SKIP, let's
-	 * wait for pending victims until MMF_OOM_SKIP is set.
-	 */
-	spin_lock(&oom_reaper_lock);
-	list_for_each_entry(p, &oom_victim_list, oom_victim_list)
-		if (!oom_unkillable_task(p, oc->memcg, oc->nodemask) &&
-		    !test_bit(MMF_OOM_SKIP, &p->signal->oom_mm->flags))
-			break;
-	spin_unlock(&oom_reaper_lock);
-	return p != NULL;
-#else
-	/*
-	 * Since nobody except oom_kill_process() sets MMF_OOM_SKIP, waiting
-	 * for pending victims until MMF_OOM_SKIP is set is useless. Therefore,
-	 * simply let the OOM killer select pending victims again.
-	 */
-	return false;
+	lockdep_assert_held(&oom_lock);
+	list_for_each_entry_safe(p, tmp, &oom_victim_list, oom_victim_list) {
+		struct mm_struct *mm = p->signal->oom_mm;
+
+		/* Skip OOM victims which current thread cannot select. */
+		if (oom_unkillable_task(p, oc->memcg, oc->nodemask))
+			continue;
+		ret = true;
+#ifdef CONFIG_MMU
+		/*
+		 * We need to hold mmap_sem for read, in order to safely test
+		 * MMF_UNSTABLE flag and blockable invalidate notifiers.
+		 */
+		if (down_read_trylock(&mm->mmap_sem)) {
+			if (!test_bit(MMF_OOM_SKIP, &mm->flags) &&
+			    !mm_has_blockable_invalidate_notifiers(mm))
+				oom_reap_mm(mm);
+			up_read(&mm->mmap_sem);
+		}
 #endif
+		/* Forget if this mm didn't complete __mmput() for too long. */
+		if (!victim_mm_stalling(p, mm))
+			continue;
+		gaveup = true;
+		list_del(&p->oom_victim_list);
+		/* Drop a reference taken by mark_oom_victim(). */
+		put_task_struct(p);
+	}
+	if (gaveup)
+		debug_show_all_locks();
+
+	return ret && !is_sysrq_oom(oc);
 }
 
 /**
@@ -1048,7 +957,6 @@ bool out_of_memory(struct oom_control *oc)
 	 */
 	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		wake_oom_reaper(current);
 		return true;
 	}
 
@@ -1074,8 +982,7 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 
 	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
-	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
-	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+	    oom_badness(current, NULL, oc->nodemask, oc->totalpages) > 0) {
 		get_task_struct(current);
 		oc->chosen = current;
 		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/8] mm,page_alloc: Make oom_reserves_allowed() even.
  2018-07-03 14:25 [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem Tetsuo Handa
                   ` (2 preceding siblings ...)
  2018-07-03 14:25 ` [PATCH 3/8] mm,oom: Fix unnecessary killing of additional processes Tetsuo Handa
@ 2018-07-03 14:25 ` Tetsuo Handa
  2018-07-03 14:25 ` [PATCH 5/8] mm,oom: Bring OOM notifier to outside of oom_lock Tetsuo Handa
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Tetsuo Handa @ 2018-07-03 14:25 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: torvalds, Tetsuo Handa, David Rientjes, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Tejun Heo, Vladimir Davydov

Since CONFIG_MMU=n kernels no longer waits for OOM victims forever,
there is no possibiligty of OOM lockup. Therefore, we can get rid of
special handling of oom_reserves_allowed().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
---
 mm/page_alloc.c | 17 +----------------
 1 file changed, 1 insertion(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6205d34..b915533 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3862,21 +3862,6 @@ static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask,
 	return alloc_flags;
 }
 
-static bool oom_reserves_allowed(struct task_struct *tsk)
-{
-	if (!tsk_is_oom_victim(tsk))
-		return false;
-
-	/*
-	 * !MMU doesn't have oom reaper so give access to memory reserves
-	 * only to the thread with TIF_MEMDIE set
-	 */
-	if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
-		return false;
-
-	return true;
-}
-
 /*
  * Distinguish requests which really need access to full memory
  * reserves from oom victims which can live with a portion of it
@@ -3892,7 +3877,7 @@ static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
 	if (!in_interrupt()) {
 		if (current->flags & PF_MEMALLOC)
 			return ALLOC_NO_WATERMARKS;
-		else if (oom_reserves_allowed(current))
+		else if (tsk_is_oom_victim(current))
 			return ALLOC_OOM;
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/8] mm,oom: Bring OOM notifier to outside of oom_lock.
  2018-07-03 14:25 [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem Tetsuo Handa
                   ` (3 preceding siblings ...)
  2018-07-03 14:25 ` [PATCH 4/8] mm,page_alloc: Make oom_reserves_allowed() even Tetsuo Handa
@ 2018-07-03 14:25 ` Tetsuo Handa
  2018-07-03 14:59   ` Michal Hocko
  2018-07-03 14:25 ` [PATCH 6/8] mm,oom: Make oom_lock static variable Tetsuo Handa
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Tetsuo Handa @ 2018-07-03 14:25 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: torvalds, Tetsuo Handa, David Rientjes, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Tejun Heo, Vladimir Davydov

Since blocking_notifier_call_chain() in out_of_memory() might sleep,
sleeping with oom_lock held is currently an unavoidable problem.

As a preparation for not to sleep with oom_lock held, this patch brings
OOM notifier callbacks to outside of oom_lock. We are planning to
eventually replace OOM notifier callbacks with different mechanisms
(e.g. shrinker API). But such changes are out of scope for this series.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
---
 include/linux/oom.h |  1 +
 mm/oom_kill.c       | 38 +++++++++++++++++++++------
 mm/page_alloc.c     | 76 +++++++++++++++++++++++++++++++----------------------
 3 files changed, 76 insertions(+), 39 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index eab409f..d8da2cb 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -101,6 +101,7 @@ extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
 
+extern unsigned long try_oom_notifier(void);
 extern bool out_of_memory(struct oom_control *oc);
 
 extern void exit_oom_victim(void);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1a9fae4..d18fe1e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -871,6 +871,36 @@ int unregister_oom_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 
+/**
+ * try_oom_notifier - Try to reclaim memory from OOM notifier list.
+ *
+ * Returns non-zero if notifier callbacks released something, zero otherwise.
+ */
+unsigned long try_oom_notifier(void)
+{
+	static DEFINE_MUTEX(oom_notifier_lock);
+	unsigned long freed = 0;
+
+	/*
+	 * In order to protect OOM notifiers which are not thread safe and to
+	 * avoid excessively releasing memory from OOM notifiers which release
+	 * memory every time, this lock serializes/excludes concurrent calls to
+	 * OOM notifiers.
+	 */
+	if (!mutex_trylock(&oom_notifier_lock))
+		return 1;
+	/*
+	 * But teach the lockdep that mutex_trylock() above acts like
+	 * mutex_lock(), for we are not allowed to depend on
+	 * __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocation here.
+	 */
+	mutex_release(&oom_notifier_lock.dep_map, 1, _THIS_IP_);
+	mutex_acquire(&oom_notifier_lock.dep_map, 0, 0, _THIS_IP_);
+	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
+	mutex_unlock(&oom_notifier_lock);
+	return freed;
+}
+
 void exit_oom_mm(struct mm_struct *mm)
 {
 	struct task_struct *p, *tmp;
@@ -937,19 +967,11 @@ static bool oom_has_pending_victims(struct oom_control *oc)
  */
 bool out_of_memory(struct oom_control *oc)
 {
-	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 
 	if (oom_killer_disabled)
 		return false;
 
-	if (!is_memcg_oom(oc)) {
-		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
-		if (freed > 0)
-			/* Got some memory back in the last second. */
-			return true;
-	}
-
 	/*
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b915533..4cb3602 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3447,10 +3447,50 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	return page;
 }
 
+static inline bool can_oomkill(gfp_t gfp_mask, unsigned int order,
+			       const struct alloc_context *ac)
+{
+	/* Coredumps can quickly deplete all memory reserves */
+	if (current->flags & PF_DUMPCORE)
+		return false;
+	/* The OOM killer will not help higher order allocs */
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		return false;
+	/*
+	 * We have already exhausted all our reclaim opportunities without any
+	 * success so it is time to admit defeat. We will skip the OOM killer
+	 * because it is very likely that the caller has a more reasonable
+	 * fallback than shooting a random task.
+	 */
+	if (gfp_mask & __GFP_RETRY_MAYFAIL)
+		return false;
+	/* The OOM killer does not needlessly kill tasks for lowmem */
+	if (ac->high_zoneidx < ZONE_NORMAL)
+		return false;
+	if (pm_suspended_storage())
+		return false;
+	/*
+	 * XXX: GFP_NOFS allocations should rather fail than rely on
+	 * other request to make a forward progress.
+	 * We are in an unfortunate situation where out_of_memory cannot
+	 * do much for this context but let's try it to at least get
+	 * access to memory reserved if the current task is killed (see
+	 * out_of_memory). Once filesystems are ready to handle allocation
+	 * failures more gracefully we should just bail out here.
+	 */
+
+	/* The OOM killer may not free memory on a specific node */
+	if (gfp_mask & __GFP_THISNODE)
+		return false;
+
+	return true;
+}
+
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	const struct alloc_context *ac, unsigned long *did_some_progress)
 {
+	const bool oomkill = can_oomkill(gfp_mask, order, ac);
 	struct oom_control oc = {
 		.zonelist = ac->zonelist,
 		.nodemask = ac->nodemask,
@@ -3462,6 +3502,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 
 	*did_some_progress = 0;
 
+	/* Try to reclaim via OOM notifier callback. */
+	if (oomkill)
+		*did_some_progress = try_oom_notifier();
+
 	/*
 	 * Acquire the oom lock.  If that fails, somebody else is
 	 * making progress for us.
@@ -3484,37 +3528,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	if (page)
 		goto out;
 
-	/* Coredumps can quickly deplete all memory reserves */
-	if (current->flags & PF_DUMPCORE)
-		goto out;
-	/* The OOM killer will not help higher order allocs */
-	if (order > PAGE_ALLOC_COSTLY_ORDER)
-		goto out;
-	/*
-	 * We have already exhausted all our reclaim opportunities without any
-	 * success so it is time to admit defeat. We will skip the OOM killer
-	 * because it is very likely that the caller has a more reasonable
-	 * fallback than shooting a random task.
-	 */
-	if (gfp_mask & __GFP_RETRY_MAYFAIL)
-		goto out;
-	/* The OOM killer does not needlessly kill tasks for lowmem */
-	if (ac->high_zoneidx < ZONE_NORMAL)
-		goto out;
-	if (pm_suspended_storage())
-		goto out;
-	/*
-	 * XXX: GFP_NOFS allocations should rather fail than rely on
-	 * other request to make a forward progress.
-	 * We are in an unfortunate situation where out_of_memory cannot
-	 * do much for this context but let's try it to at least get
-	 * access to memory reserved if the current task is killed (see
-	 * out_of_memory). Once filesystems are ready to handle allocation
-	 * failures more gracefully we should just bail out here.
-	 */
-
-	/* The OOM killer may not free memory on a specific node */
-	if (gfp_mask & __GFP_THISNODE)
+	if (!oomkill)
 		goto out;
 
 	/* Exhausted what can be done so it's blame time */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 6/8] mm,oom: Make oom_lock static variable.
  2018-07-03 14:25 [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem Tetsuo Handa
                   ` (4 preceding siblings ...)
  2018-07-03 14:25 ` [PATCH 5/8] mm,oom: Bring OOM notifier to outside of oom_lock Tetsuo Handa
@ 2018-07-03 14:25 ` Tetsuo Handa
  2018-07-03 14:25 ` [PATCH 7/8] mm,oom: Do not sleep with oom_lock held Tetsuo Handa
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Tetsuo Handa @ 2018-07-03 14:25 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: torvalds, Tetsuo Handa, David Rientjes, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Tejun Heo, Vladimir Davydov

As a preparation for not to sleep with oom_lock held, this patch makes
oom_lock local to the OOM killer.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
---
 drivers/tty/sysrq.c |  2 --
 include/linux/oom.h |  2 --
 mm/memcontrol.c     |  6 +-----
 mm/oom_kill.c       | 47 ++++++++++++++++++++++++++++-------------------
 mm/page_alloc.c     | 24 ++++--------------------
 5 files changed, 33 insertions(+), 48 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 6364890..c8b66b9 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -376,10 +376,8 @@ static void moom_callback(struct work_struct *ignored)
 		.order = -1,
 	};
 
-	mutex_lock(&oom_lock);
 	if (!out_of_memory(&oc))
 		pr_info("OOM request ignored. No task eligible\n");
-	mutex_unlock(&oom_lock);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index d8da2cb..5ad2927 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -44,8 +44,6 @@ struct oom_control {
 	unsigned long chosen_points;
 };
 
-extern struct mutex oom_lock;
-
 static inline void set_current_oom_origin(void)
 {
 	current->signal->oom_flag_origin = true;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c8a75c8..35c33bf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1198,12 +1198,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		.gfp_mask = gfp_mask,
 		.order = order,
 	};
-	bool ret;
 
-	mutex_lock(&oom_lock);
-	ret = out_of_memory(&oc);
-	mutex_unlock(&oom_lock);
-	return ret;
+	return out_of_memory(&oc);
 }
 
 #if MAX_NUMNODES > 1
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d18fe1e..a1d3616 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -59,7 +59,7 @@ static inline unsigned long oom_victim_mm_score(struct mm_struct *mm)
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
 
-DEFINE_MUTEX(oom_lock);
+static DEFINE_MUTEX(oom_lock);
 
 #ifdef CONFIG_NUMA
 /**
@@ -965,10 +965,9 @@ static bool oom_has_pending_victims(struct oom_control *oc)
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-bool out_of_memory(struct oom_control *oc)
+static bool __out_of_memory(struct oom_control *oc,
+			    enum oom_constraint constraint)
 {
-	enum oom_constraint constraint = CONSTRAINT_NONE;
-
 	if (oom_killer_disabled)
 		return false;
 
@@ -991,18 +990,8 @@ bool out_of_memory(struct oom_control *oc)
 	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
 		return true;
 
-	/*
-	 * Check if there were limitations on the allocation (only relevant for
-	 * NUMA and memcg) that may require different handling.
-	 */
-	constraint = constrained_alloc(oc);
-	if (constraint != CONSTRAINT_MEMORY_POLICY)
-		oc->nodemask = NULL;
 	check_panic_on_oom(oc, constraint);
 
-	if (oom_has_pending_victims(oc))
-		return true;
-
 	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
 	    oom_badness(current, NULL, oc->nodemask, oc->totalpages) > 0) {
 		get_task_struct(current);
@@ -1024,10 +1013,33 @@ bool out_of_memory(struct oom_control *oc)
 	return true;
 }
 
+bool out_of_memory(struct oom_control *oc)
+{
+	enum oom_constraint constraint;
+	bool ret;
+	/*
+	 * Check if there were limitations on the allocation (only relevant for
+	 * NUMA and memcg) that may require different handling.
+	 */
+	constraint = constrained_alloc(oc);
+	if (constraint != CONSTRAINT_MEMORY_POLICY)
+		oc->nodemask = NULL;
+	/*
+	 * If there are OOM victims which current thread can select,
+	 * wait for them to reach __mmput().
+	 */
+	mutex_lock(&oom_lock);
+	if (oom_has_pending_victims(oc))
+		ret = true;
+	else
+		ret = __out_of_memory(oc, constraint);
+	mutex_unlock(&oom_lock);
+	return ret;
+}
+
 /*
  * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
- * killing is already in progress so do nothing.
+ * memory-hogging task.
  */
 void pagefault_out_of_memory(void)
 {
@@ -1042,9 +1054,6 @@ void pagefault_out_of_memory(void)
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
-	if (!mutex_trylock(&oom_lock))
-		return;
 	out_of_memory(&oc);
-	mutex_unlock(&oom_lock);
 	schedule_timeout_killable(1);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4cb3602..4c648f7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3500,32 +3500,17 @@ static inline bool can_oomkill(gfp_t gfp_mask, unsigned int order,
 	};
 	struct page *page;
 
-	*did_some_progress = 0;
-
 	/* Try to reclaim via OOM notifier callback. */
-	if (oomkill)
-		*did_some_progress = try_oom_notifier();
-
-	/*
-	 * Acquire the oom lock.  If that fails, somebody else is
-	 * making progress for us.
-	 */
-	if (!mutex_trylock(&oom_lock)) {
-		*did_some_progress = 1;
-		return NULL;
-	}
+	*did_some_progress = oomkill ? try_oom_notifier() : 0;
 
 	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
-	 * we're still under heavy pressure. But make sure that this reclaim
-	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
-	 * allocation which will never fail due to oom_lock already held.
+	 * we're still under heavy pressure.
 	 */
-	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
-				      ~__GFP_DIRECT_RECLAIM, order,
+	page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
 				      ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
-	if (page)
+	if (page || *did_some_progress)
 		goto out;
 
 	if (!oomkill)
@@ -3544,7 +3529,6 @@ static inline bool can_oomkill(gfp_t gfp_mask, unsigned int order,
 					ALLOC_NO_WATERMARKS, ac);
 	}
 out:
-	mutex_unlock(&oom_lock);
 	return page;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 7/8] mm,oom: Do not sleep with oom_lock held.
  2018-07-03 14:25 [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem Tetsuo Handa
                   ` (5 preceding siblings ...)
  2018-07-03 14:25 ` [PATCH 6/8] mm,oom: Make oom_lock static variable Tetsuo Handa
@ 2018-07-03 14:25 ` Tetsuo Handa
  2018-07-03 14:25 ` [PATCH 8/8] mm,page_alloc: Move the short sleep to should_reclaim_retry() Tetsuo Handa
  2018-07-03 15:12 ` [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem Michal Hocko
  8 siblings, 0 replies; 25+ messages in thread
From: Tetsuo Handa @ 2018-07-03 14:25 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: torvalds, Tetsuo Handa, David Rientjes, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Tejun Heo, Vladimir Davydov

Since oom_reap_mm() might take quite long time, it is not a good thing to
block other threads in different OOM domains. This patch allows calling
oom_reap_mm() from multiple concurrently allocating threads. By this
change, the page allocator can spend CPU resource for oom_reap_mm() in
their interested OOM domains.

Also, out_of_memory() no longer holds oom_lock which might sleep (except
cond_resched() and CONFIG_PREEMPT=y cases), for both OOM notifiers and
oom_reap_mm() are called outside of oom_lock. This means that oom_lock is
almost a spinlock now. But this patch does not convert oom_lock, for
saving CPU resources for selecting OOM victims, printk() etc. is a still
good thing to do.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: CVE-2016-10723
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
---
 mm/oom_kill.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a1d3616..d534684 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -921,11 +921,18 @@ static bool oom_has_pending_victims(struct oom_control *oc)
 	struct task_struct *p, *tmp;
 	bool ret = false;
 	bool gaveup = false;
+	unsigned int pos = 0;
+	unsigned int last_pos = 0;
 
+ retry:
 	lockdep_assert_held(&oom_lock);
 	list_for_each_entry_safe(p, tmp, &oom_victim_list, oom_victim_list) {
 		struct mm_struct *mm = p->signal->oom_mm;
 
+		if (pos++ < last_pos)
+			continue;
+		last_pos = pos;
+
 		/* Skip OOM victims which current thread cannot select. */
 		if (oom_unkillable_task(p, oc->memcg, oc->nodemask))
 			continue;
@@ -937,8 +944,23 @@ static bool oom_has_pending_victims(struct oom_control *oc)
 		 */
 		if (down_read_trylock(&mm->mmap_sem)) {
 			if (!test_bit(MMF_OOM_SKIP, &mm->flags) &&
-			    !mm_has_blockable_invalidate_notifiers(mm))
+			    !mm_has_blockable_invalidate_notifiers(mm)) {
+				get_task_struct(p);
+				mmgrab(mm);
+				mutex_unlock(&oom_lock);
 				oom_reap_mm(mm);
+				up_read(&mm->mmap_sem);
+				mmdrop(mm);
+				put_task_struct(p);
+				mutex_lock(&oom_lock);
+				/*
+				 * Since ret == true, skipping some OOM victims
+				 * by racing with exit_oom_mm() will not cause
+				 * premature OOM victim selection.
+				 */
+				pos = 0;
+				goto retry;
+			}
 			up_read(&mm->mmap_sem);
 		}
 #endif
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 8/8] mm,page_alloc: Move the short sleep to should_reclaim_retry().
  2018-07-03 14:25 [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem Tetsuo Handa
                   ` (6 preceding siblings ...)
  2018-07-03 14:25 ` [PATCH 7/8] mm,oom: Do not sleep with oom_lock held Tetsuo Handa
@ 2018-07-03 14:25 ` Tetsuo Handa
  2018-07-03 15:12 ` [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem Michal Hocko
  8 siblings, 0 replies; 25+ messages in thread
From: Tetsuo Handa @ 2018-07-03 14:25 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: torvalds, Michal Hocko, Tetsuo Handa, David Rientjes,
	Johannes Weiner, Roman Gushchin, Tejun Heo, Vladimir Davydov

From: Michal Hocko <mhocko@suse.com>

Since the page allocator can now spend CPU resource for oom_reap_mm()
for their interested OOM domains, the short sleep for waiting for the
owner of oom_lock no longer makes sense.

should_reclaim_retry() should be a natural reschedule point. PF_WQ_WORKER
is a special case which needs a stronger rescheduling policy. Doing that
unconditionally seems more straightforward than depending on a zone being
a good candidate for a further reclaim.

Thus, move the short sleep for waiting for the owner of oom_lock (which
coincidentally also serves as a guaranteed sleep for PF_WQ_WORKER threads)
to should_reclaim_retry().

Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
---
 mm/page_alloc.c | 40 ++++++++++++++++++----------------------
 1 file changed, 18 insertions(+), 22 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4c648f7..010b536 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3904,6 +3904,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 {
 	struct zone *zone;
 	struct zoneref *z;
+	bool ret = false;
 
 	/*
 	 * Costly allocations might have made a progress but this doesn't mean
@@ -3967,25 +3968,26 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 				}
 			}
 
-			/*
-			 * Memory allocation/reclaim might be called from a WQ
-			 * context and the current implementation of the WQ
-			 * concurrency control doesn't recognize that
-			 * a particular WQ is congested if the worker thread is
-			 * looping without ever sleeping. Therefore we have to
-			 * do a short sleep here rather than calling
-			 * cond_resched().
-			 */
-			if (current->flags & PF_WQ_WORKER)
-				schedule_timeout_uninterruptible(1);
-			else
-				cond_resched();
-
-			return true;
+			ret = true;
+			goto out;
 		}
 	}
 
-	return false;
+out:
+	/*
+	 * Memory allocation/reclaim might be called from a WQ
+	 * context and the current implementation of the WQ
+	 * concurrency control doesn't recognize that
+	 * a particular WQ is congested if the worker thread is
+	 * looping without ever sleeping. Therefore we have to
+	 * do a short sleep here rather than calling
+	 * cond_resched().
+	 */
+	if (current->flags & PF_WQ_WORKER)
+		schedule_timeout_uninterruptible(1);
+	else
+		cond_resched();
+	return ret;
 }
 
 static inline bool
@@ -4226,12 +4228,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	/* Retry as long as the OOM killer is making progress */
 	if (did_some_progress) {
 		no_progress_loops = 0;
-		/*
-		 * This schedule_timeout_*() serves as a guaranteed sleep for
-		 * PF_WQ_WORKER threads when __zone_watermark_ok() == false.
-		 */
-		if (!tsk_is_oom_victim(current))
-			schedule_timeout_uninterruptible(1);
 		goto retry;
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/8] mm,oom: Don't call schedule_timeout_killable() with oom_lock held.
  2018-07-03 14:25 ` [PATCH 1/8] mm,oom: Don't call schedule_timeout_killable() with oom_lock held Tetsuo Handa
@ 2018-07-03 14:38   ` Michal Hocko
  0 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2018-07-03 14:38 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, akpm, torvalds, David Rientjes, Johannes Weiner,
	Roman Gushchin, Tejun Heo, Vladimir Davydov

On Tue 03-07-18 23:25:02, Tetsuo Handa wrote:
> When I was examining a bug which occurs under CPU + memory pressure, I
> observed that a thread which called out_of_memory() can sleep for minutes
> at schedule_timeout_killable(1) with oom_lock held when many threads are
> doing direct reclaim.
> 
> The whole point of the sleep is to give the OOM victim some time to exit.
> But since commit 27ae357fa82be5ab ("mm, oom: fix concurrent munlock and
> oom reaper unmap, v3") changed the OOM victim to wait for oom_lock in order
> to close race window at exit_mmap(), the whole point of this sleep is lost
> now. We need to make sure that the thread which called out_of_memory() will
> release oom_lock shortly. Therefore, this patch brings the sleep to outside
> of the OOM path.
> 
> Although the sleep will be after all removed by the last patch in this
> series, this patch is meant for ease of backport to stable kernels, for
> we are waiting for patches which can mitigate CVE-2016-10723.

Come on. This is beyond annoying. You have posted this patch few times
and received exactly the same feedback. Do not mindlessly move code
around just because you want to preserve the status quo which you even
demonstrated you do not care to understand.

So NO, NACK and stop this insanity. Even if you remove the move sleep in
a later patch I completely hate how you are trying to make a security
issue out of it.  It simply doesn't mitigate anything! It might help
with !PREEMPT but still doesn't solve any problem.

I have already told you that I am ok with removing the sleep. Not
because it tries to pretend to be a CVE fix. But rather because its
relevance is gone now. See the difference?

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Mitigates: CVE-2016-10723
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> ---
>  mm/oom_kill.c   | 38 +++++++++++++++++---------------------
>  mm/page_alloc.c |  7 ++++++-
>  2 files changed, 23 insertions(+), 22 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 84081e7..d3fb4e4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -479,6 +479,21 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
>  static struct task_struct *oom_reaper_list;
>  static DEFINE_SPINLOCK(oom_reaper_lock);
>  
> +/*
> + * We have to make sure not to cause premature new oom victim selection.
> + *
> + * __alloc_pages_may_oom()     oom_reap_task_mm()/exit_mmap()
> + *   mutex_trylock(&oom_lock)
> + *   get_page_from_freelist(ALLOC_WMARK_HIGH) # fails
> + *                               unmap_page_range() # frees some memory
> + *                               set_bit(MMF_OOM_SKIP)
> + *   out_of_memory()
> + *     select_bad_process()
> + *       test_bit(MMF_OOM_SKIP) # selects new oom victim
> + *   mutex_unlock(&oom_lock)
> + *
> + * Therefore, the callers hold oom_lock when calling this function.
> + */
>  void __oom_reap_task_mm(struct mm_struct *mm)
>  {
>  	struct vm_area_struct *vma;
> @@ -523,20 +538,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  {
>  	bool ret = true;
>  
> -	/*
> -	 * We have to make sure to not race with the victim exit path
> -	 * and cause premature new oom victim selection:
> -	 * oom_reap_task_mm		exit_mm
> -	 *   mmget_not_zero
> -	 *				  mmput
> -	 *				    atomic_dec_and_test
> -	 *				  exit_oom_victim
> -	 *				[...]
> -	 *				out_of_memory
> -	 *				  select_bad_process
> -	 *				    # no TIF_MEMDIE task selects new victim
> -	 *  unmap_page_range # frees some memory
> -	 */
>  	mutex_lock(&oom_lock);
>  
>  	if (!down_read_trylock(&mm->mmap_sem)) {
> @@ -1077,15 +1078,9 @@ bool out_of_memory(struct oom_control *oc)
>  		dump_header(oc, NULL);
>  		panic("Out of memory and no killable processes...\n");
>  	}
> -	if (oc->chosen && oc->chosen != (void *)-1UL) {
> +	if (oc->chosen && oc->chosen != (void *)-1UL)
>  		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
>  				 "Memory cgroup out of memory");
> -		/*
> -		 * Give the killed process a good chance to exit before trying
> -		 * to allocate memory again.
> -		 */
> -		schedule_timeout_killable(1);
> -	}
>  	return !!oc->chosen;
>  }
>  
> @@ -1111,4 +1106,5 @@ void pagefault_out_of_memory(void)
>  		return;
>  	out_of_memory(&oc);
>  	mutex_unlock(&oom_lock);
> +	schedule_timeout_killable(1);
>  }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1521100..6205d34 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3468,7 +3468,6 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  	 */
>  	if (!mutex_trylock(&oom_lock)) {
>  		*did_some_progress = 1;
> -		schedule_timeout_uninterruptible(1);
>  		return NULL;
>  	}
>  
> @@ -4244,6 +4243,12 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	/* Retry as long as the OOM killer is making progress */
>  	if (did_some_progress) {
>  		no_progress_loops = 0;
> +		/*
> +		 * This schedule_timeout_*() serves as a guaranteed sleep for
> +		 * PF_WQ_WORKER threads when __zone_watermark_ok() == false.
> +		 */
> +		if (!tsk_is_oom_victim(current))
> +			schedule_timeout_uninterruptible(1);
>  		goto retry;
>  	}
>  
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/8] mm,oom: Fix unnecessary killing of additional processes.
  2018-07-03 14:25 ` [PATCH 3/8] mm,oom: Fix unnecessary killing of additional processes Tetsuo Handa
@ 2018-07-03 14:58   ` Michal Hocko
  0 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2018-07-03 14:58 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, akpm, torvalds, David Rientjes, Johannes Weiner,
	Roman Gushchin, Tejun Heo, Vladimir Davydov

On Tue 03-07-18 23:25:04, Tetsuo Handa wrote:
> David Rientjes is complaining that memcg OOM events needlessly select
> more OOM victims when the OOM reaper was unable to reclaim memory. This
> is because exit_mmap() is setting MMF_OOM_SKIP before calling
> free_pgtables(). While David is trying to introduce timeout based hold
> off approach, Michal Hocko is rejecting plain timeout based approaches.
> 
> Therefore, this patch gets rid of the OOM reaper kernel thread and
> introduces OOM-badness-score based hold off approach. The reason for
> getting rid of the OOM reaper kernel thread is explained below.
> 
>     We are about to start getting a lot of OOM victim processes by
>     introducing "mm, oom: cgroup-aware OOM killer" patchset.
> 
>     When there are multiple OOM victims, we should try reclaiming memory
>     in parallel rather than sequentially wait until conditions to be able
>     to reclaim memory are met. Also, we should preferentially reclaim from
>     OOM domains where currently allocating threads belong to. Also, we
>     want to get rid of schedule_timeout_*(1) heuristic which is used for
>     trying to give CPU resource to the owner of oom_lock. Therefire,
>     direct OOM reaping by allocating threads can do the job better than
>     the OOM reaper kernel thread.
> 
> This patch changes the OOM killer to wait until either __mmput()
> completes or OOM badness score did not decrease for 3 seconds.

So this is yet another timeout based thing... I am really getting tired
of this coming back and forth. I will get ignored most probably again,
but let me repeat. Once you make this timeout based you will really have
to make it tunable by userspace because one timeout never fits all
needs. And that would be quite stupid. Because what we have now is a
feedback based approach. So we retry as long as we can make progress and
fail eventually because we cannot retry for ever. How many times we
retry is an implementation detail so we do not have to expose that.

Anyway, You failed to explain whether there is any fundamental problem
that the current approach has and won't be able to handle or this new
approach would handle much better. So what are sound reasons to rewrite
the whole thing?

Why do you think that pulling the memory reaping into the oom context is
any better?

So color me unconvinced
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/8] mm,oom: Bring OOM notifier to outside of oom_lock.
  2018-07-03 14:25 ` [PATCH 5/8] mm,oom: Bring OOM notifier to outside of oom_lock Tetsuo Handa
@ 2018-07-03 14:59   ` Michal Hocko
  0 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2018-07-03 14:59 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, akpm, torvalds, David Rientjes, Johannes Weiner,
	Roman Gushchin, Tejun Heo, Vladimir Davydov

On Tue 03-07-18 23:25:06, Tetsuo Handa wrote:
> Since blocking_notifier_call_chain() in out_of_memory() might sleep,
> sleeping with oom_lock held is currently an unavoidable problem.

This is wrong. The problem is avoidable by simply removing the oom
notifiers as they are pure ugliness which we should have never allowed
to live in the first place.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem.
  2018-07-03 14:25 [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem Tetsuo Handa
                   ` (7 preceding siblings ...)
  2018-07-03 14:25 ` [PATCH 8/8] mm,page_alloc: Move the short sleep to should_reclaim_retry() Tetsuo Handa
@ 2018-07-03 15:12 ` Michal Hocko
  2018-07-03 15:29   ` Michal Hocko
  8 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2018-07-03 15:12 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, torvalds

On Tue 03-07-18 23:25:01, Tetsuo Handa wrote:
> This series provides
> 
>   (1) Mitigation and a fix for CVE-2016-10723.
> 
>   (2) A mitigation for needlessly selecting next OOM victim reported
>       by David Rientjes and rejected by Michal Hocko.
> 
>   (3) A preparation for handling many concurrent OOM victims which
>       could become real by introducing memcg-aware OOM killer.

It would have been great to describe the overal design in the cover
letter. So let me summarize just to be sure I understand the proposal.
You are removing the oom_reaper and moving the oom victim tear down to
the oom path. To handle cases where we cannot get mmap_sem to do that
work you simply decay oom_badness over time if there are no changes in
the victims oom score. In order to not block in the oom context for too
long because the address space might be quite large, you allow to
direct oom reap from multiple contexts.

You fail to explain why is this approach more appropriate and how you
have settled with your current tuning with 3s timeout etc...

Considering how subtle this whole area is I am not overly happy about
another rewrite without a really strong reasoning behind. There is none
here, unfortunately. Well, except for statements how I reject something
without telling the whole story etc...

> Tetsuo Handa (7):
>   mm,oom: Don't call schedule_timeout_killable() with oom_lock held.
>   mm,oom: Check pending victims earlier in out_of_memory().
>   mm,oom: Fix unnecessary killing of additional processes.
>   mm,page_alloc: Make oom_reserves_allowed() even.
>   mm,oom: Bring OOM notifier to outside of oom_lock.
>   mm,oom: Make oom_lock static variable.
>   mm,oom: Do not sleep with oom_lock held.
> Michal Hocko (1):
>   mm,page_alloc: Move the short sleep to should_reclaim_retry().
> 
>  drivers/tty/sysrq.c        |   2 -
>  include/linux/memcontrol.h |   9 +-
>  include/linux/oom.h        |   6 +-
>  include/linux/sched.h      |   7 +-
>  include/trace/events/oom.h |  64 -------
>  kernel/fork.c              |   2 +
>  mm/memcontrol.c            |  24 +--
>  mm/mmap.c                  |  17 +-
>  mm/oom_kill.c              | 439 +++++++++++++++++++++------------------------
>  mm/page_alloc.c            | 134 ++++++--------
>  10 files changed, 287 insertions(+), 417 deletions(-)
> 
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem.
  2018-07-03 15:12 ` [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem Michal Hocko
@ 2018-07-03 15:29   ` Michal Hocko
  2018-07-04  2:22     ` penguin-kernel
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2018-07-03 15:29 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, torvalds

On Tue 03-07-18 17:12:23, Michal Hocko wrote:
> On Tue 03-07-18 23:25:01, Tetsuo Handa wrote:
> > This series provides
> > 
> >   (1) Mitigation and a fix for CVE-2016-10723.
> > 
> >   (2) A mitigation for needlessly selecting next OOM victim reported
> >       by David Rientjes and rejected by Michal Hocko.
> > 
> >   (3) A preparation for handling many concurrent OOM victims which
> >       could become real by introducing memcg-aware OOM killer.
> 
> It would have been great to describe the overal design in the cover
> letter. So let me summarize just to be sure I understand the proposal.
> You are removing the oom_reaper and moving the oom victim tear down to
> the oom path. To handle cases where we cannot get mmap_sem to do that
> work you simply decay oom_badness over time if there are no changes in
> the victims oom score.

Correction. You do not decay oom_badness. You simply increase a stall
counter anytime oom_badness hasn't changed since the last check (if that
check happend at least HZ/10 ago) and get the victim out of sight if the
counter is larger than 30. This is where 3s are coming from. So in fact
this is the low boundary while it might be considerably larger depending
on how often we examine the victim.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem.
  2018-07-03 15:29   ` Michal Hocko
@ 2018-07-04  2:22     ` penguin-kernel
  2018-07-04  7:16       ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: penguin-kernel @ 2018-07-04  2:22 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, akpm, torvalds

Michal Hocko wrote:
> > On Tue 03-07-18 23:25:01, Tetsuo Handa wrote:
> > > This series provides
> > > 
> > >   (1) Mitigation and a fix for CVE-2016-10723.
> > > 
> > >   (2) A mitigation for needlessly selecting next OOM victim reported
> > >       by David Rientjes and rejected by Michal Hocko.
> > > 
> > >   (3) A preparation for handling many concurrent OOM victims which
> > >       could become real by introducing memcg-aware OOM killer.
> > 
> > It would have been great to describe the overal design in the cover
> > letter. So let me summarize just to be sure I understand the proposal.

You understood the proposal correctly.

> > You are removing the oom_reaper and moving the oom victim tear down to
> > the oom path.

Yes. This is for getting rid of the lie

	/*
	 * Acquire the oom lock.  If that fails, somebody else is
	 * making progress for us.
	 */
	if (!mutex_trylock(&oom_lock)) {
		*did_some_progress = 1;
		schedule_timeout_uninterruptible(1);
		return NULL;
	}

which is leading to CVE-2016-10723. By reclaiming from the OOM killer path,
we can eliminate this heuristic.

Of course, we don't have to remove the OOM reaper kernel thread. But the OOM
killer path knows better than the OOM reaper path regarding which domains need
to be reclaimed faster than other domains.

> >               To handle cases where we cannot get mmap_sem to do that
> > work you simply decay oom_badness over time if there are no changes in
> > the victims oom score.

Yes. It is a feedback based approach which handles not only unable to get
mmap_sem case but also already started free_pgtables()/remove_vma() case.

However, you are not understanding [PATCH 3/8] correctly. It IS A FEEDBACK
BASED APPROACH which corresponds to

	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
		schedule_timeout_idle(HZ/10);

in current code (though the timeout is different from current code).

> 
> Correction. You do not decay oom_badness. You simply increase a stall
> counter anytime oom_badness hasn't changed since the last check (if that
> check happend at least HZ/10 ago) and get the victim out of sight if the
> counter is larger than 30. This is where 3s are coming from. So in fact
> this is the low boundary while it might be considerably larger depending
> on how often we examine the victim.

Yes, it can become larger. But if we don't need to examine the OOM victim, it means
that we are not OOM. We need to examine the OOM victim only if we are still OOM.
(If we are no longer OOM, the OOM victim will disappear via exit_oom_mm() before
the counter reaches 30.)

Note that neither current

#define MAX_OOM_REAP_RETRIES 10

	/* Retry the down_read_trylock(mmap_sem) a few times */
	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
		schedule_timeout_idle(HZ/10);

code guarantees that the OOM reaper will give up in 1 second. If oom_reap_task_mm()
for one OOM domain takes e.g. one minute when there are OOM victims in multiple OOM
domains, we will needlessly block another OOM domain where allocating threads are
wasting CPU resources due to the lie mentioned above.

By allowing allocating threads to do direct OOM reaping, we won't needlessly block
allocating threads and we can eliminate the lie mentioned above.

> 
> >                        In order to not block in the oom context for too
> > long because the address space might be quite large, you allow to
> > direct oom reap from multiple contexts.

Yes. Since the thread which acquired the oom_lock can be SCHED_IDLE priority,
this patch is trying to speed up direct OOM reaping by allowing other threads
who acquired the oom_lock with !SCHED_IDLE priority. Thus, [PATCH 7/8] tries to
reduce the duration of holding the oom_lock, by releasing the oom_lock before
doing direct OOM reaping.

We could make oom_lock a spinlock (or "mutex_lock(&oom_lock); preempt_disable();"
and "preempt_enable(); mutex_unlock(&oom_lock);") but there is no perfect answer
for whether we should do so.

> > 
> > You fail to explain why is this approach more appropriate and how you
> > have settled with your current tuning with 3s timeout etc...
> > 
> > Considering how subtle this whole area is I am not overly happy about
> > another rewrite without a really strong reasoning behind. There is none
> > here, unfortunately. Well, except for statements how I reject something
> > without telling the whole story etc...

However, you are not understanding [PATCH 1/8] correctly. You are simply
refusing my patch instead of proving that removing the short sleep _is_ safe.
Since you don't prove it, I won't remove the short sleep until [PATCH 8/8].

In the kernel space, not yielding enough CPU resources for other threads to
make forward progress is a BUG. You learned it with warn_alloc() for allocation
stall reporting, didn't you?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem.
  2018-07-04  2:22     ` penguin-kernel
@ 2018-07-04  7:16       ` Michal Hocko
  2018-07-04  7:22         ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2018-07-04  7:16 UTC (permalink / raw)
  To: penguin-kernel; +Cc: linux-mm, akpm, torvalds

On Wed 04-07-18 11:22:55, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > On Tue 03-07-18 23:25:01, Tetsuo Handa wrote:
> > > > This series provides
> > > > 
> > > >   (1) Mitigation and a fix for CVE-2016-10723.
> > > > 
> > > >   (2) A mitigation for needlessly selecting next OOM victim reported
> > > >       by David Rientjes and rejected by Michal Hocko.
> > > > 
> > > >   (3) A preparation for handling many concurrent OOM victims which
> > > >       could become real by introducing memcg-aware OOM killer.
> > > 
> > > It would have been great to describe the overal design in the cover
> > > letter. So let me summarize just to be sure I understand the proposal.
> 
> You understood the proposal correctly.
> 
> > > You are removing the oom_reaper and moving the oom victim tear down to
> > > the oom path.
> 
> Yes. This is for getting rid of the lie
> 
> 	/*
> 	 * Acquire the oom lock.  If that fails, somebody else is
> 	 * making progress for us.
> 	 */
> 	if (!mutex_trylock(&oom_lock)) {
> 		*did_some_progress = 1;
> 		schedule_timeout_uninterruptible(1);
> 		return NULL;
> 	}
> 
> which is leading to CVE-2016-10723. By reclaiming from the OOM killer path,
> we can eliminate this heuristic.
> 
> Of course, we don't have to remove the OOM reaper kernel thread.

The thing is that the current design uses the oom_reaper only as a
backup to get situation unstuck. Once you move all that heavy lifting
into the oom path directly then you will have to handle all sorts of
issues. E.g. how do you handle that a random process hitting OOM path
has to pay the full price to tear down multi TB process? This is a lot
of time.

> But the OOM
> killer path knows better than the OOM reaper path regarding which domains need
> to be reclaimed faster than other domains.

How can it tell any priority? 

> > >               To handle cases where we cannot get mmap_sem to do that
> > > work you simply decay oom_badness over time if there are no changes in
> > > the victims oom score.
> 
> Yes. It is a feedback based approach which handles not only unable to get
> mmap_sem case but also already started free_pgtables()/remove_vma() case.
> 
> However, you are not understanding [PATCH 3/8] correctly. It IS A FEEDBACK
> BASED APPROACH which corresponds to
> 
> 	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
> 		schedule_timeout_idle(HZ/10);
> 
> in current code (though the timeout is different from current code).

Yes the timeout is different. The current solution is based on the retry
loops while you are proposing a fuzzy amount of retries scheme. You can
retry much more times if they happen often and there is _some_ change in
the oom_score. I am not yet convinced that the oom_score based back off
is a good idea. If you wanted to give some comparable results you could
easily implement it on top of the current oom_reaper and argue with some
numbers.
 
> > Correction. You do not decay oom_badness. You simply increase a stall
> > counter anytime oom_badness hasn't changed since the last check (if that
> > check happend at least HZ/10 ago) and get the victim out of sight if the
> > counter is larger than 30. This is where 3s are coming from. So in fact
> > this is the low boundary while it might be considerably larger depending
> > on how often we examine the victim.
> 
> Yes, it can become larger. But if we don't need to examine the OOM victim, it means
> that we are not OOM. We need to examine the OOM victim only if we are still OOM.
> (If we are no longer OOM, the OOM victim will disappear via exit_oom_mm() before
> the counter reaches 30.)
> 
> Note that neither current
> 
> #define MAX_OOM_REAP_RETRIES 10
> 
> 	/* Retry the down_read_trylock(mmap_sem) a few times */
> 	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
> 		schedule_timeout_idle(HZ/10);
> 
> code guarantees that the OOM reaper will give up in 1 second.

And that is the genuine design decision. Because the timeout is not
really important. It is the retry loop feedback that matters. We do not
play any promisses that something will happen after one or how many
seconds. We only care about retrying as long as it makes sense but not
for ever. That not for ever part is an implementation detail that we can
tune based on real bug reports.

> If oom_reap_task_mm()
> for one OOM domain takes e.g. one minute when there are OOM victims in multiple OOM
> domains, we will needlessly block another OOM domain where allocating threads are
> wasting CPU resources due to the lie mentioned above.

Not at all. This would only happen when oom victims from other domains
are stuck not able to make forward progress. Is this something to
optimize the behavior for?

> By allowing allocating threads to do direct OOM reaping, we won't needlessly block
> allocating threads and we can eliminate the lie mentioned above.

Yes, this would help _but_ there are other downsides as mentioned above.
You have to carefuly weight them.

> > 
> > >                        In order to not block in the oom context for too
> > > long because the address space might be quite large, you allow to
> > > direct oom reap from multiple contexts.
> 
> Yes. Since the thread which acquired the oom_lock can be SCHED_IDLE priority,
> this patch is trying to speed up direct OOM reaping by allowing other threads
> who acquired the oom_lock with !SCHED_IDLE priority. Thus, [PATCH 7/8] tries to
> reduce the duration of holding the oom_lock, by releasing the oom_lock before
> doing direct OOM reaping.
> 
> We could make oom_lock a spinlock (or "mutex_lock(&oom_lock); preempt_disable();"
> and "preempt_enable(); mutex_unlock(&oom_lock);") but there is no perfect answer
> for whether we should do so.

But there is a good example, why we shouldn't do that. Crawling over the
task list can take a long time. And you definitely do not want to stall
even more for PREEMPT kernels.
 
> > > You fail to explain why is this approach more appropriate and how you
> > > have settled with your current tuning with 3s timeout etc...
> > > 
> > > Considering how subtle this whole area is I am not overly happy about
> > > another rewrite without a really strong reasoning behind. There is none
> > > here, unfortunately. Well, except for statements how I reject something
> > > without telling the whole story etc...
> 
> However, you are not understanding [PATCH 1/8] correctly. You are simply
> refusing my patch instead of proving that removing the short sleep _is_ safe.
> Since you don't prove it, I won't remove the short sleep until [PATCH 8/8].

No, you are misunderstanding my point. Do not fiddle with the code you
are not willing to understand. The whole sleeping has been done
optimistically in the oom path. Removing it perfectly reasonable with
the current code. We do have preemption points in the reclaim path. If
they are not sufficient then this is another story. But no sane design
relies on random sleeps in the OOM path which is probably the most cold
path in the kernel.
 
> In the kernel space, not yielding enough CPU resources for other threads to
> make forward progress is a BUG. You learned it with warn_alloc() for allocation
> stall reporting, didn't you?

And again, you are blaming a wrong part of the pipe. The whole
warn_alloc fiasco is the result of printk allowing unbound amount of
work on behalf of the caller.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem.
  2018-07-04  7:16       ` Michal Hocko
@ 2018-07-04  7:22         ` Michal Hocko
  2018-07-05  3:05           ` Tetsuo Handa
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2018-07-04  7:22 UTC (permalink / raw)
  To: penguin-kernel; +Cc: linux-mm, akpm, torvalds

On Wed 04-07-18 09:16:32, Michal Hocko wrote:
> On Wed 04-07-18 11:22:55, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > > On Tue 03-07-18 23:25:01, Tetsuo Handa wrote:
> > > > > This series provides
> > > > > 
> > > > >   (1) Mitigation and a fix for CVE-2016-10723.
> > > > > 
> > > > >   (2) A mitigation for needlessly selecting next OOM victim reported
> > > > >       by David Rientjes and rejected by Michal Hocko.
> > > > > 
> > > > >   (3) A preparation for handling many concurrent OOM victims which
> > > > >       could become real by introducing memcg-aware OOM killer.
> > > > 
> > > > It would have been great to describe the overal design in the cover
> > > > letter. So let me summarize just to be sure I understand the proposal.
> > 
> > You understood the proposal correctly.
> > 
> > > > You are removing the oom_reaper and moving the oom victim tear down to
> > > > the oom path.
> > 
> > Yes. This is for getting rid of the lie
> > 
> > 	/*
> > 	 * Acquire the oom lock.  If that fails, somebody else is
> > 	 * making progress for us.
> > 	 */
> > 	if (!mutex_trylock(&oom_lock)) {
> > 		*did_some_progress = 1;
> > 		schedule_timeout_uninterruptible(1);
> > 		return NULL;
> > 	}
> > 
> > which is leading to CVE-2016-10723. By reclaiming from the OOM killer path,
> > we can eliminate this heuristic.
> > 
> > Of course, we don't have to remove the OOM reaper kernel thread.
> 
> The thing is that the current design uses the oom_reaper only as a
> backup to get situation unstuck. Once you move all that heavy lifting
> into the oom path directly then you will have to handle all sorts of
> issues. E.g. how do you handle that a random process hitting OOM path
> has to pay the full price to tear down multi TB process? This is a lot
> of time.

And one more thing. Your current design doesn't solve any of the current
shortcomings. mlocked pages are still not reclaimable from the direct
oom tear down. Blockable mmu notifiers still prevent the direct tear
down. So the only thing that you achieve with a large and disruptive
patch is that the exit vs. oom locking protocol got simplified and that
you can handle oom domains from tasks belonging to them. This is not bad
but it has its own downsides which either fail to see or reluctant to
describe and explain.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem.
  2018-07-04  7:22         ` Michal Hocko
@ 2018-07-05  3:05           ` Tetsuo Handa
  2018-07-05  7:24             ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: Tetsuo Handa @ 2018-07-05  3:05 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, akpm, torvalds

Michal Hocko wrote:
> > > > You are removing the oom_reaper and moving the oom victim tear down to
> > > > the oom path.
> > 
> > Yes. This is for getting rid of the lie
> > 
> > 	/*
> > 	 * Acquire the oom lock.  If that fails, somebody else is
> > 	 * making progress for us.
> > 	 */
> > 	if (!mutex_trylock(&oom_lock)) {
> > 		*did_some_progress = 1;
> > 		schedule_timeout_uninterruptible(1);
> > 		return NULL;
> > 	}
> > 
> > which is leading to CVE-2016-10723. By reclaiming from the OOM killer path,
> > we can eliminate this heuristic.
> > 
> > Of course, we don't have to remove the OOM reaper kernel thread.
> 
> The thing is that the current design uses the oom_reaper only as a
> backup to get situation unstuck. Once you move all that heavy lifting
> into the oom path directly then you will have to handle all sorts of
> issues. E.g. how do you handle that a random process hitting OOM path
> has to pay the full price to tear down multi TB process? This is a lot
> of time.

We can add a threshold to unmap_page_range() (for direct OOM reaping threads)
which aborts after given number of pages are reclaimed. There is no need to
reclaim all pages at once if the caller is doing memory allocations.

But isn't "direct reclaim" already paying the full price (including being
blocked at unkillable locks inside shrinkers etc.), huh? If "direct reclaim"
did not exist, we would not have suffered this problem from the beginning. ;-)

> 
> And one more thing. Your current design doesn't solve any of the current
> shortcomings. mlocked pages are still not reclaimable from the direct
> oom tear down. Blockable mmu notifiers still prevent the direct tear
> down. So the only thing that you achieve with a large and disruptive
> patch is that the exit vs. oom locking protocol got simplified and that
> you can handle oom domains from tasks belonging to them. This is not bad
> but it has its own downsides which either fail to see or reluctant to
> describe and explain.

What this patchset is trying to do is "do not sleep (except cond_resched()
and PREEMPT) with oom_lock held". In other words, "sleeping with oom_lock
held (or at least, holding oom_lock more than needed)" is one of the current
shortcomings.

Handling mlocked/shared pages and blockable mmu notifieres are a different
story.

> 
> > But the OOM
> > killer path knows better than the OOM reaper path regarding which domains need
> > to be reclaimed faster than other domains.
> 
> How can it tell any priority?

I didn't catch what your question is.
Allowing direct OOM reaping itself reflects the priority.

> 
> > > >               To handle cases where we cannot get mmap_sem to do that
> > > > work you simply decay oom_badness over time if there are no changes in
> > > > the victims oom score.
> > 
> > Yes. It is a feedback based approach which handles not only unable to get
> > mmap_sem case but also already started free_pgtables()/remove_vma() case.
> > 
> > However, you are not understanding [PATCH 3/8] correctly. It IS A FEEDBACK
> > BASED APPROACH which corresponds to
> > 
> > 	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
> > 		schedule_timeout_idle(HZ/10);
> > 
> > in current code (though the timeout is different from current code).
> 
> Yes the timeout is different. The current solution is based on the retry
> loops while you are proposing a fuzzy amount of retries scheme. You can
> retry much more times if they happen often and there is _some_ change in
> the oom_score. I am not yet convinced that the oom_score based back off
> is a good idea. If you wanted to give some comparable results you could
> easily implement it on top of the current oom_reaper and argue with some
> numbers.

The oom_score based back off is mitigation for giving up too early due to
mmap_sem held for write and/or mlocked/shared pages and/or blockable mmu
notifiers. When we became possible to reclaim all pages, such back off will
be removed.

> > If oom_reap_task_mm()
> > for one OOM domain takes e.g. one minute when there are OOM victims in multiple OOM
> > domains, we will needlessly block another OOM domain where allocating threads are
> > wasting CPU resources due to the lie mentioned above.
> 
> Not at all. This would only happen when oom victims from other domains
> are stuck not able to make forward progress. Is this something to
> optimize the behavior for?

Yes, I consider that this is something to optimize the behavior for.

Current code is forcing memcg OOM killer waiting at mutex_lock(&oom_lock) to pay
the full price to tear down an OOM victim process (which might be a multi TB process)
because exit_mmap()/oom_reap_task_mm() are calling __oom_reap_task_mm(mm) with
oom_lock held. The OOM victim which exit_mmap()/oom_reap_task_mm() is tearing
down is not always in the same OOM domain mutex_lock(&oom_lock) is waiting for.

There is no need to hold oom_lock when calling __oom_reap_task_mm(mm).
Holding oom_lock helps "serializing printk() messages from the OOM killer" and
"avoid selecting new OOM victims when there are already OOM victims which current
thread should wait for". Holding oom_lock does not help when waiting for
unmap_page_range().

> > > 
> > > >                        In order to not block in the oom context for too
> > > > long because the address space might be quite large, you allow to
> > > > direct oom reap from multiple contexts.
> > 
> > Yes. Since the thread which acquired the oom_lock can be SCHED_IDLE priority,
> > this patch is trying to speed up direct OOM reaping by allowing other threads
> > who acquired the oom_lock with !SCHED_IDLE priority. Thus, [PATCH 7/8] tries to
> > reduce the duration of holding the oom_lock, by releasing the oom_lock before
> > doing direct OOM reaping.
> > 
> > We could make oom_lock a spinlock (or "mutex_lock(&oom_lock); preempt_disable();"
> > and "preempt_enable(); mutex_unlock(&oom_lock);") but there is no perfect answer
> > for whether we should do so.
> 
> But there is a good example, why we shouldn't do that. Crawling over the
> task list can take a long time. And you definitely do not want to stall
> even more for PREEMPT kernels.

I didn't catch what "do that" is referring to. Did you say

  But there is a good example, why we shouldn't make oom_lock a spinlock (or
  "mutex_lock(&oom_lock); preempt_disable();" and "preempt_enable();
  mutex_unlock(&oom_lock);").

? Then, [PATCH 2/8] will avoid needlessly crawling over the task list and
[PATCH 7/8] will reduce needlessly blocking other threads waiting for oom_lock,
which are appropriate changes for that guideline. And please stop complaining
about PREEMPT kernels (i.e. your "It might help with !PREEMPT but still doesn't
solve any problem." in response to [PATCH 1/8]). Again, "sleeping with oom_lock
held (or at least, holding oom_lock more than needed)" is one of the current
shortcomings.

>  
> > > > You fail to explain why is this approach more appropriate and how you
> > > > have settled with your current tuning with 3s timeout etc...
> > > > 
> > > > Considering how subtle this whole area is I am not overly happy about
> > > > another rewrite without a really strong reasoning behind. There is none
> > > > here, unfortunately. Well, except for statements how I reject something
> > > > without telling the whole story etc...
> > 
> > However, you are not understanding [PATCH 1/8] correctly. You are simply
> > refusing my patch instead of proving that removing the short sleep _is_ safe.
> > Since you don't prove it, I won't remove the short sleep until [PATCH 8/8].
> 
> No, you are misunderstanding my point. Do not fiddle with the code you
> are not willing to understand. The whole sleeping has been done
> optimistically in the oom path. Removing it perfectly reasonable with
> the current code. We do have preemption points in the reclaim path. If
> they are not sufficient then this is another story. But no sane design
> relies on random sleeps in the OOM path which is probably the most cold
> path in the kernel.

I split into [PATCH 1/8] and [PATCH 8/8] so that we can revert [PATCH 8/8] part
in case it turns out that removing the short sleep affects negatively.
You are still not proving that combining [PATCH 1/8] and [PATCH 8/8] is safe.
Note that "reasonable" and "safe" are different things. We are prone to make
failures which sounded "reasonable" but were not "safe" (e.g. commit
212925802454672e "mm: oom: let oom_reap_task and exit_mmap run concurrently").

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem.
  2018-07-05  3:05           ` Tetsuo Handa
@ 2018-07-05  7:24             ` Michal Hocko
  2018-07-06  2:40               ` Tetsuo Handa
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2018-07-05  7:24 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, torvalds

On Thu 05-07-18 12:05:09, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > > > You are removing the oom_reaper and moving the oom victim tear down to
> > > > > the oom path.
> > > 
> > > Yes. This is for getting rid of the lie
> > > 
> > > 	/*
> > > 	 * Acquire the oom lock.  If that fails, somebody else is
> > > 	 * making progress for us.
> > > 	 */
> > > 	if (!mutex_trylock(&oom_lock)) {
> > > 		*did_some_progress = 1;
> > > 		schedule_timeout_uninterruptible(1);
> > > 		return NULL;
> > > 	}
> > > 
> > > which is leading to CVE-2016-10723. By reclaiming from the OOM killer path,
> > > we can eliminate this heuristic.
> > > 
> > > Of course, we don't have to remove the OOM reaper kernel thread.
> > 
> > The thing is that the current design uses the oom_reaper only as a
> > backup to get situation unstuck. Once you move all that heavy lifting
> > into the oom path directly then you will have to handle all sorts of
> > issues. E.g. how do you handle that a random process hitting OOM path
> > has to pay the full price to tear down multi TB process? This is a lot
> > of time.
> 
> We can add a threshold to unmap_page_range() (for direct OOM reaping threads)
> which aborts after given number of pages are reclaimed. There is no need to
> reclaim all pages at once if the caller is doing memory allocations. 

Yes, there is no need to reclaim all pages. OOM is after freeing _some_
memory after all. But that means further complications down the unmap
path. I do not really see any reason for that.

> But isn't "direct reclaim" already paying the full price (including being
> blocked at unkillable locks inside shrinkers etc.), huh? If "direct reclaim"
> did not exist, we would not have suffered this problem from the beginning. ;-)

No, direct reclaim is a way to throttle allocations to the reclaim
speed. You would have to achive the same by other means.

> > And one more thing. Your current design doesn't solve any of the current
> > shortcomings. mlocked pages are still not reclaimable from the direct
> > oom tear down. Blockable mmu notifiers still prevent the direct tear
> > down. So the only thing that you achieve with a large and disruptive
> > patch is that the exit vs. oom locking protocol got simplified and that
> > you can handle oom domains from tasks belonging to them. This is not bad
> > but it has its own downsides which either fail to see or reluctant to
> > describe and explain.
> 
> What this patchset is trying to do is "do not sleep (except cond_resched()
> and PREEMPT) with oom_lock held". In other words, "sleeping with oom_lock
> held (or at least, holding oom_lock more than needed)" is one of the current
> shortcomings.
> 
> Handling mlocked/shared pages and blockable mmu notifieres are a different
> story.

Yes, and yet those are the only reason why some cases are not handled
with the current approach which you are trying to replace completely.
So you replace one set of corner cases with another while you do not
really solve reapability of the above. This doesn't sounds like an
improvement to me.

> > > But the OOM
> > > killer path knows better than the OOM reaper path regarding which domains need
> > > to be reclaimed faster than other domains.
> > 
> > How can it tell any priority?
> 
> I didn't catch what your question is.
> Allowing direct OOM reaping itself reflects the priority.

I guess I've misunderstood. I thought you were talking about
prioritization of the task ripping. I guess I see your point now. You
are pointint to oom_unkillable_task check before tearing down the
victim. That would indeed help when we have too many memcg ooms
happening in parallel. But it is something that your design has
introduced. If you kept the async tear down you do not really care
because you would rely on victims cleaning up after them selves in the
vast majority of cases and async is there only to handle potential
lockups.

> > > > >               To handle cases where we cannot get mmap_sem to do that
> > > > > work you simply decay oom_badness over time if there are no changes in
> > > > > the victims oom score.
> > > 
> > > Yes. It is a feedback based approach which handles not only unable to get
> > > mmap_sem case but also already started free_pgtables()/remove_vma() case.
> > > 
> > > However, you are not understanding [PATCH 3/8] correctly. It IS A FEEDBACK
> > > BASED APPROACH which corresponds to
> > > 
> > > 	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
> > > 		schedule_timeout_idle(HZ/10);
> > > 
> > > in current code (though the timeout is different from current code).
> > 
> > Yes the timeout is different. The current solution is based on the retry
> > loops while you are proposing a fuzzy amount of retries scheme. You can
> > retry much more times if they happen often and there is _some_ change in
> > the oom_score. I am not yet convinced that the oom_score based back off
> > is a good idea. If you wanted to give some comparable results you could
> > easily implement it on top of the current oom_reaper and argue with some
> > numbers.
> 
> The oom_score based back off is mitigation for giving up too early due to
> mmap_sem held for write and/or mlocked/shared pages and/or blockable mmu
> notifiers. When we became possible to reclaim all pages, such back off will
> be removed.

So why don't we fix those first and then try to rewrite the whole thing if
that is not sufficient?

> > > If oom_reap_task_mm()
> > > for one OOM domain takes e.g. one minute when there are OOM victims in multiple OOM
> > > domains, we will needlessly block another OOM domain where allocating threads are
> > > wasting CPU resources due to the lie mentioned above.
> > 
> > Not at all. This would only happen when oom victims from other domains
> > are stuck not able to make forward progress. Is this something to
> > optimize the behavior for?
> 
> Yes, I consider that this is something to optimize the behavior for.

The proper design should focus on the standard case while covering
corner cases as much as possible. Doing that other way around risks that
you over complicate the design with hard to evaluate side effects.
 
> Current code is forcing memcg OOM killer waiting at mutex_lock(&oom_lock) to pay
> the full price to tear down an OOM victim process (which might be a multi TB process)
> because exit_mmap()/oom_reap_task_mm() are calling __oom_reap_task_mm(mm) with
> oom_lock held. The OOM victim which exit_mmap()/oom_reap_task_mm() is tearing
> down is not always in the same OOM domain mutex_lock(&oom_lock) is waiting for.

You are (yet again) ignoring many details and making incorrect
claims. For once, if somebody is reclaiming memory then those other
allocation paths can make forward progress and as such they do not
really have to wait for the full tear down. See the difference from the
sync oom reaping when at least one task will have to pay the full price?
 
[...]
> > > We could make oom_lock a spinlock (or "mutex_lock(&oom_lock); preempt_disable();"
> > > and "preempt_enable(); mutex_unlock(&oom_lock);") but there is no perfect answer
> > > for whether we should do so.
> > 
> > But there is a good example, why we shouldn't do that. Crawling over the
> > task list can take a long time. And you definitely do not want to stall
> > even more for PREEMPT kernels.
> 
> I didn't catch what "do that" is referring to. Did you say
> 
>   But there is a good example, why we shouldn't make oom_lock a spinlock (or
>   "mutex_lock(&oom_lock); preempt_disable();" and "preempt_enable();
>   mutex_unlock(&oom_lock);").
> 
> ? Then, [PATCH 2/8] will avoid needlessly crawling over the task list and
> [PATCH 7/8] will reduce needlessly blocking other threads waiting for oom_lock,
> which are appropriate changes for that guideline. And please stop complaining
> about PREEMPT kernels (i.e. your "It might help with !PREEMPT but still doesn't
> solve any problem." in response to [PATCH 1/8]). Again, "sleeping with oom_lock
> held (or at least, holding oom_lock more than needed)" is one of the current
> shortcomings.

You are like a broken record. I have said that we do not have to sleep
while holding the lock several times already. You would still have to
crawl all tasks in case you have to select a victim and that is a no-no
for preempt disable.
 
> > > > > You fail to explain why is this approach more appropriate and how you
> > > > > have settled with your current tuning with 3s timeout etc...
> > > > > 
> > > > > Considering how subtle this whole area is I am not overly happy about
> > > > > another rewrite without a really strong reasoning behind. There is none
> > > > > here, unfortunately. Well, except for statements how I reject something
> > > > > without telling the whole story etc...
> > > 
> > > However, you are not understanding [PATCH 1/8] correctly. You are simply
> > > refusing my patch instead of proving that removing the short sleep _is_ safe.
> > > Since you don't prove it, I won't remove the short sleep until [PATCH 8/8].
> > 
> > No, you are misunderstanding my point. Do not fiddle with the code you
> > are not willing to understand. The whole sleeping has been done
> > optimistically in the oom path. Removing it perfectly reasonable with
> > the current code. We do have preemption points in the reclaim path. If
> > they are not sufficient then this is another story. But no sane design
> > relies on random sleeps in the OOM path which is probably the most cold
> > path in the kernel.
> 
> I split into [PATCH 1/8] and [PATCH 8/8] so that we can revert [PATCH 8/8] part
> in case it turns out that removing the short sleep affects negatively.

And again, if you read what I wrote and tried to think about it for a
while you would understand that reverting that would be a wrong
approach. I am not going to repeat myself again and again. Go and try to
reread and ask if something is not clear.

> You are still not proving that combining [PATCH 1/8] and [PATCH 8/8] is safe.
> Note that "reasonable" and "safe" are different things. We are prone to make
> failures which sounded "reasonable" but were not "safe" (e.g. commit
> 212925802454672e "mm: oom: let oom_reap_task and exit_mmap run concurrently").

We will always fail like that if the design is not clear. And random
sleeps at random places just because we have been doing that for some
time is not a design. It's a mess. And we have been piling that mess in
the oom path for years to tune for very specific workloads. It's time to
stop that finally! If you have an overloaded page allocator you simply
do not depend on a random sleep in the oom path. Full stop. If that is
not clear to you, we have not much to talk about.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem.
  2018-07-05  7:24             ` Michal Hocko
@ 2018-07-06  2:40               ` Tetsuo Handa
  2018-07-06  2:49                 ` Linus Torvalds
  2018-07-06  5:56                 ` Michal Hocko
  0 siblings, 2 replies; 25+ messages in thread
From: Tetsuo Handa @ 2018-07-06  2:40 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, akpm, torvalds, rientjes

Michal Hocko wrote:
> On Thu 05-07-18 12:05:09, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > > > > You are removing the oom_reaper and moving the oom victim tear down to
> > > > > > the oom path.
> > > > 
> > > > Yes. This is for getting rid of the lie
> > > > 
> > > > 	/*
> > > > 	 * Acquire the oom lock.  If that fails, somebody else is
> > > > 	 * making progress for us.
> > > > 	 */
> > > > 	if (!mutex_trylock(&oom_lock)) {
> > > > 		*did_some_progress = 1;
> > > > 		schedule_timeout_uninterruptible(1);
> > > > 		return NULL;
> > > > 	}
> > > > 
> > > > which is leading to CVE-2016-10723. By reclaiming from the OOM killer path,
> > > > we can eliminate this heuristic.
> > > > 
> > > > Of course, we don't have to remove the OOM reaper kernel thread.
> > > 
> > > The thing is that the current design uses the oom_reaper only as a
> > > backup to get situation unstuck. Once you move all that heavy lifting
> > > into the oom path directly then you will have to handle all sorts of
> > > issues. E.g. how do you handle that a random process hitting OOM path
> > > has to pay the full price to tear down multi TB process? This is a lot
> > > of time.
> > 
> > We can add a threshold to unmap_page_range() (for direct OOM reaping threads)
> > which aborts after given number of pages are reclaimed. There is no need to
> > reclaim all pages at once if the caller is doing memory allocations. 
> 
> Yes, there is no need to reclaim all pages. OOM is after freeing _some_
> memory after all. But that means further complications down the unmap
> path. I do not really see any reason for that.

"I do not see reason for that" cannot become a reason direct OOM reaping has to
reclaim all pages at once.

> 
> > But isn't "direct reclaim" already paying the full price (including being
> > blocked at unkillable locks inside shrinkers etc.), huh? If "direct reclaim"
> > did not exist, we would not have suffered this problem from the beginning. ;-)
> 
> No, direct reclaim is a way to throttle allocations to the reclaim
> speed. You would have to achive the same by other means.

No. Direct reclaim is a way to lockup the system to unusable level, by not giving
enough CPU resources to memory reclaim activities including the owner of oom_lock.

> 
> > > And one more thing. Your current design doesn't solve any of the current
> > > shortcomings. mlocked pages are still not reclaimable from the direct
> > > oom tear down. Blockable mmu notifiers still prevent the direct tear
> > > down. So the only thing that you achieve with a large and disruptive
> > > patch is that the exit vs. oom locking protocol got simplified and that
> > > you can handle oom domains from tasks belonging to them. This is not bad
> > > but it has its own downsides which either fail to see or reluctant to
> > > describe and explain.
> > 
> > What this patchset is trying to do is "do not sleep (except cond_resched()
> > and PREEMPT) with oom_lock held". In other words, "sleeping with oom_lock
> > held (or at least, holding oom_lock more than needed)" is one of the current
> > shortcomings.
> > 
> > Handling mlocked/shared pages and blockable mmu notifieres are a different
> > story.
> 
> Yes, and yet those are the only reason why some cases are not handled
> with the current approach which you are trying to replace completely.
> So you replace one set of corner cases with another while you do not
> really solve reapability of the above. This doesn't sounds like an
> improvement to me.

"This doesn't sounds like an improvement to me." cannot become a reason we
keep [PATCH 1/8] away. Even if lockup is a corner case, it is a bug which
has to be fixed. [PATCH 1/8] is for mitigating user-triggerable lockup.

> 
> > > > But the OOM
> > > > killer path knows better than the OOM reaper path regarding which domains need
> > > > to be reclaimed faster than other domains.
> > > 
> > > How can it tell any priority?
> > 
> > I didn't catch what your question is.
> > Allowing direct OOM reaping itself reflects the priority.
> 
> I guess I've misunderstood. I thought you were talking about
> prioritization of the task ripping. I guess I see your point now. You
> are pointint to oom_unkillable_task check before tearing down the
> victim. That would indeed help when we have too many memcg ooms
> happening in parallel. But it is something that your design has
> introduced. If you kept the async tear down you do not really care
> because you would rely on victims cleaning up after them selves in the
> vast majority of cases and async is there only to handle potential
> lockups.

Vast majority of OOM victims will tear down themselves (as long as allocating
threads give CPU resources to them). I'm fine with preserving async OOM reaping
(by the OOM reaper kernel thread). But as long as we limit number of pages direct
OOM reaping will reclaim at each attempt, it will not impact allocating threads
so much.

> 
> > > > > >               To handle cases where we cannot get mmap_sem to do that
> > > > > > work you simply decay oom_badness over time if there are no changes in
> > > > > > the victims oom score.
> > > > 
> > > > Yes. It is a feedback based approach which handles not only unable to get
> > > > mmap_sem case but also already started free_pgtables()/remove_vma() case.
> > > > 
> > > > However, you are not understanding [PATCH 3/8] correctly. It IS A FEEDBACK
> > > > BASED APPROACH which corresponds to
> > > > 
> > > > 	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
> > > > 		schedule_timeout_idle(HZ/10);
> > > > 
> > > > in current code (though the timeout is different from current code).
> > > 
> > > Yes the timeout is different. The current solution is based on the retry
> > > loops while you are proposing a fuzzy amount of retries scheme. You can
> > > retry much more times if they happen often and there is _some_ change in
> > > the oom_score. I am not yet convinced that the oom_score based back off
> > > is a good idea. If you wanted to give some comparable results you could
> > > easily implement it on top of the current oom_reaper and argue with some
> > > numbers.
> > 
> > The oom_score based back off is mitigation for giving up too early due to
> > mmap_sem held for write and/or mlocked/shared pages and/or blockable mmu
> > notifiers. When we became possible to reclaim all pages, such back off will
> > be removed.
> 
> So why don't we fix those first and then try to rewrite the whole thing if
> that is not sufficient?

I'm repeating that [PATCH 1/8] needs to be applied now and backported to stable
kernels which got "mm, oom: fix concurrent munlock and oom reaper unmap, v3".

You are maliciously blocking me by demanding me to fix irrelevant things.

> 
> > > > If oom_reap_task_mm()
> > > > for one OOM domain takes e.g. one minute when there are OOM victims in multiple OOM
> > > > domains, we will needlessly block another OOM domain where allocating threads are
> > > > wasting CPU resources due to the lie mentioned above.
> > > 
> > > Not at all. This would only happen when oom victims from other domains
> > > are stuck not able to make forward progress. Is this something to
> > > optimize the behavior for?
> > 
> > Yes, I consider that this is something to optimize the behavior for.
> 
> The proper design should focus on the standard case while covering
> corner cases as much as possible. Doing that other way around risks that
> you over complicate the design with hard to evaluate side effects.

Your "proper design" is broken because you completely ignore corner cases.
You don't think user-triggerable DoS as a problem.

>  
> > Current code is forcing memcg OOM killer waiting at mutex_lock(&oom_lock) to pay
> > the full price to tear down an OOM victim process (which might be a multi TB process)
> > because exit_mmap()/oom_reap_task_mm() are calling __oom_reap_task_mm(mm) with
> > oom_lock held. The OOM victim which exit_mmap()/oom_reap_task_mm() is tearing
> > down is not always in the same OOM domain mutex_lock(&oom_lock) is waiting for.
> 
> You are (yet again) ignoring many details and making incorrect
> claims. For once, if somebody is reclaiming memory then those other
> allocation paths can make forward progress and as such they do not
> really have to wait for the full tear down. See the difference from the
> sync oom reaping when at least one task will have to pay the full price?

For __alloc_pages_may_oom() path which is currently using mutex_trylock(), you are
right except that nobody can reclaim memory due to schedule_timeout_killable(1)
with oom_lock held.

For other paths which are currently using mutex_lock(), you are ignoring what
I'm saying. A memcg-OOM event is paying the full price for tearing down an OOM
victim which is in a different memcg domain.

>  
> [...]
> > > > We could make oom_lock a spinlock (or "mutex_lock(&oom_lock); preempt_disable();"
> > > > and "preempt_enable(); mutex_unlock(&oom_lock);") but there is no perfect answer
> > > > for whether we should do so.
> > > 
> > > But there is a good example, why we shouldn't do that. Crawling over the
> > > task list can take a long time. And you definitely do not want to stall
> > > even more for PREEMPT kernels.
> > 
> > I didn't catch what "do that" is referring to. Did you say
> > 
> >   But there is a good example, why we shouldn't make oom_lock a spinlock (or
> >   "mutex_lock(&oom_lock); preempt_disable();" and "preempt_enable();
> >   mutex_unlock(&oom_lock);").
> > 
> > ? Then, [PATCH 2/8] will avoid needlessly crawling over the task list and
> > [PATCH 7/8] will reduce needlessly blocking other threads waiting for oom_lock,
> > which are appropriate changes for that guideline. And please stop complaining
> > about PREEMPT kernels (i.e. your "It might help with !PREEMPT but still doesn't
> > solve any problem." in response to [PATCH 1/8]). Again, "sleeping with oom_lock
> > held (or at least, holding oom_lock more than needed)" is one of the current
> > shortcomings.
> 
> You are like a broken record. I have said that we do not have to sleep
> while holding the lock several times already. You would still have to
> crawl all tasks in case you have to select a victim and that is a no-no
> for preempt disable.

We know we have to crawl all tasks when we have to select a new OOM victim.
Since that might take a long time, [PATCH 2/8] can reduce number of tasks when
the allocating threads can wait for OOM victims.

But I still cannot catch what you are saying. You are saying that we should not
disable preemption while crawling all tasks in order to select a new OOM victim,
aren't you? Then, why are you complaining that "It might help with !PREEMPT but
still doesn't solve any problem." ? Are you saying that we should disable
preemption while crawling all tasks in order to select a new OOM victim?

(By the way, I think we can crawl all tasks in order to select a new OOM victim
without oom_lock held, provided that we recheck with oom_lock held whether somebody
else selected an OOM victim which current thread can wait for. That will be part of
[PATCH 7/8].)

>  
> > > > > > You fail to explain why is this approach more appropriate and how you
> > > > > > have settled with your current tuning with 3s timeout etc...
> > > > > > 
> > > > > > Considering how subtle this whole area is I am not overly happy about
> > > > > > another rewrite without a really strong reasoning behind. There is none
> > > > > > here, unfortunately. Well, except for statements how I reject something
> > > > > > without telling the whole story etc...
> > > > 
> > > > However, you are not understanding [PATCH 1/8] correctly. You are simply
> > > > refusing my patch instead of proving that removing the short sleep _is_ safe.
> > > > Since you don't prove it, I won't remove the short sleep until [PATCH 8/8].
> > > 
> > > No, you are misunderstanding my point. Do not fiddle with the code you
> > > are not willing to understand. The whole sleeping has been done
> > > optimistically in the oom path. Removing it perfectly reasonable with
> > > the current code. We do have preemption points in the reclaim path. If
> > > they are not sufficient then this is another story. But no sane design
> > > relies on random sleeps in the OOM path which is probably the most cold
> > > path in the kernel.
> > 
> > I split into [PATCH 1/8] and [PATCH 8/8] so that we can revert [PATCH 8/8] part
> > in case it turns out that removing the short sleep affects negatively.
> 
> And again, if you read what I wrote and tried to think about it for a
> while you would understand that reverting that would be a wrong
> approach. I am not going to repeat myself again and again. Go and try to
> reread and ask if something is not clear.

If you believe that schedule_timeout_*() in memory allocating path is wrong,
immediately get rid of all sleeps except should_reclaim_retry(). For example,
msleep(100) in shrink_inactive_list(), congestion_wait(BLK_RW_ASYNC, HZ/50) in
do_writepages(). Unless you do that now, I won't agree with removing the short
sleep for the owner of oom_lock. Current code is not perfect enough to survive
without short sleep.

Again, you are maliciously blocking me without admitting the fact that vast
majority of OOM victims can tear down themselves only if we don't deprive OOM
victims of CPU resources. What I worry is depriving OOM victims of CPU resources
by applying [PATCH 8/8].

> 
> > You are still not proving that combining [PATCH 1/8] and [PATCH 8/8] is safe.
> > Note that "reasonable" and "safe" are different things. We are prone to make
> > failures which sounded "reasonable" but were not "safe" (e.g. commit
> > 212925802454672e "mm: oom: let oom_reap_task and exit_mmap run concurrently").
> 
> We will always fail like that if the design is not clear. And random
> sleeps at random places just because we have been doing that for some
> time is not a design. It's a mess. And we have been piling that mess in
> the oom path for years to tune for very specific workloads. It's time to
> stop that finally! If you have an overloaded page allocator you simply
> do not depend on a random sleep in the oom path. Full stop. If that is
> not clear to you, we have not much to talk about.

You can try that after [PATCH 1/8] is applied.
After all, you explained no valid reason we cannot apply [PATCH 1/8] now.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem.
  2018-07-06  2:40               ` Tetsuo Handa
@ 2018-07-06  2:49                 ` Linus Torvalds
  2018-07-07  1:12                   ` Tetsuo Handa
  2018-07-06  5:56                 ` Michal Hocko
  1 sibling, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2018-07-06  2:49 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Michal Hocko, linux-mm, Andrew Morton, David Rientjes

On Thu, Jul 5, 2018 at 7:40 PM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> >
> > No, direct reclaim is a way to throttle allocations to the reclaim
> > speed. You would have to achive the same by other means.
>
> No. Direct reclaim is a way to lockup the system to unusable level, by not giving
> enough CPU resources to memory reclaim activities including the owner of oom_lock.

No. Really.

Direct reclaim really really does what Michal claims. Yes, it has
other effects too, and it can be problematic, but direct reclaim is
important.

People have tried to remove it many times, but it's always been a
disaster. You need to synchronize with _something_ to make sure that
the thread that is causing a lot of allocations actually pays the
price, and slows down.

You want to have a balance between direct and indirect reclaim.

If you think direct reclaim is only a way to lock up the system to
unusable levels, you should stop doing VM development.

                   Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem.
  2018-07-06  2:40               ` Tetsuo Handa
  2018-07-06  2:49                 ` Linus Torvalds
@ 2018-07-06  5:56                 ` Michal Hocko
  2018-07-10  3:57                   ` Tetsuo Handa
  1 sibling, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2018-07-06  5:56 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, torvalds, rientjes

On Fri 06-07-18 11:40:07, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 05-07-18 12:05:09, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > > > > You are removing the oom_reaper and moving the oom victim tear down to
> > > > > > > the oom path.
> > > > > 
> > > > > Yes. This is for getting rid of the lie
> > > > > 
> > > > > 	/*
> > > > > 	 * Acquire the oom lock.  If that fails, somebody else is
> > > > > 	 * making progress for us.
> > > > > 	 */
> > > > > 	if (!mutex_trylock(&oom_lock)) {
> > > > > 		*did_some_progress = 1;
> > > > > 		schedule_timeout_uninterruptible(1);
> > > > > 		return NULL;
> > > > > 	}
> > > > > 
> > > > > which is leading to CVE-2016-10723. By reclaiming from the OOM killer path,
> > > > > we can eliminate this heuristic.
> > > > > 
> > > > > Of course, we don't have to remove the OOM reaper kernel thread.
> > > > 
> > > > The thing is that the current design uses the oom_reaper only as a
> > > > backup to get situation unstuck. Once you move all that heavy lifting
> > > > into the oom path directly then you will have to handle all sorts of
> > > > issues. E.g. how do you handle that a random process hitting OOM path
> > > > has to pay the full price to tear down multi TB process? This is a lot
> > > > of time.
> > > 
> > > We can add a threshold to unmap_page_range() (for direct OOM reaping threads)
> > > which aborts after given number of pages are reclaimed. There is no need to
> > > reclaim all pages at once if the caller is doing memory allocations. 
> > 
> > Yes, there is no need to reclaim all pages. OOM is after freeing _some_
> > memory after all. But that means further complications down the unmap
> > path. I do not really see any reason for that.
> 
> "I do not see reason for that" cannot become a reason direct OOM reaping has to
> reclaim all pages at once.

We are not going to polute deep mm guts for unlikely events like oom.

[...]
> > Yes, and yet those are the only reason why some cases are not handled
> > with the current approach which you are trying to replace completely.
> > So you replace one set of corner cases with another while you do not
> > really solve reapability of the above. This doesn't sounds like an
> > improvement to me.
> 
> "This doesn't sounds like an improvement to me." cannot become a reason we
> keep [PATCH 1/8] away. Even if lockup is a corner case, it is a bug which
> has to be fixed. [PATCH 1/8] is for mitigating user-triggerable lockup.

It seems that any reasonable discussion with you is impossible. If you
are going to insist then you will not move any further with that patch.
CVE or not. I do not really care because that CVE is dubious at best.
There is a really simply way out of this situation. Just drop the sleep
from the the oom path and be done with it. If you are afraid of
regression and do not want to have your name on the patch then fine. I
will post the patch myself and also handle any fallouts.
 
[...]
> > The proper design should focus on the standard case while covering
> > corner cases as much as possible. Doing that other way around risks that
> > you over complicate the design with hard to evaluate side effects.
> 
> Your "proper design" is broken because you completely ignore corner cases.

I am very much interested in corner cases and you haven't given any
relevant argument that would show the current approach is broken. It
doesn't handle certain class of mappings, alright. Mlocked memory is
limited to a small value by default and you really have to trust your
userspace to raise a limit for it. Blockable notifiers are a real pain
as well but there is a patch to reduce the impact posted already [1].
I plan to work on mlock as well. Neither of the two has been handled by
your new rewrite.

> You don't think user-triggerable DoS as a problem.

I do care, of course. I am just saying that you are over exaggerating
the whole thing. Once you trigger a workload where some amount of
tasks can easily make other tasks not runable for a long time just
because it slept for a while then you are screwed anyway. This can be
any other sleep holding any other lock in the kernel.

Yes, it sucks that this is possible and you really have to be careful
when running untrusted code on your system.

[1] http://lkml.kernel.org/r/20180627074421.GF32348@dhcp22.suse.cz
 
> > > Current code is forcing memcg OOM killer waiting at mutex_lock(&oom_lock) to pay
> > > the full price to tear down an OOM victim process (which might be a multi TB process)
> > > because exit_mmap()/oom_reap_task_mm() are calling __oom_reap_task_mm(mm) with
> > > oom_lock held. The OOM victim which exit_mmap()/oom_reap_task_mm() is tearing
> > > down is not always in the same OOM domain mutex_lock(&oom_lock) is waiting for.
> > 
> > You are (yet again) ignoring many details and making incorrect
> > claims. For once, if somebody is reclaiming memory then those other
> > allocation paths can make forward progress and as such they do not
> > really have to wait for the full tear down. See the difference from the
> > sync oom reaping when at least one task will have to pay the full price?
> 
> For __alloc_pages_may_oom() path which is currently using mutex_trylock(), you are
> right except that nobody can reclaim memory due to schedule_timeout_killable(1)
> with oom_lock held.
> 
> For other paths which are currently using mutex_lock(), you are ignoring what
> I'm saying. A memcg-OOM event is paying the full price for tearing down an OOM
> victim which is in a different memcg domain.

Yes, this is possible. And do we have any real life example where that
is a noticeable problem?

Skipping the rest because it is repeating already discussed and
explained points. I am tired of repeating same things over and over for
you just to ignore that.
[...]

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem.
  2018-07-06  2:49                 ` Linus Torvalds
@ 2018-07-07  1:12                   ` Tetsuo Handa
  2018-07-09  7:45                     ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: Tetsuo Handa @ 2018-07-07  1:12 UTC (permalink / raw)
  To: Linus Torvalds, Michal Hocko; +Cc: linux-mm, Andrew Morton, David Rientjes

On 2018/07/06 11:49, Linus Torvalds wrote:
> On Thu, Jul 5, 2018 at 7:40 PM Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>>
>>>
>>> No, direct reclaim is a way to throttle allocations to the reclaim
>>> speed. You would have to achive the same by other means.
>>
>> No. Direct reclaim is a way to lockup the system to unusable level, by not giving
>> enough CPU resources to memory reclaim activities including the owner of oom_lock.
> 
> No. Really.
> 
> Direct reclaim really really does what Michal claims. Yes, it has
> other effects too, and it can be problematic, but direct reclaim is
> important.

I'm saying that even an unprivileged user can make the reclaim speed to
"0 pages per minute".

[PATCH 1/8] is for reducing possibility of hitting "0 pages per minute" whereas
[PATCH 8/8] is for increasing possibility of hitting "0 pages per minute".
They are contradicting directions which should not be made in one patch.

> 
> People have tried to remove it many times, but it's always been a
> disaster. You need to synchronize with _something_ to make sure that
> the thread that is causing a lot of allocations actually pays the
> price, and slows down.

[PATCH 3/8] is for reducing possibility of hitting "0 pages per minute"
by making sure that allocating threads pay some price for reclaiming memory
via doing direct OOM reaping.

> 
> You want to have a balance between direct and indirect reclaim.

While [PATCH 3/8] currently pays the full price, we can improve [PATCH 3/8] to
pay only some price by changing direct OOM reaping to use some threshold.

That is the direction towards the balance between "direct reclaim (by direct
OOM reap by allocating threads)" and "indirect reclaim (by the OOM reaper kernel
thread and exit_mmap())".

> 
> If you think direct reclaim is only a way to lock up the system to
> unusable levels, you should stop doing VM development.

Locking up the system to unusable levels due to things outside of the page
allocator is a different bug.

The page allocator chokes themselves by lack of "direct reclaim" when we are
waiting for the owner of oom_lock to make progress. This series is for getting
rid of the lie

	/*
	 * Acquire the oom lock.  If that fails, somebody else is
	 * making progress for us.
	 */
	if (!mutex_trylock(&oom_lock)) {
		*did_some_progress = 1;
		schedule_timeout_uninterruptible(1);
		return NULL;
	}

caused by lack of "direct reclaim".

> 
>                    Linus
> 

On 2018/07/06 14:56, Michal Hocko wrote:
>>> Yes, there is no need to reclaim all pages. OOM is after freeing _some_
>>> memory after all. But that means further complications down the unmap
>>> path. I do not really see any reason for that.
>>
>> "I do not see reason for that" cannot become a reason direct OOM reaping has to
>> reclaim all pages at once.
> 
> We are not going to polute deep mm guts for unlikely events like oom.

And since Michal is refusing to make changes for having the balance between
"direct reclaim by threads waiting for oom_lock" and "indirect reclaim by
a thread holding oom_lock", we will keep increasing possibility of hitting
"0 pages per minute". Therefore,

> If you are afraid of
> regression and do not want to have your name on the patch then fine. I
> will post the patch myself and also handle any fallouts.

PLEASE PLEASE PLEASE DO SO IMMEDIATELY!!!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem.
  2018-07-07  1:12                   ` Tetsuo Handa
@ 2018-07-09  7:45                     ` Michal Hocko
  0 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2018-07-09  7:45 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Linus Torvalds, linux-mm, Andrew Morton, David Rientjes

On Sat 07-07-18 10:12:57, Tetsuo Handa wrote:
[...]
> On 2018/07/06 14:56, Michal Hocko wrote:
> >>> Yes, there is no need to reclaim all pages. OOM is after freeing _some_
> >>> memory after all. But that means further complications down the unmap
> >>> path. I do not really see any reason for that.
> >>
> >> "I do not see reason for that" cannot become a reason direct OOM reaping has to
> >> reclaim all pages at once.
> > 
> > We are not going to polute deep mm guts for unlikely events like oom.
> 
> And since Michal is refusing to make changes for having the balance between
> "direct reclaim by threads waiting for oom_lock" and "indirect reclaim by
> a thread holding oom_lock", we will keep increasing possibility of hitting
> "0 pages per minute". Therefore,
> 
> > If you are afraid of
> > regression and do not want to have your name on the patch then fine. I
> > will post the patch myself and also handle any fallouts.
> 
> PLEASE PLEASE PLEASE DO SO IMMEDIATELY!!!

Curiously enough I've done so back in May [1] just to hear some really
weird arguments (e.g. that I am not solving an unrelated problem in
the memcg oom killer etc) and other changes of yours that were (again)
intermixing different things together. So then I've ended up with [2].

I will resubmit that patch. But please note how your insisting has only
delayed the whole thing. If I were you I would really think twice before
blaming someone from malicious intentions or even refusing good changes
from being merged.

[1] http://lkml.kernel.org/r/20180528124313.GC27180@dhcp22.suse.cz
[2] http://lkml.kernel.org/r/20180601080443.GX15278@dhcp22.suse.cz

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem.
  2018-07-06  5:56                 ` Michal Hocko
@ 2018-07-10  3:57                   ` Tetsuo Handa
  0 siblings, 0 replies; 25+ messages in thread
From: Tetsuo Handa @ 2018-07-10  3:57 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, akpm, torvalds, rientjes

> On Fri 06-07-18 11:40:07, Tetsuo Handa wrote:
> > > > > > Of course, we don't have to remove the OOM reaper kernel thread.
> > > > > 
> > > > > The thing is that the current design uses the oom_reaper only as a
> > > > > backup to get situation unstuck. Once you move all that heavy lifting
> > > > > into the oom path directly then you will have to handle all sorts of
> > > > > issues. E.g. how do you handle that a random process hitting OOM path
> > > > > has to pay the full price to tear down multi TB process? This is a lot
> > > > > of time.
> > > > 
> > > > We can add a threshold to unmap_page_range() (for direct OOM reaping threads)
> > > > which aborts after given number of pages are reclaimed. There is no need to
> > > > reclaim all pages at once if the caller is doing memory allocations. 
> > > 
> > > Yes, there is no need to reclaim all pages. OOM is after freeing _some_
> > > memory after all. But that means further complications down the unmap
> > > path. I do not really see any reason for that.
> > 
> > "I do not see reason for that" cannot become a reason direct OOM reaping has to
> > reclaim all pages at once.
> 
> We are not going to polute deep mm guts for unlikely events like oom.
> 
As far as I tested, below approach does not pollute deep mm guts. It should
achieve what David wants to do, without introducing user-visible tunable
interface.

David, can you try these patches?

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2018-07-10  3:58 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-03 14:25 [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem Tetsuo Handa
2018-07-03 14:25 ` [PATCH 1/8] mm,oom: Don't call schedule_timeout_killable() with oom_lock held Tetsuo Handa
2018-07-03 14:38   ` Michal Hocko
2018-07-03 14:25 ` [PATCH 2/8] mm,oom: Check pending victims earlier in out_of_memory() Tetsuo Handa
2018-07-03 14:25 ` [PATCH 3/8] mm,oom: Fix unnecessary killing of additional processes Tetsuo Handa
2018-07-03 14:58   ` Michal Hocko
2018-07-03 14:25 ` [PATCH 4/8] mm,page_alloc: Make oom_reserves_allowed() even Tetsuo Handa
2018-07-03 14:25 ` [PATCH 5/8] mm,oom: Bring OOM notifier to outside of oom_lock Tetsuo Handa
2018-07-03 14:59   ` Michal Hocko
2018-07-03 14:25 ` [PATCH 6/8] mm,oom: Make oom_lock static variable Tetsuo Handa
2018-07-03 14:25 ` [PATCH 7/8] mm,oom: Do not sleep with oom_lock held Tetsuo Handa
2018-07-03 14:25 ` [PATCH 8/8] mm,page_alloc: Move the short sleep to should_reclaim_retry() Tetsuo Handa
2018-07-03 15:12 ` [PATCH 0/8] OOM killer/reaper changes for avoiding OOM lockup problem Michal Hocko
2018-07-03 15:29   ` Michal Hocko
2018-07-04  2:22     ` penguin-kernel
2018-07-04  7:16       ` Michal Hocko
2018-07-04  7:22         ` Michal Hocko
2018-07-05  3:05           ` Tetsuo Handa
2018-07-05  7:24             ` Michal Hocko
2018-07-06  2:40               ` Tetsuo Handa
2018-07-06  2:49                 ` Linus Torvalds
2018-07-07  1:12                   ` Tetsuo Handa
2018-07-09  7:45                     ` Michal Hocko
2018-07-06  5:56                 ` Michal Hocko
2018-07-10  3:57                   ` Tetsuo Handa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.