linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch v3] mm, oom: fix unnecessary killing of additional processes
@ 2018-06-21 21:35 David Rientjes
  2018-07-04  1:43 ` David Rientjes
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: David Rientjes @ 2018-06-21 21:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kbuild test robot, Michal Hocko, Tetsuo Handa, linux-kernel, linux-mm

The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
it cannot reap an mm.  This can happen for a variety of reasons,
including:

 - the inability to grab mm->mmap_sem in a sufficient amount of time,

 - when the mm has blockable mmu notifiers that could cause the oom reaper
   to stall indefinitely,

but we can also add a third when the oom reaper can "reap" an mm but doing
so is unlikely to free any amount of memory:

 - when the mm's memory is mostly mlocked.

When all memory is mlocked, the oom reaper will not be able to free any
substantial amount of memory.  It sets MMF_OOM_SKIP before the victim can
unmap and free its memory in exit_mmap() and subsequent oom victims are
chosen unnecessarily.  This is trivial to reproduce if all eligible
processes on the system have mlocked their memory: the oom killer calls
panic() even though forward progress can be made.

This is the same issue where the exit path sets MMF_OOM_SKIP before
unmapping memory and additional processes can be chosen unnecessarily
because the oom killer is racing with exit_mmap() and is separate from
the oom reaper setting MMF_OOM_SKIP prematurely.

We can't simply defer setting MMF_OOM_SKIP, however, because if there is
a true oom livelock in progress, it never gets set and no additional
killing is possible.

To fix this, this patch introduces a per-mm reaping period, which is
configurable through the new oom_free_timeout_ms file in debugfs and
defaults to one second to match the current heuristics.  This support
requires that the oom reaper's list becomes a proper linked list so that
other mm's may be reaped while waiting for an mm's timeout to expire.

This replaces the current timeouts in the oom reaper: (1) when trying to
grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2)
a HZ sleep if there are blockable mmu notifiers.  It extends it with
timeout to allow an oom victim to reach exit_mmap() before choosing
additional processes unnecessarily.

The exit path will now set MMF_OOM_SKIP only after all memory has been
freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to
determine when it can race with the oom reaper.

The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
lapsed because it can no longer guarantee forward progress.  Since the
default oom_free_timeout_ms is one second, the same as current heuristics,
there should be no functional change with this patch for users who do not
tune it to be longer other than MMF_OOM_SKIP is set by exit_mmap() after
free_pgtables(), which is the preferred behavior.

The reaping timeout can intentionally be set for a substantial amount of
time, such as 10s, since oom livelock is a very rare occurrence and it's
better to optimize for preventing additional (unnecessary) oom killing
than a scenario that is much more unlikely.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 v3:
  - oom_free_timeout_ms is now static per kbuild test robot

 v2:
  - configurable timeout period through debugfs
  - change mm->reap_timeout to mm->oom_free_expire and add more
    descriptive comment per akpm
  - add comment to describe task->oom_reap_list locking based on
    oom_reaper_lock per akpm
  - rework the exit_mmap() comment and split into two parts to be more
    descriptive about the locking and the issue with the oom reaper
    racing with munlock_vma_pages_all() per akpm
---
 include/linux/mm_types.h |   7 ++
 include/linux/sched.h    |   3 +-
 kernel/fork.c            |   3 +
 mm/mmap.c                |  26 +++++---
 mm/oom_kill.c            | 140 +++++++++++++++++++++++++--------------
 5 files changed, 119 insertions(+), 60 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -449,6 +449,13 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_MMU
+	/*
+	 * When to give up on memory freeing from this mm after its
+	 * threads have been oom killed, in jiffies.
+	 */
+	unsigned long oom_free_expire;
+#endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1163,7 +1163,8 @@ struct task_struct {
 #endif
 	int				pagefault_disabled;
 #ifdef CONFIG_MMU
-	struct task_struct		*oom_reaper_list;
+	/* OOM victim queue for oom reaper, protected by oom_reaper_lock */
+	struct list_head		oom_reap_list;
 #endif
 #ifdef CONFIG_VMAP_STACK
 	struct vm_struct		*stack_vm_area;
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -842,6 +842,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 #ifdef CONFIG_FAULT_INJECTION
 	tsk->fail_nth = 0;
 #endif
+#ifdef CONFIG_MMU
+	INIT_LIST_HEAD(&tsk->oom_reap_list);
+#endif
 
 	return tsk;
 
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3059,25 +3059,28 @@ void exit_mmap(struct mm_struct *mm)
 	if (unlikely(mm_is_oom_victim(mm))) {
 		/*
 		 * Manually reap the mm to free as much memory as possible.
-		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
-		 * this mm from further consideration.  Taking mm->mmap_sem for
-		 * write after setting MMF_OOM_SKIP will guarantee that the oom
-		 * reaper will not run on this mm again after mmap_sem is
-		 * dropped.
-		 *
 		 * Nothing can be holding mm->mmap_sem here and the above call
 		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
 		 * __oom_reap_task_mm() will not block.
-		 *
-		 * This needs to be done before calling munlock_vma_pages_all(),
-		 * which clears VM_LOCKED, otherwise the oom reaper cannot
-		 * reliably test it.
 		 */
 		mutex_lock(&oom_lock);
 		__oom_reap_task_mm(mm);
 		mutex_unlock(&oom_lock);
 
-		set_bit(MMF_OOM_SKIP, &mm->flags);
+		/*
+		 * Now, set MMF_UNSTABLE to avoid racing with the oom reaper.
+		 * This needs to be done before calling munlock_vma_pages_all(),
+		 * which clears VM_LOCKED, otherwise the oom reaper cannot
+		 * reliably test for it.  If the oom reaper races with
+		 * munlock_vma_pages_all(), this can result in a kernel oops if
+		 * a pmd is zapped, for example, after follow_page_mask() has
+		 * checked pmd_none().
+		 *
+		 * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
+		 * guarantee that the oom reaper will not run on this mm again
+		 * after mmap_sem is dropped.
+		 */
+		set_bit(MMF_UNSTABLE, &mm->flags);
 		down_write(&mm->mmap_sem);
 		up_write(&mm->mmap_sem);
 	}
@@ -3105,6 +3108,7 @@ void exit_mmap(struct mm_struct *mm)
 	unmap_vmas(&tlb, vma, 0, -1);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
+	set_bit(MMF_OOM_SKIP, &mm->flags);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -41,6 +41,7 @@
 #include <linux/kthread.h>
 #include <linux/init.h>
 #include <linux/mmu_notifier.h>
+#include <linux/debugfs.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -476,7 +477,7 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
  */
 static struct task_struct *oom_reaper_th;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
+static LIST_HEAD(oom_reaper_list);
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
 void __oom_reap_task_mm(struct mm_struct *mm)
@@ -519,10 +520,8 @@ void __oom_reap_task_mm(struct mm_struct *mm)
 	}
 }
 
-static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
+static void oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
-	bool ret = true;
-
 	/*
 	 * We have to make sure to not race with the victim exit path
 	 * and cause premature new oom victim selection:
@@ -540,9 +539,8 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	mutex_lock(&oom_lock);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
 		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
+		goto out_oom;
 	}
 
 	/*
@@ -551,69 +549,81 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	 * TODO: we really want to get rid of this ugly hack and make sure that
 	 * notifiers cannot block for unbounded amount of time
 	 */
-	if (mm_has_blockable_invalidate_notifiers(mm)) {
-		up_read(&mm->mmap_sem);
-		schedule_timeout_idle(HZ);
-		goto unlock_oom;
-	}
+	if (mm_has_blockable_invalidate_notifiers(mm))
+		goto out_mm;
 
 	/*
-	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
-	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
+	 * MMF_UNSTABLE is set by exit_mmap when the OOM reaper can't
+	 * work on the mm anymore. The check for MMF_UNSTABLE must run
 	 * under mmap_sem for reading because it serializes against the
 	 * down_write();up_write() cycle in exit_mmap().
 	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
-		up_read(&mm->mmap_sem);
+	if (test_bit(MMF_UNSTABLE, &mm->flags)) {
 		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
+		goto out_mm;
 	}
 
 	trace_start_task_reaping(tsk->pid);
-
 	__oom_reap_task_mm(mm);
+	trace_finish_task_reaping(tsk->pid);
 
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
+out_mm:
 	up_read(&mm->mmap_sem);
-
-	trace_finish_task_reaping(tsk->pid);
-unlock_oom:
+out_oom:
 	mutex_unlock(&oom_lock);
-	return ret;
 }
 
-#define MAX_OOM_REAP_RETRIES 10
 static void oom_reap_task(struct task_struct *tsk)
 {
-	int attempts = 0;
 	struct mm_struct *mm = tsk->signal->oom_mm;
 
-	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
-		schedule_timeout_idle(HZ/10);
+	/*
+	 * If this mm has either been fully unmapped, or the oom reaper has
+	 * given up on it, nothing left to do except drop the refcount.
+	 */
+	if (test_bit(MMF_OOM_SKIP, &mm->flags))
+		goto drop;
 
-	if (attempts <= MAX_OOM_REAP_RETRIES ||
-	    test_bit(MMF_OOM_SKIP, &mm->flags))
-		goto done;
+	/*
+	 * If this mm has already been reaped, doing so again will not likely
+	 * free additional memory.
+	 */
+	if (!test_bit(MMF_UNSTABLE, &mm->flags))
+		oom_reap_task_mm(tsk, mm);
 
-	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-		task_pid_nr(tsk), tsk->comm);
-	debug_show_all_locks();
+	if (time_after_eq(jiffies, mm->oom_free_expire)) {
+		if (!test_bit(MMF_OOM_SKIP, &mm->flags)) {
+			pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+				task_pid_nr(tsk), tsk->comm);
+			debug_show_all_locks();
 
-done:
-	tsk->oom_reaper_list = NULL;
+			/*
+			 * Reaping has failed for the timeout period, so give up
+			 * and allow additional processes to be oom killed.
+			 */
+			set_bit(MMF_OOM_SKIP, &mm->flags);
+		}
+		goto drop;
+	}
 
-	/*
-	 * Hide this mm from OOM killer because it has been either reaped or
-	 * somebody can't call up_write(mmap_sem).
-	 */
-	set_bit(MMF_OOM_SKIP, &mm->flags);
+	if (test_bit(MMF_OOM_SKIP, &mm->flags))
+		goto drop;
+
+	/* Enqueue to be reaped again */
+	spin_lock(&oom_reaper_lock);
+	list_add_tail(&tsk->oom_reap_list, &oom_reaper_list);
+	spin_unlock(&oom_reaper_lock);
 
-	/* Drop a reference taken by wake_oom_reaper */
+	schedule_timeout_idle(HZ/10);
+	return;
+
+drop:
+	/* Drop the reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 }
 
@@ -622,11 +632,13 @@ static int oom_reaper(void *unused)
 	while (true) {
 		struct task_struct *tsk = NULL;
 
-		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
+		wait_event_freezable(oom_reaper_wait,
+				     !list_empty(&oom_reaper_list));
 		spin_lock(&oom_reaper_lock);
-		if (oom_reaper_list != NULL) {
-			tsk = oom_reaper_list;
-			oom_reaper_list = tsk->oom_reaper_list;
+		if (!list_empty(&oom_reaper_list)) {
+			tsk = list_entry(oom_reaper_list.next,
+					 struct task_struct, oom_reap_list);
+			list_del(&tsk->oom_reap_list);
 		}
 		spin_unlock(&oom_reaper_lock);
 
@@ -637,25 +649,57 @@ static int oom_reaper(void *unused)
 	return 0;
 }
 
+/*
+ * Millisecs to wait for an oom mm to free memory before selecting another
+ * victim.
+ */
+static u64 oom_free_timeout_ms = 1000;
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+	/*
+	 * Set the reap timeout; if it's already set, the mm is enqueued and
+	 * this tsk can be ignored.
+	 */
+	if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL,
+			jiffies + msecs_to_jiffies(oom_free_timeout_ms)))
 		return;
 
 	get_task_struct(tsk);
 
 	spin_lock(&oom_reaper_lock);
-	tsk->oom_reaper_list = oom_reaper_list;
-	oom_reaper_list = tsk;
+	list_add(&tsk->oom_reap_list, &oom_reaper_list);
 	spin_unlock(&oom_reaper_lock);
 	trace_wake_reaper(tsk->pid);
 	wake_up(&oom_reaper_wait);
 }
 
+#ifdef CONFIG_DEBUG_FS
+static int oom_free_timeout_ms_read(void *data, u64 *val)
+{
+	*val = oom_free_timeout_ms;
+	return 0;
+}
+
+static int oom_free_timeout_ms_write(void *data, u64 val)
+{
+	if (val > 60 * 1000)
+		return -EINVAL;
+
+	oom_free_timeout_ms = val;
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(oom_free_timeout_ms_fops, oom_free_timeout_ms_read,
+			oom_free_timeout_ms_write, "%llu\n");
+#endif /* CONFIG_DEBUG_FS */
+
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+#ifdef CONFIG_DEBUG_FS
+	if (!IS_ERR(oom_reaper_th))
+		debugfs_create_file("oom_free_timeout_ms", 0200, NULL, NULL,
+				    &oom_free_timeout_ms_fops);
+#endif
 	return 0;
 }
 subsys_initcall(oom_init)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v3] mm, oom: fix unnecessary killing of additional processes
  2018-06-21 21:35 [patch v3] mm, oom: fix unnecessary killing of additional processes David Rientjes
@ 2018-07-04  1:43 ` David Rientjes
  2018-07-04  2:26   ` penguin-kernel
  2018-07-05 23:46 ` Andrew Morton
  2018-07-17 21:09 ` Tetsuo Handa
  2 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2018-07-04  1:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, Tetsuo Handa, linux-kernel, linux-mm

Ping?

This can be something that can easily be removed if it becomes obsoleted 
because the oom reaper is always able to free memory to the extent of 
exit_mmap().  I argue that it cannot, because it cannot do free_pgtables() 
for large amounts of virtual memory, but am fine to be proved wrong.

In the meantime, however, this patch should introduce no significant 
change in functionality and the only interface it is added is in debugfs 
and can easily be removed if it is obsoleted.

The work to make the oom reaper more effective or realible can still 
continue with this patch.



On Thu, 21 Jun 2018, David Rientjes wrote:

> The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> it cannot reap an mm.  This can happen for a variety of reasons,
> including:
> 
>  - the inability to grab mm->mmap_sem in a sufficient amount of time,
> 
>  - when the mm has blockable mmu notifiers that could cause the oom reaper
>    to stall indefinitely,
> 
> but we can also add a third when the oom reaper can "reap" an mm but doing
> so is unlikely to free any amount of memory:
> 
>  - when the mm's memory is mostly mlocked.
> 
> When all memory is mlocked, the oom reaper will not be able to free any
> substantial amount of memory.  It sets MMF_OOM_SKIP before the victim can
> unmap and free its memory in exit_mmap() and subsequent oom victims are
> chosen unnecessarily.  This is trivial to reproduce if all eligible
> processes on the system have mlocked their memory: the oom killer calls
> panic() even though forward progress can be made.
> 
> This is the same issue where the exit path sets MMF_OOM_SKIP before
> unmapping memory and additional processes can be chosen unnecessarily
> because the oom killer is racing with exit_mmap() and is separate from
> the oom reaper setting MMF_OOM_SKIP prematurely.
> 
> We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> a true oom livelock in progress, it never gets set and no additional
> killing is possible.
> 
> To fix this, this patch introduces a per-mm reaping period, which is
> configurable through the new oom_free_timeout_ms file in debugfs and
> defaults to one second to match the current heuristics.  This support
> requires that the oom reaper's list becomes a proper linked list so that
> other mm's may be reaped while waiting for an mm's timeout to expire.
> 
> This replaces the current timeouts in the oom reaper: (1) when trying to
> grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2)
> a HZ sleep if there are blockable mmu notifiers.  It extends it with
> timeout to allow an oom victim to reach exit_mmap() before choosing
> additional processes unnecessarily.
> 
> The exit path will now set MMF_OOM_SKIP only after all memory has been
> freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to
> determine when it can race with the oom reaper.
> 
> The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
> lapsed because it can no longer guarantee forward progress.  Since the
> default oom_free_timeout_ms is one second, the same as current heuristics,
> there should be no functional change with this patch for users who do not
> tune it to be longer other than MMF_OOM_SKIP is set by exit_mmap() after
> free_pgtables(), which is the preferred behavior.
> 
> The reaping timeout can intentionally be set for a substantial amount of
> time, such as 10s, since oom livelock is a very rare occurrence and it's
> better to optimize for preventing additional (unnecessary) oom killing
> than a scenario that is much more unlikely.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  v3:
>   - oom_free_timeout_ms is now static per kbuild test robot
> 
>  v2:
>   - configurable timeout period through debugfs
>   - change mm->reap_timeout to mm->oom_free_expire and add more
>     descriptive comment per akpm
>   - add comment to describe task->oom_reap_list locking based on
>     oom_reaper_lock per akpm
>   - rework the exit_mmap() comment and split into two parts to be more
>     descriptive about the locking and the issue with the oom reaper
>     racing with munlock_vma_pages_all() per akpm
> ---
>  include/linux/mm_types.h |   7 ++
>  include/linux/sched.h    |   3 +-
>  kernel/fork.c            |   3 +
>  mm/mmap.c                |  26 +++++---
>  mm/oom_kill.c            | 140 +++++++++++++++++++++++++--------------
>  5 files changed, 119 insertions(+), 60 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -449,6 +449,13 @@ struct mm_struct {
>  #ifdef CONFIG_MMU_NOTIFIER
>  	struct mmu_notifier_mm *mmu_notifier_mm;
>  #endif
> +#ifdef CONFIG_MMU
> +	/*
> +	 * When to give up on memory freeing from this mm after its
> +	 * threads have been oom killed, in jiffies.
> +	 */
> +	unsigned long oom_free_expire;
> +#endif
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
>  	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
>  #endif
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1163,7 +1163,8 @@ struct task_struct {
>  #endif
>  	int				pagefault_disabled;
>  #ifdef CONFIG_MMU
> -	struct task_struct		*oom_reaper_list;
> +	/* OOM victim queue for oom reaper, protected by oom_reaper_lock */
> +	struct list_head		oom_reap_list;
>  #endif
>  #ifdef CONFIG_VMAP_STACK
>  	struct vm_struct		*stack_vm_area;
> diff --git a/kernel/fork.c b/kernel/fork.c
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -842,6 +842,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>  #ifdef CONFIG_FAULT_INJECTION
>  	tsk->fail_nth = 0;
>  #endif
> +#ifdef CONFIG_MMU
> +	INIT_LIST_HEAD(&tsk->oom_reap_list);
> +#endif
>  
>  	return tsk;
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3059,25 +3059,28 @@ void exit_mmap(struct mm_struct *mm)
>  	if (unlikely(mm_is_oom_victim(mm))) {
>  		/*
>  		 * Manually reap the mm to free as much memory as possible.
> -		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> -		 * this mm from further consideration.  Taking mm->mmap_sem for
> -		 * write after setting MMF_OOM_SKIP will guarantee that the oom
> -		 * reaper will not run on this mm again after mmap_sem is
> -		 * dropped.
> -		 *
>  		 * Nothing can be holding mm->mmap_sem here and the above call
>  		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
>  		 * __oom_reap_task_mm() will not block.
> -		 *
> -		 * This needs to be done before calling munlock_vma_pages_all(),
> -		 * which clears VM_LOCKED, otherwise the oom reaper cannot
> -		 * reliably test it.
>  		 */
>  		mutex_lock(&oom_lock);
>  		__oom_reap_task_mm(mm);
>  		mutex_unlock(&oom_lock);
>  
> -		set_bit(MMF_OOM_SKIP, &mm->flags);
> +		/*
> +		 * Now, set MMF_UNSTABLE to avoid racing with the oom reaper.
> +		 * This needs to be done before calling munlock_vma_pages_all(),
> +		 * which clears VM_LOCKED, otherwise the oom reaper cannot
> +		 * reliably test for it.  If the oom reaper races with
> +		 * munlock_vma_pages_all(), this can result in a kernel oops if
> +		 * a pmd is zapped, for example, after follow_page_mask() has
> +		 * checked pmd_none().
> +		 *
> +		 * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
> +		 * guarantee that the oom reaper will not run on this mm again
> +		 * after mmap_sem is dropped.
> +		 */
> +		set_bit(MMF_UNSTABLE, &mm->flags);
>  		down_write(&mm->mmap_sem);
>  		up_write(&mm->mmap_sem);
>  	}
> @@ -3105,6 +3108,7 @@ void exit_mmap(struct mm_struct *mm)
>  	unmap_vmas(&tlb, vma, 0, -1);
>  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>  	tlb_finish_mmu(&tlb, 0, -1);
> +	set_bit(MMF_OOM_SKIP, &mm->flags);
>  
>  	/*
>  	 * Walk the list again, actually closing and freeing it,
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -41,6 +41,7 @@
>  #include <linux/kthread.h>
>  #include <linux/init.h>
>  #include <linux/mmu_notifier.h>
> +#include <linux/debugfs.h>
>  
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -476,7 +477,7 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
>   */
>  static struct task_struct *oom_reaper_th;
>  static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
> -static struct task_struct *oom_reaper_list;
> +static LIST_HEAD(oom_reaper_list);
>  static DEFINE_SPINLOCK(oom_reaper_lock);
>  
>  void __oom_reap_task_mm(struct mm_struct *mm)
> @@ -519,10 +520,8 @@ void __oom_reap_task_mm(struct mm_struct *mm)
>  	}
>  }
>  
> -static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> +static void oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  {
> -	bool ret = true;
> -
>  	/*
>  	 * We have to make sure to not race with the victim exit path
>  	 * and cause premature new oom victim selection:
> @@ -540,9 +539,8 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  	mutex_lock(&oom_lock);
>  
>  	if (!down_read_trylock(&mm->mmap_sem)) {
> -		ret = false;
>  		trace_skip_task_reaping(tsk->pid);
> -		goto unlock_oom;
> +		goto out_oom;
>  	}
>  
>  	/*
> @@ -551,69 +549,81 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  	 * TODO: we really want to get rid of this ugly hack and make sure that
>  	 * notifiers cannot block for unbounded amount of time
>  	 */
> -	if (mm_has_blockable_invalidate_notifiers(mm)) {
> -		up_read(&mm->mmap_sem);
> -		schedule_timeout_idle(HZ);
> -		goto unlock_oom;
> -	}
> +	if (mm_has_blockable_invalidate_notifiers(mm))
> +		goto out_mm;
>  
>  	/*
> -	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
> -	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
> +	 * MMF_UNSTABLE is set by exit_mmap when the OOM reaper can't
> +	 * work on the mm anymore. The check for MMF_UNSTABLE must run
>  	 * under mmap_sem for reading because it serializes against the
>  	 * down_write();up_write() cycle in exit_mmap().
>  	 */
> -	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
> -		up_read(&mm->mmap_sem);
> +	if (test_bit(MMF_UNSTABLE, &mm->flags)) {
>  		trace_skip_task_reaping(tsk->pid);
> -		goto unlock_oom;
> +		goto out_mm;
>  	}
>  
>  	trace_start_task_reaping(tsk->pid);
> -
>  	__oom_reap_task_mm(mm);
> +	trace_finish_task_reaping(tsk->pid);
>  
>  	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
>  			task_pid_nr(tsk), tsk->comm,
>  			K(get_mm_counter(mm, MM_ANONPAGES)),
>  			K(get_mm_counter(mm, MM_FILEPAGES)),
>  			K(get_mm_counter(mm, MM_SHMEMPAGES)));
> +out_mm:
>  	up_read(&mm->mmap_sem);
> -
> -	trace_finish_task_reaping(tsk->pid);
> -unlock_oom:
> +out_oom:
>  	mutex_unlock(&oom_lock);
> -	return ret;
>  }
>  
> -#define MAX_OOM_REAP_RETRIES 10
>  static void oom_reap_task(struct task_struct *tsk)
>  {
> -	int attempts = 0;
>  	struct mm_struct *mm = tsk->signal->oom_mm;
>  
> -	/* Retry the down_read_trylock(mmap_sem) a few times */
> -	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
> -		schedule_timeout_idle(HZ/10);
> +	/*
> +	 * If this mm has either been fully unmapped, or the oom reaper has
> +	 * given up on it, nothing left to do except drop the refcount.
> +	 */
> +	if (test_bit(MMF_OOM_SKIP, &mm->flags))
> +		goto drop;
>  
> -	if (attempts <= MAX_OOM_REAP_RETRIES ||
> -	    test_bit(MMF_OOM_SKIP, &mm->flags))
> -		goto done;
> +	/*
> +	 * If this mm has already been reaped, doing so again will not likely
> +	 * free additional memory.
> +	 */
> +	if (!test_bit(MMF_UNSTABLE, &mm->flags))
> +		oom_reap_task_mm(tsk, mm);
>  
> -	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> -		task_pid_nr(tsk), tsk->comm);
> -	debug_show_all_locks();
> +	if (time_after_eq(jiffies, mm->oom_free_expire)) {
> +		if (!test_bit(MMF_OOM_SKIP, &mm->flags)) {
> +			pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> +				task_pid_nr(tsk), tsk->comm);
> +			debug_show_all_locks();
>  
> -done:
> -	tsk->oom_reaper_list = NULL;
> +			/*
> +			 * Reaping has failed for the timeout period, so give up
> +			 * and allow additional processes to be oom killed.
> +			 */
> +			set_bit(MMF_OOM_SKIP, &mm->flags);
> +		}
> +		goto drop;
> +	}
>  
> -	/*
> -	 * Hide this mm from OOM killer because it has been either reaped or
> -	 * somebody can't call up_write(mmap_sem).
> -	 */
> -	set_bit(MMF_OOM_SKIP, &mm->flags);
> +	if (test_bit(MMF_OOM_SKIP, &mm->flags))
> +		goto drop;
> +
> +	/* Enqueue to be reaped again */
> +	spin_lock(&oom_reaper_lock);
> +	list_add_tail(&tsk->oom_reap_list, &oom_reaper_list);
> +	spin_unlock(&oom_reaper_lock);
>  
> -	/* Drop a reference taken by wake_oom_reaper */
> +	schedule_timeout_idle(HZ/10);
> +	return;
> +
> +drop:
> +	/* Drop the reference taken by wake_oom_reaper */
>  	put_task_struct(tsk);
>  }
>  
> @@ -622,11 +632,13 @@ static int oom_reaper(void *unused)
>  	while (true) {
>  		struct task_struct *tsk = NULL;
>  
> -		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
> +		wait_event_freezable(oom_reaper_wait,
> +				     !list_empty(&oom_reaper_list));
>  		spin_lock(&oom_reaper_lock);
> -		if (oom_reaper_list != NULL) {
> -			tsk = oom_reaper_list;
> -			oom_reaper_list = tsk->oom_reaper_list;
> +		if (!list_empty(&oom_reaper_list)) {
> +			tsk = list_entry(oom_reaper_list.next,
> +					 struct task_struct, oom_reap_list);
> +			list_del(&tsk->oom_reap_list);
>  		}
>  		spin_unlock(&oom_reaper_lock);
>  
> @@ -637,25 +649,57 @@ static int oom_reaper(void *unused)
>  	return 0;
>  }
>  
> +/*
> + * Millisecs to wait for an oom mm to free memory before selecting another
> + * victim.
> + */
> +static u64 oom_free_timeout_ms = 1000;
>  static void wake_oom_reaper(struct task_struct *tsk)
>  {
> -	/* tsk is already queued? */
> -	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
> +	/*
> +	 * Set the reap timeout; if it's already set, the mm is enqueued and
> +	 * this tsk can be ignored.
> +	 */
> +	if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL,
> +			jiffies + msecs_to_jiffies(oom_free_timeout_ms)))
>  		return;
>  
>  	get_task_struct(tsk);
>  
>  	spin_lock(&oom_reaper_lock);
> -	tsk->oom_reaper_list = oom_reaper_list;
> -	oom_reaper_list = tsk;
> +	list_add(&tsk->oom_reap_list, &oom_reaper_list);
>  	spin_unlock(&oom_reaper_lock);
>  	trace_wake_reaper(tsk->pid);
>  	wake_up(&oom_reaper_wait);
>  }
>  
> +#ifdef CONFIG_DEBUG_FS
> +static int oom_free_timeout_ms_read(void *data, u64 *val)
> +{
> +	*val = oom_free_timeout_ms;
> +	return 0;
> +}
> +
> +static int oom_free_timeout_ms_write(void *data, u64 val)
> +{
> +	if (val > 60 * 1000)
> +		return -EINVAL;
> +
> +	oom_free_timeout_ms = val;
> +	return 0;
> +}
> +DEFINE_SIMPLE_ATTRIBUTE(oom_free_timeout_ms_fops, oom_free_timeout_ms_read,
> +			oom_free_timeout_ms_write, "%llu\n");
> +#endif /* CONFIG_DEBUG_FS */
> +
>  static int __init oom_init(void)
>  {
>  	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> +#ifdef CONFIG_DEBUG_FS
> +	if (!IS_ERR(oom_reaper_th))
> +		debugfs_create_file("oom_free_timeout_ms", 0200, NULL, NULL,
> +				    &oom_free_timeout_ms_fops);
> +#endif
>  	return 0;
>  }
>  subsys_initcall(oom_init)
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v3] mm, oom: fix unnecessary killing of additional processes
  2018-07-04  1:43 ` David Rientjes
@ 2018-07-04  2:26   ` penguin-kernel
  0 siblings, 0 replies; 28+ messages in thread
From: penguin-kernel @ 2018-07-04  2:26 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Michal Hocko, linux-kernel, linux-mm

David Rientjes wrote:
> Ping?
> 
> This can be something that can easily be removed if it becomes obsoleted 
> because the oom reaper is always able to free memory to the extent of 
> exit_mmap().  I argue that it cannot, because it cannot do free_pgtables() 
> for large amounts of virtual memory, but am fine to be proved wrong.

This is "[PATCH 3/8] mm,oom: Fix unnecessary killing of additional processes." in my series.

> 
> In the meantime, however, this patch should introduce no significant 
> change in functionality and the only interface it is added is in debugfs 
> and can easily be removed if it is obsoleted.
> 
> The work to make the oom reaper more effective or realible can still 
> continue with this patch.
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v3] mm, oom: fix unnecessary killing of additional processes
  2018-06-21 21:35 [patch v3] mm, oom: fix unnecessary killing of additional processes David Rientjes
  2018-07-04  1:43 ` David Rientjes
@ 2018-07-05 23:46 ` Andrew Morton
  2018-07-06  5:39   ` Michal Hocko
  2018-07-07  0:05   ` David Rientjes
  2018-07-17 21:09 ` Tetsuo Handa
  2 siblings, 2 replies; 28+ messages in thread
From: Andrew Morton @ 2018-07-05 23:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: kbuild test robot, Michal Hocko, Tetsuo Handa, linux-kernel, linux-mm

On Thu, 21 Jun 2018 14:35:20 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

> The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> it cannot reap an mm.  This can happen for a variety of reasons,
> including:
> 
>  - the inability to grab mm->mmap_sem in a sufficient amount of time,
> 
>  - when the mm has blockable mmu notifiers that could cause the oom reaper
>    to stall indefinitely,
> 
> but we can also add a third when the oom reaper can "reap" an mm but doing
> so is unlikely to free any amount of memory:
> 
>  - when the mm's memory is mostly mlocked.

Michal has been talking about making the oom-reaper handle mlocked
memory.  Where are we at with that?

> When all memory is mlocked, the oom reaper will not be able to free any
> substantial amount of memory.  It sets MMF_OOM_SKIP before the victim can
> unmap and free its memory in exit_mmap() and subsequent oom victims are
> chosen unnecessarily.  This is trivial to reproduce if all eligible
> processes on the system have mlocked their memory: the oom killer calls
> panic() even though forward progress can be made.
> 
> This is the same issue where the exit path sets MMF_OOM_SKIP before
> unmapping memory and additional processes can be chosen unnecessarily
> because the oom killer is racing with exit_mmap() and is separate from
> the oom reaper setting MMF_OOM_SKIP prematurely.
> 
> We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> a true oom livelock in progress, it never gets set and no additional
> killing is possible.
> 
> To fix this, this patch introduces a per-mm reaping period, which is
> configurable through the new oom_free_timeout_ms file in debugfs and
> defaults to one second to match the current heuristics.  This support
> requires that the oom reaper's list becomes a proper linked list so that
> other mm's may be reaped while waiting for an mm's timeout to expire.
> 
> This replaces the current timeouts in the oom reaper: (1) when trying to
> grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2)
> a HZ sleep if there are blockable mmu notifiers.  It extends it with
> timeout to allow an oom victim to reach exit_mmap() before choosing
> additional processes unnecessarily.
> 
> The exit path will now set MMF_OOM_SKIP only after all memory has been
> freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to
> determine when it can race with the oom reaper.
> 
> The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
> lapsed because it can no longer guarantee forward progress.  Since the
> default oom_free_timeout_ms is one second, the same as current heuristics,
> there should be no functional change with this patch for users who do not
> tune it to be longer other than MMF_OOM_SKIP is set by exit_mmap() after
> free_pgtables(), which is the preferred behavior.
> 
> The reaping timeout can intentionally be set for a substantial amount of
> time, such as 10s, since oom livelock is a very rare occurrence and it's
> better to optimize for preventing additional (unnecessary) oom killing
> than a scenario that is much more unlikely.
> 
> ..
>
> +#ifdef CONFIG_DEBUG_FS
> +static int oom_free_timeout_ms_read(void *data, u64 *val)
> +{
> +	*val = oom_free_timeout_ms;
> +	return 0;
> +}
> +
> +static int oom_free_timeout_ms_write(void *data, u64 val)
> +{
> +	if (val > 60 * 1000)
> +		return -EINVAL;
> +
> +	oom_free_timeout_ms = val;
> +	return 0;
> +}
> +DEFINE_SIMPLE_ATTRIBUTE(oom_free_timeout_ms_fops, oom_free_timeout_ms_read,
> +			oom_free_timeout_ms_write, "%llu\n");
> +#endif /* CONFIG_DEBUG_FS */

One of the several things I dislike about debugfs is that nobody
bothers documenting it anywhere.  But this should really be documented.
I'm not sure where, but the documentation will find itself alongside a
bunch of procfs things which prompts the question "why it *this* one in
debugfs"?

>  static int __init oom_init(void)
>  {
>  	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> +#ifdef CONFIG_DEBUG_FS
> +	if (!IS_ERR(oom_reaper_th))
> +		debugfs_create_file("oom_free_timeout_ms", 0200, NULL, NULL,
> +				    &oom_free_timeout_ms_fops);
> +#endif
>  	return 0;
>  }
>  subsys_initcall(oom_init)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v3] mm, oom: fix unnecessary killing of additional processes
  2018-07-05 23:46 ` Andrew Morton
@ 2018-07-06  5:39   ` Michal Hocko
  2018-07-07  0:05   ` David Rientjes
  1 sibling, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2018-07-06  5:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, kbuild test robot, Tetsuo Handa, linux-kernel, linux-mm

On Thu 05-07-18 16:46:21, Andrew Morton wrote:
> On Thu, 21 Jun 2018 14:35:20 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:
> 
> > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> > it cannot reap an mm.  This can happen for a variety of reasons,
> > including:
> > 
> >  - the inability to grab mm->mmap_sem in a sufficient amount of time,
> > 
> >  - when the mm has blockable mmu notifiers that could cause the oom reaper
> >    to stall indefinitely,
> > 
> > but we can also add a third when the oom reaper can "reap" an mm but doing
> > so is unlikely to free any amount of memory:
> > 
> >  - when the mm's memory is mostly mlocked.
> 
> Michal has been talking about making the oom-reaper handle mlocked
> memory.  Where are we at with that?

I didn't get to mlocked memory yet because blockable mmu notifiers are
more important. And I've already posted patch for that and it is under
discussion [1]. Mlocked memory is next. 
 
[1] http://lkml.kernel.org/r/20180627074421.GF32348@dhcp22.suse.cz

Btw. I still hate this patch and making any timeout user defineable. It
is a wrong approach and my nack to this patch still applies.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v3] mm, oom: fix unnecessary killing of additional processes
  2018-07-05 23:46 ` Andrew Morton
  2018-07-06  5:39   ` Michal Hocko
@ 2018-07-07  0:05   ` David Rientjes
  2018-07-09 12:35     ` Michal Hocko
  1 sibling, 1 reply; 28+ messages in thread
From: David Rientjes @ 2018-07-07  0:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kbuild test robot, Michal Hocko, Tetsuo Handa, linux-kernel, linux-mm

On Thu, 5 Jul 2018, Andrew Morton wrote:

> > +#ifdef CONFIG_DEBUG_FS
> > +static int oom_free_timeout_ms_read(void *data, u64 *val)
> > +{
> > +	*val = oom_free_timeout_ms;
> > +	return 0;
> > +}
> > +
> > +static int oom_free_timeout_ms_write(void *data, u64 val)
> > +{
> > +	if (val > 60 * 1000)
> > +		return -EINVAL;
> > +
> > +	oom_free_timeout_ms = val;
> > +	return 0;
> > +}
> > +DEFINE_SIMPLE_ATTRIBUTE(oom_free_timeout_ms_fops, oom_free_timeout_ms_read,
> > +			oom_free_timeout_ms_write, "%llu\n");
> > +#endif /* CONFIG_DEBUG_FS */
> 
> One of the several things I dislike about debugfs is that nobody
> bothers documenting it anywhere.  But this should really be documented.
> I'm not sure where, but the documentation will find itself alongside a
> bunch of procfs things which prompts the question "why it *this* one in
> debugfs"?
> 

The only reason I have placed it in debugfs, or making it tunable at all, 
is to appease others.  I know the non-default value we need to use to stop 
millions of processes being oom killed unnecessarily.  Michal suggested a 
tunable to disable the oom reaper entirely, which is not what we want, so 
I found this to be the best alternative.

I'd like to say that it is purposefully undocumented since it's not a 
sysctl and nobody can suggest that it is becoming a permanent API that we 
must maintain for backwards compatibility.  Having it be configurable is 
kind of ridiculous, but such is the nature of trying to get patches merged 
these days to prevent millions of processes being oom killed 
unnecessarily.

Blockable mmu notifiers and mlocked memory is not the extent of the 
problem, if a process has a lot of virtual memory we must wait until 
free_pgtables() completes in exit_mmap() to prevent unnecessary oom 
killing.  For implementations such as tcmalloc, which does not release 
virtual memory, this is important because, well, it releases this only at 
exit_mmap().  Of course we cannot do that with only the protection of 
mm->mmap_sem for read.

This is a patch that we'll always need if we continue with the current 
implementation of the oom reaper.  I wouldn't suggest it as a configurable 
value, but, owell.

I'll document the tunable and purposefully repeat myself that this is 
addresses millions of processes being oom killed unnecessarily so the 
rather important motivation of the change is clear to anyone who reads 
this thread now or in the future.  Nobody can guess an appropriate value 
until they have been hit by the issue themselves and need to deal with the 
loss of work from important processes being oom killed when some best 
effort logging cron job uses too much memory.  Or, of course, pissed off 
users who have their jobs killed off and you find yourself in the rather 
unfortunate situation of explaining why the Linux kernel in 2018 needs to 
immediately SIGKILL processes because of an arbitrary nack related to a 
timestamp.

Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v3] mm, oom: fix unnecessary killing of additional processes
  2018-07-07  0:05   ` David Rientjes
@ 2018-07-09 12:35     ` Michal Hocko
  2018-07-09 20:30       ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2018-07-09 12:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, kbuild test robot, Tetsuo Handa, linux-kernel, linux-mm

On Fri 06-07-18 17:05:39, David Rientjes wrote:
[...]
> Blockable mmu notifiers and mlocked memory is not the extent of the 
> problem, if a process has a lot of virtual memory we must wait until 
> free_pgtables() completes in exit_mmap() to prevent unnecessary oom 
> killing.  For implementations such as tcmalloc, which does not release 
> virtual memory, this is important because, well, it releases this only at 
> exit_mmap().  Of course we cannot do that with only the protection of 
> mm->mmap_sem for read.

And how exactly a timeout helps to prevent from "unnecessary killing" in
that case?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v3] mm, oom: fix unnecessary killing of additional processes
  2018-07-09 12:35     ` Michal Hocko
@ 2018-07-09 20:30       ` David Rientjes
  2018-07-10 11:01         ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2018-07-09 20:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, kbuild test robot, Tetsuo Handa, linux-kernel, linux-mm

On Mon, 9 Jul 2018, Michal Hocko wrote:

> > Blockable mmu notifiers and mlocked memory is not the extent of the 
> > problem, if a process has a lot of virtual memory we must wait until 
> > free_pgtables() completes in exit_mmap() to prevent unnecessary oom 
> > killing.  For implementations such as tcmalloc, which does not release 
> > virtual memory, this is important because, well, it releases this only at 
> > exit_mmap().  Of course we cannot do that with only the protection of 
> > mm->mmap_sem for read.
> 
> And how exactly a timeout helps to prevent from "unnecessary killing" in
> that case?

As my patch does, it becomes mandatory to move MMF_OOM_SKIP to after 
free_pgtables() in exit_mmap() and then repurpose MMF_UNSTABLE to 
indicate that the oom reaper should not operate on a given mm.  In the 
event we cannot reach MMF_OOM_SKIP, we need to ensure forward progress and 
that is possible with a timeout period in the very rare instance where 
additional memory freeing is needed, and without unnecessary oom killing 
when it is not needed.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v3] mm, oom: fix unnecessary killing of additional processes
  2018-07-09 20:30       ` David Rientjes
@ 2018-07-10 11:01         ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2018-07-10 11:01 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, kbuild test robot, Tetsuo Handa, linux-kernel, linux-mm

On Mon 09-07-18 13:30:10, David Rientjes wrote:
> On Mon, 9 Jul 2018, Michal Hocko wrote:
> 
> > > Blockable mmu notifiers and mlocked memory is not the extent of the 
> > > problem, if a process has a lot of virtual memory we must wait until 
> > > free_pgtables() completes in exit_mmap() to prevent unnecessary oom 
> > > killing.  For implementations such as tcmalloc, which does not release 
> > > virtual memory, this is important because, well, it releases this only at 
> > > exit_mmap().  Of course we cannot do that with only the protection of 
> > > mm->mmap_sem for read.
> > 
> > And how exactly a timeout helps to prevent from "unnecessary killing" in
> > that case?
> 
> As my patch does, it becomes mandatory to move MMF_OOM_SKIP to after 
> free_pgtables() in exit_mmap() and then repurpose MMF_UNSTABLE to 
> indicate that the oom reaper should not operate on a given mm.  In the 
> event we cannot reach MMF_OOM_SKIP, we need to ensure forward progress and 
> that is possible with a timeout period in the very rare instance where 
> additional memory freeing is needed, and without unnecessary oom killing 
> when it is not needed.

But such a timeout doesn't really know how much to wait so it is more
a hack than anything else. The only reason why we set MMF_OOM_SKIP so
early in the exit path now is inability to reap mlocked memory. That
is something fundamentally solvable. In fact we can really postpone
MMF_OOM_SKIP to after free_pgtables. It would require to extend the
current handover between the oom reaper and the exit path but it is
doable AFAICS. Only the exit path can call free_pgtables but the oom
reaper doesn't have to set MMF_OOM_SKIP if it _knows_ that the exit_mmap
is already past any point of blocking.

Btw, I am quite surprise you are now worried about oom victims with
basically no memory mapped and a huge amount of memory in page tables.
We have never handled that case properly IIRC. So oom_reaper hasn't
added anything new here.

That being said, I haven't heard any bug reports for over eager oom
killer just because of the oom reaper except your rather non-specific
claims about millions of pointless oom invocations. So I am not really
convinced we have to rush into a solution. I would much rather work
on a proper and comprehensible solution than put one band aid over
another. This has been the case in the oom proper for many years and
we have ended up with a subtle code which is way too easy to break and
nightmare to maintain. Let's not repeat that again please.

So do not rush into first idea and let's do the proper development
here. This means the proper analysis of the problem, find a solution
space and chose one which is the most reasonable long term.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v3] mm, oom: fix unnecessary killing of additional processes
  2018-06-21 21:35 [patch v3] mm, oom: fix unnecessary killing of additional processes David Rientjes
  2018-07-04  1:43 ` David Rientjes
  2018-07-05 23:46 ` Andrew Morton
@ 2018-07-17 21:09 ` Tetsuo Handa
  2018-07-18 20:22   ` David Rientjes
  2 siblings, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2018-07-17 21:09 UTC (permalink / raw)
  To: David Rientjes, Andrew Morton
  Cc: kbuild test robot, Michal Hocko, linux-kernel, linux-mm

This patch should be dropped from linux-next because it is incorrectly
using MMF_UNSTABLE.

On 2018/06/22 6:35, David Rientjes wrote:
> diff --git a/mm/mmap.c b/mm/mmap.c
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3059,25 +3059,28 @@ void exit_mmap(struct mm_struct *mm)
>  	if (unlikely(mm_is_oom_victim(mm))) {
>  		/*
>  		 * Manually reap the mm to free as much memory as possible.
> -		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> -		 * this mm from further consideration.  Taking mm->mmap_sem for
> -		 * write after setting MMF_OOM_SKIP will guarantee that the oom
> -		 * reaper will not run on this mm again after mmap_sem is
> -		 * dropped.
> -		 *
>  		 * Nothing can be holding mm->mmap_sem here and the above call
>  		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
>  		 * __oom_reap_task_mm() will not block.
> -		 *
> -		 * This needs to be done before calling munlock_vma_pages_all(),
> -		 * which clears VM_LOCKED, otherwise the oom reaper cannot
> -		 * reliably test it.
>  		 */
>  		mutex_lock(&oom_lock);
>  		__oom_reap_task_mm(mm);
>  		mutex_unlock(&oom_lock);
>  
> -		set_bit(MMF_OOM_SKIP, &mm->flags);
> +		/*
> +		 * Now, set MMF_UNSTABLE to avoid racing with the oom reaper.
> +		 * This needs to be done before calling munlock_vma_pages_all(),
> +		 * which clears VM_LOCKED, otherwise the oom reaper cannot
> +		 * reliably test for it.  If the oom reaper races with
> +		 * munlock_vma_pages_all(), this can result in a kernel oops if
> +		 * a pmd is zapped, for example, after follow_page_mask() has
> +		 * checked pmd_none().
> +		 *
> +		 * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
> +		 * guarantee that the oom reaper will not run on this mm again
> +		 * after mmap_sem is dropped.
> +		 */
> +		set_bit(MMF_UNSTABLE, &mm->flags);

Since MMF_UNSTABLE is set by __oom_reap_task_mm() from exit_mmap() before start reaping
(because the purpose of MMF_UNSTABLE is to "tell all users of get_user/copy_from_user
etc... that the content is no longer stable"), it cannot be used for a flag for indicating
that the OOM reaper can't work on the mm anymore.

If the oom_lock serialization is removed, the OOM reaper will give up after (by default)
1 second even if current thread is immediately after set_bit(MMF_UNSTABLE, &mm->flags) from
__oom_reap_task_mm() from exit_mmap(). Thus, this patch and the other patch which removes
oom_lock serialization should be dropped.

>  		down_write(&mm->mmap_sem);
>  		up_write(&mm->mmap_sem);
>  	}

> @@ -637,25 +649,57 @@ static int oom_reaper(void *unused)
>  	return 0;
>  }
>  
> +/*
> + * Millisecs to wait for an oom mm to free memory before selecting another
> + * victim.
> + */
> +static u64 oom_free_timeout_ms = 1000;
>  static void wake_oom_reaper(struct task_struct *tsk)
>  {
> -	/* tsk is already queued? */
> -	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
> +	/*
> +	 * Set the reap timeout; if it's already set, the mm is enqueued and
> +	 * this tsk can be ignored.
> +	 */
> +	if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL,
> +			jiffies + msecs_to_jiffies(oom_free_timeout_ms)))
>  		return;

"expire" must not be 0 in order to avoid double list_add(). See
https://lore.kernel.org/lkml/201807130620.w6D6KiAJ093010@www262.sakura.ne.jp/T/#u .

>  
>  	get_task_struct(tsk);
>  
>  	spin_lock(&oom_reaper_lock);
> -	tsk->oom_reaper_list = oom_reaper_list;
> -	oom_reaper_list = tsk;
> +	list_add(&tsk->oom_reap_list, &oom_reaper_list);
>  	spin_unlock(&oom_reaper_lock);
>  	trace_wake_reaper(tsk->pid);
>  	wake_up(&oom_reaper_wait);
>  }


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v3] mm, oom: fix unnecessary killing of additional processes
  2018-07-17 21:09 ` Tetsuo Handa
@ 2018-07-18 20:22   ` David Rientjes
  2018-07-18 21:21     ` Tetsuo Handa
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2018-07-18 20:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, kbuild test robot, Michal Hocko, linux-kernel, linux-mm

On Wed, 18 Jul 2018, Tetsuo Handa wrote:

> > diff --git a/mm/mmap.c b/mm/mmap.c
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3059,25 +3059,28 @@ void exit_mmap(struct mm_struct *mm)
> >  	if (unlikely(mm_is_oom_victim(mm))) {
> >  		/*
> >  		 * Manually reap the mm to free as much memory as possible.
> > -		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> > -		 * this mm from further consideration.  Taking mm->mmap_sem for
> > -		 * write after setting MMF_OOM_SKIP will guarantee that the oom
> > -		 * reaper will not run on this mm again after mmap_sem is
> > -		 * dropped.
> > -		 *
> >  		 * Nothing can be holding mm->mmap_sem here and the above call
> >  		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> >  		 * __oom_reap_task_mm() will not block.
> > -		 *
> > -		 * This needs to be done before calling munlock_vma_pages_all(),
> > -		 * which clears VM_LOCKED, otherwise the oom reaper cannot
> > -		 * reliably test it.
> >  		 */
> >  		mutex_lock(&oom_lock);
> >  		__oom_reap_task_mm(mm);
> >  		mutex_unlock(&oom_lock);
> >  
> > -		set_bit(MMF_OOM_SKIP, &mm->flags);
> > +		/*
> > +		 * Now, set MMF_UNSTABLE to avoid racing with the oom reaper.
> > +		 * This needs to be done before calling munlock_vma_pages_all(),
> > +		 * which clears VM_LOCKED, otherwise the oom reaper cannot
> > +		 * reliably test for it.  If the oom reaper races with
> > +		 * munlock_vma_pages_all(), this can result in a kernel oops if
> > +		 * a pmd is zapped, for example, after follow_page_mask() has
> > +		 * checked pmd_none().
> > +		 *
> > +		 * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
> > +		 * guarantee that the oom reaper will not run on this mm again
> > +		 * after mmap_sem is dropped.
> > +		 */
> > +		set_bit(MMF_UNSTABLE, &mm->flags);
> 
> Since MMF_UNSTABLE is set by __oom_reap_task_mm() from exit_mmap() before start reaping
> (because the purpose of MMF_UNSTABLE is to "tell all users of get_user/copy_from_user
> etc... that the content is no longer stable"), it cannot be used for a flag for indicating
> that the OOM reaper can't work on the mm anymore.
> 

Why?  It should be able to be set by exit_mmap() since nothing else should 
be accessing this mm in the first place.  There is no reason to wait for 
the oom reaper and the following down_write();up_write(); cycle will 
guarantee it is not operating on the mm before munlocking.

> If the oom_lock serialization is removed, the OOM reaper will give up after (by default)
> 1 second even if current thread is immediately after set_bit(MMF_UNSTABLE, &mm->flags) from
> __oom_reap_task_mm() from exit_mmap(). Thus, this patch and the other patch which removes
> oom_lock serialization should be dropped.
> 

No, it shouldn't, lol.  The oom reaper may give up because we have entered 
__oom_reap_task_mm() by way of exit_mmap(), there's no other purpose for 
it acting on the mm.  This is very different from giving up by setting 
MMF_OOM_SKIP, which it will wait for oom_free_timeout_ms to do unless the 
thread can make forward progress here in exit_mmap().

> >  		down_write(&mm->mmap_sem);
> >  		up_write(&mm->mmap_sem);
> >  	}
> 
> > @@ -637,25 +649,57 @@ static int oom_reaper(void *unused)
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Millisecs to wait for an oom mm to free memory before selecting another
> > + * victim.
> > + */
> > +static u64 oom_free_timeout_ms = 1000;
> >  static void wake_oom_reaper(struct task_struct *tsk)
> >  {
> > -	/* tsk is already queued? */
> > -	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
> > +	/*
> > +	 * Set the reap timeout; if it's already set, the mm is enqueued and
> > +	 * this tsk can be ignored.
> > +	 */
> > +	if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL,
> > +			jiffies + msecs_to_jiffies(oom_free_timeout_ms)))
> >  		return;
> 
> "expire" must not be 0 in order to avoid double list_add(). See
> https://lore.kernel.org/lkml/201807130620.w6D6KiAJ093010@www262.sakura.ne.jp/T/#u .
> 

We should not allow oom_free_timeout_ms to be 0 for sure, I assume 1000 is 
the sane minimum since we need to allow time for some memory freeing and 
this will not be radically different from what existed before the patch 
for the various backoffs.  Or maybe you meant something else for "expire" 
here?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v3] mm, oom: fix unnecessary killing of additional processes
  2018-07-18 20:22   ` David Rientjes
@ 2018-07-18 21:21     ` Tetsuo Handa
  2018-07-19 14:23       ` Tetsuo Handa
  2018-07-20  8:41       ` David Rientjes
  0 siblings, 2 replies; 28+ messages in thread
From: Tetsuo Handa @ 2018-07-18 21:21 UTC (permalink / raw)
  To: David Rientjes, Andrew Morton
  Cc: kbuild test robot, Michal Hocko, linux-kernel, linux-mm

Sigh...

Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

because David is not aware what is wrong.

On 2018/07/19 5:22, David Rientjes wrote:
> On Wed, 18 Jul 2018, Tetsuo Handa wrote:
> 
>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>> --- a/mm/mmap.c
>>> +++ b/mm/mmap.c
>>> @@ -3059,25 +3059,28 @@ void exit_mmap(struct mm_struct *mm)
>>>  	if (unlikely(mm_is_oom_victim(mm))) {
>>>  		/*
>>>  		 * Manually reap the mm to free as much memory as possible.
>>> -		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
>>> -		 * this mm from further consideration.  Taking mm->mmap_sem for
>>> -		 * write after setting MMF_OOM_SKIP will guarantee that the oom
>>> -		 * reaper will not run on this mm again after mmap_sem is
>>> -		 * dropped.
>>> -		 *
>>>  		 * Nothing can be holding mm->mmap_sem here and the above call
>>>  		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
>>>  		 * __oom_reap_task_mm() will not block.
>>> -		 *
>>> -		 * This needs to be done before calling munlock_vma_pages_all(),
>>> -		 * which clears VM_LOCKED, otherwise the oom reaper cannot
>>> -		 * reliably test it.
>>>  		 */
>>>  		mutex_lock(&oom_lock);
>>>  		__oom_reap_task_mm(mm);
>>>  		mutex_unlock(&oom_lock);
>>>  
>>> -		set_bit(MMF_OOM_SKIP, &mm->flags);
>>> +		/*
>>> +		 * Now, set MMF_UNSTABLE to avoid racing with the oom reaper.
>>> +		 * This needs to be done before calling munlock_vma_pages_all(),
>>> +		 * which clears VM_LOCKED, otherwise the oom reaper cannot
>>> +		 * reliably test for it.  If the oom reaper races with
>>> +		 * munlock_vma_pages_all(), this can result in a kernel oops if
>>> +		 * a pmd is zapped, for example, after follow_page_mask() has
>>> +		 * checked pmd_none().
>>> +		 *
>>> +		 * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
>>> +		 * guarantee that the oom reaper will not run on this mm again
>>> +		 * after mmap_sem is dropped.
>>> +		 */
>>> +		set_bit(MMF_UNSTABLE, &mm->flags);
>>
>> Since MMF_UNSTABLE is set by __oom_reap_task_mm() from exit_mmap() before start reaping
>> (because the purpose of MMF_UNSTABLE is to "tell all users of get_user/copy_from_user
>> etc... that the content is no longer stable"), it cannot be used for a flag for indicating
>> that the OOM reaper can't work on the mm anymore.
>>
> 
> Why?  It should be able to be set by exit_mmap() since nothing else should 
> be accessing this mm in the first place.  There is no reason to wait for 
> the oom reaper and the following down_write();up_write(); cycle will 
> guarantee it is not operating on the mm before munlocking.
> 

It does not make sense to call set_bit(MMF_UNSTABLE, &mm->flags) again after returning from
__oom_reap_task_mm() because MMF_UNSTABLE is _aready_ set in the beginning of __oom_reap_task_mm().

void __oom_reap_task_mm(struct mm_struct *mm)
{
        struct vm_area_struct *vma;

        /*
         * Tell all users of get_user/copy_from_user etc... that the content
         * is no longer stable. No barriers really needed because unmapping
         * should imply barriers already and the reader would hit a page fault
         * if it stumbled over a reaped memory. If MMF_UNSTABLE is already set,
         * reaping as already occurred so nothing left to do.
         */
        if (test_and_set_bit(MMF_UNSTABLE, &mm->flags))
                return;
(...snipped...)
}

void exit_mmap(struct mm_struct *mm)
{
        struct mmu_gather tlb;
        struct vm_area_struct *vma;
        unsigned long nr_accounted = 0;

        /* mm's last user has gone, and its about to be pulled down */
        mmu_notifier_release(mm);

        if (unlikely(mm_is_oom_victim(mm))) {
                /*
                 * Manually reap the mm to free as much memory as possible.
                 * Nothing can be holding mm->mmap_sem here and the above call
                 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
                 * __oom_reap_task_mm() will not block.
                 */
                __oom_reap_task_mm(mm);

                /*
                 * Now, set MMF_UNSTABLE to avoid racing with the oom reaper.
                 * This needs to be done before calling munlock_vma_pages_all(),
                 * which clears VM_LOCKED, otherwise the oom reaper cannot
                 * reliably test for it.  If the oom reaper races with
                 * munlock_vma_pages_all(), this can result in a kernel oops if
                 * a pmd is zapped, for example, after follow_page_mask() has
                 * checked pmd_none().
                 *
                 * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
                 * guarantee that the oom reaper will not run on this mm again
                 * after mmap_sem is dropped.
                 */
                set_bit(MMF_UNSTABLE, &mm->flags);
                down_write(&mm->mmap_sem);
                up_write(&mm->mmap_sem);
        }
(...snipped...)
}

>> If the oom_lock serialization is removed, the OOM reaper will give up after (by default)
>> 1 second even if current thread is immediately after set_bit(MMF_UNSTABLE, &mm->flags) from
>> __oom_reap_task_mm() from exit_mmap(). Thus, this patch and the other patch which removes
>> oom_lock serialization should be dropped.
>>
> 
> No, it shouldn't, lol.  The oom reaper may give up because we have entered 
> __oom_reap_task_mm() by way of exit_mmap(), there's no other purpose for 
> it acting on the mm.  This is very different from giving up by setting 
> MMF_OOM_SKIP, which it will wait for oom_free_timeout_ms to do unless the 
> thread can make forward progress here in exit_mmap().

Let's call "A" as a thread doing exit_mmap(), and "B" as the OOM reaper kernel thread.

(1) "A" finds that unlikely(mm_is_oom_victim(mm)) == true.
(2) "B" finds that test_bit(MMF_OOM_SKIP, &mm->flags) in oom_reap_task() is false.
(3) "B" finds that !test_bit(MMF_UNSTABLE, &mm->flags) in oom_reap_task() is true.
(4) "B" enters into oom_reap_task_mm(tsk, mm).
(5) "B" finds that !down_read_trylock(&mm->mmap_sem) is false.
(6) "B" finds that mm_has_blockable_invalidate_notifiers(mm) is false.
(7) "B" finds that test_bit(MMF_UNSTABLE, &mm->flags) is false.
(8) "B" enters into __oom_reap_task_mm(mm).
(9) "A" finds that test_and_set_bit(MMF_UNSTABLE, &mm->flags) is false.
(10) "A" is preempted by somebody else.
(11) "B" finds that test_and_set_bit(MMF_UNSTABLE, &mm->flags) is true.
(12) "B" leaves __oom_reap_task_mm(mm).
(13) "B" leaves oom_reap_task_mm().
(14) "B" finds that time_after_eq(jiffies, mm->oom_free_expire) became true.
(15) "B" finds that !test_bit(MMF_OOM_SKIP, &mm->flags) is true.
(16) "B" calls set_bit(MMF_OOM_SKIP, &mm->flags).
(17) "B" finds that test_bit(MMF_OOM_SKIP, &mm->flags) is true.
(18) select_bad_process() finds that MMF_OOM_SKIP is already set.
(19) out_of_memory() kills a new OOM victim.
(20) "A" resumes execution and start reclaiming memory.

because oom_lock serialization was already removed.

> 
>>>  		down_write(&mm->mmap_sem);
>>>  		up_write(&mm->mmap_sem);
>>>  	}
>>
>>> @@ -637,25 +649,57 @@ static int oom_reaper(void *unused)
>>>  	return 0;
>>>  }
>>>  
>>> +/*
>>> + * Millisecs to wait for an oom mm to free memory before selecting another
>>> + * victim.
>>> + */
>>> +static u64 oom_free_timeout_ms = 1000;
>>>  static void wake_oom_reaper(struct task_struct *tsk)
>>>  {
>>> -	/* tsk is already queued? */
>>> -	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
>>> +	/*
>>> +	 * Set the reap timeout; if it's already set, the mm is enqueued and
>>> +	 * this tsk can be ignored.
>>> +	 */
>>> +	if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL,
>>> +			jiffies + msecs_to_jiffies(oom_free_timeout_ms)))
>>>  		return;
>>
>> "expire" must not be 0 in order to avoid double list_add(). See
>> https://lore.kernel.org/lkml/201807130620.w6D6KiAJ093010@www262.sakura.ne.jp/T/#u .
>>
> 
> We should not allow oom_free_timeout_ms to be 0 for sure, I assume 1000 is 
> the sane minimum since we need to allow time for some memory freeing and 
> this will not be radically different from what existed before the patch 
> for the various backoffs.  Or maybe you meant something else for "expire" 
> here?
> 

I'm saying that jiffies + msecs_to_jiffies(oom_free_timeout_ms) == 0 will make
tsk->signal->oom_mm->oom_free_expire == 0 and the list will be corrupted by
allowing cmpxchg(&tsk->signal->oom_mm->oom_free_expire) to become true for twice.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v3] mm, oom: fix unnecessary killing of additional processes
  2018-07-18 21:21     ` Tetsuo Handa
@ 2018-07-19 14:23       ` Tetsuo Handa
  2018-07-20  8:41       ` David Rientjes
  1 sibling, 0 replies; 28+ messages in thread
From: Tetsuo Handa @ 2018-07-19 14:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, kbuild test robot, Michal Hocko, linux-kernel,
	linux-mm, Roman Gushchin

David,

Now that your patches are about to be dropped from linux-next.git , please try OOM lockup
(CVE-2016-10723) mitigation patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 )
and my cleanup patch ( [PATCH 1/2] at https://marc.info/?l=linux-mm&m=153119509215026&w=4 )
on top of linux.git . And please reply how was the result, for I'm currently asking
Roman whether we can apply these patches before applying the cgroup-aware OOM killer.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v3] mm, oom: fix unnecessary killing of additional processes
  2018-07-18 21:21     ` Tetsuo Handa
  2018-07-19 14:23       ` Tetsuo Handa
@ 2018-07-20  8:41       ` David Rientjes
  2018-07-20  9:57         ` Tetsuo Handa
  2018-07-20 20:14         ` [patch v4] " David Rientjes
  1 sibling, 2 replies; 28+ messages in thread
From: David Rientjes @ 2018-07-20  8:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, kbuild test robot, Michal Hocko, linux-kernel, linux-mm

On Thu, 19 Jul 2018, Tetsuo Handa wrote:

> Sigh...
> 
> Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> 
> because David is not aware what is wrong.
> 

Hmm, didn't you incorporate this exact patch into your own patch series 
that you proposed? :)

I'm coming to this stark realization that all of these theater is only for 
effect.  Perhaps other observers have come to that understanding earlier 
and I was late to the party.

You're nacking a patch because it does a double set_bit() and jiffies can 
wraparound and we can add a process to the oom reaper list twice if the 
oom happens at the exact same moment.  Ok.  These are extremely trivial to 
fix.

> Let's call "A" as a thread doing exit_mmap(), and "B" as the OOM reaper kernel thread.
> 
> (1) "A" finds that unlikely(mm_is_oom_victim(mm)) == true.
> (2) "B" finds that test_bit(MMF_OOM_SKIP, &mm->flags) in oom_reap_task() is false.
> (3) "B" finds that !test_bit(MMF_UNSTABLE, &mm->flags) in oom_reap_task() is true.
> (4) "B" enters into oom_reap_task_mm(tsk, mm).
> (5) "B" finds that !down_read_trylock(&mm->mmap_sem) is false.
> (6) "B" finds that mm_has_blockable_invalidate_notifiers(mm) is false.
> (7) "B" finds that test_bit(MMF_UNSTABLE, &mm->flags) is false.
> (8) "B" enters into __oom_reap_task_mm(mm).
> (9) "A" finds that test_and_set_bit(MMF_UNSTABLE, &mm->flags) is false.
> (10) "A" is preempted by somebody else.
> (11) "B" finds that test_and_set_bit(MMF_UNSTABLE, &mm->flags) is true.
> (12) "B" leaves __oom_reap_task_mm(mm).
> (13) "B" leaves oom_reap_task_mm().
> (14) "B" finds that time_after_eq(jiffies, mm->oom_free_expire) became true.
> (15) "B" finds that !test_bit(MMF_OOM_SKIP, &mm->flags) is true.
> (16) "B" calls set_bit(MMF_OOM_SKIP, &mm->flags).
> (17) "B" finds that test_bit(MMF_OOM_SKIP, &mm->flags) is true.
> (18) select_bad_process() finds that MMF_OOM_SKIP is already set.
> (19) out_of_memory() kills a new OOM victim.
> (20) "A" resumes execution and start reclaiming memory.
> 
> because oom_lock serialization was already removed.
> 

Absent oom_lock serialization, this is exactly working as intended.  You 
could argue that once the thread has reached exit_mmap() and begins oom 
reaping that it should be allowed to finish before the oom reaper declares 
MMF_OOM_SKIP.  That could certainly be helpful, I simply haven't 
encountered a usecase where it were needed.  Or, we could restart the oom 
expiration when MMF_UNSTABLE is set and deem that progress is being made 
so it give it some extra time.  In practice, again, we haven't seen this 
needed.  But either of those are very easy to add in as well.  Which would 
you prefer?


mm, oom: fix unnecessary killing of additional processes fix

Fix double set_bit() per Tetsuo.

Fix jiffies wraparound per Tetsuo.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/mmap.c     | 13 ++++++-------
 mm/oom_kill.c |  7 +++++--
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3069,23 +3069,22 @@ void exit_mmap(struct mm_struct *mm)
 		 * Nothing can be holding mm->mmap_sem here and the above call
 		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
 		 * __oom_reap_task_mm() will not block.
-		 */
-		__oom_reap_task_mm(mm);
-
-		/*
-		 * Now, set MMF_UNSTABLE to avoid racing with the oom reaper.
+		 *
+		 * This sets MMF_UNSTABLE to avoid racing with the oom reaper.
 		 * This needs to be done before calling munlock_vma_pages_all(),
 		 * which clears VM_LOCKED, otherwise the oom reaper cannot
 		 * reliably test for it.  If the oom reaper races with
 		 * munlock_vma_pages_all(), this can result in a kernel oops if
 		 * a pmd is zapped, for example, after follow_page_mask() has
 		 * checked pmd_none().
-		 *
+		 */
+		__oom_reap_task_mm(mm);
+
+		/*
 		 * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
 		 * guarantee that the oom reaper will not run on this mm again
 		 * after mmap_sem is dropped.
 		 */
-		set_bit(MMF_UNSTABLE, &mm->flags);
 		down_write(&mm->mmap_sem);
 		up_write(&mm->mmap_sem);
 	}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -666,12 +666,15 @@ static int oom_reaper(void *unused)
 static u64 oom_free_timeout_ms = 1000;
 static void wake_oom_reaper(struct task_struct *tsk)
 {
+	unsigned long expire = jiffies + msecs_to_jiffies(oom_free_timeout_ms);
+
+	if (!expire)
+		expire++;
 	/*
 	 * Set the reap timeout; if it's already set, the mm is enqueued and
 	 * this tsk can be ignored.
 	 */
-	if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL,
-			jiffies + msecs_to_jiffies(oom_free_timeout_ms)))
+	if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL, expire))
 		return;
 
 	get_task_struct(tsk);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v3] mm, oom: fix unnecessary killing of additional processes
  2018-07-20  8:41       ` David Rientjes
@ 2018-07-20  9:57         ` Tetsuo Handa
  2018-07-20 20:19           ` David Rientjes
  2018-07-20 20:14         ` [patch v4] " David Rientjes
  1 sibling, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2018-07-20  9:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, kbuild test robot, Michal Hocko, linux-kernel, linux-mm

On 2018/07/20 17:41, David Rientjes wrote:
> Absent oom_lock serialization, this is exactly working as intended.  You 
> could argue that once the thread has reached exit_mmap() and begins oom 
> reaping that it should be allowed to finish before the oom reaper declares 
> MMF_OOM_SKIP.  That could certainly be helpful, I simply haven't 
> encountered a usecase where it were needed.  Or, we could restart the oom 
> expiration when MMF_UNSTABLE is set and deem that progress is being made 
> so it give it some extra time.  In practice, again, we haven't seen this 
> needed.  But either of those are very easy to add in as well.  Which would 
> you prefer?

I don't think we need to introduce user-visible knob interface (even if it is in
debugfs), for I think that my approach can solve your problem. Please try OOM lockup
(CVE-2016-10723) mitigation patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 )
and my cleanup patch ( [PATCH 1/2] at https://marc.info/?l=linux-mm&m=153119509215026&w=4 )
on top of linux.git . And please reply how was the result, for I'm currently asking
Roman whether we can apply these patches before applying the cgroup-aware OOM killer.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch v4] mm, oom: fix unnecessary killing of additional processes
  2018-07-20  8:41       ` David Rientjes
  2018-07-20  9:57         ` Tetsuo Handa
@ 2018-07-20 20:14         ` David Rientjes
  2018-07-20 20:43           ` Tetsuo Handa
  2018-07-24 21:44           ` [patch v5] " David Rientjes
  1 sibling, 2 replies; 28+ messages in thread
From: David Rientjes @ 2018-07-20 20:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Tetsuo Handa, Michal Hocko, linux-kernel, linux-mm

The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
it cannot reap an mm.  This can happen for a variety of reasons,
including:

 - the inability to grab mm->mmap_sem in a sufficient amount of time,

 - when the mm has blockable mmu notifiers that could cause the oom reaper
   to stall indefinitely,

but we can also add a third when the oom reaper can "reap" an mm but doing
so is unlikely to free any amount of memory:

 - when the mm's memory is mostly mlocked.

When all memory is mlocked, the oom reaper will not be able to free any
substantial amount of memory.  It sets MMF_OOM_SKIP before the victim can
unmap and free its memory in exit_mmap() and subsequent oom victims are
chosen unnecessarily.  This is trivial to reproduce if all eligible
processes on the system have mlocked their memory: the oom killer calls
panic() even though forward progress can be made.

This is the same issue where the exit path sets MMF_OOM_SKIP before
unmapping memory and additional processes can be chosen unnecessarily
because the oom killer is racing with exit_mmap() and is separate from
the oom reaper setting MMF_OOM_SKIP prematurely.

We can't simply defer setting MMF_OOM_SKIP, however, because if there is
a true oom livelock in progress, it never gets set and no additional
killing is possible.

To fix this, this patch introduces a per-mm reaping period, which is
configurable through the new oom_free_timeout_ms file in debugfs and
defaults to one second to match the current heuristics.  This support
requires that the oom reaper's list becomes a proper linked list so that
other mm's may be reaped while waiting for an mm's timeout to expire.

This replaces the current timeouts in the oom reaper: (1) when trying to
grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2)
a HZ sleep if there are blockable mmu notifiers.  It extends it with
timeout to allow an oom victim to reach exit_mmap() before choosing
additional processes unnecessarily.

The exit path will now set MMF_OOM_SKIP only after all memory has been
freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to
determine when it can race with the oom reaper.

The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
lapsed because it can no longer guarantee forward progress.  Since the
default oom_free_timeout_ms is one second, the same as current heuristics,
there should be no functional change with this patch for users who do not
tune it to be longer other than MMF_OOM_SKIP is set by exit_mmap() after
free_pgtables(), which is the preferred behavior.

The reaping timeout can intentionally be set for a substantial amount of
time, such as 10s, since oom livelock is a very rare occurrence and it's
better to optimize for preventing additional (unnecessary) oom killing
than a scenario that is much more unlikely.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 v4:
 - fix double set_bit() per Tetsuo
 - fix jiffies wraparound per Tetsuo

 v3:
  - oom_free_timeout_ms is now static per kbuild test robot

 v2:
  - configurable timeout period through debugfs
  - change mm->reap_timeout to mm->oom_free_expire and add more
    descriptive comment per akpm
  - add comment to describe task->oom_reap_list locking based on
    oom_reaper_lock per akpm
  - rework the exit_mmap() comment and split into two parts to be more
    descriptive about the locking and the issue with the oom reaper
    racing with munlock_vma_pages_all() per akpm

 include/linux/mm_types.h |   7 ++
 include/linux/sched.h    |   3 +-
 kernel/fork.c            |   3 +
 mm/mmap.c                |  19 +++---
 mm/oom_kill.c            | 143 ++++++++++++++++++++++++++-------------
 5 files changed, 118 insertions(+), 57 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -449,6 +449,13 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_MMU
+	/*
+	 * When to give up on memory freeing from this mm after its
+	 * threads have been oom killed, in jiffies.
+	 */
+	unsigned long oom_free_expire;
+#endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1184,7 +1184,8 @@ struct task_struct {
 #endif
 	int				pagefault_disabled;
 #ifdef CONFIG_MMU
-	struct task_struct		*oom_reaper_list;
+	/* OOM victim queue for oom reaper, protected by oom_reaper_lock */
+	struct list_head		oom_reap_list;
 #endif
 #ifdef CONFIG_VMAP_STACK
 	struct vm_struct		*stack_vm_area;
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -843,6 +843,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 #ifdef CONFIG_FAULT_INJECTION
 	tsk->fail_nth = 0;
 #endif
+#ifdef CONFIG_MMU
+	INIT_LIST_HEAD(&tsk->oom_reap_list);
+#endif
 
 #ifdef CONFIG_MEMCG
 	tsk->active_memcg = NULL;
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3066,25 +3066,27 @@ void exit_mmap(struct mm_struct *mm)
 	if (unlikely(mm_is_oom_victim(mm))) {
 		/*
 		 * Manually reap the mm to free as much memory as possible.
-		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
-		 * this mm from further consideration.  Taking mm->mmap_sem for
-		 * write after setting MMF_OOM_SKIP will guarantee that the oom
-		 * reaper will not run on this mm again after mmap_sem is
-		 * dropped.
-		 *
 		 * Nothing can be holding mm->mmap_sem here and the above call
 		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
 		 * __oom_reap_task_mm() will not block.
 		 *
+		 * This sets MMF_UNSTABLE to avoid racing with the oom reaper.
 		 * This needs to be done before calling munlock_vma_pages_all(),
 		 * which clears VM_LOCKED, otherwise the oom reaper cannot
-		 * reliably test it.
+		 * reliably test for it.  If the oom reaper races with
+		 * munlock_vma_pages_all(), this can result in a kernel oops if
+		 * a pmd is zapped, for example, after follow_page_mask() has
+		 * checked pmd_none().
 		 */
 		mutex_lock(&oom_lock);
 		__oom_reap_task_mm(mm);
 		mutex_unlock(&oom_lock);
 
-		set_bit(MMF_OOM_SKIP, &mm->flags);
+		/*
+		 * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
+		 * guarantee that the oom reaper will not run on this mm again
+		 * after mmap_sem is dropped.
+		 */
 		down_write(&mm->mmap_sem);
 		up_write(&mm->mmap_sem);
 	}
@@ -3112,6 +3114,7 @@ void exit_mmap(struct mm_struct *mm)
 	unmap_vmas(&tlb, vma, 0, -1);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
+	set_bit(MMF_OOM_SKIP, &mm->flags);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -41,6 +41,7 @@
 #include <linux/kthread.h>
 #include <linux/init.h>
 #include <linux/mmu_notifier.h>
+#include <linux/debugfs.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -484,7 +485,7 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
  */
 static struct task_struct *oom_reaper_th;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
+static LIST_HEAD(oom_reaper_list);
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
 void __oom_reap_task_mm(struct mm_struct *mm)
@@ -527,10 +528,8 @@ void __oom_reap_task_mm(struct mm_struct *mm)
 	}
 }
 
-static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
+static void oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
-	bool ret = true;
-
 	/*
 	 * We have to make sure to not race with the victim exit path
 	 * and cause premature new oom victim selection:
@@ -548,9 +547,8 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	mutex_lock(&oom_lock);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
 		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
+		goto out_oom;
 	}
 
 	/*
@@ -559,69 +557,81 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	 * TODO: we really want to get rid of this ugly hack and make sure that
 	 * notifiers cannot block for unbounded amount of time
 	 */
-	if (mm_has_blockable_invalidate_notifiers(mm)) {
-		up_read(&mm->mmap_sem);
-		schedule_timeout_idle(HZ);
-		goto unlock_oom;
-	}
+	if (mm_has_blockable_invalidate_notifiers(mm))
+		goto out_mm;
 
 	/*
-	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
-	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
+	 * MMF_UNSTABLE is set by exit_mmap when the OOM reaper can't
+	 * work on the mm anymore. The check for MMF_UNSTABLE must run
 	 * under mmap_sem for reading because it serializes against the
 	 * down_write();up_write() cycle in exit_mmap().
 	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
-		up_read(&mm->mmap_sem);
+	if (test_bit(MMF_UNSTABLE, &mm->flags)) {
 		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
+		goto out_mm;
 	}
 
 	trace_start_task_reaping(tsk->pid);
-
 	__oom_reap_task_mm(mm);
+	trace_finish_task_reaping(tsk->pid);
 
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
+out_mm:
 	up_read(&mm->mmap_sem);
-
-	trace_finish_task_reaping(tsk->pid);
-unlock_oom:
+out_oom:
 	mutex_unlock(&oom_lock);
-	return ret;
 }
 
-#define MAX_OOM_REAP_RETRIES 10
 static void oom_reap_task(struct task_struct *tsk)
 {
-	int attempts = 0;
 	struct mm_struct *mm = tsk->signal->oom_mm;
 
-	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
-		schedule_timeout_idle(HZ/10);
+	/*
+	 * If this mm has either been fully unmapped, or the oom reaper has
+	 * given up on it, nothing left to do except drop the refcount.
+	 */
+	if (test_bit(MMF_OOM_SKIP, &mm->flags))
+		goto drop;
 
-	if (attempts <= MAX_OOM_REAP_RETRIES ||
-	    test_bit(MMF_OOM_SKIP, &mm->flags))
-		goto done;
+	/*
+	 * If this mm has already been reaped, doing so again will not likely
+	 * free additional memory.
+	 */
+	if (!test_bit(MMF_UNSTABLE, &mm->flags))
+		oom_reap_task_mm(tsk, mm);
 
-	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-		task_pid_nr(tsk), tsk->comm);
-	debug_show_all_locks();
+	if (time_after_eq(jiffies, mm->oom_free_expire)) {
+		if (!test_bit(MMF_OOM_SKIP, &mm->flags)) {
+			pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+				task_pid_nr(tsk), tsk->comm);
+			debug_show_all_locks();
 
-done:
-	tsk->oom_reaper_list = NULL;
+			/*
+			 * Reaping has failed for the timeout period, so give up
+			 * and allow additional processes to be oom killed.
+			 */
+			set_bit(MMF_OOM_SKIP, &mm->flags);
+		}
+		goto drop;
+	}
 
-	/*
-	 * Hide this mm from OOM killer because it has been either reaped or
-	 * somebody can't call up_write(mmap_sem).
-	 */
-	set_bit(MMF_OOM_SKIP, &mm->flags);
+	if (test_bit(MMF_OOM_SKIP, &mm->flags))
+		goto drop;
+
+	/* Enqueue to be reaped again */
+	spin_lock(&oom_reaper_lock);
+	list_add_tail(&tsk->oom_reap_list, &oom_reaper_list);
+	spin_unlock(&oom_reaper_lock);
+
+	schedule_timeout_idle(HZ/10);
+	return;
 
-	/* Drop a reference taken by wake_oom_reaper */
+drop:
+	/* Drop the reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 }
 
@@ -630,11 +640,13 @@ static int oom_reaper(void *unused)
 	while (true) {
 		struct task_struct *tsk = NULL;
 
-		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
+		wait_event_freezable(oom_reaper_wait,
+				     !list_empty(&oom_reaper_list));
 		spin_lock(&oom_reaper_lock);
-		if (oom_reaper_list != NULL) {
-			tsk = oom_reaper_list;
-			oom_reaper_list = tsk->oom_reaper_list;
+		if (!list_empty(&oom_reaper_list)) {
+			tsk = list_entry(oom_reaper_list.next,
+					 struct task_struct, oom_reap_list);
+			list_del(&tsk->oom_reap_list);
 		}
 		spin_unlock(&oom_reaper_lock);
 
@@ -645,25 +657,60 @@ static int oom_reaper(void *unused)
 	return 0;
 }
 
+/*
+ * Millisecs to wait for an oom mm to free memory before selecting another
+ * victim.
+ */
+static u64 oom_free_timeout_ms = 1000;
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+	unsigned long expire = jiffies + msecs_to_jiffies(oom_free_timeout_ms);
+
+	if (!expire)
+		expire++;
+	/*
+	 * Set the reap timeout; if it's already set, the mm is enqueued and
+	 * this tsk can be ignored.
+	 */
+	if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL, expire))
 		return;
 
 	get_task_struct(tsk);
 
 	spin_lock(&oom_reaper_lock);
-	tsk->oom_reaper_list = oom_reaper_list;
-	oom_reaper_list = tsk;
+	list_add(&tsk->oom_reap_list, &oom_reaper_list);
 	spin_unlock(&oom_reaper_lock);
 	trace_wake_reaper(tsk->pid);
 	wake_up(&oom_reaper_wait);
 }
 
+#ifdef CONFIG_DEBUG_FS
+static int oom_free_timeout_ms_read(void *data, u64 *val)
+{
+	*val = oom_free_timeout_ms;
+	return 0;
+}
+
+static int oom_free_timeout_ms_write(void *data, u64 val)
+{
+	if (val > 60 * 1000)
+		return -EINVAL;
+
+	oom_free_timeout_ms = val;
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(oom_free_timeout_ms_fops, oom_free_timeout_ms_read,
+			oom_free_timeout_ms_write, "%llu\n");
+#endif /* CONFIG_DEBUG_FS */
+
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+#ifdef CONFIG_DEBUG_FS
+	if (!IS_ERR(oom_reaper_th))
+		debugfs_create_file("oom_free_timeout_ms", 0200, NULL, NULL,
+				    &oom_free_timeout_ms_fops);
+#endif
 	return 0;
 }
 subsys_initcall(oom_init)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v3] mm, oom: fix unnecessary killing of additional processes
  2018-07-20  9:57         ` Tetsuo Handa
@ 2018-07-20 20:19           ` David Rientjes
  2018-07-20 20:47             ` Tetsuo Handa
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2018-07-20 20:19 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Andrew Morton, Michal Hocko, linux-kernel, linux-mm

On Fri, 20 Jul 2018, Tetsuo Handa wrote:

> > Absent oom_lock serialization, this is exactly working as intended.  You 
> > could argue that once the thread has reached exit_mmap() and begins oom 
> > reaping that it should be allowed to finish before the oom reaper declares 
> > MMF_OOM_SKIP.  That could certainly be helpful, I simply haven't 
> > encountered a usecase where it were needed.  Or, we could restart the oom 
> > expiration when MMF_UNSTABLE is set and deem that progress is being made 
> > so it give it some extra time.  In practice, again, we haven't seen this 
> > needed.  But either of those are very easy to add in as well.  Which would 
> > you prefer?
> 
> I don't think we need to introduce user-visible knob interface (even if it is in
> debugfs), for I think that my approach can solve your problem. Please try OOM lockup
> (CVE-2016-10723) mitigation patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 )

The issue I am fixing has nothing to do with contention on oom_lock, it 
has to do with the inability of the oom reaper to free memory for one or 
more of several reasons: mlock, blockable mmus, ptes, mm->mmap_sem 
contention, and then the setting of MMF_OOM_SKIP to choose another victim 
before the original victim even reaches exit_mmap().  Thus, removing 
oom_lock from exit_mmap() will not fix this issue.

I agree that oom_lock can be removed from exit_mmap() and it would be 
helpful to do so, and may address a series of problems that we have yet to 
encounter, but this would not fix the almost immediate setting of 
MMF_OOM_SKIP that occurs with minimal memory freeing due to the oom 
reaper.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v4] mm, oom: fix unnecessary killing of additional processes
  2018-07-20 20:14         ` [patch v4] " David Rientjes
@ 2018-07-20 20:43           ` Tetsuo Handa
  2018-07-20 22:13             ` David Rientjes
  2018-07-24 21:44           ` [patch v5] " David Rientjes
  1 sibling, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2018-07-20 20:43 UTC (permalink / raw)
  To: David Rientjes, Andrew Morton; +Cc: Michal Hocko, linux-kernel, linux-mm

On 2018/07/21 5:14, David Rientjes wrote:
> diff --git a/mm/mmap.c b/mm/mmap.c
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3066,25 +3066,27 @@ void exit_mmap(struct mm_struct *mm)
>  	if (unlikely(mm_is_oom_victim(mm))) {
>  		/*
>  		 * Manually reap the mm to free as much memory as possible.
> -		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> -		 * this mm from further consideration.  Taking mm->mmap_sem for
> -		 * write after setting MMF_OOM_SKIP will guarantee that the oom
> -		 * reaper will not run on this mm again after mmap_sem is
> -		 * dropped.
> -		 *
>  		 * Nothing can be holding mm->mmap_sem here and the above call
>  		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
>  		 * __oom_reap_task_mm() will not block.
>  		 *
> +		 * This sets MMF_UNSTABLE to avoid racing with the oom reaper.
>  		 * This needs to be done before calling munlock_vma_pages_all(),
>  		 * which clears VM_LOCKED, otherwise the oom reaper cannot
> -		 * reliably test it.
> +		 * reliably test for it.  If the oom reaper races with
> +		 * munlock_vma_pages_all(), this can result in a kernel oops if
> +		 * a pmd is zapped, for example, after follow_page_mask() has
> +		 * checked pmd_none().
>  		 */
>  		mutex_lock(&oom_lock);
>  		__oom_reap_task_mm(mm);
>  		mutex_unlock(&oom_lock);

I don't like holding oom_lock for full teardown of an mm, for an OOM victim's mm
might have multiple TB memory which could take long time.

Of course, forcing someone who triggered __mmput() to pay the full cost is
not nice though, for it even can be a /proc/$pid/ reader, can't it?

>  
> -		set_bit(MMF_OOM_SKIP, &mm->flags);
> +		/*
> +		 * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
> +		 * guarantee that the oom reaper will not run on this mm again
> +		 * after mmap_sem is dropped.
> +		 */
>  		down_write(&mm->mmap_sem);
>  		up_write(&mm->mmap_sem);
>  	}



> -#define MAX_OOM_REAP_RETRIES 10
>  static void oom_reap_task(struct task_struct *tsk)
>  {
> -	int attempts = 0;
>  	struct mm_struct *mm = tsk->signal->oom_mm;
>  
> -	/* Retry the down_read_trylock(mmap_sem) a few times */
> -	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
> -		schedule_timeout_idle(HZ/10);
> +	/*
> +	 * If this mm has either been fully unmapped, or the oom reaper has
> +	 * given up on it, nothing left to do except drop the refcount.
> +	 */
> +	if (test_bit(MMF_OOM_SKIP, &mm->flags))
> +		goto drop;
>  
> -	if (attempts <= MAX_OOM_REAP_RETRIES ||
> -	    test_bit(MMF_OOM_SKIP, &mm->flags))
> -		goto done;
> +	/*
> +	 * If this mm has already been reaped, doing so again will not likely
> +	 * free additional memory.
> +	 */
> +	if (!test_bit(MMF_UNSTABLE, &mm->flags))
> +		oom_reap_task_mm(tsk, mm);

This is still wrong. If preempted immediately after set_bit(MMF_UNSTABLE, &mm->flags) from
__oom_reap_task_mm() from exit_mmap(), oom_reap_task() can give up before reclaiming any memory.
test_bit(MMF_UNSTABLE, &mm->flags) has to be done under oom_lock serialization, and
I don't like holding oom_lock while calling __oom_reap_task_mm(mm).

>  
> -	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> -		task_pid_nr(tsk), tsk->comm);
> -	debug_show_all_locks();
> +	if (time_after_eq(jiffies, mm->oom_free_expire)) {
> +		if (!test_bit(MMF_OOM_SKIP, &mm->flags)) {
> +			pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> +				task_pid_nr(tsk), tsk->comm);
> +			debug_show_all_locks();
>  
> -done:
> -	tsk->oom_reaper_list = NULL;
> +			/*
> +			 * Reaping has failed for the timeout period, so give up
> +			 * and allow additional processes to be oom killed.
> +			 */
> +			set_bit(MMF_OOM_SKIP, &mm->flags);
> +		}
> +		goto drop;
> +	}



> @@ -645,25 +657,60 @@ static int oom_reaper(void *unused)
>  	return 0;
>  }
>  
> +/*
> + * Millisecs to wait for an oom mm to free memory before selecting another
> + * victim.
> + */
> +static u64 oom_free_timeout_ms = 1000;
>  static void wake_oom_reaper(struct task_struct *tsk)
>  {
> -	/* tsk is already queued? */
> -	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
> +	unsigned long expire = jiffies + msecs_to_jiffies(oom_free_timeout_ms);
> +
> +	if (!expire)
> +		expire++;
> +	/*
> +	 * Set the reap timeout; if it's already set, the mm is enqueued and
> +	 * this tsk can be ignored.
> +	 */
> +	if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL, expire))
>  		return;

We don't need this if we do from mark_oom_victim() like my series does.

>  
>  	get_task_struct(tsk);
>  
>  	spin_lock(&oom_reaper_lock);
> -	tsk->oom_reaper_list = oom_reaper_list;
> -	oom_reaper_list = tsk;
> +	list_add(&tsk->oom_reap_list, &oom_reaper_list);
>  	spin_unlock(&oom_reaper_lock);
>  	trace_wake_reaper(tsk->pid);
>  	wake_up(&oom_reaper_wait);
>  }


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v3] mm, oom: fix unnecessary killing of additional processes
  2018-07-20 20:19           ` David Rientjes
@ 2018-07-20 20:47             ` Tetsuo Handa
  2018-07-20 22:19               ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2018-07-20 20:47 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Michal Hocko, linux-kernel, linux-mm

On 2018/07/21 5:19, David Rientjes wrote:
> On Fri, 20 Jul 2018, Tetsuo Handa wrote:
> 
>>> Absent oom_lock serialization, this is exactly working as intended.  You 
>>> could argue that once the thread has reached exit_mmap() and begins oom 
>>> reaping that it should be allowed to finish before the oom reaper declares 
>>> MMF_OOM_SKIP.  That could certainly be helpful, I simply haven't 
>>> encountered a usecase where it were needed.  Or, we could restart the oom 
>>> expiration when MMF_UNSTABLE is set and deem that progress is being made 
>>> so it give it some extra time.  In practice, again, we haven't seen this 
>>> needed.  But either of those are very easy to add in as well.  Which would 
>>> you prefer?
>>
>> I don't think we need to introduce user-visible knob interface (even if it is in
>> debugfs), for I think that my approach can solve your problem. Please try OOM lockup
>> (CVE-2016-10723) mitigation patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 )
> 
> The issue I am fixing has nothing to do with contention on oom_lock, it 
> has to do with the inability of the oom reaper to free memory for one or 
> more of several reasons: mlock, blockable mmus, ptes, mm->mmap_sem 
> contention, and then the setting of MMF_OOM_SKIP to choose another victim 
> before the original victim even reaches exit_mmap().  Thus, removing 
> oom_lock from exit_mmap() will not fix this issue.
> 
> I agree that oom_lock can be removed from exit_mmap() and it would be 
> helpful to do so, and may address a series of problems that we have yet to 
> encounter, but this would not fix the almost immediate setting of 
> MMF_OOM_SKIP that occurs with minimal memory freeing due to the oom 
> reaper.
> 

Why [PATCH 2/2] in https://marc.info/?l=linux-mm&m=153119509215026&w=4 does not
solve your problem?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v4] mm, oom: fix unnecessary killing of additional processes
  2018-07-20 20:43           ` Tetsuo Handa
@ 2018-07-20 22:13             ` David Rientjes
  2018-07-21  2:47               ` Tetsuo Handa
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2018-07-20 22:13 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Andrew Morton, Michal Hocko, linux-kernel, linux-mm

On Sat, 21 Jul 2018, Tetsuo Handa wrote:

> > diff --git a/mm/mmap.c b/mm/mmap.c
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3066,25 +3066,27 @@ void exit_mmap(struct mm_struct *mm)
> >  	if (unlikely(mm_is_oom_victim(mm))) {
> >  		/*
> >  		 * Manually reap the mm to free as much memory as possible.
> > -		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> > -		 * this mm from further consideration.  Taking mm->mmap_sem for
> > -		 * write after setting MMF_OOM_SKIP will guarantee that the oom
> > -		 * reaper will not run on this mm again after mmap_sem is
> > -		 * dropped.
> > -		 *
> >  		 * Nothing can be holding mm->mmap_sem here and the above call
> >  		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> >  		 * __oom_reap_task_mm() will not block.
> >  		 *
> > +		 * This sets MMF_UNSTABLE to avoid racing with the oom reaper.
> >  		 * This needs to be done before calling munlock_vma_pages_all(),
> >  		 * which clears VM_LOCKED, otherwise the oom reaper cannot
> > -		 * reliably test it.
> > +		 * reliably test for it.  If the oom reaper races with
> > +		 * munlock_vma_pages_all(), this can result in a kernel oops if
> > +		 * a pmd is zapped, for example, after follow_page_mask() has
> > +		 * checked pmd_none().
> >  		 */
> >  		mutex_lock(&oom_lock);
> >  		__oom_reap_task_mm(mm);
> >  		mutex_unlock(&oom_lock);
> 
> I don't like holding oom_lock for full teardown of an mm, for an OOM victim's mm
> might have multiple TB memory which could take long time.
> 

This patch does not involve deltas for oom_lock here, it can certainly be 
changed on top of this patch.  I'm not attempting to address any oom_lock 
issue here.  It should pose no roadblock for you.

I only propose this patch now since it fixes millions of processes being 
oom killed unnecessarily, it was in -mm before a NACK for the most trivial 
fixes that have now been squashed into it, and is actually tested.

> >  
> > -		set_bit(MMF_OOM_SKIP, &mm->flags);
> > +		/*
> > +		 * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
> > +		 * guarantee that the oom reaper will not run on this mm again
> > +		 * after mmap_sem is dropped.
> > +		 */
> >  		down_write(&mm->mmap_sem);
> >  		up_write(&mm->mmap_sem);
> >  	}
> 
> 
> 
> > -#define MAX_OOM_REAP_RETRIES 10
> >  static void oom_reap_task(struct task_struct *tsk)
> >  {
> > -	int attempts = 0;
> >  	struct mm_struct *mm = tsk->signal->oom_mm;
> >  
> > -	/* Retry the down_read_trylock(mmap_sem) a few times */
> > -	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
> > -		schedule_timeout_idle(HZ/10);
> > +	/*
> > +	 * If this mm has either been fully unmapped, or the oom reaper has
> > +	 * given up on it, nothing left to do except drop the refcount.
> > +	 */
> > +	if (test_bit(MMF_OOM_SKIP, &mm->flags))
> > +		goto drop;
> >  
> > -	if (attempts <= MAX_OOM_REAP_RETRIES ||
> > -	    test_bit(MMF_OOM_SKIP, &mm->flags))
> > -		goto done;
> > +	/*
> > +	 * If this mm has already been reaped, doing so again will not likely
> > +	 * free additional memory.
> > +	 */
> > +	if (!test_bit(MMF_UNSTABLE, &mm->flags))
> > +		oom_reap_task_mm(tsk, mm);
> 
> This is still wrong. If preempted immediately after set_bit(MMF_UNSTABLE, &mm->flags) from
> __oom_reap_task_mm() from exit_mmap(), oom_reap_task() can give up before reclaiming any memory.

If there is a single thread holding onto the mm and has reached 
exit_mmap() and is in the process of starting oom reaping itself, there's 
no advantage to the oom reaper trying to oom reap it.  The thread in 
exit_mmap() will take care of it, __oom_reap_task_mm() does not block and 
oom_free_timeout_ms allows for enough time for that memory freeing to 
occur.  The oom reaper will not set MMF_OOM_SKIP until the timeout has 
expired.

As I said before, you could make a case for extending the timeout once 
MMF_UNSTABLE has been set.  It practice, we haven't encountered a case 
where that matters.  But that's trivial to do if you would prefer.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v3] mm, oom: fix unnecessary killing of additional processes
  2018-07-20 20:47             ` Tetsuo Handa
@ 2018-07-20 22:19               ` David Rientjes
  0 siblings, 0 replies; 28+ messages in thread
From: David Rientjes @ 2018-07-20 22:19 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Andrew Morton, Michal Hocko, linux-kernel, linux-mm

On Sat, 21 Jul 2018, Tetsuo Handa wrote:

> Why [PATCH 2/2] in https://marc.info/?l=linux-mm&m=153119509215026&w=4 does not
> solve your problem?
> 

Such an invasive patch, and completely rewrites the oom reaper.  I now 
fully understand your frustration with the cgroup aware oom killer being 
merged into -mm without any roadmap to actually being merged.  I agree 
with you that it should be dropped, not sure why it has not been since 
there is no active review on the proposed patchset from four months ago, 
posted twice, that fixes the issues with it, or those patches being merged 
so the damn thing can actually make progress.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v4] mm, oom: fix unnecessary killing of additional processes
  2018-07-20 22:13             ` David Rientjes
@ 2018-07-21  2:47               ` Tetsuo Handa
  2018-07-24 21:45                 ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2018-07-21  2:47 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Michal Hocko, linux-kernel, linux-mm

On 2018/07/21 7:13, David Rientjes wrote:
>>>  		mutex_lock(&oom_lock);
>>>  		__oom_reap_task_mm(mm);
>>>  		mutex_unlock(&oom_lock);
>>
>> I don't like holding oom_lock for full teardown of an mm, for an OOM victim's mm
>> might have multiple TB memory which could take long time.
>>
> 
> This patch does not involve deltas for oom_lock here, it can certainly be 
> changed on top of this patch.  I'm not attempting to address any oom_lock 
> issue here.  It should pose no roadblock for you.

You can't apply "[patch v4] mm, oom: fix unnecessary killing of additional processes"
because Michal's patch which removes oom_lock serialization was added to -mm tree.

> 
> I only propose this patch now since it fixes millions of processes being 
> oom killed unnecessarily, it was in -mm before a NACK for the most trivial 
> fixes that have now been squashed into it, and is actually tested.
> 

But I still refuse your patch because you do not test my approach.

>>
>>> -#define MAX_OOM_REAP_RETRIES 10
>>>  static void oom_reap_task(struct task_struct *tsk)
>>>  {
>>> -	int attempts = 0;
>>>  	struct mm_struct *mm = tsk->signal->oom_mm;
>>>  
>>> -	/* Retry the down_read_trylock(mmap_sem) a few times */
>>> -	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
>>> -		schedule_timeout_idle(HZ/10);
>>> +	/*
>>> +	 * If this mm has either been fully unmapped, or the oom reaper has
>>> +	 * given up on it, nothing left to do except drop the refcount.
>>> +	 */
>>> +	if (test_bit(MMF_OOM_SKIP, &mm->flags))
>>> +		goto drop;
>>>  
>>> -	if (attempts <= MAX_OOM_REAP_RETRIES ||
>>> -	    test_bit(MMF_OOM_SKIP, &mm->flags))
>>> -		goto done;
>>> +	/*
>>> +	 * If this mm has already been reaped, doing so again will not likely
>>> +	 * free additional memory.
>>> +	 */
>>> +	if (!test_bit(MMF_UNSTABLE, &mm->flags))
>>> +		oom_reap_task_mm(tsk, mm);
>>
>> This is still wrong. If preempted immediately after set_bit(MMF_UNSTABLE, &mm->flags) from
>> __oom_reap_task_mm() from exit_mmap(), oom_reap_task() can give up before reclaiming any memory.
> 
> If there is a single thread holding onto the mm and has reached 
> exit_mmap() and is in the process of starting oom reaping itself, there's 
> no advantage to the oom reaper trying to oom reap it.

You might worry about situations where __oom_reap_task_mm() is a no-op.
But that is not always true. There is no point with emitting

  pr_info("oom_reaper: unable to reap pid:%d (%s)\n", ...);
  debug_show_all_locks();

noise and doing

  set_bit(MMF_OOM_SKIP, &mm->flags);

because exit_mmap() will not release oom_lock until __oom_reap_task_mm()
completes. That is, except extra noise, there is no difference with
current behavior which sets set_bit(MMF_OOM_SKIP, &mm->flags) after
returning from __oom_reap_task_mm().

>                                                        The thread in 
> exit_mmap() will take care of it, __oom_reap_task_mm() does not block and 
> oom_free_timeout_ms allows for enough time for that memory freeing to 
> occur.  The oom reaper will not set MMF_OOM_SKIP until the timeout has 
> expired.

Again, there is no point with emitting the noise.
And, the oom_lock serialization will be removed before your patch.

> 
> As I said before, you could make a case for extending the timeout once 
> MMF_UNSTABLE has been set.  It practice, we haven't encountered a case 
> where that matters.  But that's trivial to do if you would prefer.
> 

Again, I don't think that
"[patch v4] mm, oom: fix unnecessary killing of additional processes" can be
merged.



On 2018/07/21 7:19, David Rientjes wrote:
> On Sat, 21 Jul 2018, Tetsuo Handa wrote:
> 
>> Why [PATCH 2/2] in https://marc.info/?l=linux-mm&m=153119509215026&w=4 does not
>> solve your problem?
>>
> 
> Such an invasive patch, and completely rewrites the oom reaper.  I now 
> fully understand your frustration with the cgroup aware oom killer being 
> merged into -mm without any roadmap to actually being merged.  I agree 
> with you that it should be dropped, not sure why it has not been since 
> there is no active review on the proposed patchset from four months ago, 
> posted twice, that fixes the issues with it, or those patches being merged 
> so the damn thing can actually make progress.
> 

I'm not frustrated with the cgroup aware oom killer. I'm frustrated with
the fact that you keep ignoring my approach as "such an invasive patch"
without actually testing it.

At least, you can test up to my [PATCH 1/2] whether the cgroup aware oom killer
can be rebased on top of Michal's patches and my [PATCH 1/2].


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch v5] mm, oom: fix unnecessary killing of additional processes
  2018-07-20 20:14         ` [patch v4] " David Rientjes
  2018-07-20 20:43           ` Tetsuo Handa
@ 2018-07-24 21:44           ` David Rientjes
  1 sibling, 0 replies; 28+ messages in thread
From: David Rientjes @ 2018-07-24 21:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Tetsuo Handa, Michal Hocko, linux-kernel, linux-mm

The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
it cannot reap an mm.  This can happen for a variety of reasons,
including:

 - the inability to grab mm->mmap_sem in a sufficient amount of time,

 - when the mm has blockable mmu notifiers that could cause the oom reaper
   to stall indefinitely,

but we can also add a third when the oom reaper can "reap" an mm but doing
so is unlikely to free any amount of memory:

 - when the mm's memory is mostly mlocked.

When all memory is mlocked, the oom reaper will not be able to free any
substantial amount of memory.  It sets MMF_OOM_SKIP before the victim can
unmap and free its memory in exit_mmap() and subsequent oom victims are
chosen unnecessarily.  This is trivial to reproduce if all eligible
processes on the system have mlocked their memory: the oom killer calls
panic() even though forward progress can be made.

This is the same issue where the exit path sets MMF_OOM_SKIP before
unmapping memory and additional processes can be chosen unnecessarily
because the oom killer is racing with exit_mmap() and is separate from
the oom reaper setting MMF_OOM_SKIP prematurely.

We can't simply defer setting MMF_OOM_SKIP, however, because if there is
a true oom livelock in progress, it never gets set and no additional
killing is possible.

To fix this, this patch introduces a per-mm reaping period, which is
configurable through the new oom_free_timeout_ms file in debugfs and
defaults to one second to match the current heuristics.  This support
requires that the oom reaper's list becomes a proper linked list so that
other mm's may be reaped while waiting for an mm's timeout to expire.

This replaces the current timeouts in the oom reaper: (1) when trying to
grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2)
a HZ sleep if there are blockable mmu notifiers.  It extends it with
timeout to allow an oom victim to reach exit_mmap() before choosing
additional processes unnecessarily.

The exit path will now set MMF_OOM_SKIP only after all memory has been
freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to
determine when it can race with the oom reaper.

The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
lapsed because it can no longer guarantee forward progress.  Since the
default oom_free_timeout_ms is one second, the same as current heuristics,
there should be no functional change with this patch for users who do not
tune it to be longer other than MMF_OOM_SKIP is set by exit_mmap() after
free_pgtables(), which is the preferred behavior.

The reaping timeout can intentionally be set for a substantial amount of
time, such as 10s, since oom livelock is a very rare occurrence and it's
better to optimize for preventing additional (unnecessary) oom killing
than a scenario that is much more unlikely.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 v5:
  - rebased to linux-next
  - reworked serialization in exit_mmap() to no longer rely on
    MMF_UNSTABLE

 include/linux/mm_types.h |   7 ++
 include/linux/sched.h    |   3 +-
 kernel/fork.c            |   3 +
 mm/mmap.c                |  27 ++++----
 mm/oom_kill.c            | 139 ++++++++++++++++++++++++++-------------
 5 files changed, 119 insertions(+), 60 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -451,6 +451,13 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 		struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_MMU
+		/*
+		 * When to give up on memory freeing from this mm after its
+		 * threads have been oom killed, in jiffies.
+		 */
+		unsigned long oom_free_expire;
+#endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 		pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1175,7 +1175,8 @@ struct task_struct {
 #endif
 	int				pagefault_disabled;
 #ifdef CONFIG_MMU
-	struct task_struct		*oom_reaper_list;
+	/* OOM victim queue for oom reaper, protected by oom_reaper_lock */
+	struct list_head		oom_reap_list;
 #endif
 #ifdef CONFIG_VMAP_STACK
 	struct vm_struct		*stack_vm_area;
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -868,6 +868,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 #ifdef CONFIG_FAULT_INJECTION
 	tsk->fail_nth = 0;
 #endif
+#ifdef CONFIG_MMU
+	INIT_LIST_HEAD(&tsk->oom_reap_list);
+#endif
 
 #ifdef CONFIG_MEMCG
 	tsk->active_memcg = NULL;
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3050,32 +3050,30 @@ void exit_mmap(struct mm_struct *mm)
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
+	const bool oom_victim = mm_is_oom_victim(mm);
 
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
-	if (unlikely(mm_is_oom_victim(mm))) {
+	if (unlikely(oom_victim)) {
 		/*
 		 * Manually reap the mm to free as much memory as possible.
-		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
-		 * this mm from further consideration.  Taking mm->mmap_sem for
-		 * write after setting MMF_OOM_SKIP will guarantee that the oom
-		 * reaper will not run on this mm again after mmap_sem is
-		 * dropped.
-		 *
 		 * Nothing can be holding mm->mmap_sem here and the above call
 		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
 		 * __oom_reap_task_mm() will not block.
-		 *
-		 * This needs to be done before calling munlock_vma_pages_all(),
-		 * which clears VM_LOCKED, otherwise the oom reaper cannot
-		 * reliably test it.
 		 */
 		(void)__oom_reap_task_mm(mm);
 
-		set_bit(MMF_OOM_SKIP, &mm->flags);
+		/*
+		 * Now, take mm->mmap_sem for write to avoid racing with the oom
+		 * reaper.  This needs to be done before munlock_vma_pages_all(),
+		 * which clears VM_LOCKED, otherwise the oom reaper cannot
+		 * reliably test for it.  If the oom reaper races with
+		 * munlock_vma_pages_all(), this can result in a kernel oops if
+		 * a pmd is zapped, for example, after follow_page_mask() has
+		 * checked pmd_none().
+		 */
 		down_write(&mm->mmap_sem);
-		up_write(&mm->mmap_sem);
 	}
 
 	if (mm->locked_vm) {
@@ -3101,6 +3099,9 @@ void exit_mmap(struct mm_struct *mm)
 	unmap_vmas(&tlb, vma, 0, -1);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
+	if (unlikely(oom_victim))
+		up_write(&mm->mmap_sem);
+	set_bit(MMF_OOM_SKIP, &mm->flags);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -41,6 +41,7 @@
 #include <linux/kthread.h>
 #include <linux/init.h>
 #include <linux/mmu_notifier.h>
+#include <linux/debugfs.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -492,7 +493,7 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
  */
 static struct task_struct *oom_reaper_th;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
+static LIST_HEAD(oom_reaper_list);
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
 bool __oom_reap_task_mm(struct mm_struct *mm)
@@ -542,18 +543,18 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 }
 
 /*
- * Reaps the address space of the give task.
+ * Reaps the address space of the given mm.
  *
- * Returns true on success and false if none or part of the address space
- * has been reclaimed and the caller should retry later.
+ * The tsk may be enqueued to be reaped again unless MMF_OOM_SKIP has been set
+ * in exit_mmap() before oom_free_timeout_ms expires.
  */
-static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
+static void oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
-	bool ret = true;
+	bool ret;
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		trace_skip_task_reaping(tsk->pid);
-		return false;
+		return;
 	}
 
 	/*
@@ -564,57 +565,65 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	 */
 	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
 		trace_skip_task_reaping(tsk->pid);
-		goto out_unlock;
+		goto out;
 	}
 
 	trace_start_task_reaping(tsk->pid);
-
-	/* failed to reap part of the address space. Try again later */
 	ret = __oom_reap_task_mm(mm);
-	if (!ret)
-		goto out_finish;
-
-	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+	if (ret) {
+		trace_finish_task_reaping(tsk->pid);
+		pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
-out_finish:
-	trace_finish_task_reaping(tsk->pid);
-out_unlock:
+	} else {
+		/* Failed to reap part of the address space, try again later */
+		trace_skip_task_reaping(tsk->pid);
+	}
+out:
 	up_read(&mm->mmap_sem);
-
-	return ret;
 }
 
-#define MAX_OOM_REAP_RETRIES 10
 static void oom_reap_task(struct task_struct *tsk)
 {
-	int attempts = 0;
 	struct mm_struct *mm = tsk->signal->oom_mm;
 
-	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
-		schedule_timeout_idle(HZ/10);
+	/*
+	 * If this mm has either been fully unmapped, or the oom reaper has
+	 * given up on it, nothing left to do except drop the refcount.
+	 */
+	if (test_bit(MMF_OOM_SKIP, &mm->flags))
+		goto drop;
 
-	if (attempts <= MAX_OOM_REAP_RETRIES ||
-	    test_bit(MMF_OOM_SKIP, &mm->flags))
-		goto done;
+	oom_reap_task_mm(tsk, mm);
 
-	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-		task_pid_nr(tsk), tsk->comm);
-	debug_show_all_locks();
+	if (time_after_eq(jiffies, mm->oom_free_expire) &&
+	    !test_bit(MMF_OOM_SKIP, &mm->flags)) {
+		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+			task_pid_nr(tsk), tsk->comm);
+		debug_show_all_locks();
 
-done:
-	tsk->oom_reaper_list = NULL;
+		/*
+		 * Reaping has failed for the timeout period, so give up and
+		 * allow additional processes to be oom killed.
+		 */
+		set_bit(MMF_OOM_SKIP, &mm->flags);
+	}
 
-	/*
-	 * Hide this mm from OOM killer because it has been either reaped or
-	 * somebody can't call up_write(mmap_sem).
-	 */
-	set_bit(MMF_OOM_SKIP, &mm->flags);
+	if (test_bit(MMF_OOM_SKIP, &mm->flags))
+		goto drop;
+
+	/* Enqueue to be reaped again */
+	spin_lock(&oom_reaper_lock);
+	list_add_tail(&tsk->oom_reap_list, &oom_reaper_list);
+	spin_unlock(&oom_reaper_lock);
+
+	schedule_timeout_idle(HZ/10);
+	return;
 
-	/* Drop a reference taken by wake_oom_reaper */
+drop:
+	/* Drop the reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 }
 
@@ -623,11 +632,13 @@ static int oom_reaper(void *unused)
 	while (true) {
 		struct task_struct *tsk = NULL;
 
-		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
+		wait_event_freezable(oom_reaper_wait,
+				     !list_empty(&oom_reaper_list));
 		spin_lock(&oom_reaper_lock);
-		if (oom_reaper_list != NULL) {
-			tsk = oom_reaper_list;
-			oom_reaper_list = tsk->oom_reaper_list;
+		if (!list_empty(&oom_reaper_list)) {
+			tsk = list_entry(oom_reaper_list.next,
+					 struct task_struct, oom_reap_list);
+			list_del(&tsk->oom_reap_list);
 		}
 		spin_unlock(&oom_reaper_lock);
 
@@ -638,25 +649,61 @@ static int oom_reaper(void *unused)
 	return 0;
 }
 
+/*
+ * Millisecs to wait for an oom mm to free memory before selecting another
+ * victim.
+ */
+static u64 oom_free_timeout_ms = 1000;
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+	unsigned long expire = jiffies + msecs_to_jiffies(oom_free_timeout_ms);
+
+	if (unlikely(!expire))
+		expire++;
+
+	/*
+	 * Set the reap timeout; if it's already set, the mm is enqueued and
+	 * this tsk can be ignored.
+	 */
+	if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL, expire))
 		return;
 
 	get_task_struct(tsk);
 
 	spin_lock(&oom_reaper_lock);
-	tsk->oom_reaper_list = oom_reaper_list;
-	oom_reaper_list = tsk;
+	list_add(&tsk->oom_reap_list, &oom_reaper_list);
 	spin_unlock(&oom_reaper_lock);
 	trace_wake_reaper(tsk->pid);
 	wake_up(&oom_reaper_wait);
 }
 
+#ifdef CONFIG_DEBUG_FS
+static int oom_free_timeout_ms_read(void *data, u64 *val)
+{
+	*val = oom_free_timeout_ms;
+	return 0;
+}
+
+static int oom_free_timeout_ms_write(void *data, u64 val)
+{
+	if (val > 60 * 1000)
+		return -EINVAL;
+
+	oom_free_timeout_ms = val;
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(oom_free_timeout_ms_fops, oom_free_timeout_ms_read,
+			oom_free_timeout_ms_write, "%llu\n");
+#endif /* CONFIG_DEBUG_FS */
+
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+#ifdef CONFIG_DEBUG_FS
+	if (!IS_ERR(oom_reaper_th))
+		debugfs_create_file("oom_free_timeout_ms", 0200, NULL, NULL,
+				    &oom_free_timeout_ms_fops);
+#endif
 	return 0;
 }
 subsys_initcall(oom_init)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v4] mm, oom: fix unnecessary killing of additional processes
  2018-07-21  2:47               ` Tetsuo Handa
@ 2018-07-24 21:45                 ` David Rientjes
  2018-07-24 22:31                   ` Tetsuo Handa
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2018-07-24 21:45 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Andrew Morton, Michal Hocko, linux-kernel, linux-mm

On Sat, 21 Jul 2018, Tetsuo Handa wrote:

> You can't apply "[patch v4] mm, oom: fix unnecessary killing of additional processes"
> because Michal's patch which removes oom_lock serialization was added to -mm tree.
> 

I've rebased the patch to linux-next and posted a v5.

> You might worry about situations where __oom_reap_task_mm() is a no-op.
> But that is not always true. There is no point with emitting
> 
>   pr_info("oom_reaper: unable to reap pid:%d (%s)\n", ...);
>   debug_show_all_locks();
> 
> noise and doing
> 
>   set_bit(MMF_OOM_SKIP, &mm->flags);
> 
> because exit_mmap() will not release oom_lock until __oom_reap_task_mm()
> completes. That is, except extra noise, there is no difference with
> current behavior which sets set_bit(MMF_OOM_SKIP, &mm->flags) after
> returning from __oom_reap_task_mm().
> 

v5 has restructured how exit_mmap() serializes its unmapping with the oom 
reaper.  It sets MMF_OOM_SKIP while holding mm->mmap_sem.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v4] mm, oom: fix unnecessary killing of additional processes
  2018-07-24 21:45                 ` David Rientjes
@ 2018-07-24 22:31                   ` Tetsuo Handa
  2018-07-24 22:51                     ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2018-07-24 22:31 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Michal Hocko, linux-kernel, linux-mm

On 2018/07/25 6:45, David Rientjes wrote:
> On Sat, 21 Jul 2018, Tetsuo Handa wrote:
> 
>> You can't apply "[patch v4] mm, oom: fix unnecessary killing of additional processes"
>> because Michal's patch which removes oom_lock serialization was added to -mm tree.
>>
> 
> I've rebased the patch to linux-next and posted a v5.
> 
>> You might worry about situations where __oom_reap_task_mm() is a no-op.
>> But that is not always true. There is no point with emitting
>>
>>   pr_info("oom_reaper: unable to reap pid:%d (%s)\n", ...);
>>   debug_show_all_locks();
>>
>> noise and doing
>>
>>   set_bit(MMF_OOM_SKIP, &mm->flags);
>>
>> because exit_mmap() will not release oom_lock until __oom_reap_task_mm()
>> completes. That is, except extra noise, there is no difference with
>> current behavior which sets set_bit(MMF_OOM_SKIP, &mm->flags) after
>> returning from __oom_reap_task_mm().
>>
> 
> v5 has restructured how exit_mmap() serializes its unmapping with the oom 
> reaper.  It sets MMF_OOM_SKIP while holding mm->mmap_sem.
> 

I think that v5 is still wrong. exit_mmap() keeps mmap_sem held for write does
not prevent oom_reap_task() from emitting the noise and setting MMF_OOM_SKIP
after timeout. Since your purpose is to wait for release of memory which could
not be reclaimed by __oom_reap_task_mm(), what if __oom_reap_task_mm() was no-op and
exit_mmap() was preempted immediately after returning from __oom_reap_task_mm() ?

Also, I believe that userspace visible knob is not needed.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v4] mm, oom: fix unnecessary killing of additional processes
  2018-07-24 22:31                   ` Tetsuo Handa
@ 2018-07-24 22:51                     ` David Rientjes
  2018-07-24 22:55                       ` Tetsuo Handa
  0 siblings, 1 reply; 28+ messages in thread
From: David Rientjes @ 2018-07-24 22:51 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Andrew Morton, Michal Hocko, linux-kernel, linux-mm

On Wed, 25 Jul 2018, Tetsuo Handa wrote:

> >> You might worry about situations where __oom_reap_task_mm() is a no-op.
> >> But that is not always true. There is no point with emitting
> >>
> >>   pr_info("oom_reaper: unable to reap pid:%d (%s)\n", ...);
> >>   debug_show_all_locks();
> >>
> >> noise and doing
> >>
> >>   set_bit(MMF_OOM_SKIP, &mm->flags);
> >>
> >> because exit_mmap() will not release oom_lock until __oom_reap_task_mm()
> >> completes. That is, except extra noise, there is no difference with
> >> current behavior which sets set_bit(MMF_OOM_SKIP, &mm->flags) after
> >> returning from __oom_reap_task_mm().
> >>
> > 
> > v5 has restructured how exit_mmap() serializes its unmapping with the oom 
> > reaper.  It sets MMF_OOM_SKIP while holding mm->mmap_sem.
> > 
> 
> I think that v5 is still wrong. exit_mmap() keeps mmap_sem held for write does
> not prevent oom_reap_task() from emitting the noise and setting MMF_OOM_SKIP
> after timeout. Since your purpose is to wait for release of memory which could
> not be reclaimed by __oom_reap_task_mm(), what if __oom_reap_task_mm() was no-op and
> exit_mmap() was preempted immediately after returning from __oom_reap_task_mm() ?
> 

If exit_mmap() gets preempted indefinitely before it can free any memory, 
we are better off oom killing another process.  The purpose of the timeout 
is to give an oom victim an amount of time to free its memory and exit 
before selecting another victim.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v4] mm, oom: fix unnecessary killing of additional processes
  2018-07-24 22:51                     ` David Rientjes
@ 2018-07-24 22:55                       ` Tetsuo Handa
  2018-07-25  0:24                         ` David Rientjes
  0 siblings, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2018-07-24 22:55 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Michal Hocko, linux-kernel, linux-mm

On 2018/07/25 7:51, David Rientjes wrote:
> On Wed, 25 Jul 2018, Tetsuo Handa wrote:
> 
>>>> You might worry about situations where __oom_reap_task_mm() is a no-op.
>>>> But that is not always true. There is no point with emitting
>>>>
>>>>   pr_info("oom_reaper: unable to reap pid:%d (%s)\n", ...);
>>>>   debug_show_all_locks();
>>>>
>>>> noise and doing
>>>>
>>>>   set_bit(MMF_OOM_SKIP, &mm->flags);
>>>>
>>>> because exit_mmap() will not release oom_lock until __oom_reap_task_mm()
>>>> completes. That is, except extra noise, there is no difference with
>>>> current behavior which sets set_bit(MMF_OOM_SKIP, &mm->flags) after
>>>> returning from __oom_reap_task_mm().
>>>>
>>>
>>> v5 has restructured how exit_mmap() serializes its unmapping with the oom 
>>> reaper.  It sets MMF_OOM_SKIP while holding mm->mmap_sem.
>>>
>>
>> I think that v5 is still wrong. exit_mmap() keeps mmap_sem held for write does
>> not prevent oom_reap_task() from emitting the noise and setting MMF_OOM_SKIP
>> after timeout. Since your purpose is to wait for release of memory which could
>> not be reclaimed by __oom_reap_task_mm(), what if __oom_reap_task_mm() was no-op and
>> exit_mmap() was preempted immediately after returning from __oom_reap_task_mm() ?
>>
> 
> If exit_mmap() gets preempted indefinitely before it can free any memory, 
> we are better off oom killing another process.  The purpose of the timeout 
> is to give an oom victim an amount of time to free its memory and exit 
> before selecting another victim.
> 

There is no point with emitting the noise.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch v4] mm, oom: fix unnecessary killing of additional processes
  2018-07-24 22:55                       ` Tetsuo Handa
@ 2018-07-25  0:24                         ` David Rientjes
  0 siblings, 0 replies; 28+ messages in thread
From: David Rientjes @ 2018-07-25  0:24 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Andrew Morton, Michal Hocko, linux-kernel, linux-mm

On Wed, 25 Jul 2018, Tetsuo Handa wrote:

> > If exit_mmap() gets preempted indefinitely before it can free any memory, 
> > we are better off oom killing another process.  The purpose of the timeout 
> > is to give an oom victim an amount of time to free its memory and exit 
> > before selecting another victim.
> > 
> 
> There is no point with emitting the noise.
> 

If you're concerned about too many printk's to the kernel log, 
oom_reap_task_mm() could store whether MMF_UNSTABLE was set or not before 
attempting to reap and then only printk if this was the first oom reaping.

We lose the ability to determine if subsequent reaps freed additional 
memory, but I don't suppose that's too concerning.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2018-07-25  0:24 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-21 21:35 [patch v3] mm, oom: fix unnecessary killing of additional processes David Rientjes
2018-07-04  1:43 ` David Rientjes
2018-07-04  2:26   ` penguin-kernel
2018-07-05 23:46 ` Andrew Morton
2018-07-06  5:39   ` Michal Hocko
2018-07-07  0:05   ` David Rientjes
2018-07-09 12:35     ` Michal Hocko
2018-07-09 20:30       ` David Rientjes
2018-07-10 11:01         ` Michal Hocko
2018-07-17 21:09 ` Tetsuo Handa
2018-07-18 20:22   ` David Rientjes
2018-07-18 21:21     ` Tetsuo Handa
2018-07-19 14:23       ` Tetsuo Handa
2018-07-20  8:41       ` David Rientjes
2018-07-20  9:57         ` Tetsuo Handa
2018-07-20 20:19           ` David Rientjes
2018-07-20 20:47             ` Tetsuo Handa
2018-07-20 22:19               ` David Rientjes
2018-07-20 20:14         ` [patch v4] " David Rientjes
2018-07-20 20:43           ` Tetsuo Handa
2018-07-20 22:13             ` David Rientjes
2018-07-21  2:47               ` Tetsuo Handa
2018-07-24 21:45                 ` David Rientjes
2018-07-24 22:31                   ` Tetsuo Handa
2018-07-24 22:51                     ` David Rientjes
2018-07-24 22:55                       ` Tetsuo Handa
2018-07-25  0:24                         ` David Rientjes
2018-07-24 21:44           ` [patch v5] " David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).